COUPLING AN ANNOTATED CORPUS AND A LEXICON FOR AMAZIGH POS TAGGING

Authors

  • SAMIR AMRI LEC, EMI, Med V University Rabat, Morocco
  • LAHBIB ZENKOUAR LEC, EMI, Med V University Rabat, Morocco
  • MOHAMED OUTAHAJALA Rabat, Morocco

Keywords:

POS tagging, Amazigh, Treetagger, Machine Learning, NLP, Tagset

Abstract

This paper investigates how to best couple hand-annotated data with information extracted from an external lexical resource to improve part-of-speech tagging performance. Focusing mostly on Amazigh tagging, we introduce a decision tree and Markov model using TreeTagger system. This system gives 92.3 % accuracy on the Amazigh corpus, an error reduction of 15 % (18.45 % on unknown words) over the same tagger without lexical information. We perform a series of experiments that help understanding how this lexical information helps improving tagging accuracy. We also conduct experiments on datasets and lexicons of varying sizes in order to assess the best tradeoff between annotating data versus developing a lexicon. We find that the use of a lexicon improves the quality of the tagger at any stage of development of either resource, and that for fixed performance levels the availability of the full lexicon consistently reduces the need for supervised data.

 

Downloads

Download data is not yet available.

References

Voutilainen, A. Part-of-speech tagging. The Oxford hand book of computational linguistics,2003,

(pp. 219–232).

Sun, G. , Lang, F. and Qiao, P. Chinese part-of-speech tagging based on fusion model. In

Proceedings of the 11th joint conference on information sciences,2008, Amsterdam: Atlantis

Press.

Ratnaparkhi, A. a Maximum Entropy Model for Part-Of-Speech Tagging. In Proceedings of

EMNLP, Philadelphia, USA 1996

Toutanova, K. and Manning, C. Enriching the knowledge sources used in a maximum entropy

part-of speech tagger. In EMNLP/VLC 1999, pages 63–71.

Giménez, J. and L. Màrquez , L. SVMTool: A General POS Tagger Generator Based on Support

Vector Machines. In Proceedings of the 4th International Conference on Language Resources and

Evaluation, Lisbon, Portugal, 26–28 May 2004, pp. 43--46.

Kudo, T. and Matsumoto, Y. Use of Support Vector Learning for Chunk Identification. In: Proc.of

CoNLL-2000 and LLL-2000.

Lafferty, J., McCallum, A. and Pereira, F. Conditional Random Fields: Probabilistic Models for

Segmenting and Labeling Sequence Data. In: Proc. of ICML-01,2001, pp. 282-289.

Tsuruoka, Y., Tsujii, J. and Ananiadou, S. Fast full parsing by linear-chain conditional random

fields. In Proceedings of the 12th Conference of the European Chapter of the Association for

Computational Linguistics (EACL 2009), p. 790–798.

Brants , T. 2000. TnT – A statistical part-of-sppech tagger. In Proceedings of the 6th Applied NLP

Conference. 224-231.

Black , E., Jelinek, F., Lafferty, J. , Mercer, R. and S. Roukos, S.1992. Decision tree models

applied to the labeling of text with parts-of-speech. In Proceedings of the DARPA workshop on

Speech and Natural Language, Harriman, New York.

Màrquez, L. and Rodríguez, H. 1998. Part of Speech Tagging Using Decision Trees. Lecture

Notes in AI 1398-C. Nédellec & C. Rouveirol (Eds.). Proceedings of the 10th European

Conference on Machine Learning, ECML’98. Chemnitz, German

Outahajala, M., Benajiba, Y., Rosso, P. and Zenkouar, L. POS Tagging In Amazigh Using

Support Vector Machines And Conditional Random Fields. In Natural Language to Information

Systems, LNCS (6716), Springer-Verlag, pp, 238—241, 2011

Cohen, D. Chamito-sémitiques (langues). In Encyclopædia Universalis 2007.

Chafiq, M. (1991).[Forty four lessons in Amazigh]. éd. Arabo-africaines

Chaker, S. Textes en linguistique berbère -introduction au domaine berbère, éditions du

CNRS,1984, pp 232-242

BURNARD, L. The British National Corpus,1998

IDE, N. and MACLEOD, C. The american national corpus : A standardized resource of American

english. In Proceedings of Corpus Linguistics 2001, volume 3.

Outahajala, M., .Zenkouar, L. and Rosso, P. Building an annotated corpus for Amazigh. In

Proceedings of 4th International Conference on Amazigh and ICT, 2011, Rabat, Morocco.

Marcus, M. P., Marcinkiewicz, M. A. and Santorini, B. Building a Large Annotated Corpus of

English: The Penn Treebank. Computational Linguistics, 19(2), 313-330, (1993).

Outahajala, M. and Rosso, P. Using a Small Lexicon with CRFs Confidence Measure to Improve

POS Tagging Accuracy, Proceedings of the Tenth International Conference on Language

Resources and Evaluation (LREC 2016), portoroz, Slovenia.

Schmid, H. Probabilistic part-of-speech tagging using decision trees. In International Conference

on New Methods in Language Processing, Manchester, UK, 1994, pages 44-49

Manning, C. and Schütze, H. Foundations of Statistical Natural Language Processing. The MIT

Press,1999.

Toutanova, K. Dan, K. Manning, C. and Yoram, S..Feature-Rich Part-of Speech Tagging with a

Cyclic Dependency Network. In Proceedings of HLT-NAACL 2003 pages 252-259.

Downloads

Published

2017-03-29

Issue

Section

Articles