COUPLING AN ANNOTATED CORPUS AND A LEXICON FOR AMAZIGH POS TAGGING
Keywords:
POS tagging, Amazigh, Treetagger, Machine Learning, NLP, TagsetAbstract
This paper investigates how to best couple hand-annotated data with information extracted from an external lexical resource to improve part-of-speech tagging performance. Focusing mostly on Amazigh tagging, we introduce a decision tree and Markov model using TreeTagger system. This system gives 92.3 % accuracy on the Amazigh corpus, an error reduction of 15 % (18.45 % on unknown words) over the same tagger without lexical information. We perform a series of experiments that help understanding how this lexical information helps improving tagging accuracy. We also conduct experiments on datasets and lexicons of varying sizes in order to assess the best tradeoff between annotating data versus developing a lexicon. We find that the use of a lexicon improves the quality of the tagger at any stage of development of either resource, and that for fixed performance levels the availability of the full lexicon consistently reduces the need for supervised data.
Downloads
References
Voutilainen, A. Part-of-speech tagging. The Oxford hand book of computational linguistics,2003,
(pp. 219–232).
Sun, G. , Lang, F. and Qiao, P. Chinese part-of-speech tagging based on fusion model. In
Proceedings of the 11th joint conference on information sciences,2008, Amsterdam: Atlantis
Press.
Ratnaparkhi, A. a Maximum Entropy Model for Part-Of-Speech Tagging. In Proceedings of
EMNLP, Philadelphia, USA 1996
Toutanova, K. and Manning, C. Enriching the knowledge sources used in a maximum entropy
part-of speech tagger. In EMNLP/VLC 1999, pages 63–71.
Giménez, J. and L. Màrquez , L. SVMTool: A General POS Tagger Generator Based on Support
Vector Machines. In Proceedings of the 4th International Conference on Language Resources and
Evaluation, Lisbon, Portugal, 26–28 May 2004, pp. 43--46.
Kudo, T. and Matsumoto, Y. Use of Support Vector Learning for Chunk Identification. In: Proc.of
CoNLL-2000 and LLL-2000.
Lafferty, J., McCallum, A. and Pereira, F. Conditional Random Fields: Probabilistic Models for
Segmenting and Labeling Sequence Data. In: Proc. of ICML-01,2001, pp. 282-289.
Tsuruoka, Y., Tsujii, J. and Ananiadou, S. Fast full parsing by linear-chain conditional random
fields. In Proceedings of the 12th Conference of the European Chapter of the Association for
Computational Linguistics (EACL 2009), p. 790–798.
Brants , T. 2000. TnT – A statistical part-of-sppech tagger. In Proceedings of the 6th Applied NLP
Conference. 224-231.
Black , E., Jelinek, F., Lafferty, J. , Mercer, R. and S. Roukos, S.1992. Decision tree models
applied to the labeling of text with parts-of-speech. In Proceedings of the DARPA workshop on
Speech and Natural Language, Harriman, New York.
Màrquez, L. and Rodríguez, H. 1998. Part of Speech Tagging Using Decision Trees. Lecture
Notes in AI 1398-C. Nédellec & C. Rouveirol (Eds.). Proceedings of the 10th European
Conference on Machine Learning, ECML’98. Chemnitz, German
Outahajala, M., Benajiba, Y., Rosso, P. and Zenkouar, L. POS Tagging In Amazigh Using
Support Vector Machines And Conditional Random Fields. In Natural Language to Information
Systems, LNCS (6716), Springer-Verlag, pp, 238—241, 2011
Cohen, D. Chamito-sémitiques (langues). In Encyclopædia Universalis 2007.
Chafiq, M. (1991).[Forty four lessons in Amazigh]. éd. Arabo-africaines
Chaker, S. Textes en linguistique berbère -introduction au domaine berbère, éditions du
CNRS,1984, pp 232-242
BURNARD, L. The British National Corpus,1998
IDE, N. and MACLEOD, C. The american national corpus : A standardized resource of American
english. In Proceedings of Corpus Linguistics 2001, volume 3.
Outahajala, M., .Zenkouar, L. and Rosso, P. Building an annotated corpus for Amazigh. In
Proceedings of 4th International Conference on Amazigh and ICT, 2011, Rabat, Morocco.
Marcus, M. P., Marcinkiewicz, M. A. and Santorini, B. Building a Large Annotated Corpus of
English: The Penn Treebank. Computational Linguistics, 19(2), 313-330, (1993).
Outahajala, M. and Rosso, P. Using a Small Lexicon with CRFs Confidence Measure to Improve
POS Tagging Accuracy, Proceedings of the Tenth International Conference on Language
Resources and Evaluation (LREC 2016), portoroz, Slovenia.
Schmid, H. Probabilistic part-of-speech tagging using decision trees. In International Conference
on New Methods in Language Processing, Manchester, UK, 1994, pages 44-49
Manning, C. and Schütze, H. Foundations of Statistical Natural Language Processing. The MIT
Press,1999.
Toutanova, K. Dan, K. Manning, C. and Yoram, S..Feature-Rich Part-of Speech Tagging with a
Cyclic Dependency Network. In Proceedings of HLT-NAACL 2003 pages 252-259.