Joint Models for Sentence Segmentation and Named Entity Recognition in Literary Sinitic Text
DOI:
https://doi.org/10.13052/jwe1540-9589.2512Keywords:
Literary sinitic, sentence segmentation, NER, transformerAbstract
It is challenging to understand Literary Sinitic text from the Joseon dynasty, since there is a lack of explicit word separators, which creates significant semantic ambiguity. To address this, both sentence segmentation and named entity recognition (NER) are essential. We propose a Transformer-based analyzer that performs these two tasks simultaneously. Trained on a labeled corpus from the Seungjeongwon Ilgi, our model effectively segments sentences and identifies named entities, thereby significantly improving the understanding of sentence structure and overall context.
Downloads
References
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. NeurIPS, 33:1877–1901.
Chang, Y., Kong, L., Jia, K., and Meng, Q. (2021). Chinese named entity recognition method based on bert. In ICDSCA 2021, pages 294–299.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, pages 4171–4186, Minneapolis, Minnesota.
Ge, S. (2022). Integration of named entity recognition and sentence segmentation on Ancient Chinese based on siku-BERT. In Hämäläinen, M., Alnajjar, K., Partanen, N., and Rueter, J., editors, International Workshop on Natural Language Processing for Digital Humanities, pages 167–173, Taipei, Taiwan.
Guo, W., Lu, J., and Han, F. (2022). Named entity recognition for chinese electronic medical records based on multitask and transfer learning. IEEE Access, 10:77375–77382.
Huang, Z., Xu, W., and Yu, K. (2015). Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991.
Kudo, T. and Richardson, J. (2018). Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226.
Sennrich, R., Haddow, B., and Birch, A. (2015). Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention is all you need. Advances in Neural Information Processing Systems.
Wang, D., Liu, C., Zhu, Z., Jiang, Feng, Hu, H., Shen, S., and Li, B. (2021). Construction and application of pre-training model of “Siku Quanshu” oriented to digital humanities.
Wang, S., Li, X., Meng, Y., Zhang, T., Ouyang, R., Li, J., and Wang, G. (2022). knn-ner: Named entity recognition with nearest neighbor search. arXiv preprint arXiv:2203.17103.
Wu, H., Ji, J., Tian, H., Chen, Y., Ge, W., Zhang, H., Yu, F., Zou, J., Nakamura, M., and Liao, J. (2021). Chinese-named entity recognition from adverse drug event records: Radical embedding-combined dynamic embedding–based bert in a bidirectional long short-term conditional random field (Bi-LSTM-CRF) model. JMIR Med Inform, 9(12):e26407.
Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., and Liu, T. (2020). On layer normalization in the transformer architecture. In ICML, pages 10524–10533. PMLR.
Yu, J. S., Wei, Y., and Zhang, Y. W. (2019). Automatic ancient chinese texts segmentation based on BERT. Journal of Chinese Information Processing, 33:57–63.

