Fine-grainedWeb Content Classification via Entity-level Analytics:The Case of Semantic Fingerprinting
Keywords:
Fine-grained Web Content Classification, Entity-level Web Analytics, Advanced Web Engineering, Web Semantics, Semantic FingerprintingAbstract
Approaching three decades ofWeb contents being created, the amount of heterogeneous data of diverse provenance becomes seemingly overwhelming and its organization is a “continuous battle” against time. In parallel, business, sociological, political, and media analysts require a structured access to these contents in order to conduct their studies. To this end, concise and – at the same time – efficient engineering methods are required to classify Web contents accordingly. However, the whole task is not as simple as classifying something as A or B, but to assign the most suitable (sub-)category for each Web content based on a fine-grained classification scheme. In practice, the underlying type hierarchies are commonly excerpts of large scale ontologies containing several hundreds or even thousands of (sub-)types decomposed into a few top-level types. Having such a fine-grained type hierarchy, the engineering task of Web content classification becomes out-most challenging. Our main objective in this work is to investigate whether entity-level analytics can be utilized to characterize aWeb content and align it onto a fine-grained hierarchy. We hypothesize that “You know a document by the named entities it contains”. To this end, we present a novel concept, called “Semantic Fingerprinting” that allows Web content classification solely based on the information derived from the named entities contained in a Web document. It encodes the semantic nature of a Web content into a concise vector, namely the semantic fingerprint. Thus, we expect that semantic fingerprints, when utilized in combination with machine learning, will enable a fine-grained classification of Web contents. In order to empirically validate the effectiveness of semantic fingerprinting, we perform a case study on the classification of Wikipedia documents. Even further, we thoroughly examine the results obtained by analyzing the performance of Semantic Fingerprinting with respect to the characteristics of the data set used for the experiments. In addition, we also investigate performance aspects of the engineered approach by discussing the run-time in comparison with its competitor baselines. We observe that the semantic fingerprinting approach outperforms the state-of-the-art baselines as it raises Web contents to the entity-level and captures their core essence. Moreover, our approach achieves a superior run time performance on the test data in comparison to competitors.
Downloads
References
C. Alec, C. Reynaud-Delaitre, and B. Safar. An Ontology-Driven
Approach for Semantic Annotation of Documents with Specific
Concepts. In The Semantic Web. Latest Advances and New
Domains. 13th ESWC 2016, pp. 609–624, Heraklion, Greece, May
Springer.
M. Allahyari, K. J. Kochut, and M. Janik. Ontology-based text
classification into dynamically defined topics. In 2014 IEEE
International Conference on Semantic Computing, pp. 273–278,
June 2014.
T. Berners-Lee, W. Hall, J. A. Hendler, K. O’Hara, N. Shadbolt,
D. J. Weitzner, et al. A framework for web science. Foundations
and Trends r in Web Science, 1(1):1–130, 2006.
Z. Elberrichi, A. Rahmoun, and M.A. Bentaallah. UsingWordNet
for Text Categorization. Int. Arab J. Inf. Technol., 5:16–24, 2008.
J. R. Firth. A synopsis of linguistic theory, 1930–1955. 1952–59:
–32, 1957.
M. Fleischman and E. Hovy. Fine grained classification of named
entities. In Proceedings of the 19th International Conference on
Computational Linguistics – Volume 1, COLING ’02, pp. 1–7,
Stroudsburg, PA, USA, 2002. Association for Computational
Linguistics.
J. Hoffart, F. M. Suchanek, K. Berberich, and G. Weikum.
YAGO2: A spatially and temporally enhanced knowledge base
from wikipedia. Artificial Intelligence, 194:28–61, 2013.
J. Hoffart, D. Milchevski, and G. Weikum. Stics: Searching with
strings, things, and cats. In Proceedings of the 37th International
ACMSIGIR Conference on Research & Development in Information
Retrieval, SIGIR’14, pp. 1247–1248, New York, USA,
ACM.
J. Hoffart, M. A. Yosef, I. Bordino, H. F¨urstenau, M. Pinkal,
M. Spaniol, B. Taneva, S. Thater, and G. Weikum. Robust
disambiguation of named entities in text. In Proceedings of the
Conference on Empirical Methods in Natural Language Processing,
EMNLP ’11, pp. 782–792, Stroudsburg, PA, USA, 2011.
Association for Computational Linguistics.
T. Joachims. Text categorization with support vector machines:
Learning with many relevant features. In Claire N´edellec
and C´eline Rouveirol, editors, Machine Learning: ECML-98,
pp. 137–142, Berlin, Heidelberg, 1998. Springer Berlin
Heidelberg.
R. Johnson and T. Zhang. Effective use of word order for
text categorization with convolutional neural networks. CoRR,
abs/1412. 1058, 2014.
A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov. Bag of tricks
for efficient text classification. arXiv preprint arXiv: 1607.01759,
Y. Kim. Convolutional neural networks for sentence classification.
arXiv preprint arXiv:1408.5882, 2014.
S. Lai, L. Xu, K. Liu, and J. Zhao. Recurrent convolutional
neural networks for text classification. In AAAI, volume 333,
pp. 2267–2273, 2015.
Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature,
(7553):436, 2015.
J. Lilleberg, Y. Zhu, and Y. Zhang. Support vector machines and
word2vec for text classification with semantic features. In 2015
IEEE 14th International Conference on Cognitive Informatics
Cognitive Computing (ICCI*CC), pp. 136–140, July 2015.
X. Ling and D. S. Weld. Fine-grained entity recognition. In
Proceedings of the Twenty-Sixth AAAI Conference on Artificial
Intelligence, AAAI’12, pp. 94–100. AAAI Press, 2012.
C. D. Manning, P. Raghavan, and H. Sch¨utze. Introduction to
Information Retrieval. Cambridge University Press, Cambridge,
UK, 2008.
T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean.
Distributed representations of words and phrases and their compositionality.
In Proceedings of the 26th International Conference
on Neural Information Processing Systems – Volume 2, NIPS’13,
pp. 3111–3119, USA, 2013. Curran Associates Inc.
G. A. Miller. Wordnet: A lexical database for english. Commun.
ACM, 38(11):39–41, November 1995.
A. Rahman and V. Ng. Inducing fine-grained semantic classes
via hierarchical and collective classification. In Proceedings of
the 23rd International Conference on Computational Linguistics,
COLING ’10, pages 931–939, Stroudsburg, PA, USA, 2010.
Association for Computational Linguistics.
F. Sebastiani. Machine learning in automated text categorization.
ACM Comput. Surv., 34(1):1–47, March 2002.
F. M. Suchanek, G. Kasneci, and G. Weikum. YAGO: A core
of semantic knowledge–unifying WordNet and Wikipedia. In
th International World Wide Web Conference (WWW 2007),
pp. 697–706. ACM, 2007.
Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy.
Hierarchical attention networks for document classification. In
Proceedings of the 2016 Conference of the North American
Chapter of the Association for Computational Linguistics: Human
Language Technologies, pages 1480–1489, 2016.
M. A. Yosef, S. Bauer, J. Hoffart, M. Spaniol, and G. Weikum.
Hyena: Hierarchical type classification for entity names. In Proceedings
ofCOLING2012: Posters, pp. 1361–1370.TheCOLING
Organizing Committee, 2012.
M. A. Yosef, S. Bauer, J. Hoffart, M. Spaniol, and G. Weikum.
HYENA-live: Fine-Grained Online Entity Type Classification
from Natural-language Text. In 5st Annual Meeting of the Association
for Computational Linguistics, ACL 2013, Proceedings
of the Conference System Demonstrations, 4–9 August 2013,
Sofia, Bulgaria, pp. 133–138. The Association for Computer
Linguistics, 2013.