Fine-grainedWeb Content Classification via Entity-level Analytics:The Case of Semantic Fingerprinting


  • Govind Universit´e de Caen Normandie, Department of Computer Science Campus Cˆote de Nacre, F-14032 Caen, France
  • Celine Alec Universit´e de Caen Normandie, Department of Computer Science Campus Cˆote de Nacre, F-14032 Caen, France
  • Marc Spaniol Universit´e de Caen Normandie, Department of Computer Science Campus Cˆote de Nacre, F-14032 Caen, France


Fine-grained Web Content Classification, Entity-level Web Analytics, Advanced Web Engineering, Web Semantics, Semantic Fingerprinting


Approaching three decades ofWeb contents being created, the amount of heterogeneous data of diverse provenance becomes seemingly overwhelming and its organization is a “continuous battle” against time. In parallel, business, sociological, political, and media analysts require a structured access to these contents in order to conduct their studies. To this end, concise and – at the same time – efficient engineering methods are required to classify Web contents accordingly. However, the whole task is not as simple as classifying something as A or B, but to assign the most suitable (sub-)category for each Web content based on a fine-grained classification scheme. In practice, the underlying type hierarchies are commonly excerpts of large scale ontologies containing several hundreds or even thousands of (sub-)types decomposed into a few top-level types. Having such a fine-grained type hierarchy, the engineering task of Web content classification becomes out-most challenging. Our main objective in this work is to investigate whether entity-level analytics can be utilized to characterize aWeb content and align it onto a fine-grained hierarchy. We hypothesize that “You know a document by the named entities it contains”. To this end, we present a novel concept, called “Semantic Fingerprinting” that allows Web content classification solely based on the information derived from the named entities contained in a Web document. It encodes the semantic nature of a Web content into a concise vector, namely the semantic fingerprint. Thus, we expect that semantic fingerprints, when utilized in combination with machine learning, will enable a fine-grained classification of Web contents. In order to empirically validate the effectiveness of semantic fingerprinting, we perform a case study on the classification of Wikipedia documents. Even further, we thoroughly examine the results obtained by analyzing the performance of Semantic Fingerprinting with respect to the characteristics of the data set used for the experiments. In addition, we also investigate performance aspects of the engineered approach by discussing the run-time in comparison with its competitor baselines. We observe that the semantic fingerprinting approach outperforms the state-of-the-art baselines as it raises Web contents to the entity-level and captures their core essence. Moreover, our approach achieves a superior run time performance on the test data in comparison to competitors.



Download data is not yet available.


C. Alec, C. Reynaud-Delaitre, and B. Safar. An Ontology-Driven

Approach for Semantic Annotation of Documents with Specific

Concepts. In The Semantic Web. Latest Advances and New

Domains. 13th ESWC 2016, pp. 609–624, Heraklion, Greece, May


M. Allahyari, K. J. Kochut, and M. Janik. Ontology-based text

classification into dynamically defined topics. In 2014 IEEE

International Conference on Semantic Computing, pp. 273–278,

June 2014.

T. Berners-Lee, W. Hall, J. A. Hendler, K. O’Hara, N. Shadbolt,

D. J. Weitzner, et al. A framework for web science. Foundations

and Trends r in Web Science, 1(1):1–130, 2006.

Z. Elberrichi, A. Rahmoun, and M.A. Bentaallah. UsingWordNet

for Text Categorization. Int. Arab J. Inf. Technol., 5:16–24, 2008.

J. R. Firth. A synopsis of linguistic theory, 1930–1955. 1952–59:

–32, 1957.

M. Fleischman and E. Hovy. Fine grained classification of named

entities. In Proceedings of the 19th International Conference on

Computational Linguistics – Volume 1, COLING ’02, pp. 1–7,

Stroudsburg, PA, USA, 2002. Association for Computational


J. Hoffart, F. M. Suchanek, K. Berberich, and G. Weikum.

YAGO2: A spatially and temporally enhanced knowledge base

from wikipedia. Artificial Intelligence, 194:28–61, 2013.

J. Hoffart, D. Milchevski, and G. Weikum. Stics: Searching with

strings, things, and cats. In Proceedings of the 37th International

ACMSIGIR Conference on Research & Development in Information

Retrieval, SIGIR’14, pp. 1247–1248, New York, USA,


J. Hoffart, M. A. Yosef, I. Bordino, H. F¨urstenau, M. Pinkal,

M. Spaniol, B. Taneva, S. Thater, and G. Weikum. Robust

disambiguation of named entities in text. In Proceedings of the

Conference on Empirical Methods in Natural Language Processing,

EMNLP ’11, pp. 782–792, Stroudsburg, PA, USA, 2011.

Association for Computational Linguistics.

T. Joachims. Text categorization with support vector machines:

Learning with many relevant features. In Claire N´edellec

and C´eline Rouveirol, editors, Machine Learning: ECML-98,

pp. 137–142, Berlin, Heidelberg, 1998. Springer Berlin


R. Johnson and T. Zhang. Effective use of word order for

text categorization with convolutional neural networks. CoRR,

abs/1412. 1058, 2014.

A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov. Bag of tricks

for efficient text classification. arXiv preprint arXiv: 1607.01759,

Y. Kim. Convolutional neural networks for sentence classification.

arXiv preprint arXiv:1408.5882, 2014.

S. Lai, L. Xu, K. Liu, and J. Zhao. Recurrent convolutional

neural networks for text classification. In AAAI, volume 333,

pp. 2267–2273, 2015.

Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature,

(7553):436, 2015.

J. Lilleberg, Y. Zhu, and Y. Zhang. Support vector machines and

word2vec for text classification with semantic features. In 2015

IEEE 14th International Conference on Cognitive Informatics

Cognitive Computing (ICCI*CC), pp. 136–140, July 2015.

X. Ling and D. S. Weld. Fine-grained entity recognition. In

Proceedings of the Twenty-Sixth AAAI Conference on Artificial

Intelligence, AAAI’12, pp. 94–100. AAAI Press, 2012.

C. D. Manning, P. Raghavan, and H. Sch¨utze. Introduction to

Information Retrieval. Cambridge University Press, Cambridge,

UK, 2008.

T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean.

Distributed representations of words and phrases and their compositionality.

In Proceedings of the 26th International Conference

on Neural Information Processing Systems – Volume 2, NIPS’13,

pp. 3111–3119, USA, 2013. Curran Associates Inc.

G. A. Miller. Wordnet: A lexical database for english. Commun.

ACM, 38(11):39–41, November 1995.

A. Rahman and V. Ng. Inducing fine-grained semantic classes

via hierarchical and collective classification. In Proceedings of

the 23rd International Conference on Computational Linguistics,

COLING ’10, pages 931–939, Stroudsburg, PA, USA, 2010.

Association for Computational Linguistics.

F. Sebastiani. Machine learning in automated text categorization.

ACM Comput. Surv., 34(1):1–47, March 2002.

F. M. Suchanek, G. Kasneci, and G. Weikum. YAGO: A core

of semantic knowledge–unifying WordNet and Wikipedia. In

th International World Wide Web Conference (WWW 2007),

pp. 697–706. ACM, 2007.

Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy.

Hierarchical attention networks for document classification. In

Proceedings of the 2016 Conference of the North American

Chapter of the Association for Computational Linguistics: Human

Language Technologies, pages 1480–1489, 2016.

M. A. Yosef, S. Bauer, J. Hoffart, M. Spaniol, and G. Weikum.

Hyena: Hierarchical type classification for entity names. In Proceedings

ofCOLING2012: Posters, pp. 1361–1370.TheCOLING

Organizing Committee, 2012.

M. A. Yosef, S. Bauer, J. Hoffart, M. Spaniol, and G. Weikum.

HYENA-live: Fine-Grained Online Entity Type Classification

from Natural-language Text. In 5st Annual Meeting of the Association

for Computational Linguistics, ACL 2013, Proceedings

of the Conference System Demonstrations, 4–9 August 2013,

Sofia, Bulgaria, pp. 133–138. The Association for Computer

Linguistics, 2013.