GATHERING WEB PAGES OF ENTITIES WITH HIGH PRECISION

Authors

  • BYUNG-WON ON Department of Statistics and Computer Science, Kunsan National University 558, Daehak-ro, Gunsan-si, Jeollabuk-do 573-701, Republic of Korea
  • MUHAMMD OMAR Department of Information and Communication Engineering, Yeungnam University 280, Daehak-ro, Gyeongsan-si, Gyeongsangbuk-do 712-749, Republic of Korea
  • GYU SANG CHOI Department of Information and Communication Engineering, Yeungnam University 280, Daehak-ro, Gyeongsan-si, Gyeongsangbuk-do 712-749, Republic of Korea
  • JUNBEOM KWON Department of Software Science, Dankook University 152, Jukjeon-ro, Suji-gu, Yongin-si, Gyeonggi-do 448-701, Republic of Korea

Keywords:

Precision, Support Vector Machines, Supervised Learning, Query Expansion

Abstract

A search engine like Yahoo looks for entities such as specific people, places, or things on web pages with search queries. Depending on the granularity of query keywords and performance of a search engine, the retrieved web pages may be in very large number having lots of irrelevant web pages and may be also not in proper order. It's infeasible to manually decide the relevance of each web page due to the large number of retrieved web pages. Another challenge is to develop a language independent relevance classification of search results provided by a search engine. To improve the quality of a search engine it is desirable to automatically evaluate the results of a search engine and decide the relevance of retrieved web pages with the user query and the intended entity, the query is all about. A step towards this improvement is to prune irrelevant web pages out by understanding the needs of a user in order to discover knowledge of entities in a particular domain. We propose a novel method to improve the precision of a search engine which is language independent and also free from search engine query logs and user clicks through data (widely used in recent times). We devise language independent novel features to build support vector machine relevance classification model using which we can automatically classify whether a web page retrieved by a search engine is relevant or not to the desired entity.

 

Downloads

Download data is not yet available.

References

R. Sedgewick and K. Wayne (2007), Introduction to programming in Java: An interdisciplinary

approach, Addison-Wesley (New York).

S. Brin and L. Page (1998), The anatomy of a large-scale hypertextual web search engine,

Computer Networks and ISDN Systems, Vol.30, pp. 107-117.

B. Croft, D. Metzler, and T. Strohman (2009), Search engines: Information retrieval in practice,

Pearson Education.

G. Dupret, V. Murdock, and B. Piwowarski (2007), Web search engine evaluation using click

through data and a user model, Proceedings of the World Wide Web Conference.

M. Hosseini and H. Abolhassani (2007), Mining search engine query log for evaluating content and

structure of a web site, Proceedings of IEEE/WIC/ACM International Conference on Web

Intelligence.

S. Howard, H. Tang, M. Berry, and D. Martin (2009), GTP: General text parser,

http://www.cs.utk.edu/ lsi/.

T. Joachims (2002), Optimizing search engines using clickthrough data, Proceedings of ACM

SIGKDD Conference on Knowledge Discovery and Data Mining.

L. Li, Y. Shang, and W. Zhang (2010), Relevance evaluation of search engines’ query results,

Proceedings of the World Wide Web Conference.

Y. Liu, Y. Fu, M. Zhang, S. Ma, and L. Ru (2007), Automatic search engine performance

evaluation with click-through data analysis, Proceedings of the World Wide Web Conference.

L. Lovasz (1993), Random walks on graphs: A survey, Combinatorics, Vol.2, pp. 1-46.

T. Saracevic (1995), Evaluation of evaluation in information retrieval, Proceedings of ACM

Special Interest Group on Information Retrieval.

W. Cohen, P. Ravikumar, S. Fienberg, and K. Rivard (2003), SecondString: An open-source Java

based package of approximate string-matching techniques, http://secondstring.sourceforge.net/.

T. Joachims (2008), SVM-Light Support Vector Machines, http://svmlight.joachims.org/.

B. Taneva, M. Kacimi, and G. Weikum (2010), Gathering and ranking photos of named entities

with high precision, high recall, and diversity, Proceedings of ACM International Conference on

Web Search and Data Mining.

N. Wardrip-Fruin and N. Montfort (2003), The new media reader, MIT Press.

T. Weninger, F. Fumarola, J. Han, D. Malerba (2010), Mapping web pages to database records via

link paths, Proceedings of ACM Conference on Information and Knowledge Management.

Yahoo Developer (2011), Yahoo! Search BOSS API, http://developer.yahoo.com/search/boss/.

Z. Zhuang and S. Cucerzan (2006), Re-ranking search results using query logs, Proceedings of

ACM Conference on Information and Knowledge Management.

Downloads

Published

2014-02-28

How to Cite

ON, B.-W. ., OMAR, M. ., CHOI, G. S. ., & KWON, J. . (2014). GATHERING WEB PAGES OF ENTITIES WITH HIGH PRECISION. Journal of Web Engineering, 13(5-6), 378–404. Retrieved from https://journals.riverpublishers.com/index.php/JWE/article/view/3903

Issue

Section

Articles