GATHERING WEB PAGES OF ENTITIES WITH HIGH PRECISION
Keywords:
Precision, Support Vector Machines, Supervised Learning, Query ExpansionAbstract
A search engine like Yahoo looks for entities such as specific people, places, or things on web pages with search queries. Depending on the granularity of query keywords and performance of a search engine, the retrieved web pages may be in very large number having lots of irrelevant web pages and may be also not in proper order. It's infeasible to manually decide the relevance of each web page due to the large number of retrieved web pages. Another challenge is to develop a language independent relevance classification of search results provided by a search engine. To improve the quality of a search engine it is desirable to automatically evaluate the results of a search engine and decide the relevance of retrieved web pages with the user query and the intended entity, the query is all about. A step towards this improvement is to prune irrelevant web pages out by understanding the needs of a user in order to discover knowledge of entities in a particular domain. We propose a novel method to improve the precision of a search engine which is language independent and also free from search engine query logs and user clicks through data (widely used in recent times). We devise language independent novel features to build support vector machine relevance classification model using which we can automatically classify whether a web page retrieved by a search engine is relevant or not to the desired entity.
Downloads
References
R. Sedgewick and K. Wayne (2007), Introduction to programming in Java: An interdisciplinary
approach, Addison-Wesley (New York).
S. Brin and L. Page (1998), The anatomy of a large-scale hypertextual web search engine,
Computer Networks and ISDN Systems, Vol.30, pp. 107-117.
B. Croft, D. Metzler, and T. Strohman (2009), Search engines: Information retrieval in practice,
Pearson Education.
G. Dupret, V. Murdock, and B. Piwowarski (2007), Web search engine evaluation using click
through data and a user model, Proceedings of the World Wide Web Conference.
M. Hosseini and H. Abolhassani (2007), Mining search engine query log for evaluating content and
structure of a web site, Proceedings of IEEE/WIC/ACM International Conference on Web
Intelligence.
S. Howard, H. Tang, M. Berry, and D. Martin (2009), GTP: General text parser,
http://www.cs.utk.edu/ lsi/.
T. Joachims (2002), Optimizing search engines using clickthrough data, Proceedings of ACM
SIGKDD Conference on Knowledge Discovery and Data Mining.
L. Li, Y. Shang, and W. Zhang (2010), Relevance evaluation of search engines’ query results,
Proceedings of the World Wide Web Conference.
Y. Liu, Y. Fu, M. Zhang, S. Ma, and L. Ru (2007), Automatic search engine performance
evaluation with click-through data analysis, Proceedings of the World Wide Web Conference.
L. Lovasz (1993), Random walks on graphs: A survey, Combinatorics, Vol.2, pp. 1-46.
T. Saracevic (1995), Evaluation of evaluation in information retrieval, Proceedings of ACM
Special Interest Group on Information Retrieval.
W. Cohen, P. Ravikumar, S. Fienberg, and K. Rivard (2003), SecondString: An open-source Java
based package of approximate string-matching techniques, http://secondstring.sourceforge.net/.
T. Joachims (2008), SVM-Light Support Vector Machines, http://svmlight.joachims.org/.
B. Taneva, M. Kacimi, and G. Weikum (2010), Gathering and ranking photos of named entities
with high precision, high recall, and diversity, Proceedings of ACM International Conference on
Web Search and Data Mining.
N. Wardrip-Fruin and N. Montfort (2003), The new media reader, MIT Press.
T. Weninger, F. Fumarola, J. Han, D. Malerba (2010), Mapping web pages to database records via
link paths, Proceedings of ACM Conference on Information and Knowledge Management.
Yahoo Developer (2011), Yahoo! Search BOSS API, http://developer.yahoo.com/search/boss/.
Z. Zhuang and S. Cucerzan (2006), Re-ranking search results using query logs, Proceedings of
ACM Conference on Information and Knowledge Management.