THE MODIFIED CONCEPT BASED FOCUSED CRAWLING USING ONTOLOGY
Keywords:
Concept Vector, Focused Crawling, Information Retrieval, OntologyAbstract
The major goal of focused crawlers is to crawl web pages that are relevant to a specific topic One of the important issues of focuses crawlers is the difficulty in determining which web pages are relevant to the desired topic. The ontology based web crawler uses domain ontology to estimate the semantic content of the URL and the relevancy of the URL is determined by the association metric. In concept based focused crawling a topic is represented by an overall concept vector, determined by combining concept vectors of individual pages associated with the seed URLs. The pages are ranked in comparison between concept vectors at each depth, across depths and between the overall topics indicating concept vector. However in this work, we determine and rank the seed page set from the seed URLs. We rank and filter the page sets at the succeeding depths of crawl. We propose a method to include relevant concepts from the ontology that have been missed out by the initial set of seed URLs. The performance of the proposed work is evaluated based on the two new evaluation metrics – convergence and density contour. The modified concept based focused crawling process produces the convergence value of 0.82 and with the inclusion of missing concepts produces the density contour value of 0.58.
Downloads
References
Altingoyde, I. S., and Ozgur U., Exploiting Interclass Rules for Focused Crawling. Journal of
IEEE Intelligent Systems, 19, 2004, 66-73.
Assis G. T. D., Laender A.H. F, Goncalves M. A. and Silva A. S. D., Exploiting Genre in Focused
Crawling, Proceedings of 14th International Conference on String processing and information
retrieval, 2007, 62-73.
Assis G. T. D., Laender A.H. F, Goncalves M. A., and Silva A. S. D., A Genre-Aware Approach
to Focused Crawling, Journal of World Wide Web, 12, 2009, 285-319.
Batsakis S., Petrakis E. G.M., and Milios E. E., Improving the performance of focused web
crawlers, Journal of Data Knowledge Engineering, 68, 2009, 1001-1013.
Chauhan, Naresh. and Sharma, A. K ., “Design of an agent based context driven focused crawler”,
International journal of Information Technology, 2008, 61-66.
Cheng Q., Beizhan W. and Pianpian W., Efficient focused crawling strategy using combination of
link structure and content similarity, IEEE International Symposium on IT in Medicine and
Education, pp. 1045-1048, 2008.
Ehrig M. and Maedche A., Ontology-Focused Crawling of Web Documents, Proceedings of ACM
Symposium on Applied computing, 2003, 1174-1178.
Felix A. A., Taofiki A. A., and Adetokunbo S., On Algebraic Spectrum of Ontology Evaluation,
International Journal of Advanced Computer Science and Applications, 2, 2011, 159-168.
Ganesh S., Jayaraj M., Kalyan V., Murthy, S. and Aghila, G., Ontology–based Web Crawler,
Proceedings of International Conference on Information Technology: Coding and Computing,
, 337-341.
Ghosh J. and Strehl A., Similarity-Based Text Clustering: A Comparative Study, In Grouping
Multidimensional data, Berlin-Heidelberg:Springer, 2006, 73-97.
Goyal R.K., Gupta V., Sharma V. and Mittal P., Ontology based web retrieval. Proceedings of
International Symposium of Computer Science and Technology, 2008, 141-144.
Hati D. and Kumar A., An approach for identifying URLs based on Division score and link score
in focused crawler, International Journal of Computer Applications, 2, 2010, 48-53.
Hati D., Mishra L. and Kumar A., Unvisited URL Relevancy Calculation in Focused Crawling
based on Naive Bayesian Classification, International Journal of Computer Applications, 3, 2010,
-30.
Jamali M., Sayyadi H., Hariri B. B and Abolhassani H., A method of focused crawling using
combination of link structure and content similarity, Proceedings of International Conference on
Web Intelligence, 2006, 753-756.
Kao H. Y., Lin S. H., Ho J. M. and Chen M. S., Mining web Informative Structures and Contents
based on Entropy Analysis, Journal of IEEE Transactions on Knowledge and Data Engineering,
, 2004, 41-55.
Ke Y., Deng L., Ng W. and Lee D.L., Web dynamics and their ramifications for the development
of web search engines, International Journal of Computer and Telecommunications Networking-
Web dynamics, 50, 2006, 1430-1447.
Kozanidis, Lefteris, “An ontology based focused crawler”, Proceedings of the 13th International
Conference on Natural Language and Information Systems: Applications of Natural Language to
Information Systems, NLDB '08,2008, 376—379.
Kumar, Muhesh. and Vig, Renu., “Design of CORE: Context Ontology Rule Enhanced Focused
Web Crawler”, International conference on Advances in Computing, Communication and Control,
, 494-497.
Lawrence S. and Giles C. L., “Searching the World Wide Web”, Science Journal, 280, 1998, 98-
Lokhande, Kiran. P., Honale, Sonal. S. and Gangavane, H. N., “Web Crawler Using Priority
Queue”, International Journal of Research in Advent Technology, 2014.
Luong H. P., Gauch S. and Wang Q., Ontology-based Focused Crawling, International Conference
on Information, Process, and Knowledge Management, 2009, 123-128.
Mukhopadhyay D., Biswas A. and Sinha S., A new approach to design domain specific ontology
based crawler, 10th International Conference on Information Technology, 2007, 289-291.
Nioche, Julien., “Large Scale Crawling with Apache Nutch”, ApacheCon Europe 2012.
Nutch, http://nutch.apache.org/.
Nutch Crawler, http://nutch.apache.org/downloads.html.
Pal, Anshika., Tomar, Deepak. Singh. and Shrivastava S.C., “Effective Focused Crawling Based
on Content and Link Structure Analysis”, International Journal of Computer Science and
Information Security (IJCSIS), 2009.
Thenmalar S. and Geetha T. V., Concept based Focused crawling using Ontology, International
Journal of Computer Applications, 26, 2011, 29-32.
Yang J., Kang J. and Choi J., A Focused Crawler with Document Segmentation, Proceedings of
International Conference on Intelligent Data Engineering and Automated Learning, 2005, 94-101.
Yuvarani M., Iyengar N.Ch. S. N. and Kannan A., LSCrawler: A Framework for an Enhanced
Focused Web Crawler based on Link Semantics, Proceedings of IEEE/WIC/ACM International
Conference on Web Intelligence, 2006, 794-800.
Zhuang Z., Wagle R. and Giles C. L., What's there and what's not? Focused Crawling for Missing
Documents in Digital Libraries, Proceedings of Joint Conference on digital libraries, 2005, 301-