AUTOMATIC MAINTENANCE OF WEB DIRECTORIES BY MINING WEB BROWSING DATA
Keywords:
Web directories, Web Mining, Query LogsAbstract
Web directories allow Web users to browse a hierarchy of categories, under which di- fferent types of resources are classified. We study the problem of maintaining a Web directory, that is, the problem of continually discovering and ranking resources that are relevant to the categories of the directory. We propose an unsupervised computational method that conducts the maintenance of the directory by analyses of user browsing data. The method is based on the extraction and classification of user sessions (se- quences of resources selected by users) into the categories of the directory. In addition, we show that the directory maintenance method can be slightly modified to find queries that are useful to find relevant resources allowing users to switch from directory browsing to query formulation. Experimental results allow for affirmation that the proposed me- thods are effective, that they attain identification of new pages in each category and also recommend related queries with high precision, without needing labeled data to conduct traditional web page and query classification tasks.
Downloads
References
X. Qi and B. Davison (2009), Web page classification: Features and algorithms, ACM Computing
Surveys,41(2):1-31.
Sebastiani, F., (2002). Machine learning in automated text categorization. ACM Computing
Surveys,34(1):1-47.
Yang, H.-C., Lee, C.-H., (2004). A text mining approach on automatic generation of web directories
and hierarchies. Expert Syst. Appl. 27 (4), 645663.
Stamou, S., Ntoulas, A., Krikos, V., Kokosis, P., Christodoulakis, D., (2006). Classifying web data
in directory structures. In: Zhou, X., Li, J., Shen, H. T., Kitsuregawa, M., Zhang, Y. (Eds.),
APWeb. Vol. 3841 of Lecture Notes in Computer Science. Springer, pp. 238-249.
Chung, W., Lai, G., Bonillas, A., Xi, W., Chen, H., (2008). Organizing domain-specific information
on the web: An experiment on the spanish business web directory. Int. J. Hum.-Comput. Stud. 66
(2), 5166.
Gerstel, O., Kutten, S., Laber, E., Matichin, R., Peleg, D., Pessoa, A., de Souza, C. (2007),
Reducing human interactions in Web directory searches. ACM Trans. Inf. Syst. 25 (4), 1-28.
Zaihrayeu, I., Sun, L., Giunchiglia, F., Pan, W., Ju, Q., Chi, M., Huang, X., (2007). From web direc-
tories to ontologies: Natural language processing challenges. In: et al., K. A. (Ed.), ISWC/ASWC.
Vol. 4825 of Lecture Notes in Computer Science. Springer, pp. 623636.
Chuang, S.-L., Chien, L.-F., (2003). Enriching web taxonomies through subject categorization of
query terms from search engine logs. Decision Support Systems 35 (1), 113127.
Adami, G., Avesani, P., Sona, D., (2003). Clustering documents in a web directory. In: Chiang, R.
H. L., Laender, A. H. F., Lim, E.-P. (Eds.), WIDM. ACM, pp. 6673.
Adami, G., Avesani, P., Sona, D., (2005). Clustering documents into a web directory for bootstrap-
ping a supervised classification. Data Knowl. Eng. 54 (3), 301325.
Zhang, D., Lee, W. S., (2004). Learning to integrate web taxonomies. J. Web Sem. 2 (2), 131151.
Rocchio, J., (1971). Relevance feedback in information retrieval. In: G. Salton (Ed.), The SMART
Retrieval System - Experiments in Automatic Document Processing. Prentice Hall Inc., Englewood
Clifs, NJ, USA.
SIGKDD, (2005). KDD CUP 2005 dataset. http://www.sigkdd.org/kdd2005/kddcup.html.
Baeza-Yates, R., Ribeiro-Neto, B., (1999). Modern Information Retrieval. Addison-Wesley, ACM
Press, New York.