A PREPROCESSING FRAMEWORK AND APPROACH FOR WEB APPLICATIONS
Keywords:
World Wide Web, Data preprocessing, Data cleaning, Data integrationAbstract
Aiming to meet the common requirements of several typical web applications, we propose a new preprocessing framework and the corresponding approach. The framework includes three parts: Web page cleaning, replica removal and Web page integration. After the preprocessing stage, Web pages are purified and transformed into a general model called DocView. The model consists of eight elements, identifier, type, content classification code, title, keywords, abstract, topic content, relevant hyperlinks. Most of them are meta data, while the latter two are content data. The approach first partitions a page into several content blocks according to some selected tags in the markup tag tree. Based on a set of heuristics, it identifies the blocks that contain the topic content of the page. Then a quantitative measure (a feature vector) of the blocks with respect to the topic is obtained. From the topic feature vector, the elements of DocView are extracted by corresponding algorithms. The main advantage of our approach is no need for other information beyond the raw page, while additional information is usually necessary for previous related work. The preprocessing framework and approach have been applied to our search engine (Tianwang [15]) and web page classification system. The strong evidence of improvement in applications shows the practicability of the framework and verifies the validity of the approach. It's not difficult to realize that after such a preprocessing stage, we can set up a well-formed, purified, easily manipulated information layer on top of any Web page collection (including WWW) for Web applications.
Downloads
References
N. Ashish and C. A. Knoblock. Wrapper generation for semi-structured Internet sources. In
Proceedings of the Workshop on Management of Semistructured Data, Tucson, 1997.
S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer
Networks and ISDN Systems, 30(1-7):107-117, 1998.
Lan Yi, Bing Liu, Xiaoli Li. Eliminating noise information in Web pages for data mining.
SIGKDD, 2003.
DOM. http://www.w3.org/dom/.
J. Hammer, H. Garcia-Molina, J. Cho, A. Crespo, and R. Aranha. Extracting semistructured
information from the web. In Proceedings of the Workshop on Management of Semistructured Data,
pages 18-25, May 1997.
D. Hawking, N. Craswell, P. Bailey, and K. Griffihs. Measuring search engine quality. Information
Retrieval, 4(1):33-59, 2001.
C.-N. Hsu and M.-T. Dung. Generating finite-state transducers for semi-structured data extraction
from the web. Information Systems, 23(8):521-538, 1998.
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM,
(5):604-632, 1999.
N. Kushmerick, D. S. Weld, and R. B. Doorenbos. Wrapper induction for information extraction. In
Intl. Joint Conference on Artificial Intelligence (IJCAI), pages 729-737, 1997.
D. D. Lewis, R. E. Schapire, J. P. Callan, and R. Papka. Training algorithms for linear text
classifiers. In H.-P. Frei, D. Harman, P. Sch"auble, and R. Wilkinson, editors, Proceedings of SIGIR-
, 19th ACM International Conference on Research and Development in Information Retrieval, pages
-306, Z"urich, CH, 1996. ACM Press, New York, US.
S.-H. Lin and J.-M. Ho. Discovering informative content blocks from web documents. SIGKDD,
U. Manber. Finding similar files in a large file system. In Proceedings of the USENIX Winter
Technical Conference, pages 1-10, San Fransisco, CA, USA, 1994.
Dublin Core. http://dublincore.org/documents/dces/.
Encoded Archival Description. http://lcweb.loc.gov/ead/.
Networks Lab, Peking University. http://e.pku.edu.cn/.
T.-H. Ong and H. Chen. Updateable pat-tree approach to chinese key phrase extraction using
mutual information: A linguistic foundation for knowledge management. In Proceedings of the Second
International Conference of Asian Digital Library, pages 63-84, Taipei, Taiwan, November 1999.
Rfc1321. http://www.faqs.org/rfcs/rfc1321.html.
G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information
Processing and Management, 24(5):513-523, 1988.
N. Shivakumar and H. Garc'ia-Molina. SCAM: A copy detection mechanism for digital
documents. In Proceedings of the Second Annual Conference on the Theory and Practice of Digital
Libraries, 1995.
N. Shivakumar and H. Garcia-Molina. Finding near-replicas of documents on the web. In
WEBDB: International Workshop on the World Wide Web and Databases, WebDB. LNCS, 1999.
I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin, and C. G. Nevill-Manning. KEA: Practical
automatic keyphrase extraction. In ACM DL, pages 254-255, 1999.
L. Xiaoli and S. Zhongzhi. Innovating web page classification through reducing noise. Journal of
Computer Science and Technology, 17(1), January 2002.
Y. Yang. Noise reduction in a statistical approach to text categorization. In E. A. Fox, P.
Ingwersen, and R. Fidel, editors, Proceedings of SIGIR-95, 18th ACM International Conference on
Research and Development in Information Retrieval, pages 256-263, Seattle, US, 1995. ACM Press,
New York, US.
Y. Yang and X. Liu. A re-examination of text categorization methods. In M. A. Hearst, F. Gey,
and R. Tong, editors, Proceedings of SIGIR-99, 22nd ACM International Conference on Research and
Development in Information Retrieval, pages 42-49, Berkeley, US, 1999. ACM Press, New York, US.
Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization.
In D. H. Fisher, editor, Proceedings of ICML-97, 14th International Conference on Machine Learning,
pages 412-420, Nashville, US, 1997. Morgan Kaufmann Publishers, San Francisco, US.