A PREPROCESSING FRAMEWORK AND APPROACH FOR WEB APPLICATIONS

ZHIGANG  ZHANG; JING  CHEN; XIAOMING  LI

Authors

ZHIGANG ZHANG Peking University, Beijing
JING CHEN Peking University, Beijing
XIAOMING LI Peking University, Beijing

Keywords:

World Wide Web, Data preprocessing, Data cleaning, Data integration

Abstract

Aiming to meet the common requirements of several typical web applications, we propose a new preprocessing framework and the corresponding approach. The framework includes three parts: Web page cleaning, replica removal and Web page integration. After the preprocessing stage, Web pages are purified and transformed into a general model called DocView. The model consists of eight elements, identifier, type, content classification code, title, keywords, abstract, topic content, relevant hyperlinks. Most of them are meta data, while the latter two are content data. The approach first partitions a page into several content blocks according to some selected tags in the markup tag tree. Based on a set of heuristics, it identifies the blocks that contain the topic content of the page. Then a quantitative measure (a feature vector) of the blocks with respect to the topic is obtained. From the topic feature vector, the elements of DocView are extracted by corresponding algorithms. The main advantage of our approach is no need for other information beyond the raw page, while additional information is usually necessary for previous related work. The preprocessing framework and approach have been applied to our search engine (Tianwang [15]) and web page classification system. The strong evidence of improvement in applications shows the practicability of the framework and verifies the validity of the approach. It's not difficult to realize that after such a preprocessing stage, we can set up a well-formed, purified, easily manipulated information layer on top of any Web page collection (including WWW) for Web applications.

Downloads

Download data is not yet available.

References

N. Ashish and C. A. Knoblock. Wrapper generation for semi-structured Internet sources. In

Proceedings of the Workshop on Management of Semistructured Data, Tucson, 1997.

S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer

Networks and ISDN Systems, 30(1-7):107-117, 1998.

Lan Yi, Bing Liu, Xiaoli Li. Eliminating noise information in Web pages for data mining.

SIGKDD, 2003.

DOM. http://www.w3.org/dom/.

J. Hammer, H. Garcia-Molina, J. Cho, A. Crespo, and R. Aranha. Extracting semistructured

information from the web. In Proceedings of the Workshop on Management of Semistructured Data,

pages 18-25, May 1997.

D. Hawking, N. Craswell, P. Bailey, and K. Griffihs. Measuring search engine quality. Information

Retrieval, 4(1):33-59, 2001.

C.-N. Hsu and M.-T. Dung. Generating finite-state transducers for semi-structured data extraction

from the web. Information Systems, 23(8):521-538, 1998.

J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM,

(5):604-632, 1999.

N. Kushmerick, D. S. Weld, and R. B. Doorenbos. Wrapper induction for information extraction. In

Intl. Joint Conference on Artificial Intelligence (IJCAI), pages 729-737, 1997.

D. D. Lewis, R. E. Schapire, J. P. Callan, and R. Papka. Training algorithms for linear text

classifiers. In H.-P. Frei, D. Harman, P. Sch"auble, and R. Wilkinson, editors, Proceedings of SIGIR-

, 19th ACM International Conference on Research and Development in Information Retrieval, pages

-306, Z"urich, CH, 1996. ACM Press, New York, US.

S.-H. Lin and J.-M. Ho. Discovering informative content blocks from web documents. SIGKDD,

U. Manber. Finding similar files in a large file system. In Proceedings of the USENIX Winter

Technical Conference, pages 1-10, San Fransisco, CA, USA, 1994.

Dublin Core. http://dublincore.org/documents/dces/.

Encoded Archival Description. http://lcweb.loc.gov/ead/.

Networks Lab, Peking University. http://e.pku.edu.cn/.

T.-H. Ong and H. Chen. Updateable pat-tree approach to chinese key phrase extraction using

mutual information: A linguistic foundation for knowledge management. In Proceedings of the Second

International Conference of Asian Digital Library, pages 63-84, Taipei, Taiwan, November 1999.

Rfc1321. http://www.faqs.org/rfcs/rfc1321.html.

G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information

Processing and Management, 24(5):513-523, 1988.

N. Shivakumar and H. Garc'ia-Molina. SCAM: A copy detection mechanism for digital

documents. In Proceedings of the Second Annual Conference on the Theory and Practice of Digital

Libraries, 1995.

N. Shivakumar and H. Garcia-Molina. Finding near-replicas of documents on the web. In

WEBDB: International Workshop on the World Wide Web and Databases, WebDB. LNCS, 1999.

I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin, and C. G. Nevill-Manning. KEA: Practical

automatic keyphrase extraction. In ACM DL, pages 254-255, 1999.

L. Xiaoli and S. Zhongzhi. Innovating web page classification through reducing noise. Journal of

Computer Science and Technology, 17(1), January 2002.

Y. Yang. Noise reduction in a statistical approach to text categorization. In E. A. Fox, P.

Ingwersen, and R. Fidel, editors, Proceedings of SIGIR-95, 18th ACM International Conference on

Research and Development in Information Retrieval, pages 256-263, Seattle, US, 1995. ACM Press,

New York, US.

Y. Yang and X. Liu. A re-examination of text categorization methods. In M. A. Hearst, F. Gey,

and R. Tong, editors, Proceedings of SIGIR-99, 22nd ACM International Conference on Research and

Development in Information Retrieval, pages 42-49, Berkeley, US, 1999. ACM Press, New York, US.

Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization.

In D. H. Fisher, editor, Proceedings of ICML-97, 14th International Conference on Machine Learning,

pages 412-420, Nashville, US, 1997. Morgan Kaufmann Publishers, San Francisco, US.

A PREPROCESSING FRAMEWORK AND APPROACH FOR WEB APPLICATIONS

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Issue

Section

IEEE Xplore

ImpactScore

specialissue

issn

cover

Make a Submission

subreq

indexed