World Wide Web, Data preprocessing, Data cleaning, Data integrationAbstract
Aiming to meet the common requirements of several typical web applications, we propose a new preprocessing framework and the corresponding approach. The framework includes three parts: Web page cleaning, replica removal and Web page integration. After the preprocessing stage, Web pages are purified and transformed into a general model called DocView. The model consists of eight elements, identifier, type, content classification code, title, keywords, abstract, topic content, relevant hyperlinks. Most of them are meta data, while the latter two are content data. The approach first partitions a page into several content blocks according to some selected tags in the markup tag tree. Based on a set of heuristics, it identifies the blocks that contain the topic content of the page. Then a quantitative measure (a feature vector) of the blocks with respect to the topic is obtained. From the topic feature vector, the elements of DocView are extracted by corresponding algorithms. The main advantage of our approach is no need for other information beyond the raw page, while additional information is usually necessary for previous related work. The preprocessing framework and approach have been applied to our search engine (Tianwang [15]) and web page classification system. The strong evidence of improvement in applications shows the practicability of the framework and verifies the validity of the approach. It's not difficult to realize that after such a preprocessing stage, we can set up a well-formed, purified, easily manipulated information layer on top of any Web page collection (including WWW) for Web applications.
