AN INVESTIGATION OF CLUSTERING ALGORITHMS IN THE IDENTIFICATION OF SIMILAR WEB PAGES

Authors

  • ANDREA DE LUCIA Dipartimento di Matematica e Informatica, University of Salerno, Italy
  • MICHELE RISI Dipartimento di Matematica e Informatica, University of Salerno, Italy
  • GIUSEPPE SCANNIELLO Dipartimento di Matematica e Informatica, University of Basilicata, Italy
  • GENOVEFFA TORTORA Dipartimento di Matematica e Informatica, University of Salerno, Italy

Keywords:

clone analysis, clustering algorithms, latent semantic indexing, Levenshtein string edit distances, program comprehension, reverse engineering

Abstract

In this paper we investigate the effect of using clustering algorithms in the reverse engineering field to identify pages that are similar either at the structural level or at the content level. To this end, we have used two instances of a general process that only differ for the measure used to compare web pages. In particular, two web pages at the structural level and at the content level are compared by using the Levenshtein edit distances and Latent Semantic Indexing, respectively. The static pages of two web applications and one static web site have been used to compare the results achieved by using the considered clustering algorithms both at the structural and content level. On these applications we generally achieved comparable results. However, the investigation has also suggested some heuristics to quickly identify the best partition of web pages into clusters among the possible partitions both at the structural and at the content level.

 

Downloads

Download data is not yet available.

References

Anquetil N. and Lethbridge T. C., Experiments with clustering as a software remodularization

method. In Proc. of 6th Working Conference on Reverse Engineering, 1999, pp.235-255.

Antoniol G., Canfora G., Casazza G., and De Lucia A., Web Site Reengineering using RMM. In

Proc. of International Workshop on Web Site Evolution, 2000, pp. 9-16.

Baker B. S., On finding duplication and near duplication in large software systems. In Proc. of the

nd Working Conference on Reverse Engineering, 1995, pp. 86-95.

Balazinska M., Merlo E., Dangenais M., Lague B., and Kontogiannis K., Measuring Clone Based

Reengineering Opportunities. In Proc. of 6th IEEE International Symposium on Software Metrics,

, pp. 292-303.

Baxter D., Yahin A., Moura L., Sant’Anna M., and Bier L., Clone Detection Using Abstract

Syntax Trees. In Proc. of IEEE Intl Conference on Software Maintenance, 1998, pp. 368-377.

Boldyreff C., Munro M., and Warren P. The evolution of websites. In Proc. of 7th IEEE

International Workshop on Program Comprehension, 1999, pp. 178-185.

Boldyreff C. and Kewish R., Reverse Engineering to Achieve Maintainable WWW Sites. In Proc.

of 8th IEEE Working Conference on Reverse Engineering, Suttgart, 2001, pp. 249-257.

Boldyreff C. and Tonella P., Special Issue of J of Software Maintenance, vol.16, no.1-2, 2004.

Brin S. and Page L. The Anatomy of a Large-Scale Hypertextual Web Search Engine. In

Computer Networks and ISDN Systems, vol. 30, no. 1-7, 1998, pp. 107-117.

Calefato F., Lanubile F., and Mallardo T., Function Clone Detection in Web Applications: A

Semiautomated Approach. In Journal of Web Engineering, vol. 3, no. 1, 2004, pp. 3-21.

Conallen J., Building Web application with UML, 2000.

De Lucia A., Francese R., Scanniello G., and Tortora G., Identifying Cloned Navigational Patterns

in Web Applications. In Journal of Web Engineering, vol. 5, no. 2, 2006, pp. 150-174.

De Lucia A., Scanniello G., Tortora G., Identifying Similar Pages in Web Applications using a

Competitive Clustering Algorithm. In Intl J on Software Maintenance and Evolution, vol. 19, no.

, pp. 281-296.

De Lucia A., Risi M., Scanniello G., and Tortora G., Clustering Algorithms and Latent Semantic

Indexing to Identify Similar Pages in Web Applications In Proceedings of 9th IEEE International

Symposium on Web Site Evolution, 2007, pp. 65-72.

De Lucia A., Fasano F., Oliveto R., and Tortora G., Recovering traceability links in software

artifact management systems using information retrieval methods. In Transaction on Software

Engineering and Methodology, vol. 16, no. 13, 2007.

Deerwester S., Dumais S. T., Furnas G. W., Landauer T. K., and Harshman R., Indexing by Latent

Semantic Analysis. In J of American Society for Information Science, no.41, 1990, pp.391-407.

Di Lucca G. A., Di Penta M., and Fasolino A. R., An Approach to Identify Duplicated Web Pages.

In Proc. of 26th Annual Intel Computer Software and Application Conference, 2002, pp. 481-486.

Di Lucca G. A., Fasolino A. R., De Carlini U., Pace F., and Tramontana P., Comprehending web

applications by a clustering based approach. In Proc. Of 10th Intl Workshop on Program

Comprehension, 2002, pp. 261-270.

Di Lucca G. A., Fasolino A. R., Taralli F., and De Carlini U., Testing Web Application. In Proc.

of International Conference on Software Maintenance, 2002, pp. 310-319.

Di Lucca G. A., Fasolino A. R., De Carlini U., and Tramontana P., Abstracting Business Level

UML Diagrams from Web Applications. In Proc. of 5th IEEE International Workshop on Web

Site Evolution, 2003, pp.12-19.

Di Lucca G. A., Fasolino A. R., and Tramontana P., Reverse engineering Web applications: the

WARE approach. In J of Software Maintenance and Evolution: Research and Practice, vol.16, no.

-2, 2004, pp. 71-101.

Duda R. O., Hart P. E., and Stork D. G., Pattern Classification. In Wiley-Interscience Publication,

pp. 576-581.

Eichmann D., Evolving an Engineered Web. In Proc. of International Workshop Web Site

Evolution, 1999, pp. 12-16.

Flynn P.J., Jain A. K., and Murty M. N., Data Clustering: A Review, In ACM Computing

Surveys, vol. 31, no. 3, 1999, pp. 264-323.

Ginige A. and Murugesan S., Special issue on Web Engineering. IEEE Multimedia, vol. 8, no. 1-

, 2001.

Guttman L., Some necessary conditions for common factor analysis. Psychometrika, vol. 19,

, pp. 149-61.

Harman D., Ranking Algorithms. In Information Retrieval: Data Structures and Algorithms, 1992,

pp. 363-392.

Higo Y., Ueda T., Kamiya Y., Kusumoto S., and Inoue K., On software maintenance process

improvement based on code clone analysis. In Proc. of 4th Intl Conference on Product Focused

Software Process Improvement, 2002, pp. 185-197.

Isakowitz T., Stohr E. A., and Balasubramanian P. RMM: a Methodology for Structured

Hypermedia Design. In Communications of the ACM, vol. 38, no. 8, 1995, pp. 34-44.

Jain K. and Dubes R. C., Algorithms for clustering data. Prentice-Hall Advanced Reference

Series, 1988.

Kaiser H. F., The application of electronic computers to factor analysis. Educational and

Psychological Measurement, vol. 20, 1960, pp. 141-51.

Kaufman L. and Rousseeuw P.J., Finding Groups in Data. An Introduction to Cluster Analysis.

Wiley, New York, 1990.

King F., Step-wise clustering procedures. In Journal of the American Statistical Association, vol.

, 1967, pp. 86-101.

Kohonen T., Self-organizing formation of topologically correct feature maps. In Biological

Cybernetics, vol. 43, 1982, pp. 59-69.

Kuhn A., Ducasse S., and Girba T. Semantic Clustering: Identifying Topics in Source Code. In

International Journal of Information and Software Technology, vol. 43, no 3, 2007, pp. 230-243.

Levenshtein V. L.: Binary codes capable of correcting deletions, insertions, and reversals. In

Cybernetics and Control Theory, vol. 10, (1966), pp. 707-710.

Maletic, J. I. and Marcus, A. Supporting program comprehension using semantic and structural

information. In Proc. of 23rd Intl Conference on Software Engineering, 2001, pp. 103-112.

Mcqueen J., Some methods for classification and analysis of multivariate observations. In Proc of

th Berkeley Symposium on Mathematical Statistics and Probability, 1967, pp. 281-297.

Nakov. P., Latent semantic analysis for german literature investigation. In Proceedings of the

International Conference, 7th Fuzzy Days on Computational Intelligence, Theory and

Applications, 2001, pp. 834-841.

Oudshoff A. M., Bosloper I.E., Klos T. B., and Spaanenburg L., Knowledge discovery in virtual

community texts: Clustering virtual communities. In Journal of Intelligent and Fuzzy Systems,

vol. 14, no. 1, 2003, pp. 13-24.

Rajapakse C. and Jarzabek S., An Investigation of Cloning in Web Applications. In Proc. of 5th

International Conference on Web Engineering, 2005, pp. 252-262.

Ricca F. and Tonella P., Understanding and Restructuring Web Sites with ReWeb. In IEEE

Multimedia, vol. 8, no. 2, 2001, pp. 40-51.

Ricca F. and Tonella P., Using Clustering to Support the Migration from Static to Dynamic Web

Pages. In Proc. of International Workshop on Program Comprehension, 2003, pp. 207-216.

Ricca F., Tonella P., Girardi C., and Pianta E., Improving Web site understanding with keywordbased

clustering, J of Software Maintenance and Evolution Research and Practice, vol.20, no.1,

, pp. 1–29.

Tonella P., Ricca F., Pianta E., and Girardi C., Restructuring Multilingual Web Sites. In Proc. of

International Conference on Software Maintenance, 2002, pp. 290-299.

Tonella P., Ricca F., Pianta E., Girardi C., Di Lucca G., Fasolino A. R., and Tramontana P.,

Evaluation methods for Web application clustering. In proc. of 5th IEEE Symposium on Web Site

Evolution, 2003, pp. 33- 40.

Van Rijsbergen C. J., Information Retrieval, second ed., Butterworth, London, 1979.

Wiggerts T. A., Using clustering algorithms in legacy systems remodularization. In Proc. of 4th

Working Conference on Reverse Engineering, 1997, pp. 33-43.

Downloads

Published

2009-08-12

How to Cite

DE LUCIA, A. ., RISI, M. ., SCANNIELLO, G. ., & TORTORA, G. . (2009). AN INVESTIGATION OF CLUSTERING ALGORITHMS IN THE IDENTIFICATION OF SIMILAR WEB PAGES. Journal of Web Engineering, 8(4), 346–370. Retrieved from https://journals.riverpublishers.com/index.php/JWE/article/view/4047

Issue

Section

Articles