Java Bytecode Control Flow Classification: Framework for Guiding Java Decompilation

Authors

  • Siwadol Sateanpattanakul Pattern REcognition and Computational InTElligent Laboratory (PRECITE Lab), Department of Computer Engineering, Faculty of Engineering at Kamphaengsean, Kasetsart University, Nakhon Pathom, Thailand https://orcid.org/0000-0002-0030-8553
  • Duangpen Jetpipattanapong Pattern REcognition and Computational InTElligent Laboratory (PRECITE Lab), Department of Computer Engineering, Faculty of Engineering at Kamphaengsean, Kasetsart University, Nakhon Pathom, Thailand https://orcid.org/0000-0003-2464-7223
  • Seksan Mathulaprangsan Pattern REcognition and Computational InTElligent Laboratory (PRECITE Lab), Department of Computer Engineering, Faculty of Engineering at Kamphaengsean, Kasetsart University, Nakhon Pathom, Thailand https://orcid.org/0000-0002-8810-4183

DOI:

https://doi.org/10.13052/jmm1550-4646.1822

Keywords:

Decompilation, Feature Selection, Latent Semantic Indexing, Genetic Algorithm

Abstract

Decompilation is the main process of software development, which is very important when a program tries to retrieve lost source codes. Although decompiling Java bytecode is easier than bytecode, many Java decompilers cannot recover originally lost sources, especially the selection statement, i.e., if statement. This deficiency affects directly decompilation performance. In this paper, we propose the methodology for guiding Java decompiler to deal with the aforementioned problem. In the framework, Java bytecode is transformed into two kinds of features called frame feature and latent semantic feature. The former is extracted directly from the bytecode. The latter is achieved by two-step transforming the Java bytecode to bigram and then term frequency-inverse document frequency (TFIDF). After that, both of them are fed to the genetic algorithm to reduce their dimensions. The proposed feature is achieved by converting the selected TFIDF to a latent semantic feature and concatenating it with the selected frame feature. Finally, KNN is used to classify the proposed feature. The experimental results show that the decompilation accuracy is 93.68 percent, which is obviously better than Java Decompiler.

Downloads

Download data is not yet available.

Author Biographies

Siwadol Sateanpattanakul, Pattern REcognition and Computational InTElligent Laboratory (PRECITE Lab), Department of Computer Engineering, Faculty of Engineering at Kamphaengsean, Kasetsart University, Nakhon Pathom, Thailand

Siwadol Sateanpattanakul received D.Eng. degree from King Mongkut’s Institute of Technology Ladkrabang, Bangkok, Thailand in 2012. He is currently a lecturer in Department of Computer Engineering, Faculty of Engineering at Kamphaengsean, Kasetsart University. His research interests are Software Engineering, Java Technology, Compiler Construction, Computer Programming Language, artificial intelligence, and machine learning.

Duangpen Jetpipattanapong, Pattern REcognition and Computational InTElligent Laboratory (PRECITE Lab), Department of Computer Engineering, Faculty of Engineering at Kamphaengsean, Kasetsart University, Nakhon Pathom, Thailand

Duangpen Jetpipattanapong received Ph.D. degree from Sirindhorn International Institute of Technology, Thammasat University, Thailand in 2017. She is currently a lecturer in Department of Computer Engineering, Faculty of Engineering at Kamphaengsean, Kasetsart University. Her research interests are Machine Learning and Numerical Computation.

Seksan Mathulaprangsan, Pattern REcognition and Computational InTElligent Laboratory (PRECITE Lab), Department of Computer Engineering, Faculty of Engineering at Kamphaengsean, Kasetsart University, Nakhon Pathom, Thailand

Seksan Mathulaprangsan received the B.S. and M.S. degrees in Computer Engineering from King Mongkut’s University of Technology Thonburi, Bangkok, Thailand, in 1999 and 2003, respectively. In 2019, he received Ph.D. degree in applied computer science and information engineering from National Central University, Taoyuan, Taiwan. Currently, he is a lecturer in the Department of Computer Engineering, Kasetsart University. His research interests include sound processing, image processing, dictionary learning, deep learning, and machine learning.

References

I. Sommerville, Software Engineering, 9th ed. Boston, MA, USA: Addison Wesley, 2011.

G. Nolan, Decompiling Java, 1st ed. Berkeley,CA: Apress, 2004.

M. V Emmerik and T. Waddington, “Using a decompiler for real-world source recovery,” in 11th Working Conference on Reverse Engineering, 2004, pp. 27–36.

J. Hamilton and S. Danicic, “An Evaluation of Current Java Bytecode Decompilers,” in 2009 Ninth IEEE International Working Conference on Source Code Analysis and Manipulation, 2009, pp. 129–136.

J. Kostelanský and L’. Dedera, “An evaluation of output from current Java bytecode decompilers: Is it Android which is responsible for such quality boost?,” in 2017 Communication and Information Technologies (KIT), 2017, pp. 1–6.

N. Harrand, C. Soto-Valero, M. Monperrus, and B. Baudry, “The Strengths and Behavioral Quirks of Java Bytecode Decompilers,” in 2019 19th International Working Conference on Source Code Analysis and Manipulation (SCAM), 2019, pp. 92–102.

N. Harrand, C. Soto-Valero, M. Monperrus, and B. Baudry, “Java decompiler diversity and its application to meta-decompilation,” J. Syst. Softw., vol. 168, p. 110645, 2020.

K. Miller, Y. Kwon, Y. Sun, Z. Zhang, X. Zhang, and Z. Lin, “Probabilistic Disassembly,” in Proceedings of the 41st International Conference on Software Engineering, 2019, pp. 1187–1198.

E. Schulte, J. Ruchti, M. Noonan, D. Ciarletta, and A. Loginov, “Evolving Exact Decompilation,” 2018.

D. S. Katz, J. Ruchti, and E. Schulte, “Using recurrent neural networks for decompilation,” in 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER), 2018, pp. 346–356.

J. Gosling, B. Joy, G. Steele, G. Bracha, and A. Buckley, The Java®Language Specification: Java {SE} 7 Edition, 1st ed. Boston, MA, USA: Addison Wesley Professional, 2013.

S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman, “Indexing by Latent Semantic Analysis,” J. Am. Soc. Inf. Sci., vol. 41, no. 6, pp. 391–407, 1990.

R. E. Story, “An explanation of the effectiveness of latent semantic indexing by means of a Bayesian regression model,” Inf. Process. & Manag., vol. 32, no. 3, pp. 329–344, 1996.

M. Berry, Z. Drmac, and E. Jessup, “Matrices, Vector Spaces, and Information Retrieval,” SIAM Rev., vol. 41, no. 2, pp. 335–362, 1999.

S. Dumais, “Using {LSI} for information filtering: {TREC-3} experiments,” in The Third Text REtrieval Conference (TREC-3), 1995, pp. 219–230.

M. E. Wall, A. Rechtsteiner, and L. Rocha, “Singular Value Decomposition and Principal Component Analysis,” A Pract. Approach to Microarray Data Anal., vol. 5, 2002.

S. Lipovetsky, “{PCA} and {SVD} with Nonnegative Loadings,” Pattern Recognit., vol. 42, no. 1, pp. 68–76, 2009.

W. Song and S. C. Park, “Genetic algorithm for text clustering based on latent semantic indexing,” Comput. & Math. with Appl., vol. 57, no. 11, pp. 1901–1907, 2009.

B. Yu, Z. Xu, and C. Li, “Latent semantic analysis for text categorization using neural network,” Knowledge-Based Syst., vol. 21, no. 8, pp. 900–904, 2008.

S. Zelikovitz and H. Hirsh, “Using {LSI} for Text Classification in the Presence of Background Text,” in Proceedings of the Tenth International Conference on Information and Knowledge Management, 2001, pp. 113–118.

M. Komosiñski and K. Krawiec, “Evolutionary weighting of image features for diagnosing of CNS tumors,” Artif. Intell. Med., vol. 19, pp. 25–38, 2000.

U. Maulik and S. Bandyopadhyay, “Performance evaluation of some clustering algorithms and validity indices,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 12, pp. 1650–1654, 2002.

C. Sima, S. Attoor, U. M. Braga-Neto, J. Lowey, E. Suh, and E. R. Dougherty, “Impact of error estimation on feature selection,” Pattern Recognit., vol. 38, pp. 2472–2482, 2005.

“JDT Core Component,” 2018. [Online]. Available: https://www.eclipse.org/jdt/core/.

“Apache Commons,” 2020. [Online]. Available: https://commons.apache.org/proper/commons-bcel/.

“The Apache Ant Project,” 2019. [Online]. Available: https://ant.apache.org/.

I. Markovsky, Low Rank Approximation: Algorithms, Implementation, Applications. Springer Publishing Company, Incorporated, 2011.

W. Song and S. C. Park, “Genetic Algorithm-Based Text Clustering Technique,” in Advances in Natural Computation, 2006, pp. 779–782.

H. Kim, P. Howland, and H. Park, “Dimension Reduction in Text Classification with Support Vector Machines,” J. Mach. Learn. Res., vol. 6, pp. 37–53, 2005.

L. Shi, J. Zhang, E. Liu, and P. He, “Text Classification Based on Nonlinear Dimensionality Reduction Techniques and Support Vector Machines,” in Third International Conference on Natural Computation (ICNC 2007), 2007, vol. 1, pp. 674–677.

M. Zhu, J. Zhu, and W. Chen, “Effect analysis of dimension reduction on support vector machines,” in 2005 International Conference on Natural Language Processing and Knowledge Engineering, 2005, pp. 592–596.

K. Kowsari, K. Jafari Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, and D. Brown, “Text Classification Algorithms: A Survey,” Information, vol. 10, no. 4, 2019.

B. Xu, X. Guo, Y. Ye, and J. Cheng, “An Improved Random Forest Classifier for Text Categorization,” J. Comput., vol. 7, pp. 2913–2920, 2012.

Y. Sun, Y. Li, Q. Zeng, and Y. Bian, “Application Research of Text Classification Based on Random Forest Algorithm,” in 2020 3rd International Conference on Advanced Electronic Materials, Computers and Software Engineering (AEMCSE), 2020, pp. 370–374.

Downloads

Published

2021-11-16

Issue

Section

Smart Innovative Technology for Future Industry and Multimedia Applications