Java Bytecode Control Flow Classification: Framework for Guiding Java Decompilation
DOI:
https://doi.org/10.13052/jmm1550-4646.1822Keywords:
Decompilation, Feature Selection, Latent Semantic Indexing, Genetic AlgorithmAbstract
Decompilation is the main process of software development, which is very important when a program tries to retrieve lost source codes. Although decompiling Java bytecode is easier than bytecode, many Java decompilers cannot recover originally lost sources, especially the selection statement, i.e., if statement. This deficiency affects directly decompilation performance. In this paper, we propose the methodology for guiding Java decompiler to deal with the aforementioned problem. In the framework, Java bytecode is transformed into two kinds of features called frame feature and latent semantic feature. The former is extracted directly from the bytecode. The latter is achieved by two-step transforming the Java bytecode to bigram and then term frequency-inverse document frequency (TFIDF). After that, both of them are fed to the genetic algorithm to reduce their dimensions. The proposed feature is achieved by converting the selected TFIDF to a latent semantic feature and concatenating it with the selected frame feature. Finally, KNN is used to classify the proposed feature. The experimental results show that the decompilation accuracy is 93.68 percent, which is obviously better than Java Decompiler.
Downloads
References
I. Sommerville, Software Engineering, 9th ed. Boston, MA, USA: Addison Wesley, 2011.
G. Nolan, Decompiling Java, 1st ed. Berkeley,CA: Apress, 2004.
M. V Emmerik and T. Waddington, “Using a decompiler for real-world source recovery,” in 11th Working Conference on Reverse Engineering, 2004, pp. 27–36.
J. Hamilton and S. Danicic, “An Evaluation of Current Java Bytecode Decompilers,” in 2009 Ninth IEEE International Working Conference on Source Code Analysis and Manipulation, 2009, pp. 129–136.
J. Kostelanský and L’. Dedera, “An evaluation of output from current Java bytecode decompilers: Is it Android which is responsible for such quality boost?,” in 2017 Communication and Information Technologies (KIT), 2017, pp. 1–6.
N. Harrand, C. Soto-Valero, M. Monperrus, and B. Baudry, “The Strengths and Behavioral Quirks of Java Bytecode Decompilers,” in 2019 19th International Working Conference on Source Code Analysis and Manipulation (SCAM), 2019, pp. 92–102.
N. Harrand, C. Soto-Valero, M. Monperrus, and B. Baudry, “Java decompiler diversity and its application to meta-decompilation,” J. Syst. Softw., vol. 168, p. 110645, 2020.
K. Miller, Y. Kwon, Y. Sun, Z. Zhang, X. Zhang, and Z. Lin, “Probabilistic Disassembly,” in Proceedings of the 41st International Conference on Software Engineering, 2019, pp. 1187–1198.
E. Schulte, J. Ruchti, M. Noonan, D. Ciarletta, and A. Loginov, “Evolving Exact Decompilation,” 2018.
D. S. Katz, J. Ruchti, and E. Schulte, “Using recurrent neural networks for decompilation,” in 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER), 2018, pp. 346–356.
J. Gosling, B. Joy, G. Steele, G. Bracha, and A. Buckley, The Java®Language Specification: Java {SE} 7 Edition, 1st ed. Boston, MA, USA: Addison Wesley Professional, 2013.
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman, “Indexing by Latent Semantic Analysis,” J. Am. Soc. Inf. Sci., vol. 41, no. 6, pp. 391–407, 1990.
R. E. Story, “An explanation of the effectiveness of latent semantic indexing by means of a Bayesian regression model,” Inf. Process. & Manag., vol. 32, no. 3, pp. 329–344, 1996.
M. Berry, Z. Drmac, and E. Jessup, “Matrices, Vector Spaces, and Information Retrieval,” SIAM Rev., vol. 41, no. 2, pp. 335–362, 1999.
S. Dumais, “Using {LSI} for information filtering: {TREC-3} experiments,” in The Third Text REtrieval Conference (TREC-3), 1995, pp. 219–230.
M. E. Wall, A. Rechtsteiner, and L. Rocha, “Singular Value Decomposition and Principal Component Analysis,” A Pract. Approach to Microarray Data Anal., vol. 5, 2002.
S. Lipovetsky, “{PCA} and {SVD} with Nonnegative Loadings,” Pattern Recognit., vol. 42, no. 1, pp. 68–76, 2009.
W. Song and S. C. Park, “Genetic algorithm for text clustering based on latent semantic indexing,” Comput. & Math. with Appl., vol. 57, no. 11, pp. 1901–1907, 2009.
B. Yu, Z. Xu, and C. Li, “Latent semantic analysis for text categorization using neural network,” Knowledge-Based Syst., vol. 21, no. 8, pp. 900–904, 2008.
S. Zelikovitz and H. Hirsh, “Using {LSI} for Text Classification in the Presence of Background Text,” in Proceedings of the Tenth International Conference on Information and Knowledge Management, 2001, pp. 113–118.
M. Komosiñski and K. Krawiec, “Evolutionary weighting of image features for diagnosing of CNS tumors,” Artif. Intell. Med., vol. 19, pp. 25–38, 2000.
U. Maulik and S. Bandyopadhyay, “Performance evaluation of some clustering algorithms and validity indices,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 12, pp. 1650–1654, 2002.
C. Sima, S. Attoor, U. M. Braga-Neto, J. Lowey, E. Suh, and E. R. Dougherty, “Impact of error estimation on feature selection,” Pattern Recognit., vol. 38, pp. 2472–2482, 2005.
“JDT Core Component,” 2018. [Online]. Available: https://www.eclipse.org/jdt/core/.
“Apache Commons,” 2020. [Online]. Available: https://commons.apache.org/proper/commons-bcel/.
“The Apache Ant Project,” 2019. [Online]. Available: https://ant.apache.org/.
I. Markovsky, Low Rank Approximation: Algorithms, Implementation, Applications. Springer Publishing Company, Incorporated, 2011.
W. Song and S. C. Park, “Genetic Algorithm-Based Text Clustering Technique,” in Advances in Natural Computation, 2006, pp. 779–782.
H. Kim, P. Howland, and H. Park, “Dimension Reduction in Text Classification with Support Vector Machines,” J. Mach. Learn. Res., vol. 6, pp. 37–53, 2005.
L. Shi, J. Zhang, E. Liu, and P. He, “Text Classification Based on Nonlinear Dimensionality Reduction Techniques and Support Vector Machines,” in Third International Conference on Natural Computation (ICNC 2007), 2007, vol. 1, pp. 674–677.
M. Zhu, J. Zhu, and W. Chen, “Effect analysis of dimension reduction on support vector machines,” in 2005 International Conference on Natural Language Processing and Knowledge Engineering, 2005, pp. 592–596.
K. Kowsari, K. Jafari Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, and D. Brown, “Text Classification Algorithms: A Survey,” Information, vol. 10, no. 4, 2019.
B. Xu, X. Guo, Y. Ye, and J. Cheng, “An Improved Random Forest Classifier for Text Categorization,” J. Comput., vol. 7, pp. 2913–2920, 2012.
Y. Sun, Y. Li, Q. Zeng, and Y. Bian, “Application Research of Text Classification Based on Random Forest Algorithm,” in 2020 3rd International Conference on Advanced Electronic Materials, Computers and Software Engineering (AEMCSE), 2020, pp. 370–374.