ScaleNet: Scalable and Hybrid Framework for Cyber Threat Situational Awareness Based on DNS, URL, and Email Data Analysis

Authors

  • R. Vinayakumar Center for Computational Engineering and Networking (CEN), Amrita School of Engineering, Coimbatore, Amrita Vishwa Vidyapeetham, India
  • K. P. Soman Center for Computational Engineering and Networking (CEN), Amrita School of Engineering, Coimbatore, Amrita Vishwa Vidyapeetham, India
  • Prabaharan Poornachandran Center for Computational Engineering and Networking (CEN), Amrita School of Engineering, Coimbatore, Amrita Vishwa Vidyapeetham, India
  • Vysakh S. Mohan Center for Computational Engineering and Networking (CEN), Amrita School of Engineering, Coimbatore, Amrita Vishwa Vidyapeetham, India
  • Amara Dinesh Kumar Center for Computational Engineering and Networking (CEN), Amrita School of Engineering, Coimbatore, Amrita Vishwa Vidyapeetham, India

DOI:

https://doi.org/10.13052/2245-1439.823

Keywords:

cyber security, natural language processing, text mining, machine learning, neural networks, deep learning, big data, cognitive security, distributed and semantic word representation, domain generation algorithms, uniform resource locator, spam, ransomware

Abstract

A computer virus or malware is a computer program, but with the purpose of causing harm to the system. This year has witnessed the rise of malware and the loss caused by them is high. Cyber criminals have continually advancing their methods of attack. The existing methodologies to detect the existence of such malicious programs and to prevent them from executing are static, dynamic and hybrid analysis. These approaches are adopted by anti-malware products. The conventional methods of were only efficient till a certain extent. They are incompetent in labeling the malware because of the time taken to reverse engineer the malware to generate a signature. When the signature becomes available, there is a high chance that a significant amount of damage might have occurred. However, there is a chance of detecting the malicious activities quickly by analyzing the events of DNS logs, Emails, and URLs. As these unstructured raw data contains rich source of information, we explore how the large volume of data can be leveraged to create cyber intelligent situational awareness to mitigate advanced cyber threats. Deep learning is a machine learning technique largely used by researchers in recent days. It avoids feature engineering which served as a critical step for conventional machine learning algorithms. It can be used along with the existing automation methods such as rule and heuristics based and machine learning techniques. This work takes the advantage of deep learning architectures to classify and correlate malicious activities that are perceived from the various sources such as DNS, Email, and URLs. Unlike conventional machine learning approaches, deep learning architectures don’t follow any feature engineering and feature representation methods. They can extract optimal features by themselves. Still, additional domain level features can be defined for deep learning methods in NLP tasks to enhance the performance. The cyber security events considered in this study are surrounded by texts. To convert text to real valued vectors, various natural language processing and text mining methods are incorporated. To our knowledge, this is the first attempt, a framework that can analyze and correlate the events of DNS, Email, andURLsat scale to provide situational awareness against malicious activities. The developed framework is highly scalable and capable of detecting the malicious activities in near real time. Moreover, the framework can be easily extended to handle large volume of other cyber security events by adding additional resources. These characteristics have made the proposed framework stand out from any other system of similar kind.

 

Downloads

Download data is not yet available.

Author Biographies

R. Vinayakumar, Center for Computational Engineering and Networking (CEN), Amrita School of Engineering, Coimbatore, Amrita Vishwa Vidyapeetham, India

R. Vinayakumar is a Ph.D. student at the Amrita Vishwa Vidyapeetham at Coimbatore since July 2015. He has received his BCA from JSS college of Arts, Commerce and Science, Ooty road, Mysore and MCA degree from Amrita Vishwa Vidyapeetham, Mysore. He has several papers in Machine Learning Applied to cyber security. R. Vinayakumar is currently completing a doctorate in Computer Science at the Amrita Vishwa Vidyapeetham at Coimbatore. His Ph.D. work centers on Application of Machine learning and Deep learning for cyber security and discusses the importance of natural language processing, image processing and big data analytics for cyber security. More details available at https://vinayakumarr.github.io/

K. P. Soman, Center for Computational Engineering and Networking (CEN), Amrita School of Engineering, Coimbatore, Amrita Vishwa Vidyapeetham, India

K. P. Soman has 25 years of research and teaching experience at Amrita School of Engineering, Coimbatore. He has around 150 publications in national and international journals and conference proceedings. He has organized a series of workshops and summer schools in Advanced signal processing using wavelets, Kernel Methods for pattern classification, Deep learning, and Big-data Analytics for industry and academia. He authored books on “Insight into Wavelets”, “Insight into Data mining”, “Support Vector Machines and Other Kernel Methods” and “Signal and Image processing-the sparse way,” published by Prentice Hall, New Delhi, and Elsevier. More details available at https://nlp.amrita.edu/somankp/

Prabaharan Poornachandran, Center for Computational Engineering and Networking (CEN), Amrita School of Engineering, Coimbatore, Amrita Vishwa Vidyapeetham, India

Prabaharan Poornachandran is a professor at Amrita Vishwa Vidyapeetham. He has more than two decades of experience in Computer Science and Security areas. His areas of interests are Malware, Critical Infrastructure security, Complex Binary analysis, AI and Machine Learning.

 

Vysakh S. Mohan, Center for Computational Engineering and Networking (CEN), Amrita School of Engineering, Coimbatore, Amrita Vishwa Vidyapeetham, India

Vysakh S. Mohan is an MTech student at the Amrita Vishwa Vidyapeetham at Coimbatore since July 2016. His MTech work centers on object detection using deep learning. He is an AI enthusiast and developer at Accubits Technologies Inc, who is actively involved in creating artificial intelligence solutions and has several noted research papers in domains like deep learning, computer vision, cyber security and natural language processing. More details available at https://vysakhsmohan.wixsite.com/vysakhsmohan

Amara Dinesh Kumar, Center for Computational Engineering and Networking (CEN), Amrita School of Engineering, Coimbatore, Amrita Vishwa Vidyapeetham, India

Amara Dinesh Kumar is an MTech student at the Amrita Vishwa Vidyapeetham at Coimbatore since July 2017. He is an AI researcher and Cyber Security Enthusiast. More details available at https://sites.google.com/view/ dineshkumaramara/home

References

Smith, S. (2015). Cybercrime will Cost Businesses over $2 Trillion by 2019.-juniper research, 2015. Available at https://www.juniperresearch.com.

Buczak, A. L., and Guven, E. (2016). A survey of data mining and machine learning methods for cyber security intrusion detection. IEEE Communications Surveys & Tutorials, 18(2), 1153–1176.

LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. nature, 521(7553), 436.

Cognitivesecurity IBM white paper. Available at https://cognitivesecuritywhitepaper.mybluemix.net/.

Vinayakumar, R., Soman, K. P., and Poornachandran, P. (2018). Evaluating deep learning approaches to characterize and classify malicious URL’s. Journal of Intelligent & Fuzzy Systems, 34(3), 1333–1343.

Vinayakumar, R., Soman, K. P., and Poornachandran, P. (2018). Detecting malicious domain names using deep learning approaches at scale. Journal of Intelligent & Fuzzy Systems, 34(3), 1355–1367.

Vinayakumar, R., Soman, K. P., Poornachandran, P., and Sachin Kumar, S. (2018). Evaluating deep learning approaches to characterize and classify the DGAs at scale. Journal of Intelligent & Fuzzy Systems, 34(3), 1265–1276.

Vinayakumar, R., Poornachandran, P., and Soman, K. P. (2018). Scalable Framework for Cyber Threat Situational Awareness Based on Domain Name Systems Data Analysis. In Big Data in Engineering Applications (pp. 113–142). Springer, Singapore.

Vinayakumar, R., Soman, K. P., Poornachandran, P., and Sachin Kumar, S. (2018). Detecting Android malware using long short-term memory (LSTM). Journal of Intelligent & Fuzzy Systems, 34(3), 1277–1288.

Vinayakumar, R., Soman, K. P., and Poornachandran, P. (2017). Deep encrypted text categorization. In 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), (pp. 364–370). IEEE.

Vinayakumar, R., Soman, K. P., and Poornachandran, P. (2017). Secure shell (ssh) traffic analysis with flow based features using shallow and deep networks. In 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), (pp. 2026–2032). IEEE.

Vinayakumar, R., Soman, K. P., and Poornachandran, P. (2017). Evaluating shallow and deep networks for secure shell (ssh) traffic analysis. In 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), (pp. 266–274). IEEE.

Vinayakumar, R., Soman, K. P., and Poornachandran, P. (2017). Evaluating shallow and deep networks for secure shell (ssh) traffic analysis. In 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), (pp. 266–274). IEEE.

Vinayakumar, R., Soman, K. P., and Poornachandran, P. (2017). Applying deep learning approaches for network traffic prediction. In 2017 International Conference on, Advances in Computing, Communications and Informatics (ICACCI), (pp. 2353–2358). IEEE.

Vinayakumar, R., Soman, K. P., and Poornachandran, P. (2017). Long short-term memory based operation log anomaly detection. In 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 236–242. IEEE.

Vinayakumar, R., Soman, K. P., and Poornachandran, P. (2017). Deep android malware detection and classification. In 2017 International Conference on, Advances in Computing, Communications and Informatics (ICACCI), pp. 1677–1683. IEEE.

Vinayakumar, R., Soman, K. P., and Poornachandran, P. (2017). Applying convolutional neural network for network intrusion detection. In 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 1222–1228. IEEE.

Vinayakumar, R., Soman, K. P., and Poornachandran, P. (2017). Evaluating effectiveness of shallow and deep networks to intrusion detection system. In 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 1282–1289. IEEE.

Vinayakumar, R., Soman, K. P., and Poornachandran, P. (2017). Evaluation of recurrent neural network and its variants for intrusion detection system (ids). International Journal of Information System Modeling and Design (IJISMD), 8(3):43–63.

Mohan, V. S., Vinayakumar, R., Soman, K. P., and Poornachandran, P. (2018). Spoof net: Syntactic patterns for identification of ominous online factors. In 2018 IEEE Security and Privacy Workshops (SPW), (pp. 258–263). IEEE.

Labrinidis, A., and Jagadish, H. V. (2012). Challenges and opportunities with big data. Proceedings of the VLDB Endowment, 5(12), 2032–2033.

Big data analytics in cyber defense. Ponemon Institute Research Report.

Vinayakumar, R., Soman, K. P., and Poornachandran, P. A deep-dive on machine learning for cybersecurity use cases. In Machine Learning for Computer and Cyber Security: Principle, Algorithms, and Practices. CRC press, USA.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikit-learn: Machine learning in Python. Journal of machine learning research, 12, 2825–2830.

LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. nature, 521(7553), 436.

Manning, C. D. (2008). Prabhakar Raghavan, and Hinrich Schutze. Introduction to information retrieval.

Turney, P. D., and Pantel, P. (2010). From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research, 37, 141–188.

Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T. (2016). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.

Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2016). Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606.

Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., and Mikolov, T. (2016). FastText. zip: Compressing text classification models. arXiv preprint arXiv:1612.03651.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119).

Rong, X. (2014). Word2vec parameter learning explained. arXiv preprint arXiv:1411.2738.

Kalchbrenner, N., Grefenstette, E., and Blunsom, P. (2014). A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188.

Iyyer, M., Manjunatha, V., Boyd-Graber, J., and DauméIII, H. (2015). Deep unordered composition rivals syntactic methods for text classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (Vol. 1, pp. 1681–1691).

Elman, J. L. (1990). Finding structure in time. Cognitive science, 14(2), 179–211.

Hochreiter, S., and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735–1780.

Gers, F. A., Schmidhuber, J., and Cummins, F. (1999). Learning to forget: Continual prediction with LSTM.

Gers, F. A., Schraudolph, N. N., and Schmidhuber, J. (2002). Learning precise timing with LSTM recurrent networks. Journal of machine learning research, 3, 115–143.

Le, Q. V., Jaitly, N., and Hinton, G. E. (2015). A simple way to initialize recurrent networks of rectified linear units. arXiv preprint arXiv:1504.00941.

Talathi, S. S., and Vartak, A. (2015). Improving performance of recurrent neural network with relu nonlinearity. arXiv preprint arXiv:1511.03771.

Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.

Koutnik, J., Greff, K., Gomez, F., and Schmidhuber, J. (2014). A clockwork rnn. arXiv preprint arXiv:1402.3511.

Schuster, M., and Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673–2681.

Graves, A., and Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18(5–6), 602–610.

Bai, S., Kolter, J. Z., and Koltun, V. (2018). An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271.

Zhang, X., Zhao, J., and LeCun, Y. (2015). Character-level convolutional networks for text classification. In Advances in neural information processing systems (pp. 649–657).

Lai, S., Xu, L., Liu, K., and Zhao, J. (2015). Recurrent Convolutional Neural Networks for Text Classification. In AAAI (Vol. 333, pp. 2267–2273).

Sommer, R., and Paxson, V. (2010). Outside the closed world: On using machine learning for network intrusion detection. In 2010 IEEE Symposium on Security and Privacy (SP), (pp. 305–316). IEEE.

Verma, R. (2018). Security Analytics: Adapting Data Science for Security Challenges. In Proceedings of the Fourth ACM International Workshop on Security and Privacy Analytics (pp. 40–41). ACM.

Antonakakis, M., Perdisci, R., Nadji, Y., Vasiloglou, N., Abu-Nimeh, S., Lee, W., and Dagon, D. (2012). From Throw-Away Traffic to Bots: Detecting the Rise of DGA-Based Malware. In USENIX security symposium (Vol. 12).

Zhauniarovich, Y., Khalil, I., Yu, T., and Dacier, M. (2018). A Survey on Malicious Domains Detection through DNS Data Analysis. arXiv preprint arXiv:1805.08426.

Woodbridge, J., Anderson, H. S., Ahuja, A., and Grant, D. (2016). Predicting domain generation algorithms with long short-term memory networks. arXiv preprint arXiv:1611.00791.

Feng, Z., Shuo, C., and Xiaochuan, W. (2017). Classification for DGA-Based Malicious Domain Names with Deep Learning Architectures. In 2017 Second International Conference on Applied Mathematics and information technology (p. 5).

Yu, B., Gray, D. L., Pan, J., De Cock, M., and Nascimento, A. C. (2017). Inline dga detection with deep networks. In 2017 IEEE International Conference on Data Mining Workshops (ICDMW), (pp. 683–692). IEEE.

Yu, B., Pan, J., Hu, J., Nascimento, A., and De Cock, M. (2018). Character Level Based Detection of DGA Domain Names.

Mac, H., Tran, D., Tong, V., Nguyen, L. G., and Tran, H. A. (2017). DGA Botnet Detection Using Supervised Learning Methods. In Proceedings of the Eighth International Symposium on Information and Communication Technology, (pp. 211–218). ACM.

Ma, J., Saul, L. K., Savage, S., and Voelker, G. M. (2011). Learning to detect malicious urls. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 30.

Gupta, N., Aggarwal, A., and Kumaraguru, P. (2014). bit. ly/malicious: Deep dive into short url based e-crime detection. In 2014 APWG Symposium on Electronic Crime Research (eCrime), (pp. 14–24). IEEE.

Sahoo, D., Liu, C., and Hoi, S. C. (2017). Malicious URL detection using machine learning: A survey. arXiv preprint arXiv:1701.07179.

Almomani, A., Gupta, B. B., Atawneh, S., Meulenberg, A., and Almomani, E. (2013). A survey of phishing Email filtering techniques. IEEE communications surveys & tutorials, 15(4), 2070–2090.

Gupta, B. B., Tewari, A., Jain, A. K., and Agrawal, D. P. (2017). Fighting against phishing attacks: state of the art and future challenges. Neural Computing and Applications, 28(12), 3629–3654.

Abdi, F. D., and Wenjuan, L. Malicious Url Detection Using Convolutional Neural Network.

Vinayakumar, R., Soman, K. P., and Poornachandran, P. (2018). Evaluating deep learning approaches to characterize and classify malicious URL’s. Journal of Intelligent & Fuzzy Systems, 34(3), 1333–1343.

Le, H., Pham, Q., Sahoo, D., and Hoi, S. C. (2018). URLNet: Learning a URL Representation with Deep Learning for Malicious URL Detection. arXiv preprint arXiv:1802.03162.

Jiang, J., Chen, J., Choo, K. K. R., Liu, C., Liu, K., Yu, M., and Wang, Y. (2017). A Deep Learning Based Online Malicious URL and DNS Detection Scheme. In International Conference on Security and Privacy in Communication Systems (pp. 438–448). Springer, Cham.

Selvaganapathy, S., Nivaashini, M., and Natarajan, H. (2018). Deep belief network based detection and categorization of malicious URLs. Information Security Journal: A Global Perspective, 27(3), 145–161.

Bahnsen, A. C., Bohorquez, E. C., Villegas, S., Vargas, J., and González, F. A. (2017). Classifying phishing URLs using recurrent neural networks. In 2017 APWG Symposium on Electronic Crime Research (eCrime), (pp. 1–8). IEEE.

Zhao, J., Wang, N., Ma, Q., and Cheng, Z. (2018). Classifying Malicious URLs Using Gated Recurrent Neural Networks. In International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing (pp. 385–394). Springer, Cham.

Bahnsen, A. C., Torroledo, I., Camacho, D., and Villegas, S. (2018). DeepPhish: Simulating Malicious AI. In 2018 APWG Symposium on Electronic Crime Research (eCrime). IEEE.

Radhakrishnan, A., and Vaidhehi, V. Email Classification Using Machine Learning Algorithms.

Renuka, D. K., Hamsapriya, T., Chakkaravarthi, M. R., and Surya, P. L. (2011). Spam classification based on supervised learning using machine learning techniques. In 2011 International Conference on Process Automation, Control and Computing (PACC), (pp. 1–7). IEEE.

Fdez-Riverola, F., Iglesias, E. L., Díaz, F., Méndez, J. R., and Corchado, J. M. (2007). Applying lazy learning algorithms to tackle concept drift in spam filtering. Expert Systems with Applications, 33(1), 36–48.

Almeida, T. A., and Yamakami, A. (2010). . Content-based spam filtering. In 2010 International Joint Conference on Neural Networks (IJCNN), pp. 1–7. IEEE.

Sahami, M., Dumais, S., Heckerman, D., and Horvitz, E. (1998). A Bayesian approach to filtering junk e-mail. In Learning for Text Categorization: Papers from the 1998 workshop, (Vol. 62, pp. 98–105).

Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Sakkis, G., Spyropoulos, C. D., and Stamatopoulos, P. (2000). Learning to filter spam e-mail: A comparison of a naive bayesian and a memory-based approach. arXiv preprint cs/0009009.

Woitaszek, M., Shaaban, M., and Czernikowski, R. (2003). Identifying junk electronic mail in Microsoft outlook with a support vector machine. In Proceedings 2003 Symposium on Applications and the Internet, (pp. 166–169). IEEE.

Amayri, O., and Bouguila, N. (2010). A study of spam filtering using support vector machines. Artificial Intelligence Review, 34(1), 73–108.

Yeh, C. Y., Wu, C. H., and Doong, S. H. (2005). Effective spam classification based on meta-heuristics. In 2005 IEEE International Conference on Systems, Man and Cybernetics (Vol. 4, pp. 3872–3877). IEEE.

Toolan, F., and Carthy, J. (2010). Feature selection for spam and phishing detection. In eCrime Researchers Summit (eCrime), (pp. 1–12). IEEE.

Wu, C. H. (2009). Behavior-based spam detection using a hybrid method of rule-based techniques and neural networks. Expert Systems with Applications, 36(3), 4321–4330.

Tzortzis, G., and Likas, A. (2007). Deep belief networks for spam filtering. In 19th IEEE International Conference on Tools with Artificial Intelligence, ICTAI, 2, 306–309. IEEE.

Guzella, T. S., and Caminhas, W. M. (2009). A review of machine learning approaches to spam filtering. Expert Systems with Applications, 36(7), 10206–10222.

Clark, J., Koprinska, I., and Poon, J. (2003). A neural network based approach to automated e-mail classification. In International Conference on Web Intelligence, 2003. WI 2003. Proceedings. IEEE/WIC (pp. 702–705). IEEE.

Christopher Lennan, Bastian Naber, Jan Reher, and Leon Weber. End-to-end spam classification with neural networks.

Campbell, P. D. (2015). Circuit Riders (Doctoral dissertation, Gordon Conwell Theological Seminary).

Almomani, A., Gupta, B. B., Atawneh, S., Meulenberg, A., and Almomani, E. (2013). A survey of phishing Email filtering techniques. IEEE communications surveys & tutorials, 15(4), 2070–2090.

Klimt, B., and Yang, Y. (2004). Introducing the Enron Corpus. In CEAS.

Spam assasin, Avilable at https://spamassassin.apache.org.

Yu, B., Pan, J., Hu, J., Nascimento, A., and De Cock, M. (2018). Character Level Based Detection of DGA Domain Names.

Jiang, J., Chen, J., Choo, K. K. R., Liu, C., Liu, K., Yu, M., and Wang, Y. (2017). A Deep Learning Based Online Malicious URL and DNS Detection Scheme. In International Conference on Security and Privacy in Communication Systems (pp. 438–448). Springer, Cham.

Downloads

Published

2018-10-13

Issue

Section

Articles