JITA4DS: Disaggregated Execution of Data Science Pipelines Between the Edge and the Data Centre
Keywords:Disaggregated data centers, data science pipelines, edge computing
This paper targets the execution of data science (DS) pipelines supported by data processing, transmission and sharing across several resources executing greedy processes. Current data science pipelines environments provide various infrastructure services with computing resources such as general-purpose processors (GPP), Graphics Processing Units (GPUs), Field Programmable Gate Arrays (FPGAs) and Tensor Processing Unit (TPU) coupled with platform and software services to design, run and maintain DS pipelines. These one-fits-all solutions impose the complete externalization of data pipeline tasks. However, some tasks can be executed in the edge, and the backend can provide just in time resources to ensure ad-hoc and elastic execution environments.
This paper introduces an innovative composable “Just in Time Architecture” for configuring DCs for Data Science Pipelines (JITA-4DS) and associated resource management techniques. JITA-4DS is a cross-layer management system that is aware of both the application characteristics and the underlying infrastructures to break the barriers between applications, middleware/operating system, and hardware layers. Vertical integration of these layers is needed for building a customizable Virtual Data Center (VDC) to meet the dynamically changing data science pipelines’ requirements such as performance, availability, and energy consumption. Accordingly, the paper shows an experimental simulation devoted to run data science workloads and determine the best strategies for scheduling the allocation of resources implemented by JITA-4DS.
ASTERIX open-source big data management system. https://asterixdb.apache.org. Accessed: 2021-07-24.
DriveScale Composable Platform.
Google Colab. https://colab.research.google.com. Accessed: 2021-07-24.
Google Kaggle. http://www.kaggle.com. Accessed: 2021-07-24.
Liqid Composable Infrastructure.
MARKDOWN. https://guides.github.com/features/mastering-markdown/. Accessed: 2021-07-24.
Microsoft Azure Notebooks. https://notebooks.azure.com. Accessed: 2021-07-24.
The Hypervisor (x86 & ARM).
VMware vSphere Hypervisor.
Ali Akoglu and Genoveva Vargas-Solar. Putting data science pipelines on the edge. To appear in the proceedings of the 2021 International Workshop on Big data driven Edge Cloud Services (BECS 2021), May 18, 2021.
Hao Chen, Yijia Zhang, Michael C. Caramanis, and Ayse K. Coskun. Energyqare: Qos-aware data center participation in smart grid regulation service reserve provision. ACM Trans. Model. Perform. Eval. Comput. Syst., 4(1), January 2019.
Andy Davis, Jay Parikh, and William E Weihl. Edgecomputing: extending enterprise applications to the edge of the internet. In Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters, pages 180–187, 2004.
Pedro Garcia Lopez, Alberto Montresor, Dick Epema, Anwitaman Datta, Teruo Higashino, Adriana Iamnitchi, Marinho Barcellos, Pascal Felber, and Etienne Riviere. Edge-centric computing: Vision and challenges, 2015.
Adnan Ghayas. Average 4g lte speed: How fast is 4g lte?, 2021.
Lukasz Golab and M Tamer Özsu. Data stream management. Synthesis Lectures on Data Management, 2(1):1–73, 2010.
Ram Srivatsa Kannan, Lavanya Subramanian, Ashwin Raju, Jeongseob Ahn, Jason Mars, and Lingjia Tang. Grandslam: Guaranteeing slas for jobs in microservices execution frameworks. EuroSys ’19, New York, NY, USA, 2019. Association for Computing Machinery.
Jürgen Krämer and Bernhard Seeger. Semantics and implementation of continuous sliding window queries over data streams. ACM Transactions on Database Systems (TODS), 34(1):4, 2009.
Nirmal Kumbhare, Ali Akoglu, Aniruddha Marathe, Salim Hariri, and Ghaleb Abdulla. Dynamic power management for value-oriented schedulers in power-constrained hpc system. Parallel Computing, 99:102686, 2020.
Nirmal Kumbhare, Aniruddha Marathe, Ali Akoglu, Salim Hariri, and Ghaleb Abdulla. Adaptive power reallocation for value-oriented schedulers in power-constrained hpc. In 2019 20th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT), pages 133–139. IEEE, 2019.
Nirmal Kumbhare, Aniruddha Marathe, Ali Akoglu, Howard Jay Siegel, Ghaleb Abdulla, and Salim Hariri. A value-oriented job scheduling approach for power-constrained and oversubscribed hpc systems. IEEE Transactions on Parallel and Distributed Systems, 31(6):1419–1433, 2020.
Nirmal Kumbhare, Cihan Tunc, Dylan Machovec, Ali Akoglu, Salim Hariri, and Howard Jay Siegel. Value based scheduling for oversubscribed power-constrained homogeneous hpc systems. In 2017 International Conference on Cloud and Autonomic Computing (ICCAC), pages 120–130. IEEE, 2017.
Dylan Machovec, Bhavesh Khemka, Nirmal Kumbhare, Sudeep Pasricha, Anthony A Maciejewski, Howard Jay Siegel, Ali Akoglu, Gregory A Koenig, Salim Hariri, Cihan Tunc, Michael Wright, Marcia Hilton, Rajendra Rambharos, Christopher Blandin, Farah Fargo, Ahmed Louri, and Neena Imam. Utility-based resource management in an oversubscribed energy-constrained heterogeneous environment executing parallel applications. In Parallel Computing, volume 83, pages 48–72, Apr. 2019.
Joshua Mack et al. User-Space Emulation Framework for Domain-Specific SoC Design. In 2020 IEEE Int. Parallel and Distrib. Process. Symp. Workshops), pages 44–53, 2020.
Massimo Merenda, Carlo Porcaro, and Demetrio Iero. Edge machine learning for ai-enabled iot devices: A review. Sensors, 20(9):2533, 2020.
A. D. Papaioannou, R. Nejabati, and D. Simeonidou. The benefits of a disaggregated data centre: A resource allocation approach. In 2016 IEEE Global Communications Conference (GLOBECOM), pages 1–7, 2016.
Xiaolong Xu, Wanchun Dou, Xuyun Zhang, and Jinjun Chen. EnReal: An energy-aware resource allocation method for scientific workflow executions in cloud environment. IEEE Transactions on Cloud Computing, 4(2):166–179, Sep. 2015.