Evaluation and Analysis of Distributed Graph-Parallel Processing Frameworks
DOI:
https://doi.org/10.13052/jcsm2245-1439.333Keywords:
Big data, Graph-parallel computing, Distributed processingAbstract
A number of graph-parallel processing frameworks have been proposed to address the needs of processing complex and large-scale graph structured datasets in recent years. Although significant performance improvement made by those frameworks were reported, comparative advantages of each of these frameworks over the others have not been fully studied, which impedes the best utilization of those frameworks for a specific graph computing task and setting. In this work, we conducted a comparison study on parallel processing systems for large-scale graph computations in a systematic manner, aiming to reveal the characteristics of those systems in performing common graph algorithms with real-world datasets on the same ground. We selected three popular graph-parallel processing frameworks (Giraph, GPS and GraphLab) for the study and also include a representative general data-parallel computing system– Spark–in the comparison in order to understand how well a general data-parallel system can run graph problems. We applied basic performance metrics measuring speed, resource utilization, and scalability to answer a basic question of which graph-parallel processing platform is better suited for what applications and datasets. Three widely-used graph algorithms–clustering coefficient, shortest path length, and PageRank score–were used for benchmarking on the targeted computing systems. We ran those algorithms against three real world network datasets with diverse characteristics and scales on a research cluster and have obtained a number of interesting observations. For instance, all evaluated systems showed poor scalability (i.e., the runtime increases with more computing nodes) with small datasets likely due to communication overhead. Further, out of the evaluated graph-parallel computing platforms, PowerGraph consistently exhibits better performance than others.
Downloads
References
Forum, M.P., MPI: A Message-Passing Interface Standard. 1994, University of Tennessee.
Dagum, L. and R. Menon, OpenMP: An Industry-Standard API for Shared-Memory Programming. IEEE Comput. Sci. Eng., 1998. 5(1): p. 46–55.
Dean, J. and S. Ghemawat, MapReduce: simplified data processing on large clusters. Commun. ACM, 2008. 51(1): p. 107–113.
Low, Y., et al., Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proc. VLDB Endow., 2012. 5(8): p. 716–727.
Low, Y., et al., Graphlab: A new framework for parallel machine learning. arXiv preprint arXiv:1006.4990, 2010.
Gonzalez, J.E., et al., PowerGraph: distributed graph-parallel computation on natural graphs, in Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation. 2012, USENIX Association: Hollywood, CA, USA. p. 17–30.
Katz, R.F., et al., Numerical simulation of geodynamic processes with the Portable Extensible Toolkit for Scientific Computation. Physics of the Earth and Planetary Interiors, 2007. 163(1–4): p. 52–68.
Chen, W.-Y., et al., Parallel spectral clustering in distributed systems. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2011. 33(3): p. 568–586.
Zaharia, M., et al., Spark: cluster computing with working sets, in Proceedings of the 2nd USENIX conference on Hot topics in cloud computing. 2010, USENIX Association: Boston, MA. p. 10–10.
Zaharia, M., et al., Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing, in Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. 2012, USENIX Association: San Jose, CA. p. 2–2.
Guo, Y., et al. How well do graph-processing platforms perform? an empirical performance evaluation and analysis.
Scott, J. and P.J. Carrington, The SAGE handbook of social network analysis. 2011: SAGE publications.
Newman, M., Networks: An Introduction. 2010: Oxford University Press, Inc. 720.
Malewicz, G., et al., Pregel: a system for large-scale graph processing, in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data. 2010, ACM: Indianapolis, Indiana, USA. p. 135–146.
Elser, B. and A. Montresor. An evaluation study of Big Data frameworks for graph processing. In Big Data, 2013 IEEE International Conference on. 2013.
Guo, Y., et al., Towards Benchmarking Graph-Processing Platforms.
Salihoglu, S. and J. Widom. Gps: A graph processing system. in Proceedings of the 25th International Conference on Scientific and Statistical Database Management. 2013. ACM.
The Apache Software Foundation. Apache Giraph. 2014 cited 2014; Available from: http://giraph.apache.org/.
Page, L., et al., The PageRank citation ranking: Bringing order to the web. 1999.
Haveliwala, T., S. Kamvar, and G. Jeh, An analytical comparison of approaches to personalizing PageRank. 2003.
Langville, A.N. and C.D. Meyer, Deeper inside pagerank. Internet Mathematics, 2004. 1(3): p. 335–380.
Watts, D.J. and S.H. Strogatz, Collective dynamics of ‘small-world’networks. nature, 1998. 393(6684): p. 440–442.
Minas Gjoka, Maciej Kurant, Carter T. Butts and Athina Markopoulou, Walking in Facebook: A Case Study of Unbiased Sampling of OSNs, Proceedings of IEEE INFOCOM ’10, San Diego, CA, 2010.
Crankshaw, Daniel, Ankur Dave, Reynold S. Xin, Joseph E. Gonzalez, Michael J. Franklin, and Ion Stoica, The GraphX Graph Processing System.
D. Gregor and A. Lumsdaine, “The Parallel BGL: A Generic Library for Distributed Graph Computations,” POOSC, 2005.
K. Kambatla, G. Kollias, and A. Grama, “Efficient Large-Scale Graph Analysis in MapReduce,” in PMAA, 2012.
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. Franklin, S. Shenker, and I. Stoica. Fast and interactive analytics over Hadoop data with Spark. USENIX; login, 37(4), 2012.
Yue Zhao, Kenji Yoshigoe, Mengjun Xie, Suijian Zhou, Remzi Seker, and Jiang Bian, LightGraph: Lighten Communication in Distributed Graph-Parallel Processing, in Proceedings of the 3rd IEEE International Congress on Big Data (BigData 2014), Anchorage, Alaska, USA, 2014.
POWER, R., AND LI, J. Piccolo: building fast, distributed programs with partitioned tables. In OSDI (2010).