FAULT RESOLUTION SYSTEM FOR INTER-CLOUD ENVIRONMENT
Keywords:
Cloud Computing, Fault Resolution, Peer-to-Peer Network, Inter-Cloud Environment, Bug Tracking SystemAbstract
Fault resolution in communication networks and distributed systems is a complicated process that demands the involvement of system administrators and supporting systems in monitoring, diagnosing, resolving and recording faults. This process becomes more challenging in inter-cloud environment where multiple cloud systems coordinate in provisioning applications and services. In this context, we propose a fault resolution system that assists system administrators in resolving faults in inter-cloud environment. The proposed system is characterized by the capability of sharing and searching fault knowledge resources among cloud systems for fault resolution. It uses a peer-to-peer network of fault managers that provide facilities to monitor faults occurring in cloud systems and search similar faults with solutions occurring in other cloud systems. We have implemented several components of the proposed system including fault monitor, fault searcher and fault updater. We have also experimented and evaluated the prototyping system on fault databases obtained from several fault sources, such as bug tracking systems, online discussion forums and vendor knowledge bases.
Downloads
References
R. Buyya, R. Ranjan, and R. N. Calheiros (2010), Intercloud: Utility-Oriented Federation of
Cloud Computing Environments for Scaling of Application Services, In Proc. 10th International
Conference on Algorithms and Architectures for Parallel Processing (ICA3PP'10), pp 13-31,
Heidelberg, Germany, Springer-Verlag.
Apache Hadoop Project (2005), http://hadoop.apache.org/, last access in July 2013.
OpenStack Cloud Software (2010), http://www.openstack.org/, last access in July 2013.
M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A.
Rabkin, I. Stoica, and M. Zaharia (2010), A View of Cloud Computing, ACM Communications,
Vol. 53, No. 4, pp 50-58.
R. Jhawar, V. Piuri, and M. Santambrogio (2012), Fault Tolerance Management in Cloud
Computing: A System-Level Perspective, Systems Journal, Vol. 7, No. 2.
R. Dudko, A. Sharma, and J. Tedesco (2012), Effective Failure Prediction in Hadoop Clusters,
Technical Report, University of Illinois.
A. S. Thanamani (2011), A Survey on Failure Prediction Methods, International Journal of
Engineering Science and Technology (IJEST), Vol. 3, No. 2.
N. Kuromatsu, M. Okita, and K. Hagihara (2013), Evolving Fault-Tolerance in Hadoop with
Robust Auto-Recovering JobTracker, Bulletin of Networking, Computing, Systems, and Software,
Vol. 2, No. 1.
E. Garduno, S. P. Kavulya, J. Tan, R. Gandhi, and P. Narasimhan (2012), Theia: Visual
Signatures for Problem Diagnosis in Large Hadoop Clusters, In Proc. 26th International
Conference on Large Installation System Administration: Strategies, Tools, and Techniques
(LISA'12), pp 33-42, Berkeley, USA, USENIX Association.
J. Tan, S. Kavulya, R. Gandhi, and P. Narasimhan (2010), Visual, Log-based Causal Tracing for
Performance Debugging of MapReduce Systems, In Proc. 2010 IEEE 30th International
Conference on Distributed Computing Systems (ICDCS’10), pp 795-806, Washington, USA,
IEEE Computer Society.
H. M. Tran and J. Schönwälder (2007), Fault Representation in Case-Based Reasoning, In Proc.
th IFIP/IEEE International Workshop on Distributed Systems: Operations and Management, pp
-61, Springer-Verlag.
H. M. Tran and J. Schönwälder (2007), Heuristic Search using a Feedback Scheme in
Unstructured Peer-to-Peer Networks, In Proc. 5th International Workshop on Databases,
Information Systems and Peer-to-Peer Computing, Springer-Verlag.
H. M. Tran and J. Schönwälder (2008), Fault Resolution in Case-Based Reasoning, In Proc. 10th
Pacific Rim International Conference on Artificial Intelligence (PRICAI ’08), pp 417-429,
Springer-Verlag.
H. M. Tran, G. Chulkov, and J. Schönwälder (2008), Crawling Bug Tracker for Semantic Bug
Search, In Proc. 19th IFIP/IEEE International Workshop on Distributed Systems: Operations and
Management (DSOM ’08), pp 55-66, Springer-Verlag.
H. M. Tran and J. Schönwälder (2011), Evaluation of the Distributed Case-Based Reasoning
System on a Distributed Computing Platform, In Proc. 7th International Symposium on Frontiers
of Information Systems and Network Applications (FINA 2011), pp 53-58.
A. Aamodt and E. Plaza (1994), Case-Based Reasoning: Foundational Issues, Methodological
Variations, and System Approaches, AI Communications, Vol. 7, No. 1, pp 39-59.
D. Hausheer and C. Morariu (2008), Distributed Test-Lab: EMANICSLab, The 2nd International
Summer School on Network and Service Management (ISSNSM ’08), University of Zurich,
Switzerland.
M. Uddin, R. Stadler, and A. Clemm (2013), A Query Language for Network Search, In Proc.
th IFIP/IEEE International Symposium on Integrated Network Management (IM ’13), IEEE
Computer Society.
Ganglia Monitoring System (2000), http://ganglia.info/, last access in July 2013.
The Industry Standard In IT Infrastructure Monitoring (1999), http://www.nagios.org/, last access
in July 2013.
Apache Flume (2009), http://flume.apache.org/, last access in Jan. 2014.
Eucalyptus Open Source AWS Compatible Private Clouds (2008), http://www.eucalyptus.com/,
last access in Jan. 2014.
OpenNebula Flexible Enterprise Cloud Made Simple (2008), http://opennebula.org/, last access in
Jan. 2014.
S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Schenker (2001), A Scalable Content
Addressable Network, In Proc. Conference on Applications, Technologies, Architectures, and
Protocols for Computer Communications (SIGCOMM ’01), pp 161-172, New York, USA, ACM
Press.
I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan (2001), Chord: A Scalable
Peer-to-Peer Lookup Service for Internet Applications, In Proc. Conference on Applications,
Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM ’01), pp
-160, New York, USA, ACM Press.
P. Maymounkov and D. Mazières (2002), Kademlia: A Peer-to-Peer Information System Based
on the XOR Metric, In Proc. 1st International Workshop on Peer-to-Peer Systems (IPTPS ’01), pp
-65, London, UK, Springer-Verlag.
Gnutella Protocol Specification version 0.4 (2001), http://rfc-gnutella.sourceforge.net/
developer/stable/index.html, last access in Mar. 2013.
I. Clarke, O. Sandberg, B. Wiley, and T. W. Hong (2000), Freenet: A Distributed Anonymous
Information Storage and Retrieval System, In Proc. International Workshop on Design Issues in
Anonymity and Unobservability, pp 46-66, Heidelberg, Germany, Springer-Verlag.
B. Cohen (2003), Incentives Build Robustness in Bittorrent, In Proc. 1st Workshop on Economics
of Peer-to-Peer Systems.
ITU-T (1995), Trouble Management Function for ITU-T Applications, X.790 Recommendation.
D. Johnson (1992), NOC Internal Integrated Trouble Ticket System Functional Specification
Wishlist, RFC 1297.
D. Bloom (1994), Selection Criterion and Implementation of a Trouble Tracking System: What’s
in a Paradigm?, In Proc. 22nd Annual ACM SIGUCCS Conference on User Services (SIGUCCS
’94), pp 201-203, New York, USA, ACM Press.
H. M. Tran, S. T. Le, S. V. U. Ha, and T. K. Huynh (2013), Software bug ontology supporting
bug search on peer-to-peer networks, In Proc. 6th International KES Conference on Agents and
Multi-agent Systems Technologies and Applications (AMSTA ’13), IOS Press.
B. Yang and H. Garcia-Molina (2003), Designing a super-peer network, In Proc. 19th
International Conference on Data Engineering (ICDE’03), pp 49, Los Alamitos, USA, IEEE
Computer Society