Resilience Engineering in Distributed Cloud Architectures

Ramanan Hariharan

doi:10.58425/ijea.v2i1.355

Authors

Ramanan Hariharan Principal Engineering Manager, Security and Resiliency, Microsoft, Mountain View, USA.

DOI:

https://doi.org/10.58425/ijea.v2i1.355

Keywords:

Resilience engineering, distributed cloud systems, fault tolerance, hybrid cloud strategies, AI-driven self-healing systems

Abstract

Aim: This study aims to evaluate the fundamental resilience engineering strategies in distributed cloud systems and explore their role in enhancing system availability, security, and fault tolerance. As businesses increasingly rely on geographically dispersed cloud infrastructures, ensuring continuous service delivery amid failures and cyber threats has become critical.

Methods: The research adopts a qualitative case analysis approach, complemented by a thorough literature review, to investigate key resilience practices such as redundancy, fault tolerance, proactive monitoring, and disaster recovery planning.

Results: The analysis reveals that integrating artificial intelligence (AI)-based identity and access management (IAM) tools and dynamic load balancing significantly improves system recovery performance, reduces downtime, and supports continuous availability of services. Additionally, the study finds that combining multi-cloud architectures with automated security mechanisms substantially strengthens cloud system robustness against localized failures and security breaches. These resilience strategies improve fault tolerance and support scalability and adaptive performance under changing workloads.

Conclusion: There is need for resilience engineering in the face of growing cloud adoption and system complexity.

Recommendations: Organizations should invest in hybrid cloud infrastructures and AI-driven self-healing capabilities to ensure long-term operational stability, data protection, and compliance in dynamic digital environments.

References

Abdulsalam, Y. S., & Hedabou, M. (2021). Security and privacy in cloud computing: technical review. Future Internet, 14(1), 11.

Aldwyan, Y., & Sinnott, R. O. (2019). Latency-aware failover strategies for containerized web applications in distributed clouds. Future Generation Computer Systems, 101, 1081-1095.

Alexander, B., & Denis, M. (2021). Security audit logging in microservice-based systems: survey of architecture patterns. Вопросы кибербезопасности, (2 (42)), 71-80.

Anderson, J. (2022). The Role of Identity and Access Management (IAM) in Securing Cloud Workloads.

Asghar, A., Farooq, H., & Imran, A. (2018). Self-healing in emerging cellular networks: Review, challenges, and research directions. IEEE Communications Surveys & Tutorials, 20(3), 1682-1709.

Chavan, A. (2021). Eventual consistency vs. strong consistency: Making the right choice in microservices. International Journal of Software and Applications, 14(3), 45-56. https://ijsra.net/content/eventual-consistency-vs-strong-consistency-making-right-choice-microservices

Chavan, A. (2024). Fault-tolerant event-driven systems: Techniques and best practices. Journal of Engineering and Applied Sciences Technology, 6, E167. http://doi.org/10.47363/JEAST/2024(6)E167

Chinamanagonda, S. (2023). Focus on resilience engineering in cloud services. Academia Nexus Journal, 2(1).

Chouliaras, S., & Sotiriadis, S. (2023). An adaptive auto-scaling framework for cloud resource provisioning. Future Generation Computer Systems, 148, 173-183.

Colman-Meixner, C., Develder, C., Tornatore, M., & Mukherjee, B. (2016). A survey on resiliency techniques in cloud computing infrastructures and applications. IEEE Communications Surveys & Tutorials, 18(3), 2244-2281.

Dehghanian, P., Aslan, S., & Dehghanian, P. (2018). Maintaining electric system safety through an enhanced network resilience. IEEE Transactions on Industry Applications, 54(5), 4927-4937.

Del Giudice, M., Buck, C. L., Chaby, L. E., Gormally, B. M., Taff, C. C., Thawley, C. J., ... & Wada, H. (2018). What is stress? A systems perspective. Integrative and comparative biology, 58(6), 1019-1032.

Dhanagari, M. R. (2024). MongoDB and data consistency: Bridging the gap between performance and reliability. Journal of Computer Science and Technology Studies, 6(2), 183-198. https://doi.org/10.32996/jcsts.2024.6.2.21

Dhanagari, M. R. (2024). Scaling with MongoDB: Solutions for handling big data in real-time. Journal of Computer Science and Technology Studies, 6(5), 246-264. https://doi.org/10.32996/jcsts.2024.6.5.20

Gariba, Z. P., & Van Der Poll, J. A. (2017, October). Security failure trends of cloud computing. In 2017 IEEE 3rd International Conference on Collaboration and Internet Computing (CIC) (pp. 247-256). IEEE.

Goel, G., & Bhramhabhatt, R. (2024). Dual sourcing strategies. International Journal of Science and Research Archive, 13(2), 2155. https://doi.org/10.30574/ijsra.2024.13.2.2155

Grzonka, D., Jakóbik, A., Kołodziej, J., & Pllana, S. (2018). Using a multi-agent system and artificial intelligence for monitoring and improving the cloud performance and security. Future generation computer systems, 86, 1106-1117.

Hazra, R., Chatterjee, P., Singh, Y., Podder, G., & Das, T. (2024). Data Encryption and Secure Communication Protocols. In Strategies for E-Commerce Data Security: Cloud, Blockchain, AI, and Machine Learning (pp. 546-570). IGI Global.

Karwa, K. (2024). The future of work for industrial and product designers: Preparing students for AI and automation trends. Identifying the skills and knowledge that will be critical for future-proofing design careers. International Journal of Advanced Research in Engineering and Technology, 15(5). https://iaeme.com/MasterAdmin/Journal_uploads/IJARET/VOLUME_15_ISSUE_5/IJARET_15_05_011.pdf

Karwa, K. (2024). The role of AI in enhancing career advising and professional development in design education: Exploring AI-driven tools and platforms that personalize career advice for students in industrial and product design. International Journal of Advanced Research in Engineering, Science, and Management. https://www.ijaresm.com/uploaded_files/document_file/Kushal_KarwadmKk.pdf

Kinyua, J., & Awuah, L. (2021). AI/ML in Security Orchestration, Automation and Response: Future Research Directions. Intelligent Automation & Soft Computing, 28(2).

Kopetz, H., & Steiner, W. (2022). Real-time systems: design principles for distributed embedded applications. Springer Nature.

Kumar, A. (2019). The convergence of predictive analytics in driving business intelligence and enhancing DevOps efficiency. International Journal of Computational Engineering and Management, 6(6), 118-142. Retrieved from https://ijcem.in/wp-content/uploads/the-convergence-of-predictive-analytics-in-driving-business-intelligence-and-enhancing-devops-efficiency.pdf

Kumari, P., & Kaur, P. (2021). A survey of fault tolerance in cloud computing. Journal of King Saud University-Computer and Information Sciences, 33(10), 1159-1176.

Laszewski, T., Arora, K., Farr, E., & Zonooz, P. (2018). Cloud Native Architectures: Design high-availability and cost-effective applications for the cloud. Packt Publishing Ltd.

Liu, P., Wang, T., Li, H., Zhang, X., Wang, L., Jeppesen, E., & Han, B. P. (2023). Functional diversity and redundancy of rotifer communities affected synergistically by top-down and bottom-up effects in tropical urban reservoirs. Ecological Indicators, 155, 111061.

Nguyen, D. S., & Sondano, J. (2023). Resilience and stability in organizations employing cloud computing in the financial services industry. Journal of Computer and Communications, 11(4), 103-148.

Nissenbaum, H. (2020). Protecting privacy in an information age: The problem of privacy in public. In The ethics of information technologies (pp. 141-178). Routledge.

Nowell, B., Bodkin, C. P., & Bayoumi, D. (2017). Redundancy as a strategy in disaster response systems: A pathway to resilience or a recipe for disaster?. Journal of Contingencies and Crisis Management, 25(3), 123-135.

Nwoye, C. C., & Nwagwughiagwu, S. (2024). AI-Driven Anomaly Detection for Proactive Cybersecurity and Data Breach Prevention. Int J Eng Technol Res Manag.

Nyati, S. (2018). Revolutionizing LTL carrier operations: A comprehensive analysis of an algorithm-driven pickup and delivery dispatching solution. International Journal of Science and Research (IJSR), 7(2), 1659-1666. Retrieved from https://www.ijsr.net/getabstract.php?paperid=SR24203183637

Oloruntoba, O. (2024). Business continuity in database systems: The role of data guard and oracle streams.

Phusakulkajorn, W., Núñez, A., Wang, H., Jamshidi, A., Zoeteman, A., Ripke, B., ... & Li, Z. (2023). Artificial intelligence in railway infrastructure: Current research, challenges, and future opportunities. Intelligent Transportation Infrastructure, 2, liad016.

Pookandy, J. (2021). Multi-factor authentication and identity management in cloud CRM with best practices for strengthening access controls. International Journal of Information Technology & Management Information System (IJITMIS), 12(1), 85-96.

Raju, R. K. (2017). Dynamic memory inference network for natural language inference. International Journal of Science and Research (IJSR), 6(2). https://www.ijsr.net/archive/v6i2/SR24926091431.pdf

Sardana, J. (2022). Scalable systems for healthcare communication: A design perspective. International Journal of Science and Research Archive. https://doi.org/10.30574/ijsra.2022.7.2.0253

Sardana, J. (2022). The role of notification scheduling in improving patient outcomes. International Journal of Science and Research Archive. Retrieved from https://ijsra.net/content/role-notification-scheduling-improving-patient

Sekar, R. R., Masna, A., Sharma, S., Abraham, A., & Pagilla, P. R. (2024, May). Decentralized Identity and Access Management (IAM) Using Blockchain. In 2024 International Conference on Intelligent Systems for Cybersecurity (ISCS) (pp. 1-6). IEEE.

Shahid, M. A., Islam, N., Alam, M. M., Mazliham, M. S., & Musa, S. (2021). Towards Resilient Method: An exhaustive survey of fault tolerance methods in the cloud computing environment. Computer Science Review, 40, 100398.

Singh, V. (2023). Large language models in visual question answering: Leveraging LLMs to interpret complex questions and generate accurate answers based on visual input. International Journal of Advanced Engineering and Technology (IJAET), 5(S2). https://romanpub.com/resources/Vol%205%20%2C%20No%20S2%20-%2012.pdf

Singh, V. (2024). Ethical considerations in deploying AI systems in public domains: Addressing the ethical challenges of using AI in areas like surveillance and healthcare. Turkish Journal of Computer and Mathematics Education (TURCOMAT). https://turcomat.org/index.php/turkbilmat/article/view/14959

Sowmya, R., Nandhini, M., & Priyanga, M. (2024, February). Enhancing Edge Node Resilience through SDN-Driven Proactive Failure Management. In 2024 IEEE International Conference for Women in Innovation, Technology & Entrepreneurship (ICWITE) (pp. 15-20). IEEE.

Stary, C., & Wachholder, D. (2016). System-of-systems support—A bigraph approach to interoperability and emergent behavior. Data & Knowledge Engineering, 105, 155-172.

Tatineni, S. (2023). Cloud-Based Business Continuity and Disaster Recovery Strategies. International Research Journal of Modernization in Engineering, Technology, and Science, 5(11), 1389-1397.

Thokala, V. S. (2021). A Comparative Study of Data Integrity and Redundancy in Distributed Databases for Web Applications. Int. J. Res. Anal. Rev, 8(4), 383-389.

Welsh, T., & Benkhelifa, E. (2020). On resilience in cloud computing: A survey of techniques across the cloud domain. ACM Computing Surveys (CSUR), 53(3), 1-36.

Yang, C., Yu, M., Hu, F., Jiang, Y., & Li, Y. (2017). Utilizing cloud computing to address big geospatial data challenges. Computers, environment and urban systems, 61, 120-128.

Zhang, J., Chen, B., Zhao, Y., Cheng, X., & Hu, F. (2018). Data security and privacy-preserving in edge computing paradigm: Survey and open issues. IEEE access, 6, 18209-18237.

Resilience Engineering in Distributed Cloud Architectures

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Navigation

Quick Links

Journal Indexers

Current Issue

Information