The idea of computing as a utility was realized with the emergence of the cloud computing paradigm. Cloud service providers offer a wide range of services that are delivered over the Internet to cloud service consumers. In its current manifestation, the Cloud services are realized over multiple logical, virtualized, and distributed resources, typically using a multi-layered architecture. The providers document the non-functional service level guarantees like availability, performance, security, etc, in Service Level Agreements (SLAs) provided to the consumer as Service Level Objectives (SLO). The wide adoption of cloud computing, compounded with the emergence of microservice architecture, has resulted in a considerable increase in the number of components involved in service delivery. Manually addressing failures in real-time is inefficient and often impossible at the cloud scale, where failures are a norm rather than an exception. Ensuring the quality of an application service, as documented in the SLA, therefore requires autonomous mechanisms to enhance cloud services’ resilience.
Though cloud setups rely on highly autonomous service layers for managing, provisioning, and monitoring applications, most of them focus on a specific cloud service architecture layer or consider only a particular set of faults. Any component across the cloud service stack involved in the service delivery could disrupt the SLO. Further, as cloud services use shared infrastructure, monitoring and acting on the individual service layer metrics is limiting. In such a scenario, the visibility of failure anywhere in the stack can offer effective recovery/remediation strategies; hence, an application-oriented approach that takes an end-to-end view of failures makes the case for any resiliency solution. Towards this, we propose an end-to-end service resilience framework that employs data-dependent intelligent autonomous mechanisms to deal with cloud service disruptions efficiently. The intelligence to reduce the effect of disruptions is based on understanding the complex interconnections and inter-dependencies of end-to-end components in the cloud service stack.
The different cloud service abstraction layers and infrastructure sharing have resulted in increased occurrence of faults, more specifically, saturation faults. The initial phase of this work examines real-world disruption scenarios to understand the faults that could disrupt a cloud service. With ever-changing applications and environments on which they are hosted, realizing a failure repository for cloud service faults is infeasible. This makes conventional data-oriented approaches less practical and dynamic observability data-oriented methods more desirable. Towards this, the second phase of this work developed a Topology Aware Root Cause Detection Algorithm (TA-RCD) that considers the observability data from end-to-end service components and their interconnectedness. Our results from the fault injection studies show that the proposed approach performs better than the state-of-the-art RCD algorithm, at least by 2x times for Top-5 recall and 4x times for Top-3 recall, on average.
To autonomously recover a service from its anomalous state, the remediation should target the root cause of anomalous behavior. The root-cause localizations, though accurate, are not restricted to a specific component because of causal effects due to service interactions. In order to identify the anomalous component, the third phase of this work developed a Topology Aware end-to-end failure Recovery framework (TA-REC) that identifies the appropriate remediation strategy for an anomaly. The anomaly scores assignment and component activity tracking in TA-REC facilitates the identification of the component and the remediation that needs to be applied to the component. For the saturation fault scenarios injected across the stack, TA-REC can identify an adequate remediation/recovery strategy than the state-of-the-art because of the better visibility of the origin of the failure. The end-to-end visibility hence enables TA-REC to be effective against an anomaly.
In conclusion, this work demonstrated the usefulness of the end-to-end topology of a cloud application service to remediate anomalies that challenge the service quality efficiently. The observations prove that looking at the service as a black box restricts the development of intelligent autonomous approaches to guarantee SLOs. The proof-of-concept evaluations demonstrated that the intelligence to maintain service resilience effectively is based on an accurate understanding of the end-to-end state, as it facilitates maintaining component serviceability by targeting the cause of failure in the stack. Future work aims to evaluate both TA-RCD and TA-REC for a broader range of fault scenarios in real-life production deployments.
ALL ARE WELCOME