10 Oct 2023 DevOps

Navigating DevOps Incidents: Lessons from Real-World Examples

In today's era of Internet-driven applications, any update or new feature rollout cannot occur in isolation. Modern applications rely on a complex web of plugins and API dependencies, each with its own set of interdependencies on various services, including the unpredictable Internet itself. A single element within this chain can disrupt the entire application delivery process, affecting both upstream and downstream components.

Even though a software team may be responsible for a specific part of an application, they must have immediate visibility into how their actions affect other code packages within the same application. Additionally, they need to understand how underlying Internet and network infrastructure impacts their application's performance.

Incidents, disruptions, and outages are common occurrences, irrespective of the team's size or the resources at their disposal. Let's delve into three instances where these interdependencies led to unintended DevOps incidents

 

Embracing Automation, but Treading Carefully

Automation is hailed as a revolutionary approach in DevOps, offering the promise of reducing human errors and enhancing operational efficiency. To illustrate, organizations often employ automation to simulate various scenarios and test the resilience of their infrastructure. This involves automating both testing processes and technologies to ensure robustness.

However, the key challenge arises when automated responses do not behave as expected. In a real-world example from early September, Microsoft encountered an issue related to Azure Front Door (AFD), an Application Delivery Network (ADN) service. AFD is designed to intelligently balance traffic between global edge sites and seamlessly shift traffic in case of failures.

In this instance, an unexpected spike in traffic triggered an automated response within AFD. Unfortunately, instead of mitigating the issue, it exacerbated the problem, causing multiple environments to go offline. This demonstrates that while automation is powerful, it must be implemented and tested meticulously to avoid unintended complexities and challenges during incident resolution.

 

The Road to Performance Has Its Pitfalls

As organizations strive to optimize their applications and services, they often make changes at various levels, including the code and application layers. However, these changes can have ripple effects throughout the service delivery chain, especially when third-party dependencies are involved.

In late August, Microsoft 365 desktop users faced authentication issues, impacting a significant user base, particularly in Japan. The root cause of the problem was traced back to a third-party security plug-in, rather than any issue within Microsoft's own infrastructure. It was discovered that a vulnerability scan inadvertently uninstalled the desktop authentication client used for connecting to Microsoft 365 services.

While Microsoft and the plug-in provider, Tenable, eventually resolved the issue, it had already caused disruptions. Notably, the impact was limited to a specific configuration of enterprise systems, but this incident serves as a reminder of how dependencies, even from third-party sources, can impact the overall service delivery and user experience.

 

When Things Go Wrong in the Fast Lane

In the age of DevOps, the focus is on agility and rapid deployment, but this pace can introduce risks. To mitigate these risks, organizations often schedule deployments during non-business hours to minimize user impact. However, the success of this strategy depends on correctly identifying and isolating potentially impacted users.

In early August, Google experienced an outage that affected services like Google Search, Google Maps, and Gmail. The root cause was traced back to a software update gone awry. Customer-facing impacts were observed during evening hours on the U.S. East Coast, but the disruption occurred right in the middle of business hours for APAC users, causing more significant disruption for them.

This incident underscores the importance of not only fast deployments but also strategic timing to minimize disruptions based on user geography and business hours.

 

In conclusion, these real-world examples illustrate the critical need for DevOps teams to be vigilant about the potential impacts of their actions on the entire service delivery chain. Automation, while powerful, requires careful planning and testing. Dependencies, whether internal or third-party, must be considered when making changes. Additionally, strategic timing of deployments can play a crucial role in minimizing the user experience impact during incidents.

Learn how to master DevOps and handle incidents effectively by exploring the AZ-400T00-Designing and Implementing Microsoft DevOps solutions course, which can help you gain the expertise needed to excel in DevOps and incident management. Get started today with Formatech!