Kubernetes Service Interruption 6/29/17

Table of Contents

Executive Summary

During a routine upgrade to a core component of the kubernetes control plane, the kubernetes cluster became unstable due to the disk latency from the underlying AWS datastores becoming too high. The cluster instability meant that we could not store or verify essential information necessary to running the cluster which caused erratic operating scenarios. Initially, this instability was attributed to a "split brain" scenario, wherein the components of the distributed control plane get out of sync and the system is unable to distinguish which dataset is correct. After several days of troubleshooting the team was able to conclusively attribute all instabilities to the disk latency issue and reproduce the errors in a test environment. As a result, the team has upgraded the underlying datastores for the configuration servers to utilize SSDs thereby mitigating the issue. During this process, we became aware that several Tier 1 services were deployed in a single kubernetes cluster which would constitute a single point of failure. In the future, we are going to require that Tier 1 services deploy to multiple clusters, which eventually will target Google's managed Kubernetes services and Nordstrom managed clusters on AWS. This will eliminate service disruption in the event of a single control plane becoming unstable.

Incident

Incident Description

During the routine upgrade of the kubernetes cluster's distributed configuration database (etcd), the etcd nodes became out of sync ultimately appearing to result in a "split brain" condition, wherein each node in the database would return a different value for the same key without the ability to detect which version was correct. The configuration system (etcd) holds key information about the deployment and operation of the cluster, and without it being accurate and available, the cluster began to behave erratically. The seamless and online upgrade process for etcd is a well tested and well known procedure and has successfully been performed on many occations.

Incident Summary

The upgrade process began with a verification of the process and procedure in the kubernetes development environment earlier in the day. Once verified, the tested process and procedure was implemented on the production cluster and was completed by 2:00pm. Between 2:00 - 2:30, the k8s team was alerted to irregularities in the environment by automated alerts and was able to confirm by visual inspection of the monitoring system. Attribution of the issue was difficult to ascribe due to the nature of the alerts, specifically, several unrelated alerts fired simultaneously. Upon further inspection, a split brain condition was suspected and after initial troubleshooting was conducted, a full control plane recovery process was initiated. This process is well documented and the pre-established run book for this situation was executed. To facilitate a faster transition in the event that the control plane was deemed unrecoverable, a new cluster creation process was kicked off. Upon completion of the recovery runbook, the cluster was still not stable at which time the control plane for the production cluster was officially deemed unrecoverable. At this point, the newly created cluster was promoted and workloads were prioritized to be migrated. The primary customers were informed of the migration and were assisted in the verification of the required configurations needed for the new cluster. By 5:30, customer facing services were verified by the business owners. The incident was reduced from red to yellow at 6PM. The full migration was completed by 8pm.

Ultimately, the underlying cause was attributed to the AWS Elastic Block Store (EBS) volumes which supported the etcd database were responding too slow to facilitate consistent operation. While the distributed consensus protocol (RAFT) underlying the datastore's consistency guarantees should have accounted for the increased latency, the scenario was able to be predictably recreated in test environments. To mitigate, the team tested replacing the EBS volumes with high performance SSDs which were able to perfom orders of magnitude better from both throughput and latency standpoints. Due to increased load on the cluster based on several new and large deployments, the upgrade process was a coincidental tipping point from a performance perspective as the resyncing to new nodes places increase demands on the etcd datastore. At current growth rates, this issue would have presented itself in the near future without manual intervention.

Incident Timeline

Timeline Description
1:11 etcd deployment begins
1:57 etcd deployment complete
2:02 1st alarm received
2:30 Platform impact confirmed via visual inspection of graphs
2:30 API returning different datasets
2:30 IRT P1 Incident Created
2:30 Troubleshooting and control plane recovery run-book executed
4:00 etcd servers still out of sync after recovery process
4:00 Create new cluster
4:11 New cluster ready
4:30 Roll over etcd state
4:30 Migration of workloads begin
5:00 Old cluster deemed unrecoverable
5:10 Started full migration
5:48 Prioritized customer migration
6:00 Pri1 customers confirmed up
6:15 IRT incident reduced from red to yellow

MTTD : 5 min

MTTR : 4 hours, 18 min

Incident Impact Graphs

Steel-Etcd_rpcRate_diskSyncDuration.png

Figure 1: Disk Sync Duration Latency on Impacted Cluster

Barcelona-Etcd_rpcRate_diskSyncDuration.png

Figure 2: Disk Sync Duration Latency on Mitigated Cluster

Overall VolumeQueueLength during 6-29 incident.png

Figure 3: VolumeQueueLength during 6/29 incident

Whys

Why did we upgrade in the first place

  • Several new features were included in the newest release that would help to improve visibility and stability for Anniversary

Why did we deploy before a holiday

  • The team didn't think it was a high risk operation and the perceived benefits were deemed valuable

Why didn't we notify teams that there was a high risk situation

  • The team didn't think it was a high risk operation based on several past successful executions of the same run book

Why didn't we do regression testing

  • The team performed a few hours of testing in the development environment prior to rolling it out to the production cluster

Why wasn't this issue identified during the regression testing

  • The validation load was not sufficient to identify the ultimate disk issue during testing

Go Forward Plan and Lessons Learned

Communication

One of the primary friction points that came out of this incident was the clear communication from the team around timelines and action plans once the incident was detected. Streamlining this process will be one of the primary goals of the new Site Reliability Team (SRE) team. The expectation would be that the SRE dedicated to the kubernetes team would coordinate with Incident Response Team (IRT) to facilitate communication and planning during the incident. We will also work to make sure that we have clear and established guidelines in the event of Tier 1 outage to help coordinate communication and planning.

Test Environment Fidelity

One of the primary reasons this outage occurred was the inconsistency in load from test environment to production workloads. Specifically, as more and more teams onboarded to the kubernetes cluster, the scale of the load tests during cluster validation in the development environment was not sufficient to surface issues that would ultimately arise in the production environment. Given adequate synthetic load, the etcd nodes would have failed during runbook validation thus preventing the site outage.

Application Expectations

Tier 1 apps were deployed in a single instance to a single control plane. This left those apps without a DR strategy in the event of a control plane failure. While this platform service is deployed in a High Availability configuration, it still represents a single deployment with a single blast radius. As such, we should be very specific about deployment expectations for Tier 1 apps and what can be expected from the platform provided services. We have established production app release requirements as detailed below.

Production Application Requirements

[Table of Roles and Responsibilities]

Tier 1 Service

  • Minimum of 2 deployments
    • [Per Deployment]
      • Minimum of 3 instances of the service
      • Each instance of the service should be in its own availability zone until all AZs are represented
      • Should be evenly distributed across AZs
      • Should be able to maintain SLA latencies in the event of a single AZ failure
      • Roll outs should be performed in a blue - green pattern

Tier 2 or below Service

  • Each instance of the service should be in its own availability zone until all AZs are represented
  • Should be able to maintain SLA latencies in the event of a single AZ failure
  • Roll outs should be performed in a blue - green pattern

New Kubernetes Architecture

  • Non Prod Cluster
  • Prod Cluster (A)
  • Prod Cluster (B) [Future cluster planned for GKE]

Blue Green deployment lifecycle:

blue-green.png

Figure 4: Blue Green Deployment Lifecycle

  • Integration testing in Non Prod (Staging) cluster
  • Initial deployment occurs in a Canary environment (In Prod Cluster (A)) with 1% - 5% of total traffic flowing to new deployment
  • Once the deployment stability is confirmed in the Canary Environment, full promotion to Prod Cluster (A) can proceed with a slow traffic ramp (i.e 5% every 5 min)
  • After 30 min of stability in Prod Cluster (A), a full deploy to prod cluster (B) can be performed
  • After 2 hours of seamless operation in all clusters, destroy previous version of app in all clusters

Date: 7/1/17

Author: Jeff Rose

Created: 2017-07-09 Sun 22:23

Emacs 25.1.1 (Org mode 8.2.10)

Validate