Kubernetes Cluster Failure Resolved

By Switching To The Fastest Drives AWS Can Offer

Kubernetes is an open-source product developed by Google and designed to automate, manage, and orchestrate many of the processes involved in deploying and scaling containerized applications.

One of Scandiweb products – Readymage – is built with this functionality in mind. The basic idea behind Readymage is it combines Magento back-end with ScandiPWA front-end to launch projects in the Kubernetes cluster. The cluster is managed by AWS, and the service we use for these purposes is called Amazon EKS.

Recently, we have encountered an unexpected Kubernetes problem affecting the integrity of a containerized setup and causing project deployments to crash the entire cluster.

The problem

Throughout Spring and early Summer 2020, after loading the Kubernetes cluster with a few heavy projects, we came across an unexpected issue: Kubernetes nodes (AWS instances) started randomly freezing for no obvious reason.

To mitigate the issue Kubernetes cluster normally relocates pods (sets of application containers) from the frozen nodes to healthy ones. This was acceptable up to a point when the problem began to emerge immediately on relocation. As a result, the healthy nodes were freezing as well, leading to a cascade of failed pods.

Investigation

We contacted the AWS Premium Support service for assistance, but after a month of back-and-forth communication, no real solution was proposed.

To investigate the issue, we set up monitoring in order to track the exact moment the nodes were frozen. Whenever this happened, the faulty node was removed from load-balancing, removed from the Kubernetes cluster, and stopped. Then the node disk was mounted onto a healthy instance, so we could examine precisely what was happening.

Eventually, with the new data coming in, some light was shed on the nature of the problem. As it turned out, the faulty instance lost network connectivity due to kernel panic caused by the network interface driver.

This explained why the Kubernetes cluster lost control of the node, and why the AWS health check failed, with the instance eventually being terminated.

A further search revealed that we were not alone in facing this problem. Other people using AWS managed Kubernetes cluster had encountered similar issues too, as can be seen from this G ithub thread.

The community’s suggestion was that the issues appeared when an intense disk input/output was performed – in other words, upon intense read or write. This is exactly what happens during application deployment.

A fresh look at the setup

At the time, we were using fast SSD disks with burstable performance (AWS EBS GP2). AWS EBS is attached over the network, forming a huge disk array that is attached on-demand to the required instance.

These units offer pretty decent performance:

writing speed of 270 MB/s for a prolonged period of time (until the burst balance is drained),
up to 3000 input/output operations per second (IOPS).

An alternative type of AWS instance has SSD disks attached directly to the physical instance hardware. This type is called “instance store”, and is marked with the letter “d” in the suffix – such as in “c5d.xlarge”. An instance store is not a persistent disk, meaning that all the stored data is lost if the instance is shut down. At the same time, they are much faster – though not in terms of speed, but in terms of reduced latency.

The findings

To see how the two types of hardware compare in terms of performance, we measured both EBS and instance store disks by repeatedly copying files of different size thousands of times:

File size	Number of files	Total size	EBS GP2		Instance store		Speedup x times
File size	Number of files	Total size	Speed, MB/s	Time, s	Speed, MB/s	Time, s	Speedup x times
1 KB	100 000	102 MB	1.6	65.5	22.2	4.6	14.2
10 KB	100 000	1 GB	14.9	68.6	196	5.2	13.1
100 KB	100 000	10 GB	92.7	110.4	362	28.3	3.9
1 MB	10 000	10 GB	269	38	368	27.8	1.4
10 MB	1 000	10 GB	269	38	368	27.8	1.4
100 MB	100	10 GB	269	38.1	363	28.2	1.3
1 GB	10	10 GB	267	38.4	363	28.2	1.4

Unsurprisingly, the physically attached disk is faster (on average, by 40%). But the difference becomes especially pronounced when small files are being transferred – a direct result of reduced latency, i.e. time added to each I/O operation.

As seen from the above table, with files over 1MB it doesn’t matter which disk is used: both EBS and instance store units show similar speeds – 269 and 368 MB/s accordingly.

However, when copying files under 10 KB in size, the gap increases – instance store being up to 14x faster.

Solution

The described difference in performance is crucial for Magento-based projects since Magento codebase consists of a large number of smaller-size files.

For example, assume a sample project with a codebase of 1.6 GB and 123K files:

EBS disk will require 68 seconds to copy the codebase, while
instance store disk will require only 5 seconds to perform the same task.

With these findings in mind, we revised our setup and fully switched to using instance store disks within the Kubernetes cluster. After this, the node-freezing stopped.

As a result, our Kubernetes deployments are now both faster, and don’t crash the entire cluster – benefits well justified for the 13% cost increase for the faster hardware.