enjoys technical writing, and a cheeky drink đŸ„ƒ

Nuance in “Restarting” a k8s Pod

I have left CircleCI, and joined a new organization recently (since Sept 2024).

As part of our onboarding training, I was instructed to kubectl delete pod ... to restart a service.

“Wouldn't this cause a hopefully brief period where the service is down then? Do we have any uptime SLA or policy with our customers?” I asked.

I was told that, if anything, it should be unnoticeable. However, I understand that the presenter is not familiar with Kubernetes (k8s) and thus may not appreciate the nuance in the delete-pod operation.

I did not follow up to share more, regrettably. However, I hope I can at least share my understanding here on the nuances.

When you delete a pod, the k8s cluster's control plane will create a new pod to fulfil the deployment's replica specification.

This means 1. assigning the new pod to an available node, and 2. pulling the image required.

At best, steps (1) and (2) are performed fast when the available node already has the image pulled.
At worst, it can take time if the chosen node is pulling the image for the first time. Imagine pulling a large image over a slow network.

If this is the only pod representing the deployment (i.e., replica count of 1), the service is “down” or unavailable while the pod is still pending.

We can choose to kubectl rollout restart deploy/<deployment> instead.
This is effectively a rolling update, where new pod(s) are created before the existing pods are destroyed.

This ensures we do not experience a service downtime while replacing the pods.

However, because rolling updates attempt to create new pods first, your k8s nodes need to have sufficient resources to accommodate the new pods then.
Otherwise, these new pods will be stuck waiting to be assigned to a node.

buy Kelvin a cup of coffee ☕