Waiting: The Subtle Art That You Should Master

Recently, while working on a workshop entitled Test your request for a traction on Kubernetes with the GKE and Github actions, I faced the same problem twice: Service A needs service B, but the service starts faster than service B, and the system fails. In this article, I want to describe the context of these problems and how I resolved them with the same tool.
Pending in Kubernetes
It may seem strange to wait in Kubernetes. The self-healing nature of the Kubernetes platform is one of its greatest advantages. Consider two pods: a Python application and a postgreSql database.
The application starts very quickly and impatiently tries to establish a connection to the database. Meanwhile, the database is initiated with the data provided; The connection fails. The pod is found in the Failed
State.
After a while, Kubernetes requests the state of the pod of the application. Because he failed, he finishes it and starts a new pod. At this point, two things can happen: the database pod is not yet ready, and it is back to square one, or it is ready, and the application finally connects.
To speed up the process, Kubernetes offers start -up probes:
startupProbe:
httpGet:
path: /health
port: 8080
failureThreshold: 30
periodSeconds: 10
With the probe above, Kubernetes awaits first ten seconds before requesting the state of the pod. If he fails, he is waiting for ten seconds. Rinse and repeat 30 times before it fails permanently.
You may have noticed the HTTP /health
Termination point above. Kubernetes offers two exclusive probe configuration settings: httpGet
Or exec
. The first is suitable for web applications, while the second is for other applications. This implies that we should know what type of container the pod contains and how to check its status, provided it can. I am not a postgreSql expert, so I looked for a state verification command. The Bitnami Helm graphic looks like the following when applied:
startupProbe:
exec:
command:
- /bin/sh
- -c
- -e
- exec pg_isready -U $PG_USER -h $PG_HOST -p $PG_PORT
Note that the above is a simplification, because it pleases the name of the database and an SSL certificate with pleasure.
The starter probe accelerates things in relation to the default situation if you configure it correctly. You can define a long initial delay, then shorter increments. However, the more the containers are diversified, the more difficult it becomes to configure, because you must be an expert in each of the underlying containers.
It would be advantageous to look for alternatives.
Wait4x
The alternatives are tools whose emphasis is put on waiting. A long time ago, I found the waiting script for that. The idea is simple:
./wait-for
is a script designed to synchronize services such as Docker containers. It issh
Andalpine
compatible.
Here's how to wait for an HTTP API:
sh -c './wait-for -- echo "The api is up! Let's use it"'
He did the work, but at the time, you had to copy the script and manually check the updates. I checked and the project now provides a regular container.
Wait4X plays the same role, but is available as a container paid and provides more services to wait: HTTP, DNS, databases and message queues. This is my current choice.
Whatever the tool you use, you can use it in an INIT container:
A POD can have several containers running applications, but it can also have one or more insider containers, which are executed before starting the application containers.
The INIT containers are regular containers, except:
- The containers are still working on completion.
- Each container Init must successfully end before the start of the following one.
Imagine the following Pod
It depends on a postgreSql Deployment
::
apiVersion: v1
kind: Pod
metadata:
labels:
type: app
app: recommandations
spec:
containers:
- name: recommandations
image: recommandations:latest
envFrom:
- configMapRef:
name: postgres-config
The application is python and starts fairly quickly. He tries to connect to the postgreSql database. Unfortunately, the database has not finished initializing, so the connection fails and Kubernetes restarts the Pod.
We can fix it with a initContainer
and a pending container:
apiVersion: v1
kind: Pod
metadata:
labels:
type: app
app: recommandations
spec:
initContainers:
- name: wait-for-postgres
image: atkrad/wait4x:3.1
command:
- wait4x
- postgresql
- postgres://$(DATABASE_URL)?sslmode=disable
envFrom:
- configMapRef:
name: postgres-config
containers:
- name: recommandations
image: recommandations:latest
envFrom:
- configMapRef:
name: postgres-config
In the above configuration, the initContainer
Do not stop as long as the database accepts connections. When this is the case, it ends, and the recommandations
The container can start. Kubernetes does not need to terminate the Pod
As in the previous configuration! It involves fewer newspapers and potentially fewer alerts.
When waiting becomes compulsory
The above is a slight improvement, but you can do without it. In other cases, the wait becomes compulsory. I recently experienced it during the preparation of the workshop mentioned above. The scenario is as follows:
- The pipeline applies a manifesto on the Kubernetes side
- In the next step, he performs the test
- At the start of the test before reading the application, it fails.
We have to wait for the backend to be ready before testing. Use wait4x
wait for it Pod
To accept requests before launching the tests:
- name: Wait until the application has started
uses: addnab/docker-run-action@v3 #1
with:
image: atkrad/wait4x:latest
run: wait4x http ${{ env.BASE_URL }}/health --expect-status-code 200 #2
- The GitHub action allows you to execute a container. I could have downloaded the Go binary instead.
- Wait for it
/health
The termination point returns a200
Response code.
Conclusion
Kubernetes starter probes are a great way to avoid unnecessary restarts when you start services that depend on each other. The alternative is an external waiting tool configured in a initContainer
. wait4x
is a tool that can be used in other contexts. This is now part of my belt of tools.
Go further:
Originally published in a Java geek on April 20, 2025