Backpressure Isn’t a Bug: It’s a Feature for Building Resilient Systems

admin5 hours ago

0 24 7 minutes read

The counterprint is the hidden negotiation between producers and consumers. Master it and your systems evolve free of charge. Ignore it and they collapse under the cutting edge.

The above quote suggests that even the most robust and most well-designed dams cannot withstand the destructive forces of an uncontrolled and uncontrolled flood. Likewise, in the context of a distributed system, an uncontrolled caller can often overwhelm the entire system and cause cascade failures.

Advertise here

In a previous article, I wrote on how a storm of trying to try the potential to delete a whole service if the appropriate railings are not in place. Here, I explore when a service should consider applying the counterfeit to its appellants, how it can be applied and what the appellants can do to deal with it.

Backpresure

As its name itself suggests, the counterpress is a mechanism of distributed systems which refers to the capacity of a system to strangle the speed at which the data is consumed or produced to prevent the overload itself or its downstream components. A system applying a counterfeit on its caller is not always explicit, as in the form of strangulation or load shedding, but sometimes also implicit, as slowing down its own system by adding the latency to the requests served without being explicit on this subject.

The implicit and explicit counterfection intends to slow the caller, either when the appellant does not behave well, or the service himself is unhealthy and needs time to recover.

Backpresure need

Take an example to illustrate when a system should apply counterprint. In this example, we build a control plane service with three main components: a frontage where customer requests are received, an internal queue where customers' requests are stamped and a consumer application that reads the queues of the queue and written in a database for persistence.

Figure 1: a sample control plan

Productive fee

Consider a scenario where actors / customers hit the front at a rate so high that the internal queue is full or that the worker written in the database is occupied, leading to a full queue. In this case, requests cannot be brought into play, so instead of deleting customer requests, it is preferable to inform customers in advance. This inadequacy can occur for various reasons, such as a burst of incoming traffic or a slight problem in the system where the consumer was broken for a while, but must now work additional to drain the accumulated arrears during his stop time.

Cascade resource and failure constraints

Imagine a scenario where your queue approaches 100% of its capacity, but it is normally 50%. To match this increase in the incoming rate, you increase your consumption application and start writing in the database at a higher rate. However, the database cannot manage this increase (for example, due to limits on scriptures / dry) and is breaking down. This ventilation will eliminate the entire system and increase the average time to recover (MTTR). Applying for a counterprint in appropriate places becomes critical in such scenarios.

Missed slas

Consider a scenario where the data written in the database is processed every 5 minutes, and another listening application to keep up to date. Now, if the system is not able to meet this SLA for any reason, such as the queue being complete at 90% and potentially by taking up to 10 minutes to erase all the messages, it is best to use restart techniques.

You can inform customers you will miss the ALS and ask them to try again later, or apply the counterprint by abandoning non-urgent requests for the queue to respond to ALS for critical events / requests.

Counter-press challenges

On the basis of what is described above, it seems that we should always apply a counterprint, and there should not be a debate on this subject. As true as it may seem, the main challenge is not there if We have to apply a counterprint, but especially around how To identify the right points to apply a counterprint and the mechanisms to apply it that meet specific service / business needs.

The counterfeit obliges a compromise between flow and stability, made more complex by the challenge of the prediction of the load.

Identify the points of counterprint

Find bottlenecks / weak links

Each system has bottlenecks. Some may resist and protect themselves, and some cannot. Think of a system where a large fleet of data aircraft (thousands of hosts) depends on a small fleet of control aircraft (less than 5 hosts) to receive persistent configurations in the database, as in evidence in the diagram above.

The large fleet can easily overwhelm the small fleet. In this case, to protect itself, the small fleet must have mechanisms to apply a counterfeit on the caller. Another common link in architecture is the centralized components that make decisions throughout the system, such as anti-entropy scanners. If they fail, the system can never reach a stable state and can bring down the entire service.

Use system dynamics: monitors / metrics

Another common way of finding counter-impression points for your system is to have appropriate monitors / measures in place. Continuously monitor the behavior of the system, including the depths of waiting, the use of the processor / memory and the network flow. Use this data in real time to identify emerging bottlenecks and adjust the counter-impression points accordingly.

The creation of a global view through metrics or observers such as performance canaries through different components of the system is another way of knowing that your system is under stress and should assert counterprint on its users / appellants. These performance canaries can be isolated for different aspects of the system to find strangulation points. In addition, having a real -time dashboard on the use of internal resources is another excellent way to use the system dynamics to find points of interest and be more proactive.

Limits: the principle of the smallest astonishment

The most obvious things for customers are the service areas with which they interact. These are generally APIs that customers use to make their requests serve. It is also the place where customers will be the least surprised in the event of counterprint, because it clearly emphasizes that the system is under stress. It can be in the form of limitation or load discharge.

The same principle can be applied within the service itself through different subcompants and interfaces through which they interact with each other. These surfaces are the best areas to exercise counterfeiting. This can help minimize confusion and make system behavior more predictable.

How to apply a counterprint in distributed systems

In the last section, we explained how to find the right points of interest to assert a counter-press. Once we know these points, here are some ways to assert this counterprint in practice:

Create an explicit flow check

The idea is to make the size of the queue visible for your callers and to let them control the call rate according to this. By knowing the size of the queue (or any resource which is a bottleneck), they can increase or decrease the call rate to avoid submerging the system. This type of technique is particularly useful when several internal components work together and behave as well as they can without impacting each other. The equation below can be used at any time to calculate the caller rate. Note: the real call rate will depend on various other factors, but the equation below should give a good idea.

Callrate_new = callrate_normal * (1 – (q_currentize / q_maxsize))))

Reversed responsibilities

In some systems, it is possible to modify the order where the appellants do not explicitly send requests to the service, but allow the service request to operate itself when it is ready to serve. This type of technique gives the reception service a total control over the amount it can make and can dynamically modify the size of the request according to its last state. You can use a token bucket strategy where the reception service fulfills the token, and this indicates to the caller when and how much it can send to the server. Here is an example of algorithm that the caller can use:

  # Service requests work if it has capacity

 if Tokens_available > 0: 

             Work_request_size = min (Tokens_available, Work_request_size _max) # Request work, up to a maximum limit 

             send_request_to_caller(Work_request_size) # Caller sends work if it has enough tokens

 

if Tokens_available >= Work_request_size: 

send_work_to_service(Work_request_size)

             Tokens_available = Tokens_available – Work_request_size

# Tokens are replenished at a certain rate

Tokens_available = min (Tokens_available + Token_Refresh_Rate, Token_Bucket_size)

Proactive adjustments

Sometimes you know in advance that your system will soon be exceeded, and you take proactive measures like asking the appellant to slow down the volume of call and increase it slowly. Think of a scenario where your downstream was downstream and rejected all your requests.

During this period, you have matured all the work and are now ready to empty it to meet your SLA. When you drain it faster than the normal rate, you may remove downstream services. To remedy this, you proactively limit the appellant limits or engage the caller to reduce his call volume and slowly open the valves.

Strangulation

Restress the number of requests that a service can serve and reject requests beyond this. Limitation can be applied to the service or API level. This strangulation is a direct indicator of the counterprint so that the appellant slows down the volume of call. You can go further and limit the limitation or priority equity to ensure that the least impact is seen by customers.

Load discharge

The strangulation points to reject requests when you violate certain predefined limits. Customer requests can always be rejected if the service faces stress and decides to proactively delete the requests that he has already promised to serve. This type of action is generally the last recourse so that the services protect and inform the caller.

Conclusion

Counterprint is an essential challenge in distributed systems that can have a significant impact on performance and stability. Understanding the causes and effects of counterpress, as well as effective management techniques, is crucial to building robust and high performance distributed systems. When implemented properly, the counterprint can improve the stability, reliability and scalability of a system, leading to an improved user experience.

However, if it is poorly managed, it can erode customer confidence and even contribute to the instability of the system. The fight against counterprint and meticulous monitoring of the system is the key to maintaining the health of the system. Although the implementation of the counterprint can involve compromises, as an impact potentially on the flow, the advantages in terms of global resilience of the system and user satisfaction are substantial.