When more and more users start using a system, horizontal autoscaling can be used for improving performance and keeping the system stable, but it’s not so simple to configure as it appears to be.
In this context, with a technological agnostic point of view, this article aims to define horizontal autoscaling with some examples, and to discuss best practices to configure autoscaling with effectiveness.
Horizontal Autoscaling
Horizontal scaling (or scaling out) is the ability to add or remove instances of an application. Horizontal autoscaling means to do this process automatically.
The autoscaling process works by adding or removing application instances, as a result of metric alerts and/or scheduled actions.
Figure 1 shows an example to understand how this process works with more details:
Figure 1 – Horizontal autoscaling architecture
Some components are represented simpler than they really are, because the focus here is to discuss the autoscaling process, not other aspects. Here is a quick description of each component:
- Load Balancer - distributes client requests among all available Example APP nodes
- Example APP - a generic application that can have 1 to N nodes
- Application Infrastructure - manages the infrastructure required to run both Load Balancer and the Example APP nodes. It also collects infrastructure and application metrics and sends them to the Metrics Service
- Metrics Service - contains infrastructure metrics (CPU, memory, disk, etc.) and application metrics (response times, requests per second, etc.). It also sends alerts to the Autoscaler when, for example, the average CPU utilization reaches a given percentage
- Configuration file - contains all autoscaling configurations
- Autoscaler - manages the autoscaling process. Given some configurations, it reacts to metric alerts, and checks scheduled configurations to instruct the Application Infrastructure to add or remove Example APP instances. The AutoScaler also takes into account a configuration called scaling cooldown. This configuration specifies how long the Autoscaler should wait before taking a new action. This is important to prevent unnecessary increases/reductions in the number of nodes, once changes in the number of nodes can take some time to complete. For simplicity, let’s consider the scaling cooldown of the Autoscaler to be 1 minute for any new action
Let’s check some examples to get a better understanding.
Example 1 – Using Average CPU Utilization
Suppose the following configuration:
- Minimum number of nodes: 1
- Maximum number of nodes: 10
- Metric alerts:
- Average CPU utilization >= 75%
With this configuration, when the Autoscaler receives a metric alert of average CPU utilization equal to or above 75%, it instructs the Application Infrastructure to add one more Example APP node. If there is a change in the number of nodes, new changes will only be able to happen again after 1 minute (due to scaling cooldown). Otherwise, changes can happen in the next metric alert of average CPU utilization.
Note that the number of nodes should be between 1 and 10. No action will happen to either reduce the number of nodes to less than one or increase them to greater than 10.
Example 2 – Using schedule
Suppose the following configuration:
- Minimum number of nodes: 2
- Maximum number of nodes: 2
- Scheduled configuration:
- From: 2022/12/24
- To: 2022/12/25
- Minimum number of nodes: 10
- Maximum number of nodes: 10
This is a scheduled configuration, very useful when it’s already known the traffic will increase a lot during a specific time range. In this example, the Autoscaler will instruct the Application Infrastructure to keep 10 nodes during the time range. Before and after it, the Autoscaler will use the other configuration already in-place, with a minimum and maximum of 2 nodes.
Best Practices
After defining horizontal autoscaling with some examples, let’s discuss some best practices to configure it. Some of them are more related to the application than the autoscaling itself.
Configure a Graceful Start
Load Balancers usually use a health check endpoint to identify when it should start/stop sending requests to a given application node. If the node is healthy, requests are routed to the node, otherwise not. However, note that when a node has just started, if the health check returns success and there are still resources being loaded, the clients can receive errors.
The simplest approach to solve this issue is to implement a health check endpoint in the application that only returns success when the application is really ready to accept requests. This may require changes in the application code, like moving code from “post-start” to “pre-start” hooks. With this approach, the nodes will be added to the Load Balancer routing only when they are really ready.
It’s important to notice that more sophisticated Application Infrastructures like Kubernetes have more types of health checks. For example, Kubernetes has three: liveness, readiness, and startup probes. Each one works in a different way [2]:
- Liveness checks if the application is running. If it fails, the application is restarted.
- Readiness checks ensure the application is ready to accept requests. If it fails, the application is removed from the Load Balancer request routing, but it’s still allowed to run. Once the check succeeds, the application can resume receiving requests
- Startup checks the application startup. When it doesn’t succeed, it disables the Readiness and Liveness checks, preventing these checks from interfering with the application startup flow. If it fails, the application is killed, and it becomes subject to the restart policy configuration. This check is useful for legacy applications that take a long time to start
More details about Kubernetes probes can be found here. For Kubernetes and other cases, it’s important to understand and properly configure the health check types to prevent autoscaling issues.
Configure a Graceful Shutdown
Ensure your application has a graceful shutdown configured, otherwise some requests can be interrupted abruptly when the number of nodes are reduced. The results can be errors thrown to the client, or even inconsistent data related to the request.
In most cases, mature frameworks already do this, but developers should double check this configuration and also ensure resources are properly closed when the application shuts down.
In addition, some Load Balancing implementations (like AWS Application Load Balancer) allow setting a connection-draining time. This configuration sets a timer for the application node to finish processing the current requests. This happens after the node becomes unavailable for new requests and before a complete shutdown.
Use an Appropriate Load Balancer Configuration
A Load Balancer algorithm chooses how the load is distributed among the application nodes. Having a good distribution of the load is essential to have an effective autoscaling of our application. Check below a quick description of some of the most known Load Balancer algorithms:
- Round robin - chooses the instance in a circular order
- IP hash - chooses the instance based on the hash of the client IP address
- Least connection - chooses the instance with the fewest active connections
- Least response time - chooses the instance based on the lowest average response time
- Least bandwidth - chooses the instance based on the least amount of network traffic
In addition, there is another useful configuration called sticky session. With this configuration, the Load Balancer makes a client (or a group of clients) always hit the same application node, only choosing another if that initial node becomes unavailable. Behind the scenes, the Load Balancer uses the IP address or another specific information (such as cookie, header, etc.) to decide which node to select. This is specifically useful when you have stateful applications that store session information, a situation you want to avoid, but sometimes is not possible.
It’s important to note that a bad Load Balancer configuration may end up adding more instances than necessary or even reducing instances when it’s not ideal. The reason for this is if the load is not well distributed among the instances, the metrics will reach the autoscaling thresholds prematurely, which makes the Autoscaler triggers changes in the number of instances before it’s really needed.
Be Careful with Stateful Applications
During autoscaling, keeping a consistent state inside the application nodes can be a challenge. Note that a state change in one node needs to be replicated to all others. Consider the following two possibilities:
If the replication is synchronous, you will have guarantees that the data is the same among all nodes. However, this process is heavy and can reduce the performance of your application, especially when you start having many nodes.
If the replication is asynchronous, you will have less performance issues, but will have eventual consistency of the state. Depending on the case this won’t work well, because an application node may not have the latest state, and errors can be raised when the client hits that server. Some ideas to reduce this issue include configuring a Load Balancer sticky session and adding retries in Load Balancer/clients.
So, the first recommendation is to avoid stateful applications. If this is not possible, try to delegate the internal state to an external service (e.g., cache). Another option would be to store the data in a JWT token that the client would need to re-send every time (through headers or cookies).
If you really need a stateful application, evaluate the pros and cons about the replication types discussed above.
Understand the Scenario before Choosing the Autoscaling Configurations
Although it’s very common to use a simple average CPU utilization alert metric, it’s better to understand how our application works and which constraints you have before choosing the autoscaling configurations.
Using only the average CPU utilization alert metric can make your scaling triggers go off either too soon or too late, leading to higher costs and waste, or a poor experience for users during high traffic times. Machine Learning or predictive statistics (e.g., anomaly detection) can help a bit, but they are usually not enough, because they don’t reflect the reality of an application.
Below are some examples where using the average CPU utilization is not a good approach when used alone:
- Applications that are I/O intensive (often wait for input and output operations) don’t use much CPU. For example, an application that essentially performs requests to a database and sends the data back to the client fits in this category. For this case, alert metrics like current number of requests or response times are good candidates.
- Applications that are event consumers may not use much CPU when there is a limit of events that can be processed at the same time. For this case, the size of the event queue would be a better alert metric to use.
- Applications that are heavily used during specific time ranges. When this happens, the Autoscaler may not have enough time to add more nodes fast enough, due to the cooldown period. Here, using a schedule configuration is recommended.
In addition to these examples, there can be specific constraints like costs and resource limits, which can affect your choices.
You can also combine multiple alert metrics to have a more effective result. For example, you can use an application specific alert metric to try to reflect the load of the system, and at the same time the average CPU utilization as a contingency for unexpected scenarios.
Note that finding a good autoscaling configuration is an iterative process that requires analysis, changes, and tests. This usually takes some time.
Run Tests to Validate your Configuration
Do not suppose your autoscaling configuration will work well without testing.
It’s important to run load tests against a load test environment with the same configurations as a production environment (CPU, memory, disk, etc.). Also, try to simulate the same request patterns in your load tests, and use twice the load you are expecting.
If issues happen, do an analysis, adjust the autoscaling configurations, and run the same tests again. If everything goes well, check if there is anything else that can be improved. If not, you have found a good autoscaling configuration for your application.
Don't Reinvent the Wheel
Don’t try to implement an Autoscaler if you really don’t need it. Although this article focusses on agnostic aspects, it’s recommended to use well-known tools like Kubernetes, AWS Auto Scaling Groups, Azure Autoscale, etc. Also, DevOps techniques like containerization and infra-as-code can help a lot to accomplish autoscaling.
Conclusion
This article introduces some concepts related to the horizontal autoscaling of applications and discusses important aspects to be considered when configuring autoscaling.
It’s not an easy task, but with knowledge and effort an effective autoscaling is possible.
References
[1] Kubernetes. Horizontal Pod Autoscaling. https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/
[2] Kubernetes. Configure Liveness, Readiness and Startup Probes. https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/
[3] Ahmed, M. Kubernetes Autoscaling 101: Cluster Autoscaler, Horizontal Pod Autoscaler, and Vertical Pod Autoscaler. 2018. https://levelup.gitconnected.com/kubernetes-autoscaling-101-cluster-autoscaler-horizontal-pod-autoscaler-and-vertical-pod-2a441d9ad231
[4] Atatus. CPU, I/O and Memory Bound. https://www.atatus.com/ask/cpu-io-memory-bound
Acknowledgement
This piece was written by João Longo – Systems Architect and Innovation Leader at Encora’s Engineering Technology Practices group, and Ildefonso Barros – Senior DevOps Engineer. Thanks to Isac Souza and João Caleffi for reviews and insights.
About Encora
Fast-growing tech companies partner with Encora to outsource product development and drive growth. Contact us to learn more about our software engineering capabilities.