Complete Guide to Cluster Immune System Technique

Cluster Immune System

With new architectural design patterns and strategies, software deployment is becoming increasingly more complex. This increased complexity combined with a large number of users brings more edge cases and problems.

Imagine that you developed a new super cool service, which uses distributed computing techniques to deal with thousands of simultaneous users. A week passes and you receive feedback and decide to incorporate it into your service. You work on the changes and decide to deploy the new version, which leads you to two options:

Destroy all nodes and deploy the new version
Progressively deploy nodes with the new version, shifting users to them.

Since you don’t want to annoy your users with downtime, you decide for the latter and start the process immediately. Your first node gets up and running correctly, so you decide to take a walk while the process continues.

Upon returning to your desk, you get absolutely confused by hundreds of emails and lost calls, only to discover that you made a small mistake and all nodes are crashing every couple of minutes, losing data and making your customers angry. Murphy’s law at its best.

This situation could be avoided if the deployment tool had access to logs and other information to halt the process when necessary. But there are even more advanced techniques, like the one that we are going to talk about today, the Cluster Immune System.

Definition

A Cluster Immune System is an extension of a technique called Canary deployment. Here, solutions used to monitor the production environment are linked to the release process.

If a new version of an application does not meet performance targets, or if a problem is detected, the release process is halted without human intervention, and code is automatically rolled back to the previous version. The deployment remains locked while the problem is investigated.

How it Works

This technique has a two-fold structure: the first one is the monitoring infrastructure; the second one being the connection to the deployment procedure. This obviously means that it will work differently depending on the deployment technique being used.

Given that the current set of relevant metrics are already being collected, the remaining action is to define the expected values for the metrics, from which deployment decisions will be made. Those could be error rates in APls, latency, usage statistics (e.g. number of logged users, transaction rate), or any other relevant metric for any stakeholder. Look for common errors caused by previous updates for good candidates. In the case of Canary Deployments or Rolling Updates, the cluster immune system will be used to monitor the new version and control the percentage of users being migrated to it. The following decision table illustrated how the measurements could impact the deployment process:

Picture2-4

Depending on the maturity of the deployment process, other decisions could be linked to the metrics, such as update rate or the algorithm to select the set of users receiving the new version.

Possible Challenges

A good, reliable set of metrics is necessary to use this technique
Problems with the monitoring solution might affect the rollout process

When to Use

This technique is usable whenever an automated deployment strategy is being used, as well as a reliable monitoring system is in place. The level of automation (regarding the deployment control) can be selected according to the maturity level of the deployment process. Rollbacks are especially difficult to handle when data schema changes are part of the update, so triggering it automatically could be risky.

Adopting in a Greenfield

The first step to adopt this technique is to actually adopt one of the required deployment processes such as Canary Deployment or Rolling Updates.

It’s also important to make the definition of which signals will be monitored and acted upon in an update during the design of every feature. Also, one needs to assure that the deployment process for every component created takes as a requirement all the automatic actions that can be triggered by the immune system.

The monitoring system design also needs to be robust enough so that the immune system can take actions on error signals.

Adopting in a Brownfield

The first step in adopting this technique is to make sure the deployment process is mature enough so that it can be automated. This can be done by incrementally introducing the immune system with the following steps:

Select candidate metrics from the monitoring system. You can use historical failure-after-update data to get good candidates.
Track the quality of the candidate metrics emitted by the monitoring system: fixing false alarms and adding missing ones.
Implement a dry-mode for the immune system, so that decisions can be tracked without actually being automatically executed.
Implement a manual gate-keeper for the actions, so that the operation staff can review and approve (apply) the actions.
Apply this mechanism in a pre-production environment, where you have the same configuration and a controlled environment to validate each new iteration of the process evolution.
Promote stable/reliable actions to be fully automated.

Key Takeaways

Good metrics are absolutely important. Take your time to select the best for this technique
Rollbacks are especially difficult if you have changes in data structure. Make sure that you can revert changes in your database schema, for example
This technique works best with small but continuous changes. If you are planning to make a major one, check other options (or make sure that you have taken all precautions)

Acknowledgement

This article was written by João Pedro São Gregório Silva, Software Developer and co-authored by Isac Sacchi Souza, Principal DevOps Specialist, Systems Architect & member of the DevOps Technology Practice. Thanks to João Augusto Caleffi and the DevOps Technology Practice for reviews and insights.

About Encora

Fast-growing tech companies partner with Encora to outsource product development and drive growth. Contact us to learn more about our software engineering capabilities.