How we managed to prove the ‘unverifiable’
Deploying an untested change to production will always cause some amount of stress.
The stress level increases with the importance of a system that is changing. If there is any seasonality in your traffic patterns, the stress also increases during a “busy season”. That can be daily spikes, high traffic periods of the year, the week your business goes public, or any similar seasonality. Every past incident you had when you implemented a change also affects your stress level. The newer they are, the more stress they add.
In my experience, implementing a proven change reduces stress to almost zero. Even with testing in place, incidents can and will happen, but you have a way to catch them sooner and not repeat them.
Stress reduction is reason enough to put effort into testing changes, even ones that seem difficult or even impossible to test before they hit production.
HAProxy, the Nemesis of Tests
A infobeep, we implement more than 1000 different changes in production every day. Most of those changes are code, which can and is tested. But there are also changes to network configurations, virtualization, or storage layers, which are much more difficult to test.
Until recently, one of the things we considered impossible to prove was HAProxy configurations. But we just weren’t thinking about it enough.
All of our incoming HTTP traffic goes through an L7 balancing layer, where we use HAProxy. Over the years, we have grown to over 40 data centers, many of which have specific configurations, primarily due to special customer requirements, various migrations, and different product stacks available in specific data centers. It is impossible to have a staging environment that addresses all discrepancies.
Where we fell short before
We tested a change in a single test environment we had, manipulating the configuration to a state similar to what we wanted to change in production. This process was completely dependent on the engineers making the change.
There was no strict procedure that required us to test the change before production. We take it as common sense and preach it, but when someone created a pull request for the change, we never checked to see if it was tested.
The only requirement to deploy a change to production was a pull request approved by one of the engineers. A few incidents later, we increased this to two engineers for the most important systems. We consider this a temporary measure until we think of something better. Of course, this was a pure hopium based strategy.
This is an example of a HAProxy related incident:
This was a change that resulted in a severe degradation of our platform. After applying this, some of the requests on the api.infobip.com endpoint were routed to the wrong backend. As seen in the photo, it was a simple change.
Two engineers approved it, both thinking that HAProxy behaves differently than it does. One of them was a new employee. Imagine the stress they experienced and all the flashbacks they had when the next change was needed. We decided to improve this process significantly.
A New Approach: Scratch Only Where It Itches
At first, we thought that testing HAProxy configurations before production was inherent to the type of configuration and the way we use it.
“But every HAProxy configuration is unique to a specific data center!”
“We can’t pretend responses All backends!”
“There’s a lot of customer specific context in our HAProxy configurations that you must understand to do right!”
Some costly incidents over the years made us think harder to find a solution. And it wasn’t that hard to find. In the process, we learned that while these statements are still true, they do not make the configuration untestable.
Our HAProxy configuration consists of many access lists that route traffic to specific backends based on regular expressions. When we looked closely at the incidents, we noticed that about 90% were due to a misconfigured regular expression or incorrect access list ordering.
That’s when we realized we didn’t need to test every line of the HAProxy configuration or do the test process more complicated than it should be. We should try the things we have trouble with! Scratch only where it itches!
And this part of the itch can be tested if we do so. So, we did it.
Every change we make to the HAProxy configuration must pass a series of tests, but someone still has to approve it. On some HAProxies we now run more than 3000 tests on each change. Some of them were generated automatically and some were written manually.
Each push to a git repository triggers a Jenkins build, which creates an ad-hoc Docker environment, prepares the HAProxy configuration for testing, and runs all tests. HAProxy directives that are not subject to tests are replaced with generic values. If any test fails, we consider the build to have failed and the change cannot be merged into the production branch.
We tested a couple of tools to evaluate responses to HTTP requests and chose Throw due to its good assertion engine, simple file format, and speed of test execution.
Hurl is a command line tool that executes HTTP requests defined in a simple plain text format. You can make requests, capture values, and evaluate queries on headers and body responses. It is very versatile: it can be used both to obtain data and to test HTTP sessions. You can assert both the JSON response and the HTTP response headers. Furthermore, you can even test full HTML pages or even byte content.
A simple web server that only responds (hence this “brilliant” name) HTTP 200 OK, reflecting back everything it received. Each HTTP request header that the http responder receives is returned in a JSON structure to the client.
During configuration preparation, all back-end servers in the HAProxy configuration point to the http responder. Also, the backend name is injected as a header in each request. This allows us to check if our search is routed to the expected backend, or if a reverse proxy injected any required headers, or if it blocks the request with malicious content in the HTTP headers as we configured.
The things we tried are:
- Access lists and request routing
- HTTP response status codes
- HTTP request header injections
- HTTP Response Header Injections
- HTTP header or URL path rewrites
write the tests
To test the behavior of a reverse proxy, you only need to describe the desired result of a specific request.
For example, we have the following directive in our configuration:
To translate, all requests in
test.api.infobip.com endpoint with a path starting with
/mobile/ should be routed to the
iam-mobile back end
Now we just write this sentence as proof.
This test runs on every change. So if we introduce some faulty regex that would cause traffic on /mobile to end up on another backend, this test will fail.
Zero Incident Achievement: Unlocked
The introduction of a test channel in our configuration change process improved the quality of our platform and the number of related incidents was reduced to zero. My stress level is also pretty close to that number now.
I encourage you to check your change deployment processes and try to find a place for automated testing. Remember: you don’t have to try everything. Try what matters. I guarantee you it will be worth it.