How we are redesigning our microservices deployment strategy
Sharing the story of why we decided to adopt Keptn internally at Dynatrace and how we are making it happen
Struggling with pipeline maintenance
Around two years ago we started developing new cloud-native microservices, and we needed to create workflows and mechanisms to deploy it. So, we chose to create a deployment pipeline with Jenkins — as it’s quite standard.
Slowly but surely, this pipeline evolved into something that does everything: building, running unit tests, uploading image to the registry, promoting image to the next stage, rolling out, rolling back, etc. This was great for one microservice, but then it became an issue down the line because, for each new microservice, developers would copy the same Jenkins pipeline to save time and adapt it to their needs. We ended up with many very similar pipelines, but nothing was standardized.
This situation introduced a lot of maintenance effort into our daily work. Here’s an example: one of our pipelines logged some tokens into the Jenkins log files. Due to the duplicated nature of the code, this bug had to be fixed four times. All that time could have been spent doing some actual product development!
So, it was time for a change, and we picked Keptn as new tooling to improve the situation.
We want to give developers the best and easiest experience ever when developing and deploying their applications.
Why did we choose Keptn?
Our primary reasons were:
- Implementing quality gates
Keptn is well-suited for making data-driven decisions, and we wanted to ensure that the tests we create work, and the deployment fits our quality requirements.
- Less maintenance needed
Keptn uses standard CloudEvents, making it easier for developers to create their own integrations with other open-source tools.
- Customer zero
A Dynatrace-internal team is one of the main contributors of Keptn, and we want to be customer zero for them while collaborating directly with them.
- Improve Keptn for enterprise cases
We also wanted to make Keptn better by providing use-cases for large-scale enterprise deployments.
Getting started with a new deployment process
Once we decided to start with this project, we found out that another team was working on the same topic within Dynatrace. So, we reached out to them and started collaborating to create a common approach for deploying applications.
Our shared goals were to:
- Stabilize the deployment mechanism
In the beginning, there were some instabilities on the Keptn Helm Services, which led to deployment-issues which could not be recovered without manual intervention on the Kubernetes cluster. We took a look in the code of the Helm service and of Helm itself to find out that there is an atomic flag, which would make it easier to roll back deployments if they failed.
- Make the whole CD invisible for developers
Developers should not need to trigger Keptn interactively. They should only be able to do git commit and git push.
- Set a declarative configuration in the git repository
Instead of declaring each step to Keptn one-by-one, we wanted to describe the deployment’s desired future state. Developers don’t need to know how the tools work, but they need to write the configuration. Then the deployment tooling itself does the rest.
- Have a small piece of glue code for CI/CD integration
A shell script to automate pushing the configuration files to Keptn.
All this is designed in a way that developers don’t need to interact with Keptn directly, as pictured here:
Everything starts with a git push, and then the CI pipeline will begin. If everything is fine, the pull request will be ready to be merged to master. All that the developer needs to see is the successful delivery in Keptn’s Bridge.
Our first approach: using Keptn Git-based and declarative
We wanted to follow a declarative approach to trigger Keptn, and our first attempt was to do it with Lambda Functions. We had a bit of code that triggered a Lambda function that triggered a Keptn deployment and created services. This worked well, but we realized that this was not platform-agnostic (it was only working for AWS).
After some time, we built a Git Operator for Keptn, a component in the Keptn Control Plane that finds out if something has changed in the Git repository. If it has, then it would trigger the needed actions. We wanted to translate the declarative configuration we had in our first configuration into imperative Keptn commands.
So, our architecture looks like this:
We have a Code Repository that contains the whole truth of our application and the Keptn deployment configuration. In the CI transformation step, we transform this configuration for the Keptn configuration repository, in which there is no application code, just the Keptn configuration. (This is a good solution, in my opinion, because the Keptn configuration might reside in the cloud and can be accessible by different services, but it doesn’t have to contain the application code. The only relevant thing for Keptn is the artifacts such as the Helm-chart and the container image. Therefore, we don’t need the application code there.)
In the Control Plane Cluster, the Keptn Git Operator does nothing more than check for configurations changes from the configuration repository every 30 seconds. If something changes, it finds out what has changed (new services, changes in specification, etc.) and communicates it to the Keptn API and the Kubernetes API. It creates Kubernetes objects on its own to track the state of the Keptn services.
We also introduced some custom resources into Kubernetes, such as Keptn Project and Keptn Service. If we wanted to create something, we said, “yes, Keptn, please create the service.” If the deployment changes, we say “yes, Keptn, please trigger my deployment,” then Keptn does its job and sends the deployment-triggered event. The Keptn Helm Service, on the other hand, watches for deployment events from the Keptn API and deploys this to the application namespace as it usually does.
Shell script overload
We needed to push configuration files to Keptn, so we thought we’d use shell scripts to glue CI and CD. This worked fine until we had three different services, which meant three other shell scripts as well.
Eventually, we decided to make a Go binary for all of this. A CLI tool that integrates into the CI pipeline and requires no Keptn knowledge from developers. Developers need to know how to write Helm charts or Locust files because the Keptn repository already includes all the required configurations. For each stage, there’s a different Helm chart, and the dev needs to follow a specific directory structure (see the example here).
In the CLI, you call trigger-deployment, the name of the service, and the version.
ci-connect-cli trigger-deployment --service test-service --version 1.0.0
This triggers the whole CD pipeline, and developers don’t need to be Keptn experts to use it.
However, soon we encountered another issue…
Creating the Promotion Service
Keptn always takes the latest configuration version from the “stage” repository. But the image version of the Helm chart changes.
That means that if you were to promote:
- Custom Helm Chart + Container Image -> both in version 1.0
And one week later, you approve this, but in the meantime, the Helm Chart Version has upgraded to 1.1., you would have as a result:
- Helm Chart Version 1.1 (HEAD of stage) + Container Image 1.0
The new Helm chart got promoted to hardening, but we still had the old Keptn container image. So, we found out, that the current approach ends up in inconsistent deployments. This issue was introduced by us since we used custom Helm charts instead of the standard Keptn deployment mechanisms.
Keptn did not have a solution for this issue out-of-the-box, so we decided to work together with the Keptn team and build something new. This is what we came up with:
The CI-Connect CLI takes our git commit and adds a tag with a service name and the commit version. Then we write the version in the deployment manifest, and this will be added to the master.
The Operator, afterward, reads the version from the deployment manifest in the master, it finds out that something changed, and it has to know which version it has to fetch. It then creates the deployment event with the version as a label. Then, each stage has its promotion service, which pulls the service and the version from the git tag. The repository contains a base directory with the Helm chart and stage-specific configuration. The promotion service merges this for a specific version and write this precisely to the stage branch. As a result, the stage branch gets a single source of truth for everything which gets deployed on the target stage . Every GitOps tool could pull the configuration from there and deploy this (not currently, but PoCs are already in progress). Finally, the contents of the stage branch are deployed by the Keptn Helm service.
How the deployment steps fit together
Here is an overview of how the whole process looks like:
In the Code Repository, we have the Application code and Deployment Configuration. This is our single source of truth.
Then, we trigger our CI pipeline. The KIA/CI-Connect CLI pushes to the Configuration Repository (here, we do not depend on Jenkins; if we ever want to change CI tooling, we can simply change one environmental variable).
The Configuration Repository is watched by the Git Operator, which in turn gives instructions to the Keptn API. The Promotion Service listens for promotion events (defined as first sequence step in each stage) and provides the configuration for the deployment tooling (Keptn Helm Service in our case). Finally, the Keptn Helm Service creates the namespace in Kubernetes (if it doesn’t exist) and deploys the application.
Next step: testing
Now that our deployment is ready, we need to run tests to ensure that our code works correctly. For this, we create a Job Executor Service. But to avoid making this blog post longer than it should, I will explain what it is and how it works in a second blog post.
Read part 2 here:
How we are redesigning our microservices deployment strategy was originally published in Dynatrace Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.