Take the Ops out of GitOps

Harry Panayiotou
6 min readJan 24, 2021

Architecture for a marriage between GitOps and CD

Introduction

This is a description of how myself and other members of the team implemented ArgoCD for Kubernetes. The difference being, we did this in a world of small teams with no capacity to manage files in git manually and with Kubernetes knowledge limited to just the DevOps team.

The Rise of GitOps

GitOps has become really popular over the last couple of years. Keeping your cluster state in lock-step with a git repository can have some great benefits:

  • A more consistent understanding of what’s in your cluster
  • Remove deviations because of manual operations not being committed in version control by enforcing sync at all times
  • DR enablement by being able to standup a new cluster from a declarative state

There are some drawbacks though:

  • Process is a bit manual to add commit changes manually
  • Peer Review process for applying commits can add delays to release
  • Notifications on cluster state needs more care
  • If not using a templating solution, can be hard to make platform updates across the board
  • Git isn’t well suited to automated, parallel processes
  • Developers may need to understand Kubernetes objects to update apps

We implemented GitOps for all the benefits I mentioned however, we had a small team and the developers weren’t full stack so it wouldn’t be reasonable to be able to manage the all the relevant manifests manually. We had to find a way to implement GitOps but still keep our CI/CD pipelines.

Our GitOps Repo Setup

We wanted a simple Git repository. A single branch on a single repo for all clusters and all environments. This simple approach would allow us to see the state for our entire platform in one place but would make it hard to have a simple pipeline as it would increase the likelihood of Git conflicts if automating the Git commit process.

As I mentioned, we have a small team so we had to be careful where we place our complexities. We decided to place the complexity in the automation of the pipeline, rather than having our state spread over a number repositories. Each cluster’s operator would subscribe to a specific sub-directory of the Git repository to understand what it should sync.

The Git automation problem

Git’s job is to keep your source code versioned. It has quite a complex mechanism to accomplish this and can be quite fragile if things aren’t done in quite the right order.

The natural way to implement in Git a pipeline is to pull a repo, add a commit to the branch and then push it back up. However, if you have multiple apps, the possibility of concurrent pipelines could mean that the tip of the remote branch could change between pulling and pushing. Especially with a simple Git repo structure like ours. There is a workaround of being able to rerun the deployment step in the pipeline but that’s not a nice experience for developers.

Git in CI/CD

The approach was simple. We had to find a way to serialise the process of operating on Git automatically. We did this using a queuing service that we called pubsub2git.

In our CI/CD pipeline, we wouldn’t push commits to a repo, we would push a message to GCP’s pubsub. This message would contain what the path and contents of a file to be committed, where the path was a string and the contents would be base64 encoded.

On the other side of the queue, we wrote a listening that would take a message from pubsub, one by one, clone the gitops repo, add a commit and then push it up to the remote. As this was done one message at a time, we had no git conflicts at all.

What was good about this?

Automation

We were able to build an automated pipeline with GitOps so there was no need to manually commit, removing the amount of TOIL required to deploy apps. We were able to reap the benefits of declarative state with a single pane of state without having to commit files manually.

It also enabled us to be able to extract all of the app declaration, re-generate their templates with platform improvements and then commit them back in one commit. We didn’t get a chance to orchestrate this bit, but it was on the cards.

Pipeline Speed

We were also able to split the pipeline. The jobs ran quick as there was the queue was async so it was a case of fire and forget.

Security Isolation

Pushing deployment objects to a pipeline meant we could reduce the amount of permissions the pipeline had. Before, we had to make sure the pipeline had kubectl and the google-cloud-sdk installed and the permissions to run kube commands. Now, all it needed was a small Go binary to push files to pubsub and another to read them.

Flexibility

At the time of development, ArgoCD had an issue where you couldn’t define a state in two applications. This was problematic as we had split applications by namespace. So we would want to deploy a namespace with every app so it would work on a new apps deployment but that wouldn’t work if there were two deployments of the same app, such as with feature branches.

We were able to easily add a new app that only managed the namespace and make sure it ran before deploying any apps. This meant the serial pubsub approach meant we could manage dependencies using the order of the message, without needing any other more complex checks.

Microservice Pipelines

We were able to give the pipeline less things to do by porting some of it to a microservice like pubsub2git and ArgoCD. We also did that with other bits like template generation. This meant our pipeline was a collection of service calls for each stage, meaning we gain an improved iterating and testing patterns for our pipelines. It dropped development of a new type of pipeline from about a week to 2 days.

What could be improved?

Performance

The obvious issue was that operations were serialised. So if you had a number of deployments, you’d have to wait for the queue, then the commit and then the sync. The queue and commit would only take about 2–4 seconds, so the delay wasn’t too bad. Weirdly however, this grew to over 30 seconds after a few months. This was probably due to the number of commits and so lengthening the clone step, however this remained unverified.

Deployment Status Visibility

Developers had grown used to viewing the status of the deployment as part of their pipeline status. Given the async nature introduced, this meant we didn’t have deployment status in the pipeline. We could have resolved this by having a function to continually check the ArgoCD API for status updates, but this task wasn’t completed.

Scale

In a large organisation, the initial architecture wouldn’t scale. However, there were multiple components we could fan out. Like how many git repos we used and therefore how many pubsub2git process we had to help with any growing pains.

Complexity

This was overall a more complex setup with more moving parts. There was another services to build and manage.

Validations

As we would not be able to run the manifests through a peer review process, we’d need to make sure that there was no scope for injecting sub-optimal yaml that could generate issues or cause security exploits.

We weren’t open to this due to the fact we had the lsonnet templating mechanism and so we were able to run kubeeval and unit test the templates but there is some consideration required for automating gitops without those validations.

What would I do differently?

Git is a good first step to storing Kubernetes objects. However, I don’t feel it’s optimal, not if you want a single pane view of what’s on your platform. If I were to do this another way, I’d love to implement ArgoCD but with something like Cloud Storage as our storage mechanism.

It has object versioning, can support multiple writers and is also resilient. There would probably be issues in picking up updates but maybe that could be solved with webhooks to initiate a sync? Food for thought.

We also encrypt all of our manifests in the gitops repo using Sops. This means that the diff functionality is useless to us anyway, so there was less barrier for us to use something like GCS.

This would also eliminate the performance penalty when running git operations on a large number of commits. I think 30 seconds was for about ~11,000 commits.

Conclusion

This methodology was quite cool. It worked pretty well, despite a missing feature (deployment pipeline status updates) and a performance problem (30s per push) but it enabled us to look into creating an entire cluster from scratch and sync all the apps in one go.

I apologise for the lack of code in this post. I might expand on the code in future if I can but I wanted to get the idea around the architecture out first.

Update: Acknowledgements

Shout out to Dan P and Icelyn Jennings for help with getting this up and running. It was Icelyn’s experience that helped us go down the GitOps route and avoid some pitfalls while Dan helped with a lot of the development around the tooling for pubsub2git and encryption and all sorts of other pieces.

--

--