Modern software teams leverage continuous delivery to gain control over their software. The growing set of practices of testing, automation and CI/CD puts every team member on equal footing, and empowers a team to move fast and with confidence. This effort requires commitment, and I am always happy when I see a team that understands this.
I also see Ops teams have adopted this model. DevOps teams deploy and manage complex multi-cloud environments, and CI/CD workflows are becoming the norm. If we can automate the build of an app, we can automate its infrastructure; we can script Day 2 operations: alerting, upgrades, runbooks, security audits, etc.
All of that is well and good, but I still don't like Jenkins. We took time last summer to find alternatives, and while lots of systems are better-looking, faster, leaner, none of them were strikingly different.
Spinnaker stood out for its prescriptive management of EC2 instances. It seems to work well, but it doesn't do well with retrofits. Inside a Kubernetes cluster, various GitOps tools fufill their role with a vengeance: Flux and Flagger feel like pure magic, albeit with a narrow and focused scope.
Build tools have clearly evolved, although there's nothing that can quite match the sheer backlog of Jenkins plugins. What won us over is the Docker-based, yaml-driven approach made popular by tools like Circle or Travis, or as implemented more recently by GitLab. So we implemented it in Jenkins using the wonderful CasC plugin.
The problem with Jenkins, and by extension with other tools that follow its master-slave model, is that it's responsible for everything. The butler holds all the keys to the mansion, so to speak. It knows all the jobs, it keeps all the credentials, it enforces permissions, and implements and runs all the logic needed to make your team happy and productive. Do you see a problem with that? If you do, please bear with me.
In a world of microservices and decoupled logic, to have the singular Master Jenkins that sneers at you in Groovy is an issue. Jenkins is a gatekeeper: to get to the promised land of developer bliss, you'll need to pass through him; and most team members don't have enough street cred for the devious butler.
"But surely mister", I hear you say, "we have solutions to these problems". Indeed we do. Other tools help you more, and you can get started faster. Even Jenkins can be tamed. If you get him drunk enough on yaml, he'll make friends with git and get out of your way. Groovy can be learned, libraries can be written. Still, solutions are particular to your CI/CD of choice. If you've hired Jenkins, then you're stuck with it's butler logic, for better or worse.
Case in point: my deployment script may be sparkly and unit-tested, but I still need to write a bunch of tangled bash and pipeline-groovy to make it run.
Better days ahead
My vision for a CI/CD toolset is an SDK for API-driven pipelines. Yup. Let's unpack that.
If you have a complex pipeline, chances are it has many parameters, runs steps in parallel, does its own stop/go logic, and more. Why would you want to encode this in yaml or a weird DSL? You can have the power of a full-blown language at your fingertips. Does your team write Python? Java? Go? Write your pipeline in code: do complex builds, deploy on multiple clouds, make infrastructure on demand, re-use code. This is where an SDK comes in: you need to tie your custom logic together in a way that makes sense for CI/CD. You also need to get to git and build a docker image easily. These are all part of the SDK.
An API-driven microservice is a pattern that gained popularity in recent times. The idea is you break your system down to parts that do one thing, and you expose that thing through an API. My vision for a pipeline is a gRPC microservice that serves requests for jobs of a single type. Easy right? You write your logic, compile and run, and you can then trigger any job execution of that pipeline with an API call.
This pipeline can be a build-and-test: you can trigger a new release, and builds can run in parallel, or sequential, based on what your code decides (e.g. hardware resources); it can be a deployment - you can trigger a deploy of a new version, and your pipeline can decide in which region to deploy it, test it, and finally promote it on other regions; surely you don't have multi-region in QA: why not re-use that code, but add a simple "if"?
Sure, all of this can be done with Jenkins, or with any other platform. But the questions I ask myself are:
- is our ci/cd code maintainable?
- is our ci/cd code easy to extend or re-use?
- is our ci/cd code portable? can I run it somewhere else today?
- is our ci/cd pipeline testable? Even unit tests?
- can our ci/cd platform scale?
If you ask these questions of your app toolchain, the answers are easy. If you ask them of your CI/CD, all of a sudden you're scratching your head.
The story of a failed pipeline run
I want to illustrate how this would work. Let's say I want to build a pipeline that deploys a new app version in 10 different Kubernetes clusters. I have 1 on QA, and 4 in stage, and we multiply that by two because we're multi-cloud. To make it a bit harder, let's make it fail at the end. (Oh! the horror!)
(I'm challenging myself to think of a K8s scenario, since: 1. it's poised to take over the world 2. it makes deploys easy and standard and 3. not all k8s use-cases are covered by existing solutions - at least not yet)
The pipeline microservice registers itself to the master as a gRPC service. The master uses gRPC reflection to see that the service exposes a function: DeployApp(). This function logs in to my cloud providers and checks live traffic. It can decide which regions to deploy in Stage based on 2 Strategies: lowest-traffic and random.
We have a cluster microservice in each cluster. Its role is to deploy our app inside the cluster. These microservices all register to the master and expose the DeployInCluster() function.
I login to the Master's UI. I create a new Job Definition. This requires me to choose a remote function, and parameter and credential types.
The function is: DeployApp()
The parameters are: App Version, Region strategy
The credentials are: Cloud credentials
I then setup the Credentials and issue a job execution. I choose App Version 3.0, Region strategy lowest-traffic, and hit Run.
The master checks to see if my pipeline microservice is up. If yes, it runs DeployApp() and waits for results.
DeployApp() first starts 2 parallel builds in QA. Remember that we have one cluster in each cloud. To do that, it tells the master to fetch a list of cluster microservices that match the "QA" label. It then tells the master to run DeployInCluster(AppVersion = "3.0") on each service, and return the result.
DeployInCluster(AppVersion = "3.0") rolls out the version fast: it kills any old version and brings up the new version. It runs basic tests. It scales the app in preparation for the incoming tests. It then reports back to the master, which passes the results to the initial caller, our DeployApp() function.
When the results of the deploy are in, our DeployApp() runs integration and load tests. It triggers two external systems to do this. When it is satisfied, it's time for a stage rollout.
DeployApp() uses its cloud credentials, already received from the master, to see which regions have the lowest traffic. It selects one region per cloud, and then asks the Master to fetch a pair of cluster microservices that match this region selection. Then it asks the Master to run DeployInCluster(AppVersion = "3.0").
The DeployInCluster() functions in stage are a bit different. They do a zero-downtime deploy, and switch the traffic over to the new App version in steps. It checks for in-cluster metrics to see if the app is behaving properly (e.g. increase in 50x errors for a webapp/api), before finally switching all the traffic to the new version. This all happens inside the k8s cluster, to which neither the Master, nor our DeployApp() function have any access to.
When one of the two cloud regions fails, the master tells DeployApp() the bad news.
I can see this instantly on the Master UI, as DeployApp() reports status continually, and the Master reports in realtime any other jobs that DeployApp() starts, together with their status. DeployApp() sees this as well and issues a rollback call to the first cloud, via a function called DeployInCloudRollback(). On the failing cluster, this function is not needed, as it is done automatically by the in-cluster process.
What was that all about?
We can do this with current systems. The advangates to my approach are:
- the control logic for the whole process is written in code, in whatever language you are familiar with
- so it can be tested, scaled, deployed anywhere with a minimum of fuss - it's a gRPC service, nothing more
- devs can own it and work on it with their existing toolchain
- same goes for the in-cluster components
- the Master is lean and dumb, but still plays a crucial role in offering access controls and reporting status
- we can distribute our CI/CD pipeline any way we like
- and we can scale it: the DeployInCluster() functions are the same codebase everywhere
- fewer credentials: build locality means we can use current access levels without managing complex credentials, their access and their rotation (e.g. DeployApp() can live in AWS and use IAM w/o creds to check ELB traffic; you could have another one in a different cloud, doing the same thing, and reporting back)
I view this system as generic. If you can write it as a gRPC server, then you can tie it into the Master and run it. The Master will issue creds, job executions, and take care of reporting status and storing build results.
CI/CD as API-driven SDKs - does it make sense?