A New Recipe for Idempotent Cloud Deployments

** This post describes an idea that materialized after many many talks with Nathan and Hillel.

I’ve seen many deployment workflows and scripts, many of which use GitHub actions or equivalents, use Docker for build stages, and use terraform/cloudformation/scripts for deployment.

Once kinks are sorted out, these work really well.

The Infrastructure as Code movement is thriving, DevOps is better, and we’re happier.

Except, every time someone (usually me) pushes a weird enough merge commit on my github repo, the action to build and deploy my lambdas is triggered, the docker build is just a tiny bit different this time, and everything breaks.

Docker is not really reproducible

Unless you’re distroless and in control of everything (like Nathan is), you already know what I’m saying.

Usually it’s apt get that kills reproducibility, but it’s hardly the only danger.

Maybe always use the same machine?

One way to solve reproducibility is to always use the same machine on which you’ve done the docker builds that get deployed.

Docker will use cached layers on following builds, so the build artifacts will be the same.

But you want to use GitHub actions – so it’s always a new environment.

Replacing Reproducibility with Idempotency

I have come to terms with the fact that I cannot reproduce builds.

Not with the current technology stack I’m familiar with(*).

Instead, I’ll settle for not building a second time whatever I have already built, and not deploying a second time whatever I have already deployed.

The action of build and deploy needs to be idempotent.

Sometimes GitHub actions are presented as idempotent, in that you can set that only changes to parts of your code trigger a workflow.

But since the trigger is based on the commit tree – and not the code itself – the workflow can still trigger even if the code didn’t change.

Moreover, if a deployment is triggered for one artifact, usually this cascades into a build and deploy of all artifacts, even if the others didn’t change.

(*) There’s a package manager called Nix, and an OS called NixOS, that maybe completely solve reproducible idempotent builds and deployments, but I haven’t understood them yet.

The new recipe

I’ll present the solution I’ve started using, with the concrete choices I made, though there’s a clear general principle that can be extracted and translated into other environments.

1. Put build stages in a DVC DAG

What is DVC?

DVC is short for Data Version Control, and it is a python package for keeping track of data, machine learning experiments, and models.

Putting aside the original intent of DVC, it is perfect for idempotent builds:

  1. You define build stages
  2. Each stage spells out explicitly its dependencies – all the code it uses
  3. Each stage writes explicitly its outputs – built artifacts
  4. On rerunning, only changed stages are rerun – idempotency.

Note, though DVC is a python package, none of your code needs to be in python, and nothing in my recipe uses python. DVC is just a neat tool for pipeline and remote storage management.

Wait, why not make?

Yes, you can define DAGs of builds using make, and in fact this is very common.

But:

  1. Running make a second time is idempotent only if you’re on the same machine, which you’re not, since you’re on GitHub actions.
  2. The artifacts are not automatically stored anywhere.

What does DVC look like?

Here is a dvc.yaml file defining a single build stage:

stages:
  myapp:
    deps:
      - myapp/
      - buildmyapp.sh
    cmd: ./buildmyapp.sh
    outs:
      - "artifacts/myapp.zip"

In DVC terminology, this is a pipeline with a single stage, that depends on everything in the myapp directory, and also on the script to build the app

The stage also declares a single output, a zip file that will go in the artifacts directory.

2. Define a DVC remote

Similar to a git remote, you tell DVC where you want data files pushed to.

Generalizing the original intent of the authors of DVC, instead of pushing raw data of some kind, or some machine learning model, we will push the built zip file that is our application.

Personally, I defined my remote to be on S3.

We’ll soon talk about when to actually push the artifacts to the remote.

3. Put the deployment as a stage

To make deployment contingent on the code actually changing, all we have to do is use the DAG we have, in this case dvc.yaml.

Updating the dvc.yaml pipeline file above, it looks like this:

stages:
  myapp:
    deps:
      - myapp/
      - buildmyapp.sh
    cmd: ./buildmyapp.sh
    outs:
      - "artifacts/myapp.zip"
  deploy:
    deps:
      - artifacts/
      - deploy.sh
    cmd: ./deploy.sh

The deploy stage depends on the artifacts directory, so any time it is modified by the first stage the deploy stage will also run.

But the deploy stage can also run if only the deploy script is changed.

This is where we already can see some idempotency: if we don’t change myapp’s code, the artifact won’t change, and so its deployment won’t be affected. Even if GitHub decided to run the workflow that runs DVC.

4. The GitHub Workflow

The GitHub workflow needs to run these commands:

dvc pull \
 && dvc repro \
 && dvc push \
 && git add dvc.lock \
 && git commit -m "updated dvc.lock" \
 && git push

Of course the environment in which commands run needs to have DVC installed, as well as credentials to push to the dvc and git remotes.

dvc pull

This pulls the previously built artifacts from the DVC remote, in my case from S3.

If we changed the deploy script, but not myapp, we still need the myapp zip artifact for the deploy script to work correctly.

Since we’re pulling exactly the zip file that was deployed last, we know for certain that we aren’t changing myapp.

dvc repro

This runs the stages in the pipeline defined in dvc.yaml.

dvc push

This pushes the created artifacts to the DVC remote.

The point here is that on following builds, even if they are on different machines, we’ll still get the last built artifact from dvc pull.

This combination of dvc push and dvc pull is one of the key differences from using make.

git add+commit+push

If any of our code changed, whether it’s the myapp code, the deployment script, or both, the hashes have changed.

Hashes of pipeline dependencies are stored in a file called dvc.lock.

We push the changes to origin.

Is it ok to push from a workflow action? Won’t we get an infinite loop?

This is where idempotency is critical.

The push of dvc.lock will indeed trigger another run of the workflow.

But this time, since the code hasn’t changed since the last run, all hashes will be the same, and DVC won’t run any of the pipeline stages.

This means there won’t be anything to git push, and we won’t get a third run of the workflow.

Versioning

Questions of identity go back thousands of years.

If we change one line of code, then another, and then another, until all code has been refactored, is it still the same code?

We’ll go with no.

Code is exactly the same only if it really is exactly the same.

Assuming our hash function has few collisions, we’ll also allow for the hash of code to identify it.

Given this, in order to know the version of a deployed artifact, we just need to know its hash.

Version in Logs

The most common type of versioning requirement is that the version be printed in logs of systems running the artifact.

In my own current projects the artifacts are zip files that are deployed to AWS lambda, and these zips contain python packages.

Here is the top of my lambda function:

def lambda_handler(event, context):
    print(f"CODE_ZIP_HASH='{os.environ.get('CODE_ZIP_HASH', 'n/a')}'")

This prints (to cloudwatch) the environment variable CODE_ZIP_HASH.

I populate this variable through the definition of the lambda in the deployment script.

Specifically, my deployment script uses AWS CDK, also in python, and the relevant lines are these:

    code_zip_hash = hashlib.md5(open(zip_path, 'rb').read()).hexdigest()

    func = aws_lambda.Function(
        stack,
        "myapp",
        runtime=aws_lambda.Runtime.PYTHON_3_8,
        code=_lambda.Code.from_asset(zip_path),
        handler='lambda_function.lambda_handler',
        description="Best app ever!",
        timeout=timeout,
        environment={'CODE_ZIP_HASH': code_zip_hash}
    )

There’s an additional layer of awesomeness here: DVC also uses md5 as its hashing algorithm.

Say the md5 hash of an artifact at some point was “b0d7…”, and we see it in some log.

Even if the repo has since moved on, we can look for all commits whose dvc.lock file has this hash “b0d7…”.

Though getting the commits is helpful, remember that builds aren’t reproducible, so if we want the actual artifact corresponding to “b0d7…” we need something else.

This is where DVC’s storage model comes in to play: it stores files exactly by their hash. In fact, it’s pretty much the same way git stores blobs.

So getting this old artifact is as simple as downloading /path/to/remote/b0/d7...

What about logging the commit itself?

Of course you can do it.

You can inject the current git commit into the artifact build stage.

Since git commits aren’t unique identifiers of artifacts, I chose not to do this in my own project.

Version in Deployment

Having the version printed in logs allows us to look back at what has already run.

But what if I want to know what will run if I trigger my lambda right now?

In my project, I wanted my lambda to have its unique identifier right in its description, so just by looking in AWS Console I can know what will run next.

All I had to do is update my lambda definition above to this:

    code_zip_hash = hashlib.md5(open(zip_path, 'rb').read()).hexdigest()

    func = aws_lambda.Function(
        stack,
        "myapp",
        runtime=aws_lambda.Runtime.PYTHON_3_8,
        code=_lambda.Code.from_asset(zip_path),
        handler='lambda_function.lambda_handler',
        description=f'Best app ever! [CODE_ZIP_HASH={code_zip_hash}]',  # <-- modified line
        timeout=timeout,
        environment={'CODE_ZIP_HASH': code_zip_hash}
    )

Recap

In order to have idempotent artifact builds and cloud deployments:

  1. Put builds and deployment as stages in a DVC pipeline.
  2. Define all the dependencies and outputs correctly.
  3. Use a DVC remote such as S3.
  4. Use: dvc pull && dvc repro && dvc push && git add && git commit && git push.
  5. It’s also easy to add identifiers and commit hashes to logs and descriptions.

The benefits are:

  1. Changing one artifact’s code does not force rebuilding other artifacts, even if you’re building on a new VM every time.
  2. Changing only the deployment script won’t build any artifacts at all.
  3. You have an artifact repository that just works.
  4. Your git history contains the hashes of all built artifacts.
  5. You can look up any artifact using its hash.

Alternatives

The usage of DVC in the recipe is pretty basic, and you could write the infrastructure yourself instead.

It’s mostly hashing things, saving to files, checking files, and hashing some more.

Nathan pointed me to git-annex, a tool that also saves files remotely in a git-like structure, referencing files by their hashes. You can replace that part of my recipe with git-annex.

There are also other pipeline tools out there, most familiar are AirFlow, Luigi, and Prefect.

These don’t understand caching, so I’m not sure how to replace DVC’s pipeline with theirs.

There is one more pipeline tool called Dagster, which has an experimental caching option. This can indeed replace DVC’s pipeline in the recipe.

Finally, Nix should get one more mention.

Nix is supposed to solve all reproducibility and idempotency problems by saving everything by hash. Really everything. From the libraries you use, to glibc itself. I’m still looking into this.

2 thoughts on “A New Recipe for Idempotent Cloud Deployments

  1. Very interesting use!

    I’ll note that you shouldn’t need to worry about an infinite loop (pushing a commit) from GitHub Actions unless you are using a Personal Access Token. If you push or trigger any event with the default GHA token it is designed to NOT trigger events, for the exact reason you mentioned (GitHub doesn’t want to deal with users making infinite loops).

    • Thanks 🙂
      Nice, I didn’t know that!
      And you’re spot on, I indeed use a personal access token (I don’t like running on github resources for different reasons, so my deployments are run from ECS)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s