Dockerization part 2: Deploying

Now that we have containers, we need to push them to our subprod environments so they can be tested. Bear with me, this is where things get a little complicated.

Docker Setup

Most people take the easy way out when they move to docker: they ship their containers to the cloud and let someone else manage the installation, upgrades, and maintenance on the docker hosts. We don’t do things the easy way around these parts, though, so we have our own server farm: a series of VMs in our datacenter. Everything below the VM is maintained by another team; my team is responsible for the software layer of the VM, and the containers that run on top. We have a handful of servers in our sub-prod environments, and then a handful more in our various production DMZs.

For management, most people seem to choose Kubernetes, but again, we don’t do things the easy way around here, so we went with a less popular product called Rancher. Now, Rancher is a management interface that can sit on top of a number of underlying technologies, including Kubernetes, but we chose to use their house-brand management system, called Cattle, instead. They were nice enough to give us a bunch of training in Docker, including the advice that forms the basis for their theme: if servers were pets, carefully maintained and fed over the years, containers should be like cattle, slaughtered and replaced as soon as they seem to be ill so they don’t infect the whole herd.

Rancher is a really great tool if you’re working in the GUI. It has the concept of an Environment (which we use to separate dev from QA from demo), which spans across one or more Hosts (the servers that run Docker and manage the containers). Inside the Environment are Stacks, which are a collection of related containers with a name. It also handles a lot of the networking between containers, as it comes with its own DNS for the internal container network so you can just resolve Stackname/ContainerName to find a given container in your Environment. You can upload a docker-compose.yml file to create a stack if you’re using Compose, and the extra metadata Rancher uses can be stored in a rancher-compose.yml that also can be uploaded when you make a stack.

Rancher running on my local machine, showing a project I have in progress

Deployment

Manual deployment is super easy in Rancher: create a new stack, add services, paste in the container name from our build step, and let it handle everything else. Moving between environments manually once it works in dev is also easy: download the compose files, then upload them into the next environment. But we’re doing CI/CD, and the developers are constantly asking how they can speed up their release schedule. How do we do this automatically?

There’s two tools that come with Rancher that can help here. One is the extensive API; pretty much everything you can do in the GUI can be done via the JSON-based REST API. The other is the pair of command-line tools they produce: Rancher-compose and Rancher CLI. Since I was also trying to release quickly, I used the API for my initial round of deployment scripts; in a later post, I’ll talk through how I’ve begun to convert to using the CLI commands instead, as I feel they’re faster and cleaner.

For Bamboo, I needed something that could run in a Deploy Project that would update the stack in a given environment. I decided to write a Node.JS script, because when all I have is a node-shaped hammer every build script becomes a nail 😉 (Actually, it was so our Node developers could read the script themselves). I didn’t do much special here, just your standard API integration using a promise-based architecture; however, this is a chunk of a bigger library I decided to write around Rancher, so you’ll see a lot of config options:

function findStack(stackName, environment) {
    return request({
        uri: `${opts.url}/v2-beta/projects/${opts.projectIDs[environment]}/stacks?name=${stackName}`,
        auth: {
            username: opts.auth[environment].key,
            password: opts.auth[environment].secret
        },
        json: true// Automatically stringifies the body to JSON
    });
}

function getContainerInfo(environment, stackName, containerName) {
	log(`Getting container info for ${containerName}`);
	return findStack(stackName, environment)
	.then((body) => request({
		method: 'GET',
		uri: `${opts.url}/v1/services/?environmentId=${body.data[0].id}&name=${containerName}`,
		auth: {
			username: opts.auth[environment].key,
			password: opts.auth[environment].secret
		},
		json: true // Automatically stringifies the body to JSON
	}));
}

function performAction(serviceId, action, environment, launchConfig, stackName) {
    update(`Performing action ${action} on service ${serviceId}`, 'info', stackName);
    return request({
        method: 'POST',
        uri: `${opts.url}/v1/services/${serviceId}/?action=${action}`,
        body: {
            'inServiceStrategy': {
                'batchSize': 1,
                'intervalMillis': 2000,
                'startFirst': true,
                'launchConfig': launchConfig
            }
        },
        auth: {
            username: opts.auth[environment].key,
            password: opts.auth[environment].secret
        },
        json: true
    });
}

    performServiceUpgrade: function (stackName, containerName, environment, image) {
        update(`Upgrading ${containerName} in stack ${stackName} in ${environment} to image ${image}`, 'info', stackName, environment)
        return getContainerInfo(environment, stackName, containerName).then((body) => {
            if (body.data.length <= 0) {
                throw new Error(`Could not find service ${containerName} in stack ${stackName} in ${environment}`);
            }
            let serviceId = body.data[0].id;
            let launchConfig = body.data[0].launchConfig;
            launchConfig.imageUuid = image;

            return performAction(serviceId, 'upgrade', environment, launchConfig, stackName)
            .catch((err) => {
                if ((err.statusCode == 422 || err.status == 422) && opts.retries.on422) {
                    log('Detected invalid state. Rolling back to retry.', 'info', stackName)
                    return performAction(serviceId, 'rollback', environment, launchConfig, stackName)
                        .then(() => this.waitForActionComplete(stackName, containerName, environment, 'active', stackName))
                        .then(() => performAction(serviceId, 'upgrade', environment, launchConfig, stackName));
                } else {
                    log('Detected error condition. Aborting', 'error', stackName)
                    throw err;
                }
            })
            .then(() => this.waitForActionComplete(stackName, containerName, environment, 'upgraded', stackName))
            .then(() => serviceId);
        });
    }

I highlighted lines 54 and 55, however, because they are a little strange. Rancher lets you update anything about a service using the same endpoint, which is kind of nice and kind of rough: I need to specify every single attribute of the service, or it’ll assume I meant to blank out the setting (rather than assuming I meant to leave it unchanged). To make this easier, I captured the existing launch configuration, then changed the container number and sent it back.

Upgrading a service in Rancher is a two-step process: first, you upgrade, which launches a copy of the new container for every copy of the existing container, and then you “finish” the upgrade, which removes the old containers. This is so that if there’s a problem with the new container, you can issue a “rollback” action, which turns the old containers back on and removes the new ones — much faster than trying to pull a fresh copy of the old container back. However, this means sometimes you’ll be trying to upgrade while it’s in an “upgraded” state, waiting for you to finish or roll back. When that happens, Rancher issues a status code 422. My library optionally rolls back and issues the upgrade action again if it encounters this state.

The hardest part was figuring out how to figure out when Rancher was done upgrading. Some of our images are huge, particularly the ones that contain monoliths we’re still in the process of breaking up; it can take several minutes for these containers to download and start up. Eventually, I settled on a polling-based strategy:

waitForActionComplete: function(stackName, containerName, environment, desiredState) {
    update('Waiting for upgrade to complete', 'info', stackName, environment);
    return new Promise((resolve, reject) => {
        //Wait for the service to finish upgrading
        let retries = opts.retries.actionComplete;
        function checkState() {
            getContainerInfo(environment, stackName, containerName).then((body) => {
                let container = body.data[0];
                log('Current state: ' + container.state);

                //Check if upgrade is done
                if (container.state == desiredState) {
                    log('Action complete');
                    return resolve();
                } else {
                    retries--;
                    if (retries < 0) {
                        return reject('Timed out waiting for action to complete');
                    }
                    log(`${retries} left, running again`);
                    return setTimeout(checkState, 1000);
                }
            });
        }
        setTimeout(checkState, 500);
    });
}

This will keep running the checkState function until either the container’s state enters the desired state, or it runs out of retries (configured in the config for the library). I’ve had to tune the number of retries several times; right now, for our production deploy, it’s something outrageous like 600.

This library is called from a simple wrapper for Bamboo’s sub-prod deploys; for production, however, I got a lot trickier. Stay tuned for that write-up next week!

Dockerization Part 1: Building

I’ve been long overdue for a series of articles explaining how our current build system works. One of the major projects I was involved with before this recent reorg involved overhauling our manual build process into a shiny new CI/CD system that would take the code from commit to production in a regulated, automated fashion. As always, the reward for doing a good job is more work like that; when we decided to move to Docker to better support our new team structure, I ended up doing a lot of the foundational work on our new build-test-deliver pipeline. Part one of that pipeline is, of course, building and storing containers.

Your mission, if you choose to accept it

In the old world, before we dockerized our applications, we were following a fairly typical system (that I designed): our CI server runs tests against the code, then bundles it up as an archive file. After that, one environment at a time and on request, it would SCP the tarball down to the server, stop the running process, remove the old codebase, and unpack the new before starting the process again. There were configuration files that had to be saved off and moved back in afterward in a few cases, but we had all those edge cases ironed out. It was working, and there were almost no changes to it in the year before we launched docker.

As we were preparing to go live, I didn’t want to lose the build pipelines we had worked so hard on. And yet, docker containers are fundamentally different than tarballs of code files. Furthermore, our operators (who are responsible for putting code into production) complained of having too many buttons to click: often, our servers had 3-4 codebases on them, meaning 3-4 buttons to click to update one server. They definitely didn’t want to do one button per container. On the other hand, our developers were clear on what they wanted: more deploys, faster deploys, and breaking out their monoliths into modules and microservices so they could go even faster. How to balance these concerns?

Another wrinkle emerged as well once I got my hands on our environment: we chose Rancher as our docker management tool of choice. Rancher is a great little tool, and I enjoy working with its GUI, but when most companies seem to be standardizing on Kubernetes, it was hard to find good examples and tutorials for how to work with Rancher instead.

With all those pressures bearing down on me, my task was straightforward, but far from simple.

How to build a container in 30 days

The promise of containers seemed like it resolved a lot of our headaches overall: developers control the interior of the container, and Platform Ops controls the outside of it. In this brave new world, I don’t have to care what goes in a container, but it’s my job to ensure they get to where they’re going every time without fail. In practice, however, I found I need to understand quite a bit about containers themselves.

For the purposes of this article, you don’t need to know or care about the virtualization layer; just trust that a container is isolated from everything around it, until and unless you drill holes in it (which we do. A lot. But I understand that’s common). You will need to know a little about how they’re built, however.

Picture a repository of source code. At some point, to dockerize the application contained within, you need  a Dockerfile: a file of instructions on how to build this container. Almost every container begins with an instruction to extend from another image, much like classes extending from a base class. This was really handy for us, since it means we can put anything we need into a custom base image and all the developers will have it pre-installed.

From there, there’s a series of customizations to the container. Generally, one step involves copying the code into the container, and another tells the container what executable to run when it starts. For Node.js, we ask our developers to put their code in a standard location, then execute “npm start” when the container boots up, letting them define what that means for their application.

Once you’re happy with what the container contains, it’s time to seal it up and ship it. In this case, that means two commands: a “tag” command, which gives it a name more interesting than the default (which will be something like 2b9c0185251d), and a “push” command, which uploads the docker container to a remote repository. If the container is intended to live in a central repository, it has to be tagged with that repository as part of the name (including a port number, which usually defaults to 5000 for a Docker registry unless you put an Nginx in front to make it 80): something like “artifactory.internal:5000/dt-node-base”. Appended to that is a version: this can be a sequential number, or a word or anything else. By convention, each container is tagged twice: once with a sequential number, and once with the word “latest”. That makes it so you can always pull down the very latest node base container from our Artifactory repository by asking it for “artifactory.internal:5000/dt-node-base:latest”.

The system

So we have a number of parts to this build system that the CI/CD server has to integrate with. The first piece is to begin with raw source code, including a Dockerfile; we had been using Subversion, but the developers had been asking for Git for so long we finally broke down and bought a Bitbucket server and let them migrate.

The next piece is to build the containers with Docker. Since we were using Bamboo as our CI/CD server, I installed Docker on all the remote agents; this required an OS upgrade for them to Red Hat 7, but I was able to script the install using Ansible to make doing it across our whole system less painful.

The next piece is somewhere to store the containers when we’re done with them. As you can guess by the previous example, we decided to use Artifactory for this; this is mostly because, as the developers moved to Node, they were asking for a private NPM server, and Artifactory is able to do double duty and hold both types of artifacts.

For the communication between them, my coworker put together a script we could put on each build server that the plans could use to ensure they didn’t miss any steps. It’s straightforward, looking something like this:

#!/bin/sh -e
# $1 Project Name (dt-nodejs)

docker build -t artifactory.internal:5000/$1:$bamboo_buildNumber \
 -t artifactory.internal:5000/$1:latest

docker push artifactory.internal:5000/$1:$bamboo_buildNumber
docker push artifactory.internal:5000/$1:latest
echo "$1:$bamboo_buildNumber and $1:latest pushed to Artifactory on artifactory.internal:5000"

This means that every build tags the container with the number of the build, giving us an easy source of sequential numbers for the containers without thinking about it. It does mean, however, that building a new pipeline for an existing container name will start the numbering over from 1 and overwrite old containers, but we encourage developers to edit their build plans instead of starting over where possible. If you have any ideas on how to prevent that, I’d love to hear them.

(I’ve actually enhanced this script since, but I’ll talk about that in a future entry)

 

Why the haitus? And what comes next?

You may have noticed there’s not been any real content on this blog in a hot minute. That’s because I made a career move: I stopped being an automation engineer/QA consultant, and began working on a new team in a more DevOps manner.

My team now is called Platform Ops. We’re a cross-functional team dedicated primarily to supporting and improving the lives of the developers at my company. The devs were reorganized into product-specific teams so they could be more agile, while we ended up with the cross-platform work:  implementing Docker  containers, supporting frameworks shared across teams, and responding to midnight emergencies when Operations has no idea what’s failing or who to call about it.

Now that it’s January and I’ve become more comfortable in my role, you can expect to see more articles on this blog. The content will be different; I’m going to talk about Docker, about keeping servers running, about testable infrastructure-as-code, about moving from a monolithic Waterfall environment to a DevOpsy microservice-friendly space, leapfrogging many of the in-between steps in our quest to please the unpleasable developers.

Watch this space!