The Missing Stair in IT

Photo by John Tuesday on Unsplash

There are missing stairs all over software development, and I’m willing to bet your organization, if it’s larger than the smallest of startups, has one. Do you know who it is? And if so, why do they still work for you?

What do I mean by “missing stair”? The Missing Stair is a concept that started in BDSM communities but quickly spread like wildfire across all sorts of nerdy or alternative communities. It was coined by blogger Cliff Pervocracy in their blog Pervocracy, and was coined to describe a rapist in the community. Everyone knew he was a rapist, or something like. But nobody wanted to kick him out of the community. He was like a stair missing from a flight of stairs: everyone knew to just step over the hole, but nobody wanted to fix it. Every so often, a new person would fall into the hole. “Aah yes,” everyone would say, nodding wisely. “That stair’s missing. You should just avoid it.”

I came across the concept in a blog by Captain Awkward which, if you don’t subscribe already, please do browse the archives and consider subscribing. She talks at length about geek social fallacies and how to rectify them so that we build safer, more healthy organizations. And what is a tech company but a geek organization with money involved?

Let me tell you about a past workplace, since I don’t want to out anyone at my current job. In one of my past companies (and I won’t say which), everyone knew the infra team was the Missing Stair. If you had to get an infra need satisfied, you would be instructed to make a ticket… and that ticket would sit in the queue… and sit… and sit. Months later, having moved on to a whole different project, you might get a curt message from the infra team demanding more information… and then it would sit again. They were too understaffed and overwhelmed to be able to do their jobs in anything like a timely fashion. The only hope for getting something you truly needed done in a timely fashion was to turn up at one of their cubicles with baked goods and a sob story. And then your ticket would be done… at the expense of whoever was next in line having to wait a little longer. Multiply this by a hundred and you had the problem we were facing.

In the book The Goal, the protagonist, Alex, goes on a boy scout hike with his son’s troop. (I promise this is relevant, if you haven’t already heard this story). He lets the kids order themselves in line, which results in the slowest hiker, Herbie, being at the very back. Alex kept having to double back and check on Herbie, while the faster kids waited impatiently up ahead at the designated rest areas. Finally, he had a bolt of inspiration: put Herbie in the front. The whole line could only move as fast as Herbie could go, after all; nobody could go home until Herbie finished the hike. Herbie’s slow pace effectively capped the rate at which the other kids could enjoy their hike. He likens this to his manufacturing plant, where under Lean Engineering, the whole line could only go as fast as the slowest machine. By buying a second of the slowest machine, he was able to produce real speed benefits, instead of the phantom benefits created by buying another fast machine and having it sit idle most of the time.

Usually in software development this is likened to ticket processing time. Your slowest tickets are a bottleneck for the faster ones getting done because they take up developer time. But it is also the case that engineering as a whole is like a big manufacturing line. The slowest department or person slows the work of everyone around them — and is also terrible for morale. Who likes not being able to get tickets closed because some asshole on the other side of the company won’t give you what you need? Nevermind that that “asshole” is drowning in more work than he should rightly be expected to complete.

Are you hiring software developers right now? Seriously, look at your open recs. My company is hiring at least three. But we all know where the missing stair is in our organization. I’m sure even the managers know at this point. How could they not? So why are we spending our money on more fast developers when a single addition to the missing stair team would enable so many to move faster?

Dockerization pt 3: Deploybot

One of the key components to making a good integration between products is understanding the mental model of each product. What one product calls a “counter” another could call a “metric” or a “stat”, for example; or worse, one product could be reporting, say, the amount of free memory, while another is reporting the percentage.

The same goes for integration points between teams. When I built our previous release pipeline, I discovered very quickly that while developers were comfortable talking about repositories, operations thought in terms of applications, and neither knew nor cared how many moving parts went into a single app so long as it was all on a server together. This resulted in a system in which devs had to create a long, complex manifest to explain to ops exactly what code should be deployed by the night shift when, and a number of errors came out of that process.

I knew if I had per-container deploy buttons in production, like I did for pre-production, we’d have similar issues. Developers want to push single containers out into test environments, but operations wants to deploy a single application, with as little fuss as possible, and a quick painless rollback if they notice problems. So when it came time to move to containers in production, I designed them a custom application to do just that. I called it Deploybot.

Rancher ships with a catalog of common apps that you might want to deploy in your container environment. It also comes with the ability to make your own catalogs. A catalog consists of a docker-compose and a rancher-compose; the former is a standard format for spinning up a set of containers that depend on each other as a single unit, while the latter is a Rancher-specific extension to the format that allows the product to add extra metadata to the containers once they’re spun up. In the GUI, the catalog makes it really simple to deploy a whole application at once: just a few clicks and you can have a running app within seconds. Furthermore, when new versions are released to the catalog, it’s only two clicks to upgrade to the new version, which updates only those containers which have changed.

Not only does this let us quickly upgrade, it keeps a history of everything that’s ever been deployed. This means we can easily spin up a version in develop that is identical to this week’s production environment — or last week’s, or next week’s. Furthermore, we can roll back easily to a given deployment in the event of an emergency, and we have a record in the UI of every deploy that’s been done.

I determined quickly that the API would let me upgrade from the catalog if a new version had been pushed. From there, my biggest problem was logistics. Let’s say an app has 4 containers in it. At present, we had 4 build plans, each of which built a single container and pushed it to artifactory. When deploying a container at a time, we could easily just use the Rancher API to push that container to our test environment. But for a catalog update, things were more complicated.

Step 1: Mark for Release

The first problem: how do we determine if a container has passed testing when the final phase of our testing is still manual? Artifactory has a system of properties on a given artifact; you can do a search using the API for only containers that have a given property. So I added a deploy target called “Mark for Release” after our development and test targets; when a given asset is ready to go to production, the lead developer for the team “deploys” to that “target”. Because I’m a masochist, I did this in node.js:

addLabel (args) { // eslint-disable complexity no-console

            if (!args || !args.container || !args.version || !args.label || typeof args.value === 'undefined') {
                return Promise.reject(new Error('Incorrect parameters: please supply container, version, label, and value'));
            }
            console.log(`Adding label ${args.label}=${args.value} to ${args.container}:${args.version}`);

            const {container, version, label, value} = args;

            return request({
                method: 'PUT',
                uri: `${opts.uri}/artifactory/api/storage/docker-local/${container}/${version}?properties=${label}=${value}`,
                auth: {
                    user: opts.user,
                    pass: opts.password,
                    sendImmediately: true
                }
            });
        }

This is just a wrapper around a PUT request to artifactory; docker-local is our local docker repository, as you can tell by the oh-so-creative name. Later logic will pull the list of items with that label and find the highest-numbered container that is ready to be released. At first, this worked great… right up until someone decided they wanted to fall back to an older version of a container and we had to manually remove the label from the newer container. Then I added guard conditions:

markForRelease (artif, args) {
        return artif.addLabel({
            container: args.container,
            version: args.ver,
            label: 'release',
            value: 'true'
        }).then(() => artif.getAllItemsByLabel('release', 'true'))
            .then((items) => artif.filterDockerResults(items))
            .then((containers) => {
                const promises = containers.map((item) => {
                    const [, name, version] = artif.parseVersion(item);

                    // Guard: only replace the current container, with versions higher than this one
                    if (name !== args.container) return Promise.resolve();
                    if (Number(version) <= Number(args.ver)) return Promise.resolve();

                    return artif.addLabel({
                        container: args.container,
                        version: item.substring(item.lastIndexOf('/') + 1),
                        label: 'release',
                        value: 'false'
                    });
                });

                return Promise.all(promises);
            })
            .catch((err) => {
                console.error(err);
                process.exit(1);
            });
    }

After calling the above function, it then fetches all the items with that label. (FilterdockerResults removes anything that’s not a container, because the layers of a container are also labeled when you label a container itself). At that point, anything with a higher number is un-marked, so that the highest numbered container is still the right one to deploy.

Step 2: Update catalog

The next problem is how to get the right version numbers and edit the catalog. To do that, I had to deep dive into the catalog format for Rancher.

Any given catalog is a git repository. Inside the repository, there are a series of folders: one for each application. In each of those folders is a couple descriptive files and a series of folders: one for each version. In the version folder you have two files: a docker compose and a rancher compose. When you make changes, you are meant to put out a new folder with a new version number so that people can optionally pull in the update at their leisure. When Rancher detects that there is a newer version than the one you are on, it will display a little icon in the UI to indicate that an upgrade is available.

What I needed was a way to take an existing configuration file, change the numbers of the specific containers, and make a new version with that updated information. I could have used Regular Expressions to do the parsing, but that just smelled like the wrong solution. So instead, I added a folder called “_template” to the list of versions. Rancher will ignore this, because it’s not a numbered version; furthermore, the files inside are “docker-compose.hbs” and “rancher-compose.hbs”, which Rancher does not think it can parse. However, my Node.JS application can read these in as Handlebars templates.

This means I can write a template like this:

version: '2'
services:
  nginx:
    image: artifactory:5000/nginx-fram:{{nginx-fram}}
    stdin_open: true
    volumes:
    - /var/log:/var/log
    tty: true
    labels:
      io.rancher.container.pull_image: always
      io.rancher.scheduler.affinity:container_label_soft_ne: io.rancher.stack_service.name=$${stack_name}/$${service_name}
      io.rancher.container.hostname_override: container_name
  fram:
    image: artifactory:5000/fram:{{fram}}
    stdin_open: true
    tty: true
    labels:
      io.rancher.container.pull_image: always
      io.rancher.scheduler.affinity:container_label_soft_ne: io.rancher.stack_service.name=$${stack_name}/$${service_name}
      io.rancher.container.hostname_override: container_name

You see the double brackets? Those will be replaced through the templating engine with the correct version number for that container name. This is easily accomplished: I hydrate the template with an object, the keys of which are container names and the values of which are container numbers pulled from the artifactory API, as listed above:

        getAllItemsByLabel (label, value) {
            return request(
                `${opts.uri}/artifactory/api/search/prop?${label}=${value}&repos=${opts.repository}`,
                {
                    auth: {
                        user: opts.user,
                        pass: opts.password,
                        sendImmediately: true
                    }
                }
            ).then((response) => JSON.parse(response).results);
        },
        filterDockerResults (containers) {
            return new Promise((resolve) => {
                const uris = containers.map((item) => item.uri);

                resolve(uris.filter((item) => item.indexOf('sha256') === -1 && item.indexOf('manifest.json') === -1));
            });
        },
	getVersions (urls) {
		return new Promise((resolve, reject) => {
			if (!urls) {
				return reject(new Error('Could not get container versions from Artifactory'));		
        		}
			const retVal = {};

			urls.forEach((url) => {
				const [, name, version] = artifactory.parseVersion(url);
					if (retVal[name] && Number(retVal[name]) >= Number(version)) {
					return;
				}
				retVal[name] = version;
			});
				return resolve(retVal);
		});
	},

When I compose these three functions together, they get all the containers that have been marked for release, remove anything that isn’t a real container, and return only the highest numbered one of each. This makes a neat little object to pass into the Handlebars engine. I then take the hydrated template, write it out to disk as a new catalog version, commit, and push. Voila!

Step 3: Update Rancher

Now that there’s a new catalog version, I have to tell Rancher to update the stack to that version. First, I tell it to poll the catalog for updates. I do so with this neat little function (note the bonus fun comment):

refreshCatalog: function(environment) {
    log('Refreshing Rancher catalog');

    //This API is synch, so it won't return until the catalog is refreshed
    //...in theory
    //I'm pretty sure the above is a lie. A vicious, vicious lie. 
    //But it is the lie told to me by my ancestors, so I will repeat it.
    return request({
        uri: `${opts.url}/v1-catalog/templates?action=refresh`,
        auth: {
            username: opts.auth[environment].key,
            password: opts.auth[environment].secret
        },
        json: true// Automatically stringifies the body to JSON
    });
},

Since the function was definitely returning well before the version showed up in the catalog, I wrote a wait for this particular step:

waitForCatalog: function(catalogId, version, environment, stackName) {
    update('Waiting for catalog to show up', 'info', stackName, environment);
    let that = this;
    return new Promise((resolve, reject) => {
        //Wait for the service to finish upgrading
        let retries = opts.retries.catalog;
        let secondsBetweenRetries = 3; 
        function checkCatalog() {
            that.refreshCatalog(environment)
            .then(() => request({
                method: 'GET',
                uri: `${opts.url}/v1-catalog/templates/DealerTire:${catalogId}`,
                auth: {
                    username: opts.auth[environment].key,
                    password: opts.auth[environment].secret
                },
                json: true
            })).then((body) => {
                let versionList = body.versionLinks;

                if (versionList[version]) {
                    log('Found version', 'info', stackName);
                    log('Catalog update took ' + (30 - retries)*secondsBetweenRetries + ' seconds.');
                    return resolve();
                } else {
                    retries--;
                    log(`${retries} retries remaining`)
                    if (retries < 0) {
                        log('Did not find version in catalog. Check Bitbucket?');
                        return reject(new Error('Timed out waiting for version to appear in catalog.'))
                    }
                    setTimeout(checkCatalog, secondsBetweenRetries * 1000)
                }
            });
        }
        setTimeout(checkCatalog, 500);
    });
},

This is basically the same as the waits I mentioned in part 2, of course, just waiting for the version to show up in the catalog instead of waiting for the service upgrade to finish.

Once the catalog entry is showing up, I can submit a stack-level “upgrade” action. Since I knew I’d also have to do a “complete” action and a “rollback” action, I went ahead and genericized this function as well:

performStackAction: function (stackId, environment, action, body = {}) {
    return request({
        method: 'POST',
        uri: `${opts.url}/v2-beta/projects/${opts.projectIDs[environment]}/stacks/${stackId}?action=${action}`,
        body: body,
        auth: {
            username: opts.auth[environment].key,
            password: opts.auth[environment].secret
        },
        json: true
    });
},

Now, in the case of an upgrade action specifically, you need to pass the compose files into the API. I’m not sure what happens if they disagree; this catalog portion may not be strictly necessary if the passed-in configs are preferred. That said, we knew we wanted the record in the catalog, and it’s easy enough to hand it the same data we just wrote to disk a moment ago.

performStackUpgrade: function (stackName, environment, newnum, composes) {
    update(`Upgrading ${stackName} in ${environment} to version ${newnum}`, 'info', stackName, environment)
    return findStack(stackName, environment).then((body) => {
        if (body.data.length <= 0) {
            throw new Error(`Could not find stack ${stackName} in ${environment}`);
        }
        let stackId = body.data[0].id;
        let catalogId = getCatalogId(stackName);
        return this.performStackAction(stackId, environment, 'upgrade', {
            dockerCompose: composes.docker,
            rancherCompose: composes.rancher,
            externalId: `catalog://Dealertire:${catalogId}:${newnum}`
        }).then(() => stackId);
    });
},

The “externalId” key is what’s telling it that we’re doing a catalog upgrade. However, passing in blank composes results in nothing changing; it won’t read them out of the catalog to determine what to change. It’d be great if it did, though, hint hint.

We then wait for upgrade in much the same way we did when updating a single service, then report success back to the user. The overall sequence thus looks something like this:

return updStack(`Upgrading stack ${stack}...`, 'pending')
.then(() => gitLib.clone())
.then(() => gitLib.config())
.then(() => updStack('Creating new version...', 'pending'))
.then(() => artLib.getContainers())
.then((urls) => artLib.getVersions(urls))
.then((versions) => gitLib.update(catalogID, versions))
.then(retval => {releaseData = retval})
.then(() => gitLib.commit(`Release for ${stack} on ${releaseData.internalID}`, catalogID, releaseData.folder))
.then(() => updStack(`Created version ${releaseData.id}. Waiting for update...`, 'pending'))
.then(() => ranLib.refreshCatalog(env))
.then(() => ranLib.waitForCatalog(catalogID, releaseData.id, env))
.then(() => updStack('Performing upgrade...', 'pending'))
.then(() => ranLib.performStackUpgrade(stack, env, releaseData.folder, releaseData))
.then((id) => releaseData.stackID = id)
.then(() => updStack('Waiting for upgrade to complete...', 'pending'))
.then(() => ranLib.waitForActionCompleteStack(releaseData.stackID, env, 'upgraded', stack.display))
.then(() => updStack('Waiting for health checks...', 'pending'))
.then(() => ranLib.waitForHealthCheckStack(releaseData.stackID, env, stack.display))

“updStack” here is my function to send a message back through the websocket to the client; git is my git library, artLib is the artifactory library, and ranLib is the rancher library. You have to wait for health checks here or you’ll run into issues if you complete the upgrade right away, but again, that’s pretty straightforward.

Deploybot

The interface to my final product, therefore, is really straightforward: a password prompt with AD authentication so only our Operations team can deploy, then three buttons per application. This screenshot was taken during the initial rollout; we have dozens of applications now, each with their own set of three buttons. Ops will do an upgrade, then leave the stack in the upgraded state while they confirm that the application is running correctly; when that’s done, they can complete or roll back, according to the test result. All the functionality for rollback and complete comes from Rancher; I’ve just exposed it here in a series of handy buttons to avoid them having to dig through the UI in an emergency or at 3 in the morning.

Is it overcomplicated? Yeah, totally. There’s a lot I plan to change in version 2.0, which will probably need to be tweaked when we upgrade Rancher and move to a Kubernetes-backed system. But it gets the job done, and after the first few weeks of finding edge cases and smoothing them out, it works like a charm. Sometimes that’s what you’re looking for in a devops world: a reliable, straightforward tool that gets the job done so you can get back to releasing new functionality.

Dockerization part 2: Deploying

Now that we have containers, we need to push them to our subprod environments so they can be tested. Bear with me, this is where things get a little complicated.

Docker Setup

Most people take the easy way out when they move to docker: they ship their containers to the cloud and let someone else manage the installation, upgrades, and maintenance on the docker hosts. We don’t do things the easy way around these parts, though, so we have our own server farm: a series of VMs in our datacenter. Everything below the VM is maintained by another team; my team is responsible for the software layer of the VM, and the containers that run on top. We have a handful of servers in our sub-prod environments, and then a handful more in our various production DMZs.

For management, most people seem to choose Kubernetes, but again, we don’t do things the easy way around here, so we went with a less popular product called Rancher. Now, Rancher is a management interface that can sit on top of a number of underlying technologies, including Kubernetes, but we chose to use their house-brand management system, called Cattle, instead. They were nice enough to give us a bunch of training in Docker, including the advice that forms the basis for their theme: if servers were pets, carefully maintained and fed over the years, containers should be like cattle, slaughtered and replaced as soon as they seem to be ill so they don’t infect the whole herd.

Rancher is a really great tool if you’re working in the GUI. It has the concept of an Environment (which we use to separate dev from QA from demo), which spans across one or more Hosts (the servers that run Docker and manage the containers). Inside the Environment are Stacks, which are a collection of related containers with a name. It also handles a lot of the networking between containers, as it comes with its own DNS for the internal container network so you can just resolve Stackname/ContainerName to find a given container in your Environment. You can upload a docker-compose.yml file to create a stack if you’re using Compose, and the extra metadata Rancher uses can be stored in a rancher-compose.yml that also can be uploaded when you make a stack.

Rancher running on my local machine, showing a project I have in progress

Deployment

Manual deployment is super easy in Rancher: create a new stack, add services, paste in the container name from our build step, and let it handle everything else. Moving between environments manually once it works in dev is also easy: download the compose files, then upload them into the next environment. But we’re doing CI/CD, and the developers are constantly asking how they can speed up their release schedule. How do we do this automatically?

There’s two tools that come with Rancher that can help here. One is the extensive API; pretty much everything you can do in the GUI can be done via the JSON-based REST API. The other is the pair of command-line tools they produce: Rancher-compose and Rancher CLI. Since I was also trying to release quickly, I used the API for my initial round of deployment scripts; in a later post, I’ll talk through how I’ve begun to convert to using the CLI commands instead, as I feel they’re faster and cleaner.

For Bamboo, I needed something that could run in a Deploy Project that would update the stack in a given environment. I decided to write a Node.JS script, because when all I have is a node-shaped hammer every build script becomes a nail 😉 (Actually, it was so our Node developers could read the script themselves). I didn’t do much special here, just your standard API integration using a promise-based architecture; however, this is a chunk of a bigger library I decided to write around Rancher, so you’ll see a lot of config options:

function findStack(stackName, environment) {
    return request({
        uri: `${opts.url}/v2-beta/projects/${opts.projectIDs[environment]}/stacks?name=${stackName}`,
        auth: {
            username: opts.auth[environment].key,
            password: opts.auth[environment].secret
        },
        json: true// Automatically stringifies the body to JSON
    });
}

function getContainerInfo(environment, stackName, containerName) {
	log(`Getting container info for ${containerName}`);
	return findStack(stackName, environment)
	.then((body) => request({
		method: 'GET',
		uri: `${opts.url}/v1/services/?environmentId=${body.data[0].id}&name=${containerName}`,
		auth: {
			username: opts.auth[environment].key,
			password: opts.auth[environment].secret
		},
		json: true // Automatically stringifies the body to JSON
	}));
}

function performAction(serviceId, action, environment, launchConfig, stackName) {
    update(`Performing action ${action} on service ${serviceId}`, 'info', stackName);
    return request({
        method: 'POST',
        uri: `${opts.url}/v1/services/${serviceId}/?action=${action}`,
        body: {
            'inServiceStrategy': {
                'batchSize': 1,
                'intervalMillis': 2000,
                'startFirst': true,
                'launchConfig': launchConfig
            }
        },
        auth: {
            username: opts.auth[environment].key,
            password: opts.auth[environment].secret
        },
        json: true
    });
}

    performServiceUpgrade: function (stackName, containerName, environment, image) {
        update(`Upgrading ${containerName} in stack ${stackName} in ${environment} to image ${image}`, 'info', stackName, environment)
        return getContainerInfo(environment, stackName, containerName).then((body) => {
            if (body.data.length <= 0) {
                throw new Error(`Could not find service ${containerName} in stack ${stackName} in ${environment}`);
            }
            let serviceId = body.data[0].id;
            let launchConfig = body.data[0].launchConfig;
            launchConfig.imageUuid = image;

            return performAction(serviceId, 'upgrade', environment, launchConfig, stackName)
            .catch((err) => {
                if ((err.statusCode == 422 || err.status == 422) && opts.retries.on422) {
                    log('Detected invalid state. Rolling back to retry.', 'info', stackName)
                    return performAction(serviceId, 'rollback', environment, launchConfig, stackName)
                        .then(() => this.waitForActionComplete(stackName, containerName, environment, 'active', stackName))
                        .then(() => performAction(serviceId, 'upgrade', environment, launchConfig, stackName));
                } else {
                    log('Detected error condition. Aborting', 'error', stackName)
                    throw err;
                }
            })
            .then(() => this.waitForActionComplete(stackName, containerName, environment, 'upgraded', stackName))
            .then(() => serviceId);
        });
    }

I highlighted lines 54 and 55, however, because they are a little strange. Rancher lets you update anything about a service using the same endpoint, which is kind of nice and kind of rough: I need to specify every single attribute of the service, or it’ll assume I meant to blank out the setting (rather than assuming I meant to leave it unchanged). To make this easier, I captured the existing launch configuration, then changed the container number and sent it back.

Upgrading a service in Rancher is a two-step process: first, you upgrade, which launches a copy of the new container for every copy of the existing container, and then you “finish” the upgrade, which removes the old containers. This is so that if there’s a problem with the new container, you can issue a “rollback” action, which turns the old containers back on and removes the new ones — much faster than trying to pull a fresh copy of the old container back. However, this means sometimes you’ll be trying to upgrade while it’s in an “upgraded” state, waiting for you to finish or roll back. When that happens, Rancher issues a status code 422. My library optionally rolls back and issues the upgrade action again if it encounters this state.

The hardest part was figuring out how to figure out when Rancher was done upgrading. Some of our images are huge, particularly the ones that contain monoliths we’re still in the process of breaking up; it can take several minutes for these containers to download and start up. Eventually, I settled on a polling-based strategy:

waitForActionComplete: function(stackName, containerName, environment, desiredState) {
    update('Waiting for upgrade to complete', 'info', stackName, environment);
    return new Promise((resolve, reject) => {
        //Wait for the service to finish upgrading
        let retries = opts.retries.actionComplete;
        function checkState() {
            getContainerInfo(environment, stackName, containerName).then((body) => {
                let container = body.data[0];
                log('Current state: ' + container.state);

                //Check if upgrade is done
                if (container.state == desiredState) {
                    log('Action complete');
                    return resolve();
                } else {
                    retries--;
                    if (retries < 0) {
                        return reject('Timed out waiting for action to complete');
                    }
                    log(`${retries} left, running again`);
                    return setTimeout(checkState, 1000);
                }
            });
        }
        setTimeout(checkState, 500);
    });
}

This will keep running the checkState function until either the container’s state enters the desired state, or it runs out of retries (configured in the config for the library). I’ve had to tune the number of retries several times; right now, for our production deploy, it’s something outrageous like 600.

This library is called from a simple wrapper for Bamboo’s sub-prod deploys; for production, however, I got a lot trickier. Stay tuned for that write-up next week!

Dockerization Part 1: Building

I’ve been long overdue for a series of articles explaining how our current build system works. One of the major projects I was involved with before this recent reorg involved overhauling our manual build process into a shiny new CI/CD system that would take the code from commit to production in a regulated, automated fashion. As always, the reward for doing a good job is more work like that; when we decided to move to Docker to better support our new team structure, I ended up doing a lot of the foundational work on our new build-test-deliver pipeline. Part one of that pipeline is, of course, building and storing containers.

Your mission, if you choose to accept it

In the old world, before we dockerized our applications, we were following a fairly typical system (that I designed): our CI server runs tests against the code, then bundles it up as an archive file. After that, one environment at a time and on request, it would SCP the tarball down to the server, stop the running process, remove the old codebase, and unpack the new before starting the process again. There were configuration files that had to be saved off and moved back in afterward in a few cases, but we had all those edge cases ironed out. It was working, and there were almost no changes to it in the year before we launched docker.

As we were preparing to go live, I didn’t want to lose the build pipelines we had worked so hard on. And yet, docker containers are fundamentally different than tarballs of code files. Furthermore, our operators (who are responsible for putting code into production) complained of having too many buttons to click: often, our servers had 3-4 codebases on them, meaning 3-4 buttons to click to update one server. They definitely didn’t want to do one button per container. On the other hand, our developers were clear on what they wanted: more deploys, faster deploys, and breaking out their monoliths into modules and microservices so they could go even faster. How to balance these concerns?

Another wrinkle emerged as well once I got my hands on our environment: we chose Rancher as our docker management tool of choice. Rancher is a great little tool, and I enjoy working with its GUI, but when most companies seem to be standardizing on Kubernetes, it was hard to find good examples and tutorials for how to work with Rancher instead.

With all those pressures bearing down on me, my task was straightforward, but far from simple.

How to build a container in 30 days

The promise of containers seemed like it resolved a lot of our headaches overall: developers control the interior of the container, and Platform Ops controls the outside of it. In this brave new world, I don’t have to care what goes in a container, but it’s my job to ensure they get to where they’re going every time without fail. In practice, however, I found I need to understand quite a bit about containers themselves.

For the purposes of this article, you don’t need to know or care about the virtualization layer; just trust that a container is isolated from everything around it, until and unless you drill holes in it (which we do. A lot. But I understand that’s common). You will need to know a little about how they’re built, however.

Picture a repository of source code. At some point, to dockerize the application contained within, you need  a Dockerfile: a file of instructions on how to build this container. Almost every container begins with an instruction to extend from another image, much like classes extending from a base class. This was really handy for us, since it means we can put anything we need into a custom base image and all the developers will have it pre-installed.

From there, there’s a series of customizations to the container. Generally, one step involves copying the code into the container, and another tells the container what executable to run when it starts. For Node.js, we ask our developers to put their code in a standard location, then execute “npm start” when the container boots up, letting them define what that means for their application.

Once you’re happy with what the container contains, it’s time to seal it up and ship it. In this case, that means two commands: a “tag” command, which gives it a name more interesting than the default (which will be something like 2b9c0185251d), and a “push” command, which uploads the docker container to a remote repository. If the container is intended to live in a central repository, it has to be tagged with that repository as part of the name (including a port number, which usually defaults to 5000 for a Docker registry unless you put an Nginx in front to make it 80): something like “artifactory.internal:5000/dt-node-base”. Appended to that is a version: this can be a sequential number, or a word or anything else. By convention, each container is tagged twice: once with a sequential number, and once with the word “latest”. That makes it so you can always pull down the very latest node base container from our Artifactory repository by asking it for “artifactory.internal:5000/dt-node-base:latest”.

The system

So we have a number of parts to this build system that the CI/CD server has to integrate with. The first piece is to begin with raw source code, including a Dockerfile; we had been using Subversion, but the developers had been asking for Git for so long we finally broke down and bought a Bitbucket server and let them migrate.

The next piece is to build the containers with Docker. Since we were using Bamboo as our CI/CD server, I installed Docker on all the remote agents; this required an OS upgrade for them to Red Hat 7, but I was able to script the install using Ansible to make doing it across our whole system less painful.

The next piece is somewhere to store the containers when we’re done with them. As you can guess by the previous example, we decided to use Artifactory for this; this is mostly because, as the developers moved to Node, they were asking for a private NPM server, and Artifactory is able to do double duty and hold both types of artifacts.

For the communication between them, my coworker put together a script we could put on each build server that the plans could use to ensure they didn’t miss any steps. It’s straightforward, looking something like this:

#!/bin/sh -e
# $1 Project Name (dt-nodejs)

docker build -t artifactory.internal:5000/$1:$bamboo_buildNumber \
 -t artifactory.internal:5000/$1:latest

docker push artifactory.internal:5000/$1:$bamboo_buildNumber
docker push artifactory.internal:5000/$1:latest
echo "$1:$bamboo_buildNumber and $1:latest pushed to Artifactory on artifactory.internal:5000"

This means that every build tags the container with the number of the build, giving us an easy source of sequential numbers for the containers without thinking about it. It does mean, however, that building a new pipeline for an existing container name will start the numbering over from 1 and overwrite old containers, but we encourage developers to edit their build plans instead of starting over where possible. If you have any ideas on how to prevent that, I’d love to hear them.

(I’ve actually enhanced this script since, but I’ll talk about that in a future entry)

 

Why the haitus? And what comes next?

You may have noticed there’s not been any real content on this blog in a hot minute. That’s because I made a career move: I stopped being an automation engineer/QA consultant, and began working on a new team in a more DevOps manner.

My team now is called Platform Ops. We’re a cross-functional team dedicated primarily to supporting and improving the lives of the developers at my company. The devs were reorganized into product-specific teams so they could be more agile, while we ended up with the cross-platform work:  implementing Docker  containers, supporting frameworks shared across teams, and responding to midnight emergencies when Operations has no idea what’s failing or who to call about it.

Now that it’s January and I’ve become more comfortable in my role, you can expect to see more articles on this blog. The content will be different; I’m going to talk about Docker, about keeping servers running, about testable infrastructure-as-code, about moving from a monolithic Waterfall environment to a DevOpsy microservice-friendly space, leapfrogging many of the in-between steps in our quest to please the unpleasable developers.

Watch this space!

Selenium Grid on Docker and Vagrant: Part 1

I’ve been putting together a quick proof-of-concept here at work about how we could use Docker to run a Selenium Grid. I’m not sure we’ll go that route, but I was curious how it could be done.

One of the main advantages of doing this sort of rough proof in Vagrant is that it becomes very portable. At the end of the day, I have a mini testing cloud I can run my tests against — and any member of my team can check out a few files and have their own mini testing cloud. It’s pretty neat, and it means that even if we decide against implementing this on a larger scale, I get some value out of it in years to come.

I’ll assume you’re passingly familiar with vagrant already, and have at least read the getting started docs. I was an absolute newbie to Docker when I started, so this discussion will assume no prior Docker knowledge. If you do know Docker, feel free to tell me how wrong I am in the comments section 🙂

I went down the path of using a Docker Provisioner for an hour or so before I realized that was the wrong path: I want to use the Docker Provider. The way to think of this is like a series of super tiny VMs which have to live on a giant VM in much the same way lily pads decorate the top of a pond. Docker as a provider can manage the whole set of lily pads and knows nothing about the pond; Docker as a provisioner can add a lily pad to your existing pond ecosystem without making as many waves.

So we have a secret VM, and a series of explicit Docker containers. Now, this was a proof of concept, but I actually care what OS that secret VM uses; if it’s not compatible with RHEL 6, then I won’t be able to make a good case for it in the end. Lots of shiny new toys only work on Ubuntu, after all.

Vagrant by default picks the tiniest OS it can find, just enough to support the containers on top. Usually that’s a good decision, but as we just discussed I want that secret VM to be CentOS 6 instead. This is where things get a little difficult: to specify your own VM to use, you give Vagrant another Vagrantfile.

Because Vagrantfiles need to be called “Vagrantfile”, you have to create a subfolder; mine is “dockerHost/Vagrantfile” for lack of better terminology. I also wanted to limit the amount of RAM Virtualbox would eat up, and enable networking (this will become important later). Try to think through what you’ll need, because every time you need to destroy and recreate this box, it’s going to suck and feel like it takes forever.

My dockerHost vagrantfile:

Vagrant.configure("2") do |config|
    # Every Vagrant development environment requires a box. You can search for
    # boxes at https://atlas.hashicorp.com/search.
    config.vm.box = "bento/centos-6.7"


    # Create a forwarded port mapping which allows access to a specific port
    # within the machine from a port on the host machine. In the example below,
    # accessing "localhost:8080" will access port 80 on the guest machine.
    config.vm.network "forwarded_port", guest: 80, host: 8088

    # Create a private network, which allows host-only access to the machine
    # using a specific IP.
    config.vm.network "private_network", ip: "192.168.33.10"


    # Provider-specific configuration so you can fine-tune various
    # backing providers for Vagrant. These expose provider-specific options.
    config.vm.provider "virtualbox" do |vb|
        # Customize the amount of memory on the VM:
        vb.memory = "1024"

        # enable network features
        vb.customize ["modifyvm", :id, "--natdnshostresolver1", "on"]
        vb.customize ["modifyvm", :id, "--natdnsproxy1", "on"]
    end

    # Docker provisioner will install docker when given no options
    # This prepares the box to be a base image for the docker-related script
    config.vm.provision "docker"

    # The following line terminates all ssh connections. Therefore
    # Vagrant will be forced to reconnect.
    # That's a workaround to have the docker command in the PATH
    config.vm.provision "shell", inline:
        "ps aux | grep 'sshd:' | awk '{print $2}' | xargs kill"

    # Below are to fix an issue with docker provisioning
    config.vm.provision "shell", inline: "sudo chmod 777 /var/lib/docker "

end

A few things to point out:

  • I needed to enable network features, but not right away; that’s a later addition, for things we won’t get to until part 2.
  • Docker as a provisioner comes back into the mix in a rather unintuitive way. When given no images to load or dockerfiles to build, the provisioner simply installs Docker and exits. This makes a very easy, platform-agnostic way to install Docker. I was halfway through a shell script to do the provisioning when I learned this, and frankly, I just didn’t want to bother learning how to install Docker. On the other hand, this step takes forEVER to run, so you don’t want to recreate the VM often.
  • Docker isn’t available as a command until the ssh has been kicked out and reconnected. This is probably a Vagrant bug. I found the workaround listed above and stopped looking, because I didn’t want to spend more time on this than necessary.
  • The last line isn’t needed until part 2 of this series, but if you plan to build your own docker images, you probably want it.

When this is run with “vagrant up”, it creates a VM that has Docker installed. You probably want to test this before moving on, but once you do, you won’t need to explicitly start this again.

So let’s go to our upper-level Vagrantfile. I looked around and very quickly found some Docker images I want to use out of the box: https://github.com/SeleniumHQ/docker-selenium. The first one to get running is the hub node, the central node for our grid. We configure Docker like any other provider:

Vagrant.configure("2") do |config|
    # The most common configuration options are documented and commented below.
    # For a complete reference, please see the online documentation at
    # https://docs.vagrantup.com.

    # Skip checking for an updated Vagrant box
    config.vm.box_check_update = false

    # Always use Vagrant's default insecure key
    config.ssh.insert_key = false

    # Disable synced folders (prevents an NFS error on "vagrant up")
    config.vm.synced_folder ".", "/vagrant", disabled: true

    # Configure the Docker provider for Vagrant
    config.vm.provider "docker" do |docker|

        # Define the location of the Vagrantfile for the host VM
        # Comment out this line to use default host VM
        docker.vagrant_vagrantfile = "dockerHost/Vagrantfile"

        # Specify the Docker image to use
        docker.image = "selenium/hub"

        # Specify a friendly name for the Docker container
        docker.name = 'selenium-hub'
    end

Here we can see:

  • I’ll confess I stole that synced-folders workaround from another tutorial. It’s probably cargo-culting, since I never ran into that issue myself, but on the other hand, I’m not using shared folders here, and neither should you be. If you need to use shared-folders, use them in the lower level. If you need to move files into your container, that should be done using Docker’s native utilities for file system manipulation, which will be covered in part 2.
  • The vagrantfile for the host VM is the vagrantfile we built above, the centOS one.
  • The image to use is just the name of the image. Much like vagrant boxes, this will search the central repository and find the right container image to use, so don’t worry about this unless it fails.
  • The friendly name is used in the log output, so make it something you’ll recognize.

Once that launches successfully, the hard part is done: we now have a container on top of a custom VM. Now we just add nodes, which are also provided from the same source. Of course, now we’re moving from a single-machine setup to a multi-machine setup, so we use the multi-machine namespace tools Vagrant provides. We also should probably open port 4444 so that we can actually connect to the grid from our proper host machine.

# Parallelism will damage the links 
ENV['VAGRANT_NO_PARALLEL'] = 'yes'

# All Vagrant configuration is done below. The "2" in Vagrant.configure
# configures the configuration version (we support older styles for
# backwards compatibility). Please don't change it unless you know what
# you're doing.
Vagrant.configure("2") do |config|
    # The most common configuration options are documented and commented below.
    # For a complete reference, please see the online documentation at
    # https://docs.vagrantup.com.

    # Skip checking for an updated Vagrant box
    config.vm.box_check_update = false

    # Always use Vagrant's default insecure key
    config.ssh.insert_key = false

    # Disable synced folders (prevents an NFS error on "vagrant up")
    config.vm.synced_folder ".", "/vagrant", disabled: true

    config.vm.define "hub" do |hub|
        # Configure the Docker provider for Vagrant
        hub.vm.provider "docker" do |docker|

            # Define the location of the Vagrantfile for the host VM
            # Comment out this line to use default host VM
            docker.vagrant_vagrantfile = "dockerHost/Vagrantfile"

            # Specify the Docker image to use
            docker.image = "selenium/hub"

            # Specify port mappings
            # If omitted, no ports are mapped!
            docker.ports = ['4444:4444']

            # Specify a friendly name for the Docker container
            docker.name = 'selenium-hub'
        end
    end

    #We can parallel now
    ENV['VAGRANT_NO_PARALLEL'] = 'no'
    config.vm.define "chrome" do |chrome|
        # Configure the Docker provider for Vagrant
        chrome.vm.provider "docker" do |docker|

            # Define the location of the Vagrantfile for the host VM
            # Comment out this line to use default host VM that is
            # based on boot2docker
            docker.vagrant_vagrantfile = "dockerHost/Vagrantfile"

            # Specify the Docker image to use
            docker.image = "selenium/node-chrome:2.53.0"

            # Specify a friendly name for the Docker container
            docker.name = 'selenium-chrome'

            docker.link('selenium-hub:hub')
        end
    end

    config.vm.define "firefox" do |firefox|
        # Configure the Docker provider for Vagrant
        firefox.vm.provider "docker" do |docker|

            # Define the location of the Vagrantfile for the host VM
            # Comment out this line to use default host VM that is
            # based on boot2docker
            docker.vagrant_vagrantfile = "dockerHost/Vagrantfile"

            # Specify the Docker image to use
            docker.image = "selenium/node-firefox"

            # Specify a friendly name for the Docker container
            docker.name = 'selenium-firefox'

            docker.link('selenium-hub:hub')
        end
    end
end

Some things to note:

  • We use docker.link to link the nodes to the hub. This is a very Dockery thing, so i’m not entirely sure of the implications yet, but essentially, this pokes a bit of a hole in the container walls so that the processes in one container can see another container. This link creates our little network of grid nodes, allowing the nodes to register in the grid
  • We can’t create the hub and the nodes in parallel, because the nodes need to link to the hub and the hub may not be started yet when they try to register. I tried to turn parallel back on after the hub was created but I don’t think it actually works. Oh well. Maybe move the hub to its own machine that’s always up and only control the nodes with Docker?
  • You can pin to a specific version of the container, as I did for chrome, or you can leave it at the latest, as I did for firefox. There’s no reason I did them both differently except that I was testing out options to become more comfortable with the setup.

If you only need to test Chrome and Firefox, you can easily see how you can set up a small grid network this way. If you use vagrant heavily already with a cloud or private-cloud setup, you can just plug and play, replacing the virtualbox stuff with your provider of choice.

What about testing IE? Well, I started to put something together with the modernie VMs, as separate VMs that would need to be launched alongside the Docker provider and plug back into the hub, but ultimately I abandoned that course of action. We wouldn’t use vagrant for that task in a real setup, we’d just have a permanent VM set up to multithread requests for testing IE.

Instead, what interested me more was android testing with Selendroid. There was no docker image for selendroid however… yet. Docker as a provider also lets you build your own custom image, so that’s what I set out to do. Unfortunately, that doesn’t work yet. To Be Continued!