Dockerization pt 3: Deploybot

One of the key components to making a good integration between products is understanding the mental model of each product. What one product calls a “counter” another could call a “metric” or a “stat”, for example; or worse, one product could be reporting, say, the amount of free memory, while another is reporting the percentage.

The same goes for integration points between teams. When I built our previous release pipeline, I discovered very quickly that while developers were comfortable talking about repositories, operations thought in terms of applications, and neither knew nor cared how many moving parts went into a single app so long as it was all on a server together. This resulted in a system in which devs had to create a long, complex manifest to explain to ops exactly what code should be deployed by the night shift when, and a number of errors came out of that process.

I knew if I had per-container deploy buttons in production, like I did for pre-production, we’d have similar issues. Developers want to push single containers out into test environments, but operations wants to deploy a single application, with as little fuss as possible, and a quick painless rollback if they notice problems. So when it came time to move to containers in production, I designed them a custom application to do just that. I called it Deploybot.

Rancher ships with a catalog of common apps that you might want to deploy in your container environment. It also comes with the ability to make your own catalogs. A catalog consists of a docker-compose and a rancher-compose; the former is a standard format for spinning up a set of containers that depend on each other as a single unit, while the latter is a Rancher-specific extension to the format that allows the product to add extra metadata to the containers once they’re spun up. In the GUI, the catalog makes it really simple to deploy a whole application at once: just a few clicks and you can have a running app within seconds. Furthermore, when new versions are released to the catalog, it’s only two clicks to upgrade to the new version, which updates only those containers which have changed.

Not only does this let us quickly upgrade, it keeps a history of everything that’s ever been deployed. This means we can easily spin up a version in develop that is identical to this week’s production environment — or last week’s, or next week’s. Furthermore, we can roll back easily to a given deployment in the event of an emergency, and we have a record in the UI of every deploy that’s been done.

I determined quickly that the API would let me upgrade from the catalog if a new version had been pushed. From there, my biggest problem was logistics. Let’s say an app has 4 containers in it. At present, we had 4 build plans, each of which built a single container and pushed it to artifactory. When deploying a container at a time, we could easily just use the Rancher API to push that container to our test environment. But for a catalog update, things were more complicated.

Step 1: Mark for Release

The first problem: how do we determine if a container has passed testing when the final phase of our testing is still manual? Artifactory has a system of properties on a given artifact; you can do a search using the API for only containers that have a given property. So I added a deploy target called “Mark for Release” after our development and test targets; when a given asset is ready to go to production, the lead developer for the team “deploys” to that “target”. Because I’m a masochist, I did this in node.js:

addLabel (args) { // eslint-disable complexity no-console

            if (!args || !args.container || !args.version || !args.label || typeof args.value === 'undefined') {
                return Promise.reject(new Error('Incorrect parameters: please supply container, version, label, and value'));
            }
            console.log(`Adding label ${args.label}=${args.value} to ${args.container}:${args.version}`);

            const {container, version, label, value} = args;

            return request({
                method: 'PUT',
                uri: `${opts.uri}/artifactory/api/storage/docker-local/${container}/${version}?properties=${label}=${value}`,
                auth: {
                    user: opts.user,
                    pass: opts.password,
                    sendImmediately: true
                }
            });
        }

This is just a wrapper around a PUT request to artifactory; docker-local is our local docker repository, as you can tell by the oh-so-creative name. Later logic will pull the list of items with that label and find the highest-numbered container that is ready to be released. At first, this worked great… right up until someone decided they wanted to fall back to an older version of a container and we had to manually remove the label from the newer container. Then I added guard conditions:

markForRelease (artif, args) {
        return artif.addLabel({
            container: args.container,
            version: args.ver,
            label: 'release',
            value: 'true'
        }).then(() => artif.getAllItemsByLabel('release', 'true'))
            .then((items) => artif.filterDockerResults(items))
            .then((containers) => {
                const promises = containers.map((item) => {
                    const [, name, version] = artif.parseVersion(item);

                    // Guard: only replace the current container, with versions higher than this one
                    if (name !== args.container) return Promise.resolve();
                    if (Number(version) <= Number(args.ver)) return Promise.resolve();

                    return artif.addLabel({
                        container: args.container,
                        version: item.substring(item.lastIndexOf('/') + 1),
                        label: 'release',
                        value: 'false'
                    });
                });

                return Promise.all(promises);
            })
            .catch((err) => {
                console.error(err);
                process.exit(1);
            });
    }

After calling the above function, it then fetches all the items with that label. (FilterdockerResults removes anything that’s not a container, because the layers of a container are also labeled when you label a container itself). At that point, anything with a higher number is un-marked, so that the highest numbered container is still the right one to deploy.

Step 2: Update catalog

The next problem is how to get the right version numbers and edit the catalog. To do that, I had to deep dive into the catalog format for Rancher.

Any given catalog is a git repository. Inside the repository, there are a series of folders: one for each application. In each of those folders is a couple descriptive files and a series of folders: one for each version. In the version folder you have two files: a docker compose and a rancher compose. When you make changes, you are meant to put out a new folder with a new version number so that people can optionally pull in the update at their leisure. When Rancher detects that there is a newer version than the one you are on, it will display a little icon in the UI to indicate that an upgrade is available.

What I needed was a way to take an existing configuration file, change the numbers of the specific containers, and make a new version with that updated information. I could have used Regular Expressions to do the parsing, but that just smelled like the wrong solution. So instead, I added a folder called “_template” to the list of versions. Rancher will ignore this, because it’s not a numbered version; furthermore, the files inside are “docker-compose.hbs” and “rancher-compose.hbs”, which Rancher does not think it can parse. However, my Node.JS application can read these in as Handlebars templates.

This means I can write a template like this:

version: '2'
services:
  nginx:
    image: artifactory:5000/nginx-fram:{{nginx-fram}}
    stdin_open: true
    volumes:
    - /var/log:/var/log
    tty: true
    labels:
      io.rancher.container.pull_image: always
      io.rancher.scheduler.affinity:container_label_soft_ne: io.rancher.stack_service.name=$${stack_name}/$${service_name}
      io.rancher.container.hostname_override: container_name
  fram:
    image: artifactory:5000/fram:{{fram}}
    stdin_open: true
    tty: true
    labels:
      io.rancher.container.pull_image: always
      io.rancher.scheduler.affinity:container_label_soft_ne: io.rancher.stack_service.name=$${stack_name}/$${service_name}
      io.rancher.container.hostname_override: container_name

You see the double brackets? Those will be replaced through the templating engine with the correct version number for that container name. This is easily accomplished: I hydrate the template with an object, the keys of which are container names and the values of which are container numbers pulled from the artifactory API, as listed above:

        getAllItemsByLabel (label, value) {
            return request(
                `${opts.uri}/artifactory/api/search/prop?${label}=${value}&repos=${opts.repository}`,
                {
                    auth: {
                        user: opts.user,
                        pass: opts.password,
                        sendImmediately: true
                    }
                }
            ).then((response) => JSON.parse(response).results);
        },
        filterDockerResults (containers) {
            return new Promise((resolve) => {
                const uris = containers.map((item) => item.uri);

                resolve(uris.filter((item) => item.indexOf('sha256') === -1 && item.indexOf('manifest.json') === -1));
            });
        },
	getVersions (urls) {
		return new Promise((resolve, reject) => {
			if (!urls) {
				return reject(new Error('Could not get container versions from Artifactory'));		
        		}
			const retVal = {};

			urls.forEach((url) => {
				const [, name, version] = artifactory.parseVersion(url);
					if (retVal[name] && Number(retVal[name]) >= Number(version)) {
					return;
				}
				retVal[name] = version;
			});
				return resolve(retVal);
		});
	},

When I compose these three functions together, they get all the containers that have been marked for release, remove anything that isn’t a real container, and return only the highest numbered one of each. This makes a neat little object to pass into the Handlebars engine. I then take the hydrated template, write it out to disk as a new catalog version, commit, and push. Voila!

Step 3: Update Rancher

Now that there’s a new catalog version, I have to tell Rancher to update the stack to that version. First, I tell it to poll the catalog for updates. I do so with this neat little function (note the bonus fun comment):

refreshCatalog: function(environment) {
    log('Refreshing Rancher catalog');

    //This API is synch, so it won't return until the catalog is refreshed
    //...in theory
    //I'm pretty sure the above is a lie. A vicious, vicious lie. 
    //But it is the lie told to me by my ancestors, so I will repeat it.
    return request({
        uri: `${opts.url}/v1-catalog/templates?action=refresh`,
        auth: {
            username: opts.auth[environment].key,
            password: opts.auth[environment].secret
        },
        json: true// Automatically stringifies the body to JSON
    });
},

Since the function was definitely returning well before the version showed up in the catalog, I wrote a wait for this particular step:

waitForCatalog: function(catalogId, version, environment, stackName) {
    update('Waiting for catalog to show up', 'info', stackName, environment);
    let that = this;
    return new Promise((resolve, reject) => {
        //Wait for the service to finish upgrading
        let retries = opts.retries.catalog;
        let secondsBetweenRetries = 3; 
        function checkCatalog() {
            that.refreshCatalog(environment)
            .then(() => request({
                method: 'GET',
                uri: `${opts.url}/v1-catalog/templates/DealerTire:${catalogId}`,
                auth: {
                    username: opts.auth[environment].key,
                    password: opts.auth[environment].secret
                },
                json: true
            })).then((body) => {
                let versionList = body.versionLinks;

                if (versionList[version]) {
                    log('Found version', 'info', stackName);
                    log('Catalog update took ' + (30 - retries)*secondsBetweenRetries + ' seconds.');
                    return resolve();
                } else {
                    retries--;
                    log(`${retries} retries remaining`)
                    if (retries < 0) {
                        log('Did not find version in catalog. Check Bitbucket?');
                        return reject(new Error('Timed out waiting for version to appear in catalog.'))
                    }
                    setTimeout(checkCatalog, secondsBetweenRetries * 1000)
                }
            });
        }
        setTimeout(checkCatalog, 500);
    });
},

This is basically the same as the waits I mentioned in part 2, of course, just waiting for the version to show up in the catalog instead of waiting for the service upgrade to finish.

Once the catalog entry is showing up, I can submit a stack-level “upgrade” action. Since I knew I’d also have to do a “complete” action and a “rollback” action, I went ahead and genericized this function as well:

performStackAction: function (stackId, environment, action, body = {}) {
    return request({
        method: 'POST',
        uri: `${opts.url}/v2-beta/projects/${opts.projectIDs[environment]}/stacks/${stackId}?action=${action}`,
        body: body,
        auth: {
            username: opts.auth[environment].key,
            password: opts.auth[environment].secret
        },
        json: true
    });
},

Now, in the case of an upgrade action specifically, you need to pass the compose files into the API. I’m not sure what happens if they disagree; this catalog portion may not be strictly necessary if the passed-in configs are preferred. That said, we knew we wanted the record in the catalog, and it’s easy enough to hand it the same data we just wrote to disk a moment ago.

performStackUpgrade: function (stackName, environment, newnum, composes) {
    update(`Upgrading ${stackName} in ${environment} to version ${newnum}`, 'info', stackName, environment)
    return findStack(stackName, environment).then((body) => {
        if (body.data.length <= 0) {
            throw new Error(`Could not find stack ${stackName} in ${environment}`);
        }
        let stackId = body.data[0].id;
        let catalogId = getCatalogId(stackName);
        return this.performStackAction(stackId, environment, 'upgrade', {
            dockerCompose: composes.docker,
            rancherCompose: composes.rancher,
            externalId: `catalog://Dealertire:${catalogId}:${newnum}`
        }).then(() => stackId);
    });
},

The “externalId” key is what’s telling it that we’re doing a catalog upgrade. However, passing in blank composes results in nothing changing; it won’t read them out of the catalog to determine what to change. It’d be great if it did, though, hint hint.

We then wait for upgrade in much the same way we did when updating a single service, then report success back to the user. The overall sequence thus looks something like this:

return updStack(`Upgrading stack ${stack}...`, 'pending')
.then(() => gitLib.clone())
.then(() => gitLib.config())
.then(() => updStack('Creating new version...', 'pending'))
.then(() => artLib.getContainers())
.then((urls) => artLib.getVersions(urls))
.then((versions) => gitLib.update(catalogID, versions))
.then(retval => {releaseData = retval})
.then(() => gitLib.commit(`Release for ${stack} on ${releaseData.internalID}`, catalogID, releaseData.folder))
.then(() => updStack(`Created version ${releaseData.id}. Waiting for update...`, 'pending'))
.then(() => ranLib.refreshCatalog(env))
.then(() => ranLib.waitForCatalog(catalogID, releaseData.id, env))
.then(() => updStack('Performing upgrade...', 'pending'))
.then(() => ranLib.performStackUpgrade(stack, env, releaseData.folder, releaseData))
.then((id) => releaseData.stackID = id)
.then(() => updStack('Waiting for upgrade to complete...', 'pending'))
.then(() => ranLib.waitForActionCompleteStack(releaseData.stackID, env, 'upgraded', stack.display))
.then(() => updStack('Waiting for health checks...', 'pending'))
.then(() => ranLib.waitForHealthCheckStack(releaseData.stackID, env, stack.display))

“updStack” here is my function to send a message back through the websocket to the client; git is my git library, artLib is the artifactory library, and ranLib is the rancher library. You have to wait for health checks here or you’ll run into issues if you complete the upgrade right away, but again, that’s pretty straightforward.

Deploybot

The interface to my final product, therefore, is really straightforward: a password prompt with AD authentication so only our Operations team can deploy, then three buttons per application. This screenshot was taken during the initial rollout; we have dozens of applications now, each with their own set of three buttons. Ops will do an upgrade, then leave the stack in the upgraded state while they confirm that the application is running correctly; when that’s done, they can complete or roll back, according to the test result. All the functionality for rollback and complete comes from Rancher; I’ve just exposed it here in a series of handy buttons to avoid them having to dig through the UI in an emergency or at 3 in the morning.

Is it overcomplicated? Yeah, totally. There’s a lot I plan to change in version 2.0, which will probably need to be tweaked when we upgrade Rancher and move to a Kubernetes-backed system. But it gets the job done, and after the first few weeks of finding edge cases and smoothing them out, it works like a charm. Sometimes that’s what you’re looking for in a devops world: a reliable, straightforward tool that gets the job done so you can get back to releasing new functionality.

Dockerization part 2: Deploying

Now that we have containers, we need to push them to our subprod environments so they can be tested. Bear with me, this is where things get a little complicated.

Docker Setup

Most people take the easy way out when they move to docker: they ship their containers to the cloud and let someone else manage the installation, upgrades, and maintenance on the docker hosts. We don’t do things the easy way around these parts, though, so we have our own server farm: a series of VMs in our datacenter. Everything below the VM is maintained by another team; my team is responsible for the software layer of the VM, and the containers that run on top. We have a handful of servers in our sub-prod environments, and then a handful more in our various production DMZs.

For management, most people seem to choose Kubernetes, but again, we don’t do things the easy way around here, so we went with a less popular product called Rancher. Now, Rancher is a management interface that can sit on top of a number of underlying technologies, including Kubernetes, but we chose to use their house-brand management system, called Cattle, instead. They were nice enough to give us a bunch of training in Docker, including the advice that forms the basis for their theme: if servers were pets, carefully maintained and fed over the years, containers should be like cattle, slaughtered and replaced as soon as they seem to be ill so they don’t infect the whole herd.

Rancher is a really great tool if you’re working in the GUI. It has the concept of an Environment (which we use to separate dev from QA from demo), which spans across one or more Hosts (the servers that run Docker and manage the containers). Inside the Environment are Stacks, which are a collection of related containers with a name. It also handles a lot of the networking between containers, as it comes with its own DNS for the internal container network so you can just resolve Stackname/ContainerName to find a given container in your Environment. You can upload a docker-compose.yml file to create a stack if you’re using Compose, and the extra metadata Rancher uses can be stored in a rancher-compose.yml that also can be uploaded when you make a stack.

Rancher running on my local machine, showing a project I have in progress

Deployment

Manual deployment is super easy in Rancher: create a new stack, add services, paste in the container name from our build step, and let it handle everything else. Moving between environments manually once it works in dev is also easy: download the compose files, then upload them into the next environment. But we’re doing CI/CD, and the developers are constantly asking how they can speed up their release schedule. How do we do this automatically?

There’s two tools that come with Rancher that can help here. One is the extensive API; pretty much everything you can do in the GUI can be done via the JSON-based REST API. The other is the pair of command-line tools they produce: Rancher-compose and Rancher CLI. Since I was also trying to release quickly, I used the API for my initial round of deployment scripts; in a later post, I’ll talk through how I’ve begun to convert to using the CLI commands instead, as I feel they’re faster and cleaner.

For Bamboo, I needed something that could run in a Deploy Project that would update the stack in a given environment. I decided to write a Node.JS script, because when all I have is a node-shaped hammer every build script becomes a nail 😉 (Actually, it was so our Node developers could read the script themselves). I didn’t do much special here, just your standard API integration using a promise-based architecture; however, this is a chunk of a bigger library I decided to write around Rancher, so you’ll see a lot of config options:

function findStack(stackName, environment) {
    return request({
        uri: `${opts.url}/v2-beta/projects/${opts.projectIDs[environment]}/stacks?name=${stackName}`,
        auth: {
            username: opts.auth[environment].key,
            password: opts.auth[environment].secret
        },
        json: true// Automatically stringifies the body to JSON
    });
}

function getContainerInfo(environment, stackName, containerName) {
	log(`Getting container info for ${containerName}`);
	return findStack(stackName, environment)
	.then((body) => request({
		method: 'GET',
		uri: `${opts.url}/v1/services/?environmentId=${body.data[0].id}&name=${containerName}`,
		auth: {
			username: opts.auth[environment].key,
			password: opts.auth[environment].secret
		},
		json: true // Automatically stringifies the body to JSON
	}));
}

function performAction(serviceId, action, environment, launchConfig, stackName) {
    update(`Performing action ${action} on service ${serviceId}`, 'info', stackName);
    return request({
        method: 'POST',
        uri: `${opts.url}/v1/services/${serviceId}/?action=${action}`,
        body: {
            'inServiceStrategy': {
                'batchSize': 1,
                'intervalMillis': 2000,
                'startFirst': true,
                'launchConfig': launchConfig
            }
        },
        auth: {
            username: opts.auth[environment].key,
            password: opts.auth[environment].secret
        },
        json: true
    });
}

    performServiceUpgrade: function (stackName, containerName, environment, image) {
        update(`Upgrading ${containerName} in stack ${stackName} in ${environment} to image ${image}`, 'info', stackName, environment)
        return getContainerInfo(environment, stackName, containerName).then((body) => {
            if (body.data.length <= 0) {
                throw new Error(`Could not find service ${containerName} in stack ${stackName} in ${environment}`);
            }
            let serviceId = body.data[0].id;
            let launchConfig = body.data[0].launchConfig;
            launchConfig.imageUuid = image;

            return performAction(serviceId, 'upgrade', environment, launchConfig, stackName)
            .catch((err) => {
                if ((err.statusCode == 422 || err.status == 422) && opts.retries.on422) {
                    log('Detected invalid state. Rolling back to retry.', 'info', stackName)
                    return performAction(serviceId, 'rollback', environment, launchConfig, stackName)
                        .then(() => this.waitForActionComplete(stackName, containerName, environment, 'active', stackName))
                        .then(() => performAction(serviceId, 'upgrade', environment, launchConfig, stackName));
                } else {
                    log('Detected error condition. Aborting', 'error', stackName)
                    throw err;
                }
            })
            .then(() => this.waitForActionComplete(stackName, containerName, environment, 'upgraded', stackName))
            .then(() => serviceId);
        });
    }

I highlighted lines 54 and 55, however, because they are a little strange. Rancher lets you update anything about a service using the same endpoint, which is kind of nice and kind of rough: I need to specify every single attribute of the service, or it’ll assume I meant to blank out the setting (rather than assuming I meant to leave it unchanged). To make this easier, I captured the existing launch configuration, then changed the container number and sent it back.

Upgrading a service in Rancher is a two-step process: first, you upgrade, which launches a copy of the new container for every copy of the existing container, and then you “finish” the upgrade, which removes the old containers. This is so that if there’s a problem with the new container, you can issue a “rollback” action, which turns the old containers back on and removes the new ones — much faster than trying to pull a fresh copy of the old container back. However, this means sometimes you’ll be trying to upgrade while it’s in an “upgraded” state, waiting for you to finish or roll back. When that happens, Rancher issues a status code 422. My library optionally rolls back and issues the upgrade action again if it encounters this state.

The hardest part was figuring out how to figure out when Rancher was done upgrading. Some of our images are huge, particularly the ones that contain monoliths we’re still in the process of breaking up; it can take several minutes for these containers to download and start up. Eventually, I settled on a polling-based strategy:

waitForActionComplete: function(stackName, containerName, environment, desiredState) {
    update('Waiting for upgrade to complete', 'info', stackName, environment);
    return new Promise((resolve, reject) => {
        //Wait for the service to finish upgrading
        let retries = opts.retries.actionComplete;
        function checkState() {
            getContainerInfo(environment, stackName, containerName).then((body) => {
                let container = body.data[0];
                log('Current state: ' + container.state);

                //Check if upgrade is done
                if (container.state == desiredState) {
                    log('Action complete');
                    return resolve();
                } else {
                    retries--;
                    if (retries < 0) {
                        return reject('Timed out waiting for action to complete');
                    }
                    log(`${retries} left, running again`);
                    return setTimeout(checkState, 1000);
                }
            });
        }
        setTimeout(checkState, 500);
    });
}

This will keep running the checkState function until either the container’s state enters the desired state, or it runs out of retries (configured in the config for the library). I’ve had to tune the number of retries several times; right now, for our production deploy, it’s something outrageous like 600.

This library is called from a simple wrapper for Bamboo’s sub-prod deploys; for production, however, I got a lot trickier. Stay tuned for that write-up next week!

Dockerization Part 1: Building

I’ve been long overdue for a series of articles explaining how our current build system works. One of the major projects I was involved with before this recent reorg involved overhauling our manual build process into a shiny new CI/CD system that would take the code from commit to production in a regulated, automated fashion. As always, the reward for doing a good job is more work like that; when we decided to move to Docker to better support our new team structure, I ended up doing a lot of the foundational work on our new build-test-deliver pipeline. Part one of that pipeline is, of course, building and storing containers.

Your mission, if you choose to accept it

In the old world, before we dockerized our applications, we were following a fairly typical system (that I designed): our CI server runs tests against the code, then bundles it up as an archive file. After that, one environment at a time and on request, it would SCP the tarball down to the server, stop the running process, remove the old codebase, and unpack the new before starting the process again. There were configuration files that had to be saved off and moved back in afterward in a few cases, but we had all those edge cases ironed out. It was working, and there were almost no changes to it in the year before we launched docker.

As we were preparing to go live, I didn’t want to lose the build pipelines we had worked so hard on. And yet, docker containers are fundamentally different than tarballs of code files. Furthermore, our operators (who are responsible for putting code into production) complained of having too many buttons to click: often, our servers had 3-4 codebases on them, meaning 3-4 buttons to click to update one server. They definitely didn’t want to do one button per container. On the other hand, our developers were clear on what they wanted: more deploys, faster deploys, and breaking out their monoliths into modules and microservices so they could go even faster. How to balance these concerns?

Another wrinkle emerged as well once I got my hands on our environment: we chose Rancher as our docker management tool of choice. Rancher is a great little tool, and I enjoy working with its GUI, but when most companies seem to be standardizing on Kubernetes, it was hard to find good examples and tutorials for how to work with Rancher instead.

With all those pressures bearing down on me, my task was straightforward, but far from simple.

How to build a container in 30 days

The promise of containers seemed like it resolved a lot of our headaches overall: developers control the interior of the container, and Platform Ops controls the outside of it. In this brave new world, I don’t have to care what goes in a container, but it’s my job to ensure they get to where they’re going every time without fail. In practice, however, I found I need to understand quite a bit about containers themselves.

For the purposes of this article, you don’t need to know or care about the virtualization layer; just trust that a container is isolated from everything around it, until and unless you drill holes in it (which we do. A lot. But I understand that’s common). You will need to know a little about how they’re built, however.

Picture a repository of source code. At some point, to dockerize the application contained within, you need  a Dockerfile: a file of instructions on how to build this container. Almost every container begins with an instruction to extend from another image, much like classes extending from a base class. This was really handy for us, since it means we can put anything we need into a custom base image and all the developers will have it pre-installed.

From there, there’s a series of customizations to the container. Generally, one step involves copying the code into the container, and another tells the container what executable to run when it starts. For Node.js, we ask our developers to put their code in a standard location, then execute “npm start” when the container boots up, letting them define what that means for their application.

Once you’re happy with what the container contains, it’s time to seal it up and ship it. In this case, that means two commands: a “tag” command, which gives it a name more interesting than the default (which will be something like 2b9c0185251d), and a “push” command, which uploads the docker container to a remote repository. If the container is intended to live in a central repository, it has to be tagged with that repository as part of the name (including a port number, which usually defaults to 5000 for a Docker registry unless you put an Nginx in front to make it 80): something like “artifactory.internal:5000/dt-node-base”. Appended to that is a version: this can be a sequential number, or a word or anything else. By convention, each container is tagged twice: once with a sequential number, and once with the word “latest”. That makes it so you can always pull down the very latest node base container from our Artifactory repository by asking it for “artifactory.internal:5000/dt-node-base:latest”.

The system

So we have a number of parts to this build system that the CI/CD server has to integrate with. The first piece is to begin with raw source code, including a Dockerfile; we had been using Subversion, but the developers had been asking for Git for so long we finally broke down and bought a Bitbucket server and let them migrate.

The next piece is to build the containers with Docker. Since we were using Bamboo as our CI/CD server, I installed Docker on all the remote agents; this required an OS upgrade for them to Red Hat 7, but I was able to script the install using Ansible to make doing it across our whole system less painful.

The next piece is somewhere to store the containers when we’re done with them. As you can guess by the previous example, we decided to use Artifactory for this; this is mostly because, as the developers moved to Node, they were asking for a private NPM server, and Artifactory is able to do double duty and hold both types of artifacts.

For the communication between them, my coworker put together a script we could put on each build server that the plans could use to ensure they didn’t miss any steps. It’s straightforward, looking something like this:

#!/bin/sh -e
# $1 Project Name (dt-nodejs)

docker build -t artifactory.internal:5000/$1:$bamboo_buildNumber \
 -t artifactory.internal:5000/$1:latest

docker push artifactory.internal:5000/$1:$bamboo_buildNumber
docker push artifactory.internal:5000/$1:latest
echo "$1:$bamboo_buildNumber and $1:latest pushed to Artifactory on artifactory.internal:5000"

This means that every build tags the container with the number of the build, giving us an easy source of sequential numbers for the containers without thinking about it. It does mean, however, that building a new pipeline for an existing container name will start the numbering over from 1 and overwrite old containers, but we encourage developers to edit their build plans instead of starting over where possible. If you have any ideas on how to prevent that, I’d love to hear them.

(I’ve actually enhanced this script since, but I’ll talk about that in a future entry)

 

Selenium Grid on Docker and Vagrant: Part 2

Last time we got Vagrant configured to run a single VM with three docker containers: a Selenium Grid hub, a Chrome node, and a Firefox node. This is a good start, but I wanted to configure a Selendroid node to round out the browser selection. That’s when things got a little… messy.

So upon investigation into how the Docker images I was already using were constructed, I discovered a few key points:

  • Docker images are defined by Dockerfiles in the same way Vagrant VMs are defined by Vagrantfiles. The format and syntax are totally different, but both are more-or-less human-readable flatfiles that explain how to set up a system. So far so good.
  • Dockerfiles are nestable, and in fact, are often nested. The ones from Selenium HQ have a clear hierarchy. This pleased me, because I figured it gave me a nice stable base to work on: my file, like the Chrome and Firefox files, would inherit from the node-base image, but with tweaks specific to Selendroid.

So there’s the top of my Dockerfile:

FROM selenium/node-base:2.53.0
MAINTAINER Bgreen <[redacted]>

USER root

I cracked open the Chrome dockerfile and the Dockerfile reference guide and got reading. It looked pretty straightforward at first: just write a bash script, but stick “RUN” in front of it. Spoiler alert: as I started working on my own script, I learned that this was entirely the wrong way to go about writing a dockerfile. Docker has a lot of useful commands other than “RUN”, and it wasn’t long before I was breaking apart my scripts learning how to put the dockerfile together property.

Looking at the Selendroid grid instructions and the Selendroid getting started guide, there were three major steps:

  • Install the JDK
  • Install the Android SDK
  • Install Selendroid and start it in grid mode

Step 1 appeared to be done already, by virtue of the Node-Base dockerfile. This was a dangerous and ultimately wrong assumption, but it was one I worked with for over a day before I realized my mistake. It turns out, the JRE was installed in the base image… under the name openJDK. Nice.

Java installation:

#===============
# JAVA
#===============
RUN apt-get update && apt-get install -y openjdk-8-jdk
RUN ls -l /usr/lib/jvm/
ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64

Now it was time to install the Android SDK. And here I ran into the first massive bunch of problems. I spent several hours fighting with the system, making small tweaks, before I realized I’d accidentally installed Android Studio instead of Android SDK and had to start over.

I originally was going to do a wget followed by a tar, but then I learned about Docker’s ADD command. The ADD command takes a file that is located in the same directory structure as the Dockerfile and moves it into the directory structure inside the container. If the file is a tarball, it will untar the file into a folder as it does, removing the need to write an explicit tar command — a major plus, as tar commands are always annoying to write. I chose to download the tar into the file structure to avoid the network hit and used the ENV command to set ANDROID_HOME the same way I set JAVA_HOME:

#===============
# Android SDK
#===============
ADD android-sdk_r24.4.1-linux.tgz /opt/selenium/
ENV ANDROID_HOME=/opt/selenium/android-sdk-linux
ENV PATH=${PATH}:${ANDROID_HOME}/tools:${ANDROID_HOME}/platform-tools

However, upon installation, there is no such folder as ANDROID_HOME/platform-tools. This is because it is only created once you fire up the android sdk  tool and begin downloading sdks to develop with. So I figured I’d do  RUN android update sdk --no-ui. Then I learned you have to accept the license agreement. So I updated my code to the very idomatic RUN yes | android update sdk --no-ui .  And well… the results were mildly amusing, but not what I hoped for:

Do you accept the license 'google-gdk-license-35dc2951' [y/n]:
Unknown response 'y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y

Dear maintainers of linux software: Don’t break the ‘yes’ command! Sincerely, Bay.

Thanks to Stack Overflow, I found this:

#===============
# Android SDK
#===============
ADD android-sdk_r24.4.1-linux.tgz /opt/selenium/
ENV ANDROID_HOME=/opt/selenium/android-sdk-linux
ENV PATH=${PATH}:${ANDROID_HOME}/tools:${ANDROID_HOME}/platform-tools

#The following downloads the platform-tools folder
RUN ( sleep 5 && while [ 1 ]; do sleep 1; echo y; done ) \
    | android update sdk --no-ui --all\
 	--filter tool,platform-tools,android-23,build-tools-23.0.3

Which got us to our next section: Installing Selendroid. It seemd pretty simple:

#===============
# Selendroid
#===============
ADD selendroid-standalone-0.17.0-with-dependencies.jar /opt/selenium/selendroid.jar
ADD selendroid-grid-plugin-0.17.0.jar /opt/selenium/selendroid-grid.jar

RUN java -jar /opt/selenium/selendroid.jar
RUN java -Dfile.encoding=UTF-8 -cp "/opt/selenium/selendroid-grid.jar:/opt/selenium/selendroid.jar" org.openqa.grid.selenium.GridLauncher -capabilityMatcher io.selendroid.grid.SelendroidCapabilityMatcher -role hub -host 127.0.0.1 -port 4444

But it didn’t work. And this, dear reader, is where I was stuck for hours, tearing my hair out in frustration. There were three errors. The first, it seems, is a red herring: there’s nothing actually wrong here (so why is it marked “SEVERE”? Bad usability, Selendroid!)

    android: SEVERE: Error executing command: /opt/selenium/android-sdk-linux/bu
ild-tools/23.0.3/aapt remove /tmp/android-driver7255065332626262791.apk META-INF
/NDKEYSTO.RSA
    android: org.apache.commons.exec.ExecuteException: Process exited with an er
ror: 1 (Exit value: 1)
    android:    at org.apache.commons.exec.DefaultExecutor.executeInternal(Defau
ltExecutor.java:377)
    android:    at org.apache.commons.exec.DefaultExecutor.execute(DefaultExecut
or.java:160)
    android:    at org.apache.commons.exec.DefaultExecutor.execute(DefaultExecut
or.java:147)
    android:    at io.selendroid.standalone.io.ShellCommand.exec(ShellCommand.ja
va:49)
    android:    at io.selendroid.standalone.android.impl.DefaultAndroidApp.delet
eFileFromWithinApk(DefaultAndroidApp.java:112)
    android:    at io.selendroid.standalone.builder.SelendroidServerBuilder.dele
teFileFromAppSilently(SelendroidServerBuilder.java:133)
    android:    at io.selendroid.standalone.builder.SelendroidServerBuilder.resi
gnApp(SelendroidServerBuilder.java:148)
    android:    at io.selendroid.standalone.server.model.SelendroidStandaloneDri
ver.initApplicationsUnderTest(SelendroidStandaloneDriver.java:172)
    android:    at io.selendroid.standalone.server.model.SelendroidStandaloneDri
ver.<init>(SelendroidStandaloneDriver.java:94)
    android:    at io.selendroid.standalone.server.SelendroidStandaloneServer.in
itializeSelendroidServer(SelendroidStandaloneServer.java:63)
    android:    at io.selendroid.standalone.server.SelendroidStandaloneServer.<i
nit>(SelendroidStandaloneServer.java:52)
    android:    at io.selendroid.standalone.SelendroidLauncher.launchServer(Sele
ndroidLauncher.java:65)
    android:    at io.selendroid.standalone.SelendroidLauncher.main(SelendroidLa
uncher.java:117)

The second drove me nuts because it outright lied to me. The file it complained about not having was right there, with executable permissions, owned by root (which I was operating as)!

INFO: Executing shell command: /opt/selenium/android-sdk-linux/build-tools/23.0.
3/aapt remove /tmp/android-driver2951817352746346830.apk META-INF/MANIFEST.MF
←[0m←[91mJul 06, 2016 8:22:29 AM io.selendroid.standalone.io.ShellCommand exec
SEVERE: Error executing command: /opt/selenium/android-sdk-linux/build-tools/23.
0.3/aapt remove /tmp/android-driver2951817352746346830.apk META-INF/MANIFEST.MF
java.io.IOException: Cannot run program "/opt/selenium/android-sdk-linux/build-t
ools/23.0.3/aapt" (in directory "."): error=2, No such file or directory
        at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
        at java.lang.Runtime.exec(Runtime.java:620)

It turns out that the “No such file or directory” was coming from inside the executable named, not referring to that executable. I needed to install some dependencies I’d missed, which I did using RUN apt-get update && apt-get install -y lib32stdc++6 lib32z1.

Now I had a different problem involving the keytool:

Jul 07, 2016 6:17:04 AM io.selendroid.standalone.io.ShellCommand exec
INFO: Executing shell command: /usr/lib/jvm/java-8-openjdk-amd64/bin/keytool -genkey -v -keystore /home/seluser/.android/debug.keystore -storepass android -alias androiddebugkey -keypass android -dname CN=Android Debug,O=Android,C=US -storetype JKS -sigalg MD5withRSA -keyalg RSA -validity 9999
Jul 07, 2016 6:17:06 AM io.selendroid.standalone.io.ShellCommand exec
SEVERE: Error executing command: /usr/lib/jvm/java-8-openjdk-amd64/bin/keytool -genkey -v -keystore /home/seluser/.android/debug.keystore -storepass android -alias androiddebugkey -keypass android -dname CN=Android Debug,O=Android,C=US -storetype JKS -sigalg MD5withRSA -keyalg RSA -validity 9999
org.apache.commons.exec.ExecuteException: Process exited with an error: 1 (Exit value: 1)
        at org.apache.commons.exec.DefaultExecutor.executeInternal(DefaultExecutor.java:377)
        at org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:160)
        at org.apache.commons.exec.DefaultExecutor.execute(DefaultExecutor.java:147)
        at io.selendroid.standalone.io.ShellCommand.exec(ShellCommand.java:49)
        at io.selendroid.standalone.builder.SelendroidServerBuilder.signTestServer(SelendroidServerBuilder.java:277)
        at io.selendroid.standalone.builder.SelendroidServerBuilder.resignApp(SelendroidServerBuilder.java:154)
        at io.selendroid.standalone.server.model.SelendroidStandaloneDriver.initApplicationsUnderTest(SelendroidStandaloneDriver.java:172)
        at io.selendroid.standalone.server.model.SelendroidStandaloneDriver.<init>(SelendroidStandaloneDriver.java:94)
        at io.selendroid.standalone.server.SelendroidStandaloneServer.initializeSelendroidServer(SelendroidStandaloneServer.java:63)
        at io.selendroid.standalone.server.SelendroidStandaloneServer.<init>(SelendroidStandaloneServer.java:52)
        at io.selendroid.standalone.SelendroidLauncher.launchServer(SelendroidLauncher.java:65)
        at io.selendroid.standalone.SelendroidLauncher.main(SelendroidLauncher.java:117)

 

This simply claims that the process has exited with a failure; neither the stack trace nor the error message are useful. When I tried to execute that command myself, it complained about an invalid command-line option, which to me indicated that there needed to be quotes around the OU. I didn’t have access to change that, though I could generate my own keystore. However, I also had the third error to deal with:

SEVERE: Error building server: io.selendroid.standalone.exceptions.ShellCommandE
xception: Error executing shell command: /usr/lib/jvm/java-8-openjdk-amd64/bin/j
arsigner -sigalg MD5withRSA -digestalg SHA1 -signedjar /tmp/resigned-android-dri
ver694668080026603748.apk -storepass android -keystore /root/.android/debug.keys
tore /tmp/android-driver694668080026603748.apk androiddebugkey
←[0m←[91mException in thread "main" ←[0m←[91mjava.lang.RuntimeException: io.sele
ndroid.standalone.exceptions.ShellCommandException: Error executing shell comman
d: /usr/lib/jvm/java-8-openjdk-amd64/bin/jarsigner -sigalg MD5withRSA -digestalg
 SHA1 -signedjar /tmp/resigned-android-driver694668080026603748.apk -storepass a
ndroid -keystore /root/.android/debug.keystore /tmp/android-driver69466808002660
3748.apk androiddebugkey←[0m←[91m
←[0m←[91m       at io.selendroid.standalone.server.model.SelendroidStandaloneDri
ver.initApplicationsUnderTest(SelendroidStandaloneDriver.java:175)←[0m←[91m
←[0m←[91m       at io.selendroid.standalone.server.model.SelendroidStandaloneDri
ver.<init>(SelendroidStandaloneDriver.java:94)←[0m←[91m
←[0m←[91m       at io.selendroid.standalone.server.SelendroidStandaloneServer.in
itializeSelendroidServer(SelendroidStandaloneServer.java:63)←[0m←[91m
←[0m←[91m       at io.selendroid.standalone.server.SelendroidStandaloneServer.<i
nit>(SelendroidStandaloneServer.java:52)←[0m←[91m
←[0m←[91m       at io.selendroid.standalone.SelendroidLauncher.launchServer(Sele
ndroidLauncher.java:65)←[0m←[91m
←[0m←[91m       at io.selendroid.standalone.SelendroidLauncher.main(SelendroidLa
uncher.java:117)←[0m←[91m
←[0m←[91mCaused by: io.selendroid.standalone.exceptions.ShellCommandException: E
rror executing shell command: /usr/lib/jvm/java-8-openjdk-amd64/bin/jarsigner -s
igalg MD5withRSA -digestalg SHA1 -signedjar /tmp/resigned-android-driver69466808
0026603748.apk -storepass android -keystore /root/.android/debug.keystore /tmp/a
ndroid-driver694668080026603748.apk androiddebugkey←[0m←[91m
←[0m←[91m       at io.selendroid.standalone.io.ShellCommand.exec(ShellCommand.ja
va:56)←[0m←[91m
←[0m←[91m       at io.selendroid.standalone.builder.SelendroidServerBuilder.sign
TestServer(SelendroidServerBuilder.java:296)←[0m←[91m
←[0m←[91m       at io.selendroid.standalone.builder.SelendroidServerBuilder.resi
gnApp(SelendroidServerBuilder.java:154)←[0m←[91m
←[0m←[91m       at io.selendroid.standalone.server.model.SelendroidStandaloneDri
ver.initApplicationsUnderTest(SelendroidStandaloneDriver.java:172)←[0m←[91m
←[0m←[91m       ... 5 more←[0m←[91m
←[0m←[91mCaused by: io.selendroid.standalone.exceptions.ShellCommandException: ←
[0m←[91m

This executable was missing altogether, and rightly so: I couldn’t find it on the filesystem. And that was when I realized my “JDK” was a JRE. Installing the proper JDK, shown above, took care of both those errors. Lessons learned.

(As a side note: one thing I really like about Docker is the caching strategy. It only seemed to re-install the Android SDKs if I changed that step or an earlier one, preferring the cached version when I was working on the later steps — something that saved me a ton of time and frusturation.)

So now we have a working (sort of) Dockerfile! Two problems left:

  • There are no emulators available for Selendroid. Oops!
  • The dockerfile starts Selenium Server and then hangs forever, because it doesn’t return from that. TBD

Now, normally you’d fire up the Android Studio GUI and create yourself an AVD file for Selendroid to use, but I’m doing it all the hard way, via the command line in my Docker file. The first thing I have to do is download an ABI to make an AVD out of:

#The following downloads the platform-tools folder and the ABI
RUN ( sleep 5 && while [ 1 ]; do sleep 1; echo y; done ) \
    | android update sdk --no-ui --all\
 	--filter tool,platform-tools,android-23,sys-img-x86-android-23,build-tools-23.0.3

And then, I create an ABI out of it. Note that we are asked one question I couldn’t get rid of using command-line flags, so I used the “echo” command to send a newline and accept the default option (no hardware profile):

#Create AVD. Echo sends a newline and nothing else here, for accepting the default to the question asked.
RUN echo | android create avd --name Default --target android-23 --abi x86

Now before it hangs forever, it clearly states:

android: INFO: Shell command output
android: -->
android: Available Android Virtual Devices:
android:     Name: Default
android:     Path: /root/.android/avd/Default.avd
android:   Target: Android 6.0 (API level 23)
android:  Tag/ABI: default/x86
android:     Skin: WVGA800
android: <--
android:

Progress made!

It was then that I began to really dig into the nitty gritty about how the base image started selenium server. It seems that Selenium HQ chose to use a shell script to wrangle a series of environment variables; since they know the product better than I do, I went down the same path and created my own version of this script, modified for Selendroid:

#!/bin/bash

source /opt/bin/functions.sh

java ${JAVA_OPTS} -jar /opt/selenium/selendroid.jar -keystore /home/seluser/debug.keystore &
NODE_PID=$!

curl -H "Content-Type: application/json" -X POST --data /opt/selenium/config.json http://$HUB_PORT_4444_TCP_ADDR:$HUB_PORT_4444_TCP_PORT/grid/register

trap shutdown SIGTERM SIGINT
wait $NODE_PID

You can see how much shorter it is; Selendroid is weird in that it doesn’t seem to take most of the config options required, and requires me to manually curl the config to the hub node.

A note: be sure to save this with unix line endings. You’ll get a very strange error reading [8] System error: no such file or directory if you do not, and that’s awful to try and figure out because it’s so generic. I also got really comfortable SSHing into the underlying VM to run “docker rm -f” at this point, because the container was building fine but erroring out, so now it had name conflicts when I tried to rebuild.

At this point, the container built, but did not run successfully. This means our debugging strategy changes from inspecting the vagrant output carefully to reading the container’s logs using “docker logs selenium-selendroid”. I now found the resurgence of several of our older problems, which was incredibly frustrating; I was sure that those had been resolved already. It was all stupid little fixes, like generating the keystore in a known location as root and then passing it into selendroid (an approach I had tried earlier but found unnecessary, and one that is already accounted for in the above shell script) or making sure to generate the AVD as the same user that would run selendroid so it ended up in the right location.

At this point we have a working Selendroid docker container! But….. it doesn’t register with the hub correctly. Also, the hub’s web console isn’t accessible, making debugging troubling. At this point, I’m taking a breather, because it’s been 3 days and I’m frustrated. We’ll return in part 3 to make this fully functional. Feel free to comment if you have tips and tricks!

Our current dockerfile:

FROM selenium/node-base:2.53.0
MAINTAINER Bgreen <[redacted email]>

USER root

#===============
# JAVA
#===============
RUN apt-get update && apt-get install -y openjdk-8-jdk
ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64

#Keystore generation is broken somehow?
RUN /usr/lib/jvm/java-8-openjdk-amd64/bin/keytool -genkey -v -keystore /home/seluser/debug.keystore \
    -storepass android -alias androiddebugkey -keypass android \
    -dname CN="Android Debug,O=Android,C=US" -storetype JKS -sigalg MD5withRSA \
    -keyalg RSA -validity 9999

RUN chown seluser /home/seluser/debug.keystore

#===============
# Android SDK
#===============
ADD android-sdk_r24.4.1-linux.tgz /opt/selenium/
ENV ANDROID_HOME=/opt/selenium/android-sdk-linux
ENV PATH=${PATH}:${ANDROID_HOME}/tools:${ANDROID_HOME}/platform-tools

#The following downloads the platform-tools folder and the ABI
RUN ( sleep 5 && while [ 1 ]; do sleep 1; echo y; done ) \
    | android update sdk --no-ui --all\
 	--filter tool,platform-tools,android-23,sys-img-x86-android-23,build-tools-23.0.3

#========================
# Selenium Configuration
#========================
COPY config.json /opt/selenium/config.json

#========================
# Extra libraries
#========================
RUN apt-get update && apt-get install -y lib32stdc++6 lib32z1
RUN apt-get install -y curl

#===============
# Selendroid
#===============
ADD selendroid-standalone-0.17.0-with-dependencies.jar /opt/selenium/selendroid.jar
ADD selendroid-grid-plugin-0.17.0.jar /opt/selenium/selendroid-grid.jar

COPY startSelendroid.sh /opt/bin/
RUN chmod +x /opt/bin/startSelendroid.sh

#===============
# Start the grid
#===============
USER seluser

#Create AVD. Echo sends a newline and nothing else here, for accepting the default to the question asked.
RUN echo | android create avd --name Default --target android-23 --abi x86

CMD ["/opt/bin/startSelendroid.sh"]

 

Selenium Grid on Docker and Vagrant: Part 1

I’ve been putting together a quick proof-of-concept here at work about how we could use Docker to run a Selenium Grid. I’m not sure we’ll go that route, but I was curious how it could be done.

One of the main advantages of doing this sort of rough proof in Vagrant is that it becomes very portable. At the end of the day, I have a mini testing cloud I can run my tests against — and any member of my team can check out a few files and have their own mini testing cloud. It’s pretty neat, and it means that even if we decide against implementing this on a larger scale, I get some value out of it in years to come.

I’ll assume you’re passingly familiar with vagrant already, and have at least read the getting started docs. I was an absolute newbie to Docker when I started, so this discussion will assume no prior Docker knowledge. If you do know Docker, feel free to tell me how wrong I am in the comments section 🙂

I went down the path of using a Docker Provisioner for an hour or so before I realized that was the wrong path: I want to use the Docker Provider. The way to think of this is like a series of super tiny VMs which have to live on a giant VM in much the same way lily pads decorate the top of a pond. Docker as a provider can manage the whole set of lily pads and knows nothing about the pond; Docker as a provisioner can add a lily pad to your existing pond ecosystem without making as many waves.

So we have a secret VM, and a series of explicit Docker containers. Now, this was a proof of concept, but I actually care what OS that secret VM uses; if it’s not compatible with RHEL 6, then I won’t be able to make a good case for it in the end. Lots of shiny new toys only work on Ubuntu, after all.

Vagrant by default picks the tiniest OS it can find, just enough to support the containers on top. Usually that’s a good decision, but as we just discussed I want that secret VM to be CentOS 6 instead. This is where things get a little difficult: to specify your own VM to use, you give Vagrant another Vagrantfile.

Because Vagrantfiles need to be called “Vagrantfile”, you have to create a subfolder; mine is “dockerHost/Vagrantfile” for lack of better terminology. I also wanted to limit the amount of RAM Virtualbox would eat up, and enable networking (this will become important later). Try to think through what you’ll need, because every time you need to destroy and recreate this box, it’s going to suck and feel like it takes forever.

My dockerHost vagrantfile:

Vagrant.configure("2") do |config|
    # Every Vagrant development environment requires a box. You can search for
    # boxes at https://atlas.hashicorp.com/search.
    config.vm.box = "bento/centos-6.7"


    # Create a forwarded port mapping which allows access to a specific port
    # within the machine from a port on the host machine. In the example below,
    # accessing "localhost:8080" will access port 80 on the guest machine.
    config.vm.network "forwarded_port", guest: 80, host: 8088

    # Create a private network, which allows host-only access to the machine
    # using a specific IP.
    config.vm.network "private_network", ip: "192.168.33.10"


    # Provider-specific configuration so you can fine-tune various
    # backing providers for Vagrant. These expose provider-specific options.
    config.vm.provider "virtualbox" do |vb|
        # Customize the amount of memory on the VM:
        vb.memory = "1024"

        # enable network features
        vb.customize ["modifyvm", :id, "--natdnshostresolver1", "on"]
        vb.customize ["modifyvm", :id, "--natdnsproxy1", "on"]
    end

    # Docker provisioner will install docker when given no options
    # This prepares the box to be a base image for the docker-related script
    config.vm.provision "docker"

    # The following line terminates all ssh connections. Therefore
    # Vagrant will be forced to reconnect.
    # That's a workaround to have the docker command in the PATH
    config.vm.provision "shell", inline:
        "ps aux | grep 'sshd:' | awk '{print $2}' | xargs kill"

    # Below are to fix an issue with docker provisioning
    config.vm.provision "shell", inline: "sudo chmod 777 /var/lib/docker "

end

A few things to point out:

  • I needed to enable network features, but not right away; that’s a later addition, for things we won’t get to until part 2.
  • Docker as a provisioner comes back into the mix in a rather unintuitive way. When given no images to load or dockerfiles to build, the provisioner simply installs Docker and exits. This makes a very easy, platform-agnostic way to install Docker. I was halfway through a shell script to do the provisioning when I learned this, and frankly, I just didn’t want to bother learning how to install Docker. On the other hand, this step takes forEVER to run, so you don’t want to recreate the VM often.
  • Docker isn’t available as a command until the ssh has been kicked out and reconnected. This is probably a Vagrant bug. I found the workaround listed above and stopped looking, because I didn’t want to spend more time on this than necessary.
  • The last line isn’t needed until part 2 of this series, but if you plan to build your own docker images, you probably want it.

When this is run with “vagrant up”, it creates a VM that has Docker installed. You probably want to test this before moving on, but once you do, you won’t need to explicitly start this again.

So let’s go to our upper-level Vagrantfile. I looked around and very quickly found some Docker images I want to use out of the box: https://github.com/SeleniumHQ/docker-selenium. The first one to get running is the hub node, the central node for our grid. We configure Docker like any other provider:

Vagrant.configure("2") do |config|
    # The most common configuration options are documented and commented below.
    # For a complete reference, please see the online documentation at
    # https://docs.vagrantup.com.

    # Skip checking for an updated Vagrant box
    config.vm.box_check_update = false

    # Always use Vagrant's default insecure key
    config.ssh.insert_key = false

    # Disable synced folders (prevents an NFS error on "vagrant up")
    config.vm.synced_folder ".", "/vagrant", disabled: true

    # Configure the Docker provider for Vagrant
    config.vm.provider "docker" do |docker|

        # Define the location of the Vagrantfile for the host VM
        # Comment out this line to use default host VM
        docker.vagrant_vagrantfile = "dockerHost/Vagrantfile"

        # Specify the Docker image to use
        docker.image = "selenium/hub"

        # Specify a friendly name for the Docker container
        docker.name = 'selenium-hub'
    end

Here we can see:

  • I’ll confess I stole that synced-folders workaround from another tutorial. It’s probably cargo-culting, since I never ran into that issue myself, but on the other hand, I’m not using shared folders here, and neither should you be. If you need to use shared-folders, use them in the lower level. If you need to move files into your container, that should be done using Docker’s native utilities for file system manipulation, which will be covered in part 2.
  • The vagrantfile for the host VM is the vagrantfile we built above, the centOS one.
  • The image to use is just the name of the image. Much like vagrant boxes, this will search the central repository and find the right container image to use, so don’t worry about this unless it fails.
  • The friendly name is used in the log output, so make it something you’ll recognize.

Once that launches successfully, the hard part is done: we now have a container on top of a custom VM. Now we just add nodes, which are also provided from the same source. Of course, now we’re moving from a single-machine setup to a multi-machine setup, so we use the multi-machine namespace tools Vagrant provides. We also should probably open port 4444 so that we can actually connect to the grid from our proper host machine.

# Parallelism will damage the links 
ENV['VAGRANT_NO_PARALLEL'] = 'yes'

# All Vagrant configuration is done below. The "2" in Vagrant.configure
# configures the configuration version (we support older styles for
# backwards compatibility). Please don't change it unless you know what
# you're doing.
Vagrant.configure("2") do |config|
    # The most common configuration options are documented and commented below.
    # For a complete reference, please see the online documentation at
    # https://docs.vagrantup.com.

    # Skip checking for an updated Vagrant box
    config.vm.box_check_update = false

    # Always use Vagrant's default insecure key
    config.ssh.insert_key = false

    # Disable synced folders (prevents an NFS error on "vagrant up")
    config.vm.synced_folder ".", "/vagrant", disabled: true

    config.vm.define "hub" do |hub|
        # Configure the Docker provider for Vagrant
        hub.vm.provider "docker" do |docker|

            # Define the location of the Vagrantfile for the host VM
            # Comment out this line to use default host VM
            docker.vagrant_vagrantfile = "dockerHost/Vagrantfile"

            # Specify the Docker image to use
            docker.image = "selenium/hub"

            # Specify port mappings
            # If omitted, no ports are mapped!
            docker.ports = ['4444:4444']

            # Specify a friendly name for the Docker container
            docker.name = 'selenium-hub'
        end
    end

    #We can parallel now
    ENV['VAGRANT_NO_PARALLEL'] = 'no'
    config.vm.define "chrome" do |chrome|
        # Configure the Docker provider for Vagrant
        chrome.vm.provider "docker" do |docker|

            # Define the location of the Vagrantfile for the host VM
            # Comment out this line to use default host VM that is
            # based on boot2docker
            docker.vagrant_vagrantfile = "dockerHost/Vagrantfile"

            # Specify the Docker image to use
            docker.image = "selenium/node-chrome:2.53.0"

            # Specify a friendly name for the Docker container
            docker.name = 'selenium-chrome'

            docker.link('selenium-hub:hub')
        end
    end

    config.vm.define "firefox" do |firefox|
        # Configure the Docker provider for Vagrant
        firefox.vm.provider "docker" do |docker|

            # Define the location of the Vagrantfile for the host VM
            # Comment out this line to use default host VM that is
            # based on boot2docker
            docker.vagrant_vagrantfile = "dockerHost/Vagrantfile"

            # Specify the Docker image to use
            docker.image = "selenium/node-firefox"

            # Specify a friendly name for the Docker container
            docker.name = 'selenium-firefox'

            docker.link('selenium-hub:hub')
        end
    end
end

Some things to note:

  • We use docker.link to link the nodes to the hub. This is a very Dockery thing, so i’m not entirely sure of the implications yet, but essentially, this pokes a bit of a hole in the container walls so that the processes in one container can see another container. This link creates our little network of grid nodes, allowing the nodes to register in the grid
  • We can’t create the hub and the nodes in parallel, because the nodes need to link to the hub and the hub may not be started yet when they try to register. I tried to turn parallel back on after the hub was created but I don’t think it actually works. Oh well. Maybe move the hub to its own machine that’s always up and only control the nodes with Docker?
  • You can pin to a specific version of the container, as I did for chrome, or you can leave it at the latest, as I did for firefox. There’s no reason I did them both differently except that I was testing out options to become more comfortable with the setup.

If you only need to test Chrome and Firefox, you can easily see how you can set up a small grid network this way. If you use vagrant heavily already with a cloud or private-cloud setup, you can just plug and play, replacing the virtualbox stuff with your provider of choice.

What about testing IE? Well, I started to put something together with the modernie VMs, as separate VMs that would need to be launched alongside the Docker provider and plug back into the hub, but ultimately I abandoned that course of action. We wouldn’t use vagrant for that task in a real setup, we’d just have a permanent VM set up to multithread requests for testing IE.

Instead, what interested me more was android testing with Selendroid. There was no docker image for selendroid however… yet. Docker as a provider also lets you build your own custom image, so that’s what I set out to do. Unfortunately, that doesn’t work yet. To Be Continued!

Teatime: Containers and VMs

Welcome back to Teatime! This is a weekly feature in which we sip tea and discuss some topic related to quality. Feel free to bring your tea and join in with questions in the comments section.

Tea of the week: Ceylon by Sub Rosa Tea. This is a nice, basic, bold tea, very astringent; it’s great for blending so long as you don’t choose delicate flavors to blend with. It really adds a kick!
teaset2pts

Today’s Topic: Containers and virtualization

Today, I’m going to give you a brief overview of a technology I think might be helpful when running a test lab. So often as testers we neglect to follow trends in development; we figure, devs love their fancy toys, but the processes for testing software really don’t change, so there’s no need to pay much heed to what they’re doing. Too often we forget that, especially as automation engineers, we are writing software and using software and immersing ourselves in software just like they are. So it’s worth taking the time to attend tooling talks from time to time, see if there’s anything worth picking up.

Vagrant

A tool I’ve picked up and put down a few times over the past year or so is Vagrant. Vagrant makes it very easy to provision VMs; you can store the configuration for the server needed to run software right with the source code or binaries. Adopting a system in which developers keep the vagrantfiles up to date and testers use them to spin up test instances can ensure that every test we run is on a valid system configuration, and both teams know what the supported configurations entail.

At a high level, the workflow is simple:

  1. Create a Vagrantfile
  2. On the command line, type “vagrant up”
  3. Wait for your VM to finish booting

In order for this to work, however, you have to have what’s called a “provider” configured with Vagrant. This is a specific VM technology that you’re using at your workplace; in my experiements, I’ve used Virtualbox, but if you’re already using something like VMWare or a cloud provider like AWS for your test lab, there’s integrations with those systems as well.

When creating the vagrantfile, you first select a base image to use. Typically, this will be a machine with a given version of a given OS and possible some software that’s more complex to install (to save time). HashiCorp, makers of Vagrant, provide a number of base machines that can be used, or you can create your own. This of course means that every VM you bring up has the same OS and patch level to begin with.

The next step is provisioning the box with the specific software you’re using. This is where you would install your application, any dependencies it has, and any dependencies of those dependencies, and so on. Since everything is installed automatically, everything is installed at the same version and with the same configuration, making it really easy to load up a fresh box with a known good state. Provisioning can be as simple as a handful of shell scripts, or it can use any of a number of provisioning systems, such as Chef, Ansible, or Puppet.

Here is a sample vagrantfile:

# -*- mode: ruby -*-

  $provisionScript = <<SCRIPT
    #Node & NPM
    sudo apt-get install -y curl
    curl -sL https://deb.nodesource.com/setup | sudo bash -  #We have to install from a newer location, the repo version is too old
    sudo apt-get install -y nodejs
    sudo ln -s /usr/bin/nodejs /usr/bin/node
    cd /vagrant
    sudo npm install --no-bin-links
SCRIPT

# vi: set ft=ruby :

# All Vagrant configuration is done below. The "2" in Vagrant.configure
# configures the configuration version (we support older styles for
# backwards compatibility). Please don't change it unless you know what
# you're doing.
Vagrant.configure(2) do |config|
  # The most common configuration options are documented and commented below.
  # For a complete reference, please see the online documentation at
  # https://docs.vagrantup.com.

  # Every Vagrant development environment requires a box. You can search for
  # boxes at https://atlas.hashicorp.com/search.
  config.vm.box = "hashicorp/precise64"

  config.vm.provider "virtualbox" do |v|
    v.customize ["setextradata", :id, "VBoxInternal2/SharedFoldersEnableSymlinksCreate/v-root", "1"]
  end
  config.vm.network "private_network", ip: "192.168.33.11"
  
  #Hosts file plugin
  #To install: vagrant plugin install vagrant-hostsupdater
  #This will let you access the VM at servercooties.local once it's up
  config.vm.hostname = "servercooties.local"
  
  config.vm.provision "shell",
  inline: $provisionScript

end

I left a good deal of the tutorial text in place, just in case I needed to reference it. We’re using Ubuntu Precise Pangolin 64-bit as the base box, distributed by HashiCorp, and I use a plugin that modifies my hosts file so that I can always find the machine in my browser at a known host. The provision script is just a simple shell script embedded within the config; I’ve placed it at the top so it’s easy to find.

One other major feature that I haven’t yet played with is the ability for a single Vagrantfile to bring up multiple machines. If your cluster generally consists of, say, two web servers, a database server, and a load balancer, you can encode that all in a single vagrantfile to bring up a fresh cluster on demand. This makes it simple to bring up new testing environments with just one command.

Docker

I haven’t played much with Docker, but everyone seems to be raving about it, so I figured I’d touch on it as an alternative to Vagrant. Docker takes the metaphor of shipping containers, which revolutionized the shipping industry by abstracting away the handling of specific types of goods from the underlying business of moving goods around, and extends it to software. Before standard shipping containers, different goods packed differently, required different packaging material to keep them safe, and shipped in different amounts and weights; cargo handlers had to learn all these things, and merchants were a little wary of trusting their precious goods to someone who was less experienced. The invention of the standard shipping container changed all that: shipping companies just had to understand how to load and transport containers, and it was up to the manufacturers to figure out how to pack them. Docker does the same thing for software: operations staff just have to know how to deploy containers, while it’s up to the application developers to understand how to pack them.

Inside a docker container, the application, its dependencies, and its required libraries reside, all pinned to the right versions and nestled inside the container. Outside, the operating system and any system-wide dependencies can be maintained by the operational staff. When it’s time to upgrade, they just remove the existing container and deploy the new one over top. Different containers with different versions of the same dependency can live side  by side; each one can only see its own contents and the host’s contents.

And thus, we reach the limit of my knowledge of Docker. Do you have more knowledge? Do you have experience with Vagrant? Share in the comments!