Dockerization part 2: Deploying

Now that we have containers, we need to push them to our subprod environments so they can be tested. Bear with me, this is where things get a little complicated.

Docker Setup

Most people take the easy way out when they move to docker: they ship their containers to the cloud and let someone else manage the installation, upgrades, and maintenance on the docker hosts. We don’t do things the easy way around these parts, though, so we have our own server farm: a series of VMs in our datacenter. Everything below the VM is maintained by another team; my team is responsible for the software layer of the VM, and the containers that run on top. We have a handful of servers in our sub-prod environments, and then a handful more in our various production DMZs.

For management, most people seem to choose Kubernetes, but again, we don’t do things the easy way around here, so we went with a less popular product called Rancher. Now, Rancher is a management interface that can sit on top of a number of underlying technologies, including Kubernetes, but we chose to use their house-brand management system, called Cattle, instead. They were nice enough to give us a bunch of training in Docker, including the advice that forms the basis for their theme: if servers were pets, carefully maintained and fed over the years, containers should be like cattle, slaughtered and replaced as soon as they seem to be ill so they don’t infect the whole herd.

Rancher is a really great tool if you’re working in the GUI. It has the concept of an Environment (which we use to separate dev from QA from demo), which spans across one or more Hosts (the servers that run Docker and manage the containers). Inside the Environment are Stacks, which are a collection of related containers with a name. It also handles a lot of the networking between containers, as it comes with its own DNS for the internal container network so you can just resolve Stackname/ContainerName to find a given container in your Environment. You can upload a docker-compose.yml file to create a stack if you’re using Compose, and the extra metadata Rancher uses can be stored in a rancher-compose.yml that also can be uploaded when you make a stack.

Rancher running on my local machine, showing a project I have in progress

Deployment

Manual deployment is super easy in Rancher: create a new stack, add services, paste in the container name from our build step, and let it handle everything else. Moving between environments manually once it works in dev is also easy: download the compose files, then upload them into the next environment. But we’re doing CI/CD, and the developers are constantly asking how they can speed up their release schedule. How do we do this automatically?

There’s two tools that come with Rancher that can help here. One is the extensive API; pretty much everything you can do in the GUI can be done via the JSON-based REST API. The other is the pair of command-line tools they produce: Rancher-compose and Rancher CLI. Since I was also trying to release quickly, I used the API for my initial round of deployment scripts; in a later post, I’ll talk through how I’ve begun to convert to using the CLI commands instead, as I feel they’re faster and cleaner.

For Bamboo, I needed something that could run in a Deploy Project that would update the stack in a given environment. I decided to write a Node.JS script, because when all I have is a node-shaped hammer every build script becomes a nail 😉 (Actually, it was so our Node developers could read the script themselves). I didn’t do much special here, just your standard API integration using a promise-based architecture; however, this is a chunk of a bigger library I decided to write around Rancher, so you’ll see a lot of config options:

function findStack(stackName, environment) {
    return request({
        uri: `${opts.url}/v2-beta/projects/${opts.projectIDs[environment]}/stacks?name=${stackName}`,
        auth: {
            username: opts.auth[environment].key,
            password: opts.auth[environment].secret
        },
        json: true// Automatically stringifies the body to JSON
    });
}

function getContainerInfo(environment, stackName, containerName) {
	log(`Getting container info for ${containerName}`);
	return findStack(stackName, environment)
	.then((body) => request({
		method: 'GET',
		uri: `${opts.url}/v1/services/?environmentId=${body.data[0].id}&name=${containerName}`,
		auth: {
			username: opts.auth[environment].key,
			password: opts.auth[environment].secret
		},
		json: true // Automatically stringifies the body to JSON
	}));
}

function performAction(serviceId, action, environment, launchConfig, stackName) {
    update(`Performing action ${action} on service ${serviceId}`, 'info', stackName);
    return request({
        method: 'POST',
        uri: `${opts.url}/v1/services/${serviceId}/?action=${action}`,
        body: {
            'inServiceStrategy': {
                'batchSize': 1,
                'intervalMillis': 2000,
                'startFirst': true,
                'launchConfig': launchConfig
            }
        },
        auth: {
            username: opts.auth[environment].key,
            password: opts.auth[environment].secret
        },
        json: true
    });
}

    performServiceUpgrade: function (stackName, containerName, environment, image) {
        update(`Upgrading ${containerName} in stack ${stackName} in ${environment} to image ${image}`, 'info', stackName, environment)
        return getContainerInfo(environment, stackName, containerName).then((body) => {
            if (body.data.length <= 0) {
                throw new Error(`Could not find service ${containerName} in stack ${stackName} in ${environment}`);
            }
            let serviceId = body.data[0].id;
            let launchConfig = body.data[0].launchConfig;
            launchConfig.imageUuid = image;

            return performAction(serviceId, 'upgrade', environment, launchConfig, stackName)
            .catch((err) => {
                if ((err.statusCode == 422 || err.status == 422) && opts.retries.on422) {
                    log('Detected invalid state. Rolling back to retry.', 'info', stackName)
                    return performAction(serviceId, 'rollback', environment, launchConfig, stackName)
                        .then(() => this.waitForActionComplete(stackName, containerName, environment, 'active', stackName))
                        .then(() => performAction(serviceId, 'upgrade', environment, launchConfig, stackName));
                } else {
                    log('Detected error condition. Aborting', 'error', stackName)
                    throw err;
                }
            })
            .then(() => this.waitForActionComplete(stackName, containerName, environment, 'upgraded', stackName))
            .then(() => serviceId);
        });
    }

I highlighted lines 54 and 55, however, because they are a little strange. Rancher lets you update anything about a service using the same endpoint, which is kind of nice and kind of rough: I need to specify every single attribute of the service, or it’ll assume I meant to blank out the setting (rather than assuming I meant to leave it unchanged). To make this easier, I captured the existing launch configuration, then changed the container number and sent it back.

Upgrading a service in Rancher is a two-step process: first, you upgrade, which launches a copy of the new container for every copy of the existing container, and then you “finish” the upgrade, which removes the old containers. This is so that if there’s a problem with the new container, you can issue a “rollback” action, which turns the old containers back on and removes the new ones — much faster than trying to pull a fresh copy of the old container back. However, this means sometimes you’ll be trying to upgrade while it’s in an “upgraded” state, waiting for you to finish or roll back. When that happens, Rancher issues a status code 422. My library optionally rolls back and issues the upgrade action again if it encounters this state.

The hardest part was figuring out how to figure out when Rancher was done upgrading. Some of our images are huge, particularly the ones that contain monoliths we’re still in the process of breaking up; it can take several minutes for these containers to download and start up. Eventually, I settled on a polling-based strategy:

waitForActionComplete: function(stackName, containerName, environment, desiredState) {
    update('Waiting for upgrade to complete', 'info', stackName, environment);
    return new Promise((resolve, reject) => {
        //Wait for the service to finish upgrading
        let retries = opts.retries.actionComplete;
        function checkState() {
            getContainerInfo(environment, stackName, containerName).then((body) => {
                let container = body.data[0];
                log('Current state: ' + container.state);

                //Check if upgrade is done
                if (container.state == desiredState) {
                    log('Action complete');
                    return resolve();
                } else {
                    retries--;
                    if (retries < 0) {
                        return reject('Timed out waiting for action to complete');
                    }
                    log(`${retries} left, running again`);
                    return setTimeout(checkState, 1000);
                }
            });
        }
        setTimeout(checkState, 500);
    });
}

This will keep running the checkState function until either the container’s state enters the desired state, or it runs out of retries (configured in the config for the library). I’ve had to tune the number of retries several times; right now, for our production deploy, it’s something outrageous like 600.

This library is called from a simple wrapper for Bamboo’s sub-prod deploys; for production, however, I got a lot trickier. Stay tuned for that write-up next week!

Testing Socksite: Functional tests with Node.JS

Background

As you may know if you’ve ever browsed my GitHub account, I am a member of a tiny open-source organization called SockDrawer. Odds are, we don’t make anything you’ve heard of; we’re just a group of individuals that all belong to a forum and wanted to make tools to enhance the forum-going experience. SockDrawer began when the founder, Accalia, wanted to make a platform for easily creating bots that interact with up-and-coming forum software Discourse. She named her platform SockBot, after the forum term “sock puppet”, and soon had more help than she could easily organize from other forumgoers who wanted to pitch in.

My connection with SockDrawer came when Accalia solicited some advice on how to unit test SockBot. The architecture wasn’t designed well for testability; since I work in QA, I had plenty of advice to dispense. Furthermore, she wanted help writing documentation; technical writing is also something I’m somewhat interested in, so I joined Sock Drawer and stuck around.

The Ticket

The ticket that generated today’s adventure was filed by Onyx, the UX expert for SockDrawer. It was a feature request for another product, Sock Site, which is a website that is used to monitor the forum’s uptime; the production version can be seen at www.isitjustmeorservercooties.com if you want to follow along.

The ticket was issue number 45: “No method of simulating server cooties” (“server cooties” means, loosely translated, that the forum in question is behaving incredibly slowly or not responding due to an unknown cause). The text:

We are missing a method of simulating server cooties outside of calling one of the status endpoints. This is not really useful for testing live update issues.

A good solution might be a way to set a “delay” variable at runtime. Value of this variable could then be added to the actual measured time (in ms). Setting this variable per tested endpoint would be nice, but not essential.

Of course, I immediately threw out the suggested solution :). To me, the best way to simulate server cooties was to mock the data coming from the server, putting the site into an artificially induced yet real test, sort of an emergency drill. The best way to ensure that the frontend responds correctly to what the backend is doing, to me, is to codify the changes in the functional tests, therefore removing the burden of manually regression testing when making front-end changes.

Webdriver bindings

I have used Selenium Webdriver for functional testing before, so I googled to try and find a Node library that would expose them. Webdriver.io was my first attempt; however, the interface for this library is so radically different than the standard interface that I found myself rapidly frustrated by the constant need to refer to the docs to write anything. What it did well, however, was abstracting the creation of the browser and the cleanup so that I didn’t have to write that code. Ultimately, though, I abandoned it and returned to the standard selenium-webdriver library.

I knew that I’d eventually want to use a remote webdriver service, particularly if we ever wanted to use BrowserStack or some third party source for webdriver. Wouldn’t it be cool, I thought, if when we’re not running from CI, we launched the server portion of the remote webdriver automatically? This proved to be frustratingly difficult using the selenium-webdriver library. I almost gave up — until I found selenium-standalone. This was hands-down the easiest library for controlling webdriver I’ve ever used, and I intend to bring it back to my workplace and suggest we start using it immediately. It contains commands to automatically install and launch the selenium server along with various browser drivers such as the add-on driver for Chrome or Internet Explorer. I eventually moved the install out of the scripts, figuring it could be done during the setup before the tests were run.

Using Mocha, this made my before method nice and clean:

before('init browser session', function(done){
   
       socksite.log = function() {} //no-op log function == quiet mode
       socksite.start(8888, 'localhost', function() {
           selenium.start(function(err, child) {
             if (err) throw err;
             
             driver = new webdriver.Builder().
                   usingServer('http://localhost:4444/wd/hub').
                   withCapabilities(webdriver.Capabilities.firefox()).
                   build();
             
               //In order to know when we're ready to run the test, we have to load a page.
              //there's no "on browser ready" to hook into that I've found
               driver.get("localhost:8888").then(done);
           });
       });
       
   });

And the first test:

it('should be running', function(done) {
        driver.getTitle().then(function(title) {
            assert.strictEqual(title,'Is it just me or server cooties?',"Should have the right title");
            done();
         });
    });

The teardown code had a similar issue: it needs to be async so that the teardown completes before Mocha exits:

after('end browser session', function(done){ 
        driver.close().then(done);
    });

Mocking data

Now, I needed to be able to mock the up and downtime of the site. Because we’re in Node, and not behind Apache or anything like that, I’ve launched the server in code; I have a handle directly to the application already. But how to feed it false data? Did I need to use Sinon.js to mock out the module that takes the samples? Try to intercept the socket?

It turns out, we have a cache module that stores the latest result. When a page load is requested, the server fetches the data from the cache and embeds it on the page. While we could also emit the fake data using the web socket, that gets us into the messy territory of knowing when the client has received the data and finished updating, so that we can test that it updated correctly. This is worth doing to test the sockets later, but for now, I figured changing the cache and issuing another page load would be sufficient.

I encapsulated some data packets in a json file, which I loaded into the variable testData. This let my tests be simple and clean again:

describe('TRWTF', function() {
        it('is You when status is "Great"', function(done) {
            cache.summary = testData.greatData;
            
            driver.get("localhost:8888").then(function() {
                driver.findElement(webdriver.By.css("#header-image-wrapper img")).getAttribute("src").then(function(value) {
                    assert.match( value,/isyou\.png/, "Image should say 'Is you'");
                    done();
                })
            });
        });
    
        //[...]

        
        it('is Discourse when status is "Offline"', function(done) {
            cache.summary = testData.offlineData;
            
            driver.get("localhost:8888").then(function() {
                driver.findElement(webdriver.By.css("#header-image-wrapper img")).getAttribute("src").then(function(value) {
                    assert.match( value,/isdiscourse\.png/, "Image should say 'Is discourse'");
                    done();
                })
            });
        });
    })

Screenshots

Now we were getting somewhere! I could see firefox open, flash through the various statuses, and close again. All I had to do was use Webdriver’s screenshot capability to capture images and we’d have a visual reference for what the site looks like in each of the various cootie configurations.

I created a second file, generateScreenshots.js, and put together a suite that does just that and nothing but that. I’m using Node on Windows, so I needed to use the path library to handle the differing direction of slashes on my machine versus the linux-based CI server or dev environments other developers were using. I also used path.resolve to generate the folder to save the screenshots to, since it uses the current directory to make relative paths absolute.

Here’s the complete text of the screenshot module:

describe('Taking screenshots...', function() {
    var browser = {}; 
     this.timeout(40000);
     var driver;
     
     var folder = path.resolve("test", "functional", "screenshots");

    before('init browser session', function(done){
    
        socksite.log = function() {} //no-op log function == quiet mode
        socksite.start(8888, 'localhost', function() {
            selenium.start(function(err, child) {
              if (err) throw err;
              
              driver = new webdriver.Builder().
                    usingServer('http://localhost:4444/wd/hub').
                    withCapabilities(webdriver.Capabilities.firefox()).
                    build();
                driver.get("localhost:8888").then(done);
            });
        });
        
    }); 

    

    it('when status is "Great"', function(done) {
        cache.summary = testData.greatData;
        
        driver.get("localhost:8888").then(function() {
            driver.takeScreenshot().then(function(image, err) {
                fs.writeFile(path.join(folder, 'great.png'), image, 'base64', done);
            });
        });
    });

    it('when status is "Good"', function(done) {
        cache.summary = testData.goodData;
        
        driver.get("localhost:8888").then(function() {
            driver.takeScreenshot().then(function(image, err) {
                fs.writeFile(path.join(folder, 'good.png'), image, 'base64', done);
            });
        });
    });
    
    it('when status is "OK"', function(done) {
        cache.summary = testData.okData;
        
        driver.get("localhost:8888").then(function() {
            driver.takeScreenshot().then(function(image, err) {
                fs.writeFile(path.join(folder, 'ok.png'), image, 'base64', done);
            });
        });
    });
    
    it('when status is "Bad"', function(done) {
        cache.summary = testData.badData;
        
        driver.get("localhost:8888").then(function() {
            driver.takeScreenshot().then(function(image, err) {
                fs.writeFile(path.join(folder, 'bad.png'), image, 'base64', done);
            });
        });
    });
    
    it('when status is "Offline"', function(done) {
        cache.summary = testData.offlineData;
        
        driver.get("localhost:8888").then(function() {
            driver.takeScreenshot().then(function(image, err) {
                fs.writeFile(path.join(folder, 'offline.png'), image, 'base64', done);
            });
        });
    });

    after('end browser session', function(done){ 
        driver.close().then(done);
    }); 
});

Conclusion

Finally, to make it easy to run, I created some npm commands in the package.json:

"scripts": {
    "test": "npm install selenium-standalone -g && selenium-standalone install && mocha test\\functional\\webdriverTests.js",
    "screenshot": "npm install selenium-standalone -g && selenium-standalone install && mocha test\\functional\\generateScreenshots.js"
  },

This lets us run the tests with npm test ,and the screenshots with npm run screenshot. Nice and simple for developers to use, or to hook into CI.

In the long run, SockDrawer is moving towards a gulp-based build pipeline approach, while I’m only familiar with Grunt. Accalia said she’d turn these simple scripts into steps in the eventual gulp file, letting the CI run with them. I also want to change the test code to default to phantom-js, with an optional input to use any browser so we can run them cross-browser in the future. But for an evening’s tinkering, I’d say, this isn’t bad 🙂