Teatime: Improving websites, with science!

Welcome back to Teatime! This is a weekly feature in which we sip tea and discuss some topic related to quality. Feel free to bring your tea and join in with questions in the comments section.

Tea of the week: All the Lemon! by The Tea Dude. I know I usually recommend black teas, but I’ve got a sore throat this week, and lemon tisane hits the spot perfectly. 

Today’s Topic: Improving websites, with science!

“What should I do for teatime this week?” I asked my fellow Sock Devs.

“A/B Testing! With cookies!” It was so off the wall, I had to do it. You might want to grab some cookies to go with your tea this week 😉


So let’s say we have a website.


It’s a very good website; it sells cookies, and cookies are amazing. Everyone wants your cookies. Well, your boss’ cookies. You don’t bake the cookies; you’re the IT guy, keeping the registers working and the website up and running. It’s a small business, so there’s just you that does “the computer things”.

The website takes orders and puts them in a queue; every morning, the baker makes the custom orders, boxes them up, and sends them fedex, marking the orders done on their tablet. They then bake the day’s cookies for the shop, open the place, and a series of bored teenagers try to make the stupid Square dongle work on the tablet so people can buy cookies in person. And it works, but not well enough.

For some reason, the online business isn’t taking off the way they said it should. Mrs Smith, the baker, asks you, “Do we need to call up the Googles and ask them to make the website better?” You assure the baker – she’s a sweet older lady, she doesn’t know much about computers – that your SEO is fine, really, your site is at the top if you search for “Akron Cookies rainbows”, and that’s the best search term. Analytics say a bunch of people are showing up on the site, but there’s not a lot of orders. Maybe the website needs an update?

Conversion Rate

What she’s really worried about isn’t the SEO, it’s the Conversion Rate. The conversion rate is simply a percentage that indicates how many people that land on your website go on to “do the thing” – in this case, buy cookies. It could be signing up for a cookie newsletter, or leaving a review on Yelp, or whatever, but today we’re worried about sales, because when your sales are low, nothing else matters.

Well, you don’t know much about improving conversion rates. You don’t know much about baking cookies, either, or what kinds of behaviors drive people to buy cookies online. I mean, the problem could be anything, right? Maybe people just don’t like cookies anymore? No, no, Mrs Smith assures you that people buy cookies when they’re in the store. Her physical conversion rate is like, 80%. It’s something about the website.

The Scientific Method

Well, you don’t know much about conversion optimization. And you don’t know that much about websites. But you do know something about science. You aced your science fair in middle school, and you got all the way through AP bio in high school before you decided to do websites fulltime instead of becoming a world-famous botanist. Surely the scientific method could help you here, right?

So you dig out your composition book and turn to a new page. We’re ready to begin doing science now. How did it start….

Ask a Question

Oh right. The first thing you need to do is ask a question. As scientists, we are always making observations about the world around us; we don’t start digging into something, however, until we have a question. Why does that flower look that way? Why does paper fall slower than bowling balls if acceleration due to gravity is a constant? What happens if I bake cookies at twice the temperature for half the time?

So what’s our question? How about, “Why aren’t people buying cookies online?”

Gather Data

Next, we need to do some research. If we don’t have good, concrete data about the world around us, it’s going to be impossible to answer the questions. We’re not forging ahead into a realm of experimental physics here; we’re looking at something tangible and concrete in the world, so we can measure what it’s like currently.

So you go and you pull data from google analytics for a week or two, tracking where people go when they’re on the site. It’s not very exciting data; you have a bunch of visitors, but most of them seem to be hitting the home page, scrolling down for a bit, then leaving. Why do they leave? What’s going on here? Now we can dig into the specifics.

Construct Hypothesis

So here’s your cookie homepage:

You hit the site, pretending you’ve never been here before. Maybe you don’t realize you can buy cookies here? I mean, it says cookies, but there’s sky behind it, maybe this is like, a cookie fansite? Or a dictionary page?

It’s time to make a hypothesis, a concrete statement of fact that can be proven or disproven by the data you can collect. One format for a hypothesis is the following madlib: “If I _____, then ________ will happen.”

So you use this template and jot down your hypothesis in your composition notebook:


“If I change the home page picture, then more people will buy cookies.”

Conduct Experiment

Alright, time to get down, dirty, and dangerous! Wait, that’s botany. We’re dealing with websites here. But still, time for science! We’re going to change the home page, and see if conversions go up. But, wait a minute, Mrs Smith was just telling you about her new ad campaign that’s sure to “fix the Googles problem”. And it’s almost October and that’s National Cookie Month, and people tend to buy more cookies when they hear about that on the radio. So how will you know it works? The data’s going to be contaminated by all these outside factors changing! You can’t very well expect Mrs Smith to put her advertising efforts on hold for the purity of your data, and you can’t ask time to stop for you. So what can we do?

A/B Testing


This is where A/B testing comes in. The idea is, you show the new homepage header to half the incoming visitors. That way, the influx of new visitors doesn’t impact your data, because you’re comparing results over the same time period rather than across separate months. The data remains pure and clean, and you can track this specific variable without any outside interference.

Since this is a simple website and you’re using Google Analytics, this is pretty simple to set up. You FTP upload an alternate version of the home page, one with the cookies header. Then you go into google analytics, add an experiment, and put in both URLs. You put some tracking code onto each version of the home page and re-upload them, and voila, Google handles the rest. From there, half the visitors are routed to one version of the home page, and the other half to the other version. Google tracks how many of them go on to buy cookies, and starts giving you data pretty quickly.

Analyze Data

Now in Analytics you can see the results: a breakdown of how many conversions you’re getting on each version of the homepage on each day. And when the website goes down because the power goes out to the bakery and Mrs Smith doesn’t understand what the cloud is or why she should use it when there’s a perfectly good server sitting in the broom closet, that doesn’t affect the data either, because you’re getting 0 conversions for those six hours with both variations. So it all works out!

Act on the Results


Now that you have your results, you can act on it. The cookie homepage is definitely bringing in more sales; you stop the experiment, and re-upload the cookie header version as index.php so it shows up for everyone. The orders start pouring in, so much so that Mrs Smith is able to hire another baker to help handle just home deliveries. She’s so happy, she offers to design a line of cookies with your name on it! You thank her politely, but honestly, it was your interest in botany that saved the website, so maybe cookies that look like plants?


So she agrees! And you all live happily ever after… at least, until the conversion rate tops out and you start learning about SEO in earnest. Then you get to start all over again from the top.


Cookie Monster copyright Sesame Street. Sample website created with Mobirise website generation software.

Teatime: Testing Large Domains

Welcome back to Teatime! This is a (semi-)weekly feature in which we sip tea and discuss some topic related to quality. Feel free to bring your tea and join in with questions in the comments section.

Tea of the week: Dragon Pearls by Teavana. My grandmother gave me some of this for my birthday a few years back, and it’s become one of my favorite (and most expensive!) teas since. Definitely a special occasion tea!

Edit: Teavana has stopped selling tea online after being bought by Starbucks. You know how I love a good chai, so how about the Republic Chai from Republic of Tea?

Today’s topic: Testing large domains

One challenge that intrigues me as much as it scares me is the idea of testing a product with a large domain of test inputs. Now, I’m not talking about a domain name or “big data”; instead, I mean a mathematical domain, as in the set of potential inputs to a function (or process). If you try to test every combination of multiple sets of inputs, or even every relevant one (barring a few you have decided won’t happen in nature), you’ll quickly run afoul of one of the key testing principles: exhaustive testing is impossible. Sitting down and charting out test cases without doing some prep-work first can quickly lead to madness and excessively large numbers of tests. That’s about where a BA I work with was when I offered to help,  using the knowledge I’ve gained from my QA training courses.

The Project

The project’s central task was to automate the entry and processing of warranty claims for our products. We facilitate the collection of data and the shipping of the product back to the manufacturer as an added service for our customers, as well as handling the financial rules involved to ensure that everyone who should be paid is paid in a timely fashion. However, the volume of warranty claims was growing too large for our human staff to handle alone. Therefore, we set out to construct an automated system that would check certain key rules and disallow any claim to be entered that was likely to be rejected by the manufacturer.

The domain for this process is the cartesian join of the possible inputs: every manufacturer, every customer of ours, every warehouse that can serve the customer, every specific product (in case it’s on a recall), and every possible reason a customer might return a product (as they each have different rules). Our staff did a wonderful job of boiling them down to a test set that includes a variety of situations and distinct classes, but we were still looking at over 30,000 individual test cases to ensure that all the bases were covered by our extensive rules engine. What’s a test lead to do?

Technique: Equivalence partition

The first technique is pretty straightforward and simple, but if you’ve never used it before, it can be a lightbulb moment. The basic idea is to consider the set of inputs and figure out what distinguishes one subset from another. For example, instead of trying to enter every credit card number in the world, you can break them out into partitions: a valid Visa card, a valid Mastercard, a valid American Express, a card number that is not valid, and a string of non-numeric characters. Suddenly, your thousands of test cases can cut down to a mere five!

In essence, this is what the business folks did to arrive at 30,000 from literally infinite: they isolated a set of warehouses that represent all warehouses, and a set of customers that represent all types of customers, and a set of skus that represent all types of skus.

Technique: Separation of concerns

The next thing I did isn’t so much a testing technique as a development technique I adapted for testing. I realized that we were trying to do too much in one test: combinatorial testing, functional testing, data setup verification, and exploratory testing. By separating them into explicitly different concerns, we could drastically cut down on the number of test cases. I suggested to the BA that as part of go-live we get a dump of the data setup and manually verify it, eliminating the need to test all the possible rule scenarios for all possible manufacturers. I split my test cases into combinatory happy-path tests that make sure every potential input is tested at least once, and functional testing to verify that each rule works correctly. That cut way down on the number of cases. Divide and conquer!

Technique: Decision Tables

To create the functional tests, I used a technique called a decision table. Or well, a whole set of them, but I digress. Essentially, you identify each decision point in your algorithm, using them as conditions in the top portion. You then identify each action taken as a result, and list them in the bottom portion. You input test values (often true/false or yes/no, but sometimes numeric; you could have written C3 in the example as “transaction amount” and done “<$500” and “>$500” as your values).

If any of you have written out a truth table before, this is essentially the testing version of that. In the long form, this would have a truth table of the conditions, with the actions specified based on the algorithm. You can then take any two test cases that produce identical output and have at least one identical input and elide them together.

I started putting together a decision table for each return reason, with every rule down the left and every manufacturer across the top:


As you can see, it got really messy really fast! That was when I decided to try and use equivalence partitioning on the decision tree itself. I figured, not every manufacturer cares about every rule for every reason. If I did one table per reason, and only considered the test cases that could arise from the actual data, I would have something managable on my hands.

I sat down with a big list of manufacturers and their rules, and I divided that into a set of rules which can have a threshold (giving us two cases: valid or invalid) or a “don’t care” (giving two more cases: valid but the rule does not apply, and invalid but the rule does not apply). That cut down the number of manufacturers needed to test considerably, and allowed me to begin constructing a decision table.

A list of what manufacturers consider what rules.

The output of that was a lot cleaner and easier to read:

One of eight decision tables that generated the new tests

Technique: Classification Trees

The next technique is an interesting one. When I learned it, I didn’t think I’d ever use it; however, I found it to be immensely valuable here. A classification tree begins life as a tree, the top half of the diagram you’re seeing: you break out all the possible inputs, and break out the equivalence partitions of the domain of each in a nice flat tree like this. Then you draw a table underneath it.

By OMPwiki - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=27692755
By OMPwiki – Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=27692755

The ISTQB syllabus suggested using a specialized tool that can generate pairs, triples, et cetera according to rules you punch in, but I didn’t use it for this; my coverage criteria was just to cover each factor at least once, so I figure I need at least as many tests as the largest domain (the OEMs). I then went through and marked off items to make sure each one was covered at least once. You can do more with it, but that’s all I needed.

My makeshift classification tree

At last, we had a lovely set of combinatorial tests we could run:


These tests, if you recall above, were to verify that various customer-reason-warehouse-manufacturer combinations were configured correctly. This would ensure that each of our representative samples were used in at least one test case, regardless of their data setup.


Have you ever faced a problem like this? What did you do?

Cucumber is magic, right?

Everywhere I go, I see talk about BDD and Cucumber. Cucumber is the promised messiah, the single technology that bridges the world of the business and the world of the programmer, allowing your BA to write executable test cases so you don’t have to spend time automating once you have a solid framework. It’s the future, the new order, and it’s here, now, ready for prime time. Who wouldn’t want to learn it, right?

That’s what I thought before I took a class on Ruby that happened to use it and got a good look under the hood. And what I saw wasn’t magical unicorn sparkles at all. It was a tangled nest of Regex. Lots and lots of Regex.

I had a problem so I used regular expressions Now I have two problems! - I had a problem so I used regular expressions Now I have two problems!  Perturbed Picard

If that hasn’t already scared you away, dear reader, strap in, because you’re in for a wild ride.

So I have a test automation framework for our public-facing site, Right Turn, which allows users to search for and purchase tires for their vehicle. We have tests, but the maintenance is killing us, so i’m exploring all kinds of alternate options. Among other things, one thing I wanted to do was expose the framework to a Gherkin-style test harness so that the BA could put in test scripts for the ephemeral one-off bug fixes, get them tested, and remove them when she’s sure they’re not going to regress.

That’s one thing I really want to stress: I already have a framework that’s capable of driving my page using traditional WebDriver pageObjects. This initiative doesn’t remove or lesson that requirement at all, so there goes the overhead of “I don’t have to write as much code” out of the gate. You might be able to start doing this without a proper framework, but I doubt you’d be able to finish without it.

So I start puttering about in Java, pulling in Cucumber JVM as a Maven dependency. What will I need? Well, every test is going to want to specify a vehicle I imagine, so I’ll toss in a step like that:


@Given("I have searched for tires for a ([\\w\\s]+)")
    public void I_have_found_a_product(String searchCriteria) {

This regex is simple, just grab everything after the intro bit. Now what? Well, I need to turn that string into Vehicle object. So I skim over my existing Vehicle object for a parser… and I don’t have one. I’ll need to write it. Okay, how hard could it be to turn “2009 Kia Spectra” into Vehicle(2009, “Kia”, “Spectra”)?

//Sample: 2009 Kia Spectra
pattern = Pattern.compile("(\\d{4}) (\\w+) (\\w+)");
        m = pattern.matcher(vehicleString);
        if (m.matches()) {
            return new Vehicle(m.group(1), m.group(2), m.group(3), null, null);

Those nulls? Those are Trim and Option. We’ll ignore them for now. Year Make Model, done, check.

Until I want to use a “2005 Land Rover LR3”. Well crap. I know the right way to do this, but it’s not fun.

    private final static String Makes = "Acura|Audi|BMW|Bentley|Buick|Cadillac|Chevrolet|Chrysler|Dodge|Ford|GMC|Honda|Hummer|Hyundai|Infiniti|Jaguar|Jeep|Kia|Land Rover|Lexus|Lincoln|MINI|Maserati|Maybach|Mazda|Mercedes-Benz|Mercury|Mitsubishi|Nissan|Pontiac|Porsche|Rolls Royce|Saab|Saturn|Scion|Smart|Subaru|Suzuki|Tesla|Toyota|Volkswagen|Volvo";

public static Vehicle parseString(String vehicleString) {
        Pattern pattern;
        Matcher m;
        //Sample: 1999 Kia Spectra
        pattern = Pattern.compile("(\\d{4}) (" + Makes + ") ([\\w\\s]+)");
        m = pattern.matcher(vehicleString);
        if (m.matches()) {
            return new Vehicle(m.group(1), m.group(2), m.group(3), null, null);
        throw new IllegalArgumentException("Cannot parse vehicle string: " + vehicleString);

Okay. So then I’ll need a product. So I go to the site, putting in my test vehicle (RIP my poor little Kia, but she makes a better test vehicle than she did a transportation option anyway), and… I can’t go on in the purchase funnel. Why? You can’t deduce the tire size from the YMM, you need Trim on this vehicle.

Okay. Sure. Whatever. Let’s do this.

private final static String Makes = "Acura|Audi|BMW|Bentley|Buick|Cadillac|Chevrolet|Chrysler|Dodge|Ford|GMC|Honda|Hummer|Hyundai|Infiniti|Jaguar|Jeep|Kia|Land Rover|Lexus|Lincoln|MINI|Maserati|Maybach|Mazda|Mercedes-Benz|Mercury|Mitsubishi|Nissan|Pontiac|Porsche|Rolls Royce|Saab|Saturn|Scion|Smart|Subaru|Suzuki|Tesla|Toyota|Volkswagen|Volvo";
    public static Vehicle parseString(String vehicleString) {
        Pattern pattern;
        Matcher m;
        //Sample: 2009 Kia Spectra EX
        pattern = Pattern.compile("(\\d{4}) (" + Makes + ") ([A-Za-z0-9]+) ([\\w]+)");
        m = pattern.matcher(vehicleString);
        if (m.matches()) {
            return new Vehicle(m.group(1), m.group(2), m.group(3), m.group(4), null);

        //Sample: 1999 Kia Spectra
        pattern = Pattern.compile("(\\d{4}) (" + Makes + ") ([\\w\\s]+)");
        m = pattern.matcher(vehicleString);
        if (m.matches()) {
            return new Vehicle(m.group(1), m.group(2), m.group(3), null, null);
        throw new IllegalArgumentException("Cannot parse vehicle string: " + vehicleString);

I’m starting to feel vaguely nauseated, but I’ve got a vehicle parser that can handle the one test case I’m trying to put together for a demo. It’s held together with twine and duct tape, but it parses reliably. I can tell because, of course, I wrote unit tests:


Anyway, the Kia doesn’t have options, so we’ll just move on for now. (This is hard for me: it’s wrong and I know it but I want to get to a working demo this week, so I have to force myself to leave it alone).

Right, so I’ve got a vehicle. Now I need to navigate through the purchase funnel. This is where the framework saves me:

 LocationPage locationPage = (LocationPage) new LocationPage(driver).navigateTo();
 VehiclePage vehiclePage = (VehiclePage) locationPage.clickNext();
 //Skip vehicle page and tire coach

All that logic was pre-existing, lifted right out of one of our existing tests. Except, you may have noticed one tiny problem: where did that Driver come from?

For now, I just construct one the old-fashioned way:

Webdriver driver = new FireFoxDriver();

Great, it works, we navigate. Now what?

One test case I heard the BA complaining about regressing often had to do with our product comparison feature: when you added product A, then product B, then product C, then hit “compare”, it should list them in the order A, B, C on the comparison page, but it kept doing them in the order that the API happened to return them, which was arbitrary. It had regressed a few times from simple mistakes, and she never remembered to test it, so she’d benefit from a test that could verify it quickly.

So I figure, okay, we’ll need to add a product to the compare widget:

@When("I add ([\\w\\s]+) to the compare widget")
    public void I_add_to_compare_widget(String product) {
        WebDriver driver = new FirefoxDriver();
        ProductPage productPage = new ProductPage(driver);
        Product p = getProductByName(product, productPage);

And click compare:

@When("I click compare")
public void I_click_compare() {
    WebDriver driver = new FirefoxDriver();
    ProductPage productPage = new ProductPage(driver);

And verify the position:

@Then("([\\w\\s]+) should be in the (left|right|middle) container")
public void item_in_compare_bucket(String productName, String position) {
    WebDriver driver = new FirefoxDriver();
    ComparePage productComparePage = new ComparePage(driver);
    int slotNum = 0;
    if (position.equalsIgnoreCase("left")) {
        slotNum = ComparePage.LEFT_SLOT;
    if (position.equalsIgnoreCase("right")) {
        slotNum = ComparePage.RIGHT_SLOT;
    if (position.equalsIgnoreCase("middle")) {
        slotNum = ComparePage.CENTER_SLOT;
    Product actual = productComparePage.getProductInSlot(slotNum);
    assertEquals(actual.getName(), productName);

I’m sure by now you’re screaming at me; the mistake is a newbie one, but it’s glaring and obvious once you know what to look for. You see, each of those drivers will drive separate instances of the browser; it’ll open three windows, and be very confused when it’s not on the right page at all.

What I need now is one of the harder problems with Cucumber: shared state. Somehow, I have to persist the driver between steps, but not between tests that happen to run in parallel (and we do a lot of parallelization of our webdriver tests, as they’re slow and clunky).

For now, I’ll pray that the parallelization engine properly constructs a new instance of my step class for each test, and make it a class variable:

public class sampleStepDefs {
    WebDriver driver;

And while I’m at it, I’ll move driver construction to a method, and swap it out for our remoteWebDriver boilerplate code (hardcoded to localhost for now, as I don’t want to get into configuration just yet):

private WebDriver getDriver() throws IOException {
    DesiredCapabilities capabilities = new DesiredCapabilities();
    capabilities.setCapability(CapabilityType.SUPPORTS_LOCATION_CONTEXT, true);
    capabilities.setCapability("autoAcceptAlerts", true);
    WebDriver driver = new RemoteWebDriver(new URL("http://localhost:4444/wd/hub"), capabilities);
    return driver;

Oh, and of course, it throws a malformedURLException in case the hardcoded URL that’s worked every other time somehow stops working. Which means I need to catch that or bubble it up:

@Given("I have searched for tires for a ([\\w\\s]+)")
public void I_have_found_a_product(String searchCriteria) throws IOException {
    if (driver == null) {
        driver = getDriver();
    Vehicle vehicle = Parser.parseVehicleString(searchCriteria);

    LocationPage locationPage = (LocationPage) new LocationPage(driver).navigateTo();
    VehiclePage vehiclePage = (VehiclePage) locationPage.clickNext();
    //Skip vehicle page and tire coach

I also now have a distinction between which steps are allowed to be Givens (and thus construct a WebDriver) and which are only Whens and Thens (which do not). I’m imposing arbitrary rules above and beyond the domain language, and it’s awful; I’ll have to put some thought around a better way to enact this. But now my brain is firmly gathering wool, chasing every little optimization, and I still don’t have a working demo just yet.

This writeup will gloss over the half hour I spent with an online regex tester perfecting the regexes you saw above; do not, however, let that fool you: you’ll need to be good at regex to make this work. Essentially, the more flexible you want to be for your users, the more you need to get into natural language processing, which is a skill I would never expect an automation engineer to possess. Don’t we have enough domain skills we have to pick up without adding entire fields of study to our toolbox? So regex it is, and forcing our users to bend to fit our molds, which goes against everything we know about usability but what can we do about it, really?

All that work for this:

Feature: Comparison

  Scenario: Add to compare page
    Given I have searched for tires for a 2009 Kia Spectra LX
    When I add Assurance Fuel Max to the compare widget
    And I add Precision Sport to the compare widget
    And I add AVID Ascend to the compare widget
    And I click compare
    Then Assurance Fuel Max should be in the left container
    And Precision Sport should be in the middle container
    And AVID Ascend should be in the right container

Is it worth it? Is this something our business users or BAs can even produce? There’s a hidden rigidity behind the deceptively fluid language, a whole world of rules they have to learn and memorize. But if we can offload some of the cognitive load to them, doesn’t that let us solve more of the hard problems? I don’t have answers here. I just want to be clear about what we’re doing: involving BAs in the process of test automation, not handwaving the entire process away with a magic wand.

Teatime: The Three Ways of Devops

Welcome back to Teatime! This is a weekly feature in which we sip tea and discuss some topic related to quality. Feel free to bring your tea and join in with questions in the comments section.

Tea of the week: I’m feeling herbal today, chilling out with a refreshing cup of mint tea. I grow my own mint, but you can find good varieties at your local grocery store in all probability. I like mine with honey and lemon. 

Today’s topic: The Three Ways of Devops

DevOps is one of my favorite topics to touch on. I firmly feel that the path to better quality software lies through the DevOps philosophy, even more so than through good QA. After all, finding bugs in finished software can only do so much; it’s better to foster a mindset in developers that prevents them from occurring in the first place.

What is DevOps?

DevOps is essentially a cultural phenomenon that aims to build a working relationship and harmony between development, QA, and operations. You can’t buy devops; you can’t hire devops. You have to shift your culture to follow the tao of devops, one step at a time. Anyone trying to sell you instant devops is lying, and you should avoid them like the plague. In that way, it’s a lot like Agile 🙂

DevOps is a logical extension of Agile, in fact. It was developed from the theories laid out in The Phoenix Project, a book I suggested reading a few weeks back, and made famous by a talk at the Velocity conference called “10+ Deploys Per Day: Dev and Ops Cooperation at Flickr” by John Allspaw and Paul Hammond.

You can see in the below graphic how it sort of extends the agile cycle into an infinity symbol:

For teams that are already agile, the jump is pretty straightforward: the definition of “done” now means, not just tested, but released, working in production. It adds a mindset that devs should absolutely care about how their code is doing in production, that the feedback loop should go all the way to the end user and back to the dev.

The First Way

Let me explain each way with a graphic designed by Gene Kim, one of the authors of The Phoenix Project:

The first way is the left-to-right flow of development to release. It promotes systems thinking, trying to get people in each stage of the way (us QA folks are about halfway along that line) to consider not just their own piece in the assembly line, but the line as a whole, maximizing for global efficiency rather than local. What good is testing quickly if the devs can’t get you another release until two weeks after you finish testing? What good is finding and fixing a ton of bugs if the user won’t see the fixes for two years? What good is a million passing tests if the application crashes after three hours in prod?

The principles necessary here to enact the first way are:

  • Small batch sizes. Don’t release huge projects once a year, release small chunks of functionality regularly.
  • Continuous build. If it doesn’t build, it won’t go to prod.
  • Continuous integration. If your code builds but not when your neighbor’s code is added, it’ll blow up in production if it’s only ever tested in isolation.
  • Continuous deployment. Get the code on a server ASAP so it can be tested
  • Continuous testing. Unit tests, integration tests, functional tests; we want to know ASAP when something’s broken.
  • Limit the amount of work in progress; too many balls in the air means you never know what’s going and what’s not.

The biggest rule of the first way: never pass defects downstream. Fix them where you find them.

The Second Way

The first way is the left-to-right flow of software the second way concerns the right-to-left flow of information back down to development:

Information needs to flow back from QA to Dev, back from staging to dev, back from deployment to dev, back from production to dev. The goal is to prevent problems from occurring over again when we already solved that issue. We need to foster faster detection of problems, and faster recovery from them when they occur. As QA professionals, we know you have to create quality at the source; you can’t test it in. In order to do that, we need to embed knowledge where it can do the most good: at the source.

In order to achieve the second way,  you need to:

  • Stop the production line when tests fail. Getting that crucial information back to development is key, and fixing problems before moving on to new development is crucial to ensure quality.
  • Elevate the improvement of work over the work itself. If there’s a tool you could spend 2 hours building that would save 10 hours per week, why wouldn’t you spend the two hours and build it?
  • Fast automated test suites. Overnight is too slow; we need information ASAP, before the dev’s attention span wanders.
  • Shared goals and shared pain between dev and ops. When ops has a goal to keep the system up, dev should be given the same goal; when the system goes down, devs should help fix it. Their knowledge can be highly useful in an incident.
  • Pervasive production information. Devs should know at all times what their code is doing in the wild, and be keeping an eye on things like performance and security.

The Third Way

We have a cycle, it’s very agile, we’re testing all the things; what more could we possibly learn? The Third Way, it turns out, is fostering a culture of experimentation and repetition in order to prepare for the unexpected and continue to innovate.

We have two competing concerns here in the third way, the yin and yang of innovation: how do we take risks without hurting our hard-won stability? The key is repetitive practice. It’s the same principle behind fire drills: when you do something regularly, it become second nature, and you’re ready to handle when it happens for real. That leaves you free to experiment and take risks, falling back on your well-practiced habits to save you when something goes wrong.

Key practices are:

  • Promote risk taking over mindless order taking
  • Create an environment of high trust over low trust
  • Spend 20% of every cycle on nonfunctional requirements, such as performance and security concerns
  • Constantly reinforce the idea that improvements are encouraged, even celebrated. Bring in cake when someone achieves something. Make it well known that they’ve done something good.
  • Run recovery drills regularly, making sure everyone knows how to recover from common issues
  • Warm up with code kata (or test kata) daily


Where are you in your journey toward DevOps? Have you begun? What can you take from this back to your daily life?

Teatime: Performance Testing Basics with jMeter

Welcome back to Teatime! This is a weekly feature in which we sip tea and discuss some topic related to quality. Feel free to bring your tea and join in with questions in the comments section.

Tea of the week: Black Cinnamon by Sub Rosa tea. It’s a nice spicy black tea, very like a chai, but with a focus on cinnamon specifically; perfect for warming up on a cold day.

Today’s topic: Performance Testing with jMeter

This is the second in a two-part miniseries about performance testing. After giving my talk on performance testing, my audience wanted to know about jMeter specifically, as they had some ideas for performance testing. So I spent a week learning a little bit of jMeter, and created a Part 2. I’ve augmented this talk with my learnings since then, but I’m still pretty much a total novice in this area 🙂

And, as always when talking about performance testing, I’ve included bunnies.

What can jMeter test?

jMeter is a very diverse application; it works at the raw TCP/IP layer, so it can test out both your basic websites (by issuing a GET over HTTP),  or your API layer (with SOAP or XML-RPC calls). It also ships with a JDBC connector, so it can test out your database performance specifically.  It also comes with basic configuration for testing LDAP, email (POP3, IMAP, or SMTP), and FTP protocols. It’s pretty handy that way!

A handy bunny

Setting up jMeter

The basic unit in jMeter is called a test plan; you get one of those per file, and it outlines what will be run when you hit “go”. In that plan, you have the next smaller unit: a thread group. Thread groups contain one or more actions and zero or more reporters or listeners that report out on the results. A test plan can also have listeners or reporters that are global to the entire plan.

The thread group controls the number of threads allocated to the actions underneath them; in layman’s terms, how many things it does simultaneously, simulating how many users. There’s also settings for a ramp-up time (how long it takes to go from 0 users/threads to the total number) and the number of executions. The example in the documentation lays it out like so: if you want 10 users to hit your site at once, and you have a ramp-up time of 100 seconds, each thread will start 10 seconds after the previous one, so that after 100 seconds you have 10 threads going at once, performing the same action.

Actions are implemented via a unit called a “Controller”. Controllers come in two basic types: samplers and logical controllers. A sampler sends a request and waits for the response; this is how you tell the thread what it’s doing. A logic controller performs some basic logic, such as “only send this request once ever” (useful for things like logging in) or “Alternate between these two requests”.

Multiple bunnies working together

You can see here an example of some basic logic:


In this example, once only, I log in and set my dealership (required for this request to go through successfully). Then, in a loop, I send a request to our service (here called PQR), submitting a request for our products. I then verify that there was a successful return, and wait 300 milliseconds between requests (to simulate the interface doing something with the result). In the thread group, I have it set to one user, ramping up in 1 second, looping once; this is where I’d tweak it to do a proper load test, or leave it like that for a simple response check.

Skeptical bunny is skeptical

In this test, I believe I changed it to 500 requests and then ran the whole thing three times until I felt I had enough data to take a reasonable average. The graph results listener gave me a nice, easy way to see how the results were trending, which gave me a feel for whether or not the graph was evening out. My graph ended up looking something like this:

graphThe blue line is the average; breaks are where I ran another set of tests. The purple is the median, which you can see is basically levelling out here. The red is the deviation from the average, and the green is the requests per minute.

Good result? bad result? Bunnies.

Have you ever done anything with jMeter, readers? Any tips on how to avoid those broken graph lines? Am I doing everything wrong? Let me know in the comments 🙂

Teatime: Testing Application Performance

Welcome back to Teatime! This is a weekly feature in which we sip tea and discuss some topic related to quality. Feel free to bring your tea and join in with questions in the comments section.

Tea of the week: Still on a chai kick from last week, today I’m sipping on Firebird’s Child Chai from Dryad Teas. I first was introduced to Dryad Tea at a booth at a convention; I always love being able to pick out teas in person, and once I’d found some good ones I started ordering online regularly. It’s a lovely warm chai, with a great kick to it. 

Today’s topic: Testing Application Performance

This is the first in a two-part miniseries about performance testing. The first time I gave this talk, it was a high-level overview of performance testing, and when I asked (as I usually do) if anyone had topic requests for next week, they all wanted to know about jMeter. So I spent a week learning a little bit of jMeter, and created a Part 2.

This talk, however, remains high-level. I’m going to cover three main topics:

  • Performance Testing
  • Load Testing
  • Volume Testing

I have a bit of a tradition when I talk about performance testing: I always illustrate my talks with pictures of adorable bunnies. After all, bunnies are fast, but also cute and non-threatening. Who could be scared of perf tests when they’re looking at bunnies?

White Angora Bunny Rabbit
Aww, lookit da bunny!

Performance Testing

Performance testing is any testing that assesses the performance of the application. It’s really a super-group over the other two categories that way, in that load testing and volume testing are types of performance testing. However, when used without qualifiers, we’re typically talking about measuring the response time under a typical load, to determine how fast the application will perform in the average, everyday use case.

You can measure the entire application, end to end, but it’s often valuable to instead test small pieces of functionality in isolation. Typically, we do this the same way a user would: we make an HTTP request (for a web app), or a series of HTTP requests, and measure how long it took to come back. Ideally, we do this a lot of times, to simulate a number of simultaneous users of our site. For a desktop application, instead of adding “average load”, we are concerned about the average hardware: we run the application on an “average” system and measure how long it takes to, say, repaint the screen after a button click.

But what target do you aim for? The Nielsen Normal Group outlined some general guidelines:

  • One tenth of a second response time feels like the direct result of a user’s action. When the user clicks a button, for example, the button should animate within a tenth of a second, and ideally, the entire interaction should complete within that time. Then the user feels like they are in control of the situation: they did something and it made the computer respond!
  • One second feels like a seamless interaction. The user did something, and it made the computer go do something complicated and come back with an answer. It’s less like moving a lever or pressing a button, and more like waiting for an elevator’s door to close after having pressed the button: you don’t doubt that you made it happen, but it did take a second to respond.
  • Ten seconds and you’ve entirely lost their attention. They’ve gone off to make coffee. This is a slow system.
Baby Bunny Rabbit

Load Testing

Load testing is testing the application under load. In this situation, you simulate the effects of a number of users all using your system at once. This generally ends up having one of two goals: either you’re trying to determine what the maximum capacity of your system is, or you’re trying to figure out if the system gracefully degrades when it exceeds that maximum capacity. Either way, you typically start with a few users and “ramp up” the number of users over a period of time. You should figure out before you begin what the maximum capacity you intend to have is, so you know if you’re on target or need to do some tuning.

Like the previous tests, this can be done at any layer; you can fire off a ton of requests at your API server, for example, or simulate typical usage of a user loading front-end pages that fire requests at the API tier. Often, you’ll do a mix of approaches: you’ll generate load using API calls, then simulate a user’s degraded experience as though they were browsing the site.

There’s an interesting twist on load testing where you attempt to test for sustained load: what happens to your application over a few days of peak usage rather than having a few hours and then a downtime in between? This can sometimes catch interesting memory leaks and so forth that you wouldn’t catch in a shorter test.

Bunny under load

Volume Testing

While load testing handles the case of a large amount of users at once, volume testing is more focused on the data: what happens when there’s a LOT of data involved. Does the system slow down finding search results when there’s millions or billions of records? Does the front-end slow to a crawl when the API returns large payloads of data? Does the database run out of memory executing complex query plans when there’s a lot of records?

A high volume of bunnies

Do you load test at your organization? How have you improved your load testing over the years? What challenges do you face?

Teatime: SQL Testing

Welcome back to Teatime! This is a weekly feature in which we sip tea and discuss some topic related to quality. Feel free to bring your tea and join in with questions in the comments section.

Tea of the week: I like a nice, spicy chai in the winter to warm me up, especially when I’m grappling with rough questions like today’s topic. Rather than subject myself to excessive caffeine on a cold afternoon, I’m sipping on Rooibos Chai from Sub Rosa Tea. Just the smell alone makes me more alert and ready to take on the day, and the flavor does not disappoint! 

Today’s Topic: Testing SQL – How and why?

In today’s teatime, I wanted to touch on an often-overlooked part of the stack: the database. At my current company, we have a separate development team that focuses entirely on stored procedures and database development, above and beyond the usual operational DBAs. And yet, before I became the QA coordinator, all our database testing was still manual! I set about researching how we could test the business logic stored in our stored procedures.

Two main approaches

There are two overarching approaches to database testing. The one I chose not to focus on due to our company’s setup was to test the database from outside the database. This is particularly popular in .net shops, as Visual Studio includes many tools that make it easy to unit test your stored procedures in the same suite you’re using to unit test your code. I would imagine this would also be useful in a Node shop, as you could do much the same thing. Most of our database access comes from our Coldfusion API layer, which is a little more challenging to set tests up in; furthermore, the Coldfusion code was maintained by different people than the SQL, and the SQL team was not comfortable enough in Coldfusion (or .Net) to write their tests there.

The other approach, the one I will be focusing on in this talk, is to test the database from within the database: using something like the popular tSQLt framework to write tests in SQL that test your SQL. This is a similar approach to how unit testing in other layers work; it’s very rare to see unit tests written in a different language than the code under test. Furthermore, you can keep the unit tests right next to the code, just like you would in otehr layers. It provides less overhead in the form of context-switching between writing code and writing SQL, which is great when you specialize in SQL itself.

How to write unit tests

In any language, there’s basically three phases to a unit test:

  • Arrange the environment by performing any setup steps or preconditions that are required,
  • Act on the system, usually by invoking the item under test, and
  • Assert  that the result was within acceptable parameters

In this sample unit test (from the tSQLt documentation), you can see the steps in action:

CREATE PROCEDURE testFinancialApp.[test that ConvertCurrency converts using given conversion rate] AS BEGIN 	DECLARE @actual MONEY; 
    DECLARE @rate DECIMAL(10,4); 
    SET @rate = 1.2; 
    DECLARE @amount MONEY; 
    SET @amount = 2.00; 
    SELECT @actual = FinancialApp.ConvertCurrency(@rate, @amount); 
    DECLARE @expected MONEY; 
    SET @expected = 2.4; --(rate * amount) 
    EXEC tSQLt.AssertEquals @expected, @actual; 

First we arrange the environment by declaring some variables, setting them to the amounts needed for the test. We act by calling the procedure (FinancialApp.ConvertCurrency), and then we assert that the actual response was what we expected (with a comment about why we expected it to be that).

Note how the expected result is a solid number, not the result of doing some math. If the math were wrong in the procedure, duplicating the logic here wouldn’t help us test anything. Instead, work the algorithm by hand, coming up with the expected outcome, and hard-code it into the test. That ensures that no mistakes were made implementing the algorithm as it was on paper.

One of the things you’re not seeing is that when this is run, it’s wrapped in a transaction, which is automatically rolled back at the end of the execution. This prevents any side effects from affecting your data, such as insertion of records into a table. The library also provides functions for mocking out tables and stubbing other functions, which I can cover in a future teatime.

But Why?

But why would you want to test stored procedures and functions? To me, it’s pretty straightforward: if there’s business logic there, it needs to be tested. But if you’re not already convinced, here’s some talking points to mull over:

  • Code that is unit tested ends up being cleaner, more efficient, and easier to refactor. This is well documented in terms of program code, but it’s also been examined for database code as well; for example, see this blog post about test-driven database development, or this one, or this one.
  • Tests provide living documentation of the expectation of the code. This is also true of stored procedures, some of which can run into dozens or hundreds of lines, with dizzying amounts of table joins and complex branching. A simple suite of tests can easily tell a new developer what exactly they’re looking at — and ensure that they didn’t break anything.
  • You can plug the tests into a development pipeline like we discussed last week for instant feedback upon committing your stored procedures. This of course only works if your procs are in version control, but of course they already are, right? 🙂


Do you test your database? Why or why not? Discuss in the comments 🙂

Teatime: Deployment Pipelines

Welcome back to Teatime! This is a weekly feature in which we sip tea and discuss some topic related to quality. Feel free to bring your tea and join in with questions in the comments section.

Tea of the week: An old standby, Twinings Ceylon Orange Pekoe. There’s no orange flavor in it; the Orange refers to the size of the leaves. It’s a good staple tea I can find in my local supermarkets, solid and dependable — just like a deployment pipeline should be. 

Deployment Pipelines

Today’s topic is a little more afield from last week’s discussion of testing types, but I feel it firmly falls under the umbrella of quality. A good deployment pipeline, as you will see shortly, improves the maintainability of code and prevents unwanted regressions.

Like last week, much of this talk touches on concepts laid out in  Continuous Delivery by Jez Humble and David Farley. If your company isn’t already performing continuous delivery, I highly recommend the book, as it talks through the benefits and how to get there in small increments. In the book, they lay out a simple goal:

›“Our goal as software professionals is to deliver useful, working software to users as quickly as possible”

Note that they said “software professionals”, not developers. After all, isn’t that the ultimate goal of SQA as well? And of the BAs and project managers?

Feedback Loops

In order to achieve the goal — delivering software that is both useful and working — Humble and Farley suggest that there needs to be a tight feedback loop of information about how well the software works and how useful it is to the end user delivered back to the development team so they can adjust their course in response. In order to validate traditional software, one typically has to build it first; they advocate building the software after every change so that the build is always up to date and ready for validation. Automate this process, including the delivery of build results to the development team, and you have created a feedback loop — specifically, the first step in a deployment pipeline.

Automated Deployment

In order to validate software that builds correctly, it must be installed, either on an end-user-like testing machine or to a web server that will then serve up the content (depending on the type of software). This, too, can be automated — and now you’ve gained benefits for the development team (who get feedback right away when they make a change that breaks the installation) as well as the testing team (who always have a fresh build ready to test). Furthermore, your infrastructure and/or operations teams have benefits now as well; when they need to spin up a new instance for testing or for a developer to use, they now can deploy to it using the same automated script.

Automated deployment is a must for delivering working software. The first deploy is always the most painful; when it’s done by hand at 2am in production, you’ve already lost the war for quality. Not only should your deploys be automated, they should be against production-like systems, ideally created automatically as well (humans make mistakes, after all).

Continuous Testing

And now we see how this pipeline connects to QA’s more traditional role: testing. Once we have the basic structure in place, typically using a CI Server to automatically build on every commit, we can start adding automatic quality checks into the process to give development feedback on the quality of the code they’ve committed. This can include static checks like linting (automated maintainability checking) as well as simple dynamic tests like unit tests or performance tests. Ideally, however, you want to keep your feedback loop tight; don’t run an eight-hour automated regression suite on every commit. The key is to get information back to the developer before they get bored and wander off to get coffee 🙂

Essential Practices

In order to make this really work for your organization, there are a number of practices that must be upheld, according to the authors of Continuous Delivery. These are basic maintenance sort of things, required for code to keep the level of quality it has over time. They are:

  • Commit early, commit often. Uncommitted code can’t be built, and thus, can’t be analysed.
  • Don’t commit broken code. Developers love to “code first, test later”, and, if they’re not used to this principle, tend to commit code with broken unit tests, intending to go back and clean it up “later”. Over time, the broken windows of old failing tests inoculate people against the warning tests can give. They become complacent; “oh, that always fails, pay it no mind”, they say, and then you might as well not have tests at all.
  • Wait for feedback before moving on. If your brain’s on the next task already, you’ll file away a broken unit test under the “I’ll fix it later” category, and then the above will happen. Especially, never go home on a broken build!
  • Never comment out failing tests. Why are they failing? What needs to be fixed? Commenting them out means removing all their value. ‘


Do any of you use continuous testing and/or a deployment pipeline? Maybe with software like Jenkins, Travis CI, or Bamboo? Let’s chat in the comments!

Teatime: What kinds of testing should I do?

Welcome back to Teatime! This is a weekly feature in which we sip tea and discuss some topic related to quality. Feel free to bring your tea and join in with questions in the comments section.

Tea of the week: It’s been a stressful (and cold!) week as I pre-write this, so I’m chilling out with some Toasted Caramel Rooibos from Sub Rosa Teas


Today’s topic: What kind of testing should I do?

For the first QA-content-filled teatime, I wanted to start at the beginning and touch briefly on what kinds of testing there are, and when to use each of them. This is not intended to be an exhaustive list, more of a gentle overview while we sip tea and meditate on our own projects and the testing each one needs.

Things to consider

When you’re putting together a test plan, and you’re considering performing various types of tests, you should consider the following:

  • What is the purpose of this kind of test?
  • In what situations is this test most suited?
  • What differentiates this test type from others?
  • Why are we performing this test?

Testing without purpose is not so much testing as playing; if there’s no business need, and no reason to perform the test, all you’re doing is wasting your time. Furthermore, if you already have your needs covered, adding more types of tests won’t help anything, by definition: your needs are already covered, and further testing is just wasting time.

The Testing Quadrants

test quadrants

I came across a version of this diagram in the book Continuous Delivery by Jez Humble and David Farley, and I really liked it, so I’ve recreated my own version here. There are two axes represented above, both equally important. On the horizontal axis, testing can either support development efforts, which is to say, it can provide input into the ongoing effort of building the software to correct the course by small degrees; or, it can critique a finished product, producing a feedback loop for development of future enhancements and/or future products. On the vertical axis, tests can be developer-facing, giving feedback to the development team on the project, or they can be user-facing, giving feedback to the BA from a user’s perspective of the software.

Therefore, in this model, acceptance tests are user-oriented tests; do not write acceptance tests for things that are invisible to the user! On the other hand, they support development, so we want to run them as early as possible to tighten the feedback loop so the devs can course-correct. Which argues for an Agile approach 🙂

Unit tests are similar to Acceptance Tests in that they support development and so should be written and executed as early as possible,  but they are developer-facing, so they should absolutely test for things that only developers can see, like method signatures and code organization.

Exploratory tests are like Acceptance Tests in that they are user-facing, and thus should only focus on things the end user can see; however, they are a critique, intended to find things when the product is in a stable, “release candidate” state. They serve to prove to the business that what we built meets their expectations, not to aid developers in building it.

And finally, Nonfunctional acceptance tests allow us to critique the product from a development standpoint: now that we know it works (because of Unit tests), we need to see if it’s performant, secure, et cetera.

What about regression testing?

You may have noticed that regression testing isn’t in any of the four quadrants. It doesn’t really fit into this graph; it sort of lies orthogonal to the graph, or envelops the entire thing. Regression testing is simply the act of verifying that requirements which were previously tested to work have  not been broken since the test was run. Without qualifiers, we typically mean Functional Regression Testing, which is simply running old acceptance or unit tests over again to ensure that the functionality was not broken. You can also, however, perform nonfunctional regression testing, say, to verify that software that was previously fast has not gotten slower, or software that previous installed on Windows still does after being enhanced to run on Linux.  Exploratory regression testing would be exploratory testing of areas that have not changed in the most recent version.

So what kinds should I do?

All of them 🙂

Honestly, only you can answer the question of what types of testing are right for your business needs, your product, your development team, your release cycle, et cetera. Hopefully, however, you have some ideas now that you’ve paused to consider the types available and what goals they fulfill. So, you tell me: what kinds of testing will you do?

Testing Socksite: Functional tests with Node.JS


As you may know if you’ve ever browsed my GitHub account, I am a member of a tiny open-source organization called SockDrawer. Odds are, we don’t make anything you’ve heard of; we’re just a group of individuals that all belong to a forum and wanted to make tools to enhance the forum-going experience. SockDrawer began when the founder, Accalia, wanted to make a platform for easily creating bots that interact with up-and-coming forum software Discourse. She named her platform SockBot, after the forum term “sock puppet”, and soon had more help than she could easily organize from other forumgoers who wanted to pitch in.

My connection with SockDrawer came when Accalia solicited some advice on how to unit test SockBot. The architecture wasn’t designed well for testability; since I work in QA, I had plenty of advice to dispense. Furthermore, she wanted help writing documentation; technical writing is also something I’m somewhat interested in, so I joined Sock Drawer and stuck around.

The Ticket

The ticket that generated today’s adventure was filed by Onyx, the UX expert for SockDrawer. It was a feature request for another product, Sock Site, which is a website that is used to monitor the forum’s uptime; the production version can be seen at www.isitjustmeorservercooties.com if you want to follow along.

The ticket was issue number 45: “No method of simulating server cooties” (“server cooties” means, loosely translated, that the forum in question is behaving incredibly slowly or not responding due to an unknown cause). The text:

We are missing a method of simulating server cooties outside of calling one of the status endpoints. This is not really useful for testing live update issues.

A good solution might be a way to set a “delay” variable at runtime. Value of this variable could then be added to the actual measured time (in ms). Setting this variable per tested endpoint would be nice, but not essential.

Of course, I immediately threw out the suggested solution :). To me, the best way to simulate server cooties was to mock the data coming from the server, putting the site into an artificially induced yet real test, sort of an emergency drill. The best way to ensure that the frontend responds correctly to what the backend is doing, to me, is to codify the changes in the functional tests, therefore removing the burden of manually regression testing when making front-end changes.

Webdriver bindings

I have used Selenium Webdriver for functional testing before, so I googled to try and find a Node library that would expose them. Webdriver.io was my first attempt; however, the interface for this library is so radically different than the standard interface that I found myself rapidly frustrated by the constant need to refer to the docs to write anything. What it did well, however, was abstracting the creation of the browser and the cleanup so that I didn’t have to write that code. Ultimately, though, I abandoned it and returned to the standard selenium-webdriver library.

I knew that I’d eventually want to use a remote webdriver service, particularly if we ever wanted to use BrowserStack or some third party source for webdriver. Wouldn’t it be cool, I thought, if when we’re not running from CI, we launched the server portion of the remote webdriver automatically? This proved to be frustratingly difficult using the selenium-webdriver library. I almost gave up — until I found selenium-standalone. This was hands-down the easiest library for controlling webdriver I’ve ever used, and I intend to bring it back to my workplace and suggest we start using it immediately. It contains commands to automatically install and launch the selenium server along with various browser drivers such as the add-on driver for Chrome or Internet Explorer. I eventually moved the install out of the scripts, figuring it could be done during the setup before the tests were run.

Using Mocha, this made my before method nice and clean:

before('init browser session', function(done){
       socksite.log = function() {} //no-op log function == quiet mode
       socksite.start(8888, 'localhost', function() {
           selenium.start(function(err, child) {
             if (err) throw err;
             driver = new webdriver.Builder().
               //In order to know when we're ready to run the test, we have to load a page.
              //there's no "on browser ready" to hook into that I've found

And the first test:

it('should be running', function(done) {
        driver.getTitle().then(function(title) {
            assert.strictEqual(title,'Is it just me or server cooties?',"Should have the right title");

The teardown code had a similar issue: it needs to be async so that the teardown completes before Mocha exits:

after('end browser session', function(done){ 

Mocking data

Now, I needed to be able to mock the up and downtime of the site. Because we’re in Node, and not behind Apache or anything like that, I’ve launched the server in code; I have a handle directly to the application already. But how to feed it false data? Did I need to use Sinon.js to mock out the module that takes the samples? Try to intercept the socket?

It turns out, we have a cache module that stores the latest result. When a page load is requested, the server fetches the data from the cache and embeds it on the page. While we could also emit the fake data using the web socket, that gets us into the messy territory of knowing when the client has received the data and finished updating, so that we can test that it updated correctly. This is worth doing to test the sockets later, but for now, I figured changing the cache and issuing another page load would be sufficient.

I encapsulated some data packets in a json file, which I loaded into the variable testData. This let my tests be simple and clean again:

describe('TRWTF', function() {
        it('is You when status is "Great"', function(done) {
            cache.summary = testData.greatData;
            driver.get("localhost:8888").then(function() {
                driver.findElement(webdriver.By.css("#header-image-wrapper img")).getAttribute("src").then(function(value) {
                    assert.match( value,/isyou\.png/, "Image should say 'Is you'");

        it('is Discourse when status is "Offline"', function(done) {
            cache.summary = testData.offlineData;
            driver.get("localhost:8888").then(function() {
                driver.findElement(webdriver.By.css("#header-image-wrapper img")).getAttribute("src").then(function(value) {
                    assert.match( value,/isdiscourse\.png/, "Image should say 'Is discourse'");


Now we were getting somewhere! I could see firefox open, flash through the various statuses, and close again. All I had to do was use Webdriver’s screenshot capability to capture images and we’d have a visual reference for what the site looks like in each of the various cootie configurations.

I created a second file, generateScreenshots.js, and put together a suite that does just that and nothing but that. I’m using Node on Windows, so I needed to use the path library to handle the differing direction of slashes on my machine versus the linux-based CI server or dev environments other developers were using. I also used path.resolve to generate the folder to save the screenshots to, since it uses the current directory to make relative paths absolute.

Here’s the complete text of the screenshot module:

describe('Taking screenshots...', function() {
    var browser = {}; 
     var driver;
     var folder = path.resolve("test", "functional", "screenshots");

    before('init browser session', function(done){
        socksite.log = function() {} //no-op log function == quiet mode
        socksite.start(8888, 'localhost', function() {
            selenium.start(function(err, child) {
              if (err) throw err;
              driver = new webdriver.Builder().


    it('when status is "Great"', function(done) {
        cache.summary = testData.greatData;
        driver.get("localhost:8888").then(function() {
            driver.takeScreenshot().then(function(image, err) {
                fs.writeFile(path.join(folder, 'great.png'), image, 'base64', done);

    it('when status is "Good"', function(done) {
        cache.summary = testData.goodData;
        driver.get("localhost:8888").then(function() {
            driver.takeScreenshot().then(function(image, err) {
                fs.writeFile(path.join(folder, 'good.png'), image, 'base64', done);
    it('when status is "OK"', function(done) {
        cache.summary = testData.okData;
        driver.get("localhost:8888").then(function() {
            driver.takeScreenshot().then(function(image, err) {
                fs.writeFile(path.join(folder, 'ok.png'), image, 'base64', done);
    it('when status is "Bad"', function(done) {
        cache.summary = testData.badData;
        driver.get("localhost:8888").then(function() {
            driver.takeScreenshot().then(function(image, err) {
                fs.writeFile(path.join(folder, 'bad.png'), image, 'base64', done);
    it('when status is "Offline"', function(done) {
        cache.summary = testData.offlineData;
        driver.get("localhost:8888").then(function() {
            driver.takeScreenshot().then(function(image, err) {
                fs.writeFile(path.join(folder, 'offline.png'), image, 'base64', done);

    after('end browser session', function(done){ 


Finally, to make it easy to run, I created some npm commands in the package.json:

"scripts": {
    "test": "npm install selenium-standalone -g && selenium-standalone install && mocha test\\functional\\webdriverTests.js",
    "screenshot": "npm install selenium-standalone -g && selenium-standalone install && mocha test\\functional\\generateScreenshots.js"

This lets us run the tests with npm test ,and the screenshots with npm run screenshot. Nice and simple for developers to use, or to hook into CI.

In the long run, SockDrawer is moving towards a gulp-based build pipeline approach, while I’m only familiar with Grunt. Accalia said she’d turn these simple scripts into steps in the eventual gulp file, letting the CI run with them. I also want to change the test code to default to phantom-js, with an optional input to use any browser so we can run them cross-browser in the future. But for an evening’s tinkering, I’d say, this isn’t bad 🙂