Great xUnit Test Suites: the Pre-TDD Conversation

A Burning Issue

We interrupt the current blog thread (the “E” word series) to bring you a burning issue. Well, burning for me, anyway.

I have been working with some other Pillar programmers on systems for helping not-yet-agile programmers learn some best practices. And while many of us in the industry are accustomed to coaching, mentoring, training, and otherwise cajoling people to attempt TDD specifically as a practice, I recently have begun to suspect that in fact, that’s a poor place to start the conversation.

TDD is all about how you get a good design, and good tests as specifications, and most critically to my mind, a great xUnit test suite for its regression protection. But what is a great xUnit test suite? What does that look like?

I have been finding (but not grokking until recently) that before I can have a TDD conversation with anyone, I really have to have a good conversation about the characteristics and value of a great xUnit suite.

Characteristics of a Great xUnit Test Suite

So when I come across a fresh codebase (I mean fresh to me — it might actually be quite rotten), these are the things I want to see in the xUnit tests. In future posts, I can give these more discussion, and perhaps include code snippets, but for today, it’s just a list:

  • Code coverage is no lower than 85%. (Note: As important as code coverage is — especially for teams new to xUnit best practices — it can be a dangerous narcotic. It can hide bigger problems. It is possible to have a test suite that provides 100% coverage that is about 100% crappy. People do things like comment out all assertions except assertNotNull(blah), and make other poor choices when under pressure to (A) keep the coverage rates up, and (B) get the features out the door.)
  • As much of the testing as possible is accomplished by “isolation tests”; small unit tests that run entirely in memory, with no dependencies on file systems, networks, databases, or other external resources. This is Mike Feathers’ definition of a unit test. This level of isolation (and the execution speed that goes with it) in turn depend on proper use of static and dynamic mocks. That in turn depends on dependency injection, which in turn depends on people knowing enough OO to code to interfaces.
  • Speaking of execution speed: isolation test suites should average no more than 0.5 seconds per test, on a crappy machine. If everything really is in memory, it’s pretty common to get speeds of more like 100 isolation tests per second.
  • The suite also includes end-to-end tests, “collaboration tests,” and other tests that are more real-world than isolation tests, include less or no mocking, and take longer to set up and run. These tests do talk to real databases, real networks, and perhaps completely external systems through various APIs.
  • The isolation tests and non-isolation tests are separate from each other (separate source folders, to my mind), so that they can easily be run separately by developers, and by a CI server. As projects grow, the speed of their non-isolation suites slows. Because we don’t want to discourage programmers from running isolation test suites frequently, we want to keep the isolation test execution speed fast. We also want to keep the build nice and fast. So we want to be able to run slower non-isolation suites separately, and perhaps less frequently. So if the slow tests run slowly enough, we may not make them part of each CI build, but instead run them every few hours, or overnight, in a separate CI target.
  • Each test method involves only one cycle of Arrange/Act/Assert (setup and instantiation, getting to the testable state, and verifying that state).
  • Each isolation test method isolates a thin slice of system behavior. One industry term for this (proposed by Industrial Logic) is “micro-tests.”
  • Average length of test methods is under 20 lines, ideally fewer than 10 lines.
  • Test methods and TestCase classes are written and organized in terms of system behavior, not system structure. Related to this: all the test methods in a TestCase use the code in the setUp() method in that class, with as little addition test-specific setup as possible. All of the “Arrange” part of “Arrange/Act/Assert” really should be handled in the setUp() method, whenever possible.
  • TestCases systematically cover unhappy paths: exception cases, edge cases and boundary conditions, etc. Mocks/fakes are used to simulate failure of external dependent resources.
  • TestCase object trees make effective use of base TestCase classes, and make good use of reusable, private or protected helper methods (a sort of local testing DSL). Or, as Ryan points out in the comment below, the TestCases all use a separate object tree that holds a well-thought-out, rich little local testing DSL, completely decoupled from the test code. The more of that DSL pattern you need, as Ryan might say, the less you want to use inheritance, and the more you want to use composition.
  • Test suites manage test data centrally (the repository of canonical test data might be a static class full of constants, or an in-memory database, or whatever). TestCases and test methods avoid primitive type literals wherever possible, and likewise avoid duplicate local variables and constants.
  • Test suites, TestCase classes, and test methods contain as little duplicate code as possible. This includes small details like recurring complex assertion patterns that can be extracted, repeating the name of the TestCase in a test method name, etc.
  • TestCase classes and Test methods have intention-revealing names, and use a consistent naming convention.
  • Test suites are designed to be as resistant as possible to production code design changes. They are robust, not brittle.
  • Test suites test the hard and harder things: xml configuration files, servlets, Swing GUIs, Jsp files,etc.

I’ve gathered up this first-draft list of characteristics from multiple sources — books, others’ experience, and my own experience. I’m sure I’m missing a few things in there — I’ll add and prune according to my future thinking and your comments.

Paint the Fence; Sand the Floor

Before people can talk to me with authority about the value of TDD, they need to talk with authority about the value of a great xUnit test suite. And before they can do that, they need to have (as my late mother would have said) suffered enough. They need to have suffered at the hands of codebases without great xUnit suites. They also need to have had their bacon saved by great xUnit suites.

So before we get to the TDD conversation, I increasingly want to encourage programmers new to xUnit testing practices to shoot for an xUnit test suite with the above characteristics. I don’t especially care, at first, how or why they paint the fence (from The Karate Kid), as long as they do it. I would in fact prefer that life and code provide them with the painful, indelible lessons that go with good and bad xUnit test suites.

THEN, once they have felt how hard it is to get that great xUnit suite when they have to stop, go back, retrofit tests to existing code. And once they have felt how hard it is to debug an “Eager Test” (from Gerard Meszaros great book on refactoring xUnit tests). THEN we can talk about how, hey, you know, if that great xUnit test suite is your goal, then my experience has been that TDD gets me there better and faster.

Now we are painting the fence in a specific way.

But along the way, it’s all good.

Ugly vs Clean Code; Part Two

The TicTacToe “Ugly vs Clean” Eclipse Project

In my first post on this topic, I set the stage. I had a need for two implementations of the same problem domain: one ugly, one not. As promised, by the way, you can anonymously download the entire codebase discussed in this series of blog posts from a google code project here. It’s an Eclipse project, all zipped up.

The project includes a applet that you can run, to play the game (right click on source/, and pull down “Run As > Java Applet”). There is a first-draft README file that describes the whole shebang, and suggests some exercises to try. See what you think of it all.

The legacy version of the TicTacToe game is in legacy/ Now, take into consideration that this version reflects lots of little refactorings on my part, dating back to when I had characterization tests for this “class” (I’ve since removed all of those tests — I didn’t want students and job candidates subjected to these exercises to benefit from them). I renamed a lot of methods that started out with names like “c24occx()”, assigning placeholder-quality names that I thought my characterization tests were revealing to me, like tryToFindPositionGivingSeriesOf4OnTwoOrMoreAxes(). In some cases my educated guesses were accurate, and in some other cases, I later learned that I was far off.

I extracted a few small methods from other, larger, stranger ones, naming them as meaningfully as I could at the time. I extracted lots of constants. I renamed variables. I killed a lot of dead code and inscrutable comments. I managed to extract the Java applet code (woven into the gameplay code’s DNA) into its own class. I just couldn’t stand not doing that. (Clue: what do you notice about that applet code?)

But eventually, I just gave up working with it. After person-days of jUnit poking and prodding, this codebase remained quite opaque to me. I’ve inferred a lot of its algorithmic meat from its external gameplayer behavior. But I’m still baffled by much of it.

So this is our first measure of inextensibility in a codebase we discover: what Uncle Bob Martin calls opacity. One of the characteristics I wrote about here. As we glance through it, as we write tests for it, we struggle to understand it.

But it seems that every month or so these days, there are new tools to help us grasp what we are up against. I ran Crap4J against, and of course it pegs the tool’s little meter at the far right, at 36.84, as if pressed forcefully against that right-hand fence, searching for a measure of even more non-test-protected cyclomatic complexity. Average Crap4J score (blue triangle) is just under 5, BTW. As you can see, that little yellow triangle is trying to leave the ballpark:


So, I did determine how the legacy game manages its board state and game state, and got lots of peeks into how it determines which move to make next. Enough so that I was able to run the game from a test harness, one move at a time. This is the TestCase that pits the two games, old and new, against each other, some number of times. Currently that number is 200. You can find this code in manualTests/ manualTests.

The source folder and package names contain the word “manual” because at first, I was printing out a representation of the board after each move taken by each game.I was examining System.out.println() output manually, to learn.

It took a bit for it to dawn on me: I was doing exploratory testing.

Old Game vs New Game: Exploratory Testing

So I started with lots of high hopes, deep fears, and ignorance about my prospects of test-driving a decent version of this problem domain. My goal was for my game, if it took the first move, to beat the old game or play it to a draw most of the time. (As it turns out, I did much better than that. After my second run at this code, I ended up with a game that beats the old LegacyGame about 50% of the time, and beats it to a draw about 40% of the time. When I go first, the old game wins no more than 7% of the time.)

The new game clobbers the old game, after much research and development.

In my first test-driven version, my first few defensive algorithms were, in addition to being completely ineffectual against the old game strategically and tactically, pretty badly conceived. My object model was in parts over-engineered, and in other parts procedural, sloppy, and under-engineered. I paired with my good friend Dave LeBlanc on it for an hour, and he made several forthright observations about what I had done well and what I had done poorly. My design had some real flaws. I had pretty good test coverage, but nothing like what I wanted. For the next few days I pushed this first codebase version as far as I could, and got it to the point where it edged out the old game if it went first, on average. It performed OK.

But I was deeply disappointed at the results. I knew I had to rewrite it. I can get an A+ in any course I’ve already taken, if I take it again enough times. Dave had encouraged me with suggested new design approaches. I wiped the slate clean. I started over with an empty Game class, and a much better sense of which strategic and tactical behaviors I wanted to test drive in what order.

That’s when I started turning my attention more rigorously to the move-by-move board printouts I was logging in my manual game-against-game test harness. I started combing through each loss I suffered as I test-drove my second version, looking at the strategic setup patterns, while I looked for a cleaner way to represent the basic defensive and offensive patterns. I watched carefully as the old game, ugly or not, set itself up cleverly to defeat me a couple of moves into a new game.

And as all kinds of interesting patterns emerged from this manual exploratory testing, I began to understand the problem domain much more deeply. And this, or course, made Simple Design easier, and refactoring easier, and test-driving easier. I noted specific patterns, wrapped test data and failing tests around them, and produced new behaviors of my own that played the game better.


Then suddenly one day, after I had finished one particular bit of strategy involving collecting all possible moves, ranked by tactical priority, and looking for any of the highest priority moves that also matched lower-priority moves (blocking the other player’s new series while simultaneously extending a series of our own, for example), I saw a huge new jump in my game’s performance. I added another bit of logic around responding to the other player’s first move, then another around making the first move on one of the center-most 4 squares on the board. With each of these well-thought-out bits of new behavior, I saw big jumps in my game’s performance against the old one. Meanwhile, there was not that much total strategic and tactical logic, and I was simplifying and consolidating as I went. I had a reasonably clean, reasonably well test-protected codebase that was kicking the other game’s keister. It was rewarding.

Challenge: Try the Exercise Yourself

Feel free to download, unzip, import, and play around with the codebase. Follow the instructions in the README file. Let me know what you learn, and how you would like this (or any of my exercises) to be improved.

Ugly vs Clean Code: an A/B Comparison Exercise

A Tale of Two Codebases

I have been working for years, off and on, on a “breakable toy” codebase to use for three main purposes: evaluating the technical skills of programming candidates at Pillar Technology, making baseline assessments about the technical skills of new hires and client programmers, and conducting classes on agile/OO programming practices. In this and coming blogs, I am going to share with you the codebase itself (Java/Eclipse project), and my experiences and insights while developing and using it. I am also going to solicit your input on how to improve it as a teaching/mentoring tool, and as a set of exercises for evaluating programmers.

The Problem Domain

The codebase is two completely separate implementations (Legacy and “Cleaner” implementations) of a 10 x 10 TicTacToe game where the first player to 5 in a row in any direction wins. You play against the computer, and it typically kicks your patootie, whichever version you are playing against. (Well, it kicks mine.)

So I happened upon the Legacy version of this codebase more than a year ago, when looking to design this pedagogical tool, and determined it to be perfect, in a kind of sick, pathological way. Let me explain. The original codebase is a Java applet. It is, including all the applet code, a single, 1200+ line “class” with dozens of methods that looks like this:

Actually, that snippet includes method extractions and renames that I did. It was much worse before I got a hold of it and starting retrofitting Mike Feathers-style “characterization tests” and doing bits of opportunistic refactoring here and there.

So the whole thing has a fabulously high cyclomatic complexity. In other words, though this little TicTacToe applet is quite clever algorithmically at kicking your keister, it is supremely inextensible code, along at least several axes of extension. That was exactly what I needed.

The Point of the Exercise

The entire point of the codebase and its various uses is to make the point of the differences between extensible and inextensible code, and to measure and teach these practices that are most central to the extensibility of a codebase. The point is actually to give people an experiential sense of the sum of NOT using the best practices listed above, on the one hand, and using them, on the other. It’s a sharp-edged little A/B comparison exercise.

[Note: this exercise was conducted, to rave reviews, at both Agile 2008 and Agile 2009.]

What I Did, Crazy Man that I Am

I began with this Legacy game code, and with the tentative steps of poking and prodding it with tests (more on all of that later). I played against it (and lost!) a lot. I began to learn the algorithmic problem domain (entirely despite the maddeningly bad design).

I then constructed a jUnit test harness that would enable me to play a new, test-driven version of the game against the old one, and measure how much of the time I won. And I began to test-drive my first version of this game.

I ran my head-to-head test a lot. It was depressing. Despite months of research and learning, I was still getting my rear kicked 98% of the time, or so.

And that is roughly where we shall pick up in the next blog post, when I’ll share with you a zip file of the entire Eclipse project, including the Legacy and Less-Legacy versions of the code.

Until then, fair readers.