Search This Blog

Test in Moderation

Software testing is a huge topic, and intense debates are going on all the time about what is the best way to do software testing. Test Driven-Development (TDD) has built up a strong following over the past 15 years, and with good reason. TDD can have a lot of benefits if done well. It also has its downsides, and if taken to an extreme, TDD can cause more harm than good. Programmers are justified in pushing back on the more extreme forms of TDD, but we don't want to go to the other extreme, either. Testing is a necessary part of a high-quality development process that increases the odds of releasing highly-functional software. We need to do testing in some form. The real questions are, how should we test and how much?

The intention here is not to focus solely on TDD, although that will be difficult because it's something most programmers are familiar with. What I really want to talk about is testing in general, and TDD is but one piece of the complete testing picture. The rest of the picture includes things like usability testing, code reviews, and quality assurance (QA) testing. Proponents of TDD, when at their most extreme, claim that TDD subsumes and encompasses all other types of testing. They take every possible type of testing, define and categorize them, and put them all under the banner of TDD. It may be a convenient title, but in practice most programmers think of TDD as focusing on unit and integration testing because those are the tests that are most easily automated.

When referring to TDD, the concept I'm thinking of is essentially the same as the one used by Steve Freeman and Nat Pryce in Growing Object-Oriented Software, Guided by Tests:
When we’re implementing a feature, we start by writing an acceptance test, which exercises the functionality we want to build. While it’s failing, an acceptance test demonstrates that the system does not yet implement that feature; when it passes, we’re done. When working on a feature, we use its acceptance test to guide us as to whether we actually need the code we’re about to write—we only write code that’s directly relevant. Underneath the acceptance test, we follow the unit level test/implement/refactor cycle to develop the feature[.]
I'm using 'integration' whereas they use 'acceptance,' but they're pretty interchangeable. It's clear from this explanation of the TDD process that acceptance tests do not include getting the user interface in front of actual users, getting the code in front of coworkers, and getting the product in front of physical, human testers. That's okay. The definition of TDD should be focused, but we shouldn't forget about the rest of the testing picture.

In the Grand Scheme of Things


One great way to step back and see the bigger picture is to do a postmortem on a project (referred to as a retrospective in agile development). Despite the name, don't wait until the project is finished (or dead) to do a postmortem. You can learn a lot from looking back on a project after certain milestones are reached, like major releases, and many software projects are never truly finished anyway.

When I was working for an ASIC design company, we sometimes did a postmortem right before we would release the second revision of silicon for prototyping. This was a perfect time to take a look back at what happened during the previous cycle of silicon evaluation because we were about to send off the design for fabrication again. It would be 6-8 weeks before we'd see the results, and it would be very unlikely that we could fix any mistakes we found after pulling the trigger.

I remember one postmortem in particular that was for a project with a high profile customer. We were aiming to complete the design in two and a half spins—that means the third revision would only change the metal interconnect layers of the design—and as always, we were schedule constrained to get it right. Before releasing the design for the second spin we took a look back at all of the bugs that we discovered during the first silicon evaluation. We had found 35 bugs (pretty respectable for first silicon), and they had varying degrees of severity and scope.

While we didn't discover anything that caused us to delay the next release, we did learn some interesting things about the source of the bugs. Part of the postmortem exercise involved figuring out how we could prevent those bugs that we found in the future. I can't remember the exact numbers, but we found that a significant number of those bugs, like more than half, could not have been prevented with better automated testing and simulations.

Some bugs we definitely should have caught. They were obvious after we had gotten silicon back, the fixes were obvious, and the simulations required to catch the bugs were obviously missing. It's your standard face-palm kind of stuff.

Other bugs had a root cause that didn't originate in the implementation, but in the specification. These were the inevitable misunderstandings of how the product should work, either on our part, the customer's part, or the communication between us. We continually strove to reduce these kinds of mistakes, but there's always room for improvement when it comes to product specifications and customer communication.

Then there were bugs that we could not have predicted would be an issue until we had physical silicon in our hands to test. Similar to most software development, we were doing something that had never been done before, at least by the collection of engineers participating in this project. We knew going in that there would be issues with the design, but we couldn't know what all of them would be. We did try to mitigate the risk by making the design as flexible as possible in the areas where we thought there would be trouble, and we planned for the inevitable bugs with the two and a half spin schedule. It's actually a tribute to the design team that there were so few bugs in first silicon and most of them were of this unpredictable variety.

We ended up shipping revision B2 as planned, and it was one of the most well-executed projects I've had the pleasure of working on. However, the take-away for this discussion is that in spite of our quite rigorous automated testing, we didn't catch every bug that we needed to, nor did we have any hope of doing so even if we devoted ten times the effort to testing simulations.

Imperfections in Testing


The equivalent automated testing in the software world is integration and unit testing, whether it's TDD or not, and Steve McConnell has some interesting statistics on the effectiveness of different types of testing in Code Complete 2:
.. software testing has limited effectiveness when used alone—the average defect detection rate is only about 30 percent for unit testing, 35 percent for integration testing, and 35 percent for low-volume beta testing. In contrast, the average effectiveness of design and code inspections are 55 and 60 percent.
The defect detection rates for unit and integration testing are amazingly low. Similar rates are probably applicable for ASIC simulation testing, and that is why we would have had to put in so much more effort to catch the obvious bugs that got through. During development those bugs are not obvious, and you have to write a ton of extra tests and cover cases that aren't broken to find the few remaining cases that are hiding the bugs. Other types of testing can make those bugs more obvious.

One argument against TDD is based on this issue of missing obvious bugs. The argument goes something like this. If you write the minimum amount of test code to produce a failing test and then write the minimum amount of production code to make the failing test pass, what you’re actually doing is minimizing the amount of production code you write. In the end, you’ll probably write much less error checking code, and you’ll forget to write error checking code for certain conditions or in certain circumstances, letting bugs get through the development process.

I don't think this argument only applies to TDD. You have to think of the error condition where a bug might exist whether or not you're practising TDD. The TDD approach just changes the way you handle error checking during development. If you think of a possible error condition in your code, you write a test for it first to make the failure a reality, and then write code to fix it. Humans are notoriously bad at predicting failure so it may be better to make sure the bug is real before trying to protect against it. TDD doesn't alleviate developers from needing to be diligent about error checking. Although it may help eliminate unnecessary error checking code.

Another argument against TDD came up in a discussion on the Stackoverflow podcast #38. (I know it's old, but history can still be relevant.) Joel and Jeff talk about unit testing, TDD, and the SOLID principles. Joel questions whether TDD and SOLID are agile or not. It's an interesting thought, and I can see how he would come to that conclusion if these practices were taken to their extreme: 100% unit testing, TDD for every single line of production code, and all the SOLID principles applied to every class in the system. Such a code base would be an incredible, unmanageable mess. Especially the interaction of unit testing every single little thing about every class while separating classes into thousands of tiny single-function elements causes the entire design to be very rigid and fragile to change. I shudder at the thought of it.

While it is useful to have an automated test to check code that you've written so that you don't have to keep checking it manually all the time, that does not always trump other factors. Sometimes it is a piece of code that will never break again, and writing the test is way more complicated than the actual code. In some cases it makes sense to not write the test and instead write the code, check it manually, and be done with it. As a general rule, if a test is hard to write—much harder than writing the actual code—think about whether or not to actually write the test. It would have to provide some benefit, like if the code is especially risky, before it's valuable to write and verify a complicated test.

I like the balance that Michael Hartl strikes in his Ruby on Rails Tutorial. Sometimes he uses TDD, sometimes he writes integration tests or unit tests after coding the functionality, and his assertions have a light touch that leaves enough room for the code to be flexible

Expanding the Testing Repertoire


When writing tests, it's best to KISS. If you get too wrapped up in testing it's easy to go overboard, and that can waste a lot of time. There is some intersection of tests that catch bugs and code that is the source of bugs (or code changes that create bugs). That sounds like a venn diagram to me.

 Venn diagram of bug catching tests vs buggy code

If you write tests for 100% code coverage, you're creating a huge number of tests and spending the corresponding huge amount of time writing those tests when some percentage of tests will never ever fail, and are thus useless.

Another way to think about this problem is to consider how many bugs are caught as you ramp up the amount of time you spend using a given testing style. For unit testing, as you write more tests it gets harder and harder to find more bugs, and tests start to overlap in the bugs that they catch. This is why 100% code coverage doesn't mean 100% of bugs caught; it only means about 30% of bugs caught. Over time, testing efficiency tends to look like this:

Estimated plot of bugs caught vs. testing effort

At some point, putting in more effort doesn't give you any more progress. This trend is true of all types of testing. Whether it's unit testing, usability testing, code reviews, or QA testing, testing efficiency is going to saturate at some level below 100% bug-free because different testing styles will catch different classes of bugs. Instead of taking one testing style to the extreme, it's better to mix different testing styles together so that you're hitting your software from multiple angles. If you have a fixed amount of effort that you can dedicate to testing, different mixes of testing styles will result in varying degrees of quality in the final product.

Chart of testing efficiency for fixed effort

This chart represents what the space of testing efficiency might look like if you could spread out all variations in the mix of testing styles on one axis. Somewhere in that curve would be a point where you did 30% automated testing, 20% code reviews, 10% usability testing, 20% QA testing, and 20% other miscellaneous testing. Maybe 100% automated testing is near that valley on the left. It's really representing a multi-dimensional space on a single axis, and it's purposefully vague because the maximum testing efficiency is going to heavily depend on everything—the product, the problem domain, the experience and skill of the team, the schedule, the customer, and the list goes on. Experience is probably the best remedy for this dilemma because with experience you'll learn what a good mix of testing is for the contexts that you generally develop in.

One thing's for sure. It's not worth hammering on one testing style to the detriment of all others. That's a terrible waste of time. It's better to back off the 100% code coverage goal—trust me, it's a mirage because you're still far from 100% input coverage—and look for other ways to test more effectively. Automated tests are great, but they should be used judiciously where they give you the most benefit by making sure your stuff isn't fundamentally broken. Then, spending some time getting your product in front of real users or having another developer look over your code and talking them through it will provide much more value for your testing time than writing another test that verifies that, yes, your getters and setters do indeed work. Like all things, testing is best done in moderation.