Let's Refactor Some Bad Code, Part 1

Unfortunately, we can't always be writing new code when programming. Much of being a programmer involves working with code that already exists because there is so much code out there already. There are mountains and mountains of code, and, as every programmer knows, not all of it is awesome. Sometimes this code has to be refactored to adequately maintain it, sometimes it needs to be done before new features can be shoehorned in, and sometimes it just needs to be done in order to stay sane while working with it. This is not always other people's code, either. Oftentimes it's your own code. I know I've written more than my fair share of bad code, and I may even use it as an example someday. That would be fun.

Right now, I want to look at refactoring, not my own, but someone else's code. Some time ago I was working on a feature for a product that required a DSP algorithm called rainflow counting. This algorithm splits up a signal's range into bins, and counts how many cycles the signal goes through, where a cycle starts in some bin A, peaks in another bin B, and then returns with another peak in bin A. The algorithm will also identify sub-cycles in the signal where, on its way from bin A to bin B, it backtracks to bin C before reaching bin B. It's called the rainflow counting algorithm because the Japanese researchers that developed the algorithm, Tatsuo Endo and M. Matsuishi, compared the signal being analyzed to rain running down a pagoda style roof, where the signal is viewed turned on its end:

Example rainflow graph
By Cutler at English Wikipedia [Public domain], via Wikimedia Commons
Don't worry if this doesn't make a whole lot of sense. If you really want to know how the rainflow counting algorithm works, it should become apparent from the code after it's refactored.

Finding a Working Example


I had some trouble understanding how the algorithm really worked, so I wanted to find a working example program that I could use as a model. After a bit of searching, I found a C++ program that I could use from the blog Vibrationdata by Tom Irvine. It wasn't the best piece of code I'd ever seen, but it implemented the algorithm I wanted to understand, so I set about trying to make it work and then make it better.

Now what follows is in no way a criticism of Tom's code. We all write bad code from time to time and we're all at different points on our programming journey. I'm not trying to single this code out in anyway, except that I found it to be a nice, self-contained example of code that could be improved, and it wasn't a trivial piece of code or a sprawling monstrosity. I've seen code much worse than this and code much better than this, but this was some code that I needed to modify to suit my needs and I thought I'd go through the changes I made to clean it up in the process.

I saw a number of issues with this code, including:
  • It didn't compile
  • No tests
  • Inconsistent indenting and formatting
  • Useless comments and commented-out code
  • Non-descriptive naming
  • Loooong functions
  • Poor structure
  • Cumbersome UI
I also didn't need some of the features it had and wanted to add a couple features of my own. I'll address each of these shortcomings in turn and show how the code improved along the way with short code snippets. I also put all of the changes into a git repo, so you can look at all of the diffs and follow along there. The code starts off at 590 lines, so I won't show all of the changes here.

First off, the code doesn't compile, and I wanted a working example of a rainflow algorithm, so I'll have to address that right away. Luckily, this was a simple change. He had used a getch() that needed to be changed to getchar(). He was using a different I/O library that I didn't have called conio.h, so I removed it along with <iostream>, which wasn't used. That was it. Now the code compiles and runs, so I'm off to a good start.

Time for Some Tests


With the small amount of changes I've done so far, I've tried to ignore any other issues with the code and just do what's necessary to get it running. I don't want to make any other changes that could screw up what the code does until I've added in some tests to verify the changes I make. So what should I do for tests? It's a small program, under 1 kLOC, and I'm mostly interested in how the algorithm works and the results it produces. I don't need to worry about testing an interface to this algorithm because I'll need to do something different to put it into its production form anyway. This is just an exploratory model. That means I don't want to mess around with anything remotely complicated, including a unit testing framework.

If this code was part of a larger system or it was going to be more permanent, a unit testing framework would definitely be in order. It's a judgment call, but you definitely want to use the right tool for the job. For a bigger, more industrial job, you want to use the industrial-strength tools. In this case, that would have been overkill.

What I ended up doing was dead simple. I had a few input files that I wanted to run through the algorithm, some with known results and some too complicated to calculate the results outside of the model. I ran each of these files through the program and saved the output files as golden examples. After each change to the code, I could diff the golden files against the new output to make sure I didn't break anything. This approach may not be automated, but it certainly works, especially considering that the code in its present form does not lend itself well to using an automated testing framework.

The golden outputs are fairly small. Here's an example:

 Amplitude = (peak-valley)/2 

 
          Range            Cycle   Ave    Max    Ave    Min      Max
         (units)           Counts  Amp    Amp   Mean  Valley    Peak 
   27.0000 to  30.0000      0.5     15     15      0     -15      15
   24.0000 to  27.0000      0.0      0      0      0       0       0
   21.0000 to  24.0000      1.0   10.5   10.5    1.5      -9      12
   18.0000 to  21.0000      1.5  9.667     10      2      -9      15
   15.0000 to  18.0000      2.0      8      8     -1     -15      11
   12.0000 to  15.0000      0.0      0      0      0       0       0
    9.0000 to  12.0000      2.5    4.9    5.5    4.9       0      11
    6.0000 to   9.0000      0.0      0      0      0       0       0
    4.5000 to   6.0000      0.0      0      0      0       0       0
    3.0000 to   4.5000      0.0      0      0      0       0       0
    1.5000 to   3.0000      0.0      0      0      0       0       0
    0.7500 to   1.5000      0.0      0      0      0       0       0
    0.0000 to   0.7500      0.0      0      0      0       0       0
The range column shows the bins that the signal range is split into, and there are columns for cycle counts and some other statistics. Pretty simple, and these output files contain exactly what I care about when checking that I haven't broken the algorithm with any changes I make. With that, it's time to do a second commit before moving on.

Formatting with Ease


The first thing that jumps out at you when perusing this code is that the formatting is all over the place. There is no consistency in indentation, spacing, or empty lines. As an example, here's a section of the main algorithm:
    for(i=1;i<(NP-1);i++)
 {

        slope1=(  y[i]-y[i-1]);
        slope2=(y[i+1]-y[i]);

        if((slope1*slope2)<=0. && fabs(slope1)>0.)
        {
          a.push_back(y[i]);
          k++;
        }

 }
    a.push_back(ylast);
    k++;

 last_a=k-1;

    hold=last_a;
This is very representative of the rest of the code. Braces don't line up where they should, some lines are indented too much, and operations are crammed together so that they're hard to read. I could go through and fix the formatting line-by-line, but that sounds terribly tedious. Actually, it sounds like something that a program should do. If you're using an IDE, it should have a code formatter that will automatically fix this stuff. I use Vim, so I went and looked for a C++ formatter. ClangFormat, part of the Clang compiler, turns out to be an excellent formatter that's quick and easy. I installed it on Ubuntu with
$ sudo apt install clang-format
and then ran the rainflow code through it with
$ clang-format -style=llvm -i rainflow.cpp
The llvm style option is pretty much exactly my formatting style preference. Now the above code snippet (and all the rest of the code) is nicely formatted:
  for (i = 1; i < (NP - 1); i++) {
    slope1 = (y[i] - y[i - 1]);
    slope2 = (y[i + 1] - y[i]);

    if ((slope1 * slope2) <= 0. && fabs(slope1) > 0.) {
      a.push_back(y[i]);
      k++;
    }
  }
  a.push_back(ylast);
  k++;

  last_a = k - 1;

  hold = last_a;
Much better, and much more readable! After checking that nothing broke, this code is ready for its third commit. This is an important point to check in the code because basically the entire code base changed when it was run through the formatter. It will be hard to trace any other changes across this commit, so I don't want to make any other changes until I've committed this code.

What's Next


Now that the code's in a semi-usable state, I have some options for how to proceed. I could work on organizing the code better or making the variable and function names more descriptive, but I think the most pressing thing to do next is to clean up the UI. Running the tests is a bit tedious because I have to keep answering prompts for features that I'll never use. Next time I'll start by streamlining that interface before moving on to some of the more meaty refactorings.

No comments:

Post a Comment