Optimization can take many forms. Over the years I've learned to focus on the ones that yield more bang for the buck - architectural, data structure, algorithm, and readability optimizations - and avoid those that are more trouble than they're worth. These troublesome optimizations can generally be classified as premature optimizations and micro-optimizations - categories that are not mutually exclusive. It is all too common to see micro-optimizations that are done prematurely.
Before we go any farther, some definitions are in order. Premature optimization is simply optimization that is done before it's known to be necessary, i.e. before you have actually measured the time consumed by the particular piece of code you want to optimize and found it to be a little CPU piggy relative to the rest of your program. Micro-optimization is twiddling with small code sections to try to beat the compiler at its job without making significant gains in performance, i.e. moving the deck chairs around on the Titanic.
Avoiding these types of optimization does not give you the right to be sloppy. You should still be picking appropriate data structures and algorithms for the job at hand. Joe Duffy has a great writeup on ways that premature optimization has been used as an excuse for bad programming choices. Don't be yet another example of that.
Why Not to Optimize
There are a number of great reasons to avoid these bad optimizations, the most obvious being that you are likely wasting your time. If you are optimizing code that doesn't have timing constraints or already has good enough performance, those optimizations are worthless. If your optimizations are getting optimized out by the compiler or the compiler would have done the same thing anyway, your hard work would be all for naught. Instead of fighting on the compiler's turf, you could be spending your time optimizing at a higher level in an area of your program that will matter. Measure first, then optimize only where you need to.
When you optimize, you are committing to potentially more complicated code for the sake of potentially more performance. If the optimizations are done at a high level, that commitment is probably fine. The mental overhead of the optimizations is integrated into the architecture of the program, and so it is manageable and possibly simplifies the design instead of complicating it.
If the optimizations are at a low level, you are at risk of competing with the compiler or the libraries and frameworks or even the hardware you're using. After all, processors do all kinds of optimizations including branch prediction, out-of-order execution, and memory caching, and they are changing and improving all the time. Every time you upgrade your programming environment, you will have to remeasure your optimized code to make sure it still performs well. And besides, are your users all using the same environment that you are? Even changes to other parts of your program could change the assumptions that made the optimization work and nullify the performance gains. Any changes, both within and outside your control, could make the optimization obsolete, so it will have to constantly be tested and verified. Trust me, you want to avoid that rabbit hole.
Finally, you should be optimizing for readability over performance whenever you can. That may seem a little harsh, but in reality, readable code begets performant code. I can't begin to count the number of times I refactored a complicated section of code to be more readable and only when I was finished simplifying did I see another way to refactor it to make it faster and use less memory. Often times I could see that the original code was trying to be a performance optimization, but the complexity was getting in the way. It wasn't until I made the code comprehensible that I could make it performant, and without fail the newer, faster code was much easier to read and smaller. That's a win-win-win. The bottom line is that in almost all cases, you should optimize first for readability.
When Measured Optimization Becomes Premature
Even though I try to follow these guidelines to the best of my ability, I still get caught by premature optimization sometimes. Last week was one of those times, and it brought out another reason to resist the urge to optimize until you are sure of how the program is working. But before getting into the code, here's a little background to hopefully make better sense of the example.
I'm writing embedded C++ code for a real-time application running on a TI DSP processor. A big part of what makes the real-time data processing possible for this application is the DMA (direct memory access) controller. The application has a number of memory buffers to stage the data so the processor can do calculations efficiently on a contiguous block of data before it's shuttled off to the next staging buffer. The DMA controller takes care of moving the data to the processor's internal memory and back out to external memory so that the processor is free to do program control and computation.
This DMA controller is packed with features to automate different types of memory transfers and kick off transfers from external events, but one of the more basic features is the ability to send a set of commands to the controller to initiate an immediate block transfer. The controller will go off and move the data and then interrupt the processor when it's done. Great!
The issue that I was dealing with was that sometimes there was nothing to do while waiting for the controller to move the data, so the processor had to wait for it to finish. It seemed like the controller was taking a fair amount of time, and I wanted to see if a plain old memcpy() call would be faster. Here's the little program I used to compare them:
void main() {
int cs = 2000;
Sample *pSrc = new Sample[cs];
for (int i = 0; i < cs; i++) pSrc[i] = i;
Sample *pDst = new Sample[cs];
int cTests = 10;
for (int j = 0; j <= cTests; j++) {
unsigned cb = (j+1)*cs/cTests*sizeof(Sample);
unsigned tStart = CLK_gethtime();
int id = DatCopy(pSrc, pDst, cb);
DatWait(id);
unsigned tDatCopy = CLK_gethtime();
memcpy(pDst, pSrc, cb);
unsigned tMemcpy = CLK_gethtime();
UTL_logDebug2("Test %d copying %d bytes", j+1, cb);
UTL_logDebug2(" DatCopy: %d, memcpy: %d",
tDatCopy - tStart, tMemcpy - tDatCopy);
}
}
The DatCopy() function call sets up the DMA transfer of cb bytes from pSrc to pDst, and then the DatWait() call waits until the transfer is complete. The CLK_gethtime()call is a special function that returns a count of the number of processor clock cycles since reset, and the UTL_logDebug2() calls are special print statements that log the formatted strings to a memory buffer that can be viewed with an emulator.
The real application doesn't copy more than 4000 bytes at a time, so this loop measures the amount of time that the DMA and memcpy() take to copy from 400 to 4000 bytes. Here are the results I got:
So according to this measurement, memcpy() is always significantly faster than DatCopy(), and I should pretty much always use memcpy()unless there is some other computation that can be done after kicking off the transfer to hide the latency in DatCopy(), right? That's what I thought, too. It seemed pretty straightforward, and I figured it would be an easy performance gain. I was about to go change all my uses of DatCopy()-DatWait() to memcpy(), but I had a nagging feeling that this couldn't be right.
The Premature Optimization Bug
Because of the memory architecture of this processor, memcpy() has to use the DMA controller to do its copying. All memory operations go through the DMA controller, but memcpy() was doing the transfer in small pieces under program control while DatCopy() should have been using the controller directly to transfer the data in one big block. It shouldn't be possible for memcpy() to be faster. Indeed, it isn't.
There was a bug in the DatCopy() code. The problem was that DMA transfers are set up by default to link to a null transfer that tells the controller to do nothing. The null transfer can be replaced by another DMA transfer so that when one transfer completes, it starts the next transfer automatically. However, if the null transfer is left there, then the DMA controller flags it as a missed transfer and takes its sweet time getting back to tell the processor that it's finished.
Since all of these DatCopy() transfers are one-offs, they shouldn't link to another transfer. They should return immediately. Once I figured this out and flipped a controller bit to make the transfers static instead of linked, I got this instead:
Ah, the world is right again. DatCopy() is generally faster than memcpy(), as it should be, except for transfers less than about 1000 bytes because of the time needed to set up a DMA transfer. There's only one place in the application where the transfers were guaranteed to be that small and could improve throughput if they were faster, so that's the only place where I put in the memcpy() optimization. As a bonus, all the other transfers sped up because the DMA setup bug in DatCopy() was fixed.
If I hadn't listened to that nagging doubt, I would have changed the code in dozens of places in the mistaken belief that I was improving performance, when in fact I was papering over a performance bug. Don't fall victim to that kind of hastiness. If something doesn't seem right, think it through and make sure you're not optimizing prematurely. You could be trying to do the compiler's job. You could be committing to code maintenance that isn't worth it. You could be unnecessarily complicating your code. Or you could be ignoring bugs that are preventing real performance gains. Do your homework instead, and only do the optimizations that matter.
No comments:
Post a Comment