No matter how strange a bug looked at the outset, I do believe we always got to the answer in the end. I cannot recall a single bug for any of our projects that we did not solve. Notice I didn't say "fixed." Not all bugs are fixed. Sometimes the cost-benefit tradeoff isn't enough to go ahead with the fix, or a workaround exists that adequately avoids the bug. Even in those cases, every known bug should be well understood. The team I worked on took bugs very seriously, and we never gave up when searching for the root cause of an elusive bug.
Lost in the Woods
Most bugs were easy to solve in a matter of hours or days. The symptoms were clear and reproducible. The underlying causes were simple to analyze and simulate to solve the problem. In some cases debugging would stretch out into weeks or, even more rarely, months. After solving a couple of those types of bugs, I started to appreciate that if you pursue it long enough, the answer will present itself. If not, you have not yet put in the requisite time. The answer is there, waiting to be discovered, but it's going to take a certain amount of time to find it. Eventually it became a saying. Whenever someone was getting discouraged on a bug-finding mission, we could remind them, "You just haven't put in the requisite time." And that would improve their resolve.
This idea is not destiny at work. It's more of a path through the woods that takes time to travel. Your starting point depends on the knowledge you have of the design. The path you take depends on the skills and experience you have in debugging. And the speed at which you go depends on the tools and resources you have available. When you actually find the root cause, it's going to be sudden. The trees will part and the solution will be revealed without any warning, so you better be paying attention. I've always found this to be true. Whether the bug is solved in hours or months, there is no slow buildup to the solution. Everything you've learned about the problem and every piece of data you've gathered culminates in the immediate realization of what's causing the bug. Often you look back and wonder why it took so long because the solution is so obvious in hindsight.
Perhaps more interesting is that this realization can happen anywhere at any time. I've had it happen while making breakfast, while walking to or from work, and especially before falling asleep. I had a coworker who routinely thought of solutions while in the shower or brushing his teeth. Everyone probably has a slightly different prime activity that allows them to think clearly, but more often than not that activity is not sitting in front of the problem, trying to solve it. Don't get me wrong, that time spent in front of the problem is valuable and necessary. You need to take measurements, gather data, and process information to learn as much as possible about the bug. But I've found that I do my best thinking when I can defocus and let my mind wander through all of the information about the problem that I have loaded in my head. I will often stumble upon the solution or a critical insight without realizing that I'm thinking seriously about the problem. Those moments are sublime.
The Ghost in the Machine
What do you do when you haven't been able to find the root cause of a bug for years? I've had one such experience so far, and hopefully it will be the only one. It was on one of the ASIC projects I worked on at my previous company. I'll keep the descriptions vague enough to protect the innocent, so to speak, but I'll try to give enough detail to keep it interesting and hopefully informative.
Let's call it The Bug, just to put it on a grander scale. It didn't actually take years of dedicated, focused effort to ferret out The Bug. I'm not sure how I would have survived an ordeal like that. I would have gone positively loopy looking for a bug for that long. But from the time the symptoms were first found to when the root cause was finally uncovered was at least three years, possibly as long as five. I'm a little fuzzy on that detail since this long trek was interspersed with numerous other projects and bugs, but at any rate, it was a long time.
The initial problem was that our customer found a part whose output would get stuck at a high voltage after it was powered on, with no response to changes in the input. Power cycle the part and it would behave normally.
Almost right away we had a suspicion that The Bug was not going to be easy to find. The first thing you do when dealing with a new bug is try to reliably reproduce it. Well, that alone proved difficult. The part would seem to get into its stuck state completely randomly and extremely rarely. That's a bad combination. Oh, and it would only happen when it was in a temperature range of 75-85 degrees Celsius. Worse. On top of that it was encased in a plastic epoxy so that the part and its surrounding circuit were completely inaccessible. Even worse!
It took us weeks just to figure out the optimal temperature to coax the part into sometimes getting into its stuck output state. Then we had to deal with the fact that we had no visibility into the part. All we had was three wires - power, ground, and the output. The output was also a communication line to the part, but getting into that mode required a reset that would clear the error. The input signals were accessible, but we quickly eliminated that as a possible source of the error, and we were back to square one.
We were dealing with two critical debugging issues here - reproducibility and visibility. If you can reproduce a bug on demand and you have perfect visibility into the system, that bug is going to be toast in short order. Not so for The Bug. We were basically flying blind in a random world. Not. A. Fun. Time. Luckily, the customer had found a few more parts that exhibited the same behavior. I say this in hindsight, of course. We didn't think it was so lucky at the time. Up until that point we had to be very careful with our only errant part. The extra failures gave us the leeway to be more aggressive with our debugging.
We had to increase our visibility, so we decided to try milling out some of that plastic epoxy to get at the pins of the ASIC and see what was going on there. Miraculously, the first attempt was successful and we had a still-operational part that, most importantly, still had the output getting stuck. Unfortunately, we didn't see anything unusual at the ASIC pins. Even the reset pin was being released properly. Our best running theory at the time was the part was stuck in reset because the stuck output value was the same as if it was in reset. That stuck-in-reset-like behavior should have been a clue, but we weren't ready for that, yet. We hadn't put in the requisite time.
After a few weeks of messing around with ASIC pins and milling, we weren't able to find anything else noteworthy, so we decided to take the next step and expose the silicon die by decapping the ASIC. This step was another attempt to increase our visibility into the system. There were numerous complications because the epoxy couldn't be dissolved by the chemicals used to remove the plastic package, stronger chemicals might destroy the die, and the surrounding epoxy interfered with the whole process. It took a lot longer than usual, but after some amazing work by the failure analysis team at an external lab, we were able to get a still-working-and-failing part with an exposed die.
We did a bunch of cool tests on the die while it was in the stuck output state, and got lots of colorful close-up pictures of different parts of the circuit. But we still didn't find anything compelling. To increase visibility again, we decided to pull out the die and put it in a new package so we could analyze it separately from the rest of the system. The transplant was unsuccessful. Each one of these processing steps carried a risk of irrecoverable damage, and we had finally reached our luck limit. We would have to take a different route.
We decided to investigate some suspicious defects on the surface of the die, which was a destructive process. Since we had other failures to analyze, and this die was no longer working, we could afford to sacrifice it. What we found were places where it looked like the metal wires on the die surface were cracked and crushed. That kind of damage could be caused by particles in the packaging plastic that were too large. The interaction between the epoxy, the plastic package, and the die surface could cause some of those particles to push into the die surface, causing damage and erratic circuit behavior. This behavior could change with temperature as the materials involved expanded at different rates with temperature changes. It looked like we had our root cause.
Conveniently, we were in the process of changing packaging houses, so we started analyzing parts from the new packaging house, and we couldn't find any problems. Even though it is impossible to prove a negative, we were fairly convinced that we had solved The Bug, and we were ready to close it. I've oversimplified a great many things here, and this entire process had taken almost a year with all kinds of meetings, brainstorming, analysis, and testing. We had done a lot of failure analysis of the other parts that showed the stuck output as well, and we thought all the evidence pointed to the packaging material as the root cause.
When a Bug Becomes a Zombie
At this point I should mention two things about debugging. First, if you cannot turn the bug on and off, you do not truly understand the bug! If you understand a bug, you can recreate the conditions that make it show itself, and you can remove the conditions (or the bug) to make the system behave correctly. It's like a switch - bug on, bug off - and if you can't control it, you don't understand it. We were not able to do this with The Bug, and at the time we didn't think it was possible because of the root cause. It just wasn't possible to recreate that kind of packaging defect or correct it once it happened with any kind of reliability.
Second, debugging a system with low visibility is hard. ASICs generally fall into this category of systems regardless of whether you have access to the package pins or the die itself. Your measurement options are always limited, and if the bug is not within your line of sight into the circuit or you can't deduce its location from the view that you have, it's extremely difficult to find it. Simulations can be helpful because they provide much more visibility, but if the bug is related to startup or power conditions, simulations can get pretty inaccurate and unreliable. Debuggers are a thing of dreams in the world of ASICs. As we saw with the first attempt at solving The Bug, most of the effort revolved around increasing our limited visibility into an opaque system. With those two ideas in mind, let's continue our tale.
So the design team had moved on. There were other bugs to slay and other projects to finish. A year or two passed, and we thought we had The Bug behind us. Then one day we got a call from the customer with some startling news. They were seeing an unusually high failure rate on their production line with parts' outputs getting stuck at a high voltage and failing calibration. They were recoverable by cycling power. Uh-oh. The Bug had come back from the dead.
The customer had already done a fair amount of analysis, and discovered some incredibly useful things about these new failures. First, they were failing at room temperature, which would make our debugging process much easier. Second, the customer was able to deterministically reproduce the failure by quickly power cycling the failed parts from on to off to back on in a very short time. A bad part's output would get stuck high, guaranteed, which meant we could easily reproduce The Bug for measurements and know for sure if any applied conditions truly fixed The Bug. And third, they had caught a number of failures before injecting the plastic epoxy that had been the bane of our existence in the last go around with The Bug.
Suffice it to say, we were in a much better position to find The Bug this time. We fairly quickly characterized the behavior and got to work trying things out. One difficulty with debugging is that the longer it goes on, the less methodical it gets. You may start out with a defined, logical process of measuring the behavior of the bug, narrowing your search, and eliminating theories of the root cause. But at some point you run out of theories, and at that point you can't be afraid to start throwing things at the wall to see what sticks. When you enter this wild experimental theory mode, the faster you can generate ideas and try them out, the better. Anything goes. Don't discount off-the-wall ideas because the normal debugging process has failed, and the root cause is only going to be found by more radical means.
If you hadn't already guessed, we were approaching wild experimental theory mode. We had a few parts decapped since we had a decent collection of them, and we were doing the routine emission tests on the exposed die with no clear results. We were becoming more convinced that the problem was with the power-on-reset circuit because of the way the failed parts reacted to rapid power cycling, but all of our measurements showed that the chip reset was okay.
One day I finally decided to go in with some micro-probes and a microscope and actually measure the low-level signals feeding the power-on-reset circuit while power cycling the part. This was no easy feat with all of the lab equipment hooked up around the microscope and wires hanging off the probes to create the right conditions for reproducing the failure, but I got lucky. The wires I needed to probe on the ASIC surface passed through an open area of the die with no other signals around them, so it was relatively easy to set a probe down on them. What I found was that a critical reference voltage was not coming up fast enough on power on, so the reset that looked like it was released properly on power on was actually not getting asserted at all. What's more, I could force the reference voltage to come up faster and then the stuck output never happened. Bingo!
Always Do a Postmortem
The rest of the details fell into place immediately, but they aren't important. The Bug was revealed after years of hiding, and of course, it was now so obvious. We had been blinded by a number of assumptions that continually lead us astray. It was only after an extended hiatus and the reappearance of The Bug in an easier form, that we were prepared to find the real root cause. It was as if we were different people looking at The Bug anew.
Was it possible to have shortened the time necessary to find this bug? Linus Torvalds has a saying that "given enough eyeballs, all bugs are shallow." That certainly applies in the open source community where many individuals work on a bug independently, but I'm not so sure it applies in this case. We definitely tried involving lots of smart people to find The Bug, but beyond merely generating more ideas, which was helpful when the core team was hitting a wall, mostly what happened was group think that lead us to dead ends. The problem was that we couldn't easily work independently with such a limited number of failures, and only the core team had the knowledge of the design and use of the debugging tools to actively search for The Bug. If we could have solved these debug scaling problems, we likely could have made The Bug more shallow.
In hindsight, the stuck output that looked suspiciously like the output at reset should have been a giant clue, and we should have dug into that power-on-reset circuit more deeply from the beginning. This particular bug ended up requiring years of investigation and learning before we were ready to find it. I hope I learned enough in that experience to prevent a repeat performance in the future, but even if I do come across another monster bug like this, I'll be more prepared to put in the requisite time. Remember, given enough effort every bug's time will come.
No comments:
Post a Comment