Search This Blog

How to Learn a New Code Base

Learning a new code base is hard. When you move to a new job, switch to a new team, or otherwise take on a project already in progress, you'll need to come up to speed as quickly as possible on a code base that's entirely new to you. After doing this a couple of times, I've found some things that work and some things that don't. Maybe some of these tips can help you the next time you inherit a code base and are at a loss for how to proceed.

The Mindset


When picking up a new code base, it's helpful to have the right mindset, and the right mindset is one of acceptance. Accept the fact that all of the code is going to be new. You don't know anything about it, yet. You may have decades of experience with programming and have seen a million lines of code, but you haven't seen this code. The more experience you have, the easier it will be to make associations with your prior knowledge, but you shouldn't make any assumptions about how this code base works. Let the code speak for itself, and treat this exercise as a pure learning experience.

Accept that it will be overwhelming at times, and that's okay. There will be an avalanche of information to explore, process, and understand. You'll be struggling to gain a toe-hold on the cliff of understanding, and trying to pull yourself up with your own skills and effort. Be patient and don't get frustrated. Persistence will give you the familiarity you need to be comfortable and effective with this new terrain.

Get an Overview


The first thing you're going to want to do with a new code base is try to orient yourself. Try to find the important parts of the system and how they're connected together. Try to identify the major structures and hierarchies and separate them from the multitude of details and specific objects in the system. There will likely be a lot of specific, low-level code that implements the features of the system, many of them fairly similar. These code sections can be mentally grouped together and put aside for the moment. For example, an embedded system is going to have a number of interfaces to different peripherals, an operating system is going to have hundreds of different drivers, and a web app is going to have lots of routes and helper functions. The details of these things are not important in the beginning, and focusing on them will only serve to confuse you. How those details are organized in the larger system is what's important right now.

Your goal should be to build up a foundation of knowledge to hang all of the workings of the system on. Once you have a good mental model of the system's organization, then you can start working out the details. If the code base is millions of lines of code, hopefully you have a mentor to help guide you in the right direction while exploring the code's architecture, otherwise it may take you a long time to get your bearings. If the code base is less than a hundred thousand lines of code, hopefully you don't have a mentor so you are forced to figure it out for yourself, otherwise it may take you a long time to ween yourself off of your dependence on your mentor. Smaller code bases are completely manageable for one person to learn without help, and you'll become self-sufficient faster if you start out that way.

As for the actual mechanics of gaining an overview of the code base, I haven't found documentation to be all that helpful. When documentation is even available—and in most of my experiences it is not—it tends to over-generalize and describe things too abstractly. It focuses on the run-time behavior of the code and all of the features of the code instead of what you really need to know—how the system is physically organized and connected together at the code level. Unit tests aren't much use for this purpose, either. They're so specific to the low-level functionality that they're testing that they fail to give you any kind of overview of the code base at large. Unit tests are great for seeing how to set up and do specific things with the code, but not for finding your way around when everything is completely new to you.

What I've found works best to get an overview of a code base is to start with the files and directory structure of the code. An organized directory structure and good file names can say a lot about how a software system is put together. Code files that do related things tend to be grouped together in directories, and file names should identify the basic functionality of the code they contain. Browsing through the directory structure can reveal how you should work with the code base, and more importantly, where to find things. Often times, finding the right file for the functionality that you need to work on is half the battle. If the files and directory structure aren't well organized, you'll have a harder time of it, but this strategy is still useful. You may also come up with some good ways to organize things better, and that line of thinking will also help you organize the layout of the code base in your head.

Once you know your way around the file system, more or less, you should look for the main entry point to the system. This entry point is where the system will be initialized at run time, and it's a treasure trove of information on how the system is organized. Reading the initialization code and stepping through it, maybe mentally while reading the code or literally by loading it into a debugger, will help you better understand how the system is built up at run time and how everything is connected together. Sometimes the main entry point will instantiate a bunch of objects, build up global tables or singletons, and hook them all up before setting a main loop in motion. Other times it will spin up a bunch of threads or tasks and a main message loop to coordinate communication between them. These threads will be the main functions of the system, and it's worth your time to become familiar with how they operate and interact.

This part of learning a new code base is much like learning a new level in a real-time strategy (RTS) game like Starcraft. You start out with a small view of the level where your newly-established base is set. You can see a limited area immediately around your base, and the rest of the map is black. One of the first things you'll want to do is send off a unit as a scout to see what's around. You're looking for obstacles, resources, and enemy bases. Knowledge is power, and to get it you need to explore. That's all getting an overview of a new code base is, doing your best to explore it and appreciating the sights that you'll see.

Learn By Doing


Getting an overview of the code base is all about exploration. The rest of learning the code base is all about experimentation. You have to be willing to try things, change things, and break things. Be sure to take good notes and learn from your mistakes. Oh, and you are using version control, aren't you? It will definitely make your experiments easier to roll back, more productive for you, and less harmful to the code base.

One of the best things you can do to get more familiar with a system after you have a reasonably good idea of how it's organized is to go in and fix a bug. Ask your team (or if that's not possible, figure out for yourself) what's broken and pick a bug that looks manageable to start with. Then dive in and try to fix it. While getting an overview gave you breadth of the system, this exercise is going to give you depth. You'll probably have to figure out how at least a few parts of the system interact and understand the detailed implementation of one section of the code base to fix the bug. It's a great way to expand your knowledge of the system while making a real contribution to your team.

Beyond fixing bugs you can also try adding new features to the code. I wouldn't start with a far-reaching feature that requires major changes to the architecture, but there's always plenty of small features that can be added and are good exercises for learning the code base in more depth. Try to fit new features into the existing code as neatly as possible. The goal here is not to build an entire new wing off of the existing structure that looks nothing like the code that's already there. You want to be as true to the current architecture as possible and see how well you can understand how things currently work when you're adding the new feature.

One more way to experiment with the code base is to pick a reasonably sized file, preferably somewhere in the middle of the architecture, and rewrite it. Note that I did not say rewrite the entire system, just one file. If you pick one that's fairly connected to the rest of the system, something that implements part of the core functionality, you'll learn a lot about how the system works. The reason I suggest rewriting it instead of reading it is that you tend to miss a lot of the subtleties of code when you're only reading it. If you rewrite it, you have to pay more attention to method calls, data structures, algorithms, and everything else that makes the code work. If you rewrite some code and keep the system working, you'll be well on your way to learning the system more fully. Unless the rewrite is also a significant improvement, you'll probably want to roll back your changes when you're done, lest you tick off all of your teammates with unnecessary changes.

Finally, when you're fixing bugs or adding features, you're likely to have questions about how to do things. When you have questions, try to do as much research as you can around the question so that instead of asking a generic "How do I do this?" question, you can ask a very specific question with plenty of context. A well-researched question would sound something like, "I'm trying to do x, and from what I understand of the system, I should be able to do it like this…<explanation>… I'm planning to implement it like so…<explanation>…, and I've found this and this location that seem to be good places to hook in my code. I'm stuck on where this code gets called or where this configuration is set up." That type of question gives the other developer something to chew on. You may be totally off-base and have to switch directions, but it shows that you were thinking and putting in a good-faith effort. You'll also have learned much more about the code base than if someone else always immediately tells you what to do without any effort on your part. You internalize things better if you work them out and discover them for yourself. Many times, doing this work on the question will lead you to the answer without even having to ask anyone for help.

Learning a new code base doesn't have to be terrifying or overwhelming. If you put yourself in the right mindset and accept that there's a lot you'll have to learn, you can knuckle down and prepare for the work ahead. With some patience, the exploration and experimentation can be rewarding and even fun. Make the most of it, and don't worry. You'll come up to speed before you know it.

Programming Bias Comes From Experience

We all have our personal biases, especially when it comes to programming. There are eternal debates going on about what the best programming style is or why this programming language is great and that one sucks. We love to argue about static vs. dynamic, CamelCase vs. snake_case, and tabs vs. spaces. Most of these unwinnable arguments stem from personal biases. We argue because we feel so emotionally committed to our case, but there is no objectively right answer. (Well, maybe for tabs vs. spaces—the editor should handle it for you, of course.)

These personal biases develop because of things that have bitten us in the past. Behind every irrational, subjective programming opinion is a story about how a programmer got burned while using a particular language or trying out a new programming practice. That thing that the programmer grew to despise may not have been directly responsible for the failure, but it was there and it was noticed. Biases are born of cold, hard experience, and the frustration of having to deal with an issue that's getting in your way at a critical time is not soon forgotten.

That's not to say that the reasons for any of these programming biases are invalid. They are certainly real enough, but the opposing arguments stem from equally valid stories of their own. For every programmer that's been burned by static typing, there's one that's had their fair share of shocks from dynamic typing. In the end the opposing arguments tend to boil down to differences in experience and possibly personality, and it can be hard to tell those apart. Did a programmer have a bad experience because of a predisposition to a certain programming style, or did his preferred programming style develop from the string of experiences he had? Separating those causes and effects is difficult.

To give an example, I recently listened to a discussion about some of the things Jeff Atwood hates about C# on Stack Overflow Podcast #73 (~50 minutes in). To be clear, C# was Jeff's preferred language at the time when he was working on StackOverflow.com, and he claimed that you don't really know a language unless you can name at least five things you hate about it. Fair enough. I'm not sure I agree with that, but anyway, he could only come up with three for C#.

What struck me was that the things he came up with were things that have never bothered me about programming languages in the slightest. I couldn't believe he was so frustrated by them. His number one complaint was case sensitivity. It has never occurred to me that case sensitivity in a programming language was even an issue. I learned to program in Pascal and C, so to me, case sensitivity was The Way Things Were. I learned to be careful with capitalization when programming, the same way I'm careful when writing. Jeff used to program in VB, a case-insensitive language, so I can appreciate that he was used to not caring about case when programming and had to make an adjustment. At the same time, I can't see why it's that big of a deal.

So you see, we both have our personal biases when it comes to case sensitivity, and the real reason we differ is probably because of our early experiences with programming languages, not because one is intrinsically better than the other.

His second complaint was that C# didn't have dynamic background compilation. Today it does, so it's no longer an issue, but even at the time I don't remember it being particularly painful to have to compile before seeing where you made syntax errors. Actually, having to wait for the compiler to report syntax errors may have saved time in the long run because it forced me to develop good habits when writing code so that it would be as error-free as possible.

Things are certainly better now with IDEs, like Visual Studio and Eclipse, that can point out syntax errors as you type. I can't argue with that. Although, it would be nice if they held off a little bit instead of underlining stuff in a line before you've finished it. I'm not sure that I can program any faster because of background compilation, but it's nice to have the extra confidence that the code will compile right away.

Jeff's last complaint was that static and dynamic typing both have their strengths and weaknesses, and languages would really benefit from having features from both typing systems. Joel added that type inference can help static type systems a lot by adding some nice dynamic features. I definitely agree with them here. It's great to remove some of the line noise when creating objects, especially when the names are long. In C# for example,
VeryLongTypeNameForAnObject obj = new VeryLongTypeNameForAnObject();
becomes the slightly more tolerable
var obj = new VeryLongTypeNameForAnObject();
While type inference adds dynamic features to static typing, Jeff also wanted to see some static features added to dynamic typing so that it would be easier to find all uses of a variable in a program written in a dynamic language. In a statically typed language, the type system can disambiguate variables with the same name that are declared in different places, for the most part, so the IDE can rename variables and methods with a greater chance of success. Reflection can still muck up this operation, but it's generally safer in a statically typed language.

In my experience I have run into both of these issues with static and dynamic typing from time to time, but it's been rather rare. I tend to adapt to the language I'm using. When I write in a static language, I try to cater to static typing's strengths, and when I write in a dynamic language, I take advantage of dynamic typing's features. I can't say that I've come across many code bases where similar variable and method names were strewn across dozens of files, and if that is happening in a code base, there are bigger issues in play than what renaming is going to solve. But that's my experience, and if you had to deal with lots of far-reaching variables every day, you'll probably have different biases than I do.

I probably can't come up with too many things I hate about any given programming language I've used. Having to pass structs around in C is a bit of a nuisance, so a basic object-oriented system would be nice. Having a sane syntax in C++ would be great. Having a real object system in Python, so we wouldn't have to pass self into every class method, would be an improvement.

If I think about it, these complaints are all variations on a theme. What I struggle with the most in any language is how to best express what I want to in as succinct a way as possible. I hate verbosity. When a language makes me repeat myself or say things in ten lines when one should do, that's where I feel the most pain. In fact, if every language were as succinct and expressive as Ruby, I would be happy as a clam. It's too bad Ruby isn't fast enough to be used in more places.

We should be aware of our biases and where they come from so that we're not making decisions purely on gut reactions. I recognize that my bias in favor of Ruby comes from past experiences I've had with other languages. Programmers that are biased against Ruby have their own reasons. The same thing goes for many of the endless debates we have about languages, styles, practices, etc. For some of those topics, we may be able to make real measurements about what is better, but even in those cases it's probably not cut-and-dried. Some programmers will be more productive with one choice, and other programmers will do better with the opposing choice. For the rest of the topics, there may be no definitive right or wrong answer, only a field of options, and our past experiences may be the best guide we have for making a good choice.

Be Careful What You Wish For

Who hasn't wished for a library that neatly solves the programming problem that is currently staring you in the face? When you're down in the dirt, gearing up to tackle some gnarly problem that seems like it should have a general solution, it's only natural to wish, or even expect there to be a library that addresses your need. Once you have a workable solution that you think is elegant and useful, don't you think about releasing it for the world, or at least your company, to fulfill other programmer's wishes for self-contained solutions to their problems? Be careful what you wish for.

This post is probably not going to be about what you're thinking right now. I'm not going to relate some harrowing tale about a software library I picked up off the street, hoping it was going to solve all of my problems, only to find that it was more trouble than it was worth. Honestly, I haven't had many experiences like that. Either I've had amazing luck finding exactly what I need, or I have low expectations and work through the problems with the libraries I find as they come up. No, this post is about the way the software library marketplace has grown exponentially over the years and how one type of complaint about libraries has morphed into another, seemingly without notice.

When I was first learning to program, the languages I learned had small standard libraries. C has a pretty manageable standard library. C++ also has a small standard library by today's standards, even though it includes most of the C standard library as well. You could also include the standard template library, still keep all of the C++ library functionality in your head, and recall it at will to solve your programming problems. Early Java also had a tractable standard library.

Admittedly, my early view of the world of programming languages was somewhat limited, but the world was definitely smaller back then. One of the most common complaints I heard was that the world was a little too small. We needed more good libraries and frameworks for building applications—prepackaged plug-and-play code that would solve all of our problems and make development easy and painless. Sure, there were some isolated examples of application development frameworks out there, like MFC for Windows development in C++. But we wanted more, much more, and we weren't afraid to ask for it.

Now I'm not in any way knocking the fact that we wished for all of these libraries and frameworks, or that we expected them to solve all of our problems. It was and is entirely reasonable, even virtuous, to not want to reinvent the wheel over and over again. We should package up solutions to solved problems and make them part of the foundation that we then build on top of. I wouldn't have expected or wanted it to happen any other way, but I don't think many programmers realized that it would happen the way it did.

Today Java 8 has over 4,200 classes in the standard library public API. The .NET 3.5 framework has nearly 10,000 public classes—the latest version I could find numbers for. The .NET 4.5 framework has well over that number, I'm sure. No single person could possibly know everything that's in the standard libraries of these languages, and that's not even counting all of the contributions available from open source libraries.

It's a bit harder to get stats on open source projects. Ruby, a language with a fairly tractable standard library, has 6,272 gems available on RubyGems.org. Take into account that gems are generally a larger unit of measure than classes and the gems available at RubyGems.org are only a subset of all available gems, and it is clear that there are a lot of open source libraries available for Ruby. The same goes for all of the currently popular languages, especially JavaScript. Because of the nature of JavaScript—ease of releasing projects and use on both the client and server sides of the browser—the number of libraries available is becoming truly astounding. Looking only at GitHub, there are over 320,000 active JavaScript repositories! Of course some (possibly many) of these repositories are mis-categorized, and many more of them aren't actually reusable libraries, but other varieties of projects. Even so, any way you slice that number, it's big, and growing fast.

The point here is not to get an accurate measure of the number of libraries available for different languages, but to show that in any language you use, that number is absolutely massive. The game has fundamentally changed from knowing everything that's available and how to use most of it to searching for what you need when you need it and evaluating whether or not you can use it (90% of it is crap, as per Sturgeon's Law). This shift makes some people very uncomfortable.

You see, back when we were complaining about the lack of good libraries and frameworks for building awesome applications and solving all of the mundane problems of software development, this is not what we wanted. We wanted elegant, high-performance tools that perfectly fit the problems we were trying to solve, not a mountain of crap to sift through for a few choice pearls. I posit that the former was never going to happen, and the latter is the best we could have hoped for.

Part of the reason is purely because that's the nature of software development and the web. When anyone can post a library on the web for others to use, you're going to end up with a ton of stuff that nobody needs or is of decidedly low quality. To get the fraction of really good stuff, we need to put up with a lot of mediocre, useless stuff.

The other part of the reason is that perfect software libraries are one of Steve Yegge's Nonesuch Beasts. They don't exist. You can't build one. It's impossible, and here's why. Any library that's generally applicable to a wide range of projects necessarily needs to solve problems in a general way. On the other side of the coin, you likely require a specific solution to a narrow problem, and a large collection of projects like yours will have a wide range of narrow problems that are slightly different.

What we end up with is a large problem set (the projects) on one side and a large solution set (the libraries) on the other side with a lot of thin connections running between them. One project will use a part of one library in a particular way, and another project will use a the same library in a different way or a different library altogether. There's an enormous mismatch, and you can't have both a general solution for many problems and a perfect solution for each problem. Abstractions are leaky. Although we can keep making things better, and I think we have much better tools than we did a decade ago, that mismatch will always exist. Ruby on Rails is one example among many excellent web development frameworks. Python and R have some great data analysis tools. JavaScript continues to impress me with the libraries available for adding all kinds of features to your web application. The solutions are there if we know where to look and how to evaluate them.

When programmers complain about how hard it is to find the right library for their needs and how everything out there is crap, I think they're actually complaining about needing to think. What they want is to be able to go on Google and type in, "I need a JavaScript library for x," be taken to a GitHub page with an easy-to-install library, and be able to use the library intuitively. No reading documentation, comparing options, or analyzing performance required. Sometimes we can get pretty close to this experience. Some libraries are very good and the choice is clear, but most of the time software development is still hard. The choices are vague and complicated. Nothing may be the right fit, and we need to develop our own solution. Programming still requires plenty of thought.

I count that as a good thing. If programming was simple, and all we had to do was plug together a bunch of libraries we found on GitHub, it would get unbearably boring. Besides, if we truly solved the software library problem, it would mean we were no longer innovating and pushing the frontier of software. Researching options and piecing them together to build novel solutions, basically creative problem solving, is an inherent part of our job. It is fundamental to what we do. Without that difficulty, most of us would probably be out of a job. We wished for more and better libraries. Now we're getting them, and we're wishing that we didn't have to think so hard about when and how to use them. We should be careful what we wish for. What if we actually get it?

Face It, You Can't Predict the Future

We programmers are a funny bunch. We think we can predict the future. No, really, we do. Every time we add a layer of abstraction or a level of indirection or a complicated inheritance hierarchy just in case, we're trying to predict the future. Those three words—just in case—should be closely associated with two other words in our minds—make work. In reality, that's what we're doing most of the time. We're making work for ourselves instead of solving the problem at hand and finishing the work that needs to be done.

I fall into this trap of trying to predict the future all the time. I have to stop myself when my mind goes off on tangents, dreaming up Rube Goldberg architectures to handle all kinds of imagined circumstances that may or may not ever happen. Sometimes it's easier to architect than to work on edge cases. Sometimes it's more interesting to rough out structures for potential future issues than to implement the nuts and bolts of features that need to get done now. Sometimes it's fun to over-engineer and over-think a problem.

Other times we need to get stuff done. During those times (honestly, it's most of the time, isn't it?) how do we stay on track instead of heading off into the over-engineered weeds?

For starters, if you're writing some extra structure into your code, just in case you need it, don't. You should be able to hear yourself saying it. "What if we need to support more than one type of gizmo that talks to the thingamajig? I'll make a thingamajig-gizmo bridge interface so that we can easily implement many types of gizmos, just in case." You better be sure that either the number of gizmo types is really more than one (and you will definitely use more than one as part of the requirements of the product), or the use of the gizmo will be strewn over so much of the code that it creates substantial risk should you ever need to replace it. If neither of these things is true, maybe it's better to put your head down and implement the gizmo instead, just in case you want to ship your software.

It's worth it to carefully consider your options any time you want to abstract away part of your system, especially the core parts of your system. Will you really ever switch views? databases? libraries? Ask yourself, are you making an all-singing all-dancing platform, or are you making a targeted product that real people are going to use to solve real problems, hopefully soon.

Maybe you think you are making a platform, but remember, most of today's platforms didn't start out that way. They started out as products that people found useful and eventually grew into platforms. You can't address every single concern up front, and if you try, you'll never ship. You'll be spending all kinds of time working on abstract interfaces to loosely hook up all of your code to…um…save time. I'm not sure how that works, exactly. Only a few of those abstractions will be needed. Pick a view. Pick a database. Pick your libraries. Make some decisions and ship your product.

Speaking of platforms, platforms need the feature of being able to scale. 'Scale' can mean a lot of different things. It can mean millions of users per month. It can mean tens of thousands of transactions per day. It can mean thousands of requests per second. For a team that's still working on their product, it mostly means the load the software should be able to support, just in case it gets big fast. Face it, not everyone is Twitter or Slack. Not even Twitter was Twitter when they started, and they had all kinds of scaling problems as they got huge. It didn't matter, and they dealt with the scaling problems in real time.

If you're engineering your software to scale before you need to, don't. There are too many things you should be doing to make your product better so that you actually have problems with scaling later. Don't waste time on scaling before it's a problem. If you waste enough time, scaling may never be a problem because your product never ships.

So far I've been talking about programming in the large, but let's turn to programming in the small. Programming in the large is all about architecture and scaling. Programming in the small is all about refactoring and optimizing. Refactoring is another one of those slippery words like scaling. It can mean whatever you want it to mean, so everyone loves it and loves doing it. I love refactoring, too. What does it mean? I'm not sure, but I love it.

I think it means change. I try not to change code unless it markedly improves its readability or performance, but mostly readability. When writing code, the primary goal should be to make it as clear and concise as possible. A piece of code should express its intent as obviously as the language allows. I find that clear, concise code also performs better than verbose, flexible code most of the time. Making code shorter tends to make it faster, to a point. And 90% of the time shorter code has good enough performance while being more clear.

Another reason why performance isn't an issue for most code, and you can get away with focusing on clarity, is that most code doesn't need to run fast. The bulk of most code bases is there to setup the system, provide configuration options, or is I/O bound, waiting for user input or slower processes.

One of my previous projects was writing code for an embedded system that did real-time signal processing at a sample rate of up to 1 MHz for four input channels. Functions could be attached to each of the channels for calculating things like running averages, frequencies, and peak-to-peak amplitudes. The more functions that could run simultaneously, the better. Certain sections of code in this system had to run as fast as they possibly could, so those sections were written in the specialized assembly language of the DSP processor we used. To make this code run even faster, a lot of things were set up and calculated ahead of time so that as few things as possible needed to be recalculated during run time. All of this setup and configuration code didn't need to be fast, so it was written as clearly as possible. That made it much easier to change as requirements changed, and I could implement new features within days of them being proposed because I could look at most of the code and immediately understand how it worked.

If performance is a problem and if you have measured and found where the code is slow, then, and only then, should you optimize the code in question. Since the code is already as clear and concise as you can make it, optimizing will mean making the code more complicated in some way. Maybe you need to use a more complex algorithm. Maybe you need to use more obfuscated low-level functions or assembly language. Maybe you need to add caching or memoization that muddies up the code. That's why optimizations should only be done where they're necessary, and the goal should still be to make the optimized code as clear as possible.

Beyond premature optimizations, we can also have premature refactorings in the name of DRY (Don't Repeat Yourself). DRY is a fundamental principle of good software development from Andy Hunt and David Thomas in The Pragmatic Programmer. (Excellent book. You should read it, if you haven't already.) But the DRY principle can be abused if you try to compartmentalize every operation into its own function, just in case that operation may be used again in some other place.

The problem with making every operation a function is twofold. First, you probably can't predict how the operation will be subtly different in other cases, so the function interface will likely be wrong. Second, doing this fractures your code base to the point where it's hard to find or understand anything. It actually makes it more likely that you'll have code duplication because you'll have hundreds of tiny functions that all do similar things, and you don't even realize it because you can't remember what they all are. If the original code adheres to DRY, leave it alone until code duplication actually occurs. Once it does occur, then refactor.

Obviously, the over-arching theme here is don't do something just in case it may be required in the future. If you do things just in case, you'll have spent huge amounts of time writing code that will mostly go unused. That time would be better spent thinking more deeply about the real problems and writing code to address those problems. Make sure the code is clear and concise and the architecture closely fits the problem you're currently trying to solve. Wait to scale until scaling becomes the problem you need to solve, and only optimize the code that needs to run fast. Face it, you can't predict the future. Instead, deal with the present.

Don't Fool Yourself


The first principle is that you must not fool yourself and you are the easiest person to fool. -Richard P. Feynman
Have you ever fooled yourself? Maybe a better question is, do you remember the last time you fooled yourself? You've worked all day or all week on a problem, and you think you've come up with a reasonable, if not elegant, solution. You show it to someone you trust—a coworker or a friend maybe—and she immediately sees the flaw in your logic or proposes a dead simple alternative that you never thought of. Face-palm.

I fool myself all the time, and I'm not ashamed to admit it. Those face-palms are the times when real learning happens, if you're willing to examine where you went off track and why the better solution works. Then you can generalize it to other problems. I came across one fascinating post by Ivan Yurchenko a few months back that perfectly illustrates how you can fool yourself on multiple levels when you're excited about something and want it to be more than it is.

The post is about calculating the Nth Fibonacci Number in O(log n) time, and it caught my attention because I thought that was impossible. To be clear, I don't want anyone to think I'm trying to take down Ivan or his post in any way. That's not the intent here. I enjoyed reading his post, and it made me think hard about algorithms, benchmarking, and optimization. It's a great post to learn from, and I'm sharing some of the insights I had. Getting back to Fibonacci Numbers, why did the claim of O(log n) time bother me?

Question Your Assumptions


The Fibonacci Numbers are a common sequence of numbers that are generated by starting with 0 and 1 (or 1 and 1, depending on context and personal preference) and continually adding the last two numbers to get the next number in the sequence. The first 10 Fibonacci Numbers are
0, 1, 1, 2, 3, 5, 8, 13, 21, 34, ...
Using a loop, we can write a simple program that calculates the Nth Fibonacci Number. Ivan does this, and here is his version in Python:
class IterationFibonacci:
  def __init__(self):
    pass
 
  def get_number(self, n):
  """Get nth Fibonacci number (n is int, non-negative)."""
 
    if n == 0:
      return 0
    a = 0
    b = 1
    c = 1
    for i in range(n-1):
      c = a + b
      a = b
      b = c
    return c
By inspection of the for loop, this algorithm runs in O(n) time, but that's already making an assumption that we'll get back to in a minute. First, can we improve on this naive algorithm at all? A quick search of Wikipedia shows that there is actually a closed-form expression for the Nth Fibonacci Number:

F_n = \frac{\varphi^n-\psi^n}{\varphi-\psi} = \frac{\varphi^n-\psi^n}{\sqrt 5}

where

\varphi = \frac{1 + \sqrt{5}}{2} \approx 1.61803\,39887\cdots\

is the golden ratio, and

\psi = \frac{1 - \sqrt{5}}{2} = 1 - \varphi = - {1 \over \varphi} \approx -0.61803\,39887\cdots

Normally, a closed form expression signals that we can compute a result in constant time, but take a closer look at that expression. The n for the Nth Fibonacci Number that we're trying to find is an exponent in two places, and φ and ψ are irrational numbers. Taking the nth power of a number requires n-1 multiplications. That makes the time complexity of this closed-form expression O(n). We could be smarter about the multiplications and split them up into a binary tree of multiplications. That would make the time complexity O(log n).

Can we do better? How about using a power function provided by a standard library? It turns out that the run time complexity of a power function is dependent on the precision that you need in the result, and once you start looking at the precision that you need for Fibonacci Numbers, you start to see the real problem that we're dealing with. The Nth Fibonacci Number will have on the order of n bits in binary form. The 100th number has 69 bits. The 1,000th number has 694 bits. the 10,000th number has 6,942 bits.

It takes a little while for the series to get going, but after a while every subsequent number requires another bit to calculate and store it. That growth means that a power function needs n bits of precision to compute the Nth Fibonacci Number exactly, so it will have O(n) time complexity, assuming you already know the golden ratio to the required number of bits outright. That's the best we can do. Returning to the simple iterative algorithm above, the addition inside the for loop doesn't take constant time. It takes O(n) time because there are O(n) bits to add together. That was the incorrect assumption we were making, and the algorithm actually has O(n^2) time complexity.

That brings us to the matrix multiplication algorithm that Ivan presented from The Art of Computer Programming. It only has O(log n) complexity if you assume that all multiplies can be done in constant time, but the numbers being multiplied are on the order of O(n) bits long so the complexity is actually O(n log n). Knowing that calculating Fibonacci Numbers means working with ginormous numbers is a good thing to keep in mind when figuring out the complexity of a real implementation. The theory may be making assumptions that can't be met on a physical computer.

Pay Attention to What You're Testing


Moving on from complexity theory to benchmarking, Ivan compared the simple iterative algorithm above to the matrix multiplication algorithm, and the results showed that the run time of the iterative algorithm grew linearly and the matrix multiplication algorithm run time appeared to stay constant. There were a couple wrinkles in his tests, though.

The test procedure consisted of doing test runs of each algorithm, calculating 10,000 Fibonacci Numbers around n where n was chosen pseudo-randomly in the range n±5. He did these 10,000-iteration runs over the range of n from 10 to 1010 in increments of 10. Here's the code:
def benchmark():
  random.seed()
 
  for around in range(10, 1010, 10):
    ifib = IterationFibonacci()
    mfib = MatrixFibonacci()
    ti = 0
    tm = 0
    count = 10000
    for _ in range(count):
      n = random.randint(around-5, around+5)
      t = time.time()
      ifib.get_number(n)
      t1 = time.time() - t
      ti += t1
 
      t = time.time()
      mfib.get_number(n)
      t2 = time.time() - t
      tm += t2
 
    print("{0}\t{1}\t{2}".format(around, ti, tm))
In other words, he started with n=10, calculated 10,000 Nth Fibonacci Numbers in the range of 5 to 15, moved to n=20, calculated 10,000 Nth Fibonacci Numbers in the range of 15 to 25, and so on up to n=1000. Right away, one issue is that the benchmark is only calculating relatively small Fibonacci Numbers where the linear time complexity of the additions and multiplications don't come into play too much. Maybe the application that would use these algorithms doesn't need to calculate larger numbers, but in that case, why not calculate all of the numbers in the required range once and store them in a look-up table? It wouldn't take too much time or space unless the application had limited memory requirements.

The second issue arises when we look at how the matrix algorithm is implemented. Another optimization was added to the matrix algorithm to speed it up more. In the part of the algorithm that decomposes n into powers of 2 and calculates the matrices for those powers of 2, the matrices are calculated once and saved in a hash table so that they can be recalled the next time they're needed instead of calculating them again. This optimization is called memoization, and it muddies the waters a bit because we're no longer comparing two algorithms, we're comparing one algorithm without an optimization and another one with an optimization. Here's the matrix algorithm with memoization:
class MatrixFibonacci:
    A = [[1, 1],
         [1, 0]]

    def __init__(self):
        self.__memo = {}

    def __multiply_matrices(self, M1, M2):
        """2x2 matrices multiplication"""

        a11 = M1[0][0]*M2[0][0] + M1[0][1]*M2[1][0]
        a12 = M1[0][0]*M2[0][1] + M1[0][1]*M2[1][1]
        a21 = M1[1][0]*M2[0][0] + M1[1][1]*M2[1][0]
        a22 = M1[1][0]*M2[0][1] + M1[1][1]*M2[1][1]
        r = [[a11, a12], [a21, a22]]
        return r

    def __get_matrix_power(self, M, p):
        """Calculate matrix power (p is a power of 2)"""

        if p == 1:
            return M
        if p in self.__memo:
            return self.__memo[p]
        K = self.__get_matrix_power(M, int(p/2))
        R = self.__multiply_matrices(K, K)
        self.__memo[p] = R
        return R

    def get_number(self, n):
        """Get nth Fibonacci number (n is int, non-negative)."""

        if n == 0:
            return 0
        if n == 1:
            return 1
        # Decompose to powers of 2,
        # i.e. 62 = 2^5 + 2^4 + 2^3 + 2^2 + 2^0 = 32 + 16 + 8 + 4 + 1.
        powers = [int(pow(2, b))
                  for (b, d) in enumerate(reversed(bin(n-1)[2:])) if d == '1']
        # Less pythonic: http://pastebin.com/h8cKDkHX

        matrices = [self.__get_matrix_power(MatrixFibonacci.A, p)
                    for p in powers]
        while len(matrices) > 1:
            M1 = matrices.pop()
            M2 = matrices.pop()
            R = self.__multiply_matrices(M1, M2)
            matrices.append(R)
        return matrices[0][0][0]
Notice that in the benchmark code the memoization hash is initialized when a MatrixFibonacci object is initialized, and it's maintained for the life of the object. Because the objects are created before each run of 10,000 Fibonacci Numbers, this optimization gives the matrix algorithm a big advantage. We could apply this same memoization technique to the iterative algorithm, and it would give the iterative algorithm an even bigger advantage because instead of only saving parts of the calculating, it would be saving every Fibonacci Number as it was calculated, basically building the look-up table that I described earlier on the fly. If we ran this benchmark on the iterative algorithm with memoization, it would be orders of magnitude faster than the matrix algorithm.

To compare the two algorithms directly, we can remove the memoization from the matrix algorithm. Even so, running up to n=1,000 doesn't show a substantial difference between the two algorithms. To get a better sense of their difference, we can run out to n=10,000, which results in run times like this:

Chart of Fibonacci algorithm runtimes

This chart clearly shows the O(n^2) behavior of the iterative algorithm. It's not quite as clear that the matrix algorithm is O(n log n), but it looks like it's at least linear and not sub-linear. It's important to know what you want to test when creating benchmarks, and make sure that you are indeed testing what you think you're testing.

Keep Looking for a Simpler Way


If a solution seems complicated, it probably is. There is often a simpler approach that will accomplish the same goal more clearly and in less code. The above matrix algorithm is pretty complicated, and a few of the commenters on the original post came up with some nice improvements. One that I particularly liked from Juan Lopes was dramatically shorter, and it's reproduced here with a few edits to put it in the same format as the previous algorithms for running in the test bench:
class MatrixFibonacciShort:
  def __init__(self):
    pass

  def mul(self, A, B):
    return (A[0]*B[0] + A[1]*B[2], A[0]*B[1] + A[1]*B[3], 
            A[2]*B[0] + A[3]*B[2], A[2]*B[1] + A[3]*B[3])
    
  def fib(self, n):
    if n==1: return (1, 1, 1, 0)
    A = self.fib(n>>1)
    A = self.mul(A, A)
    if n&1: A = self.mul(A, self.fib(1))
    return A

  def get_number(self, n):
    if n==0: return 0
    return self.fib(n)[1]
The matrix was simplified from a 2-dimensional to a 1-dimensional array that compresses the multiplication code, and the core of the algorithm in fib() is now five lines long. It's a very elegant way to dynamically decompose the matrix multiplications into powers of 2 because the binary 1s in n mark where the powers of 2 are, and the algorithm uses that information to build the multiplication tree on the stack as it shifts through the binary bits. It's also about twice as fast as the original matrix algorithm:

Chart of Fibonacci matrix algorithm runtimes

I run into this problem all the time where I think long and hard about the best way to express some algorithm only to have someone else (or a Google search) show me a much more elegant solution. It's always great to learn about another clever way to solve a problem, and over time these solutions build up your repertoire. But every time I'm attacking a new problem—and it's good practice to always spend some time reasoning about solutions yourself—I basically assume that whatever solution I come up with, I'll find something better on the internet.

It is so true that the easiest person to fool is yourself, and these mindsets are a good way to prevent at least some foolishness. First, always question your assumptions. Assumptions are the blind spots that hide obvious mistakes. Second, make sure that when you're comparing ideas, you know exactly what you're comparing and you are comparing exactly what you should to reach a valid conclusion. Finally, assume that whatever solution you come up with, a better one exists, especially if your current solution seems too complicated. Then you have to decide if further optimizations are worth pursuing, or if your current solution is good enough.