Search This Blog

A Microsecond in the Life of a Line of Code

We sit atop a large technology stack. Every layer in that stack is an abstraction of the layer below it that hides details and makes it easier to do more work with less effort. We have built up a great edifice of abstractions, and most people understand and work within a couple levels of this technology stack at most. No matter where you work in the stack, it's worthwhile to know something about what's going on below you.

Last week I talked about the importance of knowing the basics. Part of knowing the basics can be thought of as learning more about some of the layers of abstraction below your normal layer of expertise. I certainly cannot do justice to every layer of the computing tech stack, nor can I cover any layer in much detail, but it should still be fun to take a journey through the stack, stopping at each layer to see what goes on there.

We'll follow one line of code from a Ruby on Rails application and see how incredibly tall the tech stack has become. I only picked Rails as a starting point because it's one of the higher-level frameworks in a high-level language, and I happen to know it. Other choices exist, but it gives us something to focus on instead of talking in generalities all the way down. I'm also focusing on one line of code instead of following an operation like an HTTP request because it's fascinating that all of the actions on each of these layers are happening simultaneously, whereas an HTTP request has more sequential components to it. Maybe I can cover an HTTP request another time.

Let's get started on this incredible microsecond in the life of a line of code. With every level that we descend, things are getting more physical, and keep in mind all of these actions are happening at the same time on the same piece of silicon. Let's start peeling back the layers.

Ruby on Rails Application

Here we are on the top of the tech stack. Things are pretty cushy up here, and we can express an awful lot of work with only a few lines of code. For our example, we'll use this line of code:
respond_with @user
This piece of code sits in a controller that handles an HTTP POST request, takes the appropriate actions for the request, and renders a response to send back to the web browser. Because we're at such a high level of abstraction, this one line of code does a tremendous amount of work. To see what it does, we'll have to peel away a layer and look in the Rails framework.

Ruby on Rails Framework

The respond_with method call is a call into the Ruby on Rails framework, and here's what the source code looks like:
# File actionpack/lib/action_controller/metal/
# mime_responds.rb, line 390
    def respond_with(*resources, &block)
      if self.class.mimes_for_respond_to.empty?
        raise "In order to use respond_with, first you " + 
              "need to declare the formats your controller " + 
              "responds to in the class level."
      end

      if collector = retrieve_collector_from_mimes(&block)
        options = resources.size == 1 ? {} : 
                    resources.extract_options!
        options = options.clone
        options[:default_response] = collector.response
        (options.delete(:responder) || self.class.responder).
          call(self, resources, options)
      end
    end
First, the method checks that it knows what mime-type to generate the response for, and if there is none, it throws an error. Next, it calls another method deeper inside the Rails framework that executes the block of code provided with the call to respond_with, if there is one, and returns an object that knows the response type. Finally, a call is made at the end to render the response from a template in the Rails application.

I'm glossing over a lot of stuff here, so if this explanation doesn't make sense, don't worry about it. I'm not trying to analyze Rails in depth, so we can just enjoy the view as we pass by. Many more method calls are happening inside the methods called from this method, and it's all thin layers of abstraction contained within Rails and the other gems that it depends on. Gems are collections of Ruby code that are packaged up into convenient chunks of functionality that can be included in other Ruby programs, and they could be considered a layer in and of themselves. We'll group all of those thin layers into one for the purposes of this discussion, otherwise this post will go on forever. The important thing to keep in mind is that a ton of stuff is going on at each level, and the shear number of moving parts that are flying around as we continue down is astonishing.

Let's focus in on one line of code and move down to the next layer.

Ruby

This line looks fairly simple:
options = resources.size == 1 ? {} : resources.extract_options!
Ruby on Rails is written in a programming language called Ruby. (Bet you didn't already know that from the name, amiright?) The line of code we're looking at is a line of Ruby code. What does it do? It assigns a local variable called options one of two values depending on the size of resources. If the size is one, then options is an empty hash table, otherwise it gets assigned to the options that can be found in resources.

Ruby is made up of a set of thin layers of abstraction much like Rails, but with Ruby the layers are made up of the standard library and the core language features. The call to resources.size is actually a call to a standard library method for the Array class that is referred to as Array#size. Normally a language and its standard library go together, so we'll consider them one layer in the stack.

Ruby is an interpreted language, which means there is another program running—called the interpreter—that reads every line of code in a Ruby program file and figures out what it means and what it needs to do during run time. An interpreter can be implemented in a few different ways. In particular, Ruby has interpreters written in C (the MRI interpreter), Java (JRuby for the JVM), and Ruby (Rubinius—how's that for recursion). To make thing a bit more interesting, lets look at the JRuby implementation.

JRuby Interpreter

The JRuby interpreter reads files of Ruby code, parses it, and executes instructions to do whatever the Ruby code says should be done. Since JRuby is written mostly in Java, it's compiled Java code that's doing the work of figuring out array sizes, deciding which values to use, and assigning hash tables to local variables in our line of code above. An interpreter has a ton of decisions to make, and a lot of code is being executed to make those higher-level Ruby statements happen.

What is essentially happening in the interpreter is that code in one language is getting translated into code in another language using a third language. In this case, Ruby code is translated into Java bytecode using Java. Java bytecode is compiled Java code that is similar to an assembly language. If we assume with our example line of Ruby code that the conditional assignment operator is implemented in the JRuby interpreter as a Java method with references for the empty hash and the options hash, then the method might look like this:
public Object conditionalAssign(int lhsCompare,
                                int rhsCompare,
                                Object trueObj,
                                Object falseObj) {
  if (lhsCompare == rhsCompare) {
    return trueObj;
  } else {
    return falseObj;
  }
}
This Java method is returning a reference to an object dependant on the outcome of the equality comparison. That's what we want for the Ruby conditional assignment operator, but the interpreter wouldn't be outputting Java code, it would be outputting Java bytecode. The equivalent Java bytecode for the conditionalAssign method is this:
0: iload_1
1: iload_2
2: if_icmpne     7
5: aload_3
6: areturn
7: aload_4
8: areturn
This Java bytecode runs on a virtual machine, which brings us to the next layer in our tech stack.

JVM

The JVM is a virtual machine that models a physical processor in software so it can take in instructions represented as Java bytecode and output the right assembly code for the processor it was running on. The original idea with the JVM was that a software program could be compiled once for the JVM, and then it could run on any hardware that had the JVM implemented on it. This idea of write once, run anywhere didn't quite turn out as the JVM proponents hoped, but the JVM still has a lot of value because many languages can run on it and it runs on many hardware platforms.

Much like the JRuby interpreter, the JVM is doing a translation, this time from bytecode to assembly code, and the JVM is most likely written in yet another language—C. If the JVM happened to be running on an x64 processor, it might emit assembly code that looks like this:
 .globl conditionalAssign
 .type conditionalAssign, @function
conditionalAssign:
.LFB0:
 .cfi_startproc
 pushq %rbp
 .cfi_def_cfa_offset 16
 .cfi_offset 6, -16
 movq %rsp, %rbp
 .cfi_def_cfa_register 6
 movl %edi, -4(%rbp)
 movl %esi, -8(%rbp)
 movq %rdx, -16(%rbp)
 movq %rcx, -24(%rbp)
 movl -4(%rbp), %eax
 cmpl -8(%rbp), %eax
 jne .L2
 movq -16(%rbp), %rax
 jmp .L3
.L2:
 movq -24(%rbp), %rax
.L3:
 popq %rbp
 .cfi_def_cfa 7, 8
 ret
 .cfi_endproc
I know, things are starting to get ugly, but we're definitely making our way deep into the stack now. This is the kind of code that the actual physical microprocessor understands, but before we get to the processor, the assembly code has to get from a hard disk onto that processor, and that requires memory.

The Memory Hierarchy

The memory hierarchy of a modern processor is deep and complicated. We'll assume there's enough memory so that the entire state of the program—all of its program code and data—can be held in it without paging anything to disk. That simplifies things a bit since we can ignore things like disk I/O, virtual memory, and the TLB (translation look-aside buffer). Those things are very important to modern processors, so remember that they exist, but we'll focus on the rest of the memory hierarchy.

Computer motherboard diagram

The main goal of memory is to get instructions and data to the processor as fast as possible to keep it well fed. If the processor doesn't have the next instruction or piece of data that it needs, it will stall, and that's wasted cycles. Let's focus on a single assembly instruction at this point, the jne .L2 instruction. This instruction is short-hand for jump-not-equal and the target is the .L2 label. We'll get into what it does in the next layer. Right now we only need to know that the instruction isn't actually represented as text in memory. It's represented as a string of 1s and 0s called bits, and depending on the processor, instructions could be 16, 32, or 64 bits in length. Some processors even have variable length instructions.

So the processor needs this jne instruction and it's loaded into the memory. Normally it starts out in main program memory (DDR in the diagram above), which is very large but much slower than the processor. It then makes its way down the hierarchy to the L3, L2, and L1 caches on the processor. Each cache level is faster and smaller than the previous one, and depending on the processor, it may have less levels of caching. Each cache has a policy to decide whether to keep or remove instructions and data, and it has to keep track of whether cache lines have been written to or are stale. It all gets extremely complicated, but the basic idea is that the lowest level of cache should hold the instructions and data that are most often or most recently used. Ideally the L1 cache will always have the next instruction that the processor needs because it normally runs at or very near the processor's clock speed.

The Microprocessor

Once this jne instruction reaches the processor, it needs to be executed. For that to happen, the processor needs to know what the instruction is and where its inputs and outputs are. The instruction is decoded to figure this out. The processor looks at the series of 1s and 0s of the jne instruction and decides that it needs to look at some flags that were set by the previous instruction, and if the equal flag is set to 0, it will jump to the target address specified by the label .L2.

Not all instructions are decoded into a single operation. Some instructions do more work than is reasonable to do all at once. These instructions are broken up into more bite-sized chunks of work, called microcode, before they are executed. Then all of these instructions are fed to the next layer of abstraction.

The Processor Pipeline

Instructions are no longer executed on a processor in a single clock cycle unless we're talking about a very simple processor, like a 16-bit microcontroller. Instead, some small amount of work is done for each instruction on each clock cycle. Fetching operands, executing operations, writing results, and even the initial decode steps are part of the pipeline. Breaking the work up into these pipeline stages allows the clock speed to be faster because it's no longer limited by the longest instruction, but by the longest pipeline stage. Naturally, hardware architects try to keep pipeline stages well balanced.

Intel Nehalem Microarchitecture
"Intel Nehalem arch" by Appaloosa - Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons - http://commons.wikimedia.org/wiki/File:Intel_Nehalem_arch.svg#mediaviewer/File:Intel_Nehalem_arch.svg

Modern processors also have stages that figure out if multiple instructions can be executed at once and feed them to multiple execution units. These dispatch stages can even decide to execute later instructions ahead of earlier ones if their dependencies are satisfied. In today's processors literally hundreds of instructions could be in flight at once and starting and finishing out of order. If you thought you knew what your code micro-optimizations were doing for performance, you might want to reconsider that position. The processor is way ahead of you.

Our original Rails code is completely unrecognizable at this point, but we're not done, yet.

The Pipeline Stage

A pipeline stage consists of some combinational logic that runs between clock cycles, and a set of memory elements, called flip-flops, that store the results of that logic for the next pipeline stage. The results are latched into the flip-flops on every clock edge.

Combinational logic can do all kinds of operations, including logical operations, arithmetic, and shifting. This logic is normally described using an HDL, (Hardware Description Language) like Verilog, that resembles other programming languages, but with a concept of a clock and statements executing in parallel between clock edges. A ton of stuff is happening at once in Verilog simulations because all combinational logic in the processor evaluates its inputs as soon as they change.

MIPS pipeline stages

But the processor isn't executing Verilog. The Verilog is synthesized into digital logic gates, and those are what make up combinational logic and our next layer of abstraction.

Digital Logic Gates

The basic digital logic gates are NAND, NOR, and NOT. NAND stands for NOT-AND and NOR stands for NOT-OR. Why aren't AND and OR gates fundamental? It turns out that transistor circuits that implement logic gates inherently invert their outputs from 0 to 1 and 1 to 0, so AND and OR gates require an extra NOT gate on the output to invert the result again.

Digital logic gates and truth tables

Many other logic gates can be built up from these three basic gates, and a digital synthesis library—used by the synthesizer to select gates to build up the pipeline stages—can consist of hundreds or even thousands of variations of logic gates. All of these gates need to be connected together to make a processor. Sometimes it's done by hand for regular structures like memory or ALUs (Arithmetic Logic Units), and for other blocks it's done with an automated tool called a Place-and-Router. Suffice it to say, this stuff gets complicated in a hurry.

We're now ready to leave the land of ones and zeros because we've finally made it to the transistor.

The Transistor

Digital logic gates are normally made up of CMOS (Complimentary Metal-Oxide Semiconductor) transistors. These transistors come in two varieties—NMOS and PMOS. They both have three terminals: the source, drain, and gate. A fourth connection always exists below transistors, and that is referred to as the bulk connection. If the gate voltage is greater than or equal to the source voltage for a PMOS or less than or equal to the source voltage for an NMOS, the transistor turns off. If the gate voltage goes in the opposite direction, the transistor turns on.

CMOS NAND gate

This schematic shows a NAND gate that is made up of four transistors. Two PMOS have their sources connected to Vdd (the digital supply voltage), and two NMOS are stacked together with the bottom one's source connected to ground. When an input voltage is at Vdd, that corresponds to a 1, and an input voltage at ground corresponds to a 0. In the case of the NAND gate, if either input is at ground, one of the NMOS will turn off and one of the PMOS will turn on, pulling the output up to Vdd. If both inputs are at ground, both NMOS turn on and both PMOS turn off, pulling the output to ground. This behavior exactly matches what a NAND gate should do.

To understand how a transistor works, we need to descend another level.

Semiconductor Physics

A transistor is made up of a silicon substrate, a polysilicon gate sitting on top of a silicon oxide insulating layer, and two regions on either side of the gate that are doped (enriched) with ions that make a P-N junction. The doped regions are the source and drain of the transistor.

A P-N junction makes a diode where electrons will flow from the N-type side to the P-type side, but not the other way around. A PMOS needs to sit inside an extra well that is N-type so that the source and drain aren't shorted to the substrate and each other.

CMOS transistor cross sections

Considering just the NMOS, when the gate voltage is less than or equal to the source voltage (normally at ground), the source and drain are not connected and electrons can't flow from the source to the drain through the substrate. When the gate voltage rises enough above the source voltage, a channel forms between the source and drain, allowing electrons to flow and pulling down the drain voltage. The behavior of the PMOS is similar with the voltages reversed.

The Tower of Knowledge

There you have it. We've reached the bottom of the tech stack. This huge pile of technology and abstractions is sitting atop some basic materials and physics. When building an application in a high-level framework it is truly astounding how much other technology we depend on. The original respond_with method call in a Rails application is the tip of a very large iceberg.

Every layer in this stack is much more complex than I've described. You can read books, take courses, and spend years learning about the details of each layer and how to design within it. If you spend all of your time in one layer, it would probably be worthwhile to take a look at what's happening at least one or two layers below you to get a better understanding of how the things work that you depend on. It's one way of learning the basics.

It may also be worth learning about the layer above where you work so that you can better appreciate the issues that designers in that layer have to deal with all the time. The better these layers of technology can interact and the better the abstractions become, the more progress we'll make designing new technologies. Knowledge is power, and there's an awful lot of knowledge to be had.

No comments:

Post a Comment