# Math Riff: The Supercomputer Brain Part 2

In the previous post we set out to talk about the comparison between brains and computers, looking specifically at whether Ray Kurzweil’s prediction for human-level artificial intelligence is likely in the next few decades. Our main conclusion based on looking at the structure of a brain is that we’d need roughly 32 petabytes of space to accurately model what the brain looks like. This post delves into what kind of silicon-based infrastructure you would need to process a data structure that size.

As a first step, we might try to figure out how to get 32PB of information into the computer’s main Random Access Memory (RAM). This is different from the disk storage space, which is always a much larger number. The amount of RAM in your computer is significantly less than what the hard disk can store, but it can be manipulated much, much faster. In fact, a computer really can’t do anything with any information unless it’s loaded at least partially into RAM.

RAM is typically installed using a little card called a DIMM that is plugged into the computer’s motherboard. The most cost-effective DIMMs today hold 2GB of memory (you can get bigger ones, but they cost significantly more per unit of storage). It would take 16 million of today’s DIMMs to store our 32PB brain model. If we figured our brain model might run on a high-end computer with space for 16 DIMMs on the motherboard, we would still need a 1,000,000x improvement in memory density.

Unfortunately, Moore’s Law says we only double our density every two years. To get a 1,000,000x increase, that equates roughly to 220, which means we’re going to need 20 generations of improvement over 40 years before we get out 2PB DIMMs. That’s already got us out to the year 2049, so we’re pushing Ray’s calculations a bit, but we’re still in the park. That is, if RAM is actually fast enough.

Is RAM Fast Enough for Brain Modeling?

Once we’ve got our model in RAM, we can get the CPU to do something with it. The neurons in your head signal each other fairly slowly, but they do it all in parallel. So, even though a neuron might “fire” only 5 or 6 times per second, you’re talking about 100 billion of them working in parallel, so you’ve got quite a bit of processing going on.

Normally the way someone might look at this problem is to start with the number of axons and figure we need to run short bit of computer code across each one a few times a second. If we start with our figure of about 1016 axons and figure each scan across an neuron or axon might take roughly 10 microprocessor instructions on each pass, and that gets us in the ballpark of 217 floating point operations, or about 10 petaflops. No surprise, this is spot-on with Ray’s calculations. But we’re already anticipating super-computers with more than this range of processing power… At that scale, we should be seeing HAL 9000’s pop up around us soon.

We’ve already realized we need to look at axons, which led us to our 32PB data structure. Let’s imagine to mimic a human brain, we need to run our model through the microprocessor five times each second, which means we need to move a little piece of our model out of RAM and into the processor, update it, then move it back to main RAM five times each second. Each pass requires moving 32PB of data back and forth, or 64PB of transfer. To do that five times requires 320PB of memory bandwidth per second.

A modern memory bus moves 128 bits (16 bytes) of data at a rate of 1.3 gigahertz. Multiply those two values and you get a rate of about 20GB/s of bandwidth. How do we figure the needed technological advance here? No problem. We’ll just figure out the relationship between 320PB/s and 20GB/s and Moore’s Law should help us again…

320PB/s = 320,000 TB/s = 320,000,000 GB/s

320,000,000 / 20 = 16,000,000x

So we need a 16 million times increase in performance, or 224, which gives us 24 generations or (at two years per doubling) 48 more years of Moore’s Law moving forward. Right around the corner, right?

Unfortunately, physics has something to say here. Current technology is building circuits on process technology with a feature size of 45 nanometers. A silicon atom in the silicon crystal structure is actually around 250 picometers (0.25nm wide), which might make you think features on a current silicon die are about 180 atoms across. However, the legs feeding into the transistor are around half the feature size, or about 22nm. That means our current process technology is producing connections that are only 90 atoms wide. Using the same math, the 32nm process that Intel is ramping up for 2010 production will have on-chip connections only 64 atoms wide.

Even starting from current process technology, you can’t cut a number like 90 in half too many times. In fact, after five or six cuts  (or about eight new generations of semiconductor processes on Intel's road map above) you’re down to a feature size that’s less than an atom across. Halving the feature size yields four times the density, so we’re talking about an potential increase of 46, or 4,000x. That’s exactly 212, or half-way up the exponential curve to 16,000,000x on our doubling scale, but in absolute terms only 1/4000th of the way to the memory performance we need. If we project out to 2nm features, or 1nm interconnects, we're envisioning structures that are only 4 atoms wide. That represents something closer to 2^8 improvement (256x). Realistically, we’re going to be lucky to get 200x-300x more performance out of semiconductor technology before we hit the atomic wall, and we’ll probably see the curve flattening out on the time axis long before we get there.

But the astute reader says, “Hey! Don’t we already have a 20 petaflop computer coming using today’s technology? Why won’t that work?” The problem with the performance numbers for modern supercomputers is that they’re related to highly distributed problem solving that, so far, doesn’t look well suited to modeling the complex connections between neurons in a brain. In fact, the Roadrunner computer uses 129,000 separate smaller processing cores to reach its peak number.

Modern supercomputers are really collections of very large numbers of independent processors, memory and subsystems that are designed to perform lots of independent calculations. If a single CPU today can perform a billion calculations a second, chaining 1,000 machines together and sending a 1,000 independent problems to each one has the effect of running a billion calculations per second. But, if one of those machines needs to communicate with another for part of its work, it requires sending a message over a similar sort of memory bus or a network, which literally requires thousands of times the overhead of just doing a step locally at the CPU. Once you’re “off the chip” the performance advantage of a large number of processors is quickly eaten up.

Modern CPUs avoid moving off a single chunk of silicon by storing as much information as possible right by the CPU in very, very fast memory called cache. In fact, most of the extra transistors that all the advances enabled by semiconductor process technology go to larger and larger caches because the CPU cores can run much faster if the go less frequently out to RAM. When the CPU needs something that’s in the cache, it gets it darn fast and can keep running full steam ahead. Imagine your CPU is a sports car tearing up the striped line on the highway. As long as the CPU can find what it needs in the cache, it’s like that car is doing 100MPH. When something isn’t in the cache, and if it has to go all the way to main memory, you get something called a “pipeline stall” which is the equivalent of your Ferrari CPU stopping dead at a red light. And if has to go all the way to the hard disk for anything, that’s basically like stopping at a resort for a weekend vacation.

As long as a single problem can be broken down in such a way that there’s very little communication between each part of the problem and you can fit most of this information neatly into the processor’s cache, this works very nicely. Things like 3D rendering, weather modeling or simulating nuclear tests work very nicely on super computers because you know data elements are located in a physical geometry (X/Y/Z coordinates in physical space) that won’t change and their values are a function of their neighbor’s. Some value 1,000 points away isn’t relevant to this point right here. Modeling problems this way eliminates most communication between nodes, and makes quickly running lots of parallel independent calculations possible.

Modeling large scale synaptic networks with 10,000 connections (axons) between nodes (neurons), where the connections aren’t necessarily spatially dependent and where they change frequently, simply doesn’t fit the architecture of current supercomputers. The problem isn’t easily split apart into thousands of independent chunks of work. There’s too much required coordination over too many changing channels of communication.

Even if you think of a neuron only connecting to 100 other potential neurons, you don’t need a large number of “hops” between neurons before you’re basically traversing the whole architecture. One neuron may connect to say 100 others, but those 100 may themselves connect to another 100 each, and so on. If you make eight hops, you’re at 100^8 or 10^16 connections. That means checking our brain model and seeing how a series of eight neurons might connect up could wind up anywhere in the model’s space… You basically need the whole model in a single fast memory (like the CPU cache) to do any realistic work at the CPU’s top speed.

What this implies is that even if you have an extraordinarily powerful microprocessor, or lots of them, or lots of cores, or whatever the current paradigm is for CPU architecture 40 years from now, a reasonable model of the brain is still going to have to access a fairly holistic problem model. Running lots of computers (or independent processors in a super computing cluster) won’t help because those independent processors can’t see enough of an unpredictably changing model without a gigantic cache. Without some radically new architecture or memory model, you really need one “silicon brain” looking at the electronic simulation of our human brain repeatedly (and quickly) to make the magic happen. And the bottom line is we don’t have enough atoms left at the bottom end of Moore’s Law to get enough memory bandwidth there using the current semiconductor technology we employ today.

Ray remains optimistic that we’ll escape the atomic limits of Moore’s Law and move on to sub-atomic scale computing, quantum dot based calculation or some other ground shaking paradigm. Obviously, our brains work well using a different type of technology, so ultimately we’ll find some new approach that reaches and exceeds human capacity. Until then, my advice is to keep your brain in tip-top shape (lots of worksheets!) because computers are a long way from taking over thinking for you.