The Universal Scalability Law and the Mythical Man Month

9 minute read

View on Twitter

People dunked on this tweet, saying, in essence, “This isn’t 100% correct - you shouldn’t pay attention.” But that misses the point. The value of any model is that it’s simpler than reality so that you can gain insight. Here are the insights I have gained from this model.

Fred Brooks first put forth the idea that adding people to a late project makes it later, and stated that the pairwise communication was the real killer. Note that he was only talking about adding people to a late project - more on this later. But first, a digression (or two)!

Around the time I started QLDB, I read about the Universal Scalability Law via the always-informative Marc Booker. This law extends Amdahl’s Law to explain why adding more processors to a task can make the task take longer.

As a refresher, Amdahl’s Law says that if you use N processors, the throughput relative to a single processor is N/(1 + α(N - 1)), where 0 <= α <= 1. Amdahl’s Law is usually stated in terms of latency, but I’m going to use the throughput formulation, as I find it more useful.

Throughput is more useful when analyzing distributed systems with many servers. I am more interested in how continuously-arriving tasks are handled by the service as a whole as opposed to analyzing the latency of a single task with serial and parallel portions.

The USL extends the Amdahl’s Law with a second coefficient:

\[\frac{N}{1 + α(N - 1) + βN(N - 1)}\]

It’s this beta coefficient that models the negative returns to adding additional processors. The intuition behind the parameters is as follows.

Alpha is the fraction of the workload that is single-threaded. Gunther (who created the USL) calls it ”contention.” So if every request has to go through a single-threaded authorization process, and that takes 5% of total request time, α=.05.

Beta is Gunther’s contribution to the model. He calls it ”coherency.” It acknowledges that some things are worse than a bottleneck. One example is cache coherency slowdown in multi-core systems.

In a distributed system, this term comes from gathering consensus. It’s why nobody implements naïve Paxos, where any node can propose a new state. If multiple nodes make simultaneous proposals, you need extra communication to determine the winner.

For high-throughput distributed consensus, you first elect a leader as the sole proposer, and pay the expensive coherency cost only when a new leader needs to be elected. In the equation, more nodes mean more potential for conflicting proposals, e.g. higher beta.

What does all of this have to do with the Mythical Man Month? The ”pairwise communication” part of “adding people to a late project makes it later” is beta. Gunther himself saw this parallel and wrote about it.

When analyzing team throughput, alpha represents the time spent in one-to-many communication (e.g. team meetings where just one person is speaking). This time is a fixed tax on each additional person’s contribution to the project. Beta is the time spent on pairwise coherency.

An obvious example of pairwise coherency is standup. If every person speaks, the time goes up linearly with the number of people, and since every person is in standup, the total people-minutes consumed by standup goes up as N².

Wait, you say. Is it βN² or βN(N - 1)? This gets at the point I started with. It doesn’t matter. I’m not trying to claim that by plugging in α, β, and N you can precisely compute how long a project will take if you add three new people. That’s not the value of the model.

Now for computer systems, the model can be fit statistically to predict of future throughput. Unfortunately, humans are way more complicated. But that doesn’t mean the model doesn’t have value! I claim that by studying this model we can extract insight in order to guide action.

When I started at Amazon in 1998, there were 60 people in tech. Not 60 SDEs. 60 people total, including DBAs, SAs, TPMs, managers, etc. We were divided into two teams of roughly 30 people. Each team met every other week and all 60 met together in the alternate weeks.

This setup didn’t last. As we grew, it became impractical for us all to meet together, so we split into more and more units. But those team meetings aren’t the story here - they’re alpha. The real story is beta - and how it creeps in where you might not expect.

Part of the reason for Amazon’s incredible success across a staggering array of ventures is our focus on pushing autonomy down as far as possible. Jeff said from the start, ”I don’t want to make communication more efficient - I want there to be less communication!”

So we focused on creating small teams with clear business goals. Those teams of 30 were really subdivided into smaller teams. For example, when I started, there was a “search” team with four people and a “personalization and community” team with three.

As we grew, these teams continued to split and specialize. So the three people on personalization and community became six and then split into two teams - one for personalization and one for community - allowing them to each grow to six.

By having two teams of six instead of a team of twelve, standup costs were capped, but that’s not the main source of beta, it turns out. Just as in the USL applied to services, the main source of beta is coherency, or coming to consensus.

It’s straightforward to take six people and split them into two disjoint teams. You just pick from the 20 possible combinations! :) But that doesn’t magically partition either the software or the knowledge in their heads. Here’s where beta really gets you.

Some of the software is now shared between the two teams. These teams have different goals and priorities. Maybe one team wants to extend some functionality to be more flexible, but that would mean the other team has to adjust how they are using the shared software.

Now the teams have to spend time deciding the best way to make the changes, how valuable they are, etc. They would have had to have similar discussions if they were one team, but it’s harder to achieve consensus among groups with different priorities.

This is one reason that I insisted early on that our sub-teams share as little code as possible. Some of the engineers thought it was dumb that each of them were writing similar services to manage, say, the EC2 instances they needed.

“Duplicated effort” is an anathema to most developers. But for a new service, when you are figuring things out and need to move quickly, the cost of consensus can be crushing. It’s better to have “duplicate” efforts and join them later if you figure out they really are the same.

Obviously, you do really need to have discussions about strategy, or software design, or whatever. Gathering different perspectives is important. Letting many people have a voice is important. You just need to be conscious of the cost and pay it when it’s worth it.

It’s so easy in these discussions to become attached to being “right,” and to continue to argue your position even if the other position is just different, or if it’s unknowable which one will prove to be right. This is beta - remember that it has N² impact on productivity.

Sometimes it is important to continue to argue. We make decisions every day that have lasting ramifications. But often it’s really hard to know how the decisions will play out. What you do know is that time spent coming to consensus is time nobody is producing software.

For more-senior people, especially, it can be hard to let things go when you think the path being proposed isn’t the best. But try to ask yourself whether paying the cost of consensus delivers more than letting the less-senior person try it their way. You might be surprised!

Amazon has grown its tech community by an average of over 30% per year for over 20 years, so I’ve had a lot of opportunity to observe this process. And my observation is that it’s hard for everybody to adjust to growth, regardless of their position on the team.

Consider the least-senior member of the original personalization and community team. He was involved in all the strategy discussions for the combined team, got to weigh in on the software for both pieces - generally knew everything going on.

After the split, he is on one of the two teams, and is only participating in strategy discussions for that team. He probably isn’t participating in the design discussions for the shared software. It feels like his world has shrunk.

In reality, the total complexity of the business goals and software being developed by the post-split team of six is commensurate with the original team. It’s deeper, but it’s not as broad or diverse, so it feels smaller.

On the other hand, consider the most-senior member of the original team. She maybe doesn’t end up on either team, but rather carries responsibility for both. Her challenge is that she can no longer keep up with all the details on both teams to the same level.

She has to let go of decisions that she would have weighed in on if there were only six people total. Now that there are twelve on two teams, she has to figure out what she needs to pay attention to, and what should be completely delegated to people on the individual teams.

Complicating all of this is that it’s not really a discrete process where you were on a team of six and wake up the next day to two teams of twelve. It’s a continuous process where teams grow one person at a time and then semi-split-but-not-really-until-it-it’s-way-too-many-people.

I thought about this a lot when we were growing the QLDB team over the past year. I talked to the team about how they might find themselves feeling left out as teams split apart. I talked to them about the costs of consensus in decision making. I’m not sure how much it helped.

This is also why we haven’t added very many people over the past six months. To Dr. Brooks’ point, the cost of consensus goes up when you add new people. They have to be brought up to speed, they cause teams to split apart, etc.

Once things settle back down, beta goes down as well - people are dedicating more of their time to productive measures. Obviously the simple interpretation of the original tweet - that once the team grows you have permanently more communication - is wrong.

But I also see deep wisdom in the observation that the number of interactions goes up with the square of the number of participants. It’s a call to be aware and intentional about the increasing costs of consensus so that you don’t end up getting less done with more people.

PS. A big thanks to @vijayravindran, who encouraged me to keep working on this thread when I was ready to give up trying to express myself in … well … coherent … 280-character chunks.

PPS. As always, if you like my stories, please also consider following these people and reading their stories. Since the tweet below, I’ve added @slooterman, @taralconley, @DadTrans, @mssinenomine, @ToriGlass, @debcha, @graceelavery, @Cal__Montgomery et. al

View on Twitter