George Hadjiyiannis

George Hadjiyiannis

Software Executive, Entrepreneur, Software Architect

Software is hard - Part 1: Scaling

Building software at scale is no longer about writing code.

George Hadjiyiannis

7 minutes read

Crowd

A long time ago, when I was still in college, I was fortunate enough to take a Software Engineering class by Prof. John Guttag. I remember him starting one of the lectures with the question:

“Can anyone here write a 10,000 LOC program in a year?”

A large portion of the class raised their hands (this is MIT, after all). He then proceeded to ask:

“Can anyone here write a 100,000 LOC program in ten years?”

Almost all the same students raised their hands. Prof. Guttag then drove it home:

“Can anyone here write a 1,000,000 LOC program?”

That had the (apparently) desired effect: stunned silence. He proceeded to ask:

“How do you put put together a 1,000,000 LOC program in 3 years?”

Someone timidly said: “You get 33 developers…” but we all knew it would not be as simple as that.

“Well, 33 developers could probably do it if all they did was code. But now they can't afford to do just that - they need to coordinate. Specify the interfaces between the pieces; make sure they are coordinating schedules so that the modules that interact are available in the same time-frame to be tested together, performing the joint tests, debugging the results and coming up with fixes that impact both components… You now have a significant communication overhead don't you? So you definitely need more than 33 developers. Perhaps if you had 44…”

“What about a 10,000,000 LOC program? Can you do that with 440 developers?”

Silence. By now we were beginning to see the hard truth…

“Well, how does the communication overhead grow with the number of people?”

“Order n squared” said someone…

And there was the hard truth in full glory: writing code scales quite badly. And just in case you think that 10 MLOC is a theoretical concept of little more than academic concern, it is worth keeping in mind that Debian has been breaking the 100 MLOC barrier ever since 2002.

Instinctively, software engineers know that you can't write large pieces of software as fast as small ones just by adding people to the team. What is less obvious though, is that as the scale increases, the nature of the challenge changes: there is a tipping point where the significance of the technical challenges of the work are completely overwhelmed by the extreme need to manage the communication and coordination overhead. At that point, managing software teams turns from an engineering exercise to an organizational and people management exercise. And that tipping point can come surprisingly quickly - even as low as 40-50 people depending on the structure of the teams.

We have various tools to mitigate this effect. First of all, we often use architecture to minimize the need for communication and coordination. To a degree, this is one of the major advantages of the move to microservices: hide the dependencies behind a narrow, slowly-changing API and communication between the teams working on the clients, and the team working on the service, can be reduced to mainly the definition of the API. Encapsulate the functionality behind the API into a small, independently deployable service, and you can reduce the coordination overhead as well. If it so happens that communication with the team that implemented the service is no longer possible (perhaps because of the loss of key people), one could even consider replacing the entire implementation of the service.

Another tool is organizational structure. This is the main reason behind the famous Bezos Two-Pizza Rule. Also the key reason behind keeping Agile teams around 7 people - keep the team size small and you will keep under control the communication overhead within the team. Note, however, that this increases the coordination overhead between teams, as there are now more of them. Inevitably, this manifests in the organizational and process side of the equation, so we get things like Scrum of Scrums, or SAFe (Scaled Agile Framework). At the end of the day, however, this need to manage the overhead cannot be avoided; it is an inherent part of the system. Note how quickly we went from talking about coding to worrying about organizations - this happens even at the scale of a few teams. It is also worth keeping in mind that, realistically, neither Scrum-of-Scrums nor SAFe are viable when you are dealing with organizations spanning thousands of developers and hundreds of teams. Yet such organizations are quite common (and quite necessary).

I would generally classify the effects that the architectural and organizational measures above try to solve, as first order effects. There are, however, a whole slew of second order effects. First and foremost is the issue of autonomy versus synergies. I will attempt to illustrate using the example of build pipelines, but keep in mind this could apply to anything that could become a local vs. global choice (including technology choices, tooling, repositories, best practices etc.). Imagine that you are the manager of 7 agile teams. Do you enforce a single, global, choice of CI pipelines, or do you allow some of the teams to use GoCD and some to use good, old-fashioned Jenkins. At first, this may appear to be a simple technology decision, but the fundamental trade-off behind it is very much an organizational one: if you allow the teams to use different tools, then integration testing between components being build in the different systems carries a coordination overhead. And vice-versa: if you force them to use the same technology, then you need to coordinate pipeline software versions, design patterns for the pipelines, etc. Either way, you have a coordination overhead you need to worry about, and more often that not, it is the deciding factor in the choice. Pretty soon you start talking about guilds, Centers of Competence, and all sorts of other organizational structures, whose main function is simply to manage this coordination. Once again, managing the overall organization becomes less about technology and engineering, and more about managing the coordination overhead.

Another second-order effect is inter-team dynamics. While maintaining a reasonable level of collaboration within the team is not very taxing, the looser bond between teams means that collaboration between teams needs a lot more work to get right. In an environment where there are significant pressures from the outside, teams naturally tend towards competition with each other. When things go wrong, a team will tend to minimize their contribution to the event, often by highlighting the contribution of other teams. When things go right, they will tend to have a clearer view of their contribution rather than that of other teams, and often feel like they contributed more strongly than the other teams. Even when things are going well, the teams will naturally try to enter into a, hopefully healthy, competition. The end result is that the relationship between teams develops certain inefficiencies, and these take active effort to overcome. In short, overhead between teams is more costly than overhead between people within a team. Add to this the fact that, by design, we tend to minimize communication between teams, and different teams tend to have different cultures, and the idea of multiple teams soon comes to represent a very significant communication and coordination cost. Think about how much money and energy is being spent on improving the working relationships between teams: off-sites, hackathons, team events, travel to other sites, not to mention the amount of management bandwidth that is devoted to this, and it soon becomes obvious that this is a major overhead indeed.

To conclude, we ended up talking about every-day practices that are extremely common in the industry, and instantly recognizable, yet most of them have only a glancing relationship to technology, and are all fundamentally trying to deal with the communication and coordination overhead of large teams. I believe that, beyond a certain size, the business of creating and maintaining software is dominated by the concerns of managing this overhead, rather than the engineering concerns inherent in coding. In fact, I am fairly convinced that the main reason why it seems that software is easier to put together in a startup than in an “established” company, is not the tendency of the latter towards processes (which I believe to be the end-result), but rather the failure to address early enough the fact that managing software at scale is effectively almost all about managing the people aspect of software teams rather than the technology.

Recent posts

See more

Categories

About

A brief bio