Horizontal, vertical... is that it?

You've probably heard something about scalability already. Generally, most sources divide all the numerous scalability problems into two big baskets:

Vertical scaling. That is, making the existing server more powerful;
Horizontal scaling. Making more servers... and everything else.

I think that such a dichotomy is very shallow and all it does is hide complexity. Indeed, there are a lot of different problems that only emerge at scale but some of them have nothing to do with each other. Furthermore, the solutions to those problems sometimes come from very peculiar tradeoffs which could differ from industry to industry. I propose to split the scalability problems into 4 major categories. And with that, let's dive in!

Numerical values scaling

Let's start with the most obvious problem — the scalability of numerical values. The dimension of numbers at scale includes various problems. Some of them could be attributed to horizontal or vertical scaling e.g.:

How to make sure our servers could grow in size efficiently?
Will the system remain correct if new servers are added?

But there are many many more other questions which a) are dealing with numerical values and b) do not have a clear answer.

For example, not so long ago the amount of IPv4 was enough for everyone. This is no longer the case. Is it a horizontal scalability problem? Well, not really. It's something completely different. You don't "scale your way out of it". Instead, you innovate completely new solutions like Network Address Translation (NAT).

Another case: when your company is small enough, you could put the addresses of all your servers into a single DNS record. There is no way the same approach would work when you have thousands of machines! The creativity of solving the problems at scale sometimes would blow your mind.

Why does Facebook pass the request through the L4+L7 load balancer, and then do it inside the data center again? [1]. Is it merely a "horizontal scalability" problem? You could say that. You could also say, that some subset of solutions is operating on a completely different level. Some of them are about a constant struggle with the physical limitations of the Internet and old protocols!

Cost optimizations at scale

A somewhat related situation is cost optimizations at scale, the importance of which grows linearly with the growth of, well..., costs. Let's pretend you are a CEO of a tiny startup and your infrastructure costs are around 10000$ per year. Would you sacrifice your developers' time to drop those costs by 5% that is 500$? Probably not. What if you are Google and 5% is hundreds of millions of dollars in possible cost-saving? Heck even decreasing the costs by 1% is an astronomic amount of money!

The problem here is that optimizing something is getting ridiculous hard the more you do it. Getting low-hanging fruits is easy. Optimizing at scale is very hard. Also, it doesn't have much to do with either horizontal or vertical scaling as the approaches used here may vary drastically

Geographical scaling

Next comes geographical scalability, which brings with it a) latencies b) excellent opportunities for optimization. For example, imagine a taxi service which is available both in Moscow and in New York. How important is it for the Moscow service to know about the existence of a similar service in New York? Surely no living soul would get a ride between the two cities. Does it also mean you can split the application behind taxi service into two completely independent domains? Well, maybe, maybe not. It's too hard of a question to give a generic answer. But it is a huge possibility to reduce the maintenance costs.

Administrative scaling

Geo-distribution brings in another fascinating set of problems namely the need for administrative scalability. "Administrative" is not the most adequate word, but I steal it from [2], so let's leave it as it is. I am talking about the features of the growth of a large business that have an impact on the technical component. For example, how does the merge between two companies affect the overall architecture and infrastructure of the company? How to make this process less painful? How does a mega-corporation integrate products from hundreds of other companies, so to speak "at scale"? Some of those issues have a legal nature, but most of them are about dealing with uncertainties and having a well-thought technical plan.

A similar but different group of the problem is dealing with legal peculiarities in different countries. For example, how should personal data be stored in each country? And how to adapt the existing system to print the correct cash receipt? Those things are wildly different between countries. The answer to each of the questions affects both the growth rate and the overall architecture of technical solutions. Financial and medical companies especially have a difficult time dealing with legal specialties because the hierarchy of inspection bodies has been growing for several centuries for them. I'm fairly confident that "horizontal" and "vertical" scaling and the problem of fast onboarding of new countries are beasts of quite different nature.

People scaling

Finally, my favorite kind of scalability issue: the people scaling [3]. How to make sure that 10 developers don't quarrel while writing code together? Shouldn't be hard enough, just don't put them all on the same project. What about 100 devs? or 10000? Similar question, how do you make sure that such a crowd would follow the same set of standards? A similar and very important problem is scaling or to be more precise NOT scaling the number of communication links between people.

Ideally, you would want the number of combination channels to grow sublinearly with the number of employees. The reasoning here is that people are not machines. People get tired of one another and information gets lost or misunderstood. Making sure all the people in your organization align to the same vision and follow the same goal might just be the hardest thing to do in any growing company. Speaking of hundreds of people, what should be the hiring process to hire 15 developers in a year? And five thousand? Those would be wildly different!

Okay, so you've solved the problem of hiring. But there are still massive complications everywhere. For example: how safe it is for a few thousand people to change the codebase every single day? How do you test and deploy all these gigabytes of code? One funny example here is Google where everyone and everything lives in a self-written mono-repository. The thing is that it's not only people who change code. There is also a robot (Rosie) making a few tens of thousands of commits per month [4]. Well, that's cool, but what is cooler is that there are quite a lot of other companies operating at a similar scale which a) do not have a mono repo, and b) certainly don't have a robot changing everything. So there is always a tradeoff and things to think about.

A separate but related topic: how to support products that are already 10+ years old? Developers rarely work so much in one company, so knowledge sharing becomes mandatory to survive. In [4] there is a chapter about this, and about another interesting topic: deprecation of products, a.k.a. how to properly decommission something.

Closing notes

The problem of scaling is astonishingly deep and fascinating. Trying to hide this complexity does well to no one. Instead, we should try to embrace complexity, divide the problems into solvable pieces and fix each piece individually. That's what software engineers do!

There is a lot more to say on the topic, for example, security at scale, R&D, or retention strategies. I've outlined the four things which I think is a good start, but there is always room to grow. Those four are:

Numerical values scaling
Geographical scaling
Administrative scaling
People scaling

Sources:

Usama Naseer, Luca Niccolini, Udip Pant, Alan Frindell, Ranjeeth Dasineni, Theophilus A. Benson Zero Downtime Release: Disruption-free Load Balancing of a Multi-Billion User Website.
B. Clifford Neuman. Scale in Distributed Systems.
Titus Winters, Tom Manshreck, Hyrum Wright. Software Engineering at Google.
Rachel Potvin Josh Levenberg. Why Google Stores Billions of Lines of Code in a Single Repository.

Shark in IT