Understanding the intricacies of designing a large scale distributed system and why being agile is critical...
The industrial revolution made possible the use of new materials and designs that radically altered shipbuilding. The last two decades have been like an industrial revolution for software design and development. Just like the availability of steel, manufacturing processes, education, pre-fabrication techniques and the production line changed the scale of ships; The availability of cloud computing, a global customer base, ready to use components, a broad adoption of service oriented architecture and the Internet have come together to change the scale of software. Just like the difference in building a boat versus building a ship; Building a large scale distributed system is fundamentally different from building a small custom application. They are different processes with different focus areas and different assumptions which just look deceivingly similar.
Not built by one; not built for one
Most software today is not built from scratch and does not have a closed or well defined group of builders or users. Take a moment to think about that. Most boats have the same set of people working on it from its start to its end. They can make assumptions about everyone knowing everything and all of the compromises and what not. Most ships are built over years. People join and leave the builder team. And the ship continues to be tweaked, modified and upgraded years after it sails out of the shipyard. Doesn’t that sound very similar to the way software is ‘shipped’ today?
Software teams today need to start with that assumption that team members may not be around as the software continues to be developed and matures. Establishing that premise from the start enforces stricter contracts between components, reduces dependence on assumptions and drives better design as well as documentation. Moreover any large scale distributed system consists of multiple components spread across teams so following established design patterns and simple, clear contracts is very important
Similar assumptions and premises need to be established for a customer. When designing a large scale distributed system you are generally not targeting one known customer. You are targeting a ‘use-case’. You are targeting a problem that is valuable to solve and that will likely remain valuable over a long term. So while user groups and active feedback are important tools, they definitely fall short of being comprehensive for the modern software world. You have to build on premises of continuous evolution of the use-case and the user. That calls for simplifying and automating the voice of the customer. A/B testing, instrumentation, automated behavioral inputs, and in-flow feedback become more and more important as compared to concerted one-off events.
To make these assumptions requires clarity of the goal and big thinking. Don’t limit yourself to a developer or a customer. Go beyond thinking how your solution will be used in the next 2 years. Your design assumptions and tenets need to be bold. Compromises can be made during implementation as that is bound to be iterative. A bold, long term oriented design which does not assume a closed group of builders or users will naturally enable an iterative, improvement oriented implementation cycle.
Don’t build all of your tools
Builders enjoy building from scratch. It’s fun, rewarding and a great learning experience. It works great for small projects but to design and build anything larger requires a strong adherence to “don’t reinvent the wheel”. No modern large scale project is built from scratch. Ship builders buy entire hulls pre-fabricated by other companies. At the very least a large number of complex components such as electricals, engines, cooling systems etc. are usually built by other companies independent of a particular ship’s requirements. These systems are designed to suit most ships. Similarly modern computing eco-system is replete with reusable components, libraries and complete solutions. Some things we have already started taking for granted; For most web services, you usually don’t have to worry about managing a data center, building a storage system or solving networking problems. The range and scope of things that you don’t have to do is rapidly growing. For example you don’t have to build a messaging or queue based communication system between components. SQS & SNS are existing solutions that you can use off the shelf. You don’t have to build an optimized key-value store and choose from a plethora of options such as MongoDB or DynamoDB. You don’t have to build a scalable cache, you can just use a solution like Elasticache. Going forward you may just be able to write a piece of code that responds to an event without worrying about any of the underlying complexity! (See AWS Lambda). This is not intended to be a list of AWS services but I use them as obvious examples.
The point that I am trying to make is that a lot of design time is and should be spent on deciding on the unique and valuable parts of your solution. I will refer to Guy Kawasaki from his “The art of getting customers”. He says that in today’s market if you want your solution to be a successful one then it needs to be both valuable and unique. That is such a simple and powerful notion. So as a group of designers and developers, you need to maximize the time that you spend on your core valuable and unique solution. For anything that is not core, you need to be strongly biased towards existing components or systems that can be reused. That touches upon a key Amazon leadership principle as well, Invent and Simplify. Simplify. Invent. It is incredibly important to simplify your solution down to its very core. When I hear people say that their website’s value proposition is fast response and data retrieval, I start with the assumption that its not. Those are essential requirements that can be met by existing solutions. That is generally not a unique value proposition.
Don’t boil the ocean; Think big, build small
This write up would be incomplete without touching upon the other end of the spectrum of software design; Implementation. ‘Think big, build small’ is being proven to be the most successful model of development. Design for 10 years but build for just a few months. Why? The answer to that is rooted in understanding that platforms, components and systems are continuously evolving. No modern software system exists in isolation. So if you don’t need a functionality implemented right away, don’t implement it. Chances are that someone else will solve it by the time you need it and you can just reuse what they build. The ecosystem will continue to change as your solution is being built. It will either completely solve, greatly simplify or deprecate the implementation that you are thinking of. So unless it will be used by someone in the very near future, defer development. At any given point of time, investing your key resources in the most important, most immediately needed implementation while adhering to a bold, long term design is as of today the best known approach in my experience. That’s why being ‘agile’ is critical. A lot of developers that I talk to during interviews treat being agile as ‘being aware of constantly changing customer requirements’. That is only part of the picture though. It is equally important to be both externally and internally aware beyond the customer requirements. Knowing the ecosystem and constantly updating the assumptions on what is readily available, what can be reused, what the current strengths of the team and how best to utilize these assets are strong aspects of agile development.
I will close this out with an anecdotal story. A team of 3 developers, one strong in web technologies, one in core computing fundamentals and one mathematician started developing a phone app that managed wish lists. They spend the first 3 months of implementation designing a data model and a custom data store to store dynamic wishlists. Then the mathematician left the team. Another guy joined the team. He was a contributor to MongoDB. The ended up throwing away what they had built in the first 3 months and re-designed a much better data store using existing solutions in 1 week.
When building a ship, optimize for time, resources, long term and constantly updating reality.