The System Design Template I Use – Aditya Rohilla

System Design Primer: https://github.com/donnemartin/system-design-primer

System / Architecture design is an important part of any software engineering project. Right after requirement gathering for features and before diving into development, every project lead has to come up with a system design document illustrating how the overall system would like and how it will interact with external services. This process is followed in almost all the Big tech companies including FAANG (or MAMAA now?) and others.

Today’s I am going to present a system design document outline which I personally use before working on any large scale project. It is inspired by many Senior engineer’s design documents / templates and hopefully will help you with your next software design.

The design usually follows two parts – High Level Design and Low Level Design but I will present a template which combines both.

Without further ado:

Overview (Optional)

Describe what is the purpose of the document and who is it for? Should people know anything before diving in? Is there any targeted audience (Tech vs Non Tech etc).

Problem

This is an important part of the document.

Explain the problem in hand which needs the solution. Don’t discuss the background, add that information in the Appendix. Share relevant information about the problem and tag the wiki documents the reader can go to, to understand more about the scenario. Avoid abbreviations and expect user has no prior knowledge about the problem in hand.

Tenets (Optional)

A tenet is a principle or belief that helps align teams and bring everyone into an agreement around critical questions. At Amazon, we add this section to go back to first principles while making a design decision. This is optional for small designs though.

Requirements

Describe all the requirements that the problem imposes. What is in scope for the proposed solution? Preferably write from the end user’s perspective. What are they expecting from this solution? List all the requirements that you can think about for the project / product and get them verified by the stakeholders.

Sometimes it helps to add use cases:

As a Retail Website user, I want to be able to add product to my cart.
As a financial analyst I want monthly report to be generated in under 5 minutes.
As an on-call engineer I need to have a dashboard representing the health of a system by region.

Out of Scope

Requirements describe what is in scope of the project however in some cases it is worth explaining what is out of scope to help reader deeper understand decision making framework and avoid unnecessary questions.

Success Criteria

Imagine the solution is in production already. How will you evaluate the success of the product? What data can you use?
For example:

Time-to-Market reduced by 90%.
The solution can scale to 10,000 users with 100ms P90 latency.

Architecture

Describe the architecture of the solution with explanatory text, bullet points and diagrams. Try to avoid using too many lists or only diagrams with no text. A good design should balance a substantial amount of explanation with few key diagrams and lists. Also mention why this design is the recommended solution for the problem in hand.

High-Level Overview (HLD)

Start with the overview of high-level design. Make a list of system components. Focus on logical components of the solution rather than particular technology.
Good example:

Data Ingestion service
Data storage
Web UI
Notification service
Export functionality

Bad example:

DynamoDB
ReactJS

Creating a diagram for a high-level overview is always a good idea.

API Design

List all the APIs through which users / services will interact with this product / service. Mention the payloads, verbs, versions for each API. How can this API model evolve in future? How will customers interact with these?

This section can evolve during the implementation so even if you keep some aspects for later, that’s fine.

Data Storage and Model

What data model will you be using? Which database is suitable for this? Evaluate how much data the system will be processing. Make a future growth forecast. Prove that the solution will scale to the needs of the business in 3-5 years.

Also think about data pipelines, data ingestion and pre-processing, storage layer etc.

Application / Component Level Design (LLD)

Dive deeper into the design of each individual component in following sections. Add components, data flow and control flow diagrams – whatever is applicable.

Dependencies

Be explicit about the other systems you’re interacting with. Are they internal or external? They contain huge number of risks because you have no control over those systems. They require thorough analysis and risk mitigation.
Share your assumptions about dependencies. For instance, we assume Service A will be able to handle 50,000 TPS, and etc.

Design Alternatives Considered

Talk about all the different design alternatives – combination of different Infrastructure platforms, databases, service frameworks, logical approaches etc – you considered and mention the recommended one which you think will be best suited for this project and why.

A table with pros and cons including columns like cost, scalability, ease of use, latency, maintainability and community support are good way to judge the best services and platforms to use for this design.

Cost Analysis

Analyze how much infrastructure costs the solution will generate. This section can be optional for smaller problems however as a general rule it is beneficial to spend some time calculating the impact in order to choose a right solution.

Plan for the future growth.

Failure Modes

Contemplate on what can go wrong in the system: dependency failure, traffic overflow, performance degradation, bug in the business logic, etc. It is extremely important to know how system can fail.

Non Functional Requirements

Think about non-functional aspects of your software project – Scalability, Availability, Maintainability, Reliability, Latency, Security etc . All of these are very important for the project in the long run.

Scalability

How many users are you expecting? How many transactions / queries? How much data?

Latency and Availability

What are the P99, P50 etc numbers? Create an SLA (Service Level Agreement) with your end customers about this numbers.

Maintainability

How much maintenance will be required for each service / component? Who will manage the services in the long run?

Security

Mention the level of security required by this system and how it will achieve it. The level of security depends on different variables the system is storing / interacting with – user information, financial information, user connection details, etc.

Testing and Observability

Talk about different testing, monitoring and alerts strategies which will be used before and after the product goes into production.

Testing (Optional)

This section contains information about how testing will be done – Unit tests, Integration tests, A/B tests and stress tests etc and which tools will be used for them, this section depends on the team and company wide practices.

Metrics and Alarms

Talk about key metrics which will be tracked and how success will be measured. Have clear dashboards tracking the most important metrics. Mention the alarms tracking the SLAs and where will they be seen by the team.

Concerns / Risks

Discuss the main risks and concerns which can affect the project. For example, It can be external dependencies or resource shortage.

Future Improvements (Optional)

This section talks about the features which the users can expect in future releases. Talk about different features you plan to roll out and what that timeline can look like.

This section is great to put some questions people ask regularly and you don’t want to spend time on talking them through during a review. If the design is new and you don’t have history of questions try to anticipate what people might ask.

Appendix A: Subtitle (Optional)

Can contain deep technical analysis, data, thorough investigation description and so on. Something that people might be asking for but not required for everyone to read during a review.

Glossary

MTTR – Mean Time To Resolution of a system outage.

References

Service 1: [link]

[1] Glossary – an alphabetical list of terms or words found in or relating to a specific subject, text, or dialect, with explanations; a brief dictionary.

This is the extensive system design outline that will help you think about all the aspects of your software project. Let me know if I missed something.