Learning Notes on Designing Data-Intensive Applications (i)

Chapter 1 Reliable, Scalable, and Maintainable Applications

Photo by Pushpak Dsilva on Unsplash

This is a series of learning notes on Designing Data-Intensive Applications.

A data-intensive application is typically built from standard building blocks. They usually need to:

  • Stores Data — e.g. SQL, MongoDB, Aurora, Dynamo

An application has to meet functional requirements(e.g. allowing data to be stored, retrieved, searched, and processed in various ways) and nonfunctional requirements (e.g. security, reliability, compliance, scalability, compatibility, and maintainability).

  • Reliability. To work correctly even in the face of adversity.


Typical expectations for softwares:

  • Application performs the expected function

Systems that anticipate faults and can cope with them are called fault-tolerant or resilient. A fault is usually defined as one component of the system deviating from its spec, whereas failure is when the system as a whole stops providing the required service to the user.

Fault is impossible to eliminate, so you should tolerant and even deliberately trigger them manually:

  • Hardware faults

Until recently redundancy of hardware components was sufficient for most applications. As data volumes and applications’ computing demands have increased, there’s been a shift towards using software fault-tolerance techniques in preference or in addition to hardware redundancy. An advantage of these software fault-tolerant systems is that: For a single server system, it requires planned downtime if the machine needs to be rebooted (e.g. to apply operating system security patches). However for a system that can tolerate machine failure, it can be patched one node at a time (without downtime of the entire system — a rolling upgrade)

  • Software errors

Software failures are more correlated than hardware failures. Meaning that, a fault in one node is more likely to cause many more system failures than uncorrelated hardware failures.

  • Human errors

Humans are known to be unreliable. Configuration errors by operators are a leading cause of outages. You can make systems more reliable:

  • Minimising the opportunities for error, peg: with admin interfaces that make easy to do the “right thing” and discourage the “wrong thing”.


This is how do we cope with increased load. In order to do that, we need to describe load, describe performance, and then design to cope with load.


Load is described depending on the system type — requests per minute, cache hits / minute, simultaneous users, etc.


Performance also depends on the system type — examples are throughput, response time. But normally, just ask when a load parameter is increase, how much resources to increase if you want to keep performance unchanged?

Latency and response time: The response time is what the client sees. Latency is the duration that a request is waiting to be handled.

The metrics for assess the performance is the percentiles.

  • Median (50th percentile or p50).

Amazon describes response time requirements for internal services in terms of the 99.9th percentile because the customers with the slowest requests are often those who have the most data. The most valuable customers. Also optimising for the 99.99th percentile would be too expensive.

Service level objectives (SLOs) and service level agreements (SLAs) are contracts that define the expected performance and availability of a service. An SLA may state the median response time to be less than 200ms and a 99th percentile under 1s.

Queueing delays often account for large part of the response times at high percentiles.Since parallelism is limited in servers. Slow requests may cause head-of-line blocking and make subsequent requests slow.

The latency from an end user request is the slowest of all the parallel calls. The more backend calls we make, the higher the chance that one of the requests were slow. This is known as tail latency amplification.

Approaches for coping with load

  • Scaling up (vertical scaling): Moving to a more powerful machine

Stateful vs stateless: Scaling up stateful data systems can be complex. Taking stateful data systems from a single node to a distributed setup can introduce a lot of complexity. Until recently it was common wisdom to keep your database on a single node until cost or availability requirements are no longer satisfied.


The three design principles for software systems:

  • Operability. Make it easy for operation teams to keep the system running.

Operability: A good operations team is responsible for:

  • Monitoring and quickly restoring service if it goes into bad state

Data systems can do the following to make routine tasks easy e.g.

  • Providing visibility into the runtime behavior and internals of the system, with good monitoring.


Making a system simpler means removing accidental complexity, as non inherent in the problem that the software solves.

One of the best tools for removing accidental complexity is abstraction that hides the implementation details behind clean and simple to understand facades.

Evolvability: System requirements change constantly and we must ensure that we’re able to deal with those changes.

  • Functional requirements: what the application should do

That’s so much of it!

Happy Reading!