Learning Notes on Designing Data-Intensive Applications (i)

Photo by Pushpak Dsilva on Unsplash

This is a series of learning notes on Designing Data-Intensive Applications.

A data-intensive application is typically built from standard building blocks. They usually need to:

  • Stores Data — e.g. SQL, MongoDB, Aurora, Dynamo
  • Remembers result of expensive operations — Cache, e.g. Redis
  • Searches Indexes — Key Value Stores — Elastisearch from AWS
  • Streams Processes — Message Queues, e.g. Apache Kafka
  • Batch Processing — e.g. Azure Batch processing

An application has to meet functional requirements(e.g. allowing data to be stored, retrieved, searched, and processed in various ways) and nonfunctional requirements (e.g. security, reliability, compliance, scalability, compatibility, and maintainability).

  • Reliability. To work correctly even in the face of adversity.
  • Scalability. Reasonable ways of dealing with growth.
  • Maintainability. Be able to work on it productively.


Typical expectations for softwares:

  • Application performs the expected function
  • Fault tolerant
  • Good performance
  • The system prevents abuse e.g. unauthorised access

Systems that anticipate faults and can cope with them are called fault-tolerant or resilient. A fault is usually defined as one component of the system deviating from its spec, whereas failure is when the system as a whole stops providing the required service to the user.

Fault is impossible to eliminate, so you should tolerant and even deliberately trigger them manually:

  • Hardware faults

Until recently redundancy of hardware components was sufficient for most applications. As data volumes and applications’ computing demands have increased, there’s been a shift towards using software fault-tolerance techniques in preference or in addition to hardware redundancy. An advantage of these software fault-tolerant systems is that: For a single server system, it requires planned downtime if the machine needs to be rebooted (e.g. to apply operating system security patches). However for a system that can tolerate machine failure, it can be patched one node at a time (without downtime of the entire system — a rolling upgrade)

  • Software errors

Software failures are more correlated than hardware failures. Meaning that, a fault in one node is more likely to cause many more system failures than uncorrelated hardware failures.

  • Human errors

Humans are known to be unreliable. Configuration errors by operators are a leading cause of outages. You can make systems more reliable:

  • Minimising the opportunities for error, peg: with admin interfaces that make easy to do the “right thing” and discourage the “wrong thing”.
  • Decouple where mistakes are made (sandbox) and where the mistakes causes failures (production).
  • Test thoroughly, including unit, integration, and manual tests.
  • Quick and easy recovery from human error, fast to rollback configuration changes, roll out new code gradually and tools to recompute data.
  • Set up detailed and clear monitoring, such as performance metrics and error rates (telemetry).
  • Implement good management practices and training.


This is how do we cope with increased load. In order to do that, we need to describe load, describe performance, and then design to cope with load.


Load is described depending on the system type — requests per minute, cache hits / minute, simultaneous users, etc.


Performance also depends on the system type — examples are throughput, response time. But normally, just ask when a load parameter is increase, how much resources to increase if you want to keep performance unchanged?

Latency and response time: The response time is what the client sees. Latency is the duration that a request is waiting to be handled.

The metrics for assess the performance is the percentiles.

  • Median (50th percentile or p50).
  • Percentiles 95th, 99th and 99.9th (p95, p99 and p999) are good to figure out how bad your outliners are.

Amazon describes response time requirements for internal services in terms of the 99.9th percentile because the customers with the slowest requests are often those who have the most data. The most valuable customers. Also optimising for the 99.99th percentile would be too expensive.

Service level objectives (SLOs) and service level agreements (SLAs) are contracts that define the expected performance and availability of a service. An SLA may state the median response time to be less than 200ms and a 99th percentile under 1s.

Queueing delays often account for large part of the response times at high percentiles.Since parallelism is limited in servers. Slow requests may cause head-of-line blocking and make subsequent requests slow.

The latency from an end user request is the slowest of all the parallel calls. The more backend calls we make, the higher the chance that one of the requests were slow. This is known as tail latency amplification.

Approaches for coping with load

  • Scaling up (vertical scaling): Moving to a more powerful machine
  • Scaling out (horizontal scaling): Distributing the load across multiple smaller machines.
  • Elastic: Systems automatically add computing resources when detected load increase.

Stateful vs stateless: Scaling up stateful data systems can be complex. Taking stateful data systems from a single node to a distributed setup can introduce a lot of complexity. Until recently it was common wisdom to keep your database on a single node until cost or availability requirements are no longer satisfied.


The three design principles for software systems:

  • Operability. Make it easy for operation teams to keep the system running.
  • Simplicity. Easy for new engineers to understand the system by removing as much complexity as possible.
  • Evolvability. Make it easy for engineers to make changes to the system in the future.

Operability: A good operations team is responsible for:

  • Monitoring and quickly restoring service if it goes into bad state
  • Tracking down the cause of problems
  • Keeping software and platforms up to date
  • Anticipating future problems
  • Establishing good practices and tools for development
  • Preserving the organisation’s knowledge about the system

Data systems can do the following to make routine tasks easy e.g.

  • Providing visibility into the runtime behavior and internals of the system, with good monitoring.
  • Providing good support for automation and integration with standard tools.
  • Providing good documentation and easy-to-understand operational model.
  • Self-healing where appropriate, but also giving administrators manual control over the system state when needed.


Making a system simpler means removing accidental complexity, as non inherent in the problem that the software solves.

One of the best tools for removing accidental complexity is abstraction that hides the implementation details behind clean and simple to understand facades.

Evolvability: System requirements change constantly and we must ensure that we’re able to deal with those changes.

  • Functional requirements: what the application should do
  • Nonfunctional requirements: general properties like security, reliability, compliance, scalability, compatibility and maintainability.

That’s so much of it!

Happy Reading!



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store