Learning Notes on Designing Data-Intensive Applications (iii)

Chapter 3. Storage and Retrieval

E.Y.

9 min readJun 2, 2021

This is a series of learning notes on Designing Data-Intensive Applications.

Note: personally I feel there’s some background knowledge recommended in terms of how file system works in the hardware world and the popular file system structure as there’s some reference and inspiration from these part when it comes to storage engines.

Data Structures That Power Your Database

Many databases internally uses a log, which is an append-only data file.
To retrieve data efficiently, we need an index, which is an additional structure derived from primary data and only affects performance of queries.
Well-chosen indexes speed up queries but slow down writes. Therefore, databases don’t index everything by default and requires developers to choose index manually.

Hash indexes

Key-value stores are quite similar to the dictionary type (hash map or hash table).
The simplest indexing strategy is to keep an in-memory hash map where every key is mapped to a byte offset in the data file. Whenever you append a new key-value pair to the file, you also update the hash map to reflect the offset of the data you just wrote.
To avoid the problem of running out of disk space with this ever-increasing appending, we can break logs into segments (each segment has its own in-memory hash table)and perform compaction (deduplication) while segments get merged using a background thread. After that, old segments can be deleted.
To read the key from table, the most recent segment hash map is checked, and if key is not present, we check the the second-most recent segment and so on.
There are a few details during implementation:

Deletes: Adding a special log entry to the data file (tombstone) and the data will be removed during merging and compaction.
Crash recovery: Index will need to be snapshotted.
Corrupt records: Checksums are required for detecting partially written records.
Concurrency: Writes has to strictly be in sequential order. Many implementation choose to one write thread.

Pros:

Sequential writes are much faster than random writes, especially on magnetic spinning-disk.
Concurrency and crash recovery are much simpler if segment files are append-only or immutable (only need the latest copy).
Merging old segments avoids data fragmentation.

Cons:

Scalability: Hash table must fit into memory. Problems with too many keys
Range queries are not efficient (Have to look up each key individually ).

SSTables and LSM-Trees

The Sorted String Table (SSTable) requires each segment file to be sorted by key.
Pros:

Merging segments is simple and efficient ( with algorithms like mergesort When multiple segments contain the same key, only the value from the most recent segment is kept)
No longer require offset of every single key for lookup. One key for every few kilobytes of segment file is usually sufficient.
Read requests often need to scan over several key-value pairs, therefore it is possible to group the records into a block and compress it before writing it to disk. Each entry of the sparse index then points to the start of a compressed block. This has the advantage of saving disk space and reducing the IO bandwidth.

Constructing and maintaining SSTables

While maintaining a sorted structure on disk is possible (B-Trees), red-black trees or AVL trees can be used to keep logs sorted in memory.
When a write comes in, insert the entry to the in-memory data structure (memtable).
When memtable gets bigger than some threshold, write the memtable to disk.
For reads, first try to find the key in memtable, and in the latest segment, and in the second last segment and so on
Occasionally, run a merging and compaction process to combine segment files.
To prevent recent writes lost in a crash, a separate log is kept.Every time the memtable is written out to an SSTable, the log can be discarded.

Systems that uses the principle of merging and compacting sorted files are often called LSM systems (Log-Structure Merge-Tree).

Performance optimisations

A look up can take a long time if the entry does not exists(as it will check all SSTables). Bloom filters can be used to solve this issue.(A Bloom filter is a memory-efficient data structure for approximating the contents of a set. It can tell if a key does not appear in a database)
There are two major strategies to determine the order and timing of merging and compaction: size-tiered (HBase, Cassandra) and level compaction (LevelDB, RockDB, Cassandra).
LSM-tree: Write throughput is high. Can efficiently index data much larger than memory.

B-trees

B-trees is the standard index implementation for almost all relational databases and most non-relational databases.
Unlike Log-structured indexes break the database down into segments, B-trees break the database down into fixed size blocks or pages. Each page can be identified with its address on disk, which allows one page to refer to another. Pages are usually small in size, typically 4kb to resemblance the disk block.
B-trees also keep key-value entires sorted by key, which allows quick value lookups and range queries.
A root page contains the full range of keys (or reference to pages containing the full range) and is where query start. A leaf page contains only individual keys, which contains the value inline or reference to where the values can be found. The branching factor is the number of references to a child page in one page of a B-tree.
For update/add value:

Update: the page containing the value is looked up, modified, and written back to disk.
Add: the page whose range contains the key is looked up. If there is extra space in the page, the key-value entry is simply added, else, the page is split into two halves and the parent page is updated to account for the new file structure.

The algorithm above ensures a B-tree with n nodes is always balanced and has a depth of O(logn).

Making B-Trees reliable

When update/add value, the B-tree overwrites data vs. LSM just appends data. This is a risky as if anything crashes, the index could be corrupted. To prevent that, a common solution is to include an write-ahead-log (WAL). In case of failure, this log can be used to restore the B-tree.
With concurrency, e.g. multiple threads write, a thread may see the tree in an inconsistent state vs. with LSM all merge happen in the background thread. The solution is latches (lightweight locks). This is not an issue

B-tree optimisations

Additional pointers been added to the tree. E.g. a leaf page may have references to its sibling pages, this allows scanning keys in order without jumping back to parent pages.
Copy-on-write instead of a WAL for crashes
Key compression by storing essential information for acting as boundaries.

B-trees and LSM-trees

The LSM trees break the database down into variable-size segments typically several megabytes or more. B-trees break the database down into fixed-size blocks or pages, traditionally 4KB.
LSM-trees are faster for writes, whereas B-trees are faster for reads. Reads are typically slower on LSM-tress as they have to check several different data structures and SSTables at different stages of compaction.

Advantages of LSM-trees:

LSM-trees are typically able to sustain higher write throughput due to two main reasons: a lower write amplification ( a write to the database results in multiple writes to disk) and a sequential write . The more a storage engine writes to disk, the fewer writes per second it can handle. A B-tree index must write every piece of data at least twice: once to the write-ahead log, and once to the page itself .
LSM-trees can be compressed better, and thus often produce smaller files on disk than B-trees. B-trees tend to leave disk space unused due to fragmentation.

Downsides of LSM-trees:

Compaction process can sometimes interfere with the performance of ongoing reads and writes., as reads and writes can be blocked by the compaction process on disk resources.
On B-trees, each key exists in exactly one place in the index. This offers strong transactional semantics. Transaction isolation is implemented using locks on ranges of keys.

Other indexing structures

Secondary indexes:

Secondary indexes can be created from a key-value index. The main difference between primary and secondary indexes is that secondary indexes are not unique.

Storing values within the index:

In a key-value pair, the key is used to locate the entry and the value can be either the actual data or a reference to the storage location ( a heap file).
Heap files: Using heap files is common for building multiple secondary indexes to reduce duplication as each index just references a location in the heap file.
Clustered index: Sometimes, the extra hop to the heap file is too expensive for reads, and the indexed row is stored directly in the index.
Covering index: A compromise between a non-clustered index and a clustered index is a covering index or index with included columns, where only some columns are stored within the index.

Multi-column indexes

Multi-column indexes are created for querying rows using multiple columns of a table.
Concatenate index: The most common multi-column index, done by concatenating fields together into one key.
Multi-dimensional index: These kind are a more general, e.g. with geospatial data for a longitude and a latitude range. For this, we need to translate a two-dimensional location into a single number using a space-filling curve and then use a regular B-tree index. Specialized spatial index such as R-trees are also used.

Full-text search and fuzzy indexes:

Indexes don’t allow you to search for similar keys, such as misspelled words. Such fuzzy querying requires different techniques (Levenshtein automaton: Supports efficient search for words within a given edit distance.)
Lucene uses SSTable-like structure for its term dictionary.

Keeping everything in memory

Keeping everything in memory

Data on disk vs. Data on RAM: Disks have 2 significant advantages over main memory in durability and cost efficient.
There have been developments of in-memory databases lately, e.g. Memcached, are intended for caching only.
In-memory databases are faster mainly because they avoid encoding/decoding between in-memory structures and binaries.
In-memory databases could also provide data models that are hard to implement with disk-based indexes, such as priority queues, stacks.

Transaction processing or analytics?

Transaction: A group of reads and writes that form a logical unit.

OLAP: Online Analytics Processing. Refers to queries generally performed by business analytics that scan over a huge number of record and calculating aggregate statistics.

OLTP: Online Transaction Processing. Interactive queries which typically return a small number of records.

In the past, OLTP-type queries and OLAP-type queries were performed on the same databases. However, there’s been a push for OLAP-type queries to be run on data warehouses as OLAP has very different access patterns. A query would need to scan over a huge number of records, only reading a few columns per record, and calculates aggregate statistics.

Data warehousing

A data warehouse contains a read-only copy of the data from OLTP systems( periodic dumps or continuous stream of updates).
The process of getting data from OLTP systems to data warehouses is called Extract-Transform-Load (ETL).

Stars and Snowflakes: Schemas for Analytics

Star-schema: in the middle of it is aa fact table, where each row represents an event. The fact table uses foreign keys to refer to other tables (called dimension tables) for the who, what, where, when, how, and why of the event.
Snowflake schema: further breaks dimensions down into sub-dimensions. Snowflake schemas are more normalized than star schemas.
A typical data warehouse, tables could be very wide (up to several hundred columns), and dimension tables could also be very wide.

Column-Oriented Storage

Analytics queries often access millions of rows, but few columns, which does not fit with OLTP where storage is laid out in a row-oriented fashion.
In column-oriented storage, all the values from each column are stored together. It relies on each column file containing the rows in the same order.

Column Compression

Column-oriented storage often lends itself to good compression since there is normally less number of distinct values in a column vs. the number of rows.
Bitmap encoding: take a column with n distinct values and turn it into n separate bitmaps: one bitmap for each distinct value, with one bit for each row. The bit is 1 if the row has that value, and 0 if not.

Writing to Column-Oriented Storage

Sorted columns optimises for read-only queries, yet writes are more difficult.
In-place updates would require rewriting the whole column on every write. A LSM-tree like structure can be used when enough new writes are accumulated, they are then merged with the column files and written to new files in bulk.

Aggregation: Data Cubes and Materialised Views

Materialised view: like view in RDBS, this provides cached views with aggregated functions to avoid recomputing expensive queries.
Data cube or OLAP cube: A common special case of materialised view where data is pre-summarised over each dimension. This enables fast queries with precomputed queries but with little flexibility.

That’s so much of it!

Happy Reading!