Learning Notes on Designing Data-Intensive Applications (ii)

Chapter 2 Data Models and Query Languages

Photo by The BlackRabbit on Unsplash

This is a series of learning notes on Designing Data-Intensive Applications.

Data models deeply affects how we think about the problem. Each layer of data provides abstractions for interactions with layers above it

SQL is the most popular data model. SQL was rooted in relational databases for business data processing in the 1960s and 1970s, and was used for transaction processing and batch processing. In SQL, data is organised into relations (tables), which is an unordered collection of tuples (rows).

Two of the other earlier competitors were the network model and the hierarchical model. NoSQL is the latest attempt to overthrow SQL’s dominance.

NoSQL Databases are in two forms:

  • Document databases: Targets use cases where data comes in self-contained documents and relationships between one document and another are rare.
  • Graph databases: These go in the opposite direction, they target use cases where anything is potentially related to everything.

With a SQL model, if data is stored in a relational tables, a translation layer is required, which is called impedance mismatch.

Strategies to deal with the mismatch:

  • Normalised databases with foreign keys.
  • Use a database that Supports for structured data (PostgreSQL)
  • Encode as JSON or XML and store as text in database. It can’t be queried this way.
  • Store as JSON in a document in document-oriented databases (MongoDB). This has a better locality than the normalised representation.

JSON model:

Reduces the impedance mismatch and the lack of schema is often cited as an advantage. JSON representation has better locality than the multi-table SQL schema, as all the relevant information is in one place.

Many-to-One and Many-to-Many Relationships

  • For enum-type strings, we can store a separate normalised ID to string table, and use the ID in other parts of the database.
  • Using ids reduces duplication.

The network model

Known as the CODASYL model, the network model allows multiple parents for each record. The lines between records were not foreign keys, but more like pointers (called access path). A query is done by moving a cursor through the database by iterating over lists of records and following access paths. Although it made the most efficient use of the hardware, the model made querying and updating the database complicated and inflexible.

The relational model

The relational model is simply a relation (table) and a collection of tuples (rows). In a relational database, the query optimizer automatically decides how to efficiently execute the query, where the developer has to decide the access path in the network model.

  • Document databases: better schema flexibility, better performance due to locality, and closer to data structures in applications.
  • Relational databases: better support for joins, and many-to-one and many-to-many relationships.

Which data model leads to simpler application code?

  • Records in document models are more difficult to directly access when they are deeply nested.
  • For highly interconnected data, document model is awkward, and the relational model is acceptable, and graph models are the most natural.

Schema flexibility in the document level

  • XML support in relational databases usually comes with schema validation while JSON support does not.
  • Document databases is schema-on-read, meaning schema is only interpreted when data is read, in contrast with schema-on-write for relational database, where validation occurs during write time.
  • An analogy to type checking is dynamic type checking (runtime) and static typing checking (compile-time). In general there is no right or wrong answer.
  • Schema-on-read is good if items in the collection don’t all have the same structure. Schema-on-write is good when all records are expected to have the same structure.

Data locality for queries

  • If the whole document (string of JSON, XML) is required by application often, there is a performance advantage to this storage locality (single lookup, no joins required). If only a small portion of the data is required, this can be wasteful.
  • It’s generally recommended to keep documents small and avoid writes that increases the size of documents.

Convergence of document and relational databases

  • A hybrid of relational and document models is probably the future.
  • SQL is a declarative query language. In an imperative language, you tell the computer to perform certain operations in order.
  • In a declarative query language you just specify the pattern of the data you want, but not how to achieve that goal.
  • Declarative languages often lend themselves to parallel execution while imperative code is very hard to parallelise across multiple cores because it specifies instructions that must be performed in a particular order.

Declarative Queries on the Web

  • CSS and XSL are declarative languages to specify styling in HTML, while changing styles directly through Javascript is imperative.

MapReduce Querying

  • MapReduce is a programming model for processing large amounts of data in bulk across many machines, and a limited form of MapReduce is supported by MongoDB and CouchDB.

If many-to-many relationships are very common in your application, it becomes more natural to start modelling your data as a graph. A graph consists of vertices (nodes or entities) and edges (relationships or arcs).

There are several ways of structuring and querying the data. The property graph model (implemented by Neo4j, Titan, and Infinite Graph) and the triple-store model (implemented by Datomic, AllegroGraph, and others). There are also three declarative query languages for graphs: Cypher, SPARQL, and Datalog.

Property graphs

Each vertex consists of:

  • Unique identifier
  • Outgoing edges
  • Incoming edges
  • Collection of properties (key-value pairs)

Each edge consists of:

  • Unique identifier
  • Vertex at which the edge starts (tail vertex)
  • Vertex at which the edge ends (head vertex)
  • Label to describe the kind of relationship between the two vertices
  • A collection of properties (key-value pairs)

Graphs provide a great deal of flexibility for data modelling. Graphs are good for evolvability.

Triple-stores and SPARQL

In a triple-store, all information is stored in the form of very simple three-part statements: subject, predicate, object (peg: Jim, likes, bananas). A triple is equivalent to a vertex in graph.

That’s it!

Happy Reading!