Learning Notes on Designing Data-Intensive Applications (xi)
Chapter 11. Stream Processing
This is a series of learning notes on Designing Data-Intensive Applications.
Transmitting event streams
Stream is usually a sequence of events: a small, self-contained, immutable objects with a timestamp. Related events are usually grouped together into a topic.
An event is generated once by a producer (publisher or sender), and then processed by consumers (subcribers or recipients).
Within the publish/subscribe model, we can differentiate the systems by asking two questions:
- What happens if the producers send messages faster than the consumers can process them? The system can drop messages, buffer the messages in a queue, or apply backpressure (flow control, blocking the producer from sending more messages).
- What happens if nodes crash or temporarily go offline, are any messages lost? Durability may require some combination of writing to disk and/or replication.
A number of messaging systems use direct communication between producers and consumers without intermediary nodes. These direct messaging systems require the application code to be aware of the possibility of message loss.
An alternative is to send messages via a message broker (or message queue).
By centralising the data, these systems can easily tolerate clients that come and go, and the question of durability is moved to the broker instead. Some brokers only keep messages in memory, while others write them down to disk.
Some brokers can even participate in two-phase commit protocols using XA and JTA. This makes them similar to databases, aside some differences:
- Most message brokers automatically delete a message when it has been successfully delivered to its consumers. This makes them not suitable for long-term storage.
- Most message brokers assume that their working set is fairly small. If the broker needs to buffer a lot of messages, each individual message takes longer to process, and the overall throughput may degrade.
- Message brokers often support some way of subscribing to a subset of topics matching some pattern.
- Message brokers do not support arbitrary queries, but they do notify clients when data changes.
When multiple consumers read messages in the same topic, 2 main patterns are used:
- Load balancing: Each message is delivered to one of the consumers. The broker may assign messages to consumers arbitrarily.
- Fan-out: Each message is delivered to all of the consumers.
In order to ensure that the message is not lost, message brokers use acknowledgements: a client must explicitly tell the broker when it has finished processing a message so that the broker can remove it from the queue.
A key feature of batch processing is that you can run them repeatedly without the risk of damaging the input. This is not the case with AMQP/JMS-style messaging: receiving a message is destructive if the acknowledgement causes it to be deleted from the broker. This is where the idea behind log-based message brokers comes from.
A log is simply an append-only sequence of records on disk. The same structure can be used to implement a message broker: a producer sends a message by appending it to the end of the log, and consumer receives messages by reading the log sequentially. If a consumer reaches the end of the log, it waits for a notification that a new message has been appended.
To scale to higher throughput than a single disk can offer, the log can be partitioned. Within each partition, the broker assigns monotonically increasing sequence number, or offset, to every message.
Apache Kafka, Amazon Kinesis Streams, and Twitter’s DistributedLog, are log-based message brokers that work like this.
The log-based approach trivially supports fan-out messaging, as several consumers can independently read the log reading without affecting each other.
Databases and streams
A replication log is a stream of a database write events, produced by the leader as it processes transactions. Followers apply that stream of writes to their own copy of the database and thus end up with an accurate copy of the same data (CDC-change data capture).
There are some parallels between the ideas we’ve discussed here and event sourcing.
Applications that use event sourcing need to take the log of events and transform it into application state that is suitable for showing to a user. Applications that use event sourcing typically have some mechanism for storing snapshots.
Three types of joins that may appear in stream processes:
- Stream-stream joins: Both input streams consist of events, and the join operator searches for related events that occur within some window of time.
- Stream-table joins: One input stream consists of events, while the other is a database change log. The change log keeps a local copy of the database up to date. For each activity event, the join operator queries the database and outputs an enriched activity event.
- Table-table joins: Both input streams are database change logs. In this case, every change on one side is joined with the latest state of the other side. The result is a stream of changes to the materialized view of the join between the two tables.
Batch processing frameworks can tolerate faults fairly easy:if a task in a MapReduce job fails, it can simply be started again on another machine, input files are immutable and the output is written to a separate file.
With stream processing waiting until a tasks if finished before making its ouput visible is not an option, stream is infinite.
- Microbatching: break the stream into small blocks, and treat each block like a minuature batch process (micro-batching).
- Chekpoint: generate rolling checkpoints of state and write them to durable storage. If a stream operator crashes, it can restart from its most recent checkpoint.