Systems often consist if multiple databases, indexes, caches, and analytics systems, thus, it needs to implement mechanisms for moving data from one store to another. Systems that stores and process data can be grouped into two broad categories:
- System of records: acts as a source of truth, being represented only once, and usually normalised.
- Derived data systems: the result of transforming or processing data from other source; can easily be recreated if lost, usually denormalised and redundant.
- Transmitting event streams
- Databases and streams
- Processing Streams
All systems can fit into 3 categories:
- Services (online), where it waits for requests, tries to handle them as quickly as possible, and sends back a response. (Response time is primary measurements)
- Batch Processing (offline), where it runs a job to process a large amount of data, and produce output data. It is often scheduled periodically, and can take up to several days to finish.
- Stream Processing (near-real-time), where is consumes unbounded input shortly after its available, processes it, and produces output. Lower latency than batch processing.
Batch Processing with Unix Tools
Many data analysis can be done in few minutes using some combination of Unix commands
cat /var/log/nginx/access.log |
awk '{print $7}' |
sort |
uniq -c |
sort -r -n |
head -n 5 |
Comparing to a builtin sort algorithms e.g. in JavaScript that use in-memory aggregation, Unix shell allows handle larger-than-memory datasets (mergesort can write data into disk →segment → compact → out).
The Unix philosophy
- Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new “features”.
- Expect the output of every program to become the input to another, as yet unknown, program. Don’t clutter output with extraneous information. Avoid stringently columnar or binary input formats. Don’t insist on interactive input.
- Design and build software, even operating systems, to be tried early, ideally within weeks. Don’t hesitate to throw away the clumsy parts and rebuild them.
- Use tools in preference to unskilled help to lighten a programming task, even if you have to detour to build the tools and expect to throw some of them out after you’ve finished using them.
The limitation of Unix tools is that it can only run on a single machine, that’s where tools like Hadoop come in.
MapReduce and Distributed File Systems
MapReduce job can be distributed across thousands of machines, it reads and writes to distributed file system (eg. HDFS).
- Read a set of input files, and break it up into records.
- Call the mapper function to extract a key and value from each input record.
- Sort all of the key-value pairs by key.
- Call the reducer function to iterate over the sorted key-value pairs.
It needs the client to provide:
- a mapper callback function, which is called once for every input record and has the job of extracting key-value pair(s) from the record, and once it’s done,
- a reducer callback function, is called with a key and an iteratior that sequentially scans over all records with the same key,
The process of partitioning by reducer, sorting, and copying data partitions from mappers to reducers is known as shuffle.
Note that MapReduce has no concept of indexes to help in the join operation, it reads the entire content of the file.
MapReduce jobs can be chained together into workflows, the output of one job becomes the input to the next job.
MapReduce can perform joins through:
- sort-merge joins,
- broadcast hash joins,
- partitioned hash joins
By treating inputs as immutable and avoiding side effects (such as writing to external databases), batch jobs not only achieve good performance but also become much easier to maintain.
Beyond MapReduce
MapReduce has poor performance for some kinds of processing. The files on the distributed filesystem are simply intermediate state: a means of passing data from one job to the next. The process of writing out the intermediate state to files is called materialisation.
MapReduce’s approach of fully materialising state has some downsides compared to Unix pipes:
- A MapReduce job can only start when all tasks in the preceding jobs have completed, whereas rocesses connected by a Unix pipe are started at the same time.
- Mappers are often redundant: they just read back the same file that was just written by a reducer.
- Files are replicated across several nodes, which is often overkill for such temporary data.
As a result, dataflow engines (eg. Spark, Tex, Flink, etc.) handles entire workflow as one job, rather than small independent sub-jobs. It also provides more flexible callback functions (operations) rather than only map and reduce.
Happy Reading!