Operating Systems 101: Persistence — Log-structured File System, and Flash-based SSD(vi)
This is a course note of an online course I took relating to the Operating System on educative.io.
Log-structured file system
LFS, short for the Log-structured File System. When writing to disk, LFS first buffers all updates in an in-memory segment. When the segment is full, it is written to disk in one long, sequential transfer to an unused part of the disk. LFS never overwrites existing data, but rather always writes segments to free locations. Because segments are large, the disk (or RAID) is used efficiently, and the performance of the file system approaches its zenith. Later, LFS then reclaims that old space through cleaning. This approach is called copy-on-write.
To ease the inode finding, LFS uses a level of indirection between inode numbers and the inodes through a data structure called the inode map (imap). The imap is a structure that takes an inode number as input and produces the disk address of the most recent version of the inode.
LFS divides the disk into segments, only one of which is active at any one time. Each segment has a header called a summary block. Each summary block contains a pointer to the next summary block, linking segments into one long chain that LFS treats as a linear log. The segments do not necessarily have to be adjacent to each other on disk; for this reason, larger segment sizes (between 384KB and 1MB) are recommended because they amortise the cost of seeking between segments.
Update Process
Whenever a file or directory is changed, LFS writes to the head of this log:
- Any changed or new data blocks.
- Indirect blocks updated to point to (1).
- Inodes updated to point to (2).
- Inode map blocks updated to point at (3).
Unlike UFS, inodes in LFS do not have fixed locations. An inode map — a flat list of inode block locations — is used to track them.
When a segment is filled, LFS goes on to fill the next free or clean segment. Segments are said to be dirty if they contain live blocks, or blocks for which no newer copies exist further ahead in the log. The LFS garbage collector turns dirty segments into clean ones by copying live blocks from the dirty segment into the current segment, freeing up the old ones for writing.
At a checkpoint (usually scheduled once every 30 seconds), LFS writes the last known block locations of the inode map and the number of the current segment to a checkpoint region at a fixed place on disk. Once written, a checkpoint represents the last consistent snapshot of the file system. Recovery after a crash and normal mounting work the same way — the file system simply reconstructs its state from the last checkpoint and resumes logging from there.
During crash, LFS tries to rebuild many of those segments through a technique known as roll forward.
Roll forward: The Rollforward redoes the changes made by a transaction, after the committed transaction and over-writes the changed value once again to ensure consistency.
Roll back: The Rollback transaction is a transaction which rolls back the transaction to the beginning of the transaction (Rollback Transaction_name). It is possible to use before Commit transaction.
Flash-based SSD
Flash (NAND-based flash): is a type of non-volatile storage technology that does not require power to retain data. In order to write to a given chunk of it ( a flash page), first have to erase a bigger chunk ( a flash block). Also writing too often to a page will cause it to wear out.
Solid-state storage : such devices have no mechanical or moving parts like hard drives; rather, it is a storage drive composed entirely of memory chips(transistors). However, unlike typical random-access memory (e.g., DRAM), such a SSD retains information despite power loss and thus is an ideal candidate for use in persistent storage of data. Most SSDs are built of Flash.
Transistor: is a semiconductor device used to amplify or switch electronic signals and electrical power. Transistors are one of the basic building blocks of modern electronics. It is composed of semiconductor material usually with at least three terminals for connection to an external circuit.
Low-level operations
- Read (a page): A client of the flash chip can read any page (e.g., 2KB or 4KB), with the read command and page number.
- Erase (a block): Before writing to a page within a flash, the chip will erase the entire block the page lies within.
- Program (a page): Once a block has been erased, the program command can be used to write the desired contents of a page to the flash.
Concerns
- Wear out: when a flash block is erased and programmed, t becomes increasingly difficult to differentiate between a 0 and a 1.
- Disturbance: when accessing a particular page within a flash, it is possible that some bits get flipped in neighbouring pages, known as read disturbs or program disturbs, depending on read or programmed.
From Raw Flash to Flash-Based SSDs
What does an SSD contain:
- flash chips (for persistent storage)
- volatile (i.e., non-persistent) memory (e.g., SRAM)
- caching and buffering of data as well as for mapping tables
- control logic to orchestrate device operation
Flash translation layer: takes read and write requests on logical blocks (that comprise the device interface) and turns them into low-level read, erase, and program commands on the underlying physical blocks and physical pages (that comprise the actual flash device).
That’s so much of it!
Happy Reading!