Building A Log-Structured Merge-Tree: Part 2

November 11, 2020

Last time we looked at the memtable implementation. We’ll recall that the memtable is the first layer in an LSMT and the only layer to accept both reads and writes. As part of running the LSMT we’ll want to flush the in memory data to disk when the memtable reaches a certain size. The disk-backed layers that get flushed out from the memtable will be read only, so keep in mind that we only need to write once when flushing and users of the LSMT will only read the on disk data. In this post we’ll look at the mechanisms to flush the memtable to disk.

Static Sorted Table

A Static Sorted Table (henceforth SST) is a file containing sorted key value pairs. The pairs run sequentially through the file such that the key in the head position is sorted before the subsequent key and so forth up to the key in the tail position which is sorted behind all of the other pairs. In addition to storing key value pairs, there are various different schemes for storing additional metadata in the file that we’ll get into a little bit, but the only requirement for an SST is that it stores sorted key value pairs. The on disk persistence in LSMTs is comprised of a series of SSTs. When flushing the memtable to disk our goal will be to produce a series of SSTs that have all of the key value pairs that were present in the memtable.

Block Based SSTs

In this implementation we’ll be using block-based SSTs. The block scheme divides each SST into equal sized (size being measured in bytes) chunks. Included in the SST file will be metadata about the start and end keys for each of these chunks. If a process needs to point-read a key or find a range of keys, it can consult the metadata to find the relevant block(s). These blocks will be useful for a number of reasons further down the line, but for now know that it’s extra metadata we’ll store within each file.

Configuration

In to know how to perform the flush, or when to even trigger a flush, we’ll need to require some user provided configuration. At this point, we’ll need to know how large the memtable should get before attempting to flush to disk, as well as the size for each SST file and the size of the blocks in each file. For now we’ll ignore levels and approach that topic later, so for this blog we can just assume that the memtable will fit in one level. We’ll store the number of files and the block size in a Level configuration struct, but we’ll only have one Level for now. If you’re following along in the git history you can see the configuraiton here, but I’ll also copy it below.

type Level struct {
	blockSize uint64
	sstSize   uint64
}

type Options struct {
	levels              []Level
	path                string
	memtableMaximumSize uint64
	keyMaximumSize      uint64
	valueMaximumSize    uint64
}

Flush

The flush function can be found here. Starting with the signature of the function, we’ll take the Options and the Level configuration as well as an Iterator. In the previous post I mentioned that the Iterator interface would get used all over the place and as you can see we’re using it here. For now, we can assume that the Iterator given as an argument will be the memtable’s UnboundedIterator. This way we can iterate, in sorted order, over each key value pair in the memtable. The function will return a collection of the newly created SSTs if it succeeds, or return an error if does not succeed.

At a high level, we’re going to pull every key value pair from the Iterator and write it to a buffered file writer. We’ll store some running totals in order to know when we’ve hit block and/or file limits and then act accordingly for each of these scenarios. We’ll also store some metadata so that we can write it to each file when we’re ready to write it out. You’ll see this running data initialized at the top of the function.

// the current file
var f *os.File
// the buffered writer for the current file
var w *bufio.Writer

// the generated ssts
ssts := []*sst{}
// the block metadata for the current file
var blocks []*block
// the current pair pulled from the iterator
var currentBlock *block
// the previous pair pulled from the iterator (useful for storing the end of blocks)
var previousPair *common.Pair

// The total number of bytes written in the current file
bytesWritten := int64(0)
// The total number of bytes written in the current block
currentBlockSize := int64(0)

Once we start iterating over the pairs we’ll run through a few scenarios. First, we’ll see if the value is a Tombstone value. A Tombstone denote deletes so, for now, we’ll simply omit that key value pair when writing to disk which will effectively delete it. For now this works, but later on we’ll need to handle Tombstones differently.

Next, we’ll calculate how long the record will be in bytes. Each record is the size of the key and value added together plus 2 additional bytes since we’ll store the length of the key in 1 byte and the length of the value in 1 bytes. I don’t believe that this as space optimized as it could be, but for simplicitly’s sake we’ll implement it this way.

Next, we’ll check and see if the record size causes the SST size to be larger than the configured SST size. If this is the case, we’ll store the end of the current block (which will be the last block), we’ll store the metaoffset location for the metadata, and then we’ll write the metadata to buffer and close file. At this point we’ll be ready to start a new SST file.

Then, we’ll check and see if the file is nil. This effectively means that this is the first pair and we haven’t written anything at all yet, or the case in the previous paragraph was hit and we’re ready to open a new file. If this check succeeds then we’ll open a new file, open a new buffer for writing to the file, generate a new slice to store block metadata and append the newly created sst to our list of ssts.

Finally, we’ll check and see if the record causes us to go over the configured block size. If that’s the case we’ll simply store the end of the block and start a new block with our current key as the start of the block.

Once we’ve checked these cases we’ll write the length of the key into a byte, write the entire key, write the length of the value into a byte, write the entire value, and check to see if the iterator has any more key value pairs. Once we’re out of key value pairs we close the last file and write everything to disk. Provided that we’ve hit no errors along the way we return the newly created SSTs out of the function.

Next Up

Next time, we’ll look at implementing the public functions that make an end user would make use of: Get, Write, Delete, Iterator. From there we’ll iteratively add more features while keeping the public interface in place, beginning with managing multiple SSTs layers and causing a Flush to occur as the result of a memtable write.