Building A Log-Structured Merge-Tree: Part 1

September 23, 2020

For an introduction to log-structured merge-trees see Using Log-Structured Merge-Trees

After finishing my previous side project I started thinking about trying to work on something more involved that would take more time and ultimately produce something maybe a bit more interesting. However, I’m now a father so my dedicated stare at the computer time comes in 2ish hour increments at best. This meant that I also needed something with small pieces of functionality that I could pick up and set down, sometimes with breaks for long periods of time, and still make progress in each session. I’ve had this type of experience two times before with side projects. Both of those project were notably lisp interpreters although one was significantly more ambitious than the other.

I decided to start working on a LSMT implementation with the catch that I was going to meander my way through it by bumbling about and hopefully orienting myself in the right direction. What do I mean by this? I was going to try and work off the knowledge I already had and not refer to things like the RocksDB source code. This meant that I would probably have bad and subtly broken components, but through the power of clean interfaces, tests, thoughtful abstraction, and generally moving in the right direction (hopefully) I’d get to something passable. Why take this route? I don’t know it seems more fun than knowing how to do things the right way™.

For those who want to follow along with the codebase the code is here.

Language Choice And Tooling

I decided to use Go for no particularly good reason. My impression is that people generally seem to either love it or hate it and I wanted to see what it was all about.

The only really thought out take I have so far is that gofmt is perfect. Linters that only tell you what’s wrong with code formatting are making useless toil. Computers are very good at executing algorithms which means that reformatting code is something a computer can do well. Furthermore, formatters and linters that require configuration are creating unnecessary decisions. gofmt ships working out of the box without need for any opinions. gofmt simply reformats the code allowing us to all move on from the bear trap of code formatting discourse unscathed.

Part 1: Memtable

The initial implementation can be found in this commit

The memtable was the first component I wrote which, for better or for worse, I didn’t know how to implement “correctly”. At a high level, I knew that it should support a 1 writer, n readers concurrency model since that’s what log-structured merge-trees typically follow. I didn’t really want to implement locking mechanisms since I frankly didn’t know how to do it right. I thought that working with immutable data would make the n readers requirement much easier to deal with. For this reason I decided to implement the underlying memtable as a persistent red-black tree.

Clojure has a persistent red-black tree based on a paper by Okasaki, Kahrs, Larsen et al that I can’t find a link to. I only know about the implementation for clojure’s sorted map because I reimplemented it in clojure as part of a clojure-in-clojure bootstrapping project that ultimately didn’t materialize. None the less, this experience lead me to reuse a persistent red-black tree here and most of the code linked above is another reimplementation, but this time in Go.

Reimplementing a known algorithm isn’t super interesting, so I’ll try and highlight some other things that went into the codebase in this commit. First, the comparison function was implemented. This function will compare two byte slices lexicographically which is the sorting that the lsmt will use. I was a little bit surprised to find that Go doesn’t ship with a Comparable-esque interface, but it doesn’t really matter for this use case since it’ll always be comparing byte slices. Having the comparison function rely on the byte slice concrete type is fine.

Second, this commit contains what will become an important interface in the project. You’ll notice that the Memtable has a corresponding Iterator which has a Get and Next. Later on this interface gets cleaned up but it’s important to see it change and get codified into the project because it ends up being used everywhere and in more interesting ways. More on that as this blog series (hopefully) moves along.

Third, there is some early sketching on what the top level lsmt will look like. In this commit it was decided that it will contain one active memtable and a collection of inactive memtables. The idea here is that only one memtable will take writes, but while flushing is ongoing the inactive memtables will still be read from until the flushing operation runs to completion. Conceptually the in-limbo read only memtables will act as a 0th level. Maybe, or that might be a weird way to phrase it. Think of in a way that works for you but that’s what we’re aiming for.

With this first commit the project had a working in memory sorted tree (which we can have some confidence in due to handrolled random testing). The public interface ended up being a Get function that takes a single key, a Write function that takes a key and a value, and an Iterator function that takes a lower bound and an upper bound.

func (memtable *Memtable) Get(key []byte) ([]byte, bool) { 
  # ...
}

func (memtable *Memtable) Write(key, value []byte) {
  # ...
}

func (memtable *Memtable) Iterator(start, end []byte) Iterator {
  # ...
}

The container struct contains a single pointer to a given iteration of the underlying persistent red-black tree. Each time a write occurs the pointer will be given a new instance to refer to thus insuring that any outstanding iterators will be able to continue to use any older versions. Also, since this is using a 1 writer model we don’t have to worry about multiple writers trying to update this pointer at the same time. The interface and implementation felt good enough to consider this a stand alone component that I could continue to build on top of.

Next Up

Next post we’ll look at the code that takes a memtable and flushes it disk. Of course, we’ll also want to read the values back from disk so we’ll look at that as well.