Tuesday, May 19, 2015

Stonebraker on Databases

I recently listened to a podcast on Software Engineering Radio with database expert Michael Stonebraker. (at the recommendation of a friend - thanks Larry) It's a few years old, but still quite relevant.

As usual, I relate much of what I read and hear about to Suneido. In this case it was interesting to see how Suneido's database held up to Stonebraker's criticism of current conventional relational databases.

He talks about profiling a database and finding that 90% of the time was spent on "overhead". The time consisted of four main areas:

buffer pool management
Keeping a cache of database pages, page replacement tracking, and converting from external (disk) record format to internal (in-memory) format. Since most transactional (OLTP) databases now fit in main memory, this work is even more wasteful.

record locking
Most conventional relational databases use pessimistic row level read and write locks to ensure that transactions are atomic, consistent, and isolated (i.e. ACID). Managing locks and waiting to acquire locks can be slow.

thread locking
Most databases are multi-threaded which, in most cases, means they use locking to protect shared data structures. Locking limits parallelism.

crash recovery write-ahead logs
For durability (the last part of ACID) databases commonly use a write-ahead log where changes are written (and in theory flushed to disk) prior to updating the actual database. In case of a crash, the log can be used for recovery. But writing the log is slow.

So how does Suneido do in these areas?

Instead of a buffer pool, the database is memory mapped. This handles databases that fit into memory as well as ones that don't. When they don't the page replacement is handled by the operating system which is in a good position to do this efficiently. Modern operating systems and hardware already have so many layers of caching and buffering that it seems crazy to add yet another layer of your own!

Suneido also does as much work as possible using the external format of records, only converting to internal format when necessary. For example, most database "wheres" are done in external format. The encoding of values into external format maintains ordering, so sorting can also be done while still in external format.

Rather than pessimistic record locking, Suneido uses optimistic multi-version concurrency control in conjunction with an append-only "immutable" database. This means that read transactions do not require any locking and do not interact with update transactions. Write transactions only require locking for the actual commit.

Suneido's database server  is multi-threaded with requests handled by a pool of threads. But most of the internal data is in immutable persistent data structures, which require minimal locking. And the database itself is an immutable persistent data structure requiring minimal locking. (I minimized the locking as much to avoid bugs as for performance.)

Finally, Suneido doesn't use a write-ahead log. Instead, its immutable append-only design makes it a type of log structured database, where the database itself can act as the log.

NOTE: This is based on the newer Java implementation of Suneido which has a different database engine than the older C++ version.

I haven't benchmarked Suneido's database against other systems so I can't make any claims about speed. But in terms of avoiding most of the overheads that Stonebraker identifies, Suneido seems to hold up pretty well.

1 comment:

Larry Reid said...

I love the "locking" vs. "multi-threading" conundrum. At the end of the day, if some things have to happen in a certain order, they have to happen in a certain order. There is no free lunch.

The key is to do like Suneido and other modern databases: do things in order as quickly as possible.