Near the middle / end of my internship, I got to modify parts of the storage system in Sink, the library handling all the data of Kube.
The goal was to both to speed up the storage and reducing disk space. These two goals often go hands in hand in databases, since smaller data means faster disk lookup, and more data put in memory, available for direct usage.
Reducing key sizes
The first important modification was for keys to be stored in a binary format instead of the displayable format. This allowed data retrieval to be faster, because the binary format would be much smaller than the display format.
On the memory usage side, this is a bit more awkward to measure: on the one hand, the since the size of keys is smaller in memory, but on the other hand, LMDB (the database system we are using) will try to put more data in memory.
But on the whole, either we are using less memory, or Kube will be faster altogether. The reality is probably a mix of the two.
As for the numbers, these are the results of two different benchmarks:
Develop branch | This patch |
---|---|
Current Rss usage [kb]: 40700 | Current Rss usage [kb]: 39112 |
On disk [kb]: 10788 | On disk [kb]: 8836 |
Write amplification: 12.0075 | Write amplification: 9.83485 |
Develop branch | This patch |
---|---|
Total pages: 760 | Total pages: 603 |
Used size: 1425408 | Used size: 1191936 |
Total on disk: 3293184 | Total on disk: 2650112 |
Write amplification: 3.63268 | Write amplification: 3.51402 |
As we can see on both tables, we use less disk space after the patch, but memory usage has not gone down in all places.
Overall we use approximately 20% less disk space.
Separating UIDs and revisions
Another important modification on the storage system was separating UIDs and revisions. Before this patch, we used to store the UID and the revision as keys inside the database.
What’s more is that the key was in the format “{this-is-a-uid}0000000000042”. The reason for padding the number with zeroes was that we need the data to be ordered, and we needed the keys to be stored as a string for the UID.
However, every revision is unique, so we can store only the revision as a key. This also allows us to store the key as a number, use the “integer key” feature of LMDB so the sorting stays correct and save a nice amout of space.
But, since we can keep old revisions, we need to track them. Therefore we need another table for mapping UIDs and revisions.
The rationale behind this patch is that since the “uid to revisions” database will be much smaller than the main database, it can be put into memory, leading to faster lookups.
Even though the results were not as consequent as the “reducing key sizes” patch, it seems we still got a small improvement performance wise. We will better see how different the performance is in the near future, though.
In the meantime, we can use the benchmarks graphics to see how performance, memory and disk usage was impacted (courtesy of our CI):




