RocksDB with ZenFS
RocksDB is a persistent key-value store for fast storage devices. It is implemented using a Log-Structured Merge-Tree (LSM-tree) data structure. It is simlilar to LSM-tree based key-value engine implementations: values are stored in tables that are sorted in increasing key order. Tables are sequentially written and never modified. This basic principle of the LSM-tree data structure makes it possible to support zoned block devices.
The storage plugin architecture of RocksDB makes it possible to accommodate different storage backends. More to the point, ZenFS implements support for zoned block devices and is integrated into RocksDB.
ZenFS is a file system plugin for RocksDB that uses the RocksDB FileSystem [sic] interface to place files into zones on a raw zoned block device. By separating files into zones and utilizing "write lifetime hints" (WLTH) to co-locate data of similar lifetimes, ZenFS can reduce system write amplification as compared to regular file systems on conventional block devices. ZenFS ensures that there is no background garbage collection in the file system or on the device, which improves performance in terms of throughput, tail latencies and endurance.
ZenFS can store multiple files in a single zone by using an extent allocation scheme. A file can be composed of one or more extents and all the extents that compose that file can be stored in the same zone (or in different zones) of the device. An extent never spans multiple zones. When all file extents in a zone are invalidated, the zone can be reset and then reused to store new file extents.
ZenFS places file extents into zones based on write lifetime hints (WLTH) provided by RocksDB library. ZenFS always attempts to place file extents together in the same zones when they have similar WLTH.
In ZenFS, data garbage collection is performed only by RocksDB when it initiates the LSM-tree table compaction process. No garbage collection is executed by ZenFS and no garbage collection is executed by the ZNS device controller.
Further information is available in the ZNS: Avoiding the Block Interface Tax for Flash-based SSDs USENIX ATC 2021 article.
Getting Started
Prerequisites
ZenFS requires Linux kernel version 5.9 or newer. The kernel used must be configured with zoned block device support enabled.
ZenFS uses the libzbd library. The latest version of this library must be compiled and installed prior to building and installing ZenFS.
Building and Installing ZenFS
ZenFS is embedded into RocksDB. It is available as a submodule in RocksDB and must be explicitly enabled when compiling RocksDB.
Instructions that explain how to compile and install RocksDB with ZenFS are maintained in the ZenFS project README file.
Remember to set the block device IO scheduler to "deadline" to prevent write operations from being reordered. This can be done automatically on system boot using a udev rule.
ZenFS Command Line
ZenFS provides a command line utility called zenfs. This utility is used to format the zoned device to create a new filesystem, to list the files, and to back up and restore the filesystem.
Create a ZenFS file system
To create or format a zoned block device (e.g., NVMe ZNS device /dev/nvme0n1
),
use the following command:
zenfs mkfs --zbd=nvme0n1 --aux_path=/tmp/zone-aux --force
# Output example
ZenFS file system created. Free space: 220246 MB
List files within a ZenFS file system
After the zoned block device has been formatted, RocksDB manages its files through ZenFS. To list the files present on the zoned block device, use the following command:
zenfs list --zbd=nvme0n1 --path=rocksdbtest/dbbench
# Output example:
0 Jul 20 2021 18:13:52 LOCK
66979 Jul 20 2021 18:14:01 LOG
26961453 Jul 20 2021 18:13:55 000014.sst
26961524 Jul 20 2021 18:13:55 000015.sst
26961904 Jul 20 2021 18:13:55 000016.sst
26963148 Jul 20 2021 18:13:55 000017.sst
102734225 Jul 20 2021 18:13:56 000019.sst
26962608 Jul 20 2021 18:13:55 000020.sst
26961566 Jul 20 2021 18:13:56 000021.sst
26963214 Jul 20 2021 18:13:56 000022.sst
26963380 Jul 20 2021 18:13:57 000023.sst
102594916 Jul 20 2021 18:13:58 000025.sst
26546055 Jul 20 2021 18:13:59 000026.sst
102540090 Jul 20 2021 18:14:00 000028.sst
190808702 Jul 20 2021 18:14:01 000029.log
102826791 Jul 20 2021 18:14:00 000030.sst
16 Jul 20 2021 18:13:52 CURRENT
37 Jul 20 2021 18:13:52 IDENTITY
1586 Jul 20 2021 18:14:01 MANIFEST-000004
6178 Jul 20 2021 18:13:52 OPTIONS-000007
Back up files within a ZenFS file system
To back up all table and metadata files within a ZenFS file system to a local filesystem, use the following command:
zenfs backup --zbd=nvme0n1 --path=/tmp/backup
# Output example
rocksdbtest/dbbench/LOCK
rocksdbtest/dbbench/LOG
rocksdbtest/dbbench/000014.sst
rocksdbtest/dbbench/000015.sst
rocksdbtest/dbbench/000016.sst
rocksdbtest/dbbench/000017.sst
rocksdbtest/dbbench/000019.sst
rocksdbtest/dbbench/000020.sst
rocksdbtest/dbbench/000021.sst
rocksdbtest/dbbench/000022.sst
rocksdbtest/dbbench/000023.sst
rocksdbtest/dbbench/000025.sst
rocksdbtest/dbbench/000026.sst
rocksdbtest/dbbench/000028.sst
rocksdbtest/dbbench/000029.log
rocksdbtest/dbbench/000030.sst
rocksdbtest/dbbench/CURRENT
rocksdbtest/dbbench/IDENTITY
rocksdbtest/dbbench/MANIFEST-000004
rocksdbtest/dbbench/OPTIONS-000007
Restore files within a ZenFS file system
To restore files from a previous backup, use the following command:
zenfs restore --zbd=nvme0n1 --path=/tmp/backup/rocksdbtest/dbbench/ \
--restore_path=rocksdbtest/dbbench
# Output example
/tmp/backup/rocksdbtest/dbbench/CURRENT
/tmp/backup/rocksdbtest/dbbench/LOCK
/tmp/backup/rocksdbtest/dbbench/000015.sst
/tmp/backup/rocksdbtest/dbbench/000022.sst
/tmp/backup/rocksdbtest/dbbench/LOG
/tmp/backup/rocksdbtest/dbbench/MANIFEST-000004
/tmp/backup/rocksdbtest/dbbench/000023.sst
/tmp/backup/rocksdbtest/dbbench/000028.sst
/tmp/backup/rocksdbtest/dbbench/000025.sst
/tmp/backup/rocksdbtest/dbbench/000030.sst
/tmp/backup/rocksdbtest/dbbench/IDENTITY
/tmp/backup/rocksdbtest/dbbench/OPTIONS-000007
/tmp/backup/rocksdbtest/dbbench/000020.sst
/tmp/backup/rocksdbtest/dbbench/000019.sst
/tmp/backup/rocksdbtest/dbbench/000016.sst
/tmp/backup/rocksdbtest/dbbench/000014.sst
/tmp/backup/rocksdbtest/dbbench/000026.sst
/tmp/backup/rocksdbtest/dbbench/000029.log
/tmp/backup/rocksdbtest/dbbench/000021.sst
/tmp/backup/rocksdbtest/dbbench/000017.sst
Benchmarking
RocksDB provides the db_bench utility to test and benchmark the performance of a device. The following command provides an example of db_bench execution, using a zoned block device that has been formatted with ZenFS.
db_bench --fs_uri=zenfs://dev:nvme0n1 --benchmarks=fillrandom \
--use_direct_reads --key_size=16 --value_size=800 \
--target_file_size_base=2147483648 \
--use_direct_io_for_flush_and_compaction \
--max_bytes_for_level_multiplier=4 --write_buffer_size=2147483648 \
--target_file_size_multiplier=1 --num=1000000 --threads=2 \
--max_background_jobs=4
# Output example
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB: version 6.21
Date: Tue Jul 20 20:24:56 2021
CPU: 32 * AMD EPYC 7302P 16-Core Processor
CPUCache: 512 KB
Keys: 16 bytes each (+ 0 bytes user-defined timestamp)
Values: 800 bytes each (400 bytes after compression)
Entries: 1000000
Prefix: 0 bytes
Keys per prefix: 0
RawSize: 778.2 MB (estimated)
FileSize: 396.7 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: skip_list
Perf Level: 1
------------------------------------------------
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
DB path: [rocksdbtest/dbbench]
fillrandom : 10.447 micros/op 191441 ops/sec; 149.0 MB/s