Linux Zoned Storage Support Overview
Zoned block device support was initially released with the Linux® kernel version 4.10. Following versions improved this support and added new features beyond the raw block device access interface. More advanced features such as device mapper support and ZBD aware file systems are now available.
Applications developers can use zoned block devices through various I/O paths controlled with different programming interfaces and exposing the device in different ways. A simplistic representation of the different possible access paths is shown in the figure below.
Zoned block device support features
Three different I/O path implement two POSIX compatible interface completely hiding the write constraints of zoned block devices sequential zones. These three I/O path are suitable to execute legacy applications, that is, applications that have not been modified to implement fully sequential write streams.
- File Access Interface This is the interface implemented by a file system
allowing an application to organize its data in files and directories. Two
different implementations are available.
- ZBD File System With this implementation, the file system is modified to directly handle the sequential write constraint of zoned block devices. Random writes to files by applications are transformed into sequential write streams by the file system, consealing the device constraints from the application. An example of this is the F2FS file system.
- Legacy File System In this case, an unmodified file system is used and the device write constraints handled by a device mapper driver exposing a regular block device. This device mapper is dm-zoned. Its characteristics and use are discussed in detail in this article.
- Block Access Interface This is the raw block device access interface resulting from directly accessing from the application the block device file. This interface is implemented using the dm-zoned device mapper to hide zoned block device sequential write constraints from the application.
Two additional interfaces are usable by applications that have been written or modified to comply with the sequential write constraint of zoned block devices. These interfaces do not hide the constraints and applications must ensure that data is written using sequential streams starting from zones write pointer positions.
Zoned Block Access Interface This is the counterpart of the block access interface without any intermediate driver to handle the device constraints. An application can use this interface by directly opening the zoned block device file and gaining access to zone information and management interface provided by the block layer. As an example, Linux System Utilities use this interface. Physical zoned block devices as well as logically created zoned block devices (e.g. zoned block devices created with the dm-linear device mapper target) support this interface.
Direct Device Access Interface This is the interface provided by the SCSI generic driver which allows an application to send SCSI commands directly to the device. The kernel interfere minimally with the commands sent by applications, resulting in the need for all device characteristics handling by the application itself (e.g. Logical and physical sector size, command timeout, command retry count, etc). User level libraries such as libzbc can greatly simplify the implementation of applications using this interface.
The initial release of zoned block device support with kernel 4.10 was limited to the block layer ZBD interface and I/O ordering control and native support for the F2FS file system. Following kernel versions added more feature such as device mapper drivers and support for the block multi-queue infrastructure.
The figure below summarizes the evolution of zoned block device support with kernel versions.
Kernel versions and ZBD features
Direct Access Support (SG Access) Support for exposing host managed ZBC/ZAC hard-disks as SCSI generic (SG) nodes was officially added to kernel 3.18 with the definition of the device type
TYPE_SCSIfor SCSI devices and with the definition of the device class
ATA_DEV_ZACfor ATA devices. For kernels older than version 3.18, SATA host managed ZAC disks will not be exposed to the users as SG nodes nor as block device files. These older kernels will simply ignore SATA devices reporting a host managed ZAC device signature and the devices will not be usable in any way. For SCSI disks or SATA disks connected to a compatible SAS HBA, host managed disk will be accessible by the user through the SG node file created by the kernel to represent these disks.
Zoned Block Device Access and F2FS Support The block I/O layer zoned block device support added to kernel version 4.10 enables exposing host managed ZBC and ZAC disks as block device files, similarly to regular disks. This support also includes changes to the kernel libata command translation to enable SCSI ZBC zone block commands translation to ZAC zone ATA commands. For applications relying on SCSI generic direct access, this enables handling both ZBC (SCSI) and ZAC (ATA) disks with the same code (e.g. ATA commands do not need to be issued). Access to zoned block devices is also possible using the disk block device file (e.g. /dev/sdX device file) with regular POSIX system calls. However, compared to regular disks, some restrictions still apply (see Kernel ZBD Support Restrictions).
Device Mapper and dm-zoned Support With kernel version 4.13.0, support for zoned block devices was added to the device mapper infrastructure. This support allows using the dm-linear and dm-flakey device mapper targets on top of zoned block devices. Additionally, the new dm-zoned device mapper target was also added.
Block multi-queue and SCSI multi-queue Support (scsi-mq) With kernel version 4.16.0, support for the block multi-queue infrastructure was added. This improvement enables using host managed ZBC and ZAC disks with the SCSI multiqueue (scsi-mq) support enabled while retaining support for the legacy single queue block I/O path. The block multi queue and scsi-mq I/O path are the default since kernel version 5.0 with the removal of the legacy single queue block I/O path support.
Improvements to the kernel zoned block device support are still ongoing. Support for new file systems (e.g. btrfs) will be released in the coming months.
Recommended Kernel Versions
All kernel versions since 4.10 include ZBD support. However, as shown in the figure Kernel versions and features, some versions are recommended over others.
Long Term Stable (LTS) Versions Kernel versions 4.14 and 4.19 are long term kernel stable versions which will see bug fix back-ports from fixes in the mainline (development) kernel. These versions thus benefit from stability improvements developed for higher versions. Fixes to the zoned block device support are also back-ported to these versions.
Latest Stable Version While not necessarily marked as a long term stable version, the latest stable kernel version receives all bug fixes being developed with the mainline kernel version following it. Except if the version is tagged as a long term support version, back-port of fixes to a stable kernel version stops with the switch of the following version from mainline to stable. Using a particular kernel stable version for a long time is thus not recommended.
For any stable or long term stable kernel version, it is recommended that system administrators use the latest available release within that version to ensure that all known problem fixes are applied.
ZBD Support Restrictions
In order to minimize the amount of changes to the block layer code, various existing features were reused. Furthermore, other kernel components that are not compatible with zoned block devices behavior and are too complex to change were left unmodified. This approach led to a set of device constraints that all zoned block devices must meet to be usable.
Zone size While the ZBC and ZAC standards do not impose any constraint on the zone layout of a device, that is, zones can be of any size, the kernel ZBD support is restricted to zoned devices with all zones of equal size. The zone size must also be equal to a power of 2 number of logical blocks. Only the last zone of a device may optionally have a smaller size (a so called runt zone). This zone size restriction allows the kernel code to use the block layer "chunked" space management normally used for software RAID devices. The chunked space management uses power of two arithmetic (bit shift operations) to determine which chunk (i.e. which zone) is being accessed and ensure that block I/O operations do not cross zone boundaries.
Unrestricted Reads The ZBC and ZAC standards define the URSWRZ bit indicating if a device will return an error when a read is directed at a the unwritten sectors of a sequential zone, that is, for a read command accessing sectors that are after the write pointer position of a zone. Linux only supports ZBC and ZAC host managed disks allowing unrestricted read commands, or in other words, disks reporting the URSWRZ bit as not set. This restriction has been added to ensure that the block layer disk partition scanning process does not result in read commands failing whenever the disk partition table is checked.
Direct IO Writes The kernel page cache does not guarantee that cached dirty pages will be flushed to a block device in sequential sector order. This can lead to unaligned write errors if an application uses buffered writes to write the sequential write required zones of a device. To avoid this pitfall, applications directly using the block device without a file system should always write to host managed disk sequential zones using direct I/O operations, that is, issue
write()system calls with a block device file open using the
All known ZBC and ZAC host-managed hard disks available on the market today have characteristics compatible with these requirements and can operate with a ZBD compatible Linux kernel.