Zoned block device support was added to the device mapper subsystem in kernel version 4.13. Two existing targets gained support: the dm-linear target and the dm-flakey target. ZBD support also added a new target driver, dm-zoned.
The dm-linear target maps a linear range of blocks of the device-mapper device onto a linear range on a backend device. dm-linear is the basic building block of logical volume managers like LVM.
When used with zoned block devices, the dm-linear device that is created will also be a zoned block device with the same zone size as the underlying device. Several conditions are enforced by the device-mapper core-management code during the creation of a dm-linear target device.
- All backend devices used to map different ranges of the target device must have the same zone model.
- If the backend devices are zoned block devices, all devices must have the same zone size.
- The mapped ranges must be zone aligned, that is, partial zone mapping is not possible.
This example illustrates how to create a small host-managed disk that uses zone ranges from a large high capacity host-managed disk. The zone information of the backend device used is shown below.
To create a dm-linear device named "small-sdb" that joins the first 5 conventional zones of the backend device with the first 10 sequential zones, use the following command.
The resulting device zone model is host-managed and has 15 zones, as shown below.
The following shows a script that facilitates the creation of dm-linear devices using zone ranges from a single zoned block device. Such small zoned block devices can be used to test application limits (e.g. Disk full conditions).
dm-linear can also be used to aggregate a zoned block device's conventional zones together into a target device that will be usable as a regular disk (conventional zones can be randomly written). Reusing the previous example backend disk, 524 conventional zones of 524288 sectors (512 B unit) are available. The following command creates a dm-linear device joining all conventional zones together.
The target device is again a host-managed disk but contains only conventional zones.
Because this zoned block device is composed entirely of conventional zones, all sectors are randomly writable and can therefore be used directly with any file system.
Applications that need frequent random updates to their metadata can use such setups to facilitate the implementation of a complex metadata structure. The remaining sequential zones of the disk can be used directly by the application to store data.
The dm-flakey target is similar to the dm-linear target, except that it periodically exhibits unreliable behavior. This target is useful for simulating failing devices during testing. In the case of zoned block devices, simulating write errors to sequential zones can help to debug application write-pointer management.
dm-flakey works like this: at the time the table is loaded, the device does not generate errors for some seconds (up time), but then exhibits unreliable behavior for down seconds. This cycle then repeats.
Several error-simulation behaviors can be configured.
- drop_writes All write I/Os are silently ignored and dropped. Read I/Os are handled correctly.
- error_writes All write I/Os are failed with an error signaled. Read I/Os are handled correctly.
- corrupt_bio_byte During the down time, replace the Nth byte of the data of each read or write block I/O with a specified value.
The default error mode is to fail all I/O requests during the down time of the simulation cycle.
The same restrictions apply that applied for the dm-linear target.
dm-linear detailed documentation and usage examples can be found in the kernel source code documentation file Documentation/device-mapper/dm-flakey.txt.
The dm-zoned device mapper target provides random write access to zoned block devices (ZBC and ZAC compliant devices). It hides the sequential write constraint of host-managed zoned block devices from the device user (the "device user", in this context, is a file system or an application accessing a raw block device). This allows the use of applications and file systems that do not have native zoned block device support.
File systems or applications that can natively support host-managed zoned block devices (e.g. the f2fs file system since kernel 4.10) do not need to use the dm-zoned device mapper target.
dm-zoned implements an on-disk write buffering scheme to handle random write accesses to sequential write required zones of a zoned block device. Conventional zones of the backend device are used for buffering random accesses, as well as for storing internal metadata.
The figure below illustrates the dm-zoned zone-usage principle.
Optionally, since Linux kernel version 5.8.0, an additional regular block device can be used to provide randomly writable storage, replacing the conventional zones of the backend zoned block device for write buffering. With this new version of dm-zoned, multiple zoned block devices can also be used to increase performance.
All zones of the device(s) used to back a dm-zoned target are separated into 2 types:
Metadata zones These are randomly-writable zones that are used to store metadata. Randomly writable zones may be conventional zones or sequential write preferred zones (host-aware devices only). Metadata zones are not reported as usable capacity to the user. If an additional regular block device is used for write buffering, metadata zones are stored on this cache device.
Data zones All remaining zones of the device. The majority of these zones are sequential zones, which are used exclusively for storing user data. The conventional zones (or part of the sequential-write-preferred zones on a host-aware device) may be used also for buffering user random writes. User data may thus be stored either in conventional zone or in a sequential zone.
As shown in the above figure, the target device is divided into chunks that have the same size as the zones of the backing zoned devices. A logical chunk can be mapped to zones of the backing device in different ways.
- Conventional or cache zone mapping This is the case for chunk A in the figure, which is mapped to the conventional zone CA. This is the default mapping that is initialized when the first write command is issued to an empty (unwritten) chunk. As long as a chunk is mapped to a conventional zone, any incoming write request can be directly executed using the mapped conventional zone.
- Sequential zone mapping A chunk can be mapped initially to a sequential zone, as shown for the chunk C (mapped to the sequential zone SC in the figure). With such a mapping, an already-written block of the chunk cannot be modified directly. To handle this case, the next mapping type is used.
- Dual conventional-sequential zone mapping Temporarily add a conventional zone to the chunk mapping when you need to update data in an already-written block of a chunk that has been mapped to a sequential zone. Any write that targets a written block will be processed using the conventional zone instead of the sequential zone.
dm-zoned metadata include a set of bitmaps to track the validity state of blocks in the zones of the backing device. Any write-operation execution is always followed by an update to the bitmaps, to mark the written blocks as valid. In the case of the dual conventional-sequential chunk mapping, the bitmap for the blocks of the sequential zone is updated to clear the bits that represent the blocks that have been updated with a write in the conventional zone. By doing this, incoming reads always have access to the latest version of the block data simply by inspecting the block validity bitmaps.
dm-zoned exposes a logical device with a sector size of 4096 bytes, regardless of the physical-sector size of the zoned block device that is used as a backend . This reduces the amount of metadata needed to manage valid blocks (blocks written). The on-disk metadata format is as follows:
Super Block: The first block of the first randomly-writable zone that is found contains the super block, which describes the amount and position (on the disk) of metadata blocks.
Mapping-table blocks: After the super block, there is a set of blocks that describes the mapping of (1) the logical chunks of the target logical device to (2) data zones. The mapping is indexed by logical chunk number, and each mapping entry indicates the data zone that stores the chunk data. It can also indicate the zone number of a random zone that is used to buffer random modifications to the chunk data.
Bitmap-storage blocks: After the mapping-table blocks, there is a set of blocks used to store bitmaps that indicate the validity of blocks in the data zones. A valid block is any block that was written and not discarded. In a buffered data zone, a block can be valid only (1) in the data zone or (2) in the buffer zone.
The device-mapper subsystem uses two sets of metadata zones, to protect internal metadata against corruption in cases of sudden power loss or system crashes. One set (the primary set) is used as the main metadata set. The other set (the secondary set) is used as a log. Modified metadata are first written to the secondary set. The log that is created by writing to the secondary set is validated by writing an updated super block in the secondary set. After this log operation completes, the primary metadata set is updtaed. This ensures that one of the sets is always correct.
Flush operations are used as a commit point: when a flush operation is received, metadata activity is temporarily stopped, all dirty metadata is logged and updated, and then normal operation resumes. Flush operations temporarily delay write requests and discard requests. Read requests can be processed while metadata logging is executed.
In logical chunks that are mapped to random data zones, all write operations are processed by writing directly to the data zones. If the mapping zone is mapped to a sequential zone, the write operation is processed directly only if the write offset within the logical chunk is equal to the write pointer offset within the sequential data zone (i.e. the write operation is aligned on the zone write pointer). Otherwise, write operations are processed indirectly, using a buffer zone: a randomly writable free data zone is allocated and assigned to the chunk that is being accessed in addition to the already-mapped sequential data zone. Writing blocks to the buffer zone will invalidate the same blocks in the sequential data zone.
Read operations are processed according to the block validity information provided by the bitmaps: valid blocks are read either from the data zone or, if the data zone is buffered, from the buffer zone that is assigned to the data zone.
After some time, the limited number of available random zones may be exhausted, making unaligned writes to unbuffered zones impossible. To avoid this situation, a reclaim process regularly scans used random zones and tries to "reclaim" them by (sequentially) copying the valid blocks of the buffer zone to a free sequential zone. Once the copy completes, the chunk mapping is updated to point to the sequential zone and the buffer zone is freed for reuse.
The dmzadm command-line utility is used to format backend zoned devices for use with the dm-zoned device mapper target. This utility verifies the device zone model and prepares and writes on-disk dm-zoned metadata according to the device's capacity and zone size.
The dm-zoned-tools project was formerly hosted on GitHub as part of the HGST organization. dm-zoned-tools repository has since then moved to the Western Digital Corporation organization on GitHub.
dmzadm detailed usage is as follows.
Formatting a single device target is done using the following command:
/dev/<disk name> identifies the backend zoned block device to use. An
example execution using a SMR hard-disk is shown below:
As of Linux kernel v5.8.0, dm-zoned can use regular block devices (such as SSDs) together with zoned block devices. In this case, conventional zones are emulated (so that the regular block device can hold dm-zoned metadata and buffering data). When a regular block device is used, the zone-reclaim process operates by copying data from emulated conventional zones on the regular block device to zones of the zoned block device. This dual-drive configuration can significantly increase the performance of the target device under write-intensive workloads.
Use the following command to format a dm-zoned target device that uses an additional regular block device (and, optionally, several zoned-block devices):
(In this example
/dev/nvme2n1 is a NVMe SSD and
/dev/sdi is a host-managed SMR hard-disk.)
Start the dm-zoned target device (that is using a formatted zoned device or
set of devices) by running dmzadm with the
Confirm the target's activation by looking at the kernel messages:
The target device that was created is a regular disk that can be used with any file system.
For a multi-device target, you must specify the same list of devices that you specified when you formatted them:
Check the kernel messages to confirm the activation of the target device. This is similar to confirming target-device activiation in the single-device case.
--stop operation to disable a dm-zoned target device:
For a multi-device target, you must specify the same list of devices that you specified when you formatted them.