Skip to content

Device Mapper

Zoned block device support was added to the device mapper subsystem with kernel version 4.13. Two existing targets gained support: the dm-linear target and the dm-flakey target. ZBD support also added a new target driver, dm-zoned.

dm-linear

The dm-linear target maps a linear range of blocks of the device-mapper device onto a linear range of a backend device. dm-linear is the basic building block of logical volume managers such as LVM.

Zoned Block Device Restrictions

When used with zoned block devices, the dm-linear device created will also be a zoned block device with the same zone size as the underlying device. Several conditions are enforced by the device mapper core management code for the creation of a dm-linear target device.

  • All backend devices used to map different ranges of the target device must have the same zone model.
  • If the backend devices are zoned block devices, all devices must have the same zone size.
  • The mapped ranges must be zone aligned, that is, partial zone mapping is not possible.

Example: Creating a Small Host Managed Disk

This example illustrates how to create a small host managed disk using zone ranges from a large high capacity host managed disk. The zone information of the backend device used is shown below.

# cat /sys/block/sdb/queue/zoned
host-managed
# cat /sys/block/sdb/queue/chunk_sectors
524288
# blkzone report /dev/sdb
  start: 0x000000000, len 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
  start: 0x000080000, len 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
  start: 0x000100000, len 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
  start: 0x000180000, len 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
  ...
  start: 0x010580000, len 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
  start: 0x010600000, len 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x010680000, len 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x010700000, len 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  ...
  start: 0x6d2300000, len 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x6d2380000, len 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]

To create a dm-linear device named "small-sdb" joining the first 5 conventional zones of the backend device with the first 10 sequential zones, the following command can be used.

# echo "0 2621440 linear /dev/sdb 0
2621440 5242880 linear /dev/sdb 274726912" | dmsetup create small-sdb

The resulting device zone model is also host managed and has 15 zones as shown below.

# cat /sys/block/dm-0/queue/zoned
host-managed
# cat /sys/block/dm-0/queue/chunk_sectors
524288
# blkzone report /dev/dm-0
  start: 0x000000000, len 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
  start: 0x000080000, len 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
  start: 0x000100000, len 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
  start: 0x000180000, len 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
  start: 0x000200000, len 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
  start: 0x000280000, len 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000300000, len 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000380000, len 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000400000, len 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000480000, len 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000500000, len 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000580000, len 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000600000, len 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000680000, len 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]
  start: 0x000700000, len 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)]

The following shows a script facilitating the creation of dm-linear devices using zone ranges from a single zoned block device. Such small zoned block devices are useful for testing applications limits (e.g. Disk full conditions).

#!/bin/bash

if [ $# != 3 ]; then
    echo "Usage: $0 <disk> <num conv zones> <num seq zones>"
    exit 1
fi

disk="$1"
nrconv=$2
nrseq=$3
dname="`basename ${disk}`"

# Linear table entries: "start length linear device offset"
# start: starting block in virtual device
# length: length of this segment
# device: block device, referenced by the device name or by major:minor
# offset: starting offset of the mapping on the device

convlen=$(( $nrconv * 524288 ))
seqlen=$(( $nrseq * 524288 ))

if [ $convlen -eq 0 ] && [ $seqlen -eq 0 ]; then
    echo "0 zones..."
    exit 1
fi

seqofst=`zbc_report_zones $1 | grep "Sequential-write-required" | head -n1 | cut -f5 -d',' | cut -f3 -d' '`
if [ $convlen -gt $seqofst ]; then
    nrconv=$(( $seqofst / 524288 ))
    echo "Too many conventional zones requested: truncating to $nrconv"
    convlen=$seqofst
fi

if [ $convlen -eq 0 ]; then

echo "0 ${seqlen} linear ${disk} ${seqofst}" | dmsetup create small-${dname}

elif [ $seqlen -eq 0 ]; then

echo "0 ${convlen} linear ${disk} 0" | dmsetup create small-${dname}

else

echo "0 ${convlen} linear ${disk} 0
${convlen} ${seqlen} linear ${disk} ${seqofst}" | dmsetup create small-${dname}

fi

Example: Conventional Zones as a Regular Disk

dm-linear can also be used to aggregate a zoned block device conventional zones together into a target device that will be usable as a regular disk (conventional zones can be randomly written). Reusing the previous example backend disk, 524 conventional zones of 524288 sectors (512 B unit) are available. The following command creates a dm-linear device joining all conventional zones together.

# echo "0 274726912 linear /dev/sdb 0" | dmsetup create small-sdb

The target device is again a host managed disk but contains only conventional zones.

# cat /sys/block/dm-0/queue/zoned
host-managed
# cat /sys/block/dm-0/queue/chunk_sectors
524288
# blkzone report /dev/dm-0
  start: 0x000000000, len 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
  start: 0x000080000, len 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
  start: 0x000100000, len 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
  start: 0x000180000, len 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
  ...
  start: 0x010500000, len 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]
  start: 0x010580000, len 0x080000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)]

This zoned block device being composed of only conventional zones, all sectors are randomly writable and can thus be used directly with any file system.

# mkfs.ext4 /dev/dm-0
mke2fs 1.44.6 (5-Mar-2019)
Creating filesystem with 34340864 4k blocks and 8585216 inodes
Filesystem UUID: 3957429a-5dab-4b30-9797-f9736036a47b
Superblock backups stored on blocks:
    32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
    4096000, 7962624, 11239424, 20480000, 23887872

Allocating group tables: done
Writing inode tables: done
Creating journal (262144 blocks): done
Writing superblocks and filesystem accounting information: done

# mount /dev/dm-0 /mnt
# ls -l /mnt
total 16
drwx------ 2 root root 16384 May 21 17:03 lost+found

Applications needing frequent random updates to their metadata can use such setup to facilitate implementation of a complex metadata structure. The remaining sequential zones of the disk can be used directly by the application to store data.

dm-flakey

The dm-flakey target is similar to the dm-linear target except that it exhibits unreliable behavior periodically. This target is useful in simulating failing devices for testing purposes. In the case of zoned block devices, simulating write errors to sequential zones can help in debugging application write pointer management.

Starting from the time the table is loaded, the device does not generate errors for some seconds (up time), then exhibits unreliable behavior for down seconds. This cycle then repeats.

Error modes

Several error simulation behavior can be configured.

  • drop_writes All write I/Os are silently ignored and dropped. Read I/Os are handled correctly.
  • error_writes All write I/Os are failed with an error signaled. Read I/Os are handled correctly.
  • corrupt_bio_byte During the down time, replace the Nth byte of the data of each read or write block I/O with a specified value.

The default error mode is to fail all I/O requests during the down time of the simulation cycle.

Zoned Block Device Restrictions

The same restrictions as for the dm-linear target apply.

Examples

dm-linear detailed documentation and usage examples can be found in the kernel source code documentation file Documentation/device-mapper/dm-flakey.txt.

dm-zoned

The dm-zoned device mapper target provides unconstrained write access to zoned block devices. It hides from the device user (a file system or an application doing raw block device accesses) any sequential write constraints on host managed devices and can mitigate a potential device-side performance degradation with host aware zoned devices. In essence, the dm-zoned target is a host level implementation of a drive managed model (See SMR Interface Implementations).

File systems or applications that can directly support host managed zoned block devices (e.g. The f2fs file system since kernel 4.10) do not need to use the dm-zoned device mapper target.

Design Overview

The dm-zoned target implementation focused on simplicity and on minimizing host resource overhead (CPU and memory consumption) while maximizing the usable capacity of the backing zoned block device. For a 10TB host managed disk with 256 MB zones, dm-zoned memory usage per disk instance is at most 4.5 MB and as little as 5 zones will be used internally for storing metadata and performing garbage collection (zone reclaim) operations.

The figure below illustrates dm-zoned zone usage principle.

dm-zoned
Zone mapping overview of the dm-zoned device mapper target

The zones of the device are separated into 2 types:

  1. Metadata zones These are randomly writable zones used to store metadata. Randomly writable zones may be conventional zones or sequential write preferred zones (host aware devices only). Metadata zones are not reported as usable capacity to the user.

  2. Data zones All remaining zones, the majority of which will be sequential zones. These are used exclusively to store user data. The conventional zones (or part of the sequential write preferred zones on a host aware device) may also be used for buffering incoming random writes. Data in these zones may be permanently mapped to the randomly writable zone initially used, or moved to a sequential zone after some time so that the random zone can be reused for buffering new incoming random writes.

As shown in the above figure, the target device is divided into chunks that have the same size as the backing zoned device zones. A logical chunk can be mapped to zones of the backing zoned device in different ways.

  1. Conventional zone mapping This is the case for chunk A in the figure which is mapped to the conventional zone CA. This is the default mapping initialized when the first write command is issued to an empty (unwritten) chunk. As long as a chunk is mapped to a conventional zone, any incoming write request can be directly executed using the mapped conventional zone.
  2. Sequential zone mapping A chunk initially can also be mapped to a sequential zone as shown for the chunk C mapped to the sequential zone SC in the figure. With such mapping, a already written block of the chunk cannot be modified directly. To handle this case, the next mapping type is used.
  3. Dual conventional-sequential zone mapping To process data updates to written blocks of a chunk mapped to a sequential zone, a conventional zone may be temporarily added to the chunk mapping. Any write targeting a written block will be processed using the conventional zone rather than the sequential zone.

dm-zoned metadata include a set of bitmaps to track the validity state of blocks in the backing device zones. Any write operation execution is always followed by an update to the bitmaps to mark the written blocks as valid. In the case of the dual conventional-sequential chunk mapping, the bitmap for the blocks of the sequential zone is also updated to clear the bits representing the blocks updated but written to the conventional zone. Doing so, incoming reads always gain access to the latest version of the block data by simply inspecting the block validity bitmaps.

The dm-zoned target always exposes a logical device with a logical block size (LBA) of 4096 bytes, regardless of the logical sector size of the backend zoned device being used. This allows reducing the amount of metadata (block bitmap size) needed to manage valid blocks (blocks written).

On-Disk Metadata

dm-zoned on-disk metadata format is as follows.

  1. The first block of the first randomly writable zone found contains the super block which describes the amount and position on disk of metadata blocks.

  2. Following the super block, a set of blocks is used to describe the mapping of the logical chunks of the target logical device to data zones. The mapping is indexed by logical chunk number and each mapping entry indicates the data zone storing the chunk data and optionally the zone number of a random zone used to buffer random modification to the chunk data.

  3. A set of blocks used to store bitmaps indicating the validity of blocks in the data zones follows the mapping table blocks. A valid block is a block that was written and not discarded. For a chunk with a dual mapping, a block can be valid only in the sequential zone or in the conventional zone.

Read-Write Processing

For a logical chunk mapped to a random data zone, all write operations are processed by directly writing to the data zone. If the mapping zone is to a sequential zone, the write operation is processed directly only and only if the write offset within the logical chunk is equal to the write pointer offset within of the sequential data zone (i.e. The write operation is aligned on the zone write pointer). Otherwise, write operations are processed indirectly using a buffer zone: a randomly writable free data zone is allocated and assigned to the chunk being accessed in addition to the already mapped sequential data zone. Writing a block to the conventional zone will invalidate the same blocks in the sequential zone.

Read operations are processed according to the block validity information provided by the bitmaps. Valid blocks are read from one of the zone of the chunk mapping.

Conventional Zone Reclaim

When the limited number of conventional zones of the backing device are all used to map chunks, incoming random writes to unwritten chunks or to chunks mapped to sequential zones becomes impossible.

To avoid this situation, a reclaim process regularly scans used conventional zones and reclaim them by rewriting (sequentially) the valid blocks of the zone to an unused sequential zone. Once all valid blocks are copied to the sequential zone, the chunk mapping is updated to point to the sequential zone and the conventional zone freed for reuse.

Reclaiming the conventional zone used for a chunk with dual mapping requires to either copy all valid blocks to the sequential zone of the mapping, or if not possible, to copy the valid blocks of the two mapping zone to an unused sequential zone. This is the case illustrated in the figure above with the valid blocks of the zones C and SC being moved to another sequential zone.

Metadata Protection

To protect internal metadata against corruption in case of sudden power loss or system crash, two sets of metadata zones are used. One set, the primary set, is used as the main metadata set, while the secondary set is used as a log. Modified metadata are first written to the secondary set and the log so created validated by writing an updated super block in the secondary set. Once this log operation completes, updates in place of metadata blocks can be done in the primary metadata set, ensuring that one of the set is always correct. Flush operations are used as a commit point: upon reception of a flush operation, metadata activity is temporarily stopped, all dirty metadata logged and updated and normal operation resumed. This only temporarily delays write and discard requests. Read requests can be processed while metadata logging is executed.

Userspace Tool

The dmzadm command line utility formats backend zoned devices for use with the dm-zoned device mapper target. This utility will verify the device zone model will prepare and write on-disk dm-zoned metadata according to the device capacity, zone size, etc.

The source code for the dmzadm utility is hosted on GitHub. The project README file provides instructions on how to compile and install the utility.

To create a dm-zoned target device, a zoned block device must first be formatted using the dmzadm utility. This tool will analyze the device zone configuration, determine where to place the metadata sets and initialize on disk the metadata used by the dm-zoned target driver.

An example execution is shown below with a 15TB host managed disk.

# dmzadm --format /dev/sdb
/dev/sdb: 29297213440 512-byte sectors (13970 GiB)
  Host-managed device
  55880 zones of 524288 512-byte sectors (256 MiB)
  65536 4KB data blocks per zone
Resetting sequential zones
Writing primary metadata set
  Writing mapping table
  Writing bitmap blocks
  Writing super block
Writing secondary metadata set
  Writing mapping table
  Writing bitmap blocks
  Writing super block
Syncing disk
Done.

The dm-zoned target device can now be created using the dmsetup utility. The following script simplifies this procedure (dm-zoned-setup.sh)

#!/bin/sh

if [ $# -lt 1 ]; then
    echo "Usage: $0 <Zoned device path> [Options]"
    echo "Options:"
    echo "  rlow=<perc>      : Start reclaiming random zones if the "
    echo "                     percentage of free random data zones falls "
    echo "                     below <perc>."
    echo "  idle_rlow=<perc> : When the disk is idle (no I/O activity), "
    echo "                     start reclaiming random zones if the "
    echo "                     percentage of free random data zones falls "
    echo "                     below <perc>."
    exit 1
fi

dev="${1}"
shift
options="$@"

modprobe dm-zoned

echo "0 `blockdev --getsize ${dev}` zoned ${dev} ${options}" | \
dmsetup create zoned-`basename ${dev}`

The following options can be passed to dmsetup to tune the target operation.

Option Description
rlow=int Start reclaiming conventional zones if the percentage of free random data zones falls below the specified percentage
idle_rlow=int When the disk is idle (no I/O activity), start reclaiming conventional zones if the percentage of free conventional zones falls below the specified percentage
# dm-zoned-setup /dev/sdb
# ls -l /dev/mapper/
total 0
crw------- 1 root root 10, 236 May 20 15:35 control
lrwxrwxrwx 1 root root       7 May 21 20:29 zoned-sdb -> ../dm-0

The target device is now created. More messages are visible in the kernel log, indicating that the amount of zones lost to dm-zoned management is 4 metadata zones and 16 sequential zones reserved for the execution of conventional zones reclaim.

# dmesg
...
device-mapper: zoned metadata: (sdb): Using 2682240 B for zone information
device-mapper: zoned metadata: (sdb): Using super block 0 (gen 1)
device-mapper: zoned metadata: (sdb): Host-managed zoned block device
device-mapper: zoned metadata: (sdb):   29297213440 512-byte logical sectors
device-mapper: zoned metadata: (sdb):   55880 zones of 524288 512-byte logical sectors
device-mapper: zoned metadata: (sdb):   4 metadata zones
device-mapper: zoned metadata: (sdb):   55860 data zones for 55860 chunks
device-mapper: zoned metadata: (sdb):     520 random zones (520 unmapped)
device-mapper: zoned metadata: (sdb):     55340 sequential zones (55340 unmapped)
device-mapper: zoned metadata: (sdb):   16 reserved sequential data zones
device-mapper: zoned metadata: (sdb): Format:
device-mapper: zoned metadata: (sdb): 111871 metadata blocks per set (656 max cache)
device-mapper: zoned metadata: (sdb):   110 data zone mapping blocks
device-mapper: zoned metadata: (sdb):   111760 bitmap blocks
device-mapper: zoned: (sdh): Target device: 29286727680 512-byte logical sectors (3660840960 blocks)

The target device created is a regular disk that can be used with any file system.

# cat /sys/block/dm-0/queue/zoned
none
# mkfs.ext4 /dev/dm-0
mke2fs 1.44.6 (5-Mar-2019)
Discarding device blocks: done
Creating filesystem with 3660840960 4k blocks and 457605120 inodes
Filesystem UUID: f91ab6a5-4b71-4577-b2d2-22e043b1f083
Superblock backups stored on blocks:
    32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
    4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
    102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
    2560000000

Allocating group tables: done
Writing inode tables: done
Creating journal (262144 blocks): done
Writing superblocks and filesystem accounting information: done

# mount /dev/dm-0 /mnt
# ls -l /mnt
total 16
drwx------ 2 root root 16384 May 21 20:33 lost+found