ZFS Deduplication
12 minute read.
Last Modified 2022-05-10 12:19 -0400ZFS supports deduplication as a feature. Deduplication means that identical data is only stored once, and this can greatly reduce storage size. However deduplication is a compromise and balance between many factors, including cost, speed, and resource needs. It must be considered exceedingly carefully and the implications understood, before being used in a pool.
Deduplication is one technique ZFS can use to store file and other data in a pool. If several files contain the same pieces (blocks) of data, or any other pool data occurs more than once in the pool, ZFS will store just one copy of it. In effect instead of storing many copies of a book, it stores one copy and an arbitrary number of pointers to that one copy. Only when no file uses that data, is the data actually deleted. ZFS keeps a reference table which links files and pool data to the actual storage blocks containing “their” data. This is the Deduplication Table (DDT).
The DDT is a fundamental ZFS structure. It is treated as part of the pool’s metadata. If a pool (or any dataset in the pool) has ever contained deduplicated data, the pool will contain a DDT, and that DDT is as fundamental to the pool data as any of its other file system tables. Like any other metadata, DDT contents may temporarily be held in the ARC (RAM/memory cache) or L2ARC (disk cache) for speed and repeated use, but the DDT is not a disk cache. It is a fundamental part of the ZFS pool structure, how ZFS organises pool data on its disks. Therefore like any other pool data, if DDT data is lost, the pool is likely to become unreadable. So it is important it is stored on redundant devices.
A pool can contain any mix of deduplicated data and non-deduplicated data, coexisting. Data is written using the DDT if deduplication is enabled at the time of writing, and is written non-deduplicated if deduplication is not enabled at the time of writing. Subsequently, the data will remain as at the time it was written, until it is deleted.
The only way to convert existing current data to be all deduplicated or undeduplicated, or to change how it is deduplicated, is to create a new copy, while new settings are active. This could be done by copying the data within a file system, or to a different file system, or replicating using zfs send
and zfs receive
or the Web UI replication functions. Data in snapshots is fixed, and can only be changed by replicating the snapshot to a different pool with different settings (which preserves its snapshot status), or copying its contents.
It is possible to stipulate in a pool, that only certain datasets and volumes will be deduplicated. The DDT encompasses the entire pool, but only data in those locations will be deduplicated when written. Other data which will not deduplicate well or where deduplication is inappropriate, will not be deduplicated when written, saving resources.
The main benefit of deduplication is that, where appropriate, it can greatly reduce the size of a pool and the disk count and cost. For example, if a server stores files with identical blocks, it could store thousands or even millions of copies for almost no extra disk space. When data is read or written, it is also possible that a large block read or write can be replaced by a smaller DDT read or write, reducing disk I/O size and quantity.
The deduplication process is very demanding! There are four main costs to using deduplication: large amounts of RAM, requiring fast SSDs, CPU resources, and a general performance reduction. So the trade-off with deduplication is reduced server RAM/CPU/SSD performance and loss of “top end” I/O speeds in exchange for saving storage size and pool expenditures.
When data is not sufficiently duplicated, deduplication wastes resources, slows the server down, and has no benefit. When data is already being heavily duplicated, then consider the costs, hardware demands, and impact of enabling deduplication before enabling on a ZFS pool.
High quality mirrored SSDs configured as a “special vdev” for the DDT (and usually all metadata) are strongly recommended for deduplication unless the entire pool is built with high quality SSDs. Expect potentially severe issues if these are not used as described below. NVMe SSDs are recommended whenever possible. SSDs must be large enough to store all metadata.
The deduplication table (DDT) contains small entries about 300-900 bytes in size. It is primarily accessed using 4K reads. This places extreme demand on the disks containing the DDT.
When choosing SSDs, remember that a deduplication-enabled server can have considerable mixed I/O and very long sustained access with deduplication. Try to find “real-world” performance data wherever possible. It is recommended to use SSDs that do not rely on a limited amount of fast cache to bolster a weak continual bandwidth performance. Most SSDs performance (latency) drops when the onboard cache is fully used and more writes occur. Always review the steady state performance for 4K random mixed read/write.
Special vdev SSDs receive continuous, heavy I/O. HDDs and many common SSDs are inadequate. As of 2021, some recommended SSDs for deduplicated ZFS include Intel Optane 900p, 905p, P48xx, and better devices. Lower cost solutions are high quality consumer SSDs such as the Samsung EVO and PRO models. PCIe NVMe SSDs (NVMe, M.2 “M” key, or U.2) are recommended over SATA SSDs (SATA or M.2 “B” key).
When special vdevs cannot contain all the pool metadata, then metadata is silently stored on other disks in the pool. When special vdevs become too full (about 85%-90% usage), ZFS cannot run optimally and the disks operate slower. Try to keep special vdev usage under 65%-70% capacity whenever possible. Try to plan how much future data will be added to the pool, as this increases the amount of metadata in the pool. More special vdevs can be added to a pool when more metadata storage is needed.
Deduplication is memory intensive. When the system does not contain sufficient RAM, it cannot cache DDT in memory when read and system performance can decrease.
The RAM requirement depends on the size of the DDT and how much data will be stored in the pool. Also, the more duplicated the data, the fewer entries and smaller DDT. Pools suitable for deduplication, with deduplication ratios of 3x or more (data can be reduced to a third or less in size), might only need 1-3 GB of RAM per 1 TB of data. The actual DDT size can be estimated by deduplicating a limited amount of data in a temporary test pool, or by using zdb -S
in a command line.
The tunable vfs.zfs.arc.meta_min (type=LOADER, value=bytes) can be used to force ZFS to reserve no less than the given amount of RAM for metadata caching.
Deduplication consumes extensive CPU resources and it is recommended to use a high-end CPU with 4-6 cores at minimum.
If deduplication is used in an inadequately built system, these symptoms might be seen:
- Cause: Continuous DDT access is limiting the available RAM or RAM usage is generally very high RAM usage. This can also slow memory access if the system uses swap space on disks to compensate.
- Diagnose: Open the command line and enter
top
. The header indicates ARC and other memory usage statistics. Additional commands for investigate RAM or ARC usage performance:arc_summary
andarcstat
. - Solutions:
- Install more RAM.
- Add a new System > Tunable: vfs.zfs.arc.meta_min with Type=LOADER and Value=bytes. This specifies the minimum RAM that is reserved for metadata use and cannot be evicted from RAM when new file data is cached.
- Cause: The system must perform disk I/O to fetch DDT entries, but these are usually 4K I/O and the underlying disk hardware is unable to cope in a timely manner.
- Diagnose: Open the command line and enter
gstat
to show heavy I/O traffic for either DDT or a generic pool, although DDT traffic is more often the cause.zpool iostat
is another option that can show unexpected or very high disk latencies. When networking slowdowns are also seen,tcpdump
or an application’s TCP monitor can also show a low or zero TCP window for extended durations. - Solutions: Add high quality SSDs as a special vdev and either move the data or rebuild the pool to use the new storage.
- Cause: This is a byproduct of the Disk I/O Slowdown issue. Network buffers can become congested with incomplete demands for file data and the entire ZFS I/O system is delayed by tens or hundreds of seconds because huge amounts of DDT entries have to be fetched. Timeouts occur when networking buffers can no longer handle the demand. Because all services on a network connection share the same buffers, all become blocked. This is usually seen as file activity working for a while and then unexpectedly stalling. File and networked sessions then fail too. Services can become responsive when the disk I/O backlog clears, but this can take several minutes. This problem is more likely to be seen when high speed networking is used because the network buffers fill faster.
- Cause: When ZFS has fast special vdev SSDs, sufficient RAM, and is not limited by disk I/O, then hash calculation becomes the next bottleneck. Most of the ZFS CPU consumption is from attempting to keep hashing up to date with disk I/O. When the CPU is overburdened, the console becomes unresponsive and the web UI fails to connect. Other tasks might not run properly because of timeouts. This is often encountered with pool scrubs and it can be necessary to pause the scrub temporarily when other tasks are a priority.
- Diagnose: An easily seen symptom is that console logins or prompts take several seconds to display. Using
top
can confirm the issue. Generally, multiple entries with command *kernel {z_rd_int_[NUMBER]}* can be seen using the CPU capacity, and the CPU is heavily (98%+) used with almost no idle. - Solutions: Changing to a more performant CPU can help but might have limited benefit. 40 core CPUs have been observed to struggle as much as 4 or 8 core CPUs. A usual workaround is to temporarily pause scrub and other background ZFS activities that generate large amounts of hashing. It can also be possible to limit I/O using tunables that control disk queues and disk I/O ceilings, but this can impact general performance and is not recommended.
Typical output will include a line like this:
dedup: DDT entries 227317061, size 672B on disk, 217B in core
This means that the DDT contains 227 million blocks, and each block is using 672 bytes in the pool, and 217 bytes of RAM when cached in ARC. The two values differ because ZFS uses different structures for DDT entries on disk and in RAM. There is also a table, showing how many blocks (actual and referenced) are duplicated, summarized in bands (or “buckets”) of powers of 2, and their average actual and referenced sizes.
zpool status -Dv
.
-v
option shows disk usage for each individual vdev, which helps confirm that DDT has not overflowed into other disks in the pool.