Docs Hub | ZFS Deduplication

Cause: Continuous DDT access is limiting the available RAM or RAM usage is generally very high RAM usage. This can also slow memory access if the system uses swap space on disks to compensate.
Diagnose: Open the command line and enter top. The header indicates ARC and other memory usage statistics. Additional commands for investigate RAM or ARC usage performance: arc_summary and arcstat.
Solutions:
- Install more RAM.
- Add a new System > Tunable: vfs.zfs.arc.meta_min with Type=LOADER and Value=bytes. This specifies the minimum RAM that is reserved for metadata use and cannot be evicted from RAM when new file data is cached.

Cause: The system must perform disk I/O to fetch DDT entries, but these are usually 4K I/O and the underlying disk hardware is unable to cope in a timely manner.
Diagnose: Open the command line and enter gstat to show heavy I/O traffic for either DDT or a generic pool, although DDT traffic is more often the cause. zpool iostat is another option that can show unexpected or very high disk latencies. When networking slowdowns are also seen, tcpdump or an application’s TCP monitor can also show a low or zero TCP window for extended durations.
Solutions: Add high quality SSDs as a special vdev and either move the data or rebuild the pool to use the new storage.

Cause: This is a byproduct of the Disk I/O Slowdown issue. Network buffers can become congested with incomplete demands for file data and the entire ZFS I/O system is delayed by tens or hundreds of seconds because huge amounts of DDT entries have to be fetched. Timeouts occur when networking buffers can no longer handle the demand. Because all services on a network connection share the same buffers, all become blocked. This is usually seen as file activity working for a while and then unexpectedly stalling. File and networked sessions then fail too. Services can become responsive when the disk I/O backlog clears, but this can take several minutes. This problem is more likely to be seen when high speed networking is used because the network buffers fill faster.

Cause: When ZFS has fast special vdev SSDs, sufficient RAM, and is not limited by disk I/O, then hash calculation becomes the next bottleneck. Most of the ZFS CPU consumption is from attempting to keep hashing up to date with disk I/O. When the CPU is overburdened, the console becomes unresponsive and the web UI fails to connect. Other tasks might not run properly because of timeouts. This is often encountered with pool scrubs and it can be necessary to pause the scrub temporarily when other tasks are a priority.
Diagnose: An easily seen symptom is that console logins or prompts take several seconds to display. Using top can confirm the issue. Generally, multiple entries with command *kernel {z_rd_int_[NUMBER]}* can be seen using the CPU capacity, and the CPU is heavily (98%+) used with almost no idle.
Solutions: Changing to a more performant CPU can help but might have limited benefit. 40 core CPUs have been observed to struggle as much as 4 or 8 core CPUs. A usual workaround is to temporarily pause scrub and other background ZFS activities that generate large amounts of hashing. It can also be possible to limit I/O using tunables that control disk queues and disk I/O ceilings, but this can impact general performance and is not recommended.

ZFS Deduplication