ZFS. It's not a filesystem, it's an ecosystem
What is a filesystem?
All computers need to give you access to files. This seems quite obvious at first, but how those files get stored, most people don't seem to think about. Files need to be stored on a disk (or a network, but lets focus about on disk), and that disk needs a way to know where files are, how big they are, ect. This is all part of a filesystem. Some common ones that people may know about are
- NTFS
- FAT32
- EXT4
- HFS+
- APFS
These are just a few examples that can be found on different operating systems, and you are bound to recognize at least one.
What's the point? Isn't keeping track of files easy?
Different filesystems are built with different goals, or operating systems in mind. As a quick example, HFS+ was built before SSD's existed, and is optimized for spinning disk drives. That doesn't mean you can't use it on a solid state drive, but the performance could be better. This is what brought rise to APFS for Mac. It's built with only SSD's in mind. Once again, you can use this on spinning disk drives, but it won't perform as well as HFS+.
Another big area that filesystems are optimized for is features. More modern filesystems may offer things like on disk compression to save space while losing no data, permissions, to prevent users from accessing, modifying, or running files that they aren't allowed to, and much more. Not all filesystems are created equally, and each has upsides and downsides.
ZFS is a filesystem, but it's also not
Why explain what a filesystem is if ZFS is not one? Well, ZFS is not just a filesystem. It includes a filesystem as a component, but is far more. I won't explain all of the features it offers here, but some of the more useful ones that I take advantage of.
Redundant Array of Independent Disks (RAID)
RAID is a complex topic, so I'll only get into the basics here. It allows you to use more than 1 disk (SSD, Spinning disk, ect), all as one logical drive. There are many solutions to RAID, from hardware backed raid cards, to software in your BIOS/EFI, to LVM. One of the main drawbacks of hardware RAID is that if your raid card dies, you lose your data without an exact replacement for the RAID card. ZFS on the other hand allows you to keep your data, and as long as enough of the disks show up, the data is there. ZFS also allows some other special types of RAID, that will be talked about later that aren't possible with traditional RAID without complex layers of software needing set up on top of it. You can read a bit more about ZFS vs Hardware RAID here.
Combining RAID types (VDEV)
Storing lots of data means that sometimes combining multiple RAID types together is more cost or performance efficient. A common RAID type is RAID 10. This is a RAID 1 (mirror) with a RAID 0 on top of it. It would look something like this.
In ZFS, we call these sections of disks VDEV's. The above image would show 2 disks in each VDEV, and the stripe over all VDEV's is known as a "pool". Every ZFS array has at least 1 pool, and 1 VDEV even if it's a single disk.
Here is an example of a ZFS root filesystem used in one of my servers.
╰─$ zpool list zroot -v
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
zroot 476G 199G 277G - - 14% 41% 1.00x ONLINE -
nvme0n1p2 476G 199G 277G - - 14% 41.7% - ONLINE
Layers of ZFS
ZFS has some unique properties as far as filesysetms go. I won't list all of the layers as some are optional, but I'll highlight a few of the important ones to know about.
ARC
ZFS has a thing called ARC that allows for caching of things in RAM. This allows frequently accessed files to be accessed much faster than from disk, even if the disk is a fast SSD as RAM is always faster.
L2ARC
This is an optional secondary ARC that can be stored on an SSD to speed up reads when RAM is totally full. This is only used on massive arrays generally as ARC is really efficient at storing what should be cached on smaller arrays, and has some drawbacks as it takes up some RAM on it's own.
ZIL/SLOG
A ZIL is the ZFS Intent Log. This is where ZFS stores the data that it intends to write, and can verify that it was written correctly before committing it to disk. This is great in case of a power outage or a kernel panic stopping the system in the middle of a write. If it wasn't written properly, the data won't be committed to the disk, and there won't be corruption. This normally happens on the same disk(s) of the filesystem, though some arrays add a special device called a SLOG, which is usually an SSD to write these intents to, freeing up the normal disks to only write good data. You can read further on this topic here.
Special VDEV
Special vdevs are a type of RAID that are unique to ZFS. ZFS keeps track of files, and blocks by size. Small files and things like metadata are not where spinning disks are good, so this allows you to have a special vdev made of SSD's to help take the burden of these types of files and blocks. This has a massive increase in performance, while keeping over all storage cost low as most of the bulk storage is handled by the slow spinning disk drives, but using the SSD's where there are best. This is a fantastic read on the topic.
Filesystem, and RAID, what else?
I could spend the rest of existence rambling about everything that ZFS can do, so I'll leave a list of other features that are worth looking into.
- Compression
- NFS exports
- Data scrubbing (protecting against bitrot)
- Snapshots
- Send/Recv (sending data quickly, including diffs)
Conclusion
These are the features that make ZFS the ultimate ecosystem, and not just a
filesystem for my NAS/SAN use case, as well as data protection for even my
single disks, allowing me to back up and restore quickly with snapshots, and
send/recv faster than any other method available. I've accidentally deleted TB's
of data before when targeting the wrong disk in a rm
operation, only to
undelete the files in less than 5 seconds with a snapshot, moved countless TB's
over a network maxing out 10 gigabit speeds in ways that things like cp
and
rsync
could never get close to matching, and even torture tested machines by
pulling ram out of them while data was being sent just to see if I could cause
corruption, and found none (missing data that wasn't sent, but everything that
was sent was saved properly). This is unmatched on any other filesystem in my
opinion, including BTRFS, but that's a
rant for another day.
Further Reading
OpenZFS wiki Wikipedia ZFS page
Bonus
Below is an example of my array that is currently live in my SAN, serving everything including this page. It consists of 3 10TB Spinning disks and 2 500GB SSD's acting as an L2ARC as well as a special VDEV in a mirror.
╰─$ zpool list tank -v
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
tank 27.7T 8.71T 19.0T - - 0% 31% 1.00x ONLINE -
raidz1 27.3T 8.64T 18.6T - - 0% 31.7% - ONLINE
ata-WDC_WD100EMAZ-00WJTA0_1EG9UBBN - - - - - - - - ONLINE
ata-WDC_WD100EMAZ-00WJTA0_1EGG56NZ - - - - - - - - ONLINE
ata-WDC_WD100EMAZ-00WJTA0_2YJXTUWD - - - - - - - - ONLINE
special - - - - - - - - -
mirror 428G 70.5G 358G - - 9% 16.5% - ONLINE
sda5 - - - - - - - - ONLINE
ata-CT500MX500SSD1_2005E286AD8B - - - - - - - - ONLINE
cache - - - - - - - - -
ata-CT500MX500SSD1_1904E1E57733-part1 34.7G 31.2G 3.51G - - 0% 89.9% - ONLINE