What is a filesystem?

All computers need to give you access to files. This seems quite obvious at first, but how those files get stored, most people don't seem to think about. Files need to be stored on a disk (or a network, but lets focus about on disk), and that disk needs a way to know where files are, how big they are, ect. This is all part of a filesystem. Some common ones that people may know about are

  • NTFS
  • FAT32
  • EXT4
  • HFS+
  • APFS

These are just a few examples that can be found on different operating systems, and you are bound to recognize at least one.

What's the point? Isn't keeping track of files easy?

Different filesystems are built with different goals, or operating systems in mind. As a quick example, HFS+ was built before SSD's existed, and is optimized for spinning disk drives. That doesn't mean you can't use it on a solid state drive, but the performance could be better. This is what brought rise to APFS for Mac. It's built with only SSD's in mind. Once again, you can use this on spinning disk drives, but it won't perform as well as HFS+.

Another big area that filesystems are optimized for is features. More modern filesystems may offer things like on disk compression to save space while losing no data, permissions, to prevent users from accessing, modifying, or running files that they aren't allowed to, and much more. Not all filesystems are created equally, and each has upsides and downsides.

ZFS is a filesystem, but it's also not

Why explain what a filesystem is if ZFS is not one? Well, ZFS is not just a filesystem. It includes a filesystem as a component, but is far more. I won't explain all of the features it offers here, but some of the more useful ones that I take advantage of.

Redundant Array of Independent Disks (RAID)

RAID is a complex topic, so I'll only get into the basics here. It allows you to use more than 1 disk (SSD, Spinning disk, ect), all as one logical drive. There are many solutions to RAID, from hardware backed raid cards, to software in your BIOS/EFI, to LVM. One of the main drawbacks of hardware RAID is that if your raid card dies, you lose your data without an exact replacement for the RAID card. ZFS on the other hand allows you to keep your data, and as long as enough of the disks show up, the data is there. ZFS also allows some other special types of RAID, that will be talked about later that aren't possible with traditional RAID without complex layers of software needing set up on top of it. You can read a bit more about ZFS vs Hardware RAID here.

Combining RAID types (VDEV)

Storing lots of data means that sometimes combining multiple RAID types together is more cost or performance efficient. A common RAID type is RAID 10. This is a RAID 1 (mirror) with a RAID 0 on top of it. It would look something like this.

RAID 10

In ZFS, we call these sections of disks VDEV's. The above image would show 2 disks in each VDEV, and the stripe over all VDEV's is known as a "pool". Every ZFS array has at least 1 pool, and 1 VDEV even if it's a single disk.

Here is an example of a ZFS root filesystem used in one of my servers.

╰─$ zpool list zroot -v
NAME          SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
zroot         476G   199G   277G        -         -    14%    41%  1.00x    ONLINE  -
  nvme0n1p2   476G   199G   277G        -         -    14%  41.7%      -    ONLINE

Layers of ZFS

ZFS has some unique properties as far as filesysetms go. I won't list all of the layers as some are optional, but I'll highlight a few of the important ones to know about.

ARC

ZFS has a thing called ARC that allows for caching of things in RAM. This allows frequently accessed files to be accessed much faster than from disk, even if the disk is a fast SSD as RAM is always faster.

L2ARC

This is an optional secondary ARC that can be stored on an SSD to speed up reads when RAM is totally full. This is only used on massive arrays generally as ARC is really efficient at storing what should be cached on smaller arrays, and has some drawbacks as it takes up some RAM on it's own.

ZIL/SLOG

A ZIL is the ZFS Intent Log. This is where ZFS stores the data that it intends to write, and can verify that it was written correctly before committing it to disk. This is great in case of a power outage or a kernel panic stopping the system in the middle of a write. If it wasn't written properly, the data won't be committed to the disk, and there won't be corruption. This normally happens on the same disk(s) of the filesystem, though some arrays add a special device called a SLOG, which is usually an SSD to write these intents to, freeing up the normal disks to only write good data. You can read further on this topic here.

Special VDEV

Special vdevs are a type of RAID that are unique to ZFS. ZFS keeps track of files, and blocks by size. Small files and things like metadata are not where spinning disks are good, so this allows you to have a special vdev made of SSD's to help take the burden of these types of files and blocks. This has a massive increase in performance, while keeping over all storage cost low as most of the bulk storage is handled by the slow spinning disk drives, but using the SSD's where there are best. This is a fantastic read on the topic.

Filesystem, and RAID, what else?

I could spend the rest of existence rambling about everything that ZFS can do, so I'll leave a list of other features that are worth looking into.

Conclusion

These are the features that make ZFS the ultimate ecosystem, and not just a filesystem for my NAS/SAN use case, as well as data protection for even my single disks, allowing me to back up and restore quickly with snapshots, and send/recv faster than any other method available. I've accidentally deleted TB's of data before when targeting the wrong disk in a rm operation, only to undelete the files in less than 5 seconds with a snapshot, moved countless TB's over a network maxing out 10 gigabit speeds in ways that things like cp and rsync could never get close to matching, and even torture tested machines by pulling ram out of them while data was being sent just to see if I could cause corruption, and found none (missing data that wasn't sent, but everything that was sent was saved properly). This is unmatched on any other filesystem in my opinion, including BTRFS, but that's a rant for another day.

Further Reading

OpenZFS wiki Wikipedia ZFS page

Bonus

Below is an example of my array that is currently live in my SAN, serving everything including this page. It consists of 3 10TB Spinning disks and 2 500GB SSD's acting as an L2ARC as well as a special VDEV in a mirror.

╰─$ zpool list tank -v
NAME                                      SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
tank                                     27.7T  8.71T  19.0T        -         -     0%    31%  1.00x    ONLINE  -
  raidz1                                 27.3T  8.64T  18.6T        -         -     0%  31.7%      -    ONLINE
    ata-WDC_WD100EMAZ-00WJTA0_1EG9UBBN       -      -      -        -         -      -      -      -    ONLINE
    ata-WDC_WD100EMAZ-00WJTA0_1EGG56NZ       -      -      -        -         -      -      -      -    ONLINE
    ata-WDC_WD100EMAZ-00WJTA0_2YJXTUWD       -      -      -        -         -      -      -      -    ONLINE
special                                      -      -      -        -         -      -      -      -  -
  mirror                                  428G  70.5G   358G        -         -     9%  16.5%      -    ONLINE
    sda5                                     -      -      -        -         -      -      -      -    ONLINE
    ata-CT500MX500SSD1_2005E286AD8B          -      -      -        -         -      -      -      -    ONLINE
cache                                        -      -      -        -         -      -      -      -  -
  ata-CT500MX500SSD1_1904E1E57733-part1  34.7G  31.2G  3.51G        -         -     0%  89.9%      -    ONLINE