What's a Glusterfs?

Glusterfs is a network filesystem with many features, but the important ones here are it's ability to live on top of another filesystem, and offer high availability. If you have used SSHFS, it's quite similar in concept, giving you a "fake" filesystem from a remote machine, and as a user, you can use it just like normal without caring about the details of where the files are actually stored, except "over there I guess". Glusterfs unlike SSHFS, can be stored across multiple machines similar to network RAID. If one machine goes down, the data is still all there and well.

Why even bother?

A few years ago I decided that I was tired of managing docker services per machine and wanted them in a swarm. No more thinking! If a machine goes down, the service is either still up (already replicated across servers like this blog), or will come up on another server once it sees the service isn't alive. This is well and good until you need the SAN to go down. Now all of the data is missing, and the servers don't know, and you basically have to kick the entire cluster over to get it back alive. Not exactly ideal to say the least.

Side rant. Feel free to skip if you only care about the tech bits.

While ZFS has kept my data very secure over the ages, it can't always prevent machine oddity. I have had strange issues such as Ryzen bugs that could lock up machines at idle, a still not figured out random hang on networking (despite changing 80% of the machine, including all disks, operating system, and network cards) before it comes back 10 seconds later, and so on. As much as I always want to have a reliable machine, updates will require service restarts, reboots need done, and honestly, I'm tired of having to babysit computers. Docker swarm and NixOS are in my life because I don't want to babysit, but solve problems once, and be done with it. Storage stability was the next nail to hit, despite it being arguably a small problem, it still reminded me that computers exist when I wasn't in the mood for them to exist.

Why Glusterfs as opposed to Ceph or anything else?

Glusterfs sits on top of a filesystem. This is the feature that took me to it over anything else. I have trusted my data to ZFS for many years, and have done countless things that should have cost me data, including "oops, I deleted 2TB of data on the wrong machine", and having to force power off machines (usually SystemD reasons), and all of my data is safe. The very few things it couldn't save me from, it will happily tell me where there's corruption and I can replace the limited data from a backup. With all of that said, Glusterfs happily lives on top of ZFS, even letting me use datasets just as I have been for ages. It does however let me expand over several machines by using Glusterfs. There's a ton of modes to Glusterfs much as any "RAID software", but I'm sticking to effectively a mirror (RAID 1) in essence. Let's look at the hardware setup to explain this a bit better.

The hardware

planex

  • Ryzen 5700
  • 32GB RAM
  • 2x16TB Seagate Exos
  • 2x1TB Crucial MX500
pool
-------------------------- 
exos
 mirror-0
   wwn-0x5000c500db2f91e8
   wwn-0x5000c500db2f6413
special
 mirror-1
   wwn-0x500a0751e5b141ca
   wwn-0x500a0751e5aff797
-------------------------- 

morbo

  • Ryzen 2700
  • 32GB RAM
  • 5x3TB Western Digital Red
  • 1x10TB Western Digital (replaced a red when it died)
  • 2x500GB Crucial MX500
 red
  raidz2-0
    ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N3EVYXPT
    ata-WDC_WD100EMAZ-00WJTA0_1EG9UBBN
    ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N6ARC4SV
    ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N6ARCZ43
    ata-WDC_WD30EFRX-68N32N0_WD-WCC7K2KU0FUR
    ata-WDC_WD30EFRX-68N32N0_WD-WCC7K7FD8T6K
special
  mirror-2
    ata-CT500MX500SSD1_1904E1E57733-part2
    ata-CT500MX500SSD1_2005E286AD8B-part2
logs
  mirror-1
    ata-CT500MX500SSD1_1904E1E57733-part1
    ata-CT500MX500SSD1_2005E286AD8B-part1
-------------------------------------------- 

kif

  • Intel i3 4170
  • 8GB RAM
  • 2x256GB Inland SSD
pool
-------------------------------
inland
  mirror-0
    ata-SATA_SSD_22082224000061
    ata-SATA_SSD_22082224000174
-------------------------------

Notes

These machines are a bit different in terms of storage layout. Morbo/Planex both actually store decent amounts of data, and kif is there just to help validate things, so it doesn't get a lot of anything. We'll see why later. Would having Morbo/Planex both have identical disk layouts increase performance? Yes, but so would SSD's, for all of the data. Tradeoffs.

ZFS setup

I decided to make my setup simpler on all of my systems, and just keep the mount points for glusterfs the same. On each system, I created a dataset named gluster and set it's mountpoint to /mnt/gluster. This makes it a ton easier to not remember which machine has data where, and keep things streamlined. It may look something like this.

zfs create pool/gluster
zfs set mountpoint=/mnt/gluster

If you have one disk, or just want everything on gluster, you could just mount the entire drive/pool to somewhere you'll remember, but I find it most simple to use datasets, and I have to migrate data from outside of gluster on the same array to inside of gluster. That's it for ZFS specific things.

Creating a gluster storage pool

gluster volume create media replica 2 arbiter 1 planex:/mnt/gluster/media morbo:/mnt/gluster/media kif:/mnt/gluster/media force

This may look like a blob of text that means nothing, so let's look at what it does.

# Tells gluster that we want to make a volume named "media"
gluster volume create media

# Replicat 2 arbiter 1 tells gluster to use the first 2 servers to store the
# full data in a mirror (replicate) and set the last as an arbiter. This acts
# as a tie breaker for the case that anything ever disagrees, and you
# need a source of truth. It costs VERY little data to store this.
replica 2 arbiter 1

# The server name, and the path that we are using to store data on them
planex:/mnt/gluster/media
morbo:/mnt/gluster/media
kif:/mnt/gluster/media

# Normally you want gluster to create it's own directory. When we use datasets,
# the folder will already exist. This is something you should understand can
# cause issues if you point it at the wrong place, so check first
force

If all goes well, you can start the volume with

gluster volume start media

You'll want to check the status once it's started, and it should look something like this.

Status of volume: media
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick planex:/mnt/gluster/media             57715     0          Y       1009102
Brick morbo:/mnt/gluster/media              57485     0          Y       1530585
Brick kif:/mnt/gluster/media                54466     0          Y       1015000
Self-heal Daemon on localhost               N/A       N/A        Y       1009134
Self-heal Daemon on kif                     N/A       N/A        Y       1015144
Self-heal Daemon on morbo                   N/A       N/A        Y       1854760

Task Status of Volume media
------------------------------------------------------------------------------

With that taken care of, you can now mount your Gluster volume on any machine that you need! Just follow the normal instructions for your platform to install Gluster as it will be different for all of them. On NixOS at the time of writing, I'm using this to manage my Glusterfs for my docker swarm for any machine hosting storage. https://git.kdb424.xyz/kdb424/nixFlake/src/commit/5a1c902d0233af2302f28ba30de4fec23ddaaac9/common/networking/gluster.nix

Using gluster volumes

Once a volume is started, you can mount it pointing at any machine that has data in the volume. In my case I can mount from planex/morbo/kif, and even if one goes down, the data is still served. You can treat this mount identically to if you were storing files locally, or over NFS/SSHFS, and any data stored on it will be replicated, and left high availability if a server needs to go down for maintenance or if it has issues. This provides a bit of a backup (in the same way that a RAID mirror does, never rely on online machines for a full backup), so this could not only let you have higher uptime on data, but if you have data replication on a schedule for a backup to a machine that's always on, this would do that in real time, which is a nice side effect.

Now what?

With my docker swarm being able to be served without interruption from odd quirks, and it replacing my need to ZFS send/recv backups (on live machines, please have a cold store backup in a fire box if you care about your data, along with an off site backup), this lets me continue to forget that computers exist so I can focus on things I want to work on, like eventually setting up email alerts for ZFS scrubs, or S.M.A.R.T. scans with any drive warnings, I can continue to mostly forget about the details, and stay focused on the problems that are fun to solve. Yes, I could host my data elsewhere, but even ignoring the insane cost that I won't pay, I get to actually own my data, and not have a company creeping on things. Just because I have nothing to hide doesn't mean I leave my door unlocked.

Obligatory "things I say I won't do, but probably will later"

  • Dual network paths. Network switch or cable can knock machines offline.
  • Dual routers! Router upgrades always take too long. 5 minutes offline isn't acceptable these days!
  • Discover the true power of TempleOS.