What's all the fuss about docker?

Why is everyone excited?

Docker is, simply put, a container. How is this at all exciting? Think about any time you set up a system, and how long you spend setting it up. No matter if it's a web server, a build environment or any other system service. Do you want to do that all over again when you move systems? What about updating your operating system under it, and having to deal with config file changes, and the other cruft that gets left behind, and causes problems over time? Not having to deal with any of this is why people like docker.

What docker is, and more importantly, what it's not.

Some people assume that docker is just another container like a virtual machine. While it is true that docker will spin up a minimal Linux environment, it does this every single time that the docker instance is started, and when stopped, it throws it all away. This seems like it would be a massive pain to update every single time you wanted to add new content to it, but this wasn't overlooked when Docker was designed.

Getting started with Docker

The getting started link for Docker is here so you can install it properly no matter what platform you are on. Once docker is up and running, feel free to come back. Otherwise, let's see what docker can do for you.

Basic example

Give this command a run in a terminal/command prompt, then navigate a web browser to http://127.0.0.1 and you should see a web browser running.

docker run -d -p 80:80 docker/getting-started

Lets break down what that command is doing. You docker is the basic command to directly control Docker. run tells it that you want to run something inside of docker. -d tells docker to run it in the background, as a daemon. -p 80:80 forwards the port 80 from your local machine, where docker is running on, to port 80 in the container. This is the standard HTTP web port that allows you to access that server. The url could have been typed http://127.0.0.1:80, but 80 is implied for HTTP, so it didn't need typed. docker/getting-started is the docker image that is running inside of the container, and that image starts the web server that you see when you load the web page.

Docker-compose

Typing commands can get very confusing, especially as things get more complex, so I would recommend learning docker-compose early. You may need to install it separately on your system. The same command as above would be saved as a file called docker-compose.yml and is yaml formatted as the filename implies.

---
version "2.3"
services:
    getting-started:
            image: docker/getting-started
            ports:
                80:80
                

Make it useful

Docker compose files can be brought up by running docker-compose up -d. The -d will run it in the background. If you omit this, you can see what's going on, and stop it with ctrl c like any normal command.

---
version: "3"  # Specifies the compose version

services:  # The list of services are below
    nginxBlog:  # The only service will be this blog, running on nginx
        image: nginx  # Runs on the nginx official image
        container_name: blog  # Sets the name of the container to keep track easier
        ports:
            - 80:80  # opens up port 80 to let you access the blog
        volumes:
            # Passes through /mnt/data/blog from the host to where nginx expects a web page to be
            - /mnt/data/blog:/usr/share/nginx/html  # passes through /mnt/data/blog from the host to where nginx expects 
        restart: unless-stopped  # Automatically restarts the service on restart of docker, or host reboot, ect

This is how this blog gets to you (partially). What happens if I start this docker container on another machine? I have to upload my blog to all of them and keep them in sync? Not at all. It just takes an edit to how docker has access to data.

---
version: "3"  # Specifies the compose version

services:  # The list of services are below
    nginxBlog:  # The only service will be this blog, running on nginx
        image: nginx  # Runs on the nginx official image
        container_name: blog  # Sets the name of the container to keep track easier
        ports:
            - 80:80  # opens up port 80 to let you access the blog
        volumes:
            # This time we will pass the volume from below through to the container.
            - blog:/usr/share/nginx/html  # passes through /mnt/data/blog from the host to where nginx expects 
        restart: unless-stopped  # Automatically restarts the service on restart of docker, or host reboot, ect
        
volumes:
    blog:
        driver: local
        driver_opts:
            type:nfs
            o: "addr=192.168.25.51"
            device: ":/mnt/data/blog"

This will let docker manage a mount though NFS (assuming it's available on that machine). This means that you can use use this file with any computer that has access to that NFS mount.

Docker swarm

Speaking of managing multiple computers with docker, why bother choose what goes where for things you don't care about what machine hosts it? Docker swarm has you covered. I'll link the getting started guide here as reference, but I'll highlight some of the things I was confused about going in, as well as some other benefits to running a swarm.

Short list of upsides

  • High reliability services. Can run multiple instances in case one is restarting/crashing/overloaded
  • Automatically can use any node that joins the swarm with little to no effort after joining
  • Can easily reboot machines for updates, and docker containers stay up, or automatically come back up on another machine

Questions I and others have/had


Q: How do I know what IP address to access?

A: Docker swarm includes a load balancer. You can access any machine in the swarm on the port you want, and it will serve it to you properly.


Q: What if I need something specific for the container?

A: Docker swarm includes concepts of tagging. You may want to separate things that need ARM or x86_64 CPU's. You may also tag a system as "low_ram" for things like a raspberry pi so a minecraft server doesn't decide to try to start there. Tags are arbitrary, so you can craft it to your needs.


Q: How do I update the container?

A: You don't think about it most of the time. The implied tag for containers is :latest, which will automatically pull down the latest version of the container every time it's restarted. If you lock a version, you know when to change the version tag, but docker does the rest for you.

Other uses for docker

Docker isn't limited to running services for servers. You can use it as a container to test applications without installing them on your system directly. This is also great for dev environments as there's no more "works on my system" bugs due to the nature of everything inside of the container always being the same on all systems. I'll link an article on how to do this with Rust, but it should translate to most projects well.

Conclusion

Docker is a great way to carry around services, build environments, and many other things that help you think less about the "how do I get there" and more about whatever your goal is. When I wanted to spin this blog up, I didn't care that I had to use a web server, or how it went together. I just started an nginx instance in docker, and I am done forever. Hopefully this has helped you see what's so great about docker. Feel free to reach out with questions, and I'll update the page with any common ones.

ZFS. It's not a filesystem, it's an ecosystem

What is a filesystem?

All computers need to give you access to files. This seems quite obvious at first, but how those files get stored, most people don't seem to think about. Files need to be stored on a disk (or a network, but lets focus about on disk), and that disk needs a way to know where files are, how big they are, ect. This is all part of a filesystem. Some common ones that people may know about are

  • NTFS
  • FAT32
  • EXT4
  • HFS+
  • APFS

These are just a few examples that can be found on different operating systems, and you are bound to recognize at least one.

What's the point? Isn't keeping track of files easy?

Different filesystems are built with different goals, or operating systems in mind. As a quick example, HFS+ was built before SSD's existed, and is optimized for spinning disk drives. That doesn't mean you can't use it on a solid state drive, but the performance could be better. This is what brought rise to APFS for Mac. It's built with only SSD's in mind. Once again, you can use this on spinning disk drives, but it won't perform as well as HFS+.

Another big area that filesystems are optimized for is features. More modern filesystems may offer things like on disk compression to save space while losing no data, permissions, to prevent users from accessing, modifying, or running files that they aren't allowed to, and much more. Not all filesystems are created equally, and each has upsides and downsides.

ZFS is a filesystem, but it's also not

Why explain what a filesystem is if ZFS is not one? Well, ZFS is not just a filesystem. It includes a filesystem as a component, but is far more. I won't explain all of the features it offers here, but some of the more useful ones that I take advantage of.

Redundant Array of Independent Disks (RAID)

RAID is a complex topic, so I'll only get into the basics here. It allows you to use more than 1 disk (SSD, Spinning disk, ect), all as one logical drive. There are many solutions to RAID, from hardware backed raid cards, to software in your BIOS/EFI, to LVM. One of the main drawbacks of hardware RAID is that if your raid card dies, you lose your data without an exact replacement for the RAID card. ZFS on the other hand allows you to keep your data, and as long as enough of the disks show up, the data is there. ZFS also allows some other special types of RAID, that will be talked about later that aren't possible with traditional RAID without complex layers of software needing set up on top of it. You can read a bit more about ZFS vs Hardware RAID here.

Combining RAID types (VDEV)

Storing lots of data means that sometimes combining multiple RAID types together is more cost or performance efficient. A common RAID type is RAID 10. This is a RAID 1 (mirror) with a RAID 0 on top of it. It would look something like this.

RAID 10

In ZFS, we call these sections of disks VDEV's. The above image would show 2 disks in each VDEV, and the stripe over all VDEV's is known as a "pool". Every ZFS array has at least 1 pool, and 1 VDEV even if it's a single disk.

Here is an example of a ZFS root filesystem used in one of my servers.

╰─$ zpool list zroot -v
NAME          SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
zroot         476G   199G   277G        -         -    14%    41%  1.00x    ONLINE  -
  nvme0n1p2   476G   199G   277G        -         -    14%  41.7%      -    ONLINE

Layers of ZFS

ZFS has some unique properties as far as filesysetms go. I won't list all of the layers as some are optional, but I'll highlight a few of the important ones to know about.

ARC

ZFS has a thing called ARC that allows for caching of things in RAM. This allows frequently accessed files to be accessed much faster than from disk, even if the disk is a fast SSD as RAM is always faster.

L2ARC

This is an optional secondary ARC that can be stored on an SSD to speed up reads when RAM is totally full. This is only used on massive arrays generally as ARC is really efficient at storing what should be cached on smaller arrays, and has some drawbacks as it takes up some RAM on it's own.

ZIL/SLOG

A ZIL is the ZFS Intent Log. This is where ZFS stores the data that it intends to write, and can verify that it was written correctly before committing it to disk. This is great in case of a power outage or a kernel panic stopping the system in the middle of a write. If it wasn't written properly, the data won't be committed to the disk, and there won't be corruption. This normally happens on the same disk(s) of the filesystem, though some arrays add a special device called a SLOG, which is usually an SSD to write these intents to, freeing up the normal disks to only write good data. You can read further on this topic here.

Special VDEV

Special vdevs are a type of RAID that are unique to ZFS. ZFS keeps track of files, and blocks by size. Small files and things like metadata are not where spinning disks are good, so this allows you to have a special vdev made of SSD's to help take the burden of these types of files and blocks. This has a massive increase in performance, while keeping over all storage cost low as most of the bulk storage is handled by the slow spinning disk drives, but using the SSD's where there are best. This is a fantastic read on the topic.

Filesystem, and RAID, what else?

I could spend the rest of existence rambling about everything that ZFS can do, so I'll leave a list of other features that are worth looking into.

Conclusion

These are the features that make ZFS the ultimate ecosystem, and not just a filesystem for my NAS/SAN use case, as well as data protection for even my single disks, allowing me to back up and restore quickly with snapshots, and send/recv faster than any other method available. I've accidentally deleted TB's of data before when targeting the wrong disk in a rm operation, only to undelete the files in less than 5 seconds with a snapshot, moved countless TB's over a network maxing out 10 gigabit speeds in ways that things like cp and rsync could never get close to matching, and even torture tested machines by pulling ram out of them while data was being sent just to see if I could cause corruption, and found none (missing data that wasn't sent, but everything that was sent was saved properly). This is unmatched on any other filesystem in my opinion, including BTRFS, but that's a rant for another day.

Further Reading

OpenZFS wiki Wikipedia ZFS page

Bonus

Below is an example of my array that is currently live in my SAN, serving everything including this page. It consists of 3 10TB Spinning disks and 2 500GB SSD's acting as an L2ARC as well as a special VDEV in a mirror.

╰─$ zpool list tank -v
NAME                                      SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
tank                                     27.7T  8.71T  19.0T        -         -     0%    31%  1.00x    ONLINE  -
  raidz1                                 27.3T  8.64T  18.6T        -         -     0%  31.7%      -    ONLINE
    ata-WDC_WD100EMAZ-00WJTA0_1EG9UBBN       -      -      -        -         -      -      -      -    ONLINE
    ata-WDC_WD100EMAZ-00WJTA0_1EGG56NZ       -      -      -        -         -      -      -      -    ONLINE
    ata-WDC_WD100EMAZ-00WJTA0_2YJXTUWD       -      -      -        -         -      -      -      -    ONLINE
special                                      -      -      -        -         -      -      -      -  -
  mirror                                  428G  70.5G   358G        -         -     9%  16.5%      -    ONLINE
    sda5                                     -      -      -        -         -      -      -      -    ONLINE
    ata-CT500MX500SSD1_2005E286AD8B          -      -      -        -         -      -      -      -    ONLINE
cache                                        -      -      -        -         -      -      -      -  -
  ata-CT500MX500SSD1_1904E1E57733-part1  34.7G  31.2G  3.51G        -         -     0%  89.9%      -    ONLINE

Why all the distros?

Why does distro even matter

Linux is insane vague in what it is, and just a pile of source code for the most part. It's up to people to put the pieces together, and choose what's included or not. There is no perfection in life for all use cases, or users. This leaves us with lots of different "distros", which are just sets of default applications that are installed for the most part. Some defaults are better for things like running a server, others are better for realtime audio workstations, others are better for new users of desktop systems and include everything but the kitchen sink.

Servers

Linux on servers comes with a plethora of options. Any Linux can be used as a server, though if you want something that's low maintenance, you'll probably want to look for a few key things.

  • Light and slim default installs
  • Security focused
  • Kernel updates that don't overwrite the running kernel (Arch like distros tend to do this)

Desktop

Linux on desktop is normally focused on things like having more choice in applications, support for more desktop environments, and many prefer having updates sooner, rather than waiting on everything to pass all of the security and stability checks. Rebooting a personal computer is way less problematic than rebooting servers that have many users connected to it. A short list of what people tend to want out of a Linux desktop

  • Good graphics driver support
  • More application choices, especially graphical ones
  • Up to date kernels for the best performance, and newest drivers
  • Easy access to add user submitted packages (AUR on Arch, PPA on Ubuntu, ect)

What distros do I use on my machines, and which is the best?

First off, there is no "best Linux". There could be argued that some are better at some tasks, but with enough work, any Linux distro can do anything the other can. That said, why do all of that work if someone else gets them close enough? Do note that I only use Linux distros without systemd. I'll do my own writeup on personal reasons later of why, but if you are looking for something now, here is a good resource. With that out of the way, this is a list of what I use and why I use them.

Alpine

Alpine is a very lightweight linux distro with security and speed in mind. Alpine uses musl as opposed to glibc, which most other distros use. This causes some incompatibility with closed source applications like Steam, and makes it less suited for a desktop distro unless you only use open source tools and drivers.

Good parts

  • Insanely light and efficient
  • Very good package manager
  • Quite up to date
  • Built to run on servers and embedded systems
  • Runs openrc
  • Is the basis of tons of docker containers
  • Very stable

Bad parts

  • Command line installer (could be an upside)
  • Can't run many popular closed source Linux tools/drivers without workarounds
  • Runs ASH and not BASH by default which can confuse new users
  • Very different from most popular distros, so has a learning curve
  • Less software choice, but most needed things are there for servers

Artix

Artix Linux is based on Arch. Obligatory "I use Arch BTW" meme aside, there were some things about Arch based distros that I really like. Mainly the AUR, and pacman. We all take package managers for granted, and just assume that they all can do the same things as the others. Pacman has features built for the power user, while being fairly simple to use, or even having graphical front ends for those that just don't want to use a terminal.

Good parts

  • Amazing driver support
  • Light and efficient
  • Very good package manager
  • Not bleeding edge, but rapidly updated
  • Has GUI installer or can be installed manually
  • Almost every Linux package available thanks to the AUR
  • Optionally runs openrc

Bad parts

  • Updates the kernel in place. Reboots needed often
  • Almost bleeding edge means security may not alwoys be as high
  • Being a rolling release, it may require user intervention on updates
  • Not as stable as other mentioned offerings due to it's rapid updates

Gentoo

Gentoo is the odd one out. It's the only source based Linux distro. This means that most or all installed things will be turned from source into binaries on your system, leading to potentially long compile times. The upshot of this is extreme control of your installed programs down to compile flags. I rarely use this distro, so I'll give a short list as I never recommend it unless you are looking to learn how a system works at the core, or you know why you want it.

Good parts

  • Light and efficient
  • Insanely good package manager
  • Optionally runs openrc
  • More customizable than most distros due to source based packages
  • Very stable

Bad parts

  • Being a rolling release, it may require user intervention on updates
  • Manual install only
  • Fairly limited in packages
  • Heavy on disk space
  • Most packages need compiled from source

Void

Void linux is often talked about as "the BSD of Linux". I don't use void often due to a personal distaste for runit, but that's subjective, and I would not let that deter you from trying Void. There are great tools to help with this "problem" such asvsv or sv-helper to name just 2.

Good parts

  • Insanely light and efficient
  • Very good package manager
  • Quite up to date
  • Fairly stable
  • Built to run on servers, or desktops alike
  • Offers both muls and glibc

Bad parts

  • Command line installer (could be an upside)
  • Can't run many popular closed source Linux tools/drivers without workarounds on musl
  • Less software choice, but most needed things are available

Honorable mentions

These are some distros that I have used in the past, but have switched away from for one reason or another. I will leave them in order of which I'd recommend generically for desktops and servers. All of these run Systemd as opposed to everything above.

Desktop

  1. Manjaro (Arch based)
  2. PopOS
  3. Arch

Servers

  1. Debian
  2. Ubuntu Server

Closing thoughts

This is currently what I'm using in terms of Linux distros on my machines. Artix is used for anything I set at a display/keyboard by default, void is used on occasion, though I have run into a few bugs here and there when using it as a desktop, along with not being a personal fan of runit that I just don't use it often. Gentoo was a great learning experience, and I can recommend using it for a while to gain something from it, even if it doesn't stick around. I'm sure the list below will change over time, but here's a list of most of my running machines with their OS's at the moment.

  • pfsense
    • Lenovo Thinkstation
    • OpnSense (BSD)
    • Router/Firewall
  • planex
    • Ryzen custom built
    • Artix (openrc) (Soon to be Void or Alpine)
    • SAN
  • bender
    • Ryzen 2500u Dell Inspiron
    • Vaid
    • Tinker laptop
  • amy
    • Ryzen 4500u Lenovo Flex 5
    • Artix (openrc)
    • Primary laptop
  • farnsworth
    • M1 Mac Mini
    • OSX
    • Primary desktop
  • zapp
    • Linode
    • Alpine
    • VPS

Hosting (Aka, how this page gets to you)

How this page gets to you

TLDR: VPS --> zerotier --> docker swarm --> docker container

The long answer

Nginx

Opening ports is generally a security risk, so I wanted to be able to self host, without opening ports where possible. With a cheap VPS running nginx, I'm able to reverse proxy back to where my docker swarm is. That still requires opening ports one would think, but the magic of zerotier makes that not a problem. I'll talk about this later.

Reverse proxies

Configuring a reverse proxy is quite simple in nginx.

server {
        server_name blog.kdb424.xyz;
        listen 80;
        listen [::]:80;

        access_log /var/log/nginx/reverse-access.log;
        error_log /var/log/nginx/reverse-error.log;

        location / {
                    proxy_pass http://planex.far:8197;
  }
}

Hosts file

$ cat /etc/hosts
127.0.1.1	ubuntu
192.168.194.161 planex.far

# The following lines are desirable for IPv6 capable hosts
::1     localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

Zerotier

Zerotier allows me to create a virtual network, without creating a complex VPN, and just treat it as if all of my machines are on a private subnet, but without an exit node. It's an ethereal network, creating connections as needed. In the above example proxy_pass http://planex.far:8197; points to an address that shouldn't exist. This is added to my /etc/hosts for my convenience, but adding a direct zerotier ip would work. A simple ping test shows what's going on.

$ ping planex.far -c 1
PING planex.far (192.168.194.161) 56(84) bytes of data.
64 bytes from planex.far (192.168.194.161): icmp_seq=1 ttl=64 time=10.4 ms

Docker swarm

Docker swarm conveniently takes the same compose files that you are used to using, and deploys them to a swarm. Swarms consist of multiple machines, can offer failover modes, load balancing, and mony other nice things though in this example, I'll just show how I deploy this blog to my swarm.

---
version: "3"

services:
    nginxBlog:
        image: nginx
        container_name: blog
        ports:
            - 8197:80
        restart: unless-stopped

        volumes:
            - blog:/usr/share/nginx/html

volumes:
    blog:
    driver: local
    driver_opts:
        type: nfs
        o: "addr=192.168.25.51,rw,nfsvers=4"
        device: ":/mnt/data/blog"

The reason we don't use a drive mounted locally, is that we need to ensure that whatever machine in the swarm starts this server, has access to the same data.

Why a swarm and not just run docker on each machine?

Swarms offer things like load balancing, which includes finding the correct machine with a service, even if you point at a different node in the cluster. If you had 3 machines in your swarm, you can access the port number on any of the machines, and the load balancer will properly direct you to the service transparently. Another reason is maintenence. If you have to restart a machine, or shut it down, services will just be moved to another machine in the cluster, or stay up if you had multiple instances already running, offering you little, to no downtime, for no thinking as you maintain, bring up, or down machines.

Static site generator

As you may have seen from the footer of every page, this blog is created with Pelican, which is a static site generator. The source code can actually be seen here.

Conclusion

Hopefully this gives you a better idea of how I host things, and keep my ports closed and secure. If you have any more questions or comments on this, feel free to reach out to me, and I'll be glad to chat about it, or do more writeups on specifics if things are commonly asked.

PMM (Package Manager Manager)

Motivation

PMM came about from a simple frustration. "Why is there so much cruft installed that I just don't care about in my system?!" This is pretty simple to solve one would think. Just list all packages with something like pacman -Qqe, go through them all one by one, and remove what you don't want. After doing this many times, I remembered that Gentoo had solved this in a much better way. A world file was the simple way to keep track of what you explicitly installed, and not worry about dependencies. It only tracked what you actually wanted. I went on the hunt to see if other package managers were able to keep track of things like orphans, dependencies, and explicitly installed packages, and it turns out that most modern package managers can.

Where it started

I knew on things like Arch based Linux distros, I could use things like

$ pacman -Qqe
acpi
acpid
acpid-openrc
alsa-firmware
alsa-utils
...

This will get things explicitly installed. Simply pipe that into a file, and it's effectively a worldfile similar to what Gentoo had.

pacman -Qqe > worldfile

Sets

What became clear quickly was this file was very long, and hard to go through unless I knew what I was looking for, and sometimes things would consist of more that one thing logically. Things like pipewire, were actually installed on my systems as a set of

easyeffects
lsp-plugins
pipewire
pipewire-alsa
pipewire-jack
pipewire-pulse

These packages all logically were "pipewire" for my system, and sets were created. I could have a world file, now known as a "set", that can be pulled in to world files in any order. This allows for things like quickly changing between Sway and XFCE, or PulseAudio and PipeWire.

Installing, and removing

While installing things is great for setting up a system, removing things is also important for cleaning up the cruft, which is why this was all started in the first place. I found that most package manages knew how to know the difference between a dependency and explicitly installed packages. With a world file, that is what we want to be set explicitly in the package manager by design, so pmm can list a diff to tell you what was added/removed, or tell the system to add/purge things to be in sync with the world file. This keeps the system lean, but able to change rapidly, while keeping track of things by ideas, and not just packages thanks to sets.

Orphans

PMM doesn't actuallly remove packages when syncing. This allows running software to stay as it is, while the system transforms itself into the new "correct" state defined by the world file. It marks the things not explicitly installed as dependencies and trusts that the package manager knows how to clean up after itself. Known supported package managers are listed, and pmm offers a way to purge orphans as a wrapper, and uses the package manager itself to do all of the work.

Real world examples

Feel free to check my worldfiles here

Amy Lenovo Flex 5 laptop running Artix Linux

Planex Desktop SAN running Artix Linux

Farnsworth Mac Mini M1 (2020) Running OSX