Storage: Trials and Tribulations in the Homelab

My problem:

Right now I currently run TrueNAS (aka FreeNAS). It’s a great storage distribution that easily lets you setup ZFS Pools to store and present data via SMB, iSCSI, NFS or even with 3rd party tools like S3 and NextCloud. While it’s a great distribution and ZFS is an amazing file system, there are, however, some caveats.

ZFS presents data in “Pools”, each pool consists of one or more “vDevs”. These vDevs (Virtual Devices) are essentially disks presented as a software RAID and Pools tie these devices together and stripe the data between said vDevs. While this is great for performance and easily lets you slap more disks into a pool, getting those disks OUT on the other hand, is basically impossible without a mountain of work and literal shitload of wear on your drives. My current predicament? 2 sets of 12x 4TB drives in vDevs. Meaning I would need to buy 12 disks to upgrade one single vDev and expand my storage . I could of course add the new drives to the pool and leave the 24x 4TB drives in place, but we’re trying to save on power and those 4TB drives are gonna be dropping like flies at some point.

The problem being ZFS filesystems are essentially immutable, meaning once you set it up, you can’t remove anything from it. While there’s some pushes in Github land to allow striped vdevs to be removed without any data loss, RaidZ* vDevs are a no-go Johnny Blow. So once you set up vDevs with 12 disks in a RaidZ2 configuration, the only way to remove those disks is to replace them.

One….

by…

One…

While waiting for the Pool to resilver the new data onto the disk in the process (which can take multiple days depending on the size of the disk and pool).

Or of course I could build an entirely NEW pool and transfer that data over while wasting money on the needed space overhead to attempt such a feat (which I actually did attempt but ran into the ZFS 95% full pool brick wall and performance basically died. Plus that doesn’t give me the option to change Pool makeup down the line and just exacerbating the issue at hand. So, instead of doing that (because I’m not THAT well off to buy 12x 14TB drives at $340/each). I decided to migrate all of my data to a different storage system while using what spare drives I had on hand (and uh, borrowing some drives from the current storage array), the solutions I’m going to test is:

First Choice: UnRAID

UnRAID is a paid for storage distribution that essentially uses two parity drives and the rest of any drives you add being actual data drives. Since it can support up to 2x parity, you can have 2 drives fail without worry (essentially it’s just RAID6). So me being stuck in standard RAID/ZFS mode, I assume that read throughput should be insane while write not so much… LOL nope. Luckily there’s a free trial so you can give it a shot, YMMV.

So apparently UnRAID doesn’t stripe data across the data disks, it just shoves it onto one disk and the 2 parity disks are literally just parity (drives to grab the XOR data from, advanced RAID stuff, go google it). So writes are determined by how fast (or slow) your parity disks are. It does have an option for “Turbo Writes”, but with normal spinners you don’t see a huge improvement and you’re still basically stuck at below the single drives write speed.

So… as you can imagine, trying to backup 70TB of data onto those starting from scratch at <100MB/sec was excruciatingly slow. Not only that but if the disk fails trying to recover data, using it is basically a no go. So at the end of the day, I trashed it. Sure, it’s great for a “warm” backup solution, but that’s all it’s really good for and after scouring forums, Reddit, and Google on how increase speeds, maybe thinking there was magic bullet hiding somewhere in the settings that would net me even a 10% performance gain, alas, there was not.

Verdict:

Second Choice: Ceph.

Ceph is a distributed storage system that’s been around for nearly a decade now. If you’ve ever heard of Openstack, chances are you’ve heard of Ceph or tech based on it. It can present data in a multitude of ways,whether it’s CephFS (basically NFS), RBD (iSCSI) or RADOS (S3), it can tickle your fancy whichever route you want to take. Reading among the Reddit and forums people were singing it’s praises from the highest sun and that it was the second coming of Jesus in terms of storage for homelabs. So obviously, I gave it a shot.

Ceph can be setup as a standalone node (it’s highly discouraged, because you essentially have no HA and Ceph performs better as a cluster anyways), but me being the cheap’ish asshole that I am, I went standalone. So after setting up 16 OSD’s as Bluestore disks, shoving the WAL/Journal onto 4x SSD’s, fine tuning a few things here and there, I finally have my Ceph cluster setup.

Now there’s a few ways you can setup your “Pools” on Ceph. You can have replicated (where it makes literal copies of the data for redundancy across OSD’s (disks) or Nodes (Servers) or Erasure Coding. Erasure Coding acts somewhat like parity stripes in RAID, it stores XOR data for however many chunks you’re spreading it across. So for instance I tried 14+2 (k=14, m=2). Essentially what this means is to chop my data up into “k” amount of slices and store “m” amount of parity for redundancy. Quick drunk napkin math says I should only have ~15% overhead (m/k). Lol… nope.

So with Ceph, the easiest way to store data that looks like a standard filesystem is CephFS. The problem with CephFS is that if you want to do EC (Erasure Coded) pools you have to enable EC Overwrites. Basically what this does is lets say that I have my disks set at 4k size blocks and I have a 6kb chunks of data. What EC Overwrites does is split that 6kb data chunk into a 4kb chunk and a 2kb chunk + binary 0’s for padding (because the block needs to be 4k in size), essentially wasting 2k worth of data multiplied by k. Now with higher chunk counts (k amounts), the more space is going to be wasted. Ceph developers call this write amplification. So instead of the 15% overhead we were assuming, it’s more like, 40-60%. This is a known bug with no workaround as of now. CephFS did however have decent read/write performance with 16 OSD’s on a single server (~350MB/s write and 400MB/s read). Even with a power of 2 k I was seeing double the overhead.

So “alright” I think to myself. “CephFS is a no go as well as RBD since it requires EC overwrites also. How about S3FS?”. Essentially what this does is present an S3 bucket as a filesystem. Surely this should be decent for large file storage and have at least acceptable performance for reading files. Not only that but doesn’t require EC overwrites so theoretically that 15% overhead should be in the ballpark of overhead.

Lol, nope.

S3FS had terrible read performance. Wait, terrible “random” read performance, which I should have expected. My Plex server would take over 15 minutes scanning one season of Anime that I have, and sometimes time out, when normally it takes seconds. RSync would take a 10 second break between file uploads as well because of some strange “HEAD” calls through the HTTP API, however awscli’s “sync” tool did help a lot in that regard though, problem is anything deeper than 1 level of folders wouldn’t have any UID/GID metadata (S3 doesn’t have the concept of “folders”, it’s a flat system with no hierarchy) so it would show to be unreadable unless you set a compat mode switch that made reads slower.

So alas, no Ceph here unless I wanted to have 100%-200% overhead to get the redundancy I wanted. I’m sure a lot of the Ceph diehards will chime in and say “Well you weren’t using it the way it was meant to” or “You didn’t go through the laundry list of 50 fucking million tuneables.”. The problem is, I shouldn’t have to. Ceph should be mature enough after a decade to where I shouldn’t have to look through mailing lists, github PR’s, Reddit forums and a partridge in a pear tree to do basic file storage with decent performance and takes the space that it says it does on the label.

Verdict:

Third Choice: Gluster

Gluster is YADS (Yet Another Distributed Storage) from RedHat. The only major difference being is that it’s pretty Plain Jane when it comes to sharing the data (FUSE Gluster mount or NFS3). You can however share SMB and iSCSI file devices on it as you would any other folder/block device. Gluster has Bricks (essentially vDev’s or Hard Drives grouped into one mount, typically LVM Logical Volumes) and those bricks can be used in either a Distributed (stores entire files (not parts of files) in a round robin sort of deal across the bricks and tries to use the space equally depending on brick size. Replicated which makes literal duplicates of files on different bricks, Striped that stripes the data across bricks (like RAID 0) or Dispersed, which is Gluster’s version of Erasure Coding.

Gluster’s main selling point is that I can REMOVE bricks at any time, which ZFS does NOT allow me to do. So if I wanted to replace a set of drives I don’t have to go through the painful process of replacing them One… By…. One… while waiting for resilvering to finish for each disk. Instead I can remove a brick, gluster will transfer that data off the brick to other bricks (space withstanding) and poof, magically I can remove those disks without a care in the world.

Naturally I tried the dispersed option first since that was I wanting minimal overhead for data redundancy. However Gluster itself has write amplification as well and the only way to expand a Dispersed Volume (groups of bricks) you have to add the same amount of bricks of the same size to create a “Distributed Dispersed” volume, basically a set of bricks to stripe against. Obviously that was out of the question because expanding that down the line would lead me to where I’m at now. So…. down to the fourth option…

Verdict:

Fourth Choice: ZFS + Gluster

Ok, I know what you’re thinking. “Devin, you’re trying to move AWAY from ZFS, why the f**k are you going BACK to it!?”. OK, hear me out, because there’s a method to my madness.

My main problem is that I can’t only replace the vDev’s in my Pool at an affordable price point (because I would have to replace 12 drives one by one), but I can’t REMOVE the vdev if I wanted to try and re-arrange the pool without losing data. So this is where Gluster comes in. What do you think would happen if I presented a Pools of disks as bricks in a gluster distributed volume? Well… let’s try that.

I currently have 2 pools with one vDev of 8x 14TB drives and 6x 4TB drives. Both are in RAIDZ2 for data redundancy. I don’t have to worry about redundancy on the gluster side because:

ZFS is taking care of the data redundancy per brick.
Each brick is a single Pool of vDevs in RAIDz2, so I can have two drives in each brick die before I need to start sweating.
I get the performance and redundancy of ZFS in each brick, but also the versatility of Gluster.
Everything is on a single host. I don’t need host redundancy at this time. There’s ways to do that later that I’ll write about.

So for instance, if I wanted to replace my 12x 4TB drives with another set of drives, all I have to do is throw the new drives in another zpool, add them as a brick, queue the old zpool brick to be removed from Gluster, let Gluster distribute the data from said old zpool, then voila! I can destroy that pool with no loss of data and expand my storage when I need to. If I wanted extra performance, I could throw a NVME ZIL partition on each zPool Brick or add another RAIDz2 vDev of drives it can stripe across for faster write times.