Our Experience with PostgreSQL on ZFS

First a little context: At Faradai we had been using PostgreSQL on the ext4 file system for a number of years until it recently became evident that the constantly growing database size (over 5 TB at the time) would start to cause issues in general performance, cloud costs, and the ability to have reliable backups. We reviewed a number of alternative file systems, and chose ZFS mainly for its following three features:

Instantenous Filesystem Snapshots
This makes creating incremental database backups a piece of cake, something we didn’t have before. It never fails to amaze me how fast and easy it is to revert to a snapshot. This is doubly important if one is managing a write-heavy database that changes every second. We use the snapshot feature to transfer hourly snapshots to a backup server which also doubles as our development database server. More on that later.

Built-in Compression
Most of the data we collect are time-series sensor measurements, and they compress really well. In fact, the compression ratio is over 4.5x in our case! That naturally decreases our disk storage needs by a similar ratio. Another side benefit is that the disk throughput would be higher because data takes less space on disk and would require fewer I/O operations and less bandwidth. CPU usage is increased slightly when compression is enabled, which is fine with us.

Built-in Support for Secondary Caching on SSD Disks
Block storage on the cloud is great until it isn’t. Yes, we are getting virtually infinite storage that can be dynamically allocated and resized, and it tends to be more reliable than individual physical disks, but the network interface to storage is significantly slower. That is the price we pay for flexibility. So, we needed an easy way to utilize the fast physical disks attached to VMs for caching. We don’t care if they fail. We want to use them as a secondary cache, and ZFS makes that really easy.

ZFS has other neat features such as the ability to detect bit rot and its own implementation of software RAID, which aren’t really critical for us, but they are still very nice to have.

How we Do Backups on ZFS

There are a number of open-source tools that are used to automate ZFS backups, but our needs aren’t complex, and we prefer our critical tooling like backup scripts as simple as possible to avoid unpleasant surprises when things go south. So, we went ahead and wrote a short script and added 2 cron entries. Less than 20 lines in total including comments. It can’t probably get simpler than that thanks to ZFS’ amazing incremental snapshot feature that finishes its job in seconds. We take 23 hourly and 14 daily snapshots on the production server, and those snapshots are transferred to the backup server. Our backup server initiates and controls the entire process, and it is essentially a perfect replica of our production server, but is of course up to an hour behind.

We Live with our Backups

Test your backups! How many times have we heard that? Too many to count, but it’s absolutely true. Probably one of the most famous examples of the untested backups went bad is Toy Story 2. The entire movie production could have been lost due to corrupt backups and the infamous ‘rm -rf’ command in the wrong directory. Fortunately, they discovered a few weeks old copy of the film assets on the computer of a pregnant employee working from home, and saved the production. It’s not hard to imagine that others in a similar situation might not be so lucky.

Probably one of the most effective ways to test the backups is to actually use their contents, and use them every day. Here is how we do it:

Every few weeks, we pick a recent daily snaphot on the backup server, and create a ZFS clone from the snapshot to make it writable.
We then make the PostgreSQL instance running on the backup server use the clone’s mount point as its data directory.
Finally, we restart PostgreSQL to have it recognize the new data directory.

The backup server is now doubling as our development database server, and the process of taking hourly snapshots from the production server continues to run in the background.

Technical Bits

Installation

It takes only a few commands to install ZFS and PostgreSQL on a Linux machine (Ubuntu in our case).

sudo apt install zfsutils-linux                                          # Install zfs filesystem packages
sudo fdisk -l                                                            # Take note of disk names for the next step
sudo zpool create -o autoexpand=on db /dev/nvme1n1 /dev/nvme2n1          # Create a striped RAID0 pool from 2 block storage volumes
sudo zpool add db cache /dev/nvme3n1                                     # Add local SSD drive as cache
sudo zfs create db/data -o mountpoint=/var/lib/postgresql                # Create dataset and mount it to use for PostgreSQL
sudo zfs set compression=lz4 db                                          # Enable zfs compression
sudo zfs set atime=off db                                                # Disable atime for improved performance
sudo zfs set recordsize=32k db                                           # Set zfs record size to 32 KB (default: 128 KB)
sudo reboot                                                              # Reboot
sudo zpool import db                                                     # Import zfs pool; has to be done only once after first reboot
wget --quiet -O - \
https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -  # Add PostgreSQL repository key
sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt \
$(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'       # Add PostgreSQL repository
sudo apt update                                                          # Update apt packages
sudo apt install postgresql-14 postgresql-client-14                      # Install PostgreSQL 14
sudo sysctl -w vm.swappiness=1                                           # Tell Linux not to use swap unless absolutely necessary
echo 'vm.swappiness=1' | sudo tee -a /etc/sysctl.conf                    # Append 'vm.swappiness=1' to systctl.conf for permanent effect

Careful readers will notice that we have created a striped RAID0 pool out of two disks on a production server, which should be verboten, and I would normally agree with you if these were actual physical disks, but they aren’t. The disks are commissioned from network attached storage, and they are managed by a tier-1 cloud provider. We think they know what they are doing, and they have the necessary redundancy in place. In any case, we have a backup/development server that is ready to switch to production in a moment’s notice and a separate set of backups stored on object storage.

After the installation, the settings in postgresql.conf needs to be optimized to make better use of the number of CPUs and amount of RAM on the machine. PGTune’s recommendations is generally a good starting point.

Finally, we disable full page writes in PostgreSQL, and restart the service to eliminate unnecessary disk writes. By design it is impossible to write partial pages on ZFS.

ALTER SYSTEM set full_page_writes=off;
CHECKPOINT;

Backup Script

Here is our backup script: “remote_db_snapshot” running on our backup server. The script accesses the production server (remote) over SSH, switches PostgreSQL to online backup mode, takes a snapshot, exits backup mode, prunes old snapshots, and finally copies the new snapshot over.

#!/bin/bash

HOST="some-user@prod-db"
NOW=$(date --iso-8601=minutes | sed 's/:/-/g' | cut -c 1-16)
DELETE_OLDEST=$(date -d "$1 ago" --iso-8601=minutes | sed 's/:/-/g' | cut -c 1-16)
LAST_SNAPSHOT=$(sudo zfs list -t snapshot | tail -n1 | cut -d " " -f 1)

# Switch remote PostgreSQL to backup mode, take a ZFS snapshot, and exit from PostgreSQL backup mode
ssh $HOST "sudo -u postgres psql -c \"SELECT pg_start_backup('$NOW', true);\""
ssh $HOST "sudo zfs snapshot db/data@$NOW"
ssh $HOST "sudo -u postgres psql -c 'SELECT pg_stop_backup();'"

# Destroy oldest snapshot on local server, and if successful delete on remote
sudo zfs destroy -v db/data@$DELETE_OLDEST && ssh $HOST "sudo zfs destroy -v db/data@$DELETE_OLDEST"

# Copy newly taken snapshot from remote server
ssh $HOST "sudo zfs send -cRi $LAST_SNAPSHOT db/data@$NOW" | sudo zfs receive -vF db/data

The cron entries:

0 1-23 * * *	. $HOME/.profile; ~/scripts/remote_db_snapshot '1 day'   2>&1 | logger -t remote_db_snapshot_hourly
0    0 * * *	. $HOME/.profile; ~/scripts/remote_db_snapshot '2 weeks' 2>&1 | logger -t remote_db_snapshot_daily

To make it easy to keep track, our snapshots are named with the time they are taken:

Daily snapshots

...
db/data@2022-03-30T00-00
db/data@2022-03-31T00-00
db/data@2022-04-01T00-00
db/data@2022-04-02T00-00
db/data@2022-04-03T00-00
db/data@2022-04-04T00-00
db/data@2022-04-05T00-00
db/data@2022-04-06T00-00

24-hour rolling window snapshots:

db/data@2022-04-06T18-00
db/data@2022-04-06T19-00
db/data@2022-04-06T20-00
db/data@2022-04-06T21-00
db/data@2022-04-06T22-00
db/data@2022-04-06T23-00
db/data@2022-04-07T00-00
...
db/data@2022-04-07T11-00
db/data@2022-04-07T12-00
db/data@2022-04-07T13-00
db/data@2022-04-07T14-00
db/data@2022-04-07T15-00
db/data@2022-04-07T16-00
db/data@2022-04-07T17-00

Conclusion and Disclaimer

Our experience with PostgreSQL on ZFS has been absolutely positive so far, but we certainly don’t know everything there is to know about the duo. We are confident that we have a handle on the important bits though. So, please do take our experience with a grain of salt, and if you think we are doing something terribly wrong or have any recommendations, please let me know, and I will update this article.

Worth a Read:

Shared by:

Related: