Last week I had a massive hardware failure on one of the GlusterFS storage nodes in the ILRI, Kenya Research Computing cluster: two drives failed simultaneously on the underlying RAID5. As RAID5 can only withstand one drive failure, the entire 31TB array was toast. FML.
After replacing the failed disks, rebuilding the array, and formatting my bricks, I decided I would use rsync
to pre-seed my bricks from the good node before bringing glusterd
back up.
tl;dr: rsync
is amazing, but it’s single threaded and struggles when you tell it to sync large directory hierarchies. Here’s how you can speed it up.
rsync #fail
I figured syncing the brick hierarchy from the good node to the bad node was simple enough, so I stopped the glusterd
service on the bad node and invoked:
# rsync -aAXv --delete --exclude=.glusterfs storage0:/path/to/bricks/homes/ storage1:/path/to/bricks/homes/
After a day or so I noticed I had only copied ~1.5TB (over 1 hop on a dedicated 10GbE switch!), and I realized something must be wrong. I attached to the rsync
process with strace -p
and saw a bunch of system calls in one particular user’s directory. I dug deeper:
# find /path/to/bricks/homes/ukenyatta/maker/genN_datastore/ -type d | wc -l
1398640
So this one particular directory in one user’s home contained over a million other directories and $god knows how many files, and this command itself took several hours to finish! To make matters worse, careful trial and error inspection of other user home directories revealed more massive directory structures as well.
What we’ve learned:
rsync
is single threadedrsync
generates a list of files to be synced before it starts the sync- MAKER creates a ton of output files/directories 😉
It’s pretty clear (now) that a recursive rsync
on my huge directory hierarchy is out of the question!
rsync #winning
I had a look around and saw lots of people complaining about rsync
being “slow” and others suggesting tips to speed it up. One very promising strategy was described on this wiki and there’s a great discussion in the comments.
Basically, he describes a clever use of find
and xargs
to split up the problem set into smaller pieces that rsync
can process more quickly.
sync_brick.sh
So here’s my adaptation of his script for the purpose of syncing failed GlusterFS bricks, sync_brick.sh
:
#!/usr/bin/env bash
# borrowed / adapted from: https://wiki.ncsa.illinois.edu/display/~wglick/Parallel+Rsync
# RSYNC SETUP
RSYNC_PROG=/usr/bin/rsync
# note the important use of --relative to use relative paths so we don't have to specify the exact path on dest
RSYNC_OPTS="-aAXv --numeric-ids --progress --human-readable --delete --exclude=.glusterfs --relative"
export RSYNC_RSH="ssh -T -c arcfour -o Compression=no -x"
# ENV SETUP
SRCDIR=/path/to/good/brick
DESTDIR=/path/to/bad/brick
# Recommend to match # of CPUs
THREADS=4
BAD_NODE=server1
cd $SRCDIR
# COPY
# note the combination of -print0 and -0!
find . -mindepth 1 -maxdepth 1 -print0 | \
xargs -0 -n1 -P$THREADS -I% \
$RSYNC_PROG $RSYNC_OPTS "%" $BAD_NODE:$DESTDIR
Pay attention to the source/destination paths, the number of THREADS
, and the BAD_NODE
name, then you should be ready to roll.
The Magic, Explained
It’s a bit of magic, but here are the important parts:
- The
-aAXv
options torsync
tell it to archive, preserve ACLs, and preserve eXtended attributes. Extended attributes are critically important in GlusterFS >= 3.3, and also if you’re using SELinux. - The
--exclude=.glusterfs
option torsync
tells it to ignore this directory at the root of the directory, as the self-heal daemon — glustershd
 — will rebuild it based on the files’ extended attributes once we restart theglusterd
service. - The
--relative
option torsync
is so we don’t have to bother constructing the destination path, asrsync
will imply the path is relative to our destination’s top. - The
RSYNC_RSH
options influencersync
‘s use of SSH, basically telling it to use very weak encryption and disable any unnecessary features for non-interactive sessions (tty, X11, etc). - Using
find
with-mindepth 1
and-maxdepth 1
just means we concentrate on files/directories 1 level below each directory in our immediate hierarchy. - Using
xargs
with-n1
and-P
tells it to use 1 argument per command line, and to launch$THREADS
number of processes at a time.
I hope this helps someone!