Last week I had a massive hardware failure on one of the GlusterFS storage nodes in the ILRI, Kenya Research Computing cluster: two drives failed simultaneously on the underlying RAID5. As RAID5 can only withstand one drive failure, the entire 31TB array was toast. FML.
After replacing the failed disks, rebuilding the array, and formatting my bricks, I decided I would use
rsync to pre-seed my bricks from the good node before bringing
glusterd back up.
rsync is amazing, but it’s single threaded and struggles when you tell it to sync large directory hierarchies. Here’s how you can speed it up.
I figured syncing the brick hierarchy from the good node to the bad node was simple enough, so I stopped the
glusterd service on the bad node and invoked:
# rsync -aAXv --delete --exclude=.glusterfs storage0:/path/to/bricks/homes/ storage1:/path/to/bricks/homes/
After a day or so I noticed I had only copied ~1.5TB (over 1 hop on a dedicated 10GbE switch!), and I realized something must be wrong. I attached to the
rsync process with
strace -p and saw a bunch of system calls in one particular user’s directory. I dug deeper:
# find /path/to/bricks/homes/ukenyatta/maker/genN_datastore/ -type d | wc -l 1398640
So this one particular directory in one user’s home contained over a million other directories and $god knows how many files, and this command itself took several hours to finish! To make matters worse, careful trial and error inspection of other user home directories revealed more massive directory structures as well.
What we’ve learned:
rsyncis single threaded
rsyncgenerates a list of files to be synced before it starts the sync
- MAKER creates a ton of output files/directories 😉
It’s pretty clear (now) that a recursive
rsync on my huge directory hierarchy is out of the question!
I had a look around and saw lots of people complaining about
rsync being “slow” and others suggesting tips to speed it up. One very promising strategy was described on this wiki and there’s a great discussion in the comments.
Basically, he describes a clever use of
xargs to split up the problem set into smaller pieces that
rsync can process more quickly.
So here’s my adaptation of his script for the purpose of syncing failed GlusterFS bricks,
#!/usr/bin/env bash # borrowed / adapted from: https://wiki.ncsa.illinois.edu/display/~wglick/Parallel+Rsync # RSYNC SETUP RSYNC_PROG=/usr/bin/rsync # note the important use of --relative to use relative paths so we don't have to specify the exact path on dest RSYNC_OPTS="-aAXv --numeric-ids --progress --human-readable --delete --exclude=.glusterfs --relative" export RSYNC_RSH="ssh -T -c arcfour -o Compression=no -x" # ENV SETUP SRCDIR=/path/to/good/brick DESTDIR=/path/to/bad/brick # Recommend to match # of CPUs THREADS=4 BAD_NODE=server1 cd $SRCDIR # COPY # note the combination of -print0 and -0! find . -mindepth 1 -maxdepth 1 -print0 | \ xargs -0 -n1 -P$THREADS -I% \ $RSYNC_PROG $RSYNC_OPTS "%" $BAD_NODE:$DESTDIR
Pay attention to the source/destination paths, the number of
THREADS, and the
BAD_NODE name, then you should be ready to roll.
The Magic, Explained
It’s a bit of magic, but here are the important parts:
rsynctell it to archive, preserve ACLs, and preserve eXtended attributes. Extended attributes are critically important in GlusterFS >= 3.3, and also if you’re using SELinux.
rsynctells it to ignore this directory at the root of the directory, as the self-heal daemon —
glustershd— will rebuild it based on the files’ extended attributes once we restart the
rsyncis so we don’t have to bother constructing the destination path, as
rsyncwill imply the path is relative to our destination’s top.
rsync‘s use of SSH, basically telling it to use very weak encryption and disable any unnecessary features for non-interactive sessions (tty, X11, etc).
-maxdepth 1just means we concentrate on files/directories 1 level below each directory in our immediate hierarchy.
-Ptells it to use 1 argument per command line, and to launch
$THREADSnumber of processes at a time.
I hope this helps someone!