I’m in the middle of a new GlusterFS deployment for our research computing infrastructure (storage of Bioinformatics and GIS data). At first I was just happy to see the huge size of my /home partition when I did
df -h, but I quickly found that performance wasn’t quite what I was expecting and I started to worry.
In this particular instance I was rsync’ing a few terabytes of user homes from my old server to my GlusterFS FUSE mount, /home:
[root@server1]# rsync -avz server2:/mnt/export/home/ /home
I was shocked to find that, after 24 hours, I had only transferred ~1.5 TB! We’re not even talking about hammering the storage in “real world” use cases yet, I’m just migrating my data over!
For reference, here’s the setup I’m using. Two servers with:
- GlusterFS 3.3.1
- CentOS 6.4
- 12 x 3TB drives in hardware RAID5, mounted as one ~30TB XFS brick on each server
- One replicated GlusterFS volume (/homes)
- One client (for now)
- Dedicated 1GbE switch
The GlusterFS stack isn’t complicated per se (application- and network-layer abstraction of raw block storage), but this problem has lots of variables and it’s possible there isn’t one easy answer. By my count we have to consider network, filesystems, hard disks, and overhead from applications like
rsync and not to mention GlusterFS itself. I’m not an expert at any of those things, but I can certainly use the power of logic to deduce the problem.
My first inclination was to take disk and application overhead out of the picture and just test raw network performance. I’m using a dedicated 1GbE network switch so the two servers in question are only one hop away. I used
iperf to test the throughput from server2 to server1:
[root@server1]# iperf -s
[root@server2]# iperf -c server1 ------------------------------------------------------------ Client connecting to server1, TCP port 5001 TCP window size: 23.2 KByte (default) ------------------------------------------------------------ [ 3] local server2 port 37731 connected with server1 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 1.10 GBytes 944 Mbits/sec
944Mbit is fine for what I’m doing right now, and is MUCH faster than what I was seeing when copying my data over. I had been planning on bonding a few Ethernet ports to get aggregated throughput on these NICs, but for now it seems I’m ok. I guess my current issue isn’t related to the network configuration.
Filesystem / disk?
My next guess was that I had somehow misconfigured either the XFS filesystem or something to do with the hard disks or the hardware RAID on the back-end servers themselves.
Drop all in-memory file caches and then create a 1GB test file on my RAID5 mount point:
[root@server1 ~]# echo 3 > /proc/sys/vm/drop_caches [root@server1 ~]# dd if=/dev/zero of=/mnt/gfs/zero1 bs=1M count=1000 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 0.821399 s, 1.3 GB/s
Make a few copies. Using
dd makes it easier to see the speed of the copy operation:
[root@server1 ~]# dd if=/mnt/gfs/zero1 of=/mnt/gfs/zero2 2048000+0 records in 2048000+0 records out 1048576000 bytes (1.0 GB) copied, 3.94788 s, 266 MB/s [root@server1 ~]# dd if=/mnt/gfs/zero1 of=/mnt/gfs/zero3 2048000+0 records in 2048000+0 records out 1048576000 bytes (1.0 GB) copied, 3.7447 s, 280 MB/s
Well 266 – 280 megabytes per second isn’t bad, but let’s drop caches again to see if we can get a more realistic view of raw hard drive performance:
[root@server1 ~]# echo 3 > /proc/sys/vm/drop_caches [root@server1 ~]# dd if=/mnt/gfs/zero1 of=/mnt/gfs/zero4 2048000+0 records in 2048000+0 records out 1048576000 bytes (1.0 GB) copied, 8.66945 s, 121 MB/s [root@server1 ~]# dd if=/mnt/gfs/zero1 of=/mnt/gfs/zero4 2048000+0 records in 2048000+0 records out 1048576000 bytes (1.0 GB) copied, 8.06204 s, 130 MB/s
120 – 130 megabytes isn’t going to break any records, but it’s definitely within the realm of what we expect from 7200 RPM hard drives, if only a bit low. It seems that my GlusterFS speed issues aren’t related to filesystem or hard disks (not yet, at least).
During all my testing I noticed that
rsync was taking 10 – 20 minutes before it would even start copying! That implies a serious overhead in the rsync differential algorithms, ssh as a transport, or GlusterFS’ FUSE abstraction layer. If I were to guess I’d say I have 50 or so users, with 100,000 or so files between them, totaling around 2.5TB of data; therefore it’s not hard to believe that rsync takes a while to figure out where it left off in that sea of files and directories.
I decided to try copying a file directly within the FUSE mount’s context, as this is a more likely use case I’ll experience (users manipulating data within their home directory):
[root@server2]# cd /home/bam [root@server2]# dd if=file.bam of=file.bam2 ^C7334992+0 records in 7334992+0 records out 3755515904 bytes (3.8 GB) copied, 372.932 s, 10.1 MB/s
After a few minutes I was sick of waiting and had to cancel it.
iftop on the back-end storage showed the network traffic moving at ~90 megabits, which is pretty lousy. file.bam is 21 gigs, and at that rate I would have had to wait another fifteen minutes for it to finish!
Maybe it’s FUSE? Lets try the same thing over NFS, first with
[root@server2 ~]# umount /home [root@server2 ~]# mount -t nfs -o vers=3 server1:/homes /home [root@server2]# cd /home/bam [root@server2]# dd if=file.bam of=file.bam2 41623550+1 records in 41623550+1 records out 21311257776 bytes (21 GB) copied, 454.733 s, 46.9 MB/s
47 MB/sec, not bad… then with
[root@server1]# time cp file.bam file.bam2 cp: overwrite `file.bam2'? y real 6m52.757s user 0m0.089s sys 0m16.054s
Closer to 50 MB/sec this time, but it could be some element of caching somewhere and is basically on par with the previous run.
In any case,
iftop on the back-end server shows ~600 megabits of inbound traffic and ~850 megabits of cumulative outbound traffic to server2 as well as the other back-end server (the GlusterFS mirror). It seems NFS is much faster than the FUSE client for this particular operation. Also, for what it’s worth, this is surely nowhere near the maximum speed of the hard disk arrays on these servers, and aggregating the Ethernet ports on all the servers involved could be beneficial in the future.
rsync’ing all your data over to your shiny new GlusterFS storage isn’t a real-world use case (not for us at least!). Don’t worry if your performance isn’t top-notch the first time you take it for a spin! If you’re having performance issues try to think logically, isolate the problem if you can.
In my case it seems like I should be using the NFS client. There’s only one application I have that requires the use of extended ACLs (which the NFS client doesn’t support), so I will have to keep that data somewhere else on a FUSE mount. Also, as the raw IO speeds of my disk arrays are much faster than the speeds I was seeing, I should explore Ethernet link aggregation of 2+ ports to get higher throughput.