Troubleshooting GlusterFS performance issues

I’m in the middle of a new GlusterFS deployment for our research computing infrastructure (storage of Bioinformatics and GIS data). At first I was just happy to see the huge size of my /home partition when I did df -h, but I quickly found that performance wasn’t quite what I was expecting and I started to worry.

In this particular instance I was rsync’ing a few terabytes of user homes from my old server to my GlusterFS FUSE mount, /home:

[root@server1]# rsync -avz server2:/mnt/export/home/ /home

I was shocked to find that, after 24 hours, I had only transferred ~1.5 TB! We’re not even talking about hammering the storage in “real world” use cases yet, I’m just migrating my data over!

My setup

For reference, here’s the setup I’m using. Two servers with:

GlusterFS 3.3.1
CentOS 6.4
12 x 3TB drives in hardware RAID5, mounted as one ~30TB XFS brick on each server
One replicated GlusterFS volume (/homes)
One client (for now)
Dedicated 1GbE switch

The GlusterFS stack isn’t complicated per se (application- and network-layer abstraction of raw block storage), but this problem has lots of variables and it’s possible there isn’t one easy answer. By my count we have to consider network, filesystems, hard disks, and overhead from applications like rsync and not to mention GlusterFS itself. I’m not an expert at any of those things, but I can certainly use the power of logic to deduce the problem.

Network?

My first inclination was to take disk and application overhead out of the picture and just test raw network performance. I’m using a dedicated 1GbE network switch so the two servers in question are only one hop away. I used iperf to test the throughput from server2 to server1:

On server1:

[root@server1]# iperf -s

On server2:

[root@server2]# iperf -c server1
------------------------------------------------------------
Client connecting to server1, TCP port 5001
TCP window size: 23.2 KByte (default)
------------------------------------------------------------
[  3] local server2 port 37731 connected with server1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.10 GBytes   944 Mbits/sec

944Mbit is fine for what I’m doing right now, and is MUCH faster than what I was seeing when copying my data over. I had been planning on bonding a few Ethernet ports to get aggregated throughput on these NICs, but for now it seems I’m ok. I guess my current issue isn’t related to the network configuration.

Filesystem / disk?

My next guess was that I had somehow misconfigured either the XFS filesystem or something to do with the hard disks or the hardware RAID on the back-end servers themselves.

Drop all in-memory file caches and then create a 1GB test file on my RAID5 mount point:

[root@server1 ~]# echo 3 > /proc/sys/vm/drop_caches
[root@server1 ~]# dd if=/dev/zero of=/mnt/gfs/zero1 bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB) copied, 0.821399 s, 1.3 GB/s

Make a few copies. Using dd makes it easier to see the speed of the copy operation:

[root@server1 ~]# dd if=/mnt/gfs/zero1 of=/mnt/gfs/zero2
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 3.94788 s, 266 MB/s
[root@server1 ~]# dd if=/mnt/gfs/zero1 of=/mnt/gfs/zero3
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 3.7447 s, 280 MB/s

Well 266 – 280 megabytes per second isn’t bad, but let’s drop caches again to see if we can get a more realistic view of raw hard drive performance:

[root@server1 ~]# echo 3 > /proc/sys/vm/drop_caches
[root@server1 ~]# dd if=/mnt/gfs/zero1 of=/mnt/gfs/zero4
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 8.66945 s, 121 MB/s
[root@server1 ~]# dd if=/mnt/gfs/zero1 of=/mnt/gfs/zero4
2048000+0 records in
2048000+0 records out
1048576000 bytes (1.0 GB) copied, 8.06204 s, 130 MB/s

120 – 130 megabytes isn’t going to break any records, but it’s definitely within the realm of what we expect from 7200 RPM hard drives, if only a bit low. It seems that my GlusterFS speed issues aren’t related to filesystem or hard disks (not yet, at least).

Application overhead?

During all my testing I noticed that rsync was taking 10 – 20 minutes before it would even start copying! That implies a serious overhead in the rsync differential algorithms, ssh as a transport, or GlusterFS’ FUSE abstraction layer. If I were to guess I’d say I have 50 or so users, with 100,000 or so files between them, totaling around 2.5TB of data; therefore it’s not hard to believe that rsync takes a while to figure out where it left off in that sea of files and directories.

I decided to try copying a file directly within the FUSE mount’s context, as this is a more likely use case I’ll experience (users manipulating data within their home directory):

[root@server2]# cd /home/bam
[root@server2]# dd if=file.bam of=file.bam2
^C7334992+0 records in
7334992+0 records out
3755515904 bytes (3.8 GB) copied, 372.932 s, 10.1 MB/s

After a few minutes I was sick of waiting and had to cancel it. iftop on the back-end storage showed the network traffic moving at ~90 megabits, which is pretty lousy. file.bam is 21 gigs, and at that rate I would have had to wait another fifteen minutes for it to finish!

Maybe it’s FUSE? Lets try the same thing over NFS, first with dd:

[root@server2 ~]# umount /home
[root@server2 ~]# mount -t nfs -o vers=3 server1:/homes /home
[root@server2]# cd /home/bam
[root@server2]# dd if=file.bam of=file.bam2
41623550+1 records in
41623550+1 records out
21311257776 bytes (21 GB) copied, 454.733 s, 46.9 MB/s

47 MB/sec, not bad… then with cp:

[root@server1]# time cp file.bam file.bam2
cp: overwrite `file.bam2'? y

real	6m52.757s
user	0m0.089s
sys	0m16.054s

Closer to 50 MB/sec this time, but it could be some element of caching somewhere and is basically on par with the previous run.

In any case, iftop on the back-end server shows ~600 megabits of inbound traffic and ~850 megabits of cumulative outbound traffic to server2 as well as the other back-end server (the GlusterFS mirror). It seems NFS is much faster than the FUSE client for this particular operation. Also, for what it’s worth, this is surely nowhere near the maximum speed of the hard disk arrays on these servers, and aggregating the Ethernet ports on all the servers involved could be beneficial in the future.

The takeaway

rsync’ing all your data over to your shiny new GlusterFS storage isn’t a real-world use case (not for us at least!). Don’t worry if your performance isn’t top-notch the first time you take it for a spin! If you’re having performance issues try to think logically, isolate the problem if you can.

In my case it seems like I should be using the NFS client. There’s only one application I have that requires the use of extended ACLs (which the NFS client doesn’t support), so I will have to keep that data somewhere else on a FUSE mount. Also, as the raw IO speeds of my disk arrays are much faster than the speeds I was seeing, I should explore Ethernet link aggregation of 2+ ports to get higher throughput.

If you use Gluster replication, then writes have to travel 2x across the network, from the client to the 2 replicas. Keep that in mind when comparing to NFS with large files.

You get better write performance from FUSE with large transfer sizes (use the bs=64k option for example). dd defaults to an insanely small transfer size (I think it’s 512 bytes on RHEL6). Try strace to observe this. This might explain high CPU utilization — FUSE doesn’t aggregate writes (yet) so every tiny transfer has to make it through the kernel and out the other side to glusterfs, not Gluster’s fault. There are patches in upstream Linux relating to this, so eventually it will get fixed, but in the meantime use a larger transfer size, or if you really can’t do that, use NFS client to talk to Gluster, this will aggregate writes on the client side.

Gluster throughput is not good with small-files, and efforts are underway to improve it in this area.

However, your article illustrates the disconnect between the way distributed filesystems work and the way that applications try to use them, even in 2013. There is absolutely nothing in the functional specification of rsync that requires it to work on a single file at a time, but that’s pretty much what it does. With Gluster, if you do a single file at a time it will only utilize a small fraction of the hardware capacity, particularly with small files, and I suspect several other distributed filesystems will have similar problems.

You could try doing multiple rsyncs in parallel to a single Gluster mountpoint. Gluster’s consistent hashing will spread this load across multiple glusterfsd processes on different servers (1 glusterfsd/brick).

You could try pulling files using servers instead of pushing from a client — login to each gluster server, use/create a mountpoint there, and pull in parallel with rsync. This will give you more scalability because you’ll have multiple glusterfs processes (1 glusterfs/mountpoint) working in parallel.

Some tuning suggestions are available at http://community.gluster.org/a/linux-kernel-tuning-for-glusterfs/

Post a reply if you found any of this useful.

7 thoughts on “Troubleshooting GlusterFS performance issues”

Dennis says:

March 26, 2013 at 22:11

Excellent read, man.
And how come some filesystems use so many resources (CPU time and RAM) when they are in use?
Charles Weber says:

March 27, 2013 at 00:35

Similar but larger setup. I gave up and went back to drbd. Fuse/Gluster has max group membership limit. Kind of a problem with AD authentication.
Ben England says:

March 27, 2013 at 01:12

If you use Gluster replication, then writes have to travel 2x across the network, from the client to the 2 replicas. Keep that in mind when comparing to NFS with large files.

You get better write performance from FUSE with large transfer sizes (use the bs=64k option for example). dd defaults to an insanely small transfer size (I think it’s 512 bytes on RHEL6). Try strace to observe this. This might explain high CPU utilization — FUSE doesn’t aggregate writes (yet) so every tiny transfer has to make it through the kernel and out the other side to glusterfs, not Gluster’s fault. There are patches in upstream Linux relating to this, so eventually it will get fixed, but in the meantime use a larger transfer size, or if you really can’t do that, use NFS client to talk to Gluster, this will aggregate writes on the client side.

Gluster throughput is not good with small-files, and efforts are underway to improve it in this area.

However, your article illustrates the disconnect between the way distributed filesystems work and the way that applications try to use them, even in 2013. There is absolutely nothing in the functional specification of rsync that requires it to work on a single file at a time, but that’s pretty much what it does. With Gluster, if you do a single file at a time it will only utilize a small fraction of the hardware capacity, particularly with small files, and I suspect several other distributed filesystems will have similar problems.

You could try doing multiple rsyncs in parallel to a single Gluster mountpoint. Gluster’s consistent hashing will spread this load across multiple glusterfsd processes on different servers (1 glusterfsd/brick).

You could try pulling files using servers instead of pushing from a client — login to each gluster server, use/create a mountpoint there, and pull in parallel with rsync. This will give you more scalability because you’ll have multiple glusterfs processes (1 glusterfs/mountpoint) working in parallel.

Some tuning suggestions are available at http://community.gluster.org/a/linux-kernel-tuning-for-glusterfs/

Post a reply if you found any of this useful.
1. Alan says:
  
  March 27, 2013 at 10:33
  
  Hey, Ben, thanks for stopping by to chime in.
  
  Regarding the distribution of writes with FUSE, that explains why I get roughly half of my network capacity during writes; approximately 350-400mbps from my client to each of my two replicas, just about the max for 1GbE if you add a bit of overhead. I can see how if you were CERN and needed 1 petabyte of data NOW you would want many nodes to distribute the writes. I’m slowly learning these tradeoffs, and it’s helping me to understand how to tune GlusterFS for our own needs. I realize now that what we’re after is basically capacity, with “regular” write speeds. we’re coming from the world of attached storage, where the most you can get on a mid-range server is in the 10 TB neighborhood. So we just want more space, without having to suffer too much from moving storage to the network (also, Gluster’s global namespace and easy adding of storage nodes is a plus).
  
  I know what you mean about the small block sizes used for dd. I should make a note to try those same tests with a higher block size, but also I realize that running dd all day doesn’t really help me much in the end because none of my users have even heard of it, so it’s not really a “real” use case for me. Kinda like complaining about your new phone not getting as high of Quadrant or AnTuTu benchmark speeds; it only matters if you plan on running benchmarks all day! In our case, I think I’ll just use NFS as, again, we’re really just after increased capacity and reasonable access speeds (ie, don’t take forever to ls or cp for interactive users).
  
  I had seen the GlusterFS kernel tuning guide, as well as a very helpful guide from the FHGFS guys[1]. I was thinking about doing a followup to this article to document some of the scheduler, kernel, and GlusterFS tweaks I’m using (along with my rationale for choosing them) to hopefully help some people as well as get some feedback on them.
  
  [1] http://www.fhgfs.com/wiki/wikka.php?wakka=StorageServerTuning
Stefan Hajnoczi says:

March 27, 2013 at 12:45

The dd(1) benchmarks are going through the page cache. You can avoid that using:

# dd if=/dev/zero of=/mnt/gfs/zero1 bs=1M count=1000 oflag=direct

or

# dd if=/mnt/gfs/zero1 of=/mnt/gfs/zero3 iflag=direct oflag=direct

That way you see true file system and disk I/O performance. Dropping caches before the benchmark doesn’t help if dd(1) is just going to stash data into the host’s page cache.
1. Stefan Hajnoczi says:
  
  March 27, 2013 at 12:49
  
  Quick follow up: I saw later you’re using FUSE here.
  
  If O_DIRECT is not available on FUSE you can use dd conv=fsync instead. This calls fsync(2) at the end of the benchmark, forcing written data out of the page cache and into FUSE -> GlusterFS -> file system -> disk.
  1. Alan says:
    
    March 28, 2013 at 00:12
    
    Stefan,
    
    Thanks for the explanation about the direct flag; now that you mention it I remember seeing it from time to time but I wasn’t aware of its significance in this type of test.
    
    I’m curious to re-run my tests but not expecting much change in the numbers. What worries me is that cp of some several-hundred megabyte – several gigabyte data sets runs at a pretty miserable speed, and that’s a more “real-world” use case for my users than dd with strange block sizes etc. :\
    
    Having said that, I think the benchmarks help test the peak performance of the stack as well as help me, the sysadmin, understand some of the values of the equation and how to react to different workload demands.
    
    So thanks for the tips!

Comments are closed.

Mjanja Tech

Ujanja Ni Uhai (Hustling Is Life)