XFS fsync write amplification

By steve, 29 October, 2015

I have just finished debugging an iSCSI storage system where the I/O on the disk appeared to be significantly higher than the I/O coming from the running programs. I discovered that the top of the iotop program can confirm this:

Total DISK READ : 63.46 M/s | Total DISK WRITE : 42.03 M/s
Actual DISK READ: 15.28 M/s | Actual DISK WRITE: 112.13 M/s

The actual reads can be lower than the total reads due to caching, and the total writes can be higher than the actual writes due to journalling. I then did some reading and came across this mailing list post:
http://oss.sgi.com/archives/xfs/2015-07/msg00135.html

My understanding is that whenever a journal operation occurs in XFS, the write operation is rounded up to the nearest stripe width. In addition to this, a write operation is performed every time fsync() is called. This leads me to what I believe is our problem:

  • We use iscsi target (scst)on top of gluster on top of XFS.
  • We recently changed all disks in scst to write_through mode, which set the O_SYNC flag when the file is opened
  • Gluster calls fsync() after every write operation if the O_SYNC flag is set
  • XFS has been created with sunit=64 and blocks=521728 (2GB) in the log section.
  • This causes XFS to write 64*4k=256k of logs for every write operation, plus these log writes end up on the hard disks because we will write up to 2GB of logs.

Since we are using hardware RAID with a BBU, I now believe that the best mkfs.xfs options for the log section would in our scenario (a small number of large files that do not grow) be to set sunit=0 or 1 (to avoid the padded log writes), and also to change the log blocks to minimum (512) to try and keep the log in the RAID controller's memory instead of flushing it to the disks. I need to wait for a second gluster node before I can test this out.

Tags

Comments