Skip to main content

Optimizing Linux Filesystem Performance

Tobi Oetiker, 2009-10-18

The Issue

When looking at data sheets and benchmarks for hard disks I often see fantastic transfer rates of over 200 MB/s while my real life experience is vastly different.

We run a busy Linux file server providing both NFS and Samba file access, backed by a hardware RAID6 running a ext3 filesystem.

Experience tells me that the worst performance is to be expected when there are competing read and write accesses to the file system. Looking at existing benchmarks, this pattern does not seem to get exercised a lot. Especially not in a way similar to what to the activity to be expected on a file server. The best approximation I have seen are people who run competing dd processes, which is not even close to reality.

I have therefore developed a new benchmark to better approximate the activities of a file server. The test concerns mainly reading and writing files in a tree with a file size distribution inspired by the content of a real home directory partition. See the FsOpBench Homepage for details.

The benchmark does not only measure throughput. It measures the time required for each operation it executes and then reports detailed statistics.

Armed with this new benchmark I went ahead to figure out how to make our file server faster.

Over the years, Linux has gained a fair number of knobs one can twiddle to optimize filesystem performance:

  • at the lowest level there is the choice of io-scheduler, opinions seem to be split between cfq anddeadline.

  • when working with ext3 there are three ways to handle the journaling, data=journaldata=ordered anddata=writeback the popular wisdom being, that data=writeback is the fastest but also most risky option. The journal settings mainly affect write performance since read access does not touch the journal.

  • the noatime or relatime mount options which prevent atime updates on read-access.

  • keeping the journal inside the filesystem or on an external disk or on an ssd device.

  • whether or not to activate the write barrier.

We are looking at 72 potential combinations of these settings. Ample work for the benchmark program. And that is not even taking into account other file systems like xfs, ext4 or btrfs. To further complicate the situation, there is a lot of activity in kernel development regarding IO performance, so the kernel version is bound to play a role too. In this evaluation I have worked with the Ubuntu sever kernel for hardy (2.6.24) as well as the latest official kernel 2.6.31.2.

The Benchmark

The first time the benchmark is run, it sets up a 20 GB file tree. This tree is then used as input for the read operations.

In normal operation, the benchmark works as follows:

  1. Sync and drop all caches

  2. Fork 1 reader, going through file tree number 0

  3. Fork 3 readers going through file trees 1 to 3 in parallel

  4. Fork 3 writers creating new file trees while the 3 readers from the previous step continue to traverse their respective trees. For this test, the performance measurement starts after the writers have been left running for 30 seconds to fill up caches.

The benchmark executes many thousand filesystem operations and measures the the time required to execute each one. It then builds min, max, average, median and stdev statistics from these numbers.

The Test System

The tests have been conducted on a dual quad core Intel Xeon E5520 system with an Areca 1222 RAID controller running an RAID6 configuration on 7 SATA WD 1002FBYS drives. The tests were the only activity on the system.

Standard Deviation is WAY to high

Before looking at the individual results, there is one general observation. All configurations show amazingly high standard deviation. Even the most simple test with a single reader process, there is often a factor of 1000 between the median and the slowest measurement.

1 readers (30s)
----------------------------------------------------------------------
A read dir        cnt  31792   min 0.002 ms   max  112.924 ms   avg  0.063 ms   med 0.004 ms   std   1.121
B lstat file      cnt  29652   min 0.008 ms   max   10.614 ms   avg  0.046 ms   med 0.018 ms   std   0.377
C open file       cnt  23228   min 0.015 ms   max    0.137 ms   avg  0.018 ms   med 0.017 ms   std   0.005
D rd 1st byte     cnt  23228   min 0.170 ms   max  102.035 ms   avg  0.595 ms   med 0.270 ms   std   2.582
E read rate      57.546 MB/s (data)  22.278 MB/s (readdir + open + 1st byte + data)

3 readers (30s)
----------------------------------------------------------------------
A read dir        cnt  10164   min 0.002 ms   max   93.665 ms   avg  0.129 ms   med 0.004 ms   std   1.566
B lstat file      cnt   9502   min 0.008 ms   max  109.577 ms   avg  0.105 ms   med 0.018 ms   std   1.360
C open file       cnt   7514   min 0.015 ms   max    0.112 ms   avg  0.018 ms   med 0.017 ms   std   0.006
D rd 1st byte     cnt   7514   min 0.175 ms   max  228.477 ms   avg  2.202 ms   med 0.286 ms   std   9.472
E read rate      19.244 MB/s (data)   7.009 MB/s (readdir + open + 1st byte + data)

Running the same test on a single hard disk shows pretty similar results. Only the standard deviation seems to be a tad lower for the RAID setup.

In the graph below I have plotted the values for D (time to read 1st byte).

While the majority of readings stays low, there is an increasing number which bursts out on top. Even worse, at 10 seconds, the read process got stuck for more than a second.

Analysis

Analysis of the data shows the following recipe for optimal performance on ext3 on LVM on Areca RAID6.

  1. use the CFQ scheduler since deadline effectively kills read performance in a read/write competition.
  2. set data=ordered journaling since data=journal puts write operations at a disadvantage by making open calls more expensive without helping the read performance.

For the rest of the settings the benchmark does not render clear indications. I assume the following:

  • keep the journal on an external disk since it seems to lower the standard deviation of write operations.
  • set barrier=0 since the battery bufferd cache on the raid should protect us from its negative data integrity effects.
  • use relatime since fewer writes will give better performance.

The table below shows the results gathered when running this on 2.6.24.

Linux 2.6.24-24-server relatime barrier=0 fs=ext3 disk=areca RAID6 journal=int data=ordered scheduler=cfq
**************************************************************************************

1 readers (30s)
----------------------------------------------------------------------
A read dir        cnt  31375   min 0.002 ms   max   78.354 ms   avg  0.061 ms   med 0.005 ms   std   0.779
B lstat file      cnt  29251   min 0.008 ms   max   28.075 ms   avg  0.050 ms   med 0.020 ms   std   0.460
C open file       cnt  22874   min 0.015 ms   max    0.116 ms   avg  0.018 ms   med 0.017 ms   std   0.005
D rd 1st byte     cnt  22874   min 0.173 ms   max   95.596 ms   avg  0.612 ms   med 0.268 ms   std   2.674
E read rate      57.529 MB/s (data)  21.947 MB/s (readdir + open + 1st byte + data)

3 readers (30s)
----------------------------------------------------------------------
A read dir        cnt  10105   min 0.002 ms   max   99.035 ms   avg  0.120 ms   med 0.004 ms   std   1.706
B lstat file      cnt   9448   min 0.008 ms   max   99.366 ms   avg  0.120 ms   med 0.019 ms   std   1.834
C open file       cnt   7479   min 0.015 ms   max    0.115 ms   avg  0.018 ms   med 0.017 ms   std   0.006
D rd 1st byte     cnt   7479   min 0.171 ms   max  136.923 ms   avg  2.157 ms   med 0.284 ms   std   8.937
E read rate      18.230 MB/s (data)   6.934 MB/s (readdir + open + 1st byte + data)

3 readers, 3 writers (30s)
----------------------------------------------------------------------
F write open      cnt   3886   min 0.037 ms   max   41.745 ms   avg  0.078 ms   med 0.041 ms   std   0.872
G wr 1st byte     cnt   3886   min 0.007 ms   max    0.130 ms   avg  0.008 ms   med 0.007 ms   std   0.004
H write close     cnt   3886   min 0.012 ms   max    0.110 ms   avg  0.016 ms   med 0.016 ms   std   0.005
I mkdir           cnt    334   min 0.019 ms   max 1821.824 ms   avg 26.962 ms   med 0.025 ms   std 162.606
J write rate    522.051 MB/s (data) 218.912 MB/s (open + 1st byte + data + close)

A read dir        cnt   3847   min 0.002 ms   max   89.878 ms   avg  0.187 ms   med 0.004 ms   std   1.995
B lstat file      cnt   3604   min 0.008 ms   max   26.308 ms   avg  0.159 ms   med 0.019 ms   std   1.329
C open file       cnt   2878   min 0.015 ms   max    0.148 ms   avg  0.019 ms   med 0.018 ms   std   0.007
D rd 1st byte     cnt   2878   min 0.179 ms   max  925.522 ms   avg  7.404 ms   med 0.296 ms   std  42.786
E read rate      10.198 MB/s (data)   2.480 MB/s (readdir + open + 1st byte + data)

For 2.6.31.2 the results are a bit better. The number of writes completed raises dramatically also the large delays and high standard deviation associated with write close seems to have gone.

Linux 2.6.31.2-test relatime barrier=0 fs=ext3 disk=areca RAID6 journal=ext data=ordered scheduler=cfq
**************************************************************************************

1 readers (30s)
----------------------------------------------------------------------
A read dir        cnt  33964   min 0.001 ms   max   84.159 ms   avg  0.056 ms   med 0.004 ms   std   0.666
B lstat file      cnt  31685   min 0.007 ms   max   19.915 ms   avg  0.051 ms   med 0.022 ms   std   0.402
C open file       cnt  24842   min 0.014 ms   max    0.677 ms   avg  0.021 ms   med 0.017 ms   std   0.011
D rd 1st byte     cnt  24842   min 0.175 ms   max   90.800 ms   avg  0.560 ms   med 0.270 ms   std   1.631
E read rate      64.667 MB/s (data)  24.137 MB/s (readdir + open + 1st byte + data)

3 readers (30s)
----------------------------------------------------------------------
A read dir        cnt  11299   min 0.001 ms   max   20.214 ms   avg  0.099 ms   med 0.004 ms   std   0.866
B lstat file      cnt  10563   min 0.007 ms   max   87.001 ms   avg  0.112 ms   med 0.022 ms   std   1.424
C open file       cnt   8357   min 0.014 ms   max    0.430 ms   avg  0.021 ms   med 0.017 ms   std   0.016
D rd 1st byte     cnt   8357   min 0.176 ms   max  216.665 ms   avg  2.171 ms   med 0.290 ms   std  10.698
E read rate      25.330 MB/s (data)   7.764 MB/s (readdir + open + 1st byte + data)

3 readers, 3 writers (30s)
----------------------------------------------------------------------
F write open      cnt  11669   min 0.036 ms   max   25.387 ms   avg  0.072 ms   med 0.042 ms   std   0.590
G wr 1st byte     cnt  11669   min 0.006 ms   max    0.637 ms   avg  0.008 ms   med 0.007 ms   std   0.010
H write close     cnt  11669   min 0.013 ms   max    0.354 ms   avg  0.020 ms   med 0.019 ms   std   0.015
I mkdir           cnt   1096   min 0.021 ms   max  685.149 ms   avg 24.203 ms   med 0.029 ms   std  94.218
J write rate    280.469 MB/s (data) 162.076 MB/s (open + 1st byte + data + close)

A read dir        cnt   7364   min 0.001 ms   max   50.615 ms   avg  0.115 ms   med 0.004 ms   std   1.068
B lstat file      cnt   6884   min 0.007 ms   max   51.400 ms   avg  0.118 ms   med 0.023 ms   std   1.068
C open file       cnt   5446   min 0.014 ms   max    1.265 ms   avg  0.021 ms   med 0.017 ms   std   0.022
D rd 1st byte     cnt   5446   min 0.180 ms   max  324.864 ms   avg  3.516 ms   med 0.294 ms   std  19.900
E read rate      16.190 MB/s (data)   4.828 MB/s (readdir + open + 1st byte + data)

By keeping the journal on an external device, the performance should be improved, or so it would seem since the journaled writes will not get into the way of the reads. In our testing we did not find all that much evidence for the theory. The median numbers are the same regardless. The only substantial improvement is the lower standard deviation for 'H write close'. The lower overall data rate is probably due to the 'I mkdir' calls taking twice as long on average.

Linux 2.6.24-24-server relatime barrier=0 fs=ext3 disk=areca RAID6 journal=ext data=ordered scheduler=cfq
*********************************************************************************************************

1 readers (30s)
----------------------------------------------------------------------
A read dir        cnt  31227   min 0.002 ms   max   95.552 ms   avg  0.063 ms   med 0.004 ms   std   0.940
B lstat file      cnt  29114   min 0.008 ms   max   92.322 ms   avg  0.054 ms   med 0.021 ms   std   0.842
C open file       cnt  22770   min 0.015 ms   max    0.115 ms   avg  0.018 ms   med 0.017 ms   std   0.005
D rd 1st byte     cnt  22770   min 0.174 ms   max  105.196 ms   avg  0.614 ms   med 0.270 ms   std   2.749
E read rate      57.985 MB/s (data)  21.930 MB/s (readdir + open + 1st byte + data)

3 readers (30s)
----------------------------------------------------------------------
A read dir        cnt  10136   min 0.002 ms   max  111.241 ms   avg  0.118 ms   med 0.004 ms   std   1.706
B lstat file      cnt   9479   min 0.008 ms   max  103.033 ms   avg  0.109 ms   med 0.018 ms   std   1.488
C open file       cnt   7506   min 0.015 ms   max    0.129 ms   avg  0.018 ms   med 0.017 ms   std   0.006
D rd 1st byte     cnt   7506   min 0.172 ms   max  133.411 ms   avg  2.190 ms   med 0.288 ms   std   9.303
E read rate      18.601 MB/s (data)   6.941 MB/s (readdir + open + 1st byte + data)

3 readers, 3 writers (30s)
----------------------------------------------------------------------
F write open      cnt   2992   min 0.037 ms   max   23.378 ms   avg  0.059 ms   med 0.041 ms   std   0.534
G wr 1st byte     cnt   2992   min 0.007 ms   max    0.164 ms   avg  0.008 ms   med 0.007 ms   std   0.005
H write close     cnt   2992   min 0.013 ms   max  205.499 ms   avg  0.088 ms   med 0.017 ms   std   3.757
I mkdir           cnt    303   min 0.020 ms   max 1241.212 ms   avg 18.730 ms   med 0.025 ms   std 118.599
J write rate      4.203 MB/s (data)   4.129 MB/s (open + 1st byte + data + close)

A read dir        cnt   5514   min 0.002 ms   max   98.726 ms   avg  0.189 ms   med 0.004 ms   std   2.123
B lstat file      cnt   5157   min 0.008 ms   max   62.170 ms   avg  0.129 ms   med 0.019 ms   std   1.326
C open file       cnt   4089   min 0.015 ms   max    0.126 ms   avg  0.018 ms   med 0.017 ms   std   0.007
D rd 1st byte     cnt   4089   min 0.174 ms   max  651.012 ms   avg  4.577 ms   med 0.296 ms   std  21.041
E read rate       9.454 MB/s (data)   3.342 MB/s (readdir + open + 1st byte + data)

And the same for 2.6.31 (again faster writes):

Linux 2.6.31.2-test relatime barrier=0 fs=ext3 disk=areca RAID6 journal=/dev/journal/scratch_a data=ordered scheduler=cfq
**************************************************************************************

1 readers (30s)
----------------------------------------------------------------------
A read dir        cnt  33964   min 0.001 ms   max   84.159 ms   avg  0.056 ms   med 0.004 ms   std   0.666
B lstat file      cnt  31685   min 0.007 ms   max   19.915 ms   avg  0.051 ms   med 0.022 ms   std   0.402
C open file       cnt  24842   min 0.014 ms   max    0.677 ms   avg  0.021 ms   med 0.017 ms   std   0.011
D rd 1st byte     cnt  24842   min 0.175 ms   max   90.800 ms   avg  0.560 ms   med 0.270 ms   std   1.631
E read rate      64.667 MB/s (data)  24.137 MB/s (readdir + open + 1st byte + data)

3 readers (30s)
----------------------------------------------------------------------
A read dir        cnt  11299   min 0.001 ms   max   20.214 ms   avg  0.099 ms   med 0.004 ms   std   0.866
B lstat file      cnt  10563   min 0.007 ms   max   87.001 ms   avg  0.112 ms   med 0.022 ms   std   1.424
C open file       cnt   8357   min 0.014 ms   max    0.430 ms   avg  0.021 ms   med 0.017 ms   std   0.016
D rd 1st byte     cnt   8357   min 0.176 ms   max  216.665 ms   avg  2.171 ms   med 0.290 ms   std  10.698
E read rate      25.330 MB/s (data)   7.764 MB/s (readdir + open + 1st byte + data)

3 readers, 3 writers (30s)
----------------------------------------------------------------------
F write open      cnt  11669   min 0.036 ms   max   25.387 ms   avg  0.072 ms   med 0.042 ms   std   0.590
G wr 1st byte     cnt  11669   min 0.006 ms   max    0.637 ms   avg  0.008 ms   med 0.007 ms   std   0.010
H write close     cnt  11669   min 0.013 ms   max    0.354 ms   avg  0.020 ms   med 0.019 ms   std   0.015
I mkdir           cnt   1096   min 0.021 ms   max  685.149 ms   avg 24.203 ms   med 0.029 ms   std  94.218
J write rate    280.469 MB/s (data) 162.076 MB/s (open + 1st byte + data + close)

A read dir        cnt   7364   min 0.001 ms   max   50.615 ms   avg  0.115 ms   med 0.004 ms   std   1.068
B lstat file      cnt   6884   min 0.007 ms   max   51.400 ms   avg  0.118 ms   med 0.023 ms   std   1.068
C open file       cnt   5446   min 0.014 ms   max    1.265 ms   avg  0.021 ms   med 0.017 ms   std   0.022
D rd 1st byte     cnt   5446   min 0.180 ms   max  324.864 ms   avg  3.516 ms   med 0.294 ms   std  19.900
E read rate      16.190 MB/s (data)   4.828 MB/s (readdir + open + 1st byte + data)

Also whether keeping the journal on an SSD or a physical harddrive did not make any notable difference in this setup. Keeping multiple journals on a single ssd is bound to be more efficient due to the lower seek time but this was not tested.

The best HDD setup

The RAID setup performs really badlu on a single hard drive for both 2.6.24 and 2.6.31:

The chart below might be a bit misleading since the results J to F for the writer are actually quite good, except that it managed to do only a single 'round' of writing in the 30 seconds allocated. The reason for this is that the benchmark sends a signal to the writer process to start measuring. After 30 seconds it gets a second signal to print the statistics. If the writer is blocked while receiving the signal it will only act on it once the blockage is over. In this case it got both the start and the stop signal while blocked. Eventually it got one round of writing done and this is what we can see in the table below.

Linux 2.6.31.2-test relatime barrier=0 fs=ext3 disk=hdd journal=int data=ordered scheduler=cfq
**********************************************************************************************

1 readers (30s)
----------------------------------------------------------------------
A read dir        cnt  31834   min 0.001 ms   max   10.841 ms   avg  0.087 ms   med 0.004 ms   std   0.692
B lstat file      cnt  29737   min 0.007 ms   max   16.426 ms   avg  0.066 ms   med 0.023 ms   std   0.460
C open file       cnt  23450   min 0.014 ms   max    0.238 ms   avg  0.022 ms   med 0.021 ms   std   0.011
D rd 1st byte     cnt  23450   min 0.173 ms   max   32.441 ms   avg  0.635 ms   med 0.277 ms   std   1.448
E read rate      79.409 MB/s (data)  23.406 MB/s (readdir + open + 1st byte + data)

3 readers (30s)
----------------------------------------------------------------------
A read dir        cnt   3111   min 0.001 ms   max   44.984 ms   avg  0.486 ms   med 0.004 ms   std   3.079
B lstat file      cnt   2917   min 0.007 ms   max   45.703 ms   avg  0.625 ms   med 0.024 ms   std   3.489
C open file       cnt   2333   min 0.014 ms   max    0.145 ms   avg  0.021 ms   med 0.018 ms   std   0.011
D rd 1st byte     cnt   2333   min 0.178 ms   max  124.612 ms   avg  8.013 ms   med 0.414 ms   std  13.877
E read rate       7.819 MB/s (data)   2.153 MB/s (readdir + open + 1st byte + data)

3 readers, 3 writers (30s)
----------------------------------------------------------------------
F write open      cnt     57   min 0.039 ms   max    0.099 ms   avg  0.047 ms   med 0.043 ms   std   0.013
G wr 1st byte     cnt     57   min 0.006 ms   max    0.014 ms   avg  0.007 ms   med 0.006 ms   std   0.001
H write close     cnt     57   min 0.014 ms   max    0.077 ms   avg  0.021 ms   med 0.020 ms   std   0.009
I mkdir           cnt      1   min 0.065 ms   max    0.065 ms   avg  0.065 ms   med 0.065 ms   std   0.000
J write rate    370.348 MB/s (data) 204.618 MB/s (open + 1st byte + data + close)

A read dir        cnt  12795   min 0.001 ms   max   52.534 ms   avg  0.149 ms   med 0.004 ms   std   1.309
B lstat file      cnt  11941   min 0.007 ms   max   30.476 ms   avg  0.129 ms   med 0.023 ms   std   1.091
C open file       cnt   9382   min 0.014 ms   max    0.233 ms   avg  0.021 ms   med 0.017 ms   std   0.012
D rd 1st byte     cnt   9382   min 0.177 ms   max 6524.964 ms   avg  3.705 ms   med 0.297 ms   std  69.637
E read rate      15.355 MB/s (data)   4.385 MB/s (readdir + open + 1st byte + data)

When choosing the deadline scheduler instead, the fortunes get reversed. Now the writers starve the readers:

Linux 2.6.31.2-test relatime barrier=0 fs=ext3 disk=hdd journal=int data=ordered scheduler=deadline
**************************************************************************************

1 readers (30s)
----------------------------------------------------------------------
A read dir        cnt  30388   min 0.001 ms   max   85.786 ms   avg  0.088 ms   med 0.004 ms   std   0.839
B lstat file      cnt  28377   min 0.007 ms   max   36.928 ms   avg  0.067 ms   med 0.023 ms   std   0.521
C open file       cnt  22343   min 0.014 ms   max    0.248 ms   avg  0.022 ms   med 0.022 ms   std   0.011
D rd 1st byte     cnt  22343   min 0.173 ms   max   97.215 ms   avg  0.646 ms   med 0.278 ms   std   1.995
E read rate      69.505 MB/s (data)  22.314 MB/s (readdir + open + 1st byte + data)

3 readers (30s)
----------------------------------------------------------------------
A read dir        cnt   1703   min 0.001 ms   max   44.963 ms   avg  0.926 ms   med 0.004 ms   std   4.559
B lstat file      cnt   1597   min 0.007 ms   max   64.839 ms   avg  0.861 ms   med 0.025 ms   std   4.284
C open file       cnt   1277   min 0.014 ms   max    0.107 ms   avg  0.021 ms   med 0.018 ms   std   0.010
D rd 1st byte     cnt   1277   min 0.182 ms   max  150.754 ms   avg 14.381 ms   med 14.457 ms   std  13.847
E read rate       3.977 MB/s (data)   1.191 MB/s (readdir + open + 1st byte + data)

3 readers, 3 writers (30s)
----------------------------------------------------------------------
F write open      cnt   8210   min 0.036 ms   max 1385.255 ms   avg  2.361 ms   med 0.041 ms   std  33.087
G wr 1st byte     cnt   8210   min 0.006 ms   max    0.121 ms   avg  0.007 ms   med 0.006 ms   std   0.003
H write close     cnt   8210   min 0.013 ms   max  128.845 ms   avg  0.035 ms   med 0.019 ms   std   1.422
I mkdir           cnt    758   min 0.019 ms   max 1481.164 ms   avg 11.147 ms   med 0.038 ms   std  88.370
J write rate    173.795 MB/s (data)  14.317 MB/s (open + 1st byte + data + close)

A read dir        cnt    130   min 0.001 ms   max  454.968 ms   avg 14.884 ms   med 0.004 ms   std  66.037
B lstat file      cnt    120   min 0.007 ms   max  300.857 ms   avg 12.510 ms   med 0.027 ms   std  54.107
C open file       cnt     92   min 0.015 ms   max    0.050 ms   avg  0.022 ms   med 0.021 ms   std   0.006
D rd 1st byte     cnt     92   min 0.541 ms   max  524.495 ms   avg 186.560 ms   med 217.043 ms   std 133.519
E read rate       0.264 MB/s (data)   0.088 MB/s (readdir + open + 1st byte + data)

The only way to get decent performance is to sacrifice some data integrity guarantees by switching to data=writeback journaling.

Linux 2.6.31.2-test relatime barrier=0 fs=ext3 disk=hdd journal=int data=writeback scheduler=cfq
**************************************************************************************

1 readers (30s)
----------------------------------------------------------------------
A read dir        cnt  30861   min 0.001 ms   max   65.866 ms   avg  0.090 ms   med 0.004 ms   std   0.794
B lstat file      cnt  28824   min 0.007 ms   max   74.237 ms   avg  0.067 ms   med 0.022 ms   std   0.625
C open file       cnt  22713   min 0.014 ms   max    0.313 ms   avg  0.022 ms   med 0.020 ms   std   0.012
D rd 1st byte     cnt  22713   min 0.170 ms   max  103.009 ms   avg  0.660 ms   med 0.281 ms   std   2.058
E read rate      77.399 MB/s (data)  22.723 MB/s (readdir + open + 1st byte + data)

3 readers (30s)
----------------------------------------------------------------------
A read dir        cnt   3355   min 0.001 ms   max  106.079 ms   avg  0.519 ms   med 0.004 ms   std   3.581
B lstat file      cnt   3139   min 0.007 ms   max   76.925 ms   avg  0.487 ms   med 0.024 ms   std   3.384
C open file       cnt   2487   min 0.014 ms   max    0.270 ms   avg  0.021 ms   med 0.018 ms   std   0.012
D rd 1st byte     cnt   2487   min 0.183 ms   max  128.951 ms   avg  7.814 ms   med 0.367 ms   std  14.997
E read rate       9.094 MB/s (data)   2.244 MB/s (readdir + open + 1st byte + data)

3 readers, 3 writers (30s)
----------------------------------------------------------------------
F write open      cnt  10780   min 0.036 ms   max  473.607 ms   avg  0.109 ms   med 0.041 ms   std   4.610
G wr 1st byte     cnt  10780   min 0.006 ms   max   17.067 ms   avg  0.008 ms   med 0.006 ms   std   0.164
H write close     cnt  10780   min 0.012 ms   max    0.428 ms   avg  0.018 ms   med 0.018 ms   std   0.008
I mkdir           cnt   1004   min 0.018 ms   max 19592.233 ms   avg 29.447 ms   med 0.026 ms   std 644.768
J write rate    240.950 MB/s (data) 129.961 MB/s (open + 1st byte + data + close)

A read dir        cnt   9256   min 0.001 ms   max   32.343 ms   avg  0.146 ms   med 0.004 ms   std   1.222
B lstat file      cnt   8638   min 0.007 ms   max   45.528 ms   avg  0.138 ms   med 0.023 ms   std   1.309
C open file       cnt   6783   min 0.014 ms   max    0.266 ms   avg  0.021 ms   med 0.017 ms   std   0.013
D rd 1st byte     cnt   6783   min 0.178 ms   max  312.909 ms   avg  2.819 ms   med 0.297 ms   std  16.400
E read rate      14.757 MB/s (data)   5.163 MB/s (readdir + open + 1st byte + data)

Note the HUGE max delay for mkdir seen in this example. A hang of 19 seconds to get a single mkdir through. But at least the overall data throughput seems to be ok.

Deadline considered harmful

The deadline scheduler, did not do well in any of the scenarios. While multiple competing readers are handled gracefully, they suffer a major performance impact as soon as the writers start. The sample below is one of the 'faster' variants. The behaviour is the same across the board, for RAID as well as for HDD. The situation does not change in 2.6.31 either.

Linux 2.6.24-24-server atime barrier=0 fs=ext3 disk=areca RAID6 journal=ext data=ordered scheduler=deadline
***********************************************************************************************************

1 readers (30s)
----------------------------------------------------------------------
A read dir        cnt  31695   min 0.002 ms   max   74.903 ms   avg  0.058 ms   med 0.005 ms   std   0.667
B lstat file      cnt  29559   min 0.008 ms   max   88.191 ms   avg  0.057 ms   med 0.022 ms   std   0.837
C open file       cnt  23155   min 0.015 ms   max    0.161 ms   avg  0.018 ms   med 0.017 ms   std   0.005
D rd 1st byte     cnt  23155   min 0.171 ms   max  100.230 ms   avg  0.581 ms   med 0.267 ms   std   2.224
E read rate      56.348 MB/s (data)  22.497 MB/s (readdir + open + 1st byte + data)

3 readers (30s)
----------------------------------------------------------------------
A read dir        cnt  13642   min 0.002 ms   max   80.890 ms   avg  0.114 ms   med 0.004 ms   std   1.327
B lstat file      cnt  12763   min 0.008 ms   max   57.040 ms   avg  0.100 ms   med 0.019 ms   std   0.982
C open file       cnt  10131   min 0.015 ms   max    0.147 ms   avg  0.019 ms   med 0.017 ms   std   0.009
D rd 1st byte     cnt  10131   min 0.172 ms   max  118.771 ms   avg  1.498 ms   med 0.298 ms   std   4.785
E read rate      24.196 MB/s (data)   9.419 MB/s (readdir + open + 1st byte + data)

3 readers, 3 writers (30s)
----------------------------------------------------------------------
F write open      cnt   7547   min 0.037 ms   max 2388.053 ms   avg  0.920 ms   med 0.041 ms   std  38.511
G wr 1st byte     cnt   7547   min 0.007 ms   max    7.754 ms   avg  0.010 ms   med 0.007 ms   std   0.103
H write close     cnt   7547   min 0.012 ms   max    1.143 ms   avg  0.018 ms   med 0.016 ms   std   0.028
I mkdir           cnt    706   min 0.018 ms   max 5190.903 ms   avg 33.380 ms   med 0.025 ms   std 349.304
J write rate    434.229 MB/s (data)  37.918 MB/s (open + 1st byte + data + close)

A read dir        cnt   1019   min 0.002 ms   max  125.373 ms   avg  0.261 ms   med 0.004 ms   std   4.132
B lstat file      cnt    947   min 0.008 ms   max   24.082 ms   avg  0.130 ms   med 0.019 ms   std   1.196
C open file       cnt    738   min 0.015 ms   max    0.133 ms   avg  0.020 ms   med 0.018 ms   std   0.008
D rd 1st byte     cnt    738   min 0.175 ms   max 3676.070 ms   avg 27.504 ms   med 0.303 ms   std 241.728
E read rate       1.733 MB/s (data)   0.612 MB/s (readdir + open + 1st byte + data)

While cfq is the default on most ext3 desktop setups, deadline is still popular on servers.

What about btrfs

With btrfs being in 2.6.31 and hailed as the solution to all our trouble, I did give it a whirl too and found the results to be quite amazing. As long as there is no read/write competition the new filesystem puts everyone to shame.

Linux 2.6.31.2-test nobarrier fs=btrfs disk=areca RAID scheduler=cfq
**************************************************************************************
1 readers (30s)
----------------------------------------------------------------------
A read dir        cnt  87133   min 0.001 ms   max   19.047 ms   avg  0.020 ms   med 0.003 ms   std   0.349
B lstat file      cnt  81349   min 0.006 ms   max   32.409 ms   avg  0.034 ms   med 0.023 ms   std   0.262
C open file       cnt  63997   min 0.013 ms   max    0.128 ms   avg  0.016 ms   med 0.016 ms   std   0.003
D rd 1st byte     cnt  63997   min 0.013 ms   max   24.931 ms   avg  0.175 ms   med 0.121 ms   std   0.642
E read rate     181.640 MB/s (data)  71.359 MB/s (readdir + open + 1st byte + data)

3 readers (30s)
----------------------------------------------------------------------
A read dir        cnt  52157   min 0.001 ms   max   61.196 ms   avg  0.051 ms   med 0.003 ms   std   0.757
B lstat file      cnt  48669   min 0.006 ms   max   22.695 ms   avg  0.062 ms   med 0.026 ms   std   0.504
C open file       cnt  38205   min 0.013 ms   max    0.130 ms   avg  0.017 ms   med 0.016 ms   std   0.004
D rd 1st byte     cnt  38205   min 0.014 ms   max   41.009 ms   avg  0.349 ms   med 0.144 ms   std   1.313
E read rate     132.131 MB/s (data)  40.885 MB/s (readdir + open + 1st byte + data)

3 readers, 3 writers (30s)
----------------------------------------------------------------------
F write open      cnt     20   min 0.065 ms   max   35.891 ms   avg  1.951 ms   med 0.126 ms   std   7.789
G wr 1st byte     cnt     20   min 0.006 ms   max    0.018 ms   avg  0.009 ms   med 0.008 ms   std   0.003
H write close     cnt     20   min 0.018 ms   max 1848.102 ms   avg 168.527 ms   med 2.059 ms   std 408.952
I mkdir           cnt      6   min 0.035 ms   max    0.085 ms   avg  0.051 ms   med 0.048 ms   std   0.018
J write rate      0.036 MB/s (data)   0.028 MB/s (open + 1st byte + data + close)

A read dir        cnt   3774   min 0.001 ms   max  157.739 ms   avg  0.134 ms   med 0.003 ms   std   3.215 
B lstat file      cnt   3536   min 0.007 ms   max  752.500 ms   avg  2.737 ms   med 0.029 ms   std  27.926
C open file       cnt   2821   min 0.015 ms   max    0.119 ms   avg  0.018 ms   med 0.017 ms   std   0.005
D rd 1st byte     cnt   2821   min 0.021 ms   max 2406.329 ms   avg  7.716 ms   med 0.149 ms   std  85.023
E read rate       9.633 MB/s (data)   2.480 MB/s (readdir + open + 1st byte + data)

Ext4 is different, not necessarily better

I also put ext4 through the paces ... its overall behaviour seems to be that same than with ext3, the same settings render the best performance. Overall, the single reader scenario seems to suffer a performance drop of 20% to 30% while the three reader scenario gains about 30%. Large maximum latencies have become bigger if anything.

Linux 2.6.31.2-test relatime barrier=0 fs=ext4 disk=areca RAID6 journal=ext data=ordered scheduler=cfq
**************************************************************************************

1 readers (30s)
----------------------------------------------------------------------
A read dir        cnt  23537   min 0.001 ms   max   88.099 ms   avg  0.064 ms   med 0.004 ms   std   1.019
B lstat file      cnt  21968   min 0.007 ms   max   97.830 ms   avg  0.032 ms   med 0.025 ms   std   0.685
C open file       cnt  17263   min 0.014 ms   max    0.174 ms   avg  0.020 ms   med 0.017 ms   std   0.010
D rd 1st byte     cnt  17263   min 0.178 ms   max   96.332 ms   avg  0.877 ms   med 0.280 ms   std   3.350
E read rate      41.393 MB/s (data)  15.868 MB/s (readdir + open + 1st byte + data)

3 readers (30s)
----------------------------------------------------------------------
A read dir        cnt  15622   min 0.001 ms   max  108.698 ms   avg  0.095 ms   med 0.003 ms   std   1.361
B lstat file      cnt  14572   min 0.007 ms   max   26.369 ms   avg  0.033 ms   med 0.023 ms   std   0.354
C open file       cnt  11421   min 0.015 ms   max    0.266 ms   avg  0.020 ms   med 0.017 ms   std   0.013
D rd 1st byte     cnt  11421   min 0.178 ms   max  138.505 ms   avg  1.312 ms   med 0.295 ms   std   4.754
E read rate      24.273 MB/s (data)  10.016 MB/s (readdir + open + 1st byte + data)

3 readers, 3 writers (30s)
----------------------------------------------------------------------
F write open      cnt   5364   min 0.041 ms   max 1604.204 ms   avg  0.405 ms   med 0.047 ms   std  21.980
G wr 1st byte     cnt   5364   min 0.006 ms   max    0.419 ms   avg  0.007 ms   med 0.007 ms   std   0.008
H write close     cnt   5364   min 0.011 ms   max 2128.445 ms   avg  1.329 ms   med 0.017 ms   std  39.354
I mkdir           cnt    490   min 0.026 ms   max 1039.008 ms   avg  2.528 ms   med 0.033 ms   std  47.242
J write rate      8.110 MB/s (data)   5.698 MB/s (open + 1st byte + data + close)

A read dir        cnt   5916   min 0.001 ms   max  114.718 ms   avg  0.243 ms   med 0.004 ms   std   3.678
B lstat file      cnt   5531   min 0.008 ms   max   99.922 ms   avg  0.094 ms   med 0.025 ms   std   1.941
C open file       cnt   4382   min 0.015 ms   max    1.918 ms   avg  0.022 ms   med 0.018 ms   std   0.034
D rd 1st byte     cnt   4382   min 0.179 ms   max  632.008 ms   avg  3.857 ms   med 0.332 ms   std  20.912
E read rate       8.363 MB/s (data)   3.542 MB/s (readdir + open + 1st byte + data)

The Raw Results

Have a look at the results in detail:

Conclusion

  • setups with data=ordered and cfq scheduler provide the best performance balance for RAID6 on areca

  • normal harddisk systems only provide decent performance for r/w competion when running in data=writeback mode with cfq.

  • the major problem of the whole setup is the high number of outliers and the unpredictable service time of the IO calls. The standard deviation is way to high. Maximum wait times of several seconds are way too high.

  • upgrade from 2.6.24 to 2.6.31.2 brings some gains in throughput maybe a factor of two or so. the random delays seems to stay the same though.

  • the deadline scheduler did not have advantages in any of our test cases.

  • btrfs promises major performance gains but competing readers and writers are not handled well either.

  • fsopbench allows a quick assessment of a systems ability to satisfy competing read and write requests.

Sorry

It seems JavaScript has been disabled in your browser. This site, like most of the Internet contents these days, does not work properly without JavaScript. Please re-enable JavaScript, or use a different Browser.


If you need help fixing your browser, please send mail to support@oetiker.ch