Monday, April 22, 2013

dd on devices the difference between Linux and Soalris

People like to use dd to do the micro benchmark because it is simple. However, we should be careful about what is being measured by dd. This can be complicated than we imagine.

Let's take a look on a simple scenario: single dd writes on devices with bs=16k

    # dd if=/dev/zero of=dev_name bs=16k count=655360  

The difference between block I/O and raw I/O and The difference between Solaris and Linux

The device name can be block devices or raw devices. For I/O on block devices, there's a cache layer in OS kernel. For I/O on raw devices, I/O bypasses the kernel's buffer cache.

The device names are different between Linux and Solaris:
- Solaris block device: /dev/dsk/...
- Solaris raw device: /dev/rdsk/...
- Linux block device: /dev/sdxxx, /dev/mapper/xxxx, etc.
- Linux raw device: /dev/raw/xxxx (raw binding on block device)

What do they look like in iostat? (Don't look at performance numbers, I tested on different disks), I'm illustrating the behavior of single dd write on block devices and raw devices.

Solaris block device (bs=16k):
                    extended device statistics
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 13006.1    0.0  101.6 133.5 256.4   10.3   19.7 100 100 c5t20360080E536D50Ed0

Linux block device (bs=16k):
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00 14098.00    0.00  115.00     0.00 57868.00 1006.40   140.86 1325.40    0.00 1325.40   8.70 100.00

Solaris raw device (bs=16k):
                     extended device statistics
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 2805.0    0.0   43.8  0.0  0.9    0.0    0.3   3  87 c5t20360080E536D50Ed0

Linux raw device (bs=8k):
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00 3645.00     0.00 29160.00 16.00     0.63    0.18    0.00    0.18   0.17  62.90

For "single dd write" on block devices, look at "actv" column (Solaris) and "avgqu-sz" column (Linux), which means that average number of transactions actively being serviced (removed from the wait queue but not yet completed) are much larger than 1. On Solaris, the "actv" limit is controlled by kernel parameter ssd_max_throttle or sd_max_throttle, the default is 256. If the limit is hit, the I/O request will be queued ("wait" in iostat). The upper layer (e.g. ZFS filesystem) may also limit the I/Os sent for each device .

For "single dd write" on raw devices,  the "actv" or  "avgqu-sz" is never larger than 1 (unless the backend of dev is not a real device, e.g. a regular file), which means that only when previous data transfer is completed, the next data can be send to the device. In this regard, Solaris and linux behave similarly. In addition, since this is I/O on raw devices, the I/O size in iostat is equal to application write size. While modern CPU and memory subsystem is very fast,  "single dd write" on raw devices becomes more like a disk subsystem latency testing.

On the other hand, Solaris and Linux has different caching implementations for dd writes on block devices. From above iostat output, you can see Solaris write() splits the data into 8k trunks then send data to the I/O driver, while Linux can smartly merge small I/Os ("wrqm" column in iostat).  This means that for "dd writes on block devices" Linux usually performs better than Solaris. However, this is not very common in Solaris, in most cases, application I/O writes are on the file system which can consolidate writes or on raw devices which is usually used and optimized by databases. Below is iostat during the write testing on a file in a ZFS filesystem.

$ dd if=/dev/zero of=TEST bs=16k count=655360
                    extended device statistics
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  154.0    0.0  154.0  0.0 10.0    0.1   64.7   2 100 c7t0d0
We can see the above I/O write size is 154.0/154.0=1MB/s.

What  you can expect for single dd write performance

Below is my test result on a modern server, yours may be different.

2.5" 10KPM internal disk without using internal hardware raid:
single dd writes on block device bs=16k: 160MB/s.
single dd writes on raw device bs=16k: 160MB/s (drive write-cache enabled)
                                                                    1.3MB/s (drive write-cache disabled)

A low-end SAN storage:
single dd writes on block device bs=16k: depends on OS, LUN disk layout, raid level etc.
single dd writes on raw device bs=16k: a little more than 40MB/s (with some tweak on storage settings can get 75MB/s)

Additional notes:
-----------------------
- On linux, raw I/O  is similar as O_DRIECT, GNU dd has a "oflag=direct" option for block device.
- Also tested on internal hardware raid0 volume of internal disk, the performance is similar as above internal disk testing result. A benefit of using internal hardware raid is for "oflag=dsync" writes, if you configure the internal raid controller cache as write-back enabled, then you will get much better performance.


No comments:

Post a Comment