Monday, April 22, 2013

dd on devices the difference between Linux and Soalris

People like to use dd to do the micro benchmark because it is simple. However, we should be careful about what is being measured by dd. This can be complicated than we imagine.

Let's take a look on a simple scenario: single dd writes on devices with bs=16k

    # dd if=/dev/zero of=dev_name bs=16k count=655360  

The difference between block I/O and raw I/O and The difference between Solaris and Linux

The device name can be block devices or raw devices. For I/O on block devices, there's a cache layer in OS kernel. For I/O on raw devices, I/O bypasses the kernel's buffer cache.

The device names are different between Linux and Solaris:
- Solaris block device: /dev/dsk/...
- Solaris raw device: /dev/rdsk/...
- Linux block device: /dev/sdxxx, /dev/mapper/xxxx, etc.
- Linux raw device: /dev/raw/xxxx (raw binding on block device)

What do they look like in iostat? (Don't look at performance numbers, I tested on different disks), I'm illustrating the behavior of single dd write on block devices and raw devices.

Solaris block device (bs=16k):
                    extended device statistics
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 13006.1    0.0  101.6 133.5 256.4   10.3   19.7 100 100 c5t20360080E536D50Ed0

Linux block device (bs=16k):
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00 14098.00    0.00  115.00     0.00 57868.00 1006.40   140.86 1325.40    0.00 1325.40   8.70 100.00

Solaris raw device (bs=16k):
                     extended device statistics
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 2805.0    0.0   43.8  0.0  0.9    0.0    0.3   3  87 c5t20360080E536D50Ed0

Linux raw device (bs=8k):
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00 3645.00     0.00 29160.00 16.00     0.63    0.18    0.00    0.18   0.17  62.90

For "single dd write" on block devices, look at "actv" column (Solaris) and "avgqu-sz" column (Linux), which means that average number of transactions actively being serviced (removed from the wait queue but not yet completed) are much larger than 1. On Solaris, the "actv" limit is controlled by kernel parameter ssd_max_throttle or sd_max_throttle, the default is 256. If the limit is hit, the I/O request will be queued ("wait" in iostat). The upper layer (e.g. ZFS filesystem) may also limit the I/Os sent for each device .

For "single dd write" on raw devices,  the "actv" or  "avgqu-sz" is never larger than 1 (unless the backend of dev is not a real device, e.g. a regular file), which means that only when previous data transfer is completed, the next data can be send to the device. In this regard, Solaris and linux behave similarly. In addition, since this is I/O on raw devices, the I/O size in iostat is equal to application write size. While modern CPU and memory subsystem is very fast,  "single dd write" on raw devices becomes more like a disk subsystem latency testing.

On the other hand, Solaris and Linux has different caching implementations for dd writes on block devices. From above iostat output, you can see Solaris write() splits the data into 8k trunks then send data to the I/O driver, while Linux can smartly merge small I/Os ("wrqm" column in iostat).  This means that for "dd writes on block devices" Linux usually performs better than Solaris. However, this is not very common in Solaris, in most cases, application I/O writes are on the file system which can consolidate writes or on raw devices which is usually used and optimized by databases. Below is iostat during the write testing on a file in a ZFS filesystem.

$ dd if=/dev/zero of=TEST bs=16k count=655360
                    extended device statistics
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  154.0    0.0  154.0  0.0 10.0    0.1   64.7   2 100 c7t0d0
We can see the above I/O write size is 154.0/154.0=1MB/s.

What  you can expect for single dd write performance

Below is my test result on a modern server, yours may be different.

2.5" 10KPM internal disk without using internal hardware raid:
single dd writes on block device bs=16k: 160MB/s.
single dd writes on raw device bs=16k: 160MB/s (drive write-cache enabled)
                                                                    1.3MB/s (drive write-cache disabled)

A low-end SAN storage:
single dd writes on block device bs=16k: depends on OS, LUN disk layout, raid level etc.
single dd writes on raw device bs=16k: a little more than 40MB/s (with some tweak on storage settings can get 75MB/s)

Additional notes:
-----------------------
- On linux, raw I/O  is similar as O_DRIECT, GNU dd has a "oflag=direct" option for block device.
- Also tested on internal hardware raid0 volume of internal disk, the performance is similar as above internal disk testing result. A benefit of using internal hardware raid is for "oflag=dsync" writes, if you configure the internal raid controller cache as write-back enabled, then you will get much better performance.


Sunday, April 21, 2013

CLOSE_WAIT

It's not the first time to hear that people complain CLOSE_WAIT state remains on the system.
It is because that the peer closes the connection (close the socket explicitly or the peer process is terminated), but your side does not take correct action on this connection.

So, what happens if the peer closes the connection?

- If your side does not take correct actions, the peer will NOT result in bad state and the peer's connection state will be cleaned up after tcp_fin_wait_2_flush_interval .

- If your side is doing write() or send() on the connection, your side will receive SIGPIPE signal ("broken pipe", see signal.h man page). The default action for SIGPIPE is exiting the application. So your application probably need change this default behavior.

-  If your side is doing read() or recv() on the connection, your side will receive the return code "0". Your side must handle this situation.

- If your side is waiting for POLLIN event via select() or poll() or port_get(), then it will fire the event and consequently your recv() code will return "0", thus you can handle this situation in your application.

- If your side does not take any action on the connection, CLOSE_WAIT will remains on your side until you exit or restart your side application. (The tcp_keepalive_interval or  tcp_keepalive_abort_interval in  tcp/ip settings does not help on this)

A typical mistake in application is as below, see below example:
....
recvbytes=recv(sockfd, buf, BUFSIZE, 0);
 if (recvbytes < 0) {
             perror("recv error");
             close(sockfd);
             .....
}
....

this is buggy, the correct way is  " if (recvbytes <= 0) {"
"0" means the peer has closed the connection.

It' strange that Linux man page clearly says "recv() returns 0" means the peer  has  performed  an  orderly shutdown but current Solaris man page says nothing about it.


Saturday, April 20, 2013

Be careful to use valloc and libmalloc on Solaris 10

If you application uses valloc(), be careful to use with libmalloc.so on Solaris 10.
Solaris 10's libmalloc (at least for several versions I tested) does not have implementation of valloc(), then valloc() is called from libc but free() is called from libmalloc, this will cause core dump.

You can use DTrace to check it. For example:
~/tmp$ dtrace -qn 'pid$target::valloc:entry {ustack();}' -c ./a.out

To check if you libmalloc has valloc:
$ elfdump /usr/lib/libmalloc.so|grep valloc


             

Four Years

Four years ago today, my colleague called me while I was washing dishes after dinner. He said: "our company is sold! check your email now..."
Yes, I cannot forget it for ever. The bitterness is still deeply in my heart.