Sun will rise: 2013

Sunday, May 5, 2013

My Chinese Teacher

My high school Chinese teacher, Mr. Zhang Jun Qiu, passed away in his fifties this week. He's a role model and I respect him so much. I never expected a few words in QQ chat last year was the last communication with him...

Rest In Peace, my teacher.

Monday, April 22, 2013

dd on devices the difference between Linux and Soalris

People like to use dd to do the micro benchmark because it is simple. However, we should be careful about what is being measured by dd. This can be complicated than we imagine.

Let's take a look on a simple scenario: single dd writes on devices with bs=16k

    # dd if=/dev/zero of=dev_name bs=16k count=655360

The difference between block I/O and raw I/O and The difference between Solaris and Linux

The device name can be block devices or raw devices. For I/O on block devices, there's a cache layer in OS kernel. For I/O on raw devices, I/O bypasses the kernel's buffer cache.

The device names are different between Linux and Solaris:
- Solaris block device: /dev/dsk/...
- Solaris raw device: /dev/rdsk/...
- Linux block device: /dev/sdxxx, /dev/mapper/xxxx, etc.
- Linux raw device: /dev/raw/xxxx (raw binding on block device)

What do they look like in iostat? (Don't look at performance numbers, I tested on different disks), I'm illustrating the behavior of single dd write on block devices and raw devices.

Solaris block device (bs=16k):
                    extended device statistics
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 13006.1    0.0  101.6 133.5 256.4   10.3   19.7 100 100 c5t20360080E536D50Ed0

Linux block device (bs=16k):
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00 14098.00    0.00  115.00     0.00 57868.00 1006.40   140.86 1325.40    0.00 1325.40   8.70 100.00

Solaris raw device (bs=16k):
                     extended device statistics
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0 2805.0    0.0   43.8  0.0  0.9    0.0    0.3   3  87 c5t20360080E536D50Ed0

Linux raw device (bs=8k):
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00 3645.00     0.00 29160.00 16.00     0.63    0.18    0.00    0.18   0.17  62.90

For "single dd write" on block devices, look at "actv" column (Solaris) and "avgqu-sz" column (Linux), which means that average number of transactions actively being serviced (removed from the wait queue but not yet completed) are much larger than 1. On Solaris, the "actv" limit is controlled by kernel parameter ssd_max_throttle or sd_max_throttle, the default is 256. If the limit is hit, the I/O request will be queued ("wait" in iostat). The upper layer (e.g. ZFS filesystem) may also limit the I/Os sent for each device .

For "single dd write" on raw devices, the "actv" or "avgqu-sz" is never larger than 1 (unless the backend of dev is not a real device, e.g. a regular file), which means that only when previous data transfer is completed, the next data can be send to the device. In this regard, Solaris and linux behave similarly. In addition, since this is I/O on raw devices, the I/O size in iostat is equal to application write size. While modern CPU and memory subsystem is very fast, "single dd write" on raw devices becomes more like a disk subsystem latency testing.

On the other hand, Solaris and Linux has different caching implementations for dd writes on block devices. From above iostat output, you can see Solaris write() splits the data into 8k trunks then send data to the I/O driver, while Linux can smartly merge small I/Os ("wrqm" column in iostat). This means that for "dd writes on block devices" Linux usually performs better than Solaris. However, this is not very common in Solaris, in most cases, application I/O writes are on the file system which can consolidate writes or on raw devices which is usually used and optimized by databases. Below is iostat during the write testing on a file in a ZFS filesystem.

$ dd if=/dev/zero of=TEST bs=16k count=655360
                    extended device statistics
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.0  154.0    0.0  154.0  0.0 10.0    0.1   64.7   2 100 c7t0d0

We can see the above I/O write size is 154.0/154.0=1MB/s.

What you can expect for single dd write performance

Below is my test result on a modern server, yours may be different.

2.5" 10KPM internal disk without using internal hardware raid:
single dd writes on block device bs=16k: 160MB/s.
single dd writes on raw device bs=16k: 160MB/s (drive write-cache enabled)
1.3MB/s (drive write-cache disabled)

A low-end SAN storage:
single dd writes on block device bs=16k: depends on OS, LUN disk layout, raid level etc.
single dd writes on raw device bs=16k: a little more than 40MB/s (with some tweak on storage settings can get 75MB/s)

Additional notes:
-----------------------
- On linux, raw I/O is similar as O_DRIECT, GNU dd has a "oflag=direct" option for block device.
- Also tested on internal hardware raid0 volume of internal disk, the performance is similar as above internal disk testing result. A benefit of using internal hardware raid is for "oflag=dsync" writes, if you configure the internal raid controller cache as write-back enabled, then you will get much better performance.

Sunday, April 21, 2013

CLOSE_WAIT

It's not the first time to hear that people complain CLOSE_WAIT state remains on the system.
It is because that the peer closes the connection (close the socket explicitly or the peer process is terminated), but your side does not take correct action on this connection.

So, what happens if the peer closes the connection?

- If your side does not take correct actions, the peer will NOT result in bad state and the peer's connection state will be cleaned up after tcp_fin_wait_2_flush_interval .

- If your side is doing write() or send() on the connection, your side will receive SIGPIPE signal ("broken pipe", see signal.h man page). The default action for SIGPIPE is exiting the application. So your application probably need change this default behavior.

- If your side is doing read() or recv() on the connection, your side will receive the return code "0". Your side must handle this situation.

- If your side is waiting for POLLIN event via select() or poll() or port_get(), then it will fire the event and consequently your recv() code will return "0", thus you can handle this situation in your application.

- If your side does not take any action on the connection, CLOSE_WAIT will remains on your side until you exit or restart your side application. (The tcp_keepalive_interval or tcp_keepalive_abort_interval in tcp/ip settings does not help on this)

A typical mistake in application is as below, see below example:
....
recvbytes=recv(sockfd, buf, BUFSIZE, 0);
if (recvbytes < 0) {
             perror("recv error");
             close(sockfd);
             .....
}
....

this is buggy, the correct way is " if (recvbytes <= 0) {"
"0" means the peer has closed the connection.

It' strange that Linux man page clearly says "recv() returns 0" means the peer has performed an orderly shutdown but current Solaris man page says nothing about it.

Saturday, April 20, 2013

Be careful to use valloc and libmalloc on Solaris 10

If you application uses valloc(), be careful to use with libmalloc.so on Solaris 10.
Solaris 10's libmalloc (at least for several versions I tested) does not have implementation of valloc(), then valloc() is called from libc but free() is called from libmalloc, this will cause core dump.

You can use DTrace to check it. For example:
~/tmp$ dtrace -qn 'pid$target::valloc:entry {ustack();}' -c ./a.out

To check if you libmalloc has valloc:
$ elfdump /usr/lib/libmalloc.so|grep valloc

Four Years

Four years ago today, my colleague called me while I was washing dishes after dinner. He said: "our company is sold! check your email now..."
Yes, I cannot forget it for ever. The bitterness is still deeply in my heart.

Wednesday, January 30, 2013

Build Mesos on Solaris

Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. The purpose to build it is that I want to have experiments on Spark , a popular framework for cluster computing (e.g. big data analysis).
Mesos integrated a lots of third party software, the build process on Solaris is not very smooth. At first, I tried Solaris Studio (SunCC), but had some troubles. To save time, I decided to use gcc.
The build environment is: Solaris 11 11/11 sparc; gcc is 4.5.2 from Solaris 11 IPS repository; mesos 0.9.0.

Preparation Steps:

- need to build automake-1.11.6 (automake-1.11.2 from Solaris 11 is not enough for mesos 0.9.0)
wget http://ftp.gnu.org/gnu/automake/automake-1.11.6.tar.gz
CC=cc CXX=CC ./configure --program-suffix=-1.11
gmake;gmake install (in /usr/local/bin)

- modify configure.ac
in "solaris" section:
    CC=cc
    CXX=CC
    CFLAGS="$CFLAGS"
    CXXFLAGS="$CXXFLAGS"
    LIBS="$LIBS -lsocket -lnsl -lproject -lproc -lresolv -lsendfile -lxnet"

then in "JAVA_LDFLAGS" section, add something:
...
    elif test "$OS_NAME" = "solaris"; then
      JAVA_LDFLAGS=""
      for arch in sparc; do
        dir="$JAVA_HOME/jre/lib/$arch/server"
        if test -e "$dir"; then
          # Note that these are libtool specific flags.
          JAVA_LDFLAGS="-L$dir -R$dir -ljvm"
          break;
          fi
      done
    fi
...

- execute autoconf.

- run configure:
./configure CC=gcc CXX=g++ CFLAGS="-m32 -pthreads" CXXFLAGS="-m32 -pthreads" JAVA_HOME=/usr/java --prefix=/opt/mesos

Porting issues and solutions:

1) process.cpp in libprocess
compile process.cpp and future.hpp failed because syntax error in assembly.
using CC -S, then cc -c process.s, we found it use "pause" instruction.
Solution: use smt_pause()

2) process.cpp in libprocess
ssize_t length = sendfile(s, fd, offset, size);
=>
ssize_t length = sendfile(s, fd, &offset, size);

3)pid.cpp in libprocess
for gethostbyname2_r, it is not availabe on Solaris, has to modify the codes to use gethostname_r

4)port_posix.h, atomic_pointer.h
for macros and memory barriers.

5)getpwuid_r in zookeeper.c
getpwuid_r(uid, &pw, buf, sizeof(buf), &pwp))
smiliar as gethostbyname2

6)recordio.c in zookeeper
redefined htonll:
recordio.h:int64_t htonll(int64_t v);
On solaris, the prototype is: uint64_t htonll(uint64_t hostlonglong);
linux doesn't have htonll
Solution: comment out htonll

7)mt_adapter.c in zookeeper
atomic ops: fetch_and_add

8)cli.c in zookeeper
ctime_r(&tctime, tctimes)
Solaris requires: char *ctime_r(const time_t *clock, char *buf, int buflen);

9)mesos
./common/utils.hpp:359:17: error: ‘NAME_MAX’ was not declared
Solution: define it #define NAME_MAX 255

10)slave/solaris_project_isolation_module.hpp
Solaris project implementation is not completed, so got compiling error.
Workaround: skip it by modifying macro definition

11) protobuf is compiled in 64-bit by default.
Solution: reconfigure in thirdparty/protobuf according to config.log and add -m64 in CFLAGS and CXXFLAGS: (it's better to add -m32 in top level of mesos configure).

12) /usr/lib/python2.6/pycc complained:
cc: No valid input files specified, no output generated
because the src file is c++: native/proxy_executor.cpp
Workaround (by looking at /usr/lib/python2.6/pycc, pycc and pyCC is same):
in ~/Downloads/mesos-0.9.0/src/python
$ PYCC_CC=g++ PYCC_CXX=g++ LDFLAGS="-lnsl -lresolv -lsendfile -lsocket" python setup.py build
(It seems on Solaris 11.1 there's no such issue).

13) gmake test
need -lxnet

14) got runtime error:
if (errno != EINPROGRESS) {...
in fact, errno is 0 here.
Reason: not with "-pthreads" in build libprocess.
Solution: also add "-pthreads" in configure.ac so that it take effects for each thread having a private copy of errno.

Other notes:

- Building mesos 0.9.0 fails with gcc 4.7 on Linux. Most current linux versions come with this gcc version.
- I also built mesos 0.9.0 on Solaris 11.1 x64, the code modifications are less because the platform is x64.
- The modified files are now put into github.