Monday, October 29, 2012

Micro Benchmark MongoDB 2.2 performance on Solaris


I 100% agree on the statement by MongoDB:
“MongoDB does not publish any official benchmarks. We recommend running application performance tests on your application's work-load to find bottleneck and for performance tuning.”
However I don't have real world workload, so I just tried some micro benchmarks to observe the behaviors of MongoDB and OS. Although the result numbers mean nothing, but I would share some findings here.

1) Using JS Benchmark Harness
MongoDB provides JS Benchmarking Harness as a QA baseline perf measurement tool, not designed to be a "benchmark". This is a good start for having a first look at MongoDB performance. The harness is very easy to setup. However, there are a few things to be considerded.

The sample code on that web page is really really a micro benchmark. I tested it against MongoDB 2.2 for Solaris x64 and got suboptimal result comparing against Linux version. After analyzing the workload characteristics, it is more like a multi-threaded malloc and small TCP/IP packet ping-pong testing. 

By passing LD_PRELOAD_64=libmtmalloc.so to starting mongod, I got the performance on Solaris parallel to Linux. If the test client and sever are on separate systems, I may also need disable nagle algorithm: $ sudo ndd -set /dev/tcp tcp_naglim_def 1

The harness also has an interesting feature: RAND_INT [ min , max , multiplier ], it looks like we are able to only touch a fix fraction of data during the testing. Two things need be considered here:
  1. I looked at the current harness implementation, RAND_INT is translated to rand(), this is not really random for big (millions of records) data sets. The fix is using lrand48() instead.
  2. MongoDB uses mmap to cache data, like many other databases, it is still a page-level cache rather than row-level cache. So if your record size is small, RAND_INT [ 1, 10000000, 10 ] doesn't make you only touch 1/10 data, rather it makes you touch all the data.

2) Using YCSB.
YCSB is an extensive load testing tool. But its tests codes for mongodb is a little outdated. I need modify a little bit to add more writeConcern type.
YCSB's testing driver has some limitations:
  • You can set read/write proportion, but they are in same thread context, which means writes can block reads. So I prefer to put them in separate simultaneous jobs in testing.

  • The “recordcount” also implicitly set the max Id of data to be tested. when testing mongodb, the small number means only a few data files are mapped into the memory during transaction phase. So setting “recordcount”in transaction phase is not the right way to test against only small portion of the data.

3) Solaris related stuff.

The Solaris version of mongodb 2.2 has a large binary size compared to Linux, although it nearly does not affect the performance, but I don't like it. A quick check on its build info got “GCC 4.4 on snv_89 January 2008”, too old. This should be fixed by adding GCC option "-fno-function-sections" and "-fno-data-sections".

When starting mongod for Solaris, a warning message shows: “your operating system version does not support the method that MongoDB uses to detect impending page faults. This may result in slower performance for certain use cases”. After browsing the source codes, I found processinfo support is not there. So I added the Solaris support, currently the functions that count are ProcessInfo::blockInMemory() and ProcessInfo::blockCheckSupported().

The mongodb source code says “madvise not supported on solaris yet”, this is funny. Solaris certainly supports madvise. But madvise() is only useful when you understand your workloads. So I don't think this piece of code of calling madvise() is important.

ZFS and UFS.
==========
Since mongodb uses mmap(), it leaves a lot of things to the OS file system. UFS is a traditional file system, it uses traditional page cache (cachelist ) for caching file data. ZFS has quite a lot features beyond a file system, ZFS has its own ARC cache. The physical memory usage can be inspected using mdb ::memstat command:

# echo "::memstat"|mdb -k
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                     293720              1147    7%
ZFS File Data               85347               333    2%
Anon                       138902               542    3%
Exec and libs                1638                 6    0%
Page cache                  27118               105    1%
Free (cachelist)          3036514             11861   73%
Free (freelist)            576129              2250   14%

Total                     4159368             16247
Physical                  4159367             16247


In my test, ZFS is very good performance in data loading. However, because ZFS has its own cache, if the data is not mmaped, it will be searched firstly from cachelist then ZFS cache, if it's not there, then data is read from the disk into ARC cache, then data is mapped into mongod process address space as page cache. Using ZFS need more memory and when all data cannot fit in physical memory, there would be fights for memory between cachelist and ZFS cache. Tweaking ZFS parameters (manually set ARC cache size, adjust "primarycache" property, etc) did not help in my tests. For read intensive workload, using SSD as 2nd-level ARC cache will help. In addition, depending the workload and data characteristics, adjusting ZFS recordsize or disabling ZFS prefetching may worth a try.

An interesting madvise option for UFS is MADV_WILLNEED, when this option is set, the system will try to pull all data into the memory (quick warm) while during this period the mongod could not response to clients. So if your whole dataset can fit into the physical memory and you can stand the short period of unresponsive to outside during startup, you can consider using it because it warms fast and get peak performance quickly.

Wednesday, August 22, 2012

Build CouchDB 1.2 on Solaris 11 (SPARC)

I want to play a little bit with Document-oriented DB. The first try was MongoDB, however, building MongoDB on SPARC is a horrible experience: MongoDB is written for x86/x64; in addition, it uses scons as its build tool and scons doesn't work with IPS-based Solaris Studio 12.3 (I need hack scons to make it work), in fact in MongoDB's source code, it never consider other compilers but VC and GCC. Therefore, I decided to try CouchDB. Allthough CouchDB has a lot of dependencies, but they are friendly for porting.

1)  ICU (International Components for Unicode)
This is easy on Solaris 11:
# pkg install developer/icu (this will also cause library/icu to be installed)
Note: the ICU library 4.6 is built with Sun CC, so the C++ code that will link to it also should be built with Sun CC.


2) Build Erlang
Solaris 11 11/11 IPS repository provides Erlang 5.6.5, this is too old. CouchDB 1.1.x or later requires Erlang >= 5.7.3. Although Erlang website says they do daily build for Solaris Sparc (may be because Ericsson was Sun's shop), but why don't they publish the binary package for Solaris? Fortunately, the build procedure is not that difficult:

- Grab the source code from Erlang website. I chose 14B04 which version is 5.8.5.

- ./configure --prefix=/opt/local  (I use default gcc-45 from Solaris 11 IPS repository, since it is said erlang uses some GCC features (label vars?).
Ignore configure warnings unless configure fails.

- fix erts/emulator/drivers/common/inet_drv.c:
ifreq.ifr_hwaddr.sa_data would cause compiling error because ifreq does not have that member. Since Solaris 11 has defined both  SIOCGIFHWADDR and SIOCGENADDR, we only need SIOCGENADDR.
//#ifdef SIOCGIFHWADDR
//          if (ioctl(desc->s, SIOCGIFHWADDR, (char *)&ifreq) < 0)
//              break;
//          buf_check(sptr, s_end, 1+2+IFHWADDRLEN);
//          *sptr++ = INET_IFOPT_HWADDR;
//          put_int16(IFHWADDRLEN, sptr); sptr += 2;
//          /* raw memcpy (fix include autoconf later) */
//          sys_memcpy(sptr, (char*)(&ifreq.ifr_hwaddr.sa_data), IFHWADDRLEN);
//          sptr += IFHWADDRLEN;
//#elif defined(SIOCGENADDR)
#ifdef SIOCGENADDR


- After gmake install, prepend /opt/local/bin to PATH.


3) Build SpiderMonkey 1.8.5
- Download source code from here.
- It's interesting that Mozilla's project need autoconf-2.13, so I had to build this old version.
Get the source from GNU webiste, then:
./configure --prefix=/usr/local --program-suffix=-2.13

- Run autoconf in SpiderMonkey's source tree: to generate configure script.
 cd  js-1.8.5/js/src
 /usr/local/bin/autoconf-2.13


- This time I used Sun CC from Solaris Studio 12.3 which is freely available.
   CC=cc CXX=CC ./configure

- gmake; sudo gmake install  (in /usr/local)


4) Build CouchDB 1.2
Before proceeding, I need to fix something:

 - apache-couchdb-1.2.0/src/couchdb/priv/Makefile.in:
    replace "-Wall -Werror" with "-v"  because I use Sun CC.

 - cd /usr/local/lib; ln -s libmozjs185.so.1.0 libmozjs185-1.0.so
(because configure.ac always choose libmozjs185-1.0 instead of libmozjs185 due to existence of  libmozjs185-1.0.a)

- apache-couchdb-1.2.0/src/snappy/google-snappy/snappy-stubs-internal.h:
solaris does not have byteswap.h, I replaced it with byteorder.h and defined a few macros.
...
#else
//#include
#include
#define bswap_16(x) BSWAP_16(x)
#define bswap_32(x) BSWAP_32(x)
#define bswap_64(x) BSWAP_64(x)
....


- configure
 ./configure CC=cc CXX=CC LDFLAGS="-R /usr/local/lib" --with-erlang=/opt/local/lib/erlang/usr/include --with-js-lib=/usr/local/lib/ --with-js-include=/usr/local/include/js/ --prefix=/opt/couchdb1.2

- gmake

- Run test:
  $ export PATH=/usr/perl5/5.12/bin:$PATH
  $ gmake check   
Wait for a while, all tests should be successful.

- sudo gmake install

-  create a script couchdb.sh in /opt/couchdb1.2/bin (since I don't want the couchdb output to be at arbitrary place).
#!/usr/bin/bash
BIN_DIR=$(dirname $0)
cd $BIN_DIR
./couchdb -o ../var/log/couchdb.stdout -e ../var/log/couchdb.stderr ${1+"$@"}

- modify /opt/couchdb1.2/etc/couchdb/default.ini, change bind_address to let couchdb accessible from anywhere
  bind_address  = 0.0.0.0

- start couchdb: /opt/couchdb1.2/bin/couchdb.sh -b

Now, enjoy CouchDB!

UPDATE (Sept 2012):

 - I also tested Erlang 15B01 and latest 15B02, erts/emulator/drivers/common/inet_drv.c has already been fixed. But for building couchdb, we need to add something to /opt/local/lib/erlang/usr/include/erl_driver.h:
 #if defined(__sun)
#include <unistd.h>    /* for ssize_t */
#endif

- Running couchdb testsuite in browser (Erlang 14B04 and 15B01) caused erlang process crash. I posted the core dump analysis and temporary workaround in erlang mailing list. Using 15B02 doesn't have this problem, but still should be careful when the Erlang application on Solaris reloads crypto.so, see discussions here. One solution might be adding "-z nodelete" LDFLAGS in lib/crypto/c_src/sparc-sun-solaris2.11/Makefile.

Wednesday, June 27, 2012

root my phone

Compared to PC users, the smart phone users have far less freedom. I bought an android 4.0.3 phone recently. My phone was pre-installed many apps that I don't like and could not delete, in addition I could not install apps from Goolge Play because this phone is sold in China market. To solve these problems, I have to get root permission on the phone.

After a weekend study, I realized that a major way is to flash the phone using some 3rd party ROMs. I don't like this way . With further study, I searched out  a Linux security bug by chance. This bug also impacted Android. A hacker has exploited it for android. This is really good news to me. However, it's not easy to figure out the offsets for my phone: I know nothing about ARM assembly; run-as is statically linked, stripped and symbols are obfuscated, making it difficult to understand the binary; I  installed binutils-arm-linux-gnueabi on my linux desktop but arm-linux-gnueabi-objdump does not give me useful info.

In the end, I found 2 resources helped me a lot:
- android run-as source code
- IDA Disassembler 6.2 demo for Linux (I should thank this great tool)
These two resources helped me understand the assembly codes of run-as. In addition, analyzing run-as binary of Transformer Prime 4.0.3 helped me how to find the offsets for my phone because the Transformer Prime's offsets are already known.

The remaining is simple:
- Download and install android sdk
- run "android" from cmd line, add platform-tools for using adb.
- setup udev (for linux), follow this guide. After making changes,  /etc/init.d/udev restart
Enable debugging mode and disabling fastboot on the phone. Connect to my phone using "adb shell" from Linux:
- push "mempodroid" to /data/local/tmp on the phone.
- run the magic "mempodroid"using the offsets that I figured out, become root!
- remount /system of the phone in read-write mode:
  mount -o remount,rw /system
- deleted un-wanted apps in /system/app, /system/delapp, be careful when being root! (before doing this, I had removed as many apps as possible from app manager of the phone). For safety, I backed up them to sdcard using "cat" in adb shell.
- To make Google Play work, I copied the following apks to /system/app:
  GoogleLoginService.apk
  GoogleServicesFramework.apk
  OneTimeInitializer.apk
  Vending.apk
  These apks could be accquired from cyanogenmod website.
- Type "reboot" from adb shell.

It's much better now. I prefer "temporary root", it's safer.