Monday, October 29, 2012

Micro Benchmark MongoDB 2.2 performance on Solaris

I 100% agree on the statement by MongoDB:
“MongoDB does not publish any official benchmarks. We recommend running application performance tests on your application's work-load to find bottleneck and for performance tuning.”
However I don't have real world workload, so I just tried some micro benchmarks to observe the behaviors of MongoDB and OS. Although the result numbers mean nothing, but I would share some findings here.

1) Using JS Benchmark Harness
MongoDB provides JS Benchmarking Harness as a QA baseline perf measurement tool, not designed to be a "benchmark". This is a good start for having a first look at MongoDB performance. The harness is very easy to setup. However, there are a few things to be considerded.

The sample code on that web page is really really a micro benchmark. I tested it against MongoDB 2.2 for Solaris x64 and got suboptimal result comparing against Linux version. After analyzing the workload characteristics, it is more like a multi-threaded malloc and small TCP/IP packet ping-pong testing. 

By passing to starting mongod, I got the performance on Solaris parallel to Linux. If the test client and sever are on separate systems, I may also need disable nagle algorithm: $ sudo ndd -set /dev/tcp tcp_naglim_def 1

The harness also has an interesting feature: RAND_INT [ min , max , multiplier ], it looks like we are able to only touch a fix fraction of data during the testing. Two things need be considered here:
  1. I looked at the current harness implementation, RAND_INT is translated to rand(), this is not really random for big (millions of records) data sets. The fix is using lrand48() instead.
  2. MongoDB uses mmap to cache data, like many other databases, it is still a page-level cache rather than row-level cache. So if your record size is small, RAND_INT [ 1, 10000000, 10 ] doesn't make you only touch 1/10 data, rather it makes you touch all the data.

2) Using YCSB.
YCSB is an extensive load testing tool. But its tests codes for mongodb is a little outdated. I need modify a little bit to add more writeConcern type.
YCSB's testing driver has some limitations:
  • You can set read/write proportion, but they are in same thread context, which means writes can block reads. So I prefer to put them in separate simultaneous jobs in testing.

  • The “recordcount” also implicitly set the max Id of data to be tested. when testing mongodb, the small number means only a few data files are mapped into the memory during transaction phase. So setting “recordcount”in transaction phase is not the right way to test against only small portion of the data.

3) Solaris related stuff.

The Solaris version of mongodb 2.2 has a large binary size compared to Linux, although it nearly does not affect the performance, but I don't like it. A quick check on its build info got “GCC 4.4 on snv_89 January 2008”, too old. This should be fixed by adding GCC option "-fno-function-sections" and "-fno-data-sections".

When starting mongod for Solaris, a warning message shows: “your operating system version does not support the method that MongoDB uses to detect impending page faults. This may result in slower performance for certain use cases”. After browsing the source codes, I found processinfo support is not there. So I added the Solaris support, currently the functions that count are ProcessInfo::blockInMemory() and ProcessInfo::blockCheckSupported().

The mongodb source code says “madvise not supported on solaris yet”, this is funny. Solaris certainly supports madvise. But madvise() is only useful when you understand your workloads. So I don't think this piece of code of calling madvise() is important.

ZFS and UFS.
Since mongodb uses mmap(), it leaves a lot of things to the OS file system. UFS is a traditional file system, it uses traditional page cache (cachelist ) for caching file data. ZFS has quite a lot features beyond a file system, ZFS has its own ARC cache. The physical memory usage can be inspected using mdb ::memstat command:

# echo "::memstat"|mdb -k
Page Summary                Pages                MB  %Tot
------------     ----------------  ----------------  ----
Kernel                     293720              1147    7%
ZFS File Data               85347               333    2%
Anon                       138902               542    3%
Exec and libs                1638                 6    0%
Page cache                  27118               105    1%
Free (cachelist)          3036514             11861   73%
Free (freelist)            576129              2250   14%

Total                     4159368             16247
Physical                  4159367             16247

In my test, ZFS is very good performance in data loading. However, because ZFS has its own cache, if the data is not mmaped, it will be searched firstly from cachelist then ZFS cache, if it's not there, then data is read from the disk into ARC cache, then data is mapped into mongod process address space as page cache. Using ZFS need more memory and when all data cannot fit in physical memory, there would be fights for memory between cachelist and ZFS cache. Tweaking ZFS parameters (manually set ARC cache size, adjust "primarycache" property, etc) did not help in my tests. For read intensive workload, using SSD as 2nd-level ARC cache will help. In addition, depending the workload and data characteristics, adjusting ZFS recordsize or disabling ZFS prefetching may worth a try.

An interesting madvise option for UFS is MADV_WILLNEED, when this option is set, the system will try to pull all data into the memory (quick warm) while during this period the mongod could not response to clients. So if your whole dataset can fit into the physical memory and you can stand the short period of unresponsive to outside during startup, you can consider using it because it warms fast and get peak performance quickly.