I 100% agree on the statement by
MongoDB:
“MongoDB does not publish any official benchmarks. We recommend running application performance tests on your application's work-load to find bottleneck and for performance tuning.”
However I don't have real world workload, so I just tried some micro benchmarks to observe the behaviors of MongoDB and OS. Although the result numbers mean nothing, but I would share some findings here.
1) Using JS Benchmark Harness
MongoDB provides JS
Benchmarking Harness as a QA baseline perf measurement tool, not
designed to be a "benchmark". This is a good start for
having a first look at MongoDB performance. The harness is very easy
to setup. However, there are a few things to be considerded.
The sample code on that web page is
really really a micro benchmark. I tested it against MongoDB 2.2 for
Solaris x64 and got suboptimal result comparing against Linux
version. After analyzing the workload characteristics, it is more
like a multi-threaded malloc and small TCP/IP packet ping-pong
testing.
By passing LD_PRELOAD_64=libmtmalloc.so
to starting mongod, I got the performance on Solaris parallel to
Linux. If the test client and sever are on separate systems, I may
also need disable nagle algorithm: $ sudo ndd -set /dev/tcp
tcp_naglim_def 1
The harness also has an interesting
feature: RAND_INT [ min , max , multiplier ], it looks like
we are able to only touch a fix fraction of data during the testing.
Two things need be considered here:
- I looked at the current harness implementation, RAND_INT is translated to rand(), this is not really random for big (millions of records) data sets. The fix is using lrand48() instead.
- MongoDB uses mmap to cache data, like many other databases, it is still a page-level cache rather than row-level cache. So if your record size is small, RAND_INT [ 1, 10000000, 10 ] doesn't make you only touch 1/10 data, rather it makes you touch all the data.
2) Using YCSB.
YCSB
is an extensive load testing tool. But its tests codes for mongodb is
a little outdated. I need modify a little bit to add more
writeConcern type.
YCSB's testing driver has some
limitations:
- You can set read/write proportion, but they are in same thread context, which means writes can block reads. So I prefer to put them in separate simultaneous jobs in testing.
The “recordcount” also implicitly set the max Id of data to be tested. when testing mongodb, the small number means only a few data files are mapped into the memory during transaction phase. So setting “recordcount”in transaction phase is not the right way to test against only small portion of the data.
3) Solaris related stuff.
The Solaris version of mongodb 2.2 has a large binary size compared to Linux, although it nearly does not affect the performance, but I don't like it. A quick check on its build info got “GCC 4.4 on snv_89 January 2008”, too old. This should be fixed by adding GCC option "-fno-function-sections" and "-fno-data-sections".
When starting mongod for Solaris, a warning message shows: “your operating system version does not support the method that MongoDB uses to detect impending page faults. This may result in slower performance for certain use cases”. After browsing the source codes, I found processinfo support is not there. So I added the Solaris support, currently the functions that count are ProcessInfo::blockInMemory() and ProcessInfo::blockCheckSupported().
The mongodb source code says “madvise not supported on solaris yet”, this is funny. Solaris certainly supports madvise. But madvise() is only useful when you understand your workloads. So I don't think this piece of code of calling madvise() is important.
ZFS and UFS.
==========
Since mongodb uses mmap(), it leaves a lot of things to the OS file system. UFS is a traditional file system, it uses traditional page cache (cachelist ) for caching file data. ZFS has quite a lot features beyond a file system, ZFS has its own ARC cache. The physical memory usage can be inspected using mdb ::memstat command:
Since mongodb uses mmap(), it leaves a lot of things to the OS file system. UFS is a traditional file system, it uses traditional page cache (cachelist ) for caching file data. ZFS has quite a lot features beyond a file system, ZFS has its own ARC cache. The physical memory usage can be inspected using mdb ::memstat command:
# echo "::memstat"|mdb -k Page Summary Pages MB %Tot ------------ ---------------- ---------------- ---- Kernel 293720 1147 7% ZFS File Data 85347 333 2% Anon 138902 542 3% Exec and libs 1638 6 0% Page cache 27118 105 1% Free (cachelist) 3036514 11861 73% Free (freelist) 576129 2250 14% Total 4159368 16247 Physical 4159367 16247
In my test, ZFS is very good
performance in data loading. However, because ZFS has its own cache,
if the data is not mmaped, it will be searched firstly from cachelist
then ZFS cache, if it's not there, then data is read from the disk into ARC cache, then data is mapped into mongod process address space as page cache. Using ZFS need more memory and when all data cannot fit in
physical memory, there would be fights for memory between cachelist
and ZFS cache. Tweaking ZFS parameters (manually set ARC cache size, adjust "primarycache" property, etc) did not help in my tests. For read intensive workload, using SSD as 2nd-level ARC cache will help. In addition, depending the workload and data characteristics, adjusting ZFS recordsize or disabling ZFS prefetching may worth a try.
An interesting madvise option for UFS
is MADV_WILLNEED, when this option is set, the system will try to
pull all data into the memory (quick warm) while during this period the mongod could not response to clients. So if your whole dataset
can fit into the physical memory and you can stand the short period of unresponsive to outside during startup, you can consider using it because it
warms fast and get peak performance quickly.