Mar 102015

This post took a while to write – when I started, it was my last day in MongoDB and finishing it I am over a week into my new role in Riot Games.  It was a bittersweet moment – I was very excited to move on to my new adventure in Riot Games, but I was sad to leave a company that has been such a great place to work for the last three years.  I decided to write this post partially as a record of what I enjoyed most, partly as a list of some of my accomplishments – memory fades and I have very much enjoyed reading similar posts from several years ago.  I often wish I had written more in fact.

So, here it is, some random reflection on three great years, in no particular order.  Due to the length, I’ll insert a jump, so click through if you would like to read on.

Continue reading »

 Posted by at 6:06 pm
Mar 022015

I was asked to do an “About Me” introduction for my new gig, and wasn’t sure how to mention the fact that I had recently become a parent for the first time.  Given the fact that I now work for a games company, I decided to have a bit of fun with it and translate it into more gamer-friendly format. I think it turned out pretty well:

I’ve just started playing a new game called Parenting, it’s in early beta, only at version 0.5 (first released in August 2014). It’s a co-op game with solo parts, usually much easier with 2 players (so I usually team up with my wife) but you can occasionally use other people to form larger parties for short periods of time. It’s a massive time sink and the learning curve is brutal, but very rewarding and not a little addictive.

The opposing AI in Parenting is developing rapidly and new strategies are required constantly to combat the adaptation and avoiding meltdown. Right now, the game requires constant attention, even at night, it’s expensive and there are no save points – it’s a little scary. The progression is not slowing down either, and looks to be open ended, will probably take decades to complete but we will persevere 🙂

I’m sure my daughter will appreciate me posting it on the internet for posterity.

 Posted by at 11:39 pm
Feb 122015

This will likely be my last post about MongoDB 3.0 performance, as I am leaving MongoDB (after 3 fantastic years) to join Riot Games – more about that in later posts. Before I go, some testing I was doing to verify WiredTiger performance with the journal disabled led to some eye-catching numbers.  While performing a restore using 3.0 mongorestore to a 3.0 mongod with 4 collections, I noticed I was pushing close to 300,000 inserts/sec.

NOTE: This testing would eventually have hit a bottleneck in terms of flushing data out to disk (I had relatively small, low IOPS EBS volumes attached), and I would expect the sustained insert rate to drop significantly after that kicked in.  Also, running without a journal is not generally a good idea for a production system.  This testing was more about showing off the capabilities of the new tools, and how quickly you can use mongorestore to get data into an instance now, rather than a commentary on mongod performance – there will be far more rigorous testing than mine done to show the sustainable, production capabilities there in the future.

For anyone used to the old mongorestore (single threaded, not terribly fast), this is an amazing improvement.  The new version, which has been completely rewritten in Go along with the rest of the tools for 3.0, allows you to restore multiple collections in parallel.  It defaults to four collections in parallel (if available), so I was curious as to what I could do to push it higher.

I have no doubt that there will be plenty of impressive testing and performance numbers to accompany the MongoDB 3.0 launch, but I doubt the revamped tools will get much attention in the initial announcements.  Given the improvement I noticed, I decided to give them some lone and once I had completed the journal testing, I went back to see how far I could push things with the now apparently not-so-humble mongorestore.

It turned out to be not very far, but the limiting factor was the CPU on my (3 year old) Macbook Pro which I was using as the client in the testing – the server with the mongod running was not even breathing hard (<50% CPU, <50% IO). It also looked like I might be getting close to the limits of what my network set up would allow (not quite full gigabit ethernet), but it was not yet the limiting factor.

Therefore, I went in search of more CPU and more network capacity, and settled on using c4.8xl instance in EC2.  For anyone not familiar, these are compute optimized hosts with the following stats:

  • 36 vCPUs
  • 60GB RAM
  • 10Gb Networking

In other words, plenty of horsepower and bandwidth for my test (my local test server has 32GB RAM, Intel SSD, 1Gb Networking so I knew the server side mongod would have no issues).  This also has the benefit of giving a more standard and reproducible test environment if anyone wants to try recreating the numbers (versus my laptop and desktop boxes).  I used 2 C4.8xl instances in the same availability zone running Amazon Linux for my testing:

  1. Client using mongorestore 3.0.0-rc8, with 20 collections to be restored in parallel (totaling 38GB of data, just over 200,000,000 documents)
  2. Server running mongod 3.0.0-rc8, with WiredTiger engine, journal disabled, no other load

The result was pretty impressive, over 2 runs I saw spikes of inserts touching the 1.5 million inserts/sec mark.  While not quite that high, the overall numbers for the full restores were still very respectable:

Run 1: 154.32 seconds, 204,800,000 docs = 1,327,112 inserts/sec

Run 2: 160.30 seconds, 204,800,000 docs = 1,277,604 inserts/sec

Average: 1,302,358 inserts/sec

As well as the eye-catching insert rate, the fact that I was able to load 38GB of data into the server in less than 3 minutes is pretty decent too.  I suspect I could push this even higher with a bit of tweaking for burst imports, but I will have to leave that to others, I have reports and such to finish before I move on to my new adventures (plus, those big instances are expensive) 🙂

For anyone wondering about the documents used, they were generated by a variant of this gist which I have used before in other testing, and the only index present on the collections was the default _id index.

Feb 052015

I re-ran my compression testing from November to show the on-disk savings that compression in WiredTiger allows in my last post.  The other part of that testing was to see if we could show a benefit to having compressed data from a read perspective.  As I explained back in November, one of the scenarios where compression can help is when there is only limited IO capacity available.

Once again, to simulate that, I used a relatively slow external USB 3.0 drive to simulate reasonable IO constraints and then recorded how long it took to read my entire data set from the on-disk testing into memory from that disk. The results show that benefit, though the results in WiredTiger are somewhat hampered by the issue described in SERVER-16444. I may attempt to re-run these tests with a more random, index driven read load – as well as being more representative of a normal workload, it should also show greater benefits by avoiding the table scan related overheads.

In any case, as you can see in the graph below, the benefits are clear with a 23.8% and 34.6% reduction in the time taken to read in the compressed data using snappy and zlib respectively (over the non-compressed data):

Time taken to load data from disk compressed versus uncompressed

Time taken to load data from disk compressed versus uncompressed

Once again, I think this reinforces the choice of snappy as a good default for the WiredTiger Storage engine in MongoDB.

Feb 042015

Back in November, with the newly minted RC0 version of MongoDB 3.0 (then 2.8 of course), I ran a series of tests to check out how well compression using the new WiredTiger storage engine worked with a ~16GB data sample.  Compression on disk worked fine, but the follow up tests to show the benefits of using compression when limited IO is available ran into issues and I ended up having to run the tests again with an updated version (and the appropriate bug fix).

I’ve decided to revisit the compression testing 3 months on – we are now on release candidate 7 and heading for a GA release in March.  There have been some new settings exposed to tweak compression as well as a whole host of bug fixes and improvements.  I will start with just the straight compression testing and leave the restricted IO piece for later.

My testing was done as follows:

  1. Create a ~16GB data set using this gist (on MMAPv1, but engine does not really matter)
  2. Dump out the data using mongodump
  3. Restore that same data into WiredTiger backed MongoDB with various compression options
  4. Create a pretty graph to illustrate/compare the resulting disk usage

The config files used for the processes can be found here and all testing was done on an Ubuntu 14.10 host with an ext4 filesystem.  The only real difference between the tests this time and last time is the use of mongodump/restore to ensure that all data is the same (previously I had regenerated data each time, so there was probably some variance).

The graph speaks for itself, so here it is:


On-Disk Usage with various storage engines and compression options

On-Disk Usage with various storage engines and compression options

I think that it nicely illustrates the savings on-disk that compression in WiredTiger offers – the savings versus a non-compressed WiredTiger configuration are impressive enough – taking up just 24% of the space, but when compared to the MMAPv1 they are huge, taking up just 16% of the space.  It also shows that the defaults for WiredTiger with MongoDB (represented by the WT/Snappy bar) are a good middle ground at just over 50% of the non-compressed WT total and 34.7% of the MMAPv1 usage.

It should be noted that the trade off for using zlib rather than snappy is significant, even with just a mongorestore happening.  The insert rate with snappy was similar to that with no compression and hovered between 65,000/sec and 67000/sec.  With zlib being used rather than snappy, that dropped to between 22,000/sec and 25,000/sec.  Now, insert performance was not the target for this testing, so I only mention it as an aside – the first thing I would do to see a proper comparison would be to increase the number of collections being restored in parallel to 4, rather than 1, and then looks to see what else I could tweak to improve things.

Hence, if anyone out there is concerned about disk usage with MongoDB, then I would recommend starting to test 3.0 with WiredTiger as soon as possible and see if it suits your use case.

Update: I re-ran the read based tests on limited IO also, see this post for details.


Nov 172014

My (rather popular) first post on this topic explained the benefits of compression (which comes as the default option with the new WiredTiger storage engine) for systems with lesser IO capabilities.  The intent was to first show that the new storage engine saved space on disk and then to show that this could be translated into a gain in terms of performance when reading that data (slowly) off disk.

The first part of that story worked out pretty well, the data was nicely compressed on disk and it was easy to show it in the graph.  The second part of that story did not work out as expected, the graph was a little off from expectations and my initial speculation that it was a non-optimal access pattern didn’t pan out.  In fact, I determined that the slowness I was seeing was independent of IO and was due to how slow the in-memory piece was when using WiredTiger to do a table scan.  Needless to say, I started to talk to engineering about the issue and tried tweaking various options – each one essentially reinforced the original finding.

It was soon obvious that we had a bug that needed to be addressed (one that was still present in the first release candidate 2.8.0-rc0). I gathered the relevant data and opened SERVER-16150 to investigate the problem. Thanks to the ever excellent folks in MongoDB engineering (this one in particular), we soon had the first patched build attempting to address the issue (more, with graphs after the jump).  Before that, for anyone looking to reproduce this testing, I would recommend waiting until SERVER-16150 has been closed and integrated into the next release candidate (2.8.0-rc1), you won’t see the same results from 2.8.0-rc0 (it will instead look like the first set of results).

Continue reading »

Nov 122014

CAVEAT: This post deals with a development version of MongoDB and represents very early testing. The version used was not even a release candidate – 2.7.9-pre to be specific, this is not even a release candidate.  Therefore any and all details may change significantly before the release of 2.8, so please be aware that nothing is yet finalized and as always, read the release notes once 2.8.0 ships.

Update (Nov 17th, 2014): Good news! I have re-tested with a patched version of 2.8.0-rc0 and the results are very encouraging compared to figure 2 below.  For full details (including an updated graph), see MongoDB 2.8: Improving WiredTiger Performance

Anyone that follows the keynotes from recent MongoDB events will know that we have demonstrated the concurrency performance improvements coming in version 2.8 several times now.  This is certainly the headline performance improvement for MongoDB 2.8, with concurrency constraints in prior versions leading to complex database/collection layouts, complex deployments and more to work around the per-database locking limitations.

However, the introduction of the new WiredTiger storage engine that was announced at MongoDB London also adds another capability with a performance component that has long been requested: compression

Eliot also gave a talk about the new storage engines at MongoDB London last week after announcing the availability of WiredTiger in the keynote.  Prior to that we were talking about what would be a good way to structure that talk and I suggested showing the effects and benefits of compression. Unfortunately there wasn’t enough time to put something meaningful together on the day, but the idea stuck with me and I have put that information together for this blog post instead.

It’s not a short post, and it has graphs, so I’ve put the rest after the jump.

Continue reading »

Jun 202014

Setting readahead (RA from now on) appropriately is a contentious subject.  There are a lot of variables involved, but in my particular case I am setting out to minimize those variables, get a baseline, and have a reasonable idea of what to expect out of this configuration:

  • Environment: Amazon EC2
  • Instance Size: m3.xlarge (4 vCPU, 15GiB RAM)
  • Disk Config: Single EBS Volume, 1000 PIOPS

The testing I am going to be doing is pretty detailed, and intended for use in a future whitepaper, so I wanted to get some prep done and figure out exactly what I was dealing with here before I moved forward.  The initial testing (which is somewhat unusual for MongoDB) involves a lot of sequential IO.  Normally, I am tweaking an instance for random IO and optimizing for memory utilization efficiency – a very different beast which generally means low RA settings.  For this testing, I figured I would start with my usual config (and the one I was using on a beefy local server) and do some tweaking to see what the impact was.

I was surprised to find a huge cliff in terms of operations per second hitting the volume when I dropped RA to 16.  I expected the larger readahead settings to help up to a certain point because of the sequential IO (probably up to the point that I saturate the bandwidth to the EBS volume or similar).  But I did not expect the “cliff” between RA settings of 32, and RA settings of 16.

To elaborate: one of the things I was keeping an eye on was the page faulting rate within MongoDB.  MongoDB only reports “hard” page faults, where the data is actually fetched off the disk.  Since I was wiping out the system caches between caching runs, all of the data I was reading had to come from the disk, so the fault rate should be pretty predictable, and IO was going to be my limiting factor.

With the RA settings at 32, my tests were taking longer than 64, 64 took longer than 128 etc. until the results for 256, 512 were close enough to make no difference and RA was not really a factor anymore.  At 32, the faulting rate was relatively normal – somewhere around 20 faults/sec at peak and well within the capacity of the PIOPS volume to satisfy, this was a little higher than the 64 RA fault rate which ran at ~15 faults/sec.  I was basically just keeping an eye on it, it did not seem to be playing too big a part.

With an RA of 16 though, things slowed down dramatically.  The faults spike to over 1000 faults/sec and stay there.  That’s a ~50x increase over the 32 RA setting and basically is pegging the max PIOPS I have on that volume.  Needless to say the test takes a **lot** longer to run with the IO pegged.  To show this graphically, here are the run completion times with the differing RA settings (click for larger view):

mongodump test runs, using various readahead settings

TL;DR I will be using RA settings of 128 for this testing, and will be very careful before dropping RA below 32 on EBS volumes in EC2 in future.

Update: A bit of digging revealed that the default size of an IO request on provisioned IOPS instances is 16K, which would mean that setting RA to 32 matches this well, whereas dropping below it by 50% is essentially a bad mismatch.  Here is the relevant paragraph (from the RDS documentation):

Provisioned IOPS works with an IO request size of 16KB. An IO request smaller than 16KB is handled as one IO; for example, 1000 8KB IO requests are treated the same as 1000 16KB requests.  IO requests larfer than 16KB consume more than one IO request; Provisioned IOPS consumption is a linear function of IO request size above 16KB. For example, a 24KB IO request consumes 1.5 OP requests of storage capacity; a 32KB IO request consumes 2 IO requests etc.

Hence, once you drop below RA of 32 (and you are doing small IO requests like a 4k page for example), every request is “wasting” the difference and consuming a full IOP – definitely something to bear in mind when running any memory mapped application on EC2.

Apr 152014

This tweet encourages people to read the timeline related to the Heartbleed discovery and dissemination and draw your own interesting conclusions – challenge accepted!

There is plenty of fodder in there for the conspiracy theorists, but taking a step back for a second I would draw a conclusion not based on who knew what, but rather how to be one of those entities that knew early.  Why take that approach?

Well, the companies that learned of this bug early (the timeline lists the ones that admit they did, there were likely others) had a significant advantage here.  They were able to take the necessary steps to protect their systems while the bug was largely unknown, they could evaluate the situation calmly and without their customers/shareholders/interested parties screaming for updates, exposure assessments, time lines for fixes and the like.

As an ex-operations guy, an ex-support guy and someone that’s had to deal with this stuff before, I would definitely like to be working for one of the companies with an early heads up here rather than the ones in the dark until the public announcement.

Hence, the question I would ask is this: How do I make sure that I am on the early notification list for such issues?

Now, that may seem way too broad a question, but let’s break it down another way:

  • What technologies are truly critical to my business?
  • How do I ensure that I am up to date immediately regarding significant developments?
  • How do I ensure I can prioritize features, get key issues for my business addressed?

Sometimes, with closed source technology, the answer is simple – you make sure you are an important customer for the owner of the technology, whether it is Microsoft, Oracle, Cisco or anyone else.  This might be a matter of paying them enough money, or it could be that you make sure you are a visible and valuable partner, provide a nice reference use case etc. – basically whatever it takes to make sure that you are at the top of their list for any vital information flow or feature decision.

What about open source software though?  How do you make sure you are similarly informed for the Apache HTTP web server, the HAProxy load balancer, OpenSSL, the Linux Kernel, or whatever OSS technology your business relies on?

Take a look at that timeline again, consider who learned about the issue early.  I haven’t crunched the numbers, but I’d bet good money that the companies that learned early (and were not part of discovery) either have a significant number of employees contributing to OpenSSL, or they have employees that know the main contributors well (and let’s face it, most of them will be contributing to other OSS projects – geeks and nerds gossip just like everyone else).

Therefore, the conclusion I draw from the Heartbleed timeline is this:

If I am using open source software that is critical to my business, I should be employing people that actively contribute to that software, that are known to the core developers, if not core developers themselves.  They will work for me, but devote a significant portion (perhaps all) of their time to the specific project(s) identified as critical.

There are many benefits to this – besides getting early notification of issues, you would have an expert on hand to answer those screaming for updates, to evaluate your exposure and perhaps even fix the issue internally before the public fix is available.  You also get a respected voice in terms of setting the direction of the project, have a way to prioritize key features and more.  Finally, you get the good will of the community, help make the product better for everyone, and become a possible destination for other smart contributors to work.

The key here is about actually committing resources.  It’s often amazing (to me) how quickly the commitment of actual resources will focus an otherwise overly broad discussion.  If you start by asking people to list all of the OSS technology that is critical to the business, you will likely end up with a massive list.  Now tell them that they are going to have to commit headcount, budget to support every piece of technology on the list (and justify it) – it will likely shrink rapidly.

Mar 282014

Note: I have also written this up in Q&A format over on StackOverflow for visibility.

When I am testing MongoDB, I often need to insert a bunch of data quickly into a collection so I can manipulate it, check performance, try out different indexes etc.  There’s nothing particularly complex about this data usually, so a simple for loop generally suffices.  Here is a basic example that inserts 100,000 docs:

for(var i = 0; i < 100000; i++){db.timecheck.insert({"_id" : i, "date" : new Date(), "otherID" : new ObjectId()})};

Generally, I would just copy and paste that into the mongo shell, and then go about using the data.  With 2.4 and below, this is pretty fast.  To test, I’ve simplified even more and kept it to a single field (_id) and added some very basic timing.  Here’s the result with the 2.4 shell:

> db.timecheck.drop();
> start = new Date(); for(var i = 0; i < 100000; i++){db.timecheck.insert({"_id" : i})}; end = new Date(); print(end - start);

A little over 2 seconds to insert 100,000 documents, not bad.  Now, let’s try the same thing with the 2.6.0-rc2 shell:

> db.timecheck.drop();
> start = new Date(); for(var i = 0; i < 100000; i++){db.timecheck.insert({"_id" : i})}; end = new Date(); print(end - start);

Oh dear – over 37 seconds to insert the same number of documents, that’s more than 15x slower!  You might be tempted to despair and think 2.6 performance is terrible, but in fact this is just a behavioral change in the shell (I will explain that shortly).  Just to make it clear that it’s not something weird caused by running things in a single line in the shell, let’s pass the same code in as a JavaScript snippet.  This time we’ll just use the time command to measure:

2.4 shell:

$ time mongo ~/mongo/insert100k.js --port 31100
MongoDB shell version: 2.4.6
connecting to:

real    0m2.253s
user    0m0.942s
sys    0m0.432s

2.6 shell:

$ time ./mongo ~/mongo/insert100k.js --port 31100
MongoDB shell version: 2.6.0-rc2
connecting to:

real    0m34.691s
user    0m22.203s
sys    0m2.272s

So, no real change, things are pretty slow with a 2.6 shell.  It should be noted that I ran both against a 2.6 mongod, only the shells I am using are different.  So, of course, you can work around it by using the 2.4 shell to connect to 2.6 but that is not exactly future proof.

(UPDATE: if anyone saw my original post, I had screwed up and run a 2.4 shell thanks to a PATH mix up, there is no difference between passing in the file and an interactive loop).

To explain: before 2.4 the interactive shell would run through the loop and only check the success (using getLastError) of the last operation in the loop (more specifically, it called getLastError after each carriage return, with the last operation being the last insert in the loop).  With 2.6, the shell will now check on the status of each individual operation within the loop.  Essentially that means that the “slowness” with 2.6 can be attributed to acknowledged versus unacknowledged write performance rather than an actual issue.

Acknowledged writes have been the default for some time now, and so I think the behavior in the 2.6 is more correct, though a little inconvenient for those of us used to the original behavior.  We have a workaround with 2.4 but ideally we want to use the latest shell with the latest server, so the question remains, how do I do a simple bulk insert from the 2.6 shell quickly if I truly don’t care about failures?

The answer is to use the new unordered bulk insert API:

> db.timecheck.drop();
> var bulk = db.timecheck.initializeUnorderedBulkOp(); start = new Date(); for(var i = 0; i < 100000; i++){bulk.insert({"_id" : i})}; bulk.execute({w:1}); end = new Date(); print(end - start);

Success!  And essentially the same performance at just over 2 seconds. Sure, it’s a little more bulky (pardon the pun), but you know exactly what you are getting, which I think is a good thing in general. There is also an upside here, when you are not looking for timing information. Let’s get rid of that and run the insert again:

> db.timecheck.drop();
> var bulk = db.timecheck.initializeUnorderedBulkOp(); for(var i = 0; i < 100000; i++){bulk.insert({"_id" : i})}; bulk.execute({w:1});
"writeErrors" : [ ],
"writeConcernErrors" : [ ],
"nInserted" : 100000,
"nUpserted" : 0,
"nMatched" : 0,
"nModified" : 0,
"nRemoved" : 0,
"upserted" : [ ]

Now we get a nice result document when we do the bulk insert. Because it is an unordered bulk operation, it will continue should it encounter an error and report on each one in this document. There are none to be seen here, but it’s easy to create a failure scenario, let’s just pre-insert a value we know will come up and hence cause a duplicate key error on the (default) unique _id index:

> db.timecheck.drop();
> db.timecheck.insert({_id : 500})
WriteResult({ "nInserted" : 1 })
> var bulk = db.timecheck.initializeUnorderedBulkOp(); for(var i = 0; i < 100000; i++){bulk.insert({"_id" : i})}; bulk.execute({w:1});
2014-03-28T16:19:40.923+0000 BulkWriteError({
"writeErrors" : [
"index" : 500,
"code" : 11000,
"errmsg" : "insertDocument :: caused by :: 11000 E11000 duplicate key error index: test.timecheck.$_id_ dup key: { : 500.0 }",
"op" : {
"_id" : 500
"writeConcernErrors" : [ ],
"nInserted" : 99999,
"nUpserted" : 0,
"nMatched" : 0,
"nModified" : 0,
"nRemoved" : 0,
"upserted" : [ ]

Now we can see how many were successful, which one failed (and why). It may be a little more complicated to set up, but overall I think we can call this a win.