This post originated from an RSS feed registered with Agile Buzz
by Travis Swicegood.
Original Post: MongoDB: A first look
Feed Title: Travis Swicegood
Feed URL: http://travisswicegood.com/atom/
Feed Description: Posts on Git from Travis Swicegood, author of Pragmatic Version Control using Git.
The entire subject of two talks and mentioned in several other, MongoDB was
definitely a buzz at TekX this year. It's long been in favor in the tech
community in Lawrence and has been used for some data crunching for a few
projects at the local paper. Even with all of this exposure, I've yet to sit
down and actually explore it.
That changed Friday afternoon while I sat at O'Hare waiting on my flight back
to Lawrence (which subsequently got canceled). I installed Mongo earlier in
the week and opened up a bunch of tabs on the various intros and tutorials
available on the Mongo wiki. The rest of this article a mix of
stream-of-conscious as I played around with Mongo for the first time and some
of my reflections this past week.
Note on typefaces
I use both Mongo and mongo throughout this article. The first, the
title-case Mongo refers to the software as a whole. Whenever you see mongo
with a lowercase and in monospace, it's referring to the Mongo client program
you run from the command line.
Installation
On a Mac, it's a breeze. I use Homebrew to manage software on my Mac, so a
quick brew install mongodb was all I needed and a minute later I was ready to
go.
Starting Up the Server
Mongo is run by the mongod process. I don't know if it's pronounced
mongo-d or mon-god though. It's a fun play on words if the latter is the
case.
Brew includes a basic configuration to get up and running, so I use that inside
a screen instance so I can leave it running in the background while I use the
mongo tool to interact with it.
Interacting with Mongo
I started out with the basic tutorial to get going. It looks like that
needs some love though. It shows the version in the startup as 0.9.8.
Homebrew ships with 1.4.2 and I did find a few things that were out of date.
No, I' haven't been a good open source community member and submitted fixes
yet.
The first thing that's different than a traditional RMDBS with Mongo is that
you don't have to explicitly create a database. Pretty straight forward: from
within mongo, type use <database>. This creates a brand new database for
you and you're off. For the examples below, I'm using use mydb to select
mydb as my database.
It's kind of nice to just be able to connect and go, but it feels odd. Not
good or bad, just odd. Sort of like the first time you run git checkout
inside a repository to switch branches when you're used to Subversion.
The shell feels like a Javascript console. I don't have access to the source
code in my off-line mode, so I don't know but that it is. The syntax seems
remarkably similar, so it's at least Javascript inspired.
Adding Records
Mongo stores documents, not rows of columns. This distinction allows Mongo to
ignore schema—continuing the theme of leaving it up to the developer.
Those documents can be made up any number key-values that look remarkably like
JSON. Need to store a new data point, just add it as a field to a document
and you're set.
Here's an example inspired by Mongo's tutorial for adding a few records:
> person = {name: "Travis Swicegood"}
> city = {city: "Lawrence", state: "KS"}
> db.things.save(person)
> db.things.save(city)
Here I created two new objects with various data attached to them, then saved
them all inside the things collection. Collections in Mongo are like a table
inside the SQL world. You don't have to create a collection, you just declare
it on the db object, and you're set.
Comparing this to the same code in a database, I've got to say I love this. No
boilerplate code to get going. I didn't have to create a database, no tables
were created. I just started using them. This appeals to my
laziness—err, I mean desire for efficiency, but also looks very promising
to teach someone new. Every abstract idea you can remove is one less potential
stumbling block for someone starting out.
Back to the data I entered. Notice that neither have the same fields.
Collections inside Mongo are made up of a series of keys and values—they
can be whatever you want them to be. This is perfect for lazy migrations:
migrating the data as its requested instead of doing it all at once. ming,
a Python wrapper around Mongo already provides this. This is especially
useful for large sites with lots of data that may or may not ever been
requested again.
Finding Records
Now that the records are there, finding them. The db.things object comes
back now:
That gives me everything. The find method takes optional parameters to
filter the results. This is actually a good time to bring up the built-in help
in mongo. Entering only the value of any function (i.e., without calling it)
displays the implementation of the function:
> db.things.find
function (query, fields, limit, skip) {
return new DBQuery(
this._mongo, this._db, this, this._fullName,
this._massageObject(query), fields, limit, skip);
}
Note: I changed the formatting so it's more easily viewable online.
The parameters are optional (like all Javascript function), so you can pass in
as many or as few as you want. Filtering the results is done by providing a
hash for the query parameter (the first one). For example, to find my
record:
One thing you can't do is full-text searching. I can't ask for all of the
records that begin with Travis or have a portion of my name in it. The
current recommendation (at least via the wiki) is to build your own list of
keywords as an array, then search that array. For example:
For something like a name, this can be useful. For full-text searching of an
article, it's probably best to delegate searching off to something like
Solr and let Mongo focus on storage and retrieval.
Querying for sub-objects
Of course, I had to try sub-objects to see if they would work:
You can also query using the dot-notation to &lquot;reach through&rquot; an
object and look at its children. This returns the same result as the previous
query:
> db.things.find({"person.name_field": "Travis"})
Limiting returned columns
This ability to dynamically add columns to a record and definitely provides a
breading ground for massive documents with lots of keys. Most of the time a
small subset of those keys are all that's needed. The second parameter in find
provides us with that functionality:
These examples bring up a syntax thing with Mongo that I'm not crazy about: the
use of the number one. It's the standard C style: 1 is true, 0 is false. I'd
love to see the client and the libraries adopt an intent revealing name.
Granted, this is a minor niggle, but the little things are what make a good
system an amazing one.
Few issues
The docs, being that they are community run and Mongo's still relatively new,
are a little loose. I've found a bunch of examples looking through them that
don't work the way they were documented.
Another potential issue (or at least something you need to be aware of) is that
Mongo's geospatial support isn't 100% year. They only provide 2d and the
math they use assumes that 1° of longitude is the same at the poles as it is at
the equator. For many applications, this isn't a huge issue, but if
precision is important, Mongo's not ready for this type of use.
One thing that I'm looking forward to is Mongo's sharding. That is going
to allow Mongo to scale horizontally really well. Some of
the initial test results look amazing. What will be really interesting is
to see how well is scales down. It's one thing to have over 300,000 ops/sec
on a bigger box, another thing to be able to manage it on something like a 1gb
instance on Rackspace Cloudservers.
Two Biggest Issues
First, Mongo's a master-slave system. It appears really robust, but whenever a
box takes on a special role I start to get nervous. One of the promises of
&lquot;NoSQL&rquot; is that it provides a tremendous amount of resilience. Any
time you start to add special nodes you're taking away from that.
For example, if you're running 5 homogeneous servers and one goes down, the
other 4 can pick up the slack—assuming you're not running 5 servers at
peak capacity. This makes failure planning easy: figure up the amount of CPU
time you need to handle your load, provision that many servers, then add enough
servers to be comfortable when they start failing. Need 3 servers, provision 5
and you can have two failures before you peg your machines.
This isn't to say Mongo can't handle failures. It's current model is
rebalancing the load when one of the servers goes out. mongos is the tool to
read up on for handling this. Unfortunately, I haven't been able to dive into
it yet. The only way to know for sure is to build up a cluster then start
killing servers. Of course, this type of testing is preferred for any data
storage system.
Second, the license. I'm not anti-AGPL, but there's some ambiguity. The Mongo
team has addressed this both on the
wiki and through an in-depth
blog post. According to
that, I can write up a service such as MongoHQ and as long as I don't
actually change the mongod or mongos code I'm fine.
On the other hand, most of the definitions I've read of the AGPL mean that code
that talks to it is subject to being hit with the AGPL. I don't have any
doubts with 10gen, but if they don't always own the copyright
Of course, those last two paragraphs are with the caveat I am not a lawyer.
I think Mongo is an amazingly compelling piece of software in the non-standard
database realm. With the upcoming sharding and what I would have to imagine is
an eminent fix to the geospatial queries, Mongo's definitely worth a look.