Understanding MongoDB storage structure

Recently, while trying to reduce the size of my ever expanding Mongo database, I came across a lot of terms related to the storage structure of Mongo. Terms like file size, storage size, extents etc. initially sounded confusing but with a little bit more reading I could comprehend them better. Also, shrinking your Mongo database is not as easy as just deleting some records. Mongo is a hoarder, and its ideology is very similar to Captain Jack Sparrow from the Pirates of the Caribbean franchise

pirates_1

pirates_2

If you have 5GB of data and you delete 2GB of it, you would expect that 2GB to be released back to the OS; but no, MongoDB will hold on to it for accommodating new data in the future.

In this post, I use an interesting analogy to help you better understand the different storage related concepts of MongoDB and also discuss some techniques to get back your precious unused disk space from the ravenous claws of Mongo.

Consider a city as a computer.
The total surface area of this city is analogous to the disk space on the computer.
The amount of area on which something is constructed is the used up disk space and the amount of area where there is no construction is the free space.

Let us consider there are houses of different sizes in this city. Each house is analogous to a database on the machine.

Let us consider a single house.

1. fileSize

The surface area covered by this house is the file size of this database.

The fileSize metric is equal to the sum of all the space occupied by the data, the indices and the yet unused space in the database.

The house can consist of rooms, a garden, parking space, some open space etc.

2. storageSize

The rooms in this house where people stay are the collections in the database.

The surface area of each room is the storage size of the collection.
The total surface area of the rooms in the house is the storage size of the database.

The storageSize metric is equal to the size of all the data containers in the database. It includes yet-unused space (in the data containers) and space vacated by deleted or moved documents within the containers.

3. dataSize

The occupants of the house are the documents of the database.
The total surface area occupied by the occupants is the dataSize of the database.

The dataSize metric is the sum of the sizes of all the documents stored in the database.

Suppose each room can occupy fans of a particular football team. So, all the Arsenal fans stay in one room, while the Chelsea fans stay in another room and so on.

When a few Arsenal fans leave the house, the room where they were staying is not re-constructed to make it smaller. Also, the Arsenal fans would not allow the Chelsea fans to stay in their room. The room can only be used to accommodate more Arsenal fans, who may or may not come again to stay in this house.

Similarly, when some documents are deleted from a collection, the dataSize of the collection reduces but the storageSize does not. Also, the space freed up by deletion of the documents can only be used to store other new documents of the same collection. So when documents are deleted from a collection, it is not very beneficial to the overall available space.

So, how do you get back this unused space?

There  are two ways of doing this :

1. Compacting Collections

You can compact individual collections by running the compact command.

When compact is run, the rooms get re-constructed. Since the Arsenal fans are few in number, their room size becomes smaller. And the space from their previous large room, becomes available to other rooms. So, more number of Chelsea fans can now be accommodated in this house.

Similarly,  when compact is run on a collection, its storageSize reduces and other collections can now use this space to store their documents.

However, running compact does not reduce the fileSize i.e the total surface area of the house.[1]

2. Repair Database

When the repairDatabase command is run the entire house gets rebuilt from scratch; the size allotted to this house is now according to the number of its occupants. The rooms are much smaller now. The space left unused from the previous large house is now available to the city.

Similarly, when repairDatabase is run on a database, each of the collections and the indices are rebuilt from scratch. The storageSize of each collection is almost similar to the dataSize of the collection. All the spaces and holes left by the deleted documents in all the collections are removed. This reduces the storageSize of the database, and subsequently the fileSize of the database. Finally, the disk space initially occupied by this database is reduced and the newly freed disk space is now available for the machine.

That’s all folks! Hope this post gives you an insight into the MongoDB storage structure and helps you while evaluating options to reduce the size of your database.

Footnotes

1. From the MongoDB documentation :

compact has different impacts on available disk space depending on which storage engine is in use.On WiredTigercompact will rewrite the collection and indexes to minimize disk space by releasing unused disk space to the system.On MMAPv1compact defragments the collection’s data files and recreates its indexes. Unused disk space is not released to the system, but instead retained for future data.
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: