Duplicate Keys in JSON Objects

We came across a curious problem when building a new RESTful API recently. If we sent duplicate keys in our JSON request, how should the API handle it? Shouldn’t that request be rejected straight away as invalid JSON? Are duplicate keys even allowed in JSON? I did a bit of digging around to clear up this debate, and this is what I found.

RFC-7159, the current standard for JSON published by the Internet Engineering Task Force (IETF), states “The names within an object SHOULD be unique“. Sounds pretty clear, right? However, according to RFC-2119 which defines the terminology used in IETF documents, the word “should” in fact means “… there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course“. So, things just got a bit more confusing. What this essentially means is that while having unique keys is recommended, it is not a must. We can have duplicate keys in a JSON object, and it would still be valid.

// Despite the firstName key being repeated, this is still valid JSON
{
  "id" : 001,
  "firstName" : "John",
  "firstName" : "Jane",
  "lastName" : "Doe"
}

The validity of duplicate keys in JSON is an exception and not a rule, so this becomes a problem when it comes to actual implementations. In the typical object-oriented world, it’s not easy to work with duplicate key value pairs, and the standard does not define how a deserializer/parser should handle such objects. An implementation is free to choose its own path, and the behaviour is completely unpredictable from one library to another.

For example, a parser may take only the last value present in the object for a particular key, and ignore the previous ones. It could also return all the key value pairs, or it may even reject the JSON with a parsing error, and all of these behaviours would be valid. That said, most popular implementations (including the ECMAScript specification which is implemented in modern browsers) follow the rule of taking only the last key value pair, but there is always the possibility of another library handling it in a different way. In our case we went with the last key option, but there may be use cases where that is not acceptable and you may want to disallow duplicate keys altogether.

This kind of difference in behaviour can be problematic particularly in a modern polyglot architectures, where the behaviour of different services should ideally be as consistent as possible. It may be unlikely that such a scenario would actually occur, but if and when it does, it would definitely help to know how your applications behave, and have it documented as such for your consumers and fellow developers.

Advertisements
Duplicate Keys in JSON Objects

Reclaiming Disk Space from MongoDB

If you have used MongoDB, you would probably have noticed that it follows a default disk usage policy a bit like “take what you can, give nothing back”. Here’s a simple example – let’s say you had 10GB of data in a MongoDB database, and you delete 3GB of that data. However, even though that data is deleted and your database is holding only 7GB worth of data, that unused 3GB will not be released to the OS. MongoDB will keep holding on to the entire 10GB disk space it had before, so it can use that same space to accommodate new data. You can easily see this yourself by running a db.stats():

db.stats()
dataSize whows the size of the data in the database, while storageSize shows the size of data plus unused/freed space. The fileSize parameter, which is essentially the space your database is taking up on disk, includes the size of data, indexes and unused/freed space.

MongoDB is commonly used to store large quantities of data, often in read-heavy situations where the amount of data manipulation operations are relatively much less. In this kind of situation, it makes sense to anticipate that if you had to handle a certain amount of data before, then you might have to handle a similar amount again. Nevertheless there will be situations (your development environment, for example) where you don’t want to let MongoDB to keep hogging all your disk space to itself. So how would you reclaim this disk space? Depending on your setup and the storage engine you’re using for your MongoDB, you have a couple of choices.

Compact

The compact command works at the collection level, so each collection in your database will have to be compacted one by one. This completely rewrites the data and indexes to remove fragmentation. In addition, if your storage engine is WiredTiger, the compact command will also release unused disk space back to the system. You’re out of luck if your storage engine is the older MMAPv1 though; it will still rewrite the collection, but it will not release the unused disk space. Running the compact command places a block on all other operations at the database level, so you have to plan for some downtime.

Usage example:

db.runCommand({compact:'collectionName'})

Repair

If your storage engine is MMAPv1, this is your way forward. The repairDatabase command is used for checking and repairing errors and inconsistencies in your data. It performs a rewrite of your data, freeing up any unused disk space along with it. Like compact, it will block all other operations on your database. Running repairDatabase can take a lot of time depending on the amount of data in your db, and it will also completely remove any corrupted data it finds.

RepairDatabase needs free space equivalent to the data in your database and an additional 2GB more. It can be run either from the system shell or from within the mongo shell. Depending on the amount of data you have, it may be necessary to assign a sperate volume for this using the –repairpath option.

Usage examples:

mongod --repair --repairpath /mnt/vol1

db.repairDatabase()

db.runCommand({repairDatabase:1})

Resync

In a replica set, unused disk space can be released by running an initial sync. This involves stopping the mongod instance, emptying the data directory  and then restarting to allow it to reconstruct the data through replication.

Reclaiming Disk Space from MongoDB