Stupid Tricks with MongoDB

Using aggregation framework to reshape schema

4/21/2013

When I first started playing with MongoDB aggregation framework queries, I quickly realized that '$project' stage could be hugely useful to completely reshape documents in cases where you structured them one way, but you need to get the results shaped differently.

I thought it would be particularly helpful if you could map keys (aka field names) to be field values and vice versa. This feature did not exist (and still doesn't) so I filed a request for it in MongoDB issue tracking system.

Meanwhile, as I show here it is still possible to do this "projection" via aggregation if you know the names of the fields in advance.

But what if you get the names of fields dynamically, rather than always knowing they will be field1, field2?

I'm going to show how you can generate the appropriate aggregation framework pipeline programmatically based on the set of field names passed in. I'm going to use Javascript in the shell as the most general example, but you can translate this into your language of choice and adjust accordingly.

My sample documents and schema will be like this:

 {
"_id" : 1,
    "attr" : [
        {  "k": "firstName",
           "v":  "Asya" }, 
        {  "k": "lastName",
           "v": "Kamsky" }, 
        {  "k": "employer",
           "v": "10gen, the MongoDB company" }, 
        {  "k": "URL",
           "v": "http://www.kamsky.org" } 
    ]
}

This is a pretty standard way of storing dynamic attributes - properties of the document that can't be easily enumerated in advance either because they are not all known, because they can vary widely for different types of documents or both. It makes it possible to index the attributes by creating a compound index on {"attr.k":1,"attr.v":1} fields and now querying on something like {"attr.k":"color","attr.v":"blue"} will use the index. An alternative way of storing keys as field names and values as field values has other advantages but it makes it difficult to have a good indexing strategy as you may end up with a large number of indexes and every time you add a new type of attribute you have to create a new index to support it. Sparse indexes can help, but they can create their own challenges worthy of a separate blog post.

Now imagine I want to output documents which have this shape:

 {
    "_id" : 1,
    "firstName" : "Asya",
    "lastName" : "Kamsky",
    "employer" : "10gen, the MongoDB company"
}

Given I will be passed an array called 'fields' which contains values 'firstName', 'lastName' and 'employer' here is how I will build the pipeline stages for my aggregation.

 /* my array of wanted fields */
fields = [ "firstName", "lastName", "employer"];
/* first I unwind the attributes array in each document  */ 
unwind = {"$unwind" : "$attr"}; 
/* I only keep the attributes I want to return */
match = { "$match" : { "attr.k" : { "$in" : fields } } }; 
/* I create new fields by setting correct value if key  *
 * matches, or some known value I can "skip" later */ 
project = { "$project" : { } } ;
fields.forEach( function(f) { 
    project["$project"][f] = { "$cond" : 
                      [ { "$eq" : [ f, "$attr.k" ] }, 
                        "$attr.v", "  skip" 
                      ] };
} ); 
/* I regroup the original document using $max to *
 * trick it into keeping only non-skip value */ 
group = { "$group" : { "_id" : "$_id" } } ; 
fields.forEach( function(f) { 
    group["$group"][f] = { "$max" : "$" + f }; 
} ); 
/* now run the aggregation */
db.collection.aggregate( unwind, match, project, group );  
{
    "result" : [ 
        {
            "_id" : 1,
            "firstName" : "Asya",
            "lastName" : "Kamsky",
            "employer" : "10gen, the MongoDB company"
        }
    ], 
    "ok" : 1
}

I did make some assumptions about the data:

no duplicate attribute names for a particular document
all attribute values would be alphanumeric and compare greater than "space" character (ASCII x20)

The "space" character ordering was the trick that would keep the "real" attribute value when using '$max' expression in the '$group' stage because all alphanumeric values are greater than space character when compared.

I hope this is helpful, if not for actual implementation, at least for thinking about how you want to structure your documents. In the future I will take a closer look at advantages and disadvantages of different ways of storing attributes in your schema design.

9 Comments

José Bonnet

10/15/2013 06:38:34 pm

Hi,
Isn't this just a 'relationalization' problem, i.e., if new attributes appear, why not just consider them in the document where they appear?

Thanks,
jb

andy

10/6/2014 05:17:03 pm

Hi Asya,

I found that mongodb aggregation $match almost always scan the whole collection (or maybe after utilzing index), and $limit after $match doesn't help much (I have a huge number of records, with indexes).

Is it possible that $match return number of "$limit" records and go to the next stage of the pipeline ? (I want to get results as fast as limit() in the find().limit(x) case)

Thank you.

Asya Kamsky link

10/8/2014 02:50:38 am

Match will utilize the same indexes that find does and when there is a limit with it they will be combined to use the available indexes. You can see it if you use explain as that shows the query plan for match and expected steps for whole pipeline.

What makes you think the whole collection scan is happening?

Ash

10/27/2014 01:59:45 pm

Wow! reminds me of penrose stairs.

sivaraj

1/31/2016 10:47:36 pm

I m using like this its showing error

db.collection.aggregate( unwind, match, project, group )

error: wrong number of arguments (4 for 1..2)

and also like this:

I m using like this also:

aggregate(unwind, match, project, group)

def aggregrate(unwind, match, project, group)

ServerLog.collection.aggregate([
{
"$project" => project
};
{
"$group" => group
}
])
end

error: Mongo::Error::OperationFailure: exception: the group aggregate field name '$group' cannot be an operator name (15950)

Pls give solution for this

6/23/2016 12:51:08 am

Looks like instead of using some trick value you can just use {$literal:null}

ererer

1/13/2017 03:24:03 am

Vinod Kumar

5/1/2017 05:13:22 am

hi.
i am working on cms project in which menuitems are assigned to the particular user by its role and even i am able to retrieve all the menuitems but the problem is that i am not arrange them in their parent child schema.please contact me at
[email protected]

Ramez Rafla link

3/10/2018 04:11:55 pm

Thank you! This solution saved our behinds

Using aggregation framework to reshape schema

Leave a Reply.

Asya Kamsky

Archives

Categories