Stupid Tricks with MongoDB

Aggregating Over Time

10/27/2013

It's commonly thought that you can't use aggregation framework to do cumulative over time sort of analysis. While there are many problems which are hard or impossible to do in MongoDB aggregation framework, there are some that are more or less straight forward, even though at first it may seem that they are beyond the scope of simple aggregation.

One of those problems stems from the need to analyze large collections where actions were written in append only fashion with simple key information, containing only actor id, action id and time of action. This can be the case when pre-aggregation by action by person isn't feasible because rather than answering questions like "how many people did this during this hour" we will instead want to ask questions about actions relative to each other for the same person. An example would be "How many people first performed action1, then performed action2 and then performed action3?" We might also want to know how long on average it took from first to last action or from first to second. This is the sort of problem that the following aggregation framework example addresses.

The following example was loosely inspired by some consulting work I did last year with a customer who is collecting a large number of "check-in" events from many devices and then needs to answer questions after the fact about devices which failed to check-in on schedule and which ones of them performed certain events immediately prior the failure (like upgraded software, or sent a "battery low" alert). Recently a "Stack Overflow" question on a similar topic reminded me of this problem and I decided to write it up in detail.

In the example, I assume that each document in my collection (I'll call it "pats") has three fields in addition to the required _id: "p_id" number representing a person identifier, "action_id" string representing an action identifier and "ts" representing the time of action as ISODate(). It should be easy to modify my code to use exact field names and types of your collection.

First, we need to have a list of actions that we are reporting on so that we can generate the report automatically. I'm going to assume that my input will come as an array of action strings which represents actions in the order I expected them to be performed. Let's assume that "a1", "a2" and "a3" represent actions which are expected to be performed in order (though each may be performed multiple times) and that I want to know how many people performed just "a1", how many performed "a1" followed by "a2", how many performed all three actions in order and how long on average it took from "a1" to "a2" and "a1" to "a3".

So I have a giant collection, looking something like:

 > db.pats.find({},{_id:0})
{ "p_id" : 0, "action_id" : "a0", "ts" : ISODate("2013-10-27T20:12:27.838Z") }
{ "p_id" : 0, "action_id" : "a00", "ts" : ISODate("2013-10-27T20:12:27.943Z") }
{ "p_id" : 0, "action_id" : "a1", "ts" : ISODate("2013-10-27T20:12:28.141Z") }
{ "p_id" : 0, "action_id" : "a3", "ts" : ISODate("2013-10-27T20:12:28.159Z") }
{ "p_id" : 3, "action_id" : "a0", "ts" : ISODate("2013-10-27T20:12:28.426Z") }
{ "p_id" : 3, "action_id" : "a02", "ts" : ISODate("2013-10-27T20:12:29.045Z") }
{ "p_id" : 3, "action_id" : "a2", "ts" : ISODate("2013-10-27T20:12:29.211Z") }
{ "p_id" : 1, "action_id" : "a3", "ts" : ISODate("2013-10-27T20:12:29.454Z") }
{ "p_id" : 3, "action_id" : "a4", "ts" : ISODate("2013-10-27T20:12:29.459Z") }
{ "p_id" : 3, "action_id" : "a41", "ts" : ISODate("2013-10-27T20:12:30.106Z") }
{ "p_id" : 3, "action_id" : "a21", "ts" : ISODate("2013-10-27T20:12:30.632Z") }
{ "p_id" : 2, "action_id" : "a4", "ts" : ISODate("2013-10-27T20:12:30.898Z") }

Let's start with actions = [ "a1", "a2", "a3" ].

We can also have various optional filters, such as date range to constrain the report, but for now I'm going to ignore all other filters.

Conceptually, we have about four or five "phases" to put our data through. First we want to get rid of all the documents which don't match the actions we care about. If we had a time range, we would apply that filter in the same step. This will be a `$match` pipeline stage, of course, and its syntax is simply:

  var match={ "$match" : { "action_id" : { "$in" : actions } } };

Remember, "actions" is an array of strings representing actions I care about - the way I'm going to generate this pipeline programmatically, it should work for any group of one or more actions - in fact, at the end, I'm going to put all this into a function which takes an array of actions and returns the summary report.

Next stage is all about transforming the data into the format we need. While there are several ways to do this, I'm going to use `$project` first to compute new fields for each different action - I will need first or earliest time of each action in my list, as well as a total of the number of times each was performed. Once I have transformed action_id values into key names, I will aggregate them by person. An alternate way could be to first aggregate by person, action and then do the values to keys transformation.

Here is how I build the projection:

 
  var projectActions= { "$project" : {  "p" : "$p_id" } };
  var ever = ISODate("2035-01-01");
  actions.forEach( function(act) { 
    projectActions["$project"][act] = { };
    projectActions["$project"][act]["f"] = { "$cond" : [ { "$eq" : [ "$action_id", act ] }, "$ts", ever ] };
    projectActions["$project"][act]["t"] = { "$cond" : [ { "$eq" : [ "$action_id", act ] },   1, 0 ] };
  }

As you can see, for each of the actions I'm creating a key for when it was performed, and I'm using variable "ever" as a representation of "never" - by setting it to a very large date. I will later filter out all values that are still way in the future. If my "ts" field was not stored as ISODate, I would use an appropriate type to do comparisons and interval calculations. I hope you don't store your dates as strings of "Thu Nov 8, 2012 14:10:27" sort - if you do, you would want to convert them to actual timestamps - it's not hard, but it's a pain - I have an example of doing it <here> since MongoDB does not yet have <sfrmt> feature implemented.

The result of this phase will look be a document with two new fields for every action, one with timestamp of action, the other with 1 or 0 depending on whether this document records this action or not.

Now the grouping - that's quite simple, I simply group by person calculating the first action using `$min` operator, and sum up their actions:

 
  var groupByPerson={ "$group" : { "_id" : "$p" } };
  actions.forEach( function(act) { 
    var first = act + "first";
    var times = act + "times";
    groupByPerson["$group"][first] = { "$min" : "$" + act + ".f" };
    groupByPerson["$group"][times] = { "$sum" : "$" + act + ".t" };
  }

Each of my documents coming out of the groupByPerson stage looks something like this:

 
{
     "_id" : 434,
     "a1first" : ISODate("2013-10-27T21:16:25.454Z"),
     "a1times" : 1,
     "a2first" : ISODate("2035-01-01T00:00:00Z"),
     "a2times" : 0,
     "a3first" : ISODate("2013-10-27T21:20:24Z"),
     "a3times" : 1
}

You can see that a document for each person has their id and for each of three actions we care about has when that action was first performed (if ever, if not then 2035 is the year) and if it was performed then how many times.

We have only a couple of stages left (other than the "make it pretty" stage at the end). The key to this whole analysis is the next stage where we create a bunch of new fields which tells us whether each person performed key sequences of events that we care about. We compare appropriate fields for each person and record true or false into new fields that we can later summarize.

Since we don't want to hardcode the actions, we will use javascript to generate these stages programmatically from our array of actions as well.

First, the set-up:

 
  var didA = "did" + actions[0];
  var andClause = { "$and" : [  ] };
  var didFirst =  { "$lt" : [ "$" + actions[0] + "first", ever ] };
  andClause["$and"].push(didFirst);

Here we've set a few things we'll need later, and initialized the two stages we'll be going through next. I will break them into two for clarity, but in the final code, I'll combine them for brevity. The `$project` stage where we make these boolean fields:

 
  var projectBool = { "$project" : { "_id" : 0, "p" : "$_id" } };
  projectBool["$project"][didA] = didFirst;
  for (var i=1; i < actions.length; i++) { 
   didA = didA + "then" + actions[i];
    andClause["$and"].push( { "$lt" : [ "$" + actions[i] + "first", ever ] } );
    andClause["$and"].push( { "$gt" : [ "$" + actions[i] + "first", "$" + actions[i-1] + "first" ] } );
    projectBool["$project"][didA] =  { "$and" : [  ] };
    andClause["$and"].forEach(function(a) { projectBool["$project"][didA]["$and"].push(a); });
    var timeBtw = "timeBetween" + actions[0] + "_" + actions[i];
    projectBool["$project"][timeBtw] = { "$subtract" : [ "$"+actions[i]+"first", "$"+actions[0]+"first" ] };
  }

Now the `$group` everything stage:

 
  var groupAll = { "$group" : { "_id" : null } };
  groupAll["$group"][didA] = { "$sum" : { "$cond" : [ "$" + didA, 1, 0] } };
  for (var i=1; i < actions.length; i++) { 
    didA = didA + "then" + actions[i]; 
    var timeBtw = "timeBetween" + actions[0] + "_" + actions[i]; 
    groupAll["$group"][didA] = { "$sum" : { "$cond" : [ "$" + didA, 1, 0] } };
    groupAll["$group"][timeBtw] = { "$sum" : { "$cond" : [ "$" + didA, "$" + timeBtw, 0 ] } };
  }

And of course, the make it pretty `$project`:

 
  var projectNeat = { "$project" : { "_id" : 0 } };
  projectNeat["$project"][didA] = 1;
  for (var i=1; i < actions.length; i++) {
    didA = didA + "then" + actions[i]; 
    var timeBtw = "timeBetween" + actions[0] + "_" + actions[i]; 
    projectNeat["$project"][didA] = 1;
    var avgTime = "avgMinsBetween" + actions[0] + "_" + actions[i];
    projectNeat["$project"][avgTime] = {$cond: [ {$ne:["$"+didA,0]}, { "$divide" : [ { "$divide" : [ "$"+timeBtw, "$"+didA ] } , 6000 ] }, "N/A"]};
  }

Now that we have all the stages defined, we run the whole thing with:

 db.pats.aggregate([ match, projectActions, groupByPerson, projectBool, groupAll, projectNeat ] );

Now here is a cool trick - let's make the whole thing into a function that we can call with actions array as an argument. In mongo shell define it like this:

 
groupByActions = function (actions) {
  if (!(actions instanceof Array)) throw "Must pass an array of actions!";
  if (actions.length < 1) throw "Must pass array of at least one action";
  var ever=ISODate("2035-01-01")
  var match={ "$match" : { "action_id" : { "$in" : actions } } };
  var projectActions= { "$project" : {  "p" : "$p_id" } };
  var groupByPerson={ "$group" : { "_id" : "$p" } };
  actions.forEach( function(act) {
    projectActions["$project"][act] = { };
    projectActions["$project"][act]["f"] = { "$cond" : [ { "$eq" : [ "$action_id", act ] }, "$ts", ever ] };
    projectActions["$project"][act]["t"] = { "$cond" : [ { "$eq" : [ "$action_id", act ] },   1, 0 ] };
    var first = act + "first";
    var times = act + "times";
    groupByPerson["$group"][first] = { "$min" : "$" + act + ".f" };
    groupByPerson["$group"][times] = { "$min" : "$" + act + ".t" };
  });
  var didA = "did" + actions[0];
  var andClause = { "$and" : [  ] };
  var didFirst =  { "$lt" : [ "$" + actions[0] + "first", ever ] };
  andClause["$and"].push(didFirst);
  var projectBool = { "$project" : { "_id" : 0, "p" : "$_id" } };
  projectBool["$project"][didA] = didFirst;
  var groupAll = { "$group" : { "_id" : null } };
  groupAll["$group"][didA] = { "$sum" : { "$cond" : [ "$" + didA, 1, 0] } };
  var projectNeat = { "$project" : { "_id" : 0 } };
  projectNeat["$project"][didA] = 1;
  for (var i=1; i < actions.length; i++) {
    didA = didA + "then" + actions[i];
    andClause["$and"].push( { "$lt" : [ "$" + actions[i] + "first", ever ] } );
    andClause["$and"].push( { "$gt" : [ "$" + actions[i] + "first", "$" + actions[i-1] + "first" ] } );
    projectBool["$project"][didA] =  { "$and" : [  ] };
    andClause["$and"].forEach(function(a) { projectBool["$project"][didA]["$and"].push(a); });
    var timeBtw = "timeBetween" + actions[0] + "_" + actions[i];
    projectBool["$project"][timeBtw] = { "$subtract" : [ "$"+actions[i]+"first", "$"+actions[0]+"first" ] };
    groupAll["$group"][didA] = { "$sum" : { "$cond" : [ "$" + didA, 1, 0] } };
    groupAll["$group"][timeBtw] = { "$sum" : { "$cond" : [ "$" + didA, "$" + timeBtw, 0 ] } };
    projectNeat["$project"][didA] = 1;
    var avgTime = "avgMinsBetween" + actions[0] + "_" + actions[i];
    projectNeat["$project"][avgTime] = {$cond: [ {$ne:["$"+didA,0]}, { "$divide" : [ { "$divide" : [ "$"+timeBtw, "$"+didA ] } , 6000 ] }, "N/A"]};
  }
  return db.pats.aggregate([ match, projectActions, groupByPerson, projectBool, groupAll, projectNeat ] );
}

Here's what I get when I run it with my actions array:

 
> var actions=["a1", "a2", "a3"];
> groupByActions(actions).next()
{
     "dida1" : 2770,
     "dida1thena2" : 438,
     "dida1thena2thena3" : 82,
     "avgMinsBetweena1_a2" : 40.93396194824962,
     "avgMinsBetweena1_a3" : 96.30631910569106
}

Note: I added next() because I'm using 2.5.3 where aggregation returns a cursor(!) but this will work just fine with 2.4.7 or any other recent version - you will get back a document with results array, looking something like this:

 
{
    "result" : [ 
        {
         "dida1" : 2770,
         "dida1thena2" : 438,
         "dida1thena2thena3" : 82,
         "avgMinsBetweena1_a2" : 40.93396194824962,
         "avgMinsBetweena1_a3" : 96.30631910569106
        }
    ], 
    "ok" : 1
}

Hope you enjoyed this installment of aggregation tricks. Up next: new aggregation features in 2.6 followed by a series on MongoDB performance.

4 Comments

Ramon Helena

10/28/2013 06:08:37 pm

Asya: simple great!! thanks for shared it. :-)

A. Jesse Jiryu Davis link

11/4/2013 08:29:44 am

Awesome, Asya. Curious to see your code to convert strings to timestamps, it looks like you didn't link to that code example yet.

Asya Kamsky

11/4/2013 10:09:31 am

I'm reworking it - the string to number translating is ugly till that Jira gets fixed.

Jonathan

12/4/2018 03:59:17 pm

That was really nice, I'm currently having problems executing a similar query, it takes almost 2 minutes to complete, with only 20K items.

Investigating ways to fix it.

Aggregating Over Time

Leave a Reply.

Asya Kamsky

Archives

Categories