One of those problems stems from the need to analyze large collections where actions were written in append only fashion with simple key information, containing only actor id, action id and time of action. This can be the case when pre-aggregation by action by person isn't feasible because rather than answering questions like "how many people did this during this hour" we will instead want to ask questions about actions relative to each other for the same person. An example would be "How many people first performed action1, then performed action2 and then performed action3?" We might also want to know how long on average it took from first to last action or from first to second. This is the sort of problem that the following aggregation framework example addresses.
The following example was loosely inspired by some consulting work I did last year with a customer who is collecting a large number of "check-in" events from many devices and then needs to answer questions after the fact about devices which failed to check-in on schedule and which ones of them performed certain events immediately prior the failure (like upgraded software, or sent a "battery low" alert). Recently a "Stack Overflow" question on a similar topic reminded me of this problem and I decided to write it up in detail.
In the example, I assume that each document in my collection (I'll call it "pats") has three fields in addition to the required _id: "p_id" number representing a person identifier, "action_id" string representing an action identifier and "ts" representing the time of action as ISODate(). It should be easy to modify my code to use exact field names and types of your collection.
First, we need to have a list of actions that we are reporting on so that we can generate the report automatically. I'm going to assume that my input will come as an array of action strings which represents actions in the order I expected them to be performed. Let's assume that "a1", "a2" and "a3" represent actions which are expected to be performed in order (though each may be performed multiple times) and that I want to know how many people performed just "a1", how many performed "a1" followed by "a2", how many performed all three actions in order and how long on average it took from "a1" to "a2" and "a1" to "a3".
So I have a giant collection, looking something like:
We can also have various optional filters, such as date range to constrain the report, but for now I'm going to ignore all other filters.
Conceptually, we have about four or five "phases" to put our data through. First we want to get rid of all the documents which don't match the actions we care about. If we had a time range, we would apply that filter in the same step. This will be a `$match` pipeline stage, of course, and its syntax is simply:
Next stage is all about transforming the data into the format we need. While there are several ways to do this, I'm going to use `$project` first to compute new fields for each different action - I will need first or earliest time of each action in my list, as well as a total of the number of times each was performed. Once I have transformed action_id values into key names, I will aggregate them by person. An alternate way could be to first aggregate by person, action and then do the values to keys transformation.
Here is how I build the projection:
The result of this phase will look be a document with two new fields for every action, one with timestamp of action, the other with 1 or 0 depending on whether this document records this action or not.
Now the grouping - that's quite simple, I simply group by person calculating the first action using `$min` operator, and sum up their actions:
We have only a couple of stages left (other than the "make it pretty" stage at the end). The key to this whole analysis is the next stage where we create a bunch of new fields which tells us whether each person performed key sequences of events that we care about. We compare appropriate fields for each person and record true or false into new fields that we can later summarize.
First, the set-up: