Combining results of two tables in mongoid/mongo - ruby-on-rails

Hi guys what would be the best way to combine results of two mongoid queries.
My issue is that I would like to know active users, A user can send a letter and a notification, both are separate table and a user if he sends either the letter or the notification is considered active. What I want to know is how many active users were there per month.
right now what I can think of is doing this
Letter.collection.aggregate([
{ '$match': {}.merge(opts) },
{ '$sort': { 'created_at': 1 } },
{
'$group': {
_id: '$customer_id',
first_notif_sent: {
'$first': {
'day': { '$dayOfMonth': '$created_at' },
'month': { '$month': '$created_at' },
'year': { '$year': '$created_at' }
}
}
}
}])
Notification.collection.aggregate([
{ '$match': {}.merge(opts) },
{ '$sort': { 'created_at': 1 } },
{
'$group': {
_id: '$customer_id',
first_notif_sent: {
'$first': {
'day': { '$dayOfMonth': '$created_at' },
'month': { '$month': '$created_at' },
'year': { '$year': '$created_at' }
}
}
}
}])
What I am looking for is to get the minimum of the dates and then combine the results and get the count. Right now I can get the results and loop over each of them and create a new list. But I wanted to know if there is a way to do it in mongo directly.
EDIT
For letters
def self.get_active(tenant_id)
map = %{
function() {
emit(this.customer_id, new Date(this.created_at))
}
}
reduce = %{
function(key, values) {
return new Date(Math.min.apply(null, values))
}
}
where(tenant_id: tenant_id).map_reduce(map, reduce).out(reduce: "#{tenant_id}_letter_notification")
end
Notifications
def self.get_active(tenant_id)
map = %{
function() {
emit(this.customer_id, new Date(this.updated_at))
}
}
reduce = %{
function(key, values) {
return new Date(Math.min.apply(null, values))
}
}
where(tenant_id: tenant_id, transferred: true).map_reduce(map, reduce).out(reduce: "#{tenant_id}_outgoing_letter_standing_order_balance")
end
This is what I am thinking of going with, one of the reason is that, lookup does not work with my version of mongo.

the customer created a new notification, or a new letter, and I would like to get the first created at of either.
Let's address this first as a foundation. Given examples of document schema as below:
Document schema in Letter collection:
{ _id: <ObjectId>,
customer_id: <integer>,
created_at: <date> }
And, document schema in Notification collection:
{ _id: <ObjectId>,
customer_id: <integer>,
created_at: <date> }
You can utilise aggregation pipeline $lookup to join the two collections. For example using mongo shell :
db.letter.aggregate([
{"$group":{"_id":"$customer_id", tmp1:{"$max":"$created_at"}}},
{"$lookup":{from:"notification",
localField:"_id",
foreignField:"customer_id",
as:"notifications"}},
{"$project":{customer_id:"$_id",
_id:0,
latest_letter:"$tmp1",
latest_notification: {"$max":"$notifications.created_at"}}},
{"$addFields":{"latest":
{"$cond":[{"$gt":["$latest_letter", "$latest_notification"]},
"$latest_letter",
"$latest_notification"]}}},
{"$sort":{latest:-1}}
], {cursor:{batchSize:100}})
The output of the above aggregation pipeline is a list of customers in sorted order of created_at field from either Letter or Notification. Example output documents:
{
"customer_id": 0,
"latest_letter": ISODate("2017-12-19T07:00:08.818Z"),
"latest_notification": ISODate("2018-01-26T13:43:56.353Z"),
"latest": ISODate("2018-01-26T13:43:56.353Z")
},
{
"customer_id": 4,
"latest_letter": ISODate("2018-01-04T18:55:26.264Z"),
"latest_notification": ISODate("2018-01-25T02:05:19.035Z"),
"latest": ISODate("2018-01-25T02:05:19.035Z")
}, ...
What I want to know is how many active users were there per month
To achieve this, you can just replace the last stage ($sort) of the above aggregation pipeline with $group. For example:
db.letter.aggregate([
{"$group":{"_id":"$customer_id", tmp1:{$max:"$created_at"}}},
{"$lookup":{from:"notification",
localField:"_id",
foreignField:"customer_id",
as:"notifications"}},
{"$project":{customer_id:"$_id",
_id:0,
latest_letter:"$tmp1",
latest_notification: {"$max":"$notifications.created_at"}}},
{"$addFields":{"latest":
{"$cond":[{"$gt":["$latest_letter", "$latest_notification"]},
"$latest_letter",
"$latest_notification"]}}},
{"$group":{_id:{month:{"$month": "$latest"},
year:{"$year": "$latest"}},
active_users: {"$sum": "$customer_id"}
}
}
],{cursor:{batchSize:10}})
Where the example output would be as below:
{
"_id": {
"month": 10,
"year": 2017
},
"active_users": 9
},
{
"_id": {
"month": 1,
"year": 2018
},
"active_users": 18
},

Related

Group a Searchkick result?

I have a basic Searchkick system set-up. I want to take the results and then group them by an attribute to sum a another attribute etc.
This question is close to my issue:
Elasticsearch + searckick
and the only answer was to use aggregations. I could do that but then I would be building an active record call for each of the agg keys returned.
Here is what I have so far:
BudgetItem.all.search("*", body_options: { aggs: { cbs_item_id: { terms: { field: "cbs_item_id" }, aggs: { "total": { "sum": { "field": "total" } } } } } } )
which results in:
"aggregations"=>{"cbs_item_id"=>{"doc_count_error_upper_bound"=>0, "sum_other_doc_count"=>0, "buckets"=>[{"key"=>5, "doc_count"=>2, "total"=>{"value"=>2956.0}}, {"key"=>6, "doc_count"=>2, "total"=>{"value"=>7734.0}}]}}}>
in my search_data I have a term 'cbs' which is a text value that relates to the 'cbs_item_id'. I am looking for this result:
"aggregations"=>
{"cbs_item_id"=>
{"doc_count_error_upper_bound"=>0, "sum_other_doc_count"=>0, "buckets"=>
[{"key"=>5, "doc_count"=>2, "total"=>{"value"=>2956.0}, "cbs"=>{"value"=>"MY CBS Related Field" }},
{"key"=>6, "doc_count"=>2, "total"=>{"value"=>7734.0}, "cbs"=>{"value"=>"MY OTHER CBS Related Field" }}]}}}
This of this where you have in inventory of cars and a separate table of car_colors ( [id = 1, color = red], [id = 3, color = blue ]. I want to search for the cars of a given color then group them and sum etc.
I am sure I am perhaps missing something simple here.
UPDATE
Getting close:
BudgetItem.all.search("*", body_options: { aggs: { cbs_item_id: { terms: { field: "cbs_item_id" }, aggs: { cbs: { terms: { field: "cbs" } }, "total": { "sum": { "field": "total" } } } } } } )
which results:
"buckets"=>
[{"key"=>5, "doc_count"=>2, "total"=>{"value"=>2956.0}, "cbs"=>{"doc_count_error_upper_bound"=>0, "sum_other_doc_count"=>0, "buckets"=>[{"key"=>"001", "doc_count"=>2}]}},
{"key"=>6, "doc_count"=>2, "total"=>{"value"=>7734.0}, "cbs"=>{"doc_count_error_upper_bound"=>0, "sum_other_doc_count"=>0, "buckets"=>[{"key"=>"002", "doc_count"=>2}]}}]}}
The second "key"s 001 and 002 are the data I am looking for.

Mongo DB, delete redundant data, how to delete duplicate unique index from the collection

I have a collection which has redundant data.
Example Data:
{
unique_index : "1"
other_field : "whatever1"
},
{
unique_index : "2"
other_field : "whatever2"
},
{
unique_index : "1"
other_field : "whatever1"
}
I ran the query: (I have to use allowDiskUse:true because there is lot of data)
db.collection.aggregate([
{
$group: {
_id: "$unique_index",
count: { $sum: 1 }
}
},
{ $match: { count: { $gte: 2 } } }
], { allowDiskUse: true })
I get this output: (for example)
{ "_id" : "1", "count" : 2 }
.
.
Now the problem is that I want to keep only one data. I want to delete all redundant data. Please note that its lot of data, like more than 100,000 records or something. I am searching for fast and easy solution (in mongodb or RoR because I am using Ruby on Rails), if any one can help, would be appreciated.
If you don't care about _id, the simplest way is to select distinct documents into new collection, and then rename it:
db.collection.aggregate([
{$group: {
_id: "$unique_index",
other_field: {$first: "$other_field"}
}},
{$project: {
_id: 0,
unique_index: "$_id",
other_field:1
}},
{$out: "new_collection"}
]);
db.new_collection.renameCollection("collection", true);
Please bear in mind, you will need to restore all indexes. Also renameCollection is not working on sharded colelctions.

Elasicsearch getting top 5 results from an aggregation with a script

I am trying to get the top 5 products sold, ordered by revenue using elasticsearch in Rails.
Here is my query:
query = {
bool: {
filter: {
bool: {
must: [
{ term: { store_id: store.id } } # Limiting the products by store
]
}
}
}
}
aggs = {
by_revenue: {
terms: {
size: 5,
order: {revenue: "desc"}
},
aggs: {
revenue: {
max: {
script: "doc['price_as_float'].value * doc['quantity'].value"
}
}
}
}
}
response = OrderItem.search(query: query, aggs: aggs, size: 0)
I get the error could not find the appropriate value context to perform aggregation [by_revenue]
Thanks!
You need to aggregate orders on product reference, then summing the prices * quantity to get the revenues from one product with a nested sum aggregation, not max:
aggs: {
products: {
terms: {
field: "product_ref",
order: { revenues: "desc" },
},
aggs: {
revenues: {
sum: { script: "doc['price_as_float'].value * doc['quantity'].value" }
}
}
}
}
Don't use the size option in the terms aggregation, because you're not sure all the orders for your top products are located in the same shard; you should get them from the response instead.

How to present sort order-by query in Falcor?

Suppose the model is structured as
{
events: [
{
date: '2016-06-01',
name: 'Children Day'
},
{
date: '2016-01-01',
name: 'New Year Day'
},
{
date: '2016-12-25',
name: 'Christmass'
}
]
}
There could be a lot of events in our storage. From client side, we want to issue a query to get 10 events order by date in ascending order.
How to present this kind of query in Falcor?
Heres what I would do: first, promote your events to entities, i.e. give them ids:
id | date | name
1 | 2016-06-01 | Children Day
2 | 2016-01-01 | New Year Day
3 | 2016-12-25 | Christmass
Then provide a route providing these events by id:
route: 'eventsById[{integers:ids}]["date","name"]'
which returns the original data. Now you can create a new route for orderings
route: 'orderedEvents['date','name']['asc','desc'][{ranges:range}]
which returns references into the eventsById route. This way your client could even request the same data sorted in different ways within the same request!
router.get(
"orderedEvents.date.asc[0..2]",
"orderedEvents.date.desc[0..2]");
which would return
{
'eventsById': {
1: {
'date':'2016-06-01',
'name':'Children Day' },
2: {
'date':'2016-01-01',
'name':'New Year Day' },
3: {
'date':'2016-12-25',
'name':'Christmass' } },
'orderedEvents': {
'date': {
'asc': [
{ '$type':'ref', 'value':['eventsById',2] },
{ '$type':'ref', 'value':['eventsById',1] },
{ '$type':'ref', 'value':['eventsById',3] } ],
'desc': [
{ '$type':'ref', 'value':['eventsById',3] },
{ '$type':'ref', 'value':['eventsById',1] },
{ '$type':'ref', 'value':['eventsById',2] } ] } }
}

Filter result based on a count of inner data

I am building my search query for some listing data. As part of the search people can ask for multiple rooms which sleeps a min amount of people, ie two rooms which sleep 2 and 3 people.
Im not sure how I can perform that with a filter.
Here is a shortened search query so far.
{
"query":{
"filtered":{
"query":{
"match_all":{}
}
}
},
"filter":{
"and":
[
{
"term":{
"status":"live"
}
},
{
"geo_bounding_box":{
"location":{
"top_left":"60.856553, -8.64935719999994",
"bottom_right":"49.8669688, 1.76270959999999"
}
}
}
,{
"range":{
"bedrooms":{
"gte":"2"
}
}
}
]
}
,
"size":10
}
Test Data
{
"took":1,
"timed_out":false,
"_shards":{
"total":5,
"successful":5,
"failed":0
},
"hits":{
"total":3,
"max_score":1.0,
"hits":[
{
"_index":"listings",
"_type":"listing",
"_id":"1",
"_score":1.0,
"_source":{
"name:":"Listing One",
"address1":"Some Street",
"bedrooms":2,
"city":"A City",
"id":1,
"refno":"FI451",
"user_id":1,
"rooms":[
{
"bathroom":"Shared bathroom with bath",
"double_standard":null,
"id":5,
"single":2,
"sleeps":2,
"title":"Twinny",
},
{
"bathroom":"Ensuite with bath",
"double_king_size":1,
"double_standard":1,
"id":1,
"single":null,
"sleeps":2,
"title":"Double Ensuite Room",
}
]
}
},
{
"_index":"listings",
"_type":"listing",
"_id":"2",
"_score":1.0,
"_source":{
"name":"Listing Two",
"address1":"Some Street",
"bedrooms":2,
"city":"A City",
"id":2,
"refno":"BL932",
"user_id":1,
"rooms":[
{
"bathroom":"Ensuite with bath",
"double_standard":1,
"id":4,
"single":1,
"sleeps":3,
"title":"Family Room",
},
{
"bathroom":"Ensuite with shower",
"double_standard":1,
"id":2,
"single":null,
"sleeps":2,
"title":"Single Room",
}
]
}
},
{
"_index":"listings",
"_type":"listing",
"_id":"3",
"_score":1.0,
"_source":{
"name":"Listing Three",
"address1":"Another Address",
"bedrooms":1,
"city":"Your City",
"id":3,
"refno":"TE2116",
"user_id":1,
"rooms":[
{
"bathroom":"Ensuite with shower",
"double_king_size":null,
"double_standard":1,
"id":3,
"single":1,
"sleeps":3,
"title":"Family Room",
}
]
}
}
]
}
}
If you look at my data I have 3 listings, two of them have multiple rooms (Listing One & Two) but only Listing Two would match my search, Reason it has one room with that sleeps two and the other sleeps three.
Is it possible to perform this query with elasticsearch?
If what you want is "Find all listings where a bedroom sleeps 2 AND another bedroom sleeps 3", this query will work. It makes one big assumptions: that you are using inner objects, and not the Nested data type.
This query is using the fact that inner objects are collapsed into a single field, causing "rooms.sleeps" to equal [2,3] for the desired field. Since the field is collapsed into a single array, a simple Terms query will match them. When you change the execution mode to And, it forces both 2 and 3 to be matched.
The caveat is that a room that has [2,3,4] will also be matched.
I've omitted the geo and status portion since that data wasn't provided in the source documents.
{
"query": {
"filtered": {
"query": {
"match_all": {}
}
}
},
"filter": {
"and": [
{
"range": {
"bedrooms": {
"gte": "2"
}
}
},
{
"terms": {
"rooms.sleeps": [2,3],
"execution": "and"
}
}
]
},
"size": 10
}
As far as I know the filter has to be a sibling of the query inside the filtered element. See: http://www.elasticsearch.org/guide/reference/query-dsl/filtered-query/
If you combine that with Zach's solution it should work.
{
"query":
{
"filtered":
{
"query":
{
"match_all":{}
},
"filter":
{
"put" : "your filter here"
}
}
}
}

Resources