Mongoid MapReduce giving irregular results for recursive reduce function - ruby-on-rails

I have an Item model which has an attribute category. I want the items count grouped by category. I wrote a map reduce for this functionality. It was working fine. I recently wrote a script to create 5000 items. Now I realize my map reduce only gives the result for the last 80 records. The following is the code for the mapreduce function.
map = %Q{
function(){
emit({},{category: this.category});
}
}
reduce = %Q{
function(key, values){
var category_count = {};
values.forEach(function(value){
if(category_count.hasOwnProperty(value.category))
category_count[value.category]++;
else
category_count[value.category] = 1
})
return category_count;
}
}
Item.map_reduce(map,reduce).out(inline: true).first.try(:[],"value")
After researching a bit and I discovered mongodb invokes reduce function multiple times. How can achieve the functionality I intended for?

There is a rule you must follow when writing map-reduce code in MongoDB (a few rules, actually). One is that the emit (which emits key/value pairs) must have the same format for the value that your reduce function will return.
If you emit(this.key, this.value) then reduce must return the exact same type that this.value has. If you emit({},1) then reduce must return a number. If you emit({},{category: this.category}) then reduce must return the document of format {category:"string"} (assuming category is a string).
So that clearly can't be what you want, since you want totals, so let's look at what reduce is returning and work out from that what you should be emitting.
It looks like at the end you want to accumulate a document where there is a keyname for each category and its value is a number representing the number of its occurrences. Something like:
{category_name1:total, category_name2:total}
If that's the case then the correct map function would emit({},{"this.category":1}) in which case your reduce will need to add up the numbers for each key corresponding to a category.
Here is what the map should look like:
map=function (){
category = { };
category[this.category]=1;
emit({},category);
}
And here is the correct corresponding reduce:
reduce=function (key,values) {
var category_count = {};
values.forEach(function(value){
for (cat in value) {
if( !category_count.hasOwnProperty(cat) ) category_count[cat]=0;
category_count[cat] += value[cat];
}
});
return category_count;
}
Note that it satisfies two other requirements for MapReduce - it works correctly if the reduce function is never called (which will be the case if there is only one document in your collection) and it will work correctly if the reduce function gets called multiple times (which is what's happening when you have more than 100 documents).
A more conventional way to do that would be to emit category name as key and the number as value. This simplifies map and reduce:
map=function() {
emit(this.category, 1);
}
reduce=function(key,values) {
var count=0;
values.forEach(function(val) {
count+=val;
}
return count;
}
This will sum the number of times each category appears. This also satisfies requirements for MapReduce - it works correctly if the reduce function is never called (which will be the case for any category that only appears once) and it will work correctly if the reduce function gets called multiple times (which will happen if any category appears more than 100 times).
As others pointed out, aggregation framework makes the same exercise much simpler with:
db.collection.aggregate({$group:{_id:"$category",count:{$sum:1}}})
although that matches the format of the second mapReduce I showed, and not the original format that you had which is outputting category names as keys. However aggregation framework will always be significantly faster than MapReduce.

I agree with Neil Lunn's comment.
What I can see from the info that is provided is that if you are on a version of MongoDB greater or equal than 2.2 you can use the aggregation framework instead of map-reduce.
db.items.aggregate([
{ $group: { _id: '$category', category_count: { $sum: 1 } }
])
Which is a lot simpler and performant (see Map/Reduce vs. Aggregation Framework )

Related

Performance of accessing table via reference vs ipairs loop

I'm modding a game. I'd like to optimize my code if possible for a frequently called function. The function will look into a dictionary table (consisting of estimated 10-100 entries). I'm considering 2 patterns a) direct reference and b) lookup with ipairs:
PATTERN A
tableA = { ["moduleName.propertyName"] = { some stuff } } -- the key is a string with dot inside, hence the quotation marks
result = tableA["moduleName.propertyName"]
PATTERN B
function lookup(type)
local result
for i, obj in ipairs(tableB) do
if obj.type == "moduleName.propertyName" then
result = obj
break
end
end
return result
end
***
tableB = {
[1] = {
type = "moduleName.propertyName",
... some stuff ...
}
}
result = lookup("moduleName.propertyName")
Which pattern should be faster on average? I'd expect the 'native' referencing to be faster (it is certainly much neater), but maybe this is a silly assumption? I'm able to sort (to some extent) tableB in a order of frequency of the lookups whereas (as I understand it) tableA will have in Lua random internal order by default even if I declare the keys in proper order.
A lookup table will always be faster than searching a table every time.
For 100 elements that's one indexing operation compared to up to 100 loop cycles, iterator calls, conditional statements...
It is questionable though if you would experience a difference in your application with so little elements.
So if you build that data structure for this purpose only, go with a look-up table right away.
If you already have this data structure for other purposes and you just want to look something up once, traverse the table with a loop.
If you have this structure already and you need to look values up more than once, build a look up table for that purpose.

Lua: Sort table of numbers with multiple dots

I have a table of strings like this:
{
"1",
"1.5",
"3.13",
"1.2.5.7",
"2.5",
"1.3.5",
"2.2.5.7.10",
"1.17",
"1.10.5",
"2.3.14.9",
"3.5.21.9.3",
"4"
}
And would like to sort that like this:
{
"1",
"1.2.5.7",
"1.3.5",
"1.5",
"1.10.5",
"1.17",
"2.2.5.7.10",
"2.3.14.9",
"2.5",
"3.5.21.9.3",
"3.13",
"4"
}
How do I sort this in Lua? I know that table.sort() will be used, I just don't know the function (second parameter) to use for comparison.
Given your requirements, you probably want something like natural sort order. I described several possible solution as well as their impact on the results in a blog post.
The simplest solution may look like this (below), but there are 5 different solutions listed with different complexity and the results:
function alphanumsort(o)
local function padnum(d) return ("%03d%s"):format(#d, d) end
table.sort(o, function(a,b)
return tostring(a):gsub("%d+",padnum) < tostring(b):gsub("%d+",padnum) end)
return o
end
table.sort sorts ascending by default. You don't have to provide a second parameter then. As you're sorting strings Lua will compare the strings character by character. Hence you must implement a sorting function that tells Lua which comes first.
I just don't know the function (second parameter) to use for
comparison.
That's why people wrote the Lua Reference Manual
table.sort (list [, comp])
Sorts the list elements in a given order, in-place, from list1 to
list[#list]. If comp is given, then it must be a function that
receives two list elements and returns true when the first element
must come before the second in the final order, so that, after the
sort, i <= j implies not comp(list[j],list[i]). If comp is not given,
then the standard Lua operator < is used instead.
The comp function must define a consistent order; more formally, the
function must define a strict weak order. (A weak order is similar to
a total order, but it can equate different elements for comparison
purposes.)
The sort algorithm is not stable: Different elements considered equal
by the given order may have their relative positions changed by the
sort.
Think about how you would do it with pen an paper. You would compare each number segment. As soon as a segment is smaller than the other you know this number comes first.
So a solution would probably require you to get those segments for the strings, convert them to numbers so you can compare their values...

Right way to execute two queries in sequence in r2dbc

I work with R2DBC and i need to execute query, whiсh on request returns Flux of my
entities and after that i need to convert this entities to DTO's,but to create an DTO i need to make another query to the database for each entity, which returns some special info from another tables, for example:
This code doesn't work when total number of Ids exceeds 512
orderRepository.findByIds(listIds).flatMap{ order->
eventRepostiry.findByOrderId(order.id).map{events->
entityToDtoMapper.map(order,events,OrderWithEventsDto::class.java)
}
}
concatMap doesn't help.
But this code works
orderRepository.findByIds(listIds).collectList().flatMapMany{orders->
Flux.fromIterable(orders)
}.flatMap{ order->{
eventRepository.findByOrderId(order.id).collectList().flatMapMany{ events->
Flux.fromIterable(events)
}.map { event->
entityToDtoMapper.map(order,events,OrderWithEventsDto::class.java)
}
}
}
I think there’s a better solution to this problem. How am I supposed to do these queries right?

Dynamically fire different activerecord queries using procs in Rails?

Say I have many different classes that inherit from Tree and each of them implements a method called grow! but with a slightly different ActiveRecord implementation. Say each method begins with an ActiveRecord query to find the right trees to grow with something like:
trees = Tree
.joins(:fruits)
.where(land_id: land.id)
.where(fruits: { sweet: true })
.where(fruits: { season_id: season.id })
Say the part we want to swap out from query to query is this part:
.where(fruits: { sweet: true })
Say we want to then build a WinterTree class and its own grow method but it only grows non sweet fruits and so we want to return trees that only grow non-sweet fruits. Is there anyway to not have to rewrite the rest of the query and only swap out that one piece of the query and maybe write the rest of the query in the parent Tree class? Is there anyway to call AR segments of queries dynamically?
I found it easy to build dynamic queries using where statements in sql such as: Tree.joins(:fruits).where("land_id = ?", land.id ) etc. Below is what I did yesterday to give you some idea of what I'm talking about but you'll need to extrapolate it to fit your needs:
query = ''
counter = 1
sets_of_data_ill_query.each do |set|
if counter == 1
query += "district = '#{set[0]}' AND second_district = '#{set[1]}'"
else
query += " OR district = '#{set[0].to_s}' AND second_district = '#{set[1]}'"
end
end
voters = Voter.where(query)
NOTE: I knew the data I was querying was safe so I just used the raw info but you'll want to do it as I showed in the first paragraph with ?escaping values if it's data that will be entered by users. Also, since you're chaining where statements you would want to use an "AND" instead of where I used "OR" if you need to loop through sets etc.

how to sort documents using the erlang map reduce in riak

i'm using riak to store json documents right now, and i want to sort them based on some attribute, let's say there's a key, i.e
{
"someAttribute": "whatever",
"order": 1
}
so i want to sort the documents based on the "order".
I am currently retrieving the documents in riak with the erlang interface. i can retrieve the document back as a string, but i dont' really know what to do after that. i'm thinking the map function just reduces the json document itself, and in the reduce function, i'd make a check to see whether the item i'm looking at has a higher "order" than the head of the rest of the list, and if so append to beginning, and then return a lists:reverse.
despite my ideas above i've had zero results after almost an entire day, i'm so confused with the erlang interface in riak. can someone provide insight on how to write this map/reduce function, or just how to parse the json document?
As far as I know, You do not have access to Input list in Map. You emit from Map a document as 1 element list.
Inputs (all the docs to handle as {Bucket, Key}) -> Map (handle single doc) -> Reduce (whole list emitted from Map).
Maps are executed per each doc on many nodes whereas Reduce is done once on so called coordinator node (the one where query was called).
Solution:
Define Inputs (as a list or bucket)
Retrieve Value in Map and emit whole doc or {Id, Val_to_sort_by)
Sort in Reduce (using regular list:keysort)
This is not a map reduce solution but you should check out Riak Search.
so i "solved" the problem using javascript, still can't do it using erlang.
here is my query
{"inputs":"test",
"query":[{"map":{"language":"javascript",
"source":"function(value, keyData, arg){ var data = Riak.mapValuesJson(value)[0]; var obj = {}; obj[data.order] = data; return [ obj ];}"}},
{"reduce":{"language":"javascript",
"source":"function(values, arg){ return [ values.reduce(function(acc, item){ for(var order in item){ acc[order] = item[order]; } return acc; }) ];}",
"keep":true}}
]
}
so in the map phase, all i do is create a new array, obj, with the key as the order, and the value as the data itself. so visually, the obj is like this
{"1":{"firstName":"John","order":1}
in the reduce phase, i'm just putting it in the accumulator, so basically that's the sort if you think about it, because when you're done, everything will be put in order for you. so i put 2 json documents for testing, one is above, the ohter is just firstName: Billie, order 2. and here is my result for the query above
[{"1":{"firstName":"John","order":1},"2":{"firstName":"Billie","order":2}}]
so it works! . but i still need to do this in ERLANG, any insights?

Resources