In a Rails app I have a collection of events with a list of invitees for each of them. I would like to have the whole list of invitees as a single flatten list.
A simple .map is not an option as there are thousands of events with as much of invitees...
class Event
include Mongoid::Document
include Mongoid::Timestamps
has_many :invitees
end
I was trying to use Map/Reduce for that with the following code:
map = %Q{
function() {
var event = this;
this.invitees.forEach(function(invitee){
emit(invitee.id, { event: event.title, id: invitee.id });
});
}
}
reduce = %Q{
function(key, values) {
var result = [];
values.forEach(function(v) {
result.push(v)
});
return { value: result };
}
}
Event.map_reduce(map, reduce).out(replace: "invitees")
But Mongo returns the following error: Mongo::Error::OperationFailure (TypeError: this.invitees is undefined :)
Is Map/Reduce the right way to achieve this operation ? If so, what am I doing wrong ?
These days map/reduce is deprecated and instead using the aggregation pipeline is recommended.
To perform the equivalent of joins in MongoDB, use $unwind aggregation pipeline stage: https://docs.mongodb.com/manual/reference/operator/aggregation/unwind/
You could then use $project to reduce the size of the result set to the fields/subdocuments that you care about: https://docs.mongodb.com/manual/reference/operator/aggregation/project/
Mongoid does not provide aggregation pipeline helpers, thus you have to go through the driver to use it. Driver documentation: https://docs.mongodb.com/ruby-driver/current/tutorials/ruby-driver-aggregation/
To get a collection object in a Rails app using Mongoid:
ModelClass.collection
Related
Below is a code that should be optimized:
def statistics
blogs = Blog.where(id: params[:ids])
results = blogs.map do |blog|
{
id: blog.id,
comment_count: blog.blog_comments.select("DISTINCT user_id").count
}
end
render json: results.to_json
end
Each SQL query cost around 200ms. If I have 10 blog posts, this function would take 2s because it runs synchronously. I can use GROUP BY to optimize the query, but I put that aside first because the task could be a third party request, and I am interested in how Ruby deals with async.
In Javascript, when I want to dispatch multiple asynchronous works and wait all of them to resolve, I can use Promise.all(). I wonder what the alternatives are for Ruby language to solve this problem.
Do I need a thread for this case? And is it safe to do that in Ruby?
There are multiple ways to solve this in ruby, including promises (enabled by gems).
JavaScript accomplishes asynchronous execution using an event loop and event driven I/O. There are event libraries to accomplish the same thing in ruby. One of the most popular is eventmachine.
As you mentioned, threads can also solve this problem. Thread-safety is a big topic and is further complicated by different thread models in different flavors of ruby (MRI, JRuby, etc). In summary I'll just say that of course threads can be used safely... there are just times when that is difficult. However, when used with blocking I/O (like to an API or a database request) threads can be very useful and fairly straight-forward. A solution with threads might look something like this:
# run blocking IO requests simultaneously
thread_pool = [
Thread.new { execute_sql_1 },
Thread.new { execute_sql_2 },
Thread.new { execute_sql_3 },
# ...
]
# wait for the slowest one to finish
thread_pool.each(&:join)
You also have access to other currency models, like the actor model, async classes, promises, and others enabled by gems like concurrent-ruby.
Finally, ruby concurrency can take the form of multiple processes communicating through built in mechanisms (drb, sockets, etc) or through distributed message brokers (redis, rabbitmq, etc).
Sure just do the count in one database call:
blogs = Blog
.select('blogs.id, COUNT(DISTINCT blog_comments.user_id) AS comment_count')
.joins('LEFT JOIN blog_comments ON blog_comments.blog_id = blogs.id')
.where(comments: { id: params[:ids] })
.group('blogs.id')
results = blogs.map do |blog|
{ id: blog.id, comment_count: blog.comment_count }
end
render json: results.to_json
You might need to change the statements depending on how your table as named in the database because I just guessed by the name of your associations.
Okay, generalizing a bit:
You have a list of data data and want to operate on that data asynchronously. Assuming the operation is the same for all entries in your list, you can do this:
data = [1, 2, 3, 4] # Example data
operation = -> (data_entry) { data * 2 } # Our operation: multiply by two
results = data.map{ |e| Thread.new(e, &operation) }.map{ |t| t.value }
Taking it apart:
data = [1, 2, 3, 4]
This could be anything from database IDs to URIs. Using numbers for simplicity here.
operation = -> (data_entry) { data * 2 }
Definition of a lambda that takes one argument and does some calculation on it. This could be an API call, an SQL query or any other operation that takes some time to complete. Again, for simplicity, I'm just multiplicating the numbers by 2.
results =
This array will contain the results of all the asynchronous operations.
data.map{ |e| Thread.new(e, &operation) }...
For every entry in the data set, spawn a thread that runs operation and pass the entry as argument. This is the data_entry argument in the lambda.
...map{ |t| t.value }
Extract the value from each thread. This will wait for the thread to finish first, so by the end of this line all your data will be there.
Lambdas
Lambdas are really just glorified blocks that raise an error if you pass in the wrong number of arguments. The syntax -> (arguments) {code} is just syntactic sugar for Lambda.new { |arguments| code }.
When a method accepts a block like Thread.new { do_async_stuff_here } you can also pass a Lambda or Proc object prefixed with & and it will be treated the same way.
I am using the plugin: Grails CSV Plugin in my application with Grails 2.5.3.
I need to implement the concurrency functionality with for example: GPars, but I don't know how I can do it.
Now, the configuration is sequential processing. Example of my code fragment:
Thanks.
Implementing concurrency in this case may not give you much of a benefit. It really depends on where the bottleneck is. For example, if the bottleneck is in reading the CSV file, then there would be little advantage because the file can only be read in sequential order. With that out of the way, here's the simplest example I could come up with:
import groovyx.gpars.GParsPool
def tokens = csvFileLoad.inputStream.toCsvReader(['separatorChar': ';', 'charset': 'UTF-8', 'skipLines': 1]).readAll()
def failedSaves = GParsPool.withPool {
tokens.parallel
.map { it[0].trim() }
.filter { !Department.findByName(it) }
.map { new Department(name: it) }
.map { customImportService.saveRecordCSVDepartment(it) }
.map { it ? 0 : 1 }
.sum()
}
if(failedSaves > 0) transactionStatus.setRollbackOnly()
As you can see, the entire file is read first; hence the main bottleneck. The majority of the processing is done concurrently with the map(), filter(), and sum() methods. At the very end, the transaction is rolled back if any of the Departments failed to save.
Note: I chose to go with a map()-sum() pair instead of using anyParallel() to avoid having to convert the parallel array produced by map() to a regular Groovy collection, perform the anyParallel(), which creates a parallel array and then converts it back to a Groovy collection.
Improvements
As I already mentioned in my example the CSV file is first read completely before the concurrent execution begins. It also attempts to save all of the Department instances, even if one failed to save. You may want that (which is what you demonstrated) or not.
I have this piece of code on my accounts model.
scope :unverified, lambda { |limit|
select('accounts.id, accounts.email').joins('LEFT OUTER JOIN verifications v ON v.account_id = accounts.id')
.where('v.account_id IS NULL').limit(limit)
}
Because my team has rubocop with strict settings, I cannot write it the normal way rails recommends which would look like this:
scope :unverified, -> (limit = nil) {
select('accounts.id, accounts.email').joins('LEFT OUTER JOIN verifications v ON v.account_id = accounts.id')
.where('v.account_id IS NULL').limit(limit)
}
Writing it the normal way will trigger a rubocop error. I have the code close to the way I want but I can't figure out how exactly to pass in a default argument for a lambda. Can someone provide just a little push?
You can simply provide the defaults to the block parameters:
scope :unverified, lambda { |limit = nil|
select('accounts.id, accounts.email').joins('LEFT OUTER JOIN verifications v ON v.account_id = accounts.id')
.where('v.account_id IS NULL').limit(limit)
}
But not sure if it makes sense to pass nil to .limit(). You may want to default it to an integer.
Domain.where {
1 == 0
}.count()
This returned all the elements of the domain class. The more general case:
Domain.where {
false
}.count()
Will return all elements; if I use one of the fields and make a false condition, the result is as expected.
My question is why does this happen (the first case) ? If it is a too naive question, please just suggest some reading material. Thanks!
The version of grails that I use is 2.3.6 (it may be different in newer versions?)
I'm not sure what you are trying to achieve, but here is an explanation (maybe a bit general because of that :).
What you pass to the where method is actually a DSL for specifying SQL criterias, it just uses normal Groovy syntax to pretend to be more natural. But when you do someProperty != 5 && someOtherProperty == 6 that is not evaluated directly, but transformed to end up in an SQL query as select * from Domain where some_property != 5 and some_other_property = 6.
Since you are not passing any reference to a property in your criteria (1 == 0), it gets ignored by the detached criteria DSL evaluator, thus returning the result of select * from domain. You can try to do for example id != id to see how you get an empty list as the result. If you again examine the resulting query, you'll see that a where id<>id is included.
You can learn more about the where method: https://grails.github.io/grails-doc/latest/guide/GORM.html#whereQueries
Bear in mind that what you pass to the where method is a Closure, so the code inside is not executed upfront, and is not necessarily evaluated in the context where it was declared. You can learn more about Groovy Closures. Also about creating DSLs with Groovy, though is a bit of an advanced topic.
(I simplified the SQL queries to make them more undestandable, if you activate the query log of Hibernate or MySQL/other DB you are using, you'll see they are bigger).
To illustrate Deigote's explanation, here's a very crude implementation of a WHERE query builder (actually just the WHERE clause) using the criteria criteria format:
class WhereBuilder {
def conditions = []
def eq(column, value) {
conditions << [
column: column,
operator: '=',
value: value]
}
def ne(column, value) {
conditions << [
column: column,
operator: '!=',
value: value]
}
String toString() {
'WHERE ' <<
conditions.collect {
"${it.column} ${it.operator} '${it.value}'"
}.join(' AND ')
}
}
def builder = new WhereBuilder()
builder.with {
1 == 0
false
eq 'firstName', 'John'
ne 'lastName', 'Smith'
}
assert builder.toString() == "WHERE firstName = 'John' AND lastName != 'Smith'"
As you can see, the expressions 1 == 0 and false have no effect.
I have a test case like this.
subject(:report) { #report.data }
it { expect(report[0][:id]).to eq(#c1.id) }
it { expect(report[1][:id]).to eq(#c2.id) }
it { expect(report[2][:id]).to eq(#c3.id) }
it { expect(report[0][:title]).to eq("Announcement3") }
it { expect(report[1][:title]).to eq("Announcement2") }
it { expect(report[2][:title]).to eq("Announcement1") }
I feel this is not really an efficient way.
Is there any other way to make it efficient ? So that it looks like one line condition.
Test Behavior, Not Composition
Always test behavior, not composition. What you ought to be testing here is the behavior that given some set of fixed inputs, a report will generate the same fixed output every single time. You aren't doing that; you're introspecting individual report elements, which is about composition.
Instead, consider using fixtures or FactoryGirl (or even just a setup block) to define fixed inputs, and then check that:
it 'creates valid report data' do
expect(#report.data).to eq #sample.data
end
More on Behavior
If each element of your report is coming from a different method, you ought to be testing the behavior of each of those methods separately, rather than decomposing the final report. That is another way to make your test clearer and more meaningful, and addresses the "unit" in unit testing.
I'd rather write something like this:
it { expect(report[0]).to include(id: #c1.id, title: "Announcement3") }
it { expect(report[1]).to include(id: #c2.id, title: "Announcement2") }
it { expect(report[2]).to include(id: #c3.id, title: "Announcement1") }
It does not get that deep into report structure and for me looks more readable.
You could map reports to the keys you care about and make expectations on that:
expect(report.map { |h| h[:title] }).to eq [array, of, values]
You could use include to check each report member's keys:
expect(report[0]).to include id: #c1.id, title: "Announcement3"
You could build an "expected" object and test equality:
expected = [{ id: #c1.id, title: "Announcement3" }, ...]
expect(report).to eq expected