Rails 4 - sum values grouped by external key - ruby-on-rails

I know this must be simple but I'm really lost.
Three models: Job, Task and Operation, as follows:
Job
has_many :tasks
Task
belongs_to :job
belongs_to :operation
Operation
has_many :jobs
Job has an attribute, total_pieces, which tells me how many pieces you need.
For each Job, you can add a number of Tasks, which can belong to different Operations (cutting, drilling, etc.) and for every task you can set up a number of pieces.
I don't know in advance how many Operations will be needed for a single Job, but I need to alert user of the number of pieces left for that Operation when a new Task is inserted.
Let us make an example:
Job 1: total_pieces=100
- Task 1: operation 1(cutting), pieces=20
- Task 2: operation 1(cutting), pieces=30
- Task 3: operation 2(drilling), pieces=20
I need to alert the user that they still need to cut 50 pieces and to drill 80.
Hypothetically, if i add:
- Task 4: operation 3(bending), pieces=20
I need to alert the user that they also still need to bend 80 pieces.
So far i've managed to list all kinds of Operations for each Job using map, but now i need to sum up all pieces of the Task with the same Operation type in a Job, and only for those Operations present in the Tasks belonging to that Job.
Is there any way to do this using map? Or do I need to write a query manually?
EDIT: this is what I managed to patch up at the moment.
A method operations_applied in Job gives me a list of ids for all the Operations usend in Tasks queued for the Job.
Then another method, pieces_remaining for(operation) gives me the remaining pieces for the single operation.
Finally, in the Job views I need, I iterate through all operations_applied printing all pieces_remaining_for.
I know this is not particularly elegant but so far it works, any ideas to improve this?
Thank you.

If I'm not misunderstanding, it is not possible to do what you want to do with map, since map always applies to arr.size == arr.map {...}.size and you want to reduce your array.
What you could do is something like this:
job = Jobs.first
operation_pieces = {}
job.tasks.each do |task|
operation_pieces[task.operation.id] ||= { operation: task.operation }
operation_pieces[task.operation.id][:pieces] ||= 0
operation_pieces[task.operation.id][:pieces] += task.pieces
end
Now operation_pieces contains the sum of pieces for the operation with the id of the respective index. But I'm sure there is a more elegant version to do this ;)
EDIT: changed the code example to a hash
EDIT: and here is the more elegant version:
job = Jobs.first
job.tasks
.group_by(&:operation)
.map { |op, tasks|
{ op => tasks.sum(&:pieces) }
}
The group_by groups your array of tasks by the operation of the task (maybe you need to use group_by { |t| t.operation } instead, I'm not sure) and inside the map afterwards the pieces of each task with the same operation is summed up. Finally, you end up with a hash of the type OPERATION => PIECES_SUM (INTEGER).

I assume the following variables required for your query,
attributes
Task:
number_of_pieces,
Job:
name,
Operation:
name
Job.joins("LEFT JOIN tasks ON jobs.id = tasks.job_id").joins("LEFT JOIN operations ON operations.id = tasks.operation_id").select(" SUM(tasks.number_of_pieces) as number_of_pieces, operations.name, jobs.name").group("operations.id, jobs.id")
This will list all the jobs, and sum of pieces required for each operation under it.
if you have the job_id for you want the list of operations and the pieces, then use the below code,
Job.find(params[:job_id]).joins("LEFT JOIN tasks ON jobs.id = tasks.job_id").joins("LEFT JOIN operations ON operations.id = tasks.operation_id").select(" SUM(tasks.number_of_pieces) as number_of_pieces, operations.name, jobs.name").group("operations.id, jobs.id")
please put in comments if your need any explanation.

Related

Rails: Group By join table

I am trying to group by a delegate join table. I have a tasks table and each task is has a project_id. The following code works well in my controller for me to group by project:
#task = Task.joins(:project).joins(:urgency).where(urgencies: {urgency_value: 7}).group_by(&:project_id)
This returns a hash where the key is what I have joined by and then the index contains each tasks within that group. I can then loop through each task to retrieve its attributes.
However, each project belongs to a workspace (via a workspace_id). What I want is to have the same query but to group by the workspace. The final aim is for me to create a table which shows the workspace name in one column and the number of tasks for that workspace in the second column.
I have tried many combinations and searched many forums but after several hours still haven't been able to crack it.
If your only goal is to get the task counts per workspace, I think you want a different query.
#workspaces_with_task_counts =
Workspace
.joins(:projects)
.joins(:tasks)
.select('workspaces.name, count(tasks.id) as task_count')
.group(:workspace_id)
Then you can access the count like this:
#workspaces_with_task_counts.each do |workspace|
puts "#{workspace.name}: #{workspace.task_count}"
end
EDIT 1
I think this is what you want:
Workspace
.joins(projects: { tasks: :urgencies })
.where(urgencies: {urgency_value: 7})
.group(:name)
.count
which results in a hash containing all of the workspaces with at least one task where the urgency_value is 7, by name, with the number of tasks in that workspace:
{"workspace1"=>4, "workspace2"=>1}
EDIT 2
SQL is not capable of returning both detail and summary information in a single query. But, we can get all the data, then summarize it in memory with Ruby's group_by method:
Task
.joins(project: :workspace)
.includes(project: :workspace)
.group_by { |task| task.project.workspace.name }
This produces the following data structure:
{
"workspace1": [task, task, task],
"workspace2": [task, task],
"workspace3": [task, task, task, task]
}
But, it does so at a cost. Grouping in memory is an expensive process. Running that query 10,000 times took ~15 seconds.
It turns out that executing two SQL queries is actually two orders of magnitude faster at ~0.2 seconds. Here are the queries:
tasks = Task.joins(project: :workspace).includes(project: :workspace)
counts = tasks.group('workspaces.name').count
The first query gets you all the tasks and preloads their associated project and workspace data. The second query uses ActiveRecord's group clause to construct the SQL statement to summarize the data. It returns this data structure:
{ "workspace1": 3, "workspace2": 2, "workspace3": 4 }
Databases are super efficient at set manipulation. It's almost always significantly faster to do that work in the database than in Ruby.

Custom bulk indexer for searchkick : mapping options are ignored

I'm using Searchkick 3.1.0
I have to bulk index a certain collection of records. By what I read in the docs and have tried, I cannot pass an predefined array of ids to Searchkick's reindex method. I'm using the async mode.
If you do for example, Klass.reindex(async: true), it will enqueue jobs with the specified batch_size in your options. The problem with that it loops through the entire model's ids will then determin if they have to be indexed. For example, if I have 10 000 records in my database and a batch size of 200, it will enqueue 50 jobs. It will then loop on each id and if the search_import's conditions are met, it will index it.
This step is useless, I would like to enqueue a pre-filtered array of ids to prevent looping through the entire records.
I tried writing the following job to overwrite the normal behavior :
def perform(class_name, batch_size = 100, offset = 0)
model = class_name.constantize
ids = model
.joins(:user)
.where(user: { active: true, id: $rollout.get(:searchkick).users })
.where("#{class_name.downcase.pluralize}.id > ?", offset)
.pluck(:id)
until ids.empty?
ids_to_enqueue = ids.shift(batch_size)
Searchkick::BulkReindexJob.perform_later(
class_name: model.name,
record_ids: ids_to_enqueue
)
end
The problem : The searchkick mapping options are completely ignored when inserting records into ElasticSearch and I can't figure out why. It doesn't take the specified match (text_middle) and create a mapping with default match 'keyword'.
Is there any clean way to bulk reindex an array of records without having to enqueue jobs containing unwanted records?
You should be able to reindex records based on a condition:
From the searchkick docs:
Reindex multiple records
Product.where(store_id: 1).reindex
You can put that in your own delayed job.
What I have done is have for some of our batch operations that happens already in a delayed job, I wrap the code in the job in the bulk block, also in the searchkick doc.
Searchkick.callbacks(:bulk) do
... // wrap some batch operations on model instrumented with searchkick.
// the bulk block should be outside of any transaction block
end

I need advice in speeding up this rails method that involves many queries

I'm trying to display a table that counts webhooks and arranges the various counts into cells by date_sent, sending_ip, and esp (email service provider). Within each cell, the controller needs to count the webhooks that are labelled with the "opened" event, and the "sent" event. Our database currently includes several million webhooks, and adds at least 100k per day. Already this process takes so long that running this index method is practically useless.
I was hoping that Rails could break down the enormous model into smaller lists using a line like this:
#today_hooks = #m_webhooks.where(:date_sent => this_date)
I thought that the queries after this line would only look at the partial list, instead of the full model. Unfortunately, running this index method generates hundreds of SQL statements, and they all look like this:
SELECT COUNT(*) FROM "m_webhooks" WHERE "m_webhooks"."date_sent" = $1 AND "m_webhooks"."sending_ip" = $2 AND (m_webhooks.esp LIKE 'hotmail') AND (m_webhooks.event LIKE 'sent')
This appears that the "date_sent" attribute is included in all of the queries, which implies that the SQL is searching through all 1M records with every single query.
I've read over a dozen articles about increasing performance in Rails queries, but none of the tips that I've found there have reduced the time it takes to complete this method. Thank you in advance for any insight.
m_webhooks.controller.rb
def index
def set_sub_count_hash(thip) {
gmail_hooks: {opened: a = thip.gmail.send(#event).size, total_sent: b = thip.gmail.sent.size, perc_opened: find_perc(a, b)},
hotmail_hooks: {opened: a = thip.hotmail.send(#event).size, total_sent: b = thip.hotmail.sent.size, perc_opened: find_perc(a, b)},
yahoo_hooks: {opened: a = thip.yahoo.send(#event).size, total_sent: b = thip.yahoo.sent.size, perc_opened: find_perc(a, b)},
other_hooks: {opened: a = thip.other.send(#event).size, total_sent: b = thip.other.sent.size, perc_opened: find_perc(a, b)},
}
end
#m_webhooks = MWebhook.select("date_sent", "sending_ip", "esp", "event", "email").all
#event = params[:event] || "unique_opened"
#m_list_of_ips = [#List of three ip addresses]
end_date = Date.today
start_date = Date.today - 10.days
date_range = (end_date - start_date).to_i
#count_array = []
date_range.times do |n|
this_date = end_date - n.days
#today_hooks = #m_webhooks.where(:date_sent => this_date)
#count_array[n] = {:this_date => this_date}
#m_list_of_ips.each_with_index do |ip, index|
thip = #today_hooks.where(:sending_ip => ip) #Stands for "Today Hooks ip"
#count_array[n][index] = set_sub_count_hash(thip)
end
end
Well, your problem is very simple, actually. You gotta remember that when you use where(condition), the query is not straight executed in the DB.
Rails is smart enough to detect when you need a concrete result (a list, an object, or a count or #size like in your case) and chain your queries while you don't need one. In your code, you keep chaining conditions to the main query inside a loop (date_range). And it gets worse, you start another loop inside this one adding conditions to each query created in the first loop.
Then you pass the query (not concrete yet, it was not yet executed and does not have results!) to the method set_sub_count_hash which goes on to call the same query many times.
Therefore you have something like:
10(date_range) * 3(ip list) * 8 # (times the query is materialized in the #set_sub_count method)
and then you have a problem.
What you want to do is to do the whole query at once and group it by date, ip and email. You should have a hash structure after that, which you would pass to the #set_sub_count method and do some ruby gymnastics to get the counts you're looking for.
I imagine the query something like:
main_query = #m_webhooks.where('date_sent > ?', 10.days.ago.to_date)
.where(sending_ip:#m_list_of_ips)
Ok, now you have one query, which is nice, but I think you should separate the query in 4 (gmail, hotmail, yahoo and other), which gives you 4 queries (the first one, the main_query, will not be executed until you call for materialized results, don forget it). Still, like 100 times faster.
I think this is the result that should be grouped, mapped and passed to #set_sub_count instead of passing the raw query and calling methods on it every time and many times. It will be a little work to do the grouping, mapping and counting for sure, but hey, it's faster. =)
In case this helps anybody else, I learned how to fill a hash with counts in a much simpler way. More importantly, this approach runs a single query (as opposed to the 240 queries that I was running before).
#count_array[esp_index][j] = MWebhook.where('date_sent > ?', start_date.to_date)
.group('date_sent', 'sending_ip', 'event', 'esp').count

How do I ensure a model always uses a transaction and locks (in Rails)?

I noticed that Rails can have concurrency issues with multiple servers and would like to force my model to always lock. Is this possible in Rails, similar to unique constraints to force data integrity? Or does it just require careful programming?
Terminal One
irb(main):033:0* Vote.transaction do
irb(main):034:1* v = Vote.lock.first
irb(main):035:1> v.vote += 1
irb(main):036:1> sleep 60
irb(main):037:1> v.save
irb(main):038:1> end
Terminal Two, while sleeping
irb(main):240:0* Vote.transaction do
irb(main):241:1* v = Vote.first
irb(main):242:1> v.vote += 1
irb(main):243:1> v.save
irb(main):244:1> end
DB Start
select * from votes where id = 1;
id | vote | created_at | updated_at
----+------+----------------------------+----------------------------
1 | 0 | 2013-09-30 02:29:28.740377 | 2013-12-28 20:42:58.875973
After execution
Terminal One
irb(main):040:0> v.vote
=> 1
Terminal Two
irb(main):245:0> v.vote
=> 1
DB End
select * from votes where id = 1;
id | vote | created_at | updated_at
----+------+----------------------------+----------------------------
1 | 1 | 2013-09-30 02:29:28.740377 | 2013-12-28 20:44:10.276601
Other Example
http://rhnh.net/2010/06/30/acts-as-list-will-break-in-production
You are correct that transactions by themselves don't protect against many common concurrency scenarios, incrementing a counter being one of them. There isn't a general way to force a lock, you have to ensure you use it everywhere necessary in your code
For the simple counter incrementing scenario there are two mechanisms that will work well:
Row Locking
Row locking will work as long as you do it everywhere in your code where it matters. Knowing where it matters may take some experience to get an instinct for :/. If, as in your above code, you have two places where a resource needs concurrency protection and you only lock in one, you will have concurrency issues.
You want to use the with_lock form; this does a transaction and a row-level lock (table locks are obviously going to scale much more poorly than row locks, although for tables with few rows there is no difference as postgresql (not sure about mysql) will use a table lock anyway. This looks like this:
v = Vote.first
v.with_lock do
v.vote +=1
sleep 10
v.save
end
The with_lock creates a transaction, locks the row the object represents, and reloads the objects attributes all in one step, minimizing the opportunity for bugs in your code. However this does not necessarily help you with concurrency issues involving the interaction of multiple objects. It can work if a) all possible interactions depend on one object, and you always lock that object and b) the other objects each only interact with one instance of that object, e.g. locking a user row and doing stuff with objects which all belong_to (possibly indirectly) that user object.
Serializable Transactions
The other possibility is to use serializable transaction. Since 9.1, Postgresql has "real" serializable transactions. This can perform much better than locking rows (though it is unlikely to matter in the simple counter incrementing usecase)
The best way to understand what serializable transactions give you is this: if you take all the possible orderings of all the (isolation: :serializable) transactions in your app, what happens when your app is running is guaranteed to always correspond with one of those orderings. With ordinary transactions this is not guaranteed to be true.
However, what you have to do in exchange is to take care of what happens when a transaction fails because the database is unable to guarantee that it was serializable. In the case of the counter increment, all we need to do is retry:
begin
Vote.transaction(isolation: :serializable) do
v = Vote.first
v.vote += 1
sleep 10 # this is to simulate concurrency
v.save
end
rescue ActiveRecord::StatementInvalid => e
sleep rand/100 # this is NECESSARY in scalable real-world code,
# although the amount of sleep is something you can tune.
retry
end
Note the random sleep before the retry. This is necessary because failed serializable transactions have a non-trivial cost, so if we don't sleep, multiple processes contending for the same resource can swamp the db. In a heavily concurrent app you may need to gradually increase the sleep with each retry. The random is VERY important to avoid harmonic deadlocks -- if all the processes sleep the same amount of time they can get into a rhythm with each other, where they all are sleeping and the system is idle and then they all try for the lock at the same time and the system deadlocks causing all but one to sleep again.
When the transaction that needs to be serializable involves interaction with a source of concurrency other than the database, you may still have to use row-level locks to accomplish what you need. An example of this would be when a state machine transition determines what state to transition to based on a query to something other than the db, like a third-party API. In this case you need to lock the row representing the object with the state machine while the third party API is queried. You cannot nest transactions inside serializable transactions, so you would have to use object.lock! instead of with_lock.
Another thing to be aware of is that any objects fetched outside the transaction(isolation: :serializable) should have reload called on them before use inside the transaction.
ActiveRecord always wraps save operations in a transaction.
For your simple case it might be best to just use a SQL update instead of performing logic in Ruby and then saving. Here is an example which adds a model method to do this:
class Vote
def vote!
self.class.update_all("vote = vote + 1", {:id => id})
end
This method avoids the need for locking in your example. If you need more general database locking check see David's suggestion.
You can do the following in your model like so
class Vote < ActiveRecord::Base
validate :handle_conflict, only: :update
attr_accessible :original_updated_at
attr_writer :original_updated_at
def original_updated_at
#original_updated_at || updated_at
end
def handle_conflict
#If we want to use this across multiple models
#then extract this to module
if #conflict || updated_at.to_f> original_updated_at.to_f
#conflict = true
#original_updated_at = nil
#If two updates are made at the same time a validation error
#is displayed and the fields with
errors.add :base, 'This record changed while you were editing'
changes.each do |attribute, values|
errors.add attribute, "was #{values.first}"
end
end
end
end
The original_updated_at is a virtual attribute that is set. handle_conflict is fired when the record is updated. Checks to see if the updated_at attribute is in the database is later than the one hidden(defined on your page). By the way you should define the following in the your app/view/votes/_form.html.erb
<%= f.hidden_field :original_updated_at %>
If a there is a conflict then raise the validation error.
And if you are using Rails 4 you will won't have the attr_accessible and will need to add :original_updated_at to your vote_params method in your controller.
Hopefully this sheds some light.
For simple +1
Vote.increment_counter :vote, Vote.first.id
Because vote was used both for the table name and the field, this is how the 2 are used
TableName.increment_counter :field_name, id_of_the_row

Optimised solution to store order of rows in a database

I have a data model in Doctrine/symfony. I have a 'Course' which has many 'Lesson's. For each lesson I need to calculate the order (by date) that the lesson appears. For example, the Course 'Java for beginners' might have 10 lessons during October, I need to retrieve the order of these lessons so that the first one is called 'Java for beginners 1' etc.
Currently I have a getTitle() method in my Lesson model that queries the database to establish the number. This works fine. However, when there are 400 lessons on screen (this is a typical use case) it results in 400+ queries.
I have read about Doctrine behaviours and as I understand it, I could add a behaviour for each time a lesson is added, edited or deleted I can recalculate all the sequence numbers - storing them in the database. However, I cannot get this to work efficiently.
Is there a more efficient method than the ones I have mentioned?
Cheers,
Matt
Doctrine_Query::create()->
from('Lesson l')->
leftJoin('l.Course c')->
leftJoin('l.Teacher t')->
leftJoin('l.Students sl')->
andWhere('l.date BETWEEN ? AND ?', array(date('Y-m-d', $start_date), date('Y-m-d', $end_date)))->
orderBy('l.date, l.time');
The above code returns all my lesson information (apart from the lesson number).
Doctrine_Query::create()
->select('COUNT(l.id) as count')
->from('Lesson l')
->leftJoin('l.Course c')
->where('c.id = ?', $this->course_id)
->andWhere('TIMESTAMP(l.date, l.time) < ?', $this->date . ' ' . $this->time)
->orderBy('l.date, l.time');
$lessons = $q->fetchOne();
return $lessons->count + 1;
And this code is in the Lesson model as a function. It calculates the sequence number of a given lesson and returns it as an integer. This is the method that gets called 400+ times. I have tried adding this as a subquery to the first query, but with no success.
Behaviour Code
public function postInsert(Doctrine_Event $event) {
$invoker = $event->getInvoker();
$table = Doctrine::getTable('Lesson');
// Course query
$cq = Doctrine::getTable('Lesson')->createQuery();
$cq->select('COUNT(l.id) as count')
->from('Lesson l')
->leftJoin('l.Course c')
->where('c.id = ?', $invoker->Course->id)
->andWhere('TIMESTAMP(l.date, l.time) < ?', $invoker->date . ' ' . $invoker->time)
->orderBy('l.date, l.time');
$lessons = $cq->fetchOne();
$q = $table->createQuery();
$q->update()->
set('sequence_number', $lessons->count + 1)->
where('id = ?', $invoker->id)->
execute();
}
The obvious problem here is that it only updates the invoked lesson. If one lesson updates its sequence, all lessons should update their sequence numbers. However, the above code causes a memory problem when I try to populate the database from my fixtures (about 5000 lessons, it fails at ~1500).
Edit::
Youre running out of memory because youre eating tons of memory with all those objects. You should try batching in groups of 1000 (since youre failing at ~1500).
Another thing you could do is not put the listener/behavior on the model directly but instead make a postTransactionCommit listener. In this listener you can check to see if any records in the lesson table were part of the commit and then update them all with the proper sequence.
Edit:
So as per our comment... lets go the other direction:
Doctrine_Core::getTable('Course')->createQuery('c')
->leftJoin('l.Lesson')
->orderBy('l.date_column')
->execute();
Well normally to avoid the extra queries you would perform a join wihtin the query so that you dont have the extra expense. For example:
Doctrine_Core::getTable('Lesson')->createQuery('l')
->leftJoin('l.Course')
->orderBy('l.date_column')
->execute();
This way the Course and Lesson info will come down in one query and the Course will be hydrated along with the Lesson.
This of course does nothing for your sequence number. That is a different issue really, and i would think that the best thing to do would be to use a behavior for that as you yourself suggested. If you want to post your behavior code and a description of the problem(s) im sure we could help you with it

Resources