I am trying to group by a delegate join table. I have a tasks table and each task is has a project_id. The following code works well in my controller for me to group by project:
#task = Task.joins(:project).joins(:urgency).where(urgencies: {urgency_value: 7}).group_by(&:project_id)
This returns a hash where the key is what I have joined by and then the index contains each tasks within that group. I can then loop through each task to retrieve its attributes.
However, each project belongs to a workspace (via a workspace_id). What I want is to have the same query but to group by the workspace. The final aim is for me to create a table which shows the workspace name in one column and the number of tasks for that workspace in the second column.
I have tried many combinations and searched many forums but after several hours still haven't been able to crack it.
If your only goal is to get the task counts per workspace, I think you want a different query.
#workspaces_with_task_counts =
Workspace
.joins(:projects)
.joins(:tasks)
.select('workspaces.name, count(tasks.id) as task_count')
.group(:workspace_id)
Then you can access the count like this:
#workspaces_with_task_counts.each do |workspace|
puts "#{workspace.name}: #{workspace.task_count}"
end
EDIT 1
I think this is what you want:
Workspace
.joins(projects: { tasks: :urgencies })
.where(urgencies: {urgency_value: 7})
.group(:name)
.count
which results in a hash containing all of the workspaces with at least one task where the urgency_value is 7, by name, with the number of tasks in that workspace:
{"workspace1"=>4, "workspace2"=>1}
EDIT 2
SQL is not capable of returning both detail and summary information in a single query. But, we can get all the data, then summarize it in memory with Ruby's group_by method:
Task
.joins(project: :workspace)
.includes(project: :workspace)
.group_by { |task| task.project.workspace.name }
This produces the following data structure:
{
"workspace1": [task, task, task],
"workspace2": [task, task],
"workspace3": [task, task, task, task]
}
But, it does so at a cost. Grouping in memory is an expensive process. Running that query 10,000 times took ~15 seconds.
It turns out that executing two SQL queries is actually two orders of magnitude faster at ~0.2 seconds. Here are the queries:
tasks = Task.joins(project: :workspace).includes(project: :workspace)
counts = tasks.group('workspaces.name').count
The first query gets you all the tasks and preloads their associated project and workspace data. The second query uses ActiveRecord's group clause to construct the SQL statement to summarize the data. It returns this data structure:
{ "workspace1": 3, "workspace2": 2, "workspace3": 4 }
Databases are super efficient at set manipulation. It's almost always significantly faster to do that work in the database than in Ruby.
Related
I'm using Searchkick 3.1.0
I have to bulk index a certain collection of records. By what I read in the docs and have tried, I cannot pass an predefined array of ids to Searchkick's reindex method. I'm using the async mode.
If you do for example, Klass.reindex(async: true), it will enqueue jobs with the specified batch_size in your options. The problem with that it loops through the entire model's ids will then determin if they have to be indexed. For example, if I have 10 000 records in my database and a batch size of 200, it will enqueue 50 jobs. It will then loop on each id and if the search_import's conditions are met, it will index it.
This step is useless, I would like to enqueue a pre-filtered array of ids to prevent looping through the entire records.
I tried writing the following job to overwrite the normal behavior :
def perform(class_name, batch_size = 100, offset = 0)
model = class_name.constantize
ids = model
.joins(:user)
.where(user: { active: true, id: $rollout.get(:searchkick).users })
.where("#{class_name.downcase.pluralize}.id > ?", offset)
.pluck(:id)
until ids.empty?
ids_to_enqueue = ids.shift(batch_size)
Searchkick::BulkReindexJob.perform_later(
class_name: model.name,
record_ids: ids_to_enqueue
)
end
The problem : The searchkick mapping options are completely ignored when inserting records into ElasticSearch and I can't figure out why. It doesn't take the specified match (text_middle) and create a mapping with default match 'keyword'.
Is there any clean way to bulk reindex an array of records without having to enqueue jobs containing unwanted records?
You should be able to reindex records based on a condition:
From the searchkick docs:
Reindex multiple records
Product.where(store_id: 1).reindex
You can put that in your own delayed job.
What I have done is have for some of our batch operations that happens already in a delayed job, I wrap the code in the job in the bulk block, also in the searchkick doc.
Searchkick.callbacks(:bulk) do
... // wrap some batch operations on model instrumented with searchkick.
// the bulk block should be outside of any transaction block
end
I've written the rake task to perform a postgreSQL query. The task returns an object of class Result.
Here's my task:
task export_products: :environment do
results = execute "SELECT smth IN somewhere"
if results.present?
results
else
nil
end
end
def execute sql
ActiveRecord::Base.connection.execute sql
end
My further plan is to split the output in batches and save these batches one by one into a .csv file.
Here I get stuck. I cannot imagine how to call find_in_batches method of ActiveRecord::Batches module for PG::Result.
How should I proceed?
Edit: I have a legacy sql query to a legacy database
If you look at how find_in_batches is implemented, you'll see that the algorithm is essentially:
Force the query to be ordered by the primary key.
Add a LIMIT clause to the query to match the batch size.
Execute the modified query from (2) to get a batch.
Do whatever needs to be done with the batch.
If the batch is smaller than the batch size, then the unlimited query has been exhausted so we're done.
Get the maximum primary query value (last_max) from the batch you get in (3).
Add primary_key_column > last_max to the query from (2)'s WHERE clause, run the query again, and go to step (4).
Pretty straight forward and could be implemented with something like this:
def in_batches_of(batch_size)
last_max = 0 # This should be safe for any normal integer primary key.
query = %Q{
select whatever
from table
where what_you_have_now
and primary_key_column > %{last_max}
order by primary_key_column
limit #{batch_size}
}
results = execute(query % { last_max: last_max }).to_a
while(results.any?)
yield results
break if(results.length < batch_size)
last_max = results.last['primary_key_column']
results = execute(query % { last_max: last_max }).to_a
end
end
in_batches_of(1000) do |batch|
# Do whatever needs to be done with the `batch` array here
end
Where, of course, primary_key_column and friends have been replaced with real values.
If you don't have a primary key in your query then you can use some other column that sorts nicely and is unique enough for your needs. You could also use an OFFSET clause instead of the primary key but that can get expensive with large result sets.
I know this must be simple but I'm really lost.
Three models: Job, Task and Operation, as follows:
Job
has_many :tasks
Task
belongs_to :job
belongs_to :operation
Operation
has_many :jobs
Job has an attribute, total_pieces, which tells me how many pieces you need.
For each Job, you can add a number of Tasks, which can belong to different Operations (cutting, drilling, etc.) and for every task you can set up a number of pieces.
I don't know in advance how many Operations will be needed for a single Job, but I need to alert user of the number of pieces left for that Operation when a new Task is inserted.
Let us make an example:
Job 1: total_pieces=100
- Task 1: operation 1(cutting), pieces=20
- Task 2: operation 1(cutting), pieces=30
- Task 3: operation 2(drilling), pieces=20
I need to alert the user that they still need to cut 50 pieces and to drill 80.
Hypothetically, if i add:
- Task 4: operation 3(bending), pieces=20
I need to alert the user that they also still need to bend 80 pieces.
So far i've managed to list all kinds of Operations for each Job using map, but now i need to sum up all pieces of the Task with the same Operation type in a Job, and only for those Operations present in the Tasks belonging to that Job.
Is there any way to do this using map? Or do I need to write a query manually?
EDIT: this is what I managed to patch up at the moment.
A method operations_applied in Job gives me a list of ids for all the Operations usend in Tasks queued for the Job.
Then another method, pieces_remaining for(operation) gives me the remaining pieces for the single operation.
Finally, in the Job views I need, I iterate through all operations_applied printing all pieces_remaining_for.
I know this is not particularly elegant but so far it works, any ideas to improve this?
Thank you.
If I'm not misunderstanding, it is not possible to do what you want to do with map, since map always applies to arr.size == arr.map {...}.size and you want to reduce your array.
What you could do is something like this:
job = Jobs.first
operation_pieces = {}
job.tasks.each do |task|
operation_pieces[task.operation.id] ||= { operation: task.operation }
operation_pieces[task.operation.id][:pieces] ||= 0
operation_pieces[task.operation.id][:pieces] += task.pieces
end
Now operation_pieces contains the sum of pieces for the operation with the id of the respective index. But I'm sure there is a more elegant version to do this ;)
EDIT: changed the code example to a hash
EDIT: and here is the more elegant version:
job = Jobs.first
job.tasks
.group_by(&:operation)
.map { |op, tasks|
{ op => tasks.sum(&:pieces) }
}
The group_by groups your array of tasks by the operation of the task (maybe you need to use group_by { |t| t.operation } instead, I'm not sure) and inside the map afterwards the pieces of each task with the same operation is summed up. Finally, you end up with a hash of the type OPERATION => PIECES_SUM (INTEGER).
I assume the following variables required for your query,
attributes
Task:
number_of_pieces,
Job:
name,
Operation:
name
Job.joins("LEFT JOIN tasks ON jobs.id = tasks.job_id").joins("LEFT JOIN operations ON operations.id = tasks.operation_id").select(" SUM(tasks.number_of_pieces) as number_of_pieces, operations.name, jobs.name").group("operations.id, jobs.id")
This will list all the jobs, and sum of pieces required for each operation under it.
if you have the job_id for you want the list of operations and the pieces, then use the below code,
Job.find(params[:job_id]).joins("LEFT JOIN tasks ON jobs.id = tasks.job_id").joins("LEFT JOIN operations ON operations.id = tasks.operation_id").select(" SUM(tasks.number_of_pieces) as number_of_pieces, operations.name, jobs.name").group("operations.id, jobs.id")
please put in comments if your need any explanation.
I have a habtm relationship between my Product and Category model.
I'm trying to write a query that searches for products with minimum of 2 categories.
I got it working with the following code:
p = Product.joins(:categories).group("product_id").having("count(product_id) > 1")
p.length # 178
When iterating on it though, for each time I call product.categories, it will do a new call to the database - not good. I want to prevent these calls and have the same result. Doing more research I've seen that I could include (includes) my categories table and it would load all the table in memory so it's not necessary to call the database again when iterating. So I got it working with the following code:
p2 = Product.includes(:categories).joins(:categories).group("product_id").having("count(product_id) > 1")
p2.length # 178 - I compared and the objects are the same as last query
Here come's what I am confused about:
p.first.eql? p2.first # true
p.first.categories.eql? p2.first.categories # false
p.first.categories.length # 2
p2.first.categories.length # 1
Why with the includes query I get the right objects but I don't get the categories relationship right?
It has something to do with the group method. Your p2 only contains the first category for each product.
You could break this up into two queries:
product_ids = Product.joins(:categories).group("product_id").having("count(product_id) > 1").pluck(:product_id)
result = Product.includes(:categories).find(product_ids)
Yeah, you hit the database twice, but at least you don't go to the database when you're iterating.
You must know that includes doesn't play well with joins (joins will just suppress the former).
Also When you include an association ActiveRecord figures out if it'll use eager_load (with a left join) or preload (with a separate query). Includes is just a wrapper for one of those 2.
The thing is preload plays well with joins ! So you can do this :
products = Product.preload(:categories). # this will trigger a separate query
joins(:categories). # this will build the relevant query
group("products.id").
having("count(product_id) > 1").
select("products.*")
Note that this will also hit the database twice, but you will not have any O(n) query.
Is there a way to batch independent (i.e. not dependent on a previous value) queries into a single request to the database in order to prevent round trips?
I intend use this to read data from several unrelated data models, so joins and views don't suffice.
Here's a very rough idea of what I'm attempting:
queries = {
# What Relation or SQL
top_duck: Duck.limit(1), # We can't use first directly
cow_count: Cow.select('count(*)'), # We can't use count directly
petals_lost: Flower.group_by('petals_lost').order_by('petals_lost')
.select('average(petals_lost) as average_petals_lost'),
avg_score: "select avg(score), stddev(score) from games"
}
bq = BatchQuery.new(queries)
# And in one trip...
bq.execute # => Hash<DB::Result> suffices for now; next step: typecasting