I have a sidekiq worker that processes certain series of task in batch. Once it completes the job, it updates a tracker table on the success/failure of the task. Each batch has a unique identifier that is being passed to the worker script and the worker process queries that table for this unique id and update that particular row through a activerecord query similar to:
cpr = MODEL.find(tracker_unique_id)
cpr.update_attributes(:attempted => cpr[:attempted] + 1, :success => cpr[:success] + 1)
What I have noticed is that the tracker only get record of 1 set of task running even though I can see from the sidekiq log and another result table that x number of tasks finished running.
Anyone can help me on this?
Your update_attributes call has a race condition as you cannot increment like that safely. Multiple threads will stomp on each other. You must do a proper UPDATE SQL statement.
update models set attempted = attempted + 1 where tracker_unique_id = ?
Related
I'm working on an app where users can un-bookmark a post they've bookmarked before. But I realized that if multiple requests is sent by a particular user to un-bookmark a post they've bookmarked before, node properties get set multiple times. For example, if user 1 bookmarked post 1, their noOfBookmarks (both user and post) will increase by 1 and when they un-bookmark, their noOfBookmarks will decrease by 1. But sometimes during concurrent requests, I get incorrect or negative noOfBookmarks depending on the number of requests. I'm using MATCH which will return 0 rows when the pattern can't be found.
I think the problem is because of the isolation level neo4j is using. During concurrent requests, the changes made by the first query to run will not be visible to other transactions until the first transaction is committed. So the MATCH is still returning rows, that's why I'm getting invalid properties. I think what I need is for transactions to be executed sequentially or get an exclusive read lock.
I've tried setting a property on the user and post node (before MATCHing the bookmark relationship) which will make the first transaction get a write lock on those nodes. I thought other transactions will wait at this point for the write lock to be released before continuing but it didn't work.
How do I ensure the first transaction during concurrent requests modify the graph and other transactions stop at that MATCH (which is the behaviour during sequential requests)?
This is my cypher query:
MATCH (user:User { id: $userId })
MATCH (post:Post { id: $postId })
WITH user, post
MATCH (user)-[bookmarkRel:BOOKMARKED_POST]->(post)
WITH bookmarkRel, user, post
DELETE bookmarkRel
WITH post, user
SET post.noOfBookmarks = post.noOfBookmarks - 1,
user.noOfBookmarks = user.noOfBookmarks - 1
RETURN post { .* }
Thank you
I am trying to group by a delegate join table. I have a tasks table and each task is has a project_id. The following code works well in my controller for me to group by project:
#task = Task.joins(:project).joins(:urgency).where(urgencies: {urgency_value: 7}).group_by(&:project_id)
This returns a hash where the key is what I have joined by and then the index contains each tasks within that group. I can then loop through each task to retrieve its attributes.
However, each project belongs to a workspace (via a workspace_id). What I want is to have the same query but to group by the workspace. The final aim is for me to create a table which shows the workspace name in one column and the number of tasks for that workspace in the second column.
I have tried many combinations and searched many forums but after several hours still haven't been able to crack it.
If your only goal is to get the task counts per workspace, I think you want a different query.
#workspaces_with_task_counts =
Workspace
.joins(:projects)
.joins(:tasks)
.select('workspaces.name, count(tasks.id) as task_count')
.group(:workspace_id)
Then you can access the count like this:
#workspaces_with_task_counts.each do |workspace|
puts "#{workspace.name}: #{workspace.task_count}"
end
EDIT 1
I think this is what you want:
Workspace
.joins(projects: { tasks: :urgencies })
.where(urgencies: {urgency_value: 7})
.group(:name)
.count
which results in a hash containing all of the workspaces with at least one task where the urgency_value is 7, by name, with the number of tasks in that workspace:
{"workspace1"=>4, "workspace2"=>1}
EDIT 2
SQL is not capable of returning both detail and summary information in a single query. But, we can get all the data, then summarize it in memory with Ruby's group_by method:
Task
.joins(project: :workspace)
.includes(project: :workspace)
.group_by { |task| task.project.workspace.name }
This produces the following data structure:
{
"workspace1": [task, task, task],
"workspace2": [task, task],
"workspace3": [task, task, task, task]
}
But, it does so at a cost. Grouping in memory is an expensive process. Running that query 10,000 times took ~15 seconds.
It turns out that executing two SQL queries is actually two orders of magnitude faster at ~0.2 seconds. Here are the queries:
tasks = Task.joins(project: :workspace).includes(project: :workspace)
counts = tasks.group('workspaces.name').count
The first query gets you all the tasks and preloads their associated project and workspace data. The second query uses ActiveRecord's group clause to construct the SQL statement to summarize the data. It returns this data structure:
{ "workspace1": 3, "workspace2": 2, "workspace3": 4 }
Databases are super efficient at set manipulation. It's almost always significantly faster to do that work in the database than in Ruby.
I'm using Searchkick 3.1.0
I have to bulk index a certain collection of records. By what I read in the docs and have tried, I cannot pass an predefined array of ids to Searchkick's reindex method. I'm using the async mode.
If you do for example, Klass.reindex(async: true), it will enqueue jobs with the specified batch_size in your options. The problem with that it loops through the entire model's ids will then determin if they have to be indexed. For example, if I have 10 000 records in my database and a batch size of 200, it will enqueue 50 jobs. It will then loop on each id and if the search_import's conditions are met, it will index it.
This step is useless, I would like to enqueue a pre-filtered array of ids to prevent looping through the entire records.
I tried writing the following job to overwrite the normal behavior :
def perform(class_name, batch_size = 100, offset = 0)
model = class_name.constantize
ids = model
.joins(:user)
.where(user: { active: true, id: $rollout.get(:searchkick).users })
.where("#{class_name.downcase.pluralize}.id > ?", offset)
.pluck(:id)
until ids.empty?
ids_to_enqueue = ids.shift(batch_size)
Searchkick::BulkReindexJob.perform_later(
class_name: model.name,
record_ids: ids_to_enqueue
)
end
The problem : The searchkick mapping options are completely ignored when inserting records into ElasticSearch and I can't figure out why. It doesn't take the specified match (text_middle) and create a mapping with default match 'keyword'.
Is there any clean way to bulk reindex an array of records without having to enqueue jobs containing unwanted records?
You should be able to reindex records based on a condition:
From the searchkick docs:
Reindex multiple records
Product.where(store_id: 1).reindex
You can put that in your own delayed job.
What I have done is have for some of our batch operations that happens already in a delayed job, I wrap the code in the job in the bulk block, also in the searchkick doc.
Searchkick.callbacks(:bulk) do
... // wrap some batch operations on model instrumented with searchkick.
// the bulk block should be outside of any transaction block
end
Sidekiq will run 25 concurrent jobs in our scenario. We need to get a single integer as the result of each job and tally all of the results together. In this case we are querying an external API and returning counts. We want the total count from all of the API requests.
The Report object stores the final total. Postgresql is our database.
At the end of each job, we increment the report with the additional records found.
Report.find(report_id).increment(:total, api_response_total)
Is this a good approach to track the running total? Will there be Postgresql concurrency issues? Is there a better approach?
increment shouldn't lead to concurrency issues, at sql level, it updates atomically with COALESCE(total, 0) + api_response_total. Race conditions can come only if you addition manually and then saving the object.
report = Report.find(report_id)
report.total += api_response_total
report.save # NOT SAFE
Note: Even with increment! the value at Rails level can be stale, but it will be correct at database level:
# suppose initial `total` is 0
report = Report.find(report_id) # Thread 1 at time t0
report2 = Report.find(report_id) # Thread 2 at time t0
report.increment!(:total) # Thread 1 at time t1
report2.increment!(:total) # Thread 2 at time t1
report.total #=> 1 # Thread 1 at time t2
report2.total #=> 1 # Thread 2 at time t2
report.reload.total #=> 2 # Thread 1 at time t3, value was stale in object, but correct in db
Is this a good approach to track the running total? Will there be Postgresql concurrency issues? Is there a better approach?
I will prefer to do this with Sidekiq Batches. It allows you to run a batch of jobs and assign a callback to the batch, which executes once all jobs are processed. Example:
batch = Sidekiq::Batch.new
batch.description = "Batch description (this is optional)"
batch.on(:success, MyCallback, :to => user.email)
batch.jobs do
rows.each { |row| RowWorker.perform_async(row) }
end
puts "Just started Batch #{batch.bid}"
We need to get a single integer as the result of each job and tally all of the results together.
Note that Sidekiq job doesn't do anything with the returned value and the value is GC'ed and ignored. So, in above batch strategy, you will not have data of jobs in the callback. You can tailor-made that solution. For example, have a LIST in redis with key as batch id, and push the values of each complete job (in perform). In callback, simply use the list and summate it.
Foobar.find(1).votes_count returns 0.
In rails console, I am doing:
10.times { Resque.enqueue(AddCountToFoobar, 1) }
My resque worker:
class AddCountToFoobar
#queue = :low
def self.perform(id)
foobar = Foobar.find(id)
foobar.update_attributes(:count => foobar.votes_count +1)
end
end
I would expect Foobar.find(1).votes_count to be 10, but instead it returns 4. If I run 10.times { Resque.enqueue(AddCountToFoobar, 1) } again, it returns the same behaviour. It only increments votes_count by 4 and sometimes 5.
Can anyone explain this?
This is a classic race condition scenario. Imagine that only 2 workers exist and that they each run one of your vote incrementing jobs. Imagine the following sequence.
Worker1: load foobar(vote count == 1)
Worker2: load foobar(vote count == 1, in a separate ruby object)
Worker 1: increment vote count (now == 2) and save
Worker 2: increment it's copy of foobar (vote count now == 2) and save, overwriting what worker 1 did
Although 2 workers ran 1 update job each, the count only increased by 1 because they were both operating on their own copy of foobar that wasn't aware of the change the other worker was doing
To solve this, you could either do an inplace style update, ie
UPDATE foos SET count = count + 1
or use one of the 2 forms of locking active record supports (pessimistic locking & optimistic locking)
The former works because the database ensures that you don't have concurrent updates on the same row at the same time.
Looks like ActiveRecord is not thread-safe in Resque (or rather redis, I guess). Here's a nice explanation.
As Frederick says, you're observing a race condition. You need to serialize access to the critical section from the time you read the value and update it.
I'd try to use pessimistic locking:
http://api.rubyonrails.org/classes/ActiveRecord/Transactions/ClassMethods.html
http://api.rubyonrails.org/classes/ActiveRecord/Locking/Pessimistic.html
foobar = Foobar.find(id)
foobar.with_lock do
foobar.update_attributes(:count => foobar.votes_count +1)
end