Mongoid 4 (GitHub master) creating documents with duplicate IDs - ruby-on-rails

I am running a high traffic test with Sidekiq that creates MongoDB based objects using Mongoid as my driver in a Rails 4 app. The issue I'm seeing is that when a PlayByPlay document is supposed to have a unique game_id, I see multiple PlayByPlay objects getting created with the same exact game_id. I've enforced the unique constraint on MongoDB as well and this is still happening. Here's my document, it's embedded document, and a glimpse into how I'm creating the documents. The issue is that this is all happening in a threaded environment using Sidekiq, and I'm not sure if there is a way to work around it. My write concern is set to 1 in mongoid.yml and it looks like the safe option was removed in master as was persist_in_safe_mode. Code below -- any suggestions on how to properly work this would be appreciated. This is not a replica set, it's a single MongoDB server performing all read/write requests at this time.
module MLB
class Play
include Mongoid::Document
include Mongoid::Timestamps
embedded_in :play_by_play
field :batter#, type: Hash
field :next_batter#, type: Hash
field :pitchers#, type: Array
field :pitches#, type: Array
field :fielders#, type: Array
field :narrative, type: String
field :seq_id, type: Integer
field :inning, type: Integer
field :outs
field :no_play
field :home_team_score
field :away_team_score
end
class PlayByPlay
include Mongoid::Document
include Mongoid::Timestamps
embeds_many :plays, cascade_callbacks: true
accepts_nested_attributes_for :plays
field :sport
field :datetime, type: DateTime
field :gamedate, type: DateTime
field :game_id
field :home_team_id
field :away_team_id
field :home_team_score
field :away_team_score
field :season_year
field :season_type
field :location
field :status
field :home_team_abbr
field :away_team_abbr
field :hp_umpire
field :fb_umpire
field :sb_umpire
field :tb_umpire
index({game_id: 1})
index({away_team_id: 1})
index({home_team_id: 1})
index({season_type: 1})
index({season_year: 1})
index({"plays.seq_id" => 1}, {unique: true, drop_dups: true})
#validates 'play.seq_id', uniqueness: true
validates :game_id, presence: true, uniqueness: true
validates :home_team_id, presence: true
validates :away_team_id, presence: true
validates :gamedate, presence: true
validates :datetime, presence: true
validates :season_type, presence: true
validates :season_year, presence: true
def self.parse!(entry)
#document = Nokogiri::XML(entry.data)
xslt = Nokogiri::XSLT(File.read("#{$XSLT_PATH}/mlb_pbp.xslt"))
transform = xslt.apply_to(#document)
json_document = JSON.parse(transform)
obj = find_or_create_by(game_id: json_document['game_id'])
obj.sport = json_document['sport']
obj.home_team_id = json_document['home_team_id']
obj.away_team_id = json_document['away_team_id']
obj.home_team_score = json_document['home_team_score']
obj.away_team_score = json_document['away_team_score']
obj.season_type = json_document['season_type']
obj.season_year = json_document['season_year']
obj.location = json_document['location']
obj.datetime = DateTime.strptime(json_document['datetime'], "%m/%d/%y %H:%M:%S")
obj.gamedate = DateTime.strptime(json_document['game_date'], "%m/%d/%Y %H:%M:%S %p")
obj.status = json_document['status']
obj.home_team_abbr = json_document['home_team_abbr']
obj.away_team_abbr = json_document['away_team_abbr']
obj.hp_umpire = json_document['hp_umpire']
obj.fb_umpire = json_document['fb_umpire']
obj.sb_umpire = json_document['sb_umpire']
obj.tb_umpire = json_document['tb_umpire']
p=obj.plays.build(seq_id: json_document['seq_id'])
p.batter = json_document['batter']
p.next_batter = json_document['next_batter'] if json_document['next_batter'].present? && json_document['next_batter'].keys.count >= 1
p.pitchers = json_document['pitchers'] if json_document['pitchers'].present? && json_document['pitchers'].count >= 1
p.pitches = json_document['pitches'] if json_document['pitches'].present? && json_document['pitches'].count >= 1
p.fielders = json_document['fielders'] if json_document['fielders'].present? && json_document['fielders'].count >= 1
p.narrative = json_document['narrative']
p.seq_id = json_document['seq_id']
p.inning = json_document['inning']
p.outs = json_document['outs']
p.no_play = json_document['no_play']
p.home_team_score = json_document['home_team_score']
p.away_team_score = json_document['away_team_score']
obj.save
end
end
end
** NOTE **
This problem goes away if I limit sidekiq to 1 worker, which obviously in the real world I'd never do.

You already have an index on game_id, why not making it unique? that way the db will not allow a duplicate entry, even if mongoid fails to do the validation correctly (#vidaica's answer describes how mongoid could fail to validate the uniqueness).
Try adding a unique index
index({"game_id" => 1}, {unique: true})
and then
rake db:mongoid:create_indexes
to create them in mongo (please make sure that it is created from a mongo shell).
After that, mongodb should not persist any records with duplicate game_id and you'll have to do on the ruby layer is to handle the insert errors that you'll receive from mongodb.

This is because many threads inserting objects with the same game_id. Let me paraphrase it.
For example, you have two sidekiq threads t1 and t2. They run in parallel. Assuming you have a document with game_id 1 and it has not been inserted into the database.
t1 enters parse method, it sees no document in the database with game_id 1, it creates a document with game_id 1 and continues to populate other data, but it has not saved the document.
t2 enters parse method, it sees no document in the database with game_id 1 because at this point t1 has not saved the document. t2 creates a document with the same game_id 1.
t1 save the document
t2 save the document
The result: you have two documents with the same game_id 1.
To prevent this you can use a Mutex to serialize the access of the parsing code. To know how to use a Mutex, read this: http://www.ruby-doc.org/core-2.0.0/Mutex.html

Whatever you do, you will want to solve this on the database level because you will almost certainly do a worst job of implementing unique constraints then what mongo people did.
Assuming you will want to shard one day or consider mongo due to its horizontal scalability features (you're doing high volume testing so I assume this is something you don't want to rule out by design), there may be no reliable way to do this (see Ramifications of working with a mongodb cluster and sharding concepts):
Suppose we were sharding on email and wanted to have a unique index on username. This is not possible to enforce with a cluster.
However, if you're sharding on game_id or you're not considering sharding at all then setting a unique index on game_id should prevent double records (see #xlembouras answer).
However, that answer may not prevent exceptions when this index is violated due to race conditions so be sure to rescue that exception and perform an update instead of create in the rescue block (possibly by playing with #new_record (click 'Show source'), will try to find time to give you exact code).
UPDATE, short fast answer
begin
a = Album.new(name: 'foo', game_id: 3)
a.save
rescue
a.id = id_of_the_object_with_same_id_already_in_db
a.instance_variable_set('#new_record', false)
a.save
end

#vidaica's answer is helpful. If you were fetching and incrementing an ID from memory or a database, it might solve your problem.
However, your game_id is not being generated in parse, it is being passed into parse via the entry JSON object.
How / where is your game_id being generated?

Maybe you should do an upsert instead of an insert:
obj = new(game_id: json_document['game_id'])
obj.upsert

A naive approach is to change the last line of #parse to:
obj.save if where(game_id: obj.game_id).count == 0
Or if you hand to handle it somehow:
if where(game_id: obj.game_id).count == 0
# handle it here
end
Note however that this still leaves possibilities for duplicates.

Related

How is this validation error possible for this code?

The error is ActiveRecord::RecordInvalid: Validation failed: Route must exist.
This is the code:
new_route = Route.create!(new_route_data)
new_points.each_with_index do |point, index|
new_point_data = { route_id: new_route.id,
latitude: point[0],
longitude: point[1],
timestamp: point[2] }
new_point = Point.create!(new_point_data)
end
It is being reported for the new_point = Point.create!(new_point_data) line.
Related Details:
This code is running in a single Sidekiq worker as you see it above (so, the Route isn't being created in one worker, with the Points being created in another worker - this is all inline)
The routes table has almost 3M records
The points table has about 2.5B records
The Point model contains belongs_to :route, counter_cache: true
There are no validations on the Route model
In case it's relevant, the Route model does contain belongs_to :user, counter_cache: true
There are only about 5k records in the users table
Software versions:
Rails 5.1.5
Ruby 2.5.0
PostgreSQL 9.6.7
First of all, your code does not make sense. You are iterating new_point and assigning new_point to some value in the block. So I am assuming that you meant iteration over some collection called data_points
Try this.
In Route model
has_many :points
then:
new_route = ...
data_points.each do |point|
point_data = {latitude: point[0], ...} # do not include route_id
new_route.points.create(point_data) # idiomatic
end
`
You don't need the index, so don't use each_with_index.
It's tough to say what the issue is without seeing what type of validations you have in the Point model.
My guess is that your validation in point.rb is:
validates :route, presence: true
With ActiveRecord relations, you can use this shortcut to avoid explicitly assigning route_id:
new_point_data = { latitude: point[0],longitude: point[1], timestamp: point[2] }
new_route.points.create!(new_point_data)
where new_point data does not have route_id.
You should also rename the new_point that you are assigning in the block since you are writing over the array that you're iterating.

How to avoid duplicate data in my database in ruby on rails

I have scrap a data from another website and saved it in my database which is working fine. However, anytime I refresh my application the scrapped data duplicate itself in my database.Any help would be highly appreciated. Below are my code
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open("www.example.com"))
entries = doc.css('.block')
#entriesArray = []
entries.each do |row|
Scrap.create!(
title: title = row.css('h2>a').text,
link: link = row.css('a')[0]['href'],
day: days =row.css('time').text)
#entriesArray << Entry.new(title,link,days)
end
You can use model validation to raise error on create! if any validation fails.
class Scrap < ApplicationRecord
validates_uniqueness_of :title
end
And, you can also use first_or_create method to create a new record only if not exists in the database:
entries.each do |row|
title = row.css('h2>a').text
link = row.css('a')[0]['href']
day = row.css('time').text
Scrap.where(title: title).first_or_create(
title: title,
link: link,
day: day
)
#entriesArray << Entry.new(title,link,days)
end
You should add an uniq index on, for instance, the link column on your database (this is to optimize the find_by and to enforce you'll not have duplicates with the same links, it's not needed, although it makes sense), since they'll be unique (you could go with title too, but they could repeat themselves? - not sure, it depends on what you're fetching)
And then check to see if you already have that link on the database, like so:
entries.each do |row|
scrap = Scrap.create_with(title: row.css('h2>a').text, day: row.css('time').text).find_or_initialize_by(link: row.css('a')[0]['href'])
#entriesArray << Entry.new(title,link,days) if scrap.new_record? && (scrap.save! if scrap.new_record?)
end
(the last if is in case you just want to add the Entry if it's a new Entry, if you want to add it no matter what, just remove from if scrap.new_record? ... until end
You want to add a uniqueness validator to your Scrap model like this:
validates :title, uniqueness: true
validates :link, uniqueness: true
This will prevent the same record from being saved twice.

You cannot call create unless the parent is saved error when seeding in rails

I'm trying to populate my SQLite3 database with a simple seed file that is supposed to create a bunch o movies entries in the Film table and then create some comments to this movies that are stored in Comments table.
formats = %w(Beta VHS IMAX HD SuperHD 4K DVD BlueRay)
30.times do
film = Film.create(title: "#{Company.bs}",
director: "#{Name.name}",
description: Lorem.paragraphs.join("<br/>").html_safe,
year: rand(1940..2015),
length: rand(20..240),
format: formats[rand(formats.length)]
)
film.save
(rand(0..10)).times do
film.comments.create( author: "#{Name.name}",
title: "#{Company.bs}",
content: Lorem.sentences(3).join("<br/>").html_safe,
rating: rand(1..5)
)
end
end
Once i execute rake db:seed I inevitably get the error
ActiveRecord::RecordNotSaved: You cannot call create unless the parent is saved
and no records are added to either Films or Comments
My film.rb file is
class Film < ActiveRecord::Base
has_many :comments
validates_presence_of :title, :director
validates_length_of :format, maximum: 5, minimum:3
validates_numericality_of :year, :length, greater_than: 0
validates_uniqueness_of :title
paginates_per 4
end
The length limit on 'format' raises the error when creating a Film with formats selected from the 'format' list
ActiveRecord::RecordNotSaved: You cannot call create unless the parent is saved
This occurs when you try to save a child association (Comment) but the parent (Film) isn't saved yet.
It seems that film is not saved. Looking at the code, it appears that film = Film.create(...) is failing validations and thus film.comments.create(..) cannot proceed.
Without knowing more about which validation is failing, that's all I can say.
I would recommend using create!(...) everywhere in seeds.rb. The bang version will raise an exception if the record isn't valid and help prevent silent failures.

Mongoid queries not returning anything on new model fields

I have a Rails application where I am trying to iterate over each object in a Model class depending on whether the object has been archived or not.
class Model
include Mongoid::Document
include Mongoid::Timestamps
field :example_id, type: Integer
field :archived, type: Boolean, default: false
def archive_all
Model.all.where(archived: false).each do |m|
m.archive!
end
end
end
However, the where clause isn't returning anything. When I go into the console and enter these lines, here is what I get:
Model.where(example_id: 3).count #=> 23
Model.where(archived: false).count #=> 0
Model.all.map(&:archived) #=> [false, false, false, ...]
I have other where clauses throughout the application and they seem to work fine. If it makes any difference, the 'archived' field is one that I just recently added.
What is happening here? What am I doing wrong?
When you say:
Model.where(archived: false)
you're looking for documents in MongoDB the archived field is exactly false. If you just added your archived field then none of the documents in your database will have that field (and no, the :default doesn't matter) so there won't be any with archived: false. You're probably better off looking for documents where archived is not true:
Model.where(:archived.ne => true).each(&:archive!)
You might want to add a validation on archived to ensure that it is always true or false and that every document has that field.

Rails: use existing model validation rules against a collection instead of the database table

Rails 4, Mongoid instead of ActiveRecord (but this should change anything for the sake of the question).
Let's say I have a MyModel domain class with some validation rules:
class MyModel
include Mongoid::Document
field :text, type: String
field :type, type: String
belongs_to :parent
validates :text, presence: true
validates :type, inclusion: %w(A B C)
validates_uniqueness_of :text, scope: :parent # important validation rule for the purpose of the question
end
where Parent is another domain class:
class Parent
include Mongoid::Document
field :name, type: String
has_many my_models
end
Also I have the related tables in the database populated with some valid data.
Now, I want to import some data from an CSV file, which can conflict with the existing data in the database. The easy thing to do is to create an instance of MyModel for every row in the CSV and verify if it's valid, then save it to the database (or discard it).
Something like this:
csv_rows.each |data| # simplified
my_model = MyModel.new(data) # data is the hash with the values taken from the CSV row
if my_model.valid?
my_model.save validate: false
else
# do something useful, but not interesting for the question's purpose
# just know that I need to separate validation from saving
end
end
Now, this works pretty smoothly for a limited amount of data. But when the CSV contains hundreds of thousands of rows, this gets quite slow, because (worst case) there's a write operation for every row.
What I'd like to do, is to store the list of valid items and save them all at the end of the file parsing process. So, nothing complicated:
valids = []
csv_rows.each |data|
my_model = MyModel.new(data)
if my_model.valid? # THE INTERESTING LINE this "if" checks only against the database, what happens if it conflicts with some other my_models not saved yet?
valids << my_model
else
# ...
end
end
if valids.size > 0
# bulk insert of all data
end
That would be perfect, if I could be sure that the data in the CSV does not contain duplicated rows or data that goes against the validation rules of MyModel.
My question is: how can I check each row against the database AND the valids array, without having to repeat the validation rules defined into MyModel (avoiding to have them duplicated)?
Is there a different (more efficient) approach I'm not considering?
What you can do is validate as model, save the attributes in a hash, pushed to the valids array, then do a bulk insert of the values usint mongodb's insert:
valids = []
csv_rows.each |data|
my_model = MyModel.new(data)
if my_model.valid?
valids << my_model.attributes
end
end
MyModel.collection.insert(valids, continue_on_error: true)
This won't however prevent NEW duplicates... for that you could do something like the following, using a hash and compound key:
valids = {}
csv_rows.each |data|
my_model = MyModel.new(data)
if my_model.valid?
valids["#{my_model.text}_#{my_model.parent}"] = my_model.as_document
end
end
Then either of the following will work, DB Agnostic:
MyModel.create(valids.values)
Or MongoDB'ish:
MyModel.collection.insert(valids.values, continue_on_error: true)
OR EVEN BETTER
Ensure you have a uniq index on the collection:
class MyModel
...
index({ text: 1, parent: 1 }, { unique: true, dropDups: true })
...
end
Then Just do the following:
MyModel.collection.insert(csv_rows, continue_on_error: true)
http://api.mongodb.org/ruby/current/Mongo/Collection.html#insert-instance_method
http://mongoid.org/en/mongoid/docs/indexing.html
TIP: I recommend if you anticipate thousands of rows to do this in batches of 500 or so.

Resources