How to avoid duplicate data in my database in ruby on rails - ruby-on-rails

I have scrap a data from another website and saved it in my database which is working fine. However, anytime I refresh my application the scrapped data duplicate itself in my database.Any help would be highly appreciated. Below are my code
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open("www.example.com"))
entries = doc.css('.block')
#entriesArray = []
entries.each do |row|
Scrap.create!(
title: title = row.css('h2>a').text,
link: link = row.css('a')[0]['href'],
day: days =row.css('time').text)
#entriesArray << Entry.new(title,link,days)
end

You can use model validation to raise error on create! if any validation fails.
class Scrap < ApplicationRecord
validates_uniqueness_of :title
end
And, you can also use first_or_create method to create a new record only if not exists in the database:
entries.each do |row|
title = row.css('h2>a').text
link = row.css('a')[0]['href']
day = row.css('time').text
Scrap.where(title: title).first_or_create(
title: title,
link: link,
day: day
)
#entriesArray << Entry.new(title,link,days)
end

You should add an uniq index on, for instance, the link column on your database (this is to optimize the find_by and to enforce you'll not have duplicates with the same links, it's not needed, although it makes sense), since they'll be unique (you could go with title too, but they could repeat themselves? - not sure, it depends on what you're fetching)
And then check to see if you already have that link on the database, like so:
entries.each do |row|
scrap = Scrap.create_with(title: row.css('h2>a').text, day: row.css('time').text).find_or_initialize_by(link: row.css('a')[0]['href'])
#entriesArray << Entry.new(title,link,days) if scrap.new_record? && (scrap.save! if scrap.new_record?)
end
(the last if is in case you just want to add the Entry if it's a new Entry, if you want to add it no matter what, just remove from if scrap.new_record? ... until end

You want to add a uniqueness validator to your Scrap model like this:
validates :title, uniqueness: true
validates :link, uniqueness: true
This will prevent the same record from being saved twice.

Related

Rails Validation Error: ActiveRecord::RecordInvalid

I'm kinda struggling with validations in my Rails application.
I have the following Setup:
class MealDay < ApplicationRecord
belongs_to :meal
belongs_to :day
has_many :meal_day_canteens
has_many :canteens,
through: :meal_day_canteens
validates :meal_id, uniqueness: {scope: :day_id}
end
#MainController
helping_hash.each { |days, meal|
dayRef = Day.where(date: days).first
mealRef = Meal.where(name: meal)
dayRef.meals << mealRef #This line is obviously throwing the error because a record exists already
}
The Error is: "Validation failed: Meal has already been taken"
But I'm not sure on how to handle the error. I just want it, so that Rails is not inserting it into the database and skips it. If you need more information just tell me.
Thanks.
Edit: Some more code which I can't get to work now.
Edit2: Forgot to add validation to that model. Works fine now
helping_hash.each { |days, meal|
dayRef = Day.where(date: days).first
mealRef = Meal.where(name: meal)
meal_day_ref = MealDay.where(day: dayRef, meal: mealRef)
#canteenNameRef.meal_days << meal_day_ref rescue next
}
How about rescueing the error?
helping_hash.each do |days, meal|
dayRef = Day.where(date: days).first
mealRef = Meal.where(name: meal)
dayRef.meals << mealRef rescue next
end
Rails uniqueness constraint is basically throwing a query under the hood to look for a record in the DB. More of a side comment but this is already not safe concurrently so I recommend you adding a constraint at the database level.
You basically need to do the manual work of skipping the ones that already exist.
Basically something like:
helping_hash.each do |day, meal|
day_ref = Day.find_by!(date: day)
meal_ref = Meal.find_by!(name: meal)
unless day_ref.meal_ids.include?(meal_ref.id)
# Depending on the number of records you expect
# using day_ref.meals.exists?(id: meal_ref.id) may be a better choice
day_ref.meals << meal_ref
end
end

Uploading csv file and save in database

I have one problem when I try to save some data into my database, imported from a CSV file (uploaded).
My environment is about a classroom reservation. I have the following code for my model Reservation:
class Reservation < ActiveRecord::Base
require 'csv'
belongs_to :classroom
validates :start_date, presence: true
validates :end_date, presence: true
validates :classroom_id, presence: true
validate :validate_room
scope :filter_by_room, ->(room_id) { where 'classroom_id = ?' % room_id }
def self.import(file)
CSV.foreach(file, headers: true ) do |row|
room_id = Classroom.where(number: row[0]).pluck(:id)
Reservation.create(classroom_id: room_id, start_date: row[1], end_date: row[2])
end
end
private
def validate_room
if Reservation.filter_by_room(classroom_id).nil?
errors.add(:classroom_id, ' has already booked')
end
end
end
The CSV file comes with these three headers: "classroom number", "start date", "end date".
Note that "classroom number" header came from a column of classroom table.
My job is to get the classroom.id using the "number" and create the row in the database of the reservation table.
Ok, but the problem is when I get the classroom_id in "self.import" method and print on the console, he exists. When I use the scope to filter the classroom_id, he is empty.
Expect I've expressed myself like I want.
Sorry for my bad English :/
Edit: Discovered that classroom_id before Reservation.create become nil when I use inside the create method. If I use row[0] works, but I need to use classroom_id.
{ where 'classroom_id = ?' % room_id }
Should be
{ where 'classroom_id = ?', room_id }
The answer is simple, I forgot to use .first after pluck(:id) method.
The pluck method returns a value wrapped in an array:
room_id = Classroom.where(number: row[0]).pluck(:id).first

Rails: use existing model validation rules against a collection instead of the database table

Rails 4, Mongoid instead of ActiveRecord (but this should change anything for the sake of the question).
Let's say I have a MyModel domain class with some validation rules:
class MyModel
include Mongoid::Document
field :text, type: String
field :type, type: String
belongs_to :parent
validates :text, presence: true
validates :type, inclusion: %w(A B C)
validates_uniqueness_of :text, scope: :parent # important validation rule for the purpose of the question
end
where Parent is another domain class:
class Parent
include Mongoid::Document
field :name, type: String
has_many my_models
end
Also I have the related tables in the database populated with some valid data.
Now, I want to import some data from an CSV file, which can conflict with the existing data in the database. The easy thing to do is to create an instance of MyModel for every row in the CSV and verify if it's valid, then save it to the database (or discard it).
Something like this:
csv_rows.each |data| # simplified
my_model = MyModel.new(data) # data is the hash with the values taken from the CSV row
if my_model.valid?
my_model.save validate: false
else
# do something useful, but not interesting for the question's purpose
# just know that I need to separate validation from saving
end
end
Now, this works pretty smoothly for a limited amount of data. But when the CSV contains hundreds of thousands of rows, this gets quite slow, because (worst case) there's a write operation for every row.
What I'd like to do, is to store the list of valid items and save them all at the end of the file parsing process. So, nothing complicated:
valids = []
csv_rows.each |data|
my_model = MyModel.new(data)
if my_model.valid? # THE INTERESTING LINE this "if" checks only against the database, what happens if it conflicts with some other my_models not saved yet?
valids << my_model
else
# ...
end
end
if valids.size > 0
# bulk insert of all data
end
That would be perfect, if I could be sure that the data in the CSV does not contain duplicated rows or data that goes against the validation rules of MyModel.
My question is: how can I check each row against the database AND the valids array, without having to repeat the validation rules defined into MyModel (avoiding to have them duplicated)?
Is there a different (more efficient) approach I'm not considering?
What you can do is validate as model, save the attributes in a hash, pushed to the valids array, then do a bulk insert of the values usint mongodb's insert:
valids = []
csv_rows.each |data|
my_model = MyModel.new(data)
if my_model.valid?
valids << my_model.attributes
end
end
MyModel.collection.insert(valids, continue_on_error: true)
This won't however prevent NEW duplicates... for that you could do something like the following, using a hash and compound key:
valids = {}
csv_rows.each |data|
my_model = MyModel.new(data)
if my_model.valid?
valids["#{my_model.text}_#{my_model.parent}"] = my_model.as_document
end
end
Then either of the following will work, DB Agnostic:
MyModel.create(valids.values)
Or MongoDB'ish:
MyModel.collection.insert(valids.values, continue_on_error: true)
OR EVEN BETTER
Ensure you have a uniq index on the collection:
class MyModel
...
index({ text: 1, parent: 1 }, { unique: true, dropDups: true })
...
end
Then Just do the following:
MyModel.collection.insert(csv_rows, continue_on_error: true)
http://api.mongodb.org/ruby/current/Mongo/Collection.html#insert-instance_method
http://mongoid.org/en/mongoid/docs/indexing.html
TIP: I recommend if you anticipate thousands of rows to do this in batches of 500 or so.

Mongoid 4 (GitHub master) creating documents with duplicate IDs

I am running a high traffic test with Sidekiq that creates MongoDB based objects using Mongoid as my driver in a Rails 4 app. The issue I'm seeing is that when a PlayByPlay document is supposed to have a unique game_id, I see multiple PlayByPlay objects getting created with the same exact game_id. I've enforced the unique constraint on MongoDB as well and this is still happening. Here's my document, it's embedded document, and a glimpse into how I'm creating the documents. The issue is that this is all happening in a threaded environment using Sidekiq, and I'm not sure if there is a way to work around it. My write concern is set to 1 in mongoid.yml and it looks like the safe option was removed in master as was persist_in_safe_mode. Code below -- any suggestions on how to properly work this would be appreciated. This is not a replica set, it's a single MongoDB server performing all read/write requests at this time.
module MLB
class Play
include Mongoid::Document
include Mongoid::Timestamps
embedded_in :play_by_play
field :batter#, type: Hash
field :next_batter#, type: Hash
field :pitchers#, type: Array
field :pitches#, type: Array
field :fielders#, type: Array
field :narrative, type: String
field :seq_id, type: Integer
field :inning, type: Integer
field :outs
field :no_play
field :home_team_score
field :away_team_score
end
class PlayByPlay
include Mongoid::Document
include Mongoid::Timestamps
embeds_many :plays, cascade_callbacks: true
accepts_nested_attributes_for :plays
field :sport
field :datetime, type: DateTime
field :gamedate, type: DateTime
field :game_id
field :home_team_id
field :away_team_id
field :home_team_score
field :away_team_score
field :season_year
field :season_type
field :location
field :status
field :home_team_abbr
field :away_team_abbr
field :hp_umpire
field :fb_umpire
field :sb_umpire
field :tb_umpire
index({game_id: 1})
index({away_team_id: 1})
index({home_team_id: 1})
index({season_type: 1})
index({season_year: 1})
index({"plays.seq_id" => 1}, {unique: true, drop_dups: true})
#validates 'play.seq_id', uniqueness: true
validates :game_id, presence: true, uniqueness: true
validates :home_team_id, presence: true
validates :away_team_id, presence: true
validates :gamedate, presence: true
validates :datetime, presence: true
validates :season_type, presence: true
validates :season_year, presence: true
def self.parse!(entry)
#document = Nokogiri::XML(entry.data)
xslt = Nokogiri::XSLT(File.read("#{$XSLT_PATH}/mlb_pbp.xslt"))
transform = xslt.apply_to(#document)
json_document = JSON.parse(transform)
obj = find_or_create_by(game_id: json_document['game_id'])
obj.sport = json_document['sport']
obj.home_team_id = json_document['home_team_id']
obj.away_team_id = json_document['away_team_id']
obj.home_team_score = json_document['home_team_score']
obj.away_team_score = json_document['away_team_score']
obj.season_type = json_document['season_type']
obj.season_year = json_document['season_year']
obj.location = json_document['location']
obj.datetime = DateTime.strptime(json_document['datetime'], "%m/%d/%y %H:%M:%S")
obj.gamedate = DateTime.strptime(json_document['game_date'], "%m/%d/%Y %H:%M:%S %p")
obj.status = json_document['status']
obj.home_team_abbr = json_document['home_team_abbr']
obj.away_team_abbr = json_document['away_team_abbr']
obj.hp_umpire = json_document['hp_umpire']
obj.fb_umpire = json_document['fb_umpire']
obj.sb_umpire = json_document['sb_umpire']
obj.tb_umpire = json_document['tb_umpire']
p=obj.plays.build(seq_id: json_document['seq_id'])
p.batter = json_document['batter']
p.next_batter = json_document['next_batter'] if json_document['next_batter'].present? && json_document['next_batter'].keys.count >= 1
p.pitchers = json_document['pitchers'] if json_document['pitchers'].present? && json_document['pitchers'].count >= 1
p.pitches = json_document['pitches'] if json_document['pitches'].present? && json_document['pitches'].count >= 1
p.fielders = json_document['fielders'] if json_document['fielders'].present? && json_document['fielders'].count >= 1
p.narrative = json_document['narrative']
p.seq_id = json_document['seq_id']
p.inning = json_document['inning']
p.outs = json_document['outs']
p.no_play = json_document['no_play']
p.home_team_score = json_document['home_team_score']
p.away_team_score = json_document['away_team_score']
obj.save
end
end
end
** NOTE **
This problem goes away if I limit sidekiq to 1 worker, which obviously in the real world I'd never do.
You already have an index on game_id, why not making it unique? that way the db will not allow a duplicate entry, even if mongoid fails to do the validation correctly (#vidaica's answer describes how mongoid could fail to validate the uniqueness).
Try adding a unique index
index({"game_id" => 1}, {unique: true})
and then
rake db:mongoid:create_indexes
to create them in mongo (please make sure that it is created from a mongo shell).
After that, mongodb should not persist any records with duplicate game_id and you'll have to do on the ruby layer is to handle the insert errors that you'll receive from mongodb.
This is because many threads inserting objects with the same game_id. Let me paraphrase it.
For example, you have two sidekiq threads t1 and t2. They run in parallel. Assuming you have a document with game_id 1 and it has not been inserted into the database.
t1 enters parse method, it sees no document in the database with game_id 1, it creates a document with game_id 1 and continues to populate other data, but it has not saved the document.
t2 enters parse method, it sees no document in the database with game_id 1 because at this point t1 has not saved the document. t2 creates a document with the same game_id 1.
t1 save the document
t2 save the document
The result: you have two documents with the same game_id 1.
To prevent this you can use a Mutex to serialize the access of the parsing code. To know how to use a Mutex, read this: http://www.ruby-doc.org/core-2.0.0/Mutex.html
Whatever you do, you will want to solve this on the database level because you will almost certainly do a worst job of implementing unique constraints then what mongo people did.
Assuming you will want to shard one day or consider mongo due to its horizontal scalability features (you're doing high volume testing so I assume this is something you don't want to rule out by design), there may be no reliable way to do this (see Ramifications of working with a mongodb cluster and sharding concepts):
Suppose we were sharding on email and wanted to have a unique index on username. This is not possible to enforce with a cluster.
However, if you're sharding on game_id or you're not considering sharding at all then setting a unique index on game_id should prevent double records (see #xlembouras answer).
However, that answer may not prevent exceptions when this index is violated due to race conditions so be sure to rescue that exception and perform an update instead of create in the rescue block (possibly by playing with #new_record (click 'Show source'), will try to find time to give you exact code).
UPDATE, short fast answer
begin
a = Album.new(name: 'foo', game_id: 3)
a.save
rescue
a.id = id_of_the_object_with_same_id_already_in_db
a.instance_variable_set('#new_record', false)
a.save
end
#vidaica's answer is helpful. If you were fetching and incrementing an ID from memory or a database, it might solve your problem.
However, your game_id is not being generated in parse, it is being passed into parse via the entry JSON object.
How / where is your game_id being generated?
Maybe you should do an upsert instead of an insert:
obj = new(game_id: json_document['game_id'])
obj.upsert
A naive approach is to change the last line of #parse to:
obj.save if where(game_id: obj.game_id).count == 0
Or if you hand to handle it somehow:
if where(game_id: obj.game_id).count == 0
# handle it here
end
Note however that this still leaves possibilities for duplicates.

Overriding id on create in ActiveRecord

Is there any way of overriding a model's id value on create? Something like:
Post.create(:id => 10, :title => 'Test')
would be ideal, but obviously won't work.
id is just attr_protected, which is why you can't use mass-assignment to set it. However, when setting it manually, it just works:
o = SomeObject.new
o.id = 8888
o.save!
o.reload.id # => 8888
I'm not sure what the original motivation was, but I do this when converting ActiveHash models to ActiveRecord. ActiveHash allows you to use the same belongs_to semantics in ActiveRecord, but instead of having a migration and creating a table, and incurring the overhead of the database on every call, you just store your data in yml files. The foreign keys in the database reference the in-memory ids in the yml.
ActiveHash is great for picklists and small tables that change infrequently and only change by developers. So when going from ActiveHash to ActiveRecord, it's easiest to just keep all of the foreign key references the same.
You could also use something like this:
Post.create({:id => 10, :title => 'Test'}, :without_protection => true)
Although as stated in the docs, this will bypass mass-assignment security.
Try
a_post = Post.new do |p|
p.id = 10
p.title = 'Test'
p.save
end
that should give you what you're looking for.
For Rails 4:
Post.create(:title => 'Test').update_column(:id, 10)
Other Rails 4 answers did not work for me. Many of them appeared to change when checking using the Rails Console, but when I checked the values in MySQL database, they remained unchanged. Other answers only worked sometimes.
For MySQL at least, assigning an id below the auto increment id number does not work unless you use update_column. For example,
p = Post.create(:title => 'Test')
p.id
=> 20 # 20 was the id the auto increment gave it
p2 = Post.create(:id => 40, :title => 'Test')
p2.id
=> 40 # 40 > the next auto increment id (21) so allow it
p3 = Post.create(:id => 10, :title => 'Test')
p3.id
=> 10 # Go check your database, it may say 41.
# Assigning an id to a number below the next auto generated id will not update the db
If you change create to use new + save you will still have this problem. Manually changing the id like p.id = 10 also produces this problem.
In general, I would use update_column to change the id even though it costs an extra database query because it will work all the time. This is an error that might not show up in your development environment, but can quietly corrupt your production database all the while saying it is working.
we can override attributes_protected_by_default
class Example < ActiveRecord::Base
def self.attributes_protected_by_default
# default is ["id", "type"]
["type"]
end
end
e = Example.new(:id => 10000)
Actually, it turns out that doing the following works:
p = Post.new(:id => 10, :title => 'Test')
p.save(false)
As Jeff points out, id behaves as if is attr_protected. To prevent that, you need to override the list of default protected attributes. Be careful doing this anywhere that attribute information can come from the outside. The id field is default protected for a reason.
class Post < ActiveRecord::Base
private
def attributes_protected_by_default
[]
end
end
(Tested with ActiveRecord 2.3.5)
Post.create!(:title => "Test") { |t| t.id = 10 }
This doesn't strike me as the sort of thing that you would normally want to do, but it works quite well if you need to populate a table with a fixed set of ids (for example when creating defaults using a rake task) and you want to override auto-incrementing (so that each time you run the task the table is populate with the same ids):
post_types.each_with_index do |post_type|
PostType.create!(:name => post_type) { |t| t.id = i + 1 }
end
Put this create_with_id function at the top of your seeds.rb and then use it to do your object creation where explicit ids are desired.
def create_with_id(clazz, params)
obj = clazz.send(:new, params)
obj.id = params[:id]
obj.save!
obj
end
and use it like this
create_with_id( Foo, {id:1,name:"My Foo",prop:"My other property"})
instead of using
Foo.create({id:1,name:"My Foo",prop:"My other property"})
This case is a similar issue that was necessary overwrite the id with a kind of custom date :
# in app/models/calendar_block_group.rb
class CalendarBlockGroup < ActiveRecord::Base
...
before_validation :parse_id
def parse_id
self.id = self.date.strftime('%d%m%Y')
end
...
end
And then :
CalendarBlockGroup.create!(:date => Date.today)
# => #<CalendarBlockGroup id: 27072014, date: "2014-07-27", created_at: "2014-07-27 20:41:49", updated_at: "2014-07-27 20:41:49">
Callbacks works fine.
Good Luck!.
For Rails 3, the simplest way to do this is to use new with the without_protection refinement, and then save:
Post.new({:id => 10, :title => 'Test'}, :without_protection => true).save
For seed data, it may make sense to bypass validation which you can do like this:
Post.new({:id => 10, :title => 'Test'}, :without_protection => true).save(validate: false)
We've actually added a helper method to ActiveRecord::Base that is declared immediately prior to executing seed files:
class ActiveRecord::Base
def self.seed_create(attributes)
new(attributes, without_protection: true).save(validate: false)
end
end
And now:
Post.seed_create(:id => 10, :title => 'Test')
For Rails 4, you should be using StrongParams instead of protected attributes. If this is the case, you'll simply be able to assign and save without passing any flags to new:
Post.new(id: 10, title: 'Test').save # optionally pass `{validate: false}`
In Rails 4.2.1 with Postgresql 9.5.3, Post.create(:id => 10, :title => 'Test') works as long as there isn't a row with id = 10 already.
you can insert id by sql:
arr = record_line.strip.split(",")
sql = "insert into records(id, created_at, updated_at, count, type_id, cycle, date) values(#{arr[0]},#{arr[1]},#{arr[2]},#{arr[3]},#{arr[4]},#{arr[5]},#{arr[6]})"
ActiveRecord::Base.connection.execute sql

Resources