How to migrate images to new field in Rails? - ruby-on-rails

I use Ruby on Rails 5.2.3, Mongoid, Attachinary and Cloudinary for images.
class User
include Mongoid::Document
has_attachment :image, accept: [:jpg, :png, :gif]
field :pic, type: String
before_update :migrate_images
def migrate_images
self.image_url = self.pic
end
end
Images are saved in pic field as links.
Now I use this code, the problem is that this takes a very long time and not all images are saved.
User.where(:pic.exists => true).all.each &:update
log
irb(main):001:0> User.where(:pic.exists => true).all.each &:update
=> #<Mongoid::Contextual::Mongo:0x00007ffe5a3f98e0 #cache=nil, #klass=User, #criteria=#<Mongoid::Criteria
selector: {"pic"=>{"$exists"=>true}}
options: {}
class: User
embedded: false>
, #collection=#<Mongo::Collection:0x70365213493680 namespace=link_development.users>, #view=#<Mongo::Collection::View:0x70365213493380 namespace='link_development.users' #filter={"pic"=>{"$exists"=>true}} #options={"session"=>nil}>, #cache_loaded=true>

User.where(:pic.exists => true).all.each &:update
This is slow because .all.each loads all matching Users into memory, find_each is a bit more efficient on memory as it will load in batches, but it's still a waste of time and memory to load each object into memory and turn it into an object to copy one attribute. Then it runs an update on each individual one.
Instead, you can do this entirely in the database in a single query.
If the intent is to copy from User.pic to User.image_url, you can do this in a single statement.
# Find all the users who do not already have an image_url set
User.where(image_url: nil)
# Set their image_url to be their pic.
.update_all("image_url = pic")
This will run a single query:
update users
set image_url = pic
where image_url is null
There's no need to also check for users who lack a pic because there's no harm in setting nil to nil, and a simpler search might be faster. But if you like check you can use where.not. Users.where(image_url: nil).where.not(pic: nil)

Related

update_columns update a string but it's nil when fetching on rails

I'm trying to save file to a Google Storage bucket, I followed the official guide for rails.
So I've this code for updating my file
after_create :upload_document, if: :path
def upload_document
file = Document.storage_bucket.create_file \
path.tempfile,
"cover_images/#{id}/#{path.original_filename}",
content_type: path.content_type,
acl: "public"
# Update the url to my path field on my database
update_columns(path: file.public_url)
end
I can store my file on my bucket, I can retrieve the public_url and update the path field on my table but when I try to fetch the path string I have a nil. Exemple on my rails console
Document.find(14)
=> #<Document id: 14, name: "Test", path: "https://storage.googleapis.com/xxxxxxxx-xxxx.x...", created_at: "2018-10-05 07:17:59", updated_at: "2018-10-05 07:17:59">
Document.find(14).path
=> nil
Document.find(14).name
=> "Test"
So I don't understand why I can access to my path field on my SQL database after an update using the update_columns of Rails.
Thanks a lot for your help
You have some method defined on Document class (or included module) that is overriding the default attribute accessor.
To find out which, write this in console:
Document.find(14).method(:path).source_location
In any case you can access directly the attribute with
Document.find(14)['path']

Rails reuse object to save memory in bulk import

I'm currently using SmarterCSV to do bulk CSV import via MongoDB's upsert commands. I have the following code excerpt:
SmarterCSV.process(csv, csv_options) do |chunk|
chunk.each do |row|
#creates a temporary user to store the object
user = User.new
#converts row info to populate user object
#creates an array of commands that can be executed by MongoDB via user.as_document
updates << {:q => {:email => user.email},
:u => {:$set => user.as_document},
:multi => false,
:upsert => true}
user = nil
end
end
However, I'm noticing that the memory usage keeps growing as the Garbage Collection (using Rails 3.2.14 & Ruby 2.0.0p353) doesn't seem to clear the temporary user objects fast enough.
So I tried to create user = User.new outside of the SmarterCSV process (see below) and reuse the user object within the process. This saves memory. However, user.as_document would overwrite previous elements in the updates array on each iteration. I was able to solve the problem by using user.as_document.to_json, but that doesn't set any of User's relationship correctly. For example, instead of saving a BSON reference for an relation's id, it only saves the id in string format.
Any ideas? Is there a way that I can optimize the bulk import process?
user = User.new
SmarterCSV.process(csv, csv_options) do |chunk|
chunk.each do |row|
#creates a temporary user to store the object
#converts row info to populate & reuse user object
#creates an array of commands that can be executed by MongoDB via user.as_document.to_json
updates << {:q => {:email => user.email},
:u => {:$set => user.as_document.to_json},
:multi => false,
:upsert => true}
end
end
I ended fixing this by using 'user.as_document.deep_dup'

Mongoid 4 (GitHub master) creating documents with duplicate IDs

I am running a high traffic test with Sidekiq that creates MongoDB based objects using Mongoid as my driver in a Rails 4 app. The issue I'm seeing is that when a PlayByPlay document is supposed to have a unique game_id, I see multiple PlayByPlay objects getting created with the same exact game_id. I've enforced the unique constraint on MongoDB as well and this is still happening. Here's my document, it's embedded document, and a glimpse into how I'm creating the documents. The issue is that this is all happening in a threaded environment using Sidekiq, and I'm not sure if there is a way to work around it. My write concern is set to 1 in mongoid.yml and it looks like the safe option was removed in master as was persist_in_safe_mode. Code below -- any suggestions on how to properly work this would be appreciated. This is not a replica set, it's a single MongoDB server performing all read/write requests at this time.
module MLB
class Play
include Mongoid::Document
include Mongoid::Timestamps
embedded_in :play_by_play
field :batter#, type: Hash
field :next_batter#, type: Hash
field :pitchers#, type: Array
field :pitches#, type: Array
field :fielders#, type: Array
field :narrative, type: String
field :seq_id, type: Integer
field :inning, type: Integer
field :outs
field :no_play
field :home_team_score
field :away_team_score
end
class PlayByPlay
include Mongoid::Document
include Mongoid::Timestamps
embeds_many :plays, cascade_callbacks: true
accepts_nested_attributes_for :plays
field :sport
field :datetime, type: DateTime
field :gamedate, type: DateTime
field :game_id
field :home_team_id
field :away_team_id
field :home_team_score
field :away_team_score
field :season_year
field :season_type
field :location
field :status
field :home_team_abbr
field :away_team_abbr
field :hp_umpire
field :fb_umpire
field :sb_umpire
field :tb_umpire
index({game_id: 1})
index({away_team_id: 1})
index({home_team_id: 1})
index({season_type: 1})
index({season_year: 1})
index({"plays.seq_id" => 1}, {unique: true, drop_dups: true})
#validates 'play.seq_id', uniqueness: true
validates :game_id, presence: true, uniqueness: true
validates :home_team_id, presence: true
validates :away_team_id, presence: true
validates :gamedate, presence: true
validates :datetime, presence: true
validates :season_type, presence: true
validates :season_year, presence: true
def self.parse!(entry)
#document = Nokogiri::XML(entry.data)
xslt = Nokogiri::XSLT(File.read("#{$XSLT_PATH}/mlb_pbp.xslt"))
transform = xslt.apply_to(#document)
json_document = JSON.parse(transform)
obj = find_or_create_by(game_id: json_document['game_id'])
obj.sport = json_document['sport']
obj.home_team_id = json_document['home_team_id']
obj.away_team_id = json_document['away_team_id']
obj.home_team_score = json_document['home_team_score']
obj.away_team_score = json_document['away_team_score']
obj.season_type = json_document['season_type']
obj.season_year = json_document['season_year']
obj.location = json_document['location']
obj.datetime = DateTime.strptime(json_document['datetime'], "%m/%d/%y %H:%M:%S")
obj.gamedate = DateTime.strptime(json_document['game_date'], "%m/%d/%Y %H:%M:%S %p")
obj.status = json_document['status']
obj.home_team_abbr = json_document['home_team_abbr']
obj.away_team_abbr = json_document['away_team_abbr']
obj.hp_umpire = json_document['hp_umpire']
obj.fb_umpire = json_document['fb_umpire']
obj.sb_umpire = json_document['sb_umpire']
obj.tb_umpire = json_document['tb_umpire']
p=obj.plays.build(seq_id: json_document['seq_id'])
p.batter = json_document['batter']
p.next_batter = json_document['next_batter'] if json_document['next_batter'].present? && json_document['next_batter'].keys.count >= 1
p.pitchers = json_document['pitchers'] if json_document['pitchers'].present? && json_document['pitchers'].count >= 1
p.pitches = json_document['pitches'] if json_document['pitches'].present? && json_document['pitches'].count >= 1
p.fielders = json_document['fielders'] if json_document['fielders'].present? && json_document['fielders'].count >= 1
p.narrative = json_document['narrative']
p.seq_id = json_document['seq_id']
p.inning = json_document['inning']
p.outs = json_document['outs']
p.no_play = json_document['no_play']
p.home_team_score = json_document['home_team_score']
p.away_team_score = json_document['away_team_score']
obj.save
end
end
end
** NOTE **
This problem goes away if I limit sidekiq to 1 worker, which obviously in the real world I'd never do.
You already have an index on game_id, why not making it unique? that way the db will not allow a duplicate entry, even if mongoid fails to do the validation correctly (#vidaica's answer describes how mongoid could fail to validate the uniqueness).
Try adding a unique index
index({"game_id" => 1}, {unique: true})
and then
rake db:mongoid:create_indexes
to create them in mongo (please make sure that it is created from a mongo shell).
After that, mongodb should not persist any records with duplicate game_id and you'll have to do on the ruby layer is to handle the insert errors that you'll receive from mongodb.
This is because many threads inserting objects with the same game_id. Let me paraphrase it.
For example, you have two sidekiq threads t1 and t2. They run in parallel. Assuming you have a document with game_id 1 and it has not been inserted into the database.
t1 enters parse method, it sees no document in the database with game_id 1, it creates a document with game_id 1 and continues to populate other data, but it has not saved the document.
t2 enters parse method, it sees no document in the database with game_id 1 because at this point t1 has not saved the document. t2 creates a document with the same game_id 1.
t1 save the document
t2 save the document
The result: you have two documents with the same game_id 1.
To prevent this you can use a Mutex to serialize the access of the parsing code. To know how to use a Mutex, read this: http://www.ruby-doc.org/core-2.0.0/Mutex.html
Whatever you do, you will want to solve this on the database level because you will almost certainly do a worst job of implementing unique constraints then what mongo people did.
Assuming you will want to shard one day or consider mongo due to its horizontal scalability features (you're doing high volume testing so I assume this is something you don't want to rule out by design), there may be no reliable way to do this (see Ramifications of working with a mongodb cluster and sharding concepts):
Suppose we were sharding on email and wanted to have a unique index on username. This is not possible to enforce with a cluster.
However, if you're sharding on game_id or you're not considering sharding at all then setting a unique index on game_id should prevent double records (see #xlembouras answer).
However, that answer may not prevent exceptions when this index is violated due to race conditions so be sure to rescue that exception and perform an update instead of create in the rescue block (possibly by playing with #new_record (click 'Show source'), will try to find time to give you exact code).
UPDATE, short fast answer
begin
a = Album.new(name: 'foo', game_id: 3)
a.save
rescue
a.id = id_of_the_object_with_same_id_already_in_db
a.instance_variable_set('#new_record', false)
a.save
end
#vidaica's answer is helpful. If you were fetching and incrementing an ID from memory or a database, it might solve your problem.
However, your game_id is not being generated in parse, it is being passed into parse via the entry JSON object.
How / where is your game_id being generated?
Maybe you should do an upsert instead of an insert:
obj = new(game_id: json_document['game_id'])
obj.upsert
A naive approach is to change the last line of #parse to:
obj.save if where(game_id: obj.game_id).count == 0
Or if you hand to handle it somehow:
if where(game_id: obj.game_id).count == 0
# handle it here
end
Note however that this still leaves possibilities for duplicates.

Getting couchrest and couch_potato to recognize existing couchdb documents

I'm trying to create a basic Rails CRUD app against a CouchDB database hosted on Cloudant.
I'm using couch_potato as my persistence layer and have it connecting properly to my Cloudant database.
The issues I'm having is my first model won't see the existing documents in my CouchDB database, unless I add a ruby_class field that equals the name of my model.
My simple User model:
class User
include CouchPotato::Persistence
property :id, :type => Fixnum
property :FullName, :type => String
view :all, :key => :FullName
end
Sample CouchDB document:
{
"_id": 123456,
"_rev": "4-b96f36763934ce7c469abbc6fa05aaf3",
"ORGID": 400638,
"MyOrgToken": "19fc342d50f9d8df1ecd5e5404f5e5f7",
"FullName": "Jane Doe",
"Phone": "555-555-5555",
"MemberNumber": 123456,
"Email": "jane#example.com",
"LoginPWHash": "14a3ccc0e6a50135ef391608e786f4e8"
}
Now, when I use my all view from the rails console, I don't get any results back:
1.9.2-p290 :002 > CouchPotato.database.view User.all
=> []
If I add the field and value "ruby_class: User" to the above CouchDB document, then I get results back in the console:
1.9.2-p290 :003 > CouchPotato.database.view User.all
=> [#<User _id: "123456", _rev: "4-b96f36763934ce7c469abbc6fa05aaf3", created_at: nil,
updated_at: nil, id: "123456", FullName: "Jane Doe">]
I'm working with a large set of customer data, and I don't want to write any scripts to add the ruby_class field to every document (and I may not be permitted to).
How can I get my app to recognize these existing CouchDB documents without adding the ruby_class field?
I couldn't find much documentation for couch_potato and couchrest that shows how to work with existing CouchDB databases. Most of the examples assume you're starting your project and database(s) from scratch.
Thanks,
/floatnspace
when you are looking at the all view of your User you will see something like ruby_class == 'User' so unless you add this property to your documents you will need to work around what couch_potato provides. you could i.e. use couch_rest directly to retrieve your documents, but i don't think that this what you want.
if you start persisting or updating your own documents, couch_potato will add the ruby_class field anyways. so i think the simples solution would be to just add them there.
another thing you can do is create a view that emits the documents also when they DON'T have the property set. this approach will only work if you have just one kind of document in your couchdb:
if(!doc.ruby_class || doc.ruby_class == 'User') {
emit(doc);
}

Saving a nested hash in Ruby on Rails

I'm trying to save a nested Hash to my database and retrieve it, but nested values are lost upon retrieval.
My model looks like this:
class User
serialize :metadata, MetaData
end
The class MetaData looks like this:
class MetaData < Hash
attr_accessor :availability, :validated
end
The code I'm using to store data looks something like this (the real data is coming from a HTML form, though):
user = User.find(id)
user.metadata.validated = true
user.metadata.availability = {'Sunday' => 'Yes', 'Monday' => 'No', 'Tuesday' => 'Yes'}
user.save
When I look at the data in the database, I see the following:
--- !map:MetaData
availability: !map:ActiveSupport::HashWithIndifferentAccess
Sunday: "Yes"
Monday: "No"
Tuesday: "Yes"
validated: true
The problem occurs when I try to get the object again:
user = User.find(id)
user.metadata.validated # <- this is true
user.metadata.availability # <- this is nil
Any ideas? I'm using Rails 3.1 with Postgresql as my datastore.
If you look in the database you see "map:ActiveSupport::HashWithIndifferentAccess" for availability?
My approach would be to separate out the single instance of availablity from the hash collection structure of days available.
you mean user.metadata.validated # <- this is true ?
What DB columns are metadata and availability stored as? They need to be TEXT

Resources