Change type of field in Mongoid without losing data - ruby-on-rails

How can I run a migration to change the type of a field in Mongoid/MongoDB without losing any data?
In my case I'm trying to convert from a BigDecimal (stored as string) to an Integer to store some money. I need to convert the string decimal representation to cents for the integer. I don't want to lose the existing data.
I'm assuming the steps might be something like:
create new Integer field with a new name, say amount2
deploy to production and run a migration (or rake task) that converts each amount to the right value for amount2
(this whole time existing code is still using amount and there is no downtime from the users perspective)
take the site down for maintenance, run the migration one more time to capture any amount fields that could have changed in the last few minutes
delete amount and rename amount2 to amount
deploy new code which expects amount to be an integer
bring site back up
It looks like Mongoid offers a rename method: http://mongoid.org/docs/persistence/atomic.html#rename
But I'm a little confused how this is used. If you have a field named amount2 (and you've already deleted amount), do you just run Transaction.rename :amount2, :amount? Then I imagine this immediately breaks the underlying representation so you have to restart your app server after that? What happens if you run that while amount still exists? Does it get overwritten, fail, or try to convert on it's own?
Thanks!

Ok I made it through. I think there is a faster way using the mongo console with something like this:
MongoDB: How to change the type of a field?
But I couldn't get the conversion working, so opted for this slower method in the rails console with more downtime. If anyone has a faster solution please post it.
create new Integer field with a new name, say amount2
convert each amount to the right value for amount2 in a console or rake task
Mongoid.identity_map_enabled = false
Transaction.all.each_with_index do |t,i|
puts i if i%1000==0
t.amount2 = t.amount.to_money
break if !t.save
end
Note that .all.each works fine (you don't need to use .find_each or .find_in_batches like regular activerecord with mysql) because of mongodb cursors. It won't fill up memory as long as the identity_map is off.
take the site down for maintenance, run the migration one more time to capture any amount fields that could have changed in the last few minutes (something like Transaction.where(:updated_at.gt => 1.hour.ago).each_with_index...
comment out field :amount, type: BigDecimal in your model, you don't want mongoid to know about this field anymore, and push this code
now run another script to rename your column (it overwrites any old BigDecimal string values in the process). You might need to comment any validations you have on the model which expect the old field.
Mongoid.identity_map_enabled = false
Transaction.all.each_with_index do |t,i|
puts i if i%1000==0
t.rename :amount2, :amount
end
This is atomic and doesn't require a save on the model.
update your model to reflect the new column type field :amount, type: Integer
deploy and bring the site back up
As mentioned I think there is a better way, so if anyone has some tips please share. Thanks!

Related

Import CSV data faster in rails

I am building an import module to import a large set of orders from a csv file. I have a model called Order where the data needs to be stored.
A simplified version of the Order model is below
sku
quantity
value
customer_email
order_date
status
When importing the data two things have to happen
Any dates or currencies need to be cleaned up i.e. dates are represented as strings in the csv, this needs to be converted into a Rails Date object and currencies need to converted to a decimal by removing any commas or dollar signs
If a row already exists it has to be updated, the uniqueness is checked based on two columns.
Currently I use a simple csv import code
CSV.foreach("orders.csv") do |row|
order = Order.first_or_initialize(sku: row[0], customer_email: row[3])
order.quantity = row[1]
order.value= parse_currency(row[2])
order.order_date = parse_date(row[4])
order.status = row[5]
order.save!
end
Where parse_currency and parse_date are two functions used to extract the values from strings. In the case of the date it is just a wrapper for Date.strptime.
I can add a check to see if the record already exists and do nothing in case it already exists and that should save a little time. But I am looking for something that is significantly faster. Currently importing around 100k rows takes around 30 mins with an empty database. It will get slower as the data size increases.
So I am basically looking for a faster way to import the data.
Any help would be appreciated.
Edit
After some more testing based on the comments here I have an observation and a question. I am not sure if they should go here or if I need to open a new thread for the questions. So please let me know if I have to move this to a separate question.
I ran a test using Postgres copy to import the data from the file and it took less than a minute. I just imported the data into a new table without any validations. So the import can be much faster.
The Rails overhead seems to be coming from 2 places
The multiple database calls that are happening i.e. the first_or_initialize for each row. This ends up becoming multiple SQL calls because it has to first find the record and then update it and then save it.
Bandwidth. Each time the SQL server is called the data flows back and forth which adds up to a lot of time
Now for my question. How do I move the update/create logic to the database i.e. If an order already exists based on the sku and customer_email it needs to update the record else a new record needs to be created. Currently with rails I am using the first_or_initialize method to get the record in case it exists and update it, else I am creating a new one and saving it. How do I do that in SQL.
I could run a raw SQL query using ActiveRecord connection execute but I do not think that would be a very elegant way of doing it. Is there a better way of doing that?
Since ruby 1.9 fastcsv is now part of ruby core. You don't need to use a special gem. Simply use CSV.
With 100k records ruby takes 0.018 secs / record. In my opinion most of your time will be used within Order.first_or_initialize. This part of your code takes an extra roundtrip to your database. Initialization of an ActiveRecord takes it time too. But to realy be sure I would suggest that you benchmark your code.
Benchmark.bm do |x|
x.report("CSV evel") { CSV.foreach("orders.csv") {} }
x.report("Init: ") { 1.upto(100_000) {Order.first_or_initialize(sku: rand(...), customer_email: rand(...))} } # use rand query to prevent query caching
x.report('parse_currency') { 1.upto(100_000) { parse_currency(...} }
x.report('parse_date') { 1.upto(100_000) { parse_date(...} }
end
You should also watch memory consumption during your import. Maybe the garbage collection does not run often enough or objects are not cleaned up.
To gain speed you can follow Matt Brictson hint and bypass ActiveRecord.
You can try the gem activerecord-import or you can start to go parallel, for instance multiprocessing with fork or multithreading with Thread.new.

Is it bad to change _id type in MongoDB to integer?

MongoDB uses ObjectId type for _id.
Will it be bad if I make _id an incrementing integer?
(With this gem, if you're interested)
No it isn't bad at all and in fact the built in ObjectId is quite sizeable within the index so if you believe you have something better then you are more than welcome to change the default value of the _id field to whatever.
But, and this is a big but, there are some considerations when deciding to move away from the default formulated ObjectId, especially when using the auto incrementing _ids as shown here: https://docs.mongodb.com/v3.0/tutorial/create-an-auto-incrementing-field
Multi threading isn't such a big problem because findAndModify and the atomic locks can actually take care of that, but then you just hit into your first problem. findAndModify is not the fastest function nor the lightest and there have been significant performance drops noticed when using it regularly.
You also have to consider the overhead of doing this yourself anyway, even without findAndModify. For every insert you will need an extra query. Imagine having a unique id that you have to query the uniqueness of every time you want to insert. Eventually your insert rate will drop to a crawl and your lock time will build up.
Of course the ObjectId is really good at being unique without having to check or formulate its own uniqueness by touching the database prior to insertion, hence it doesn't have this overhead.
If you still feel an integer _id suites your scenario, then go for it, but bare in mind the overhead described above.
You can do it, but you are responsible to make sure that the integers are unique.
MongoDB doesn't support auto-increment fields like most SQL databases. When you have a distributed or multithreaded application which has multiple processes and/or threads which create new database entries, you have to make sure that they use the same counter. Otherwise it could happen that two threads try to store a document with the same _id in the database.
When that happens, one of them will fail. That means you have to wait for the database to return a success or error (by calling GetLastError or by setting the write concerns to acknowledged), which takes longer than just sending data in a fire-and-forget manner.
I had a use case for this: replacing _id with a 64 bit integer that represented a simhash of a document index for searching.
Since I intended to "Get or create", providing the initial simhash, and creating a new record if one didn't exist was perfect. Also, for anyone Googling, MongoDB support explained to me that simhashes are absolutely perfect for sharding and scaling, and even better than the more generic ObjectId, because they will divide up the data across shards perfectly and intrinsically, and you get the key stored for negative space (a uint64 is much smaller than an objectId and would need to be stored anyway).
Also, for you Googlers, replacing a MongoDB _id with something other than an objectId is absolutely simple: Just create an object with the _id being defined; use an integer if you like. That's it: Mongo will simply use it. If you try to create a document with the same _id you'll get an error (E11000/Duplicate key). So like me, if you're using simhashing, this is ideal in all respects.

Rails 3 Migration: Autoincrement on (Non-Primary-Key) Column?

I'm looking for a way to create a column that autoincrements the way the automatic :id column does. I could probably handle this somehow in the model, but that seems kludgey. I haven't found anything in stock Rails 3 that handles this; are there gems available that might handle this? I'm surprised it's not already an option, since Rails handles this behavior for primary key columns.
Normally auto-incrementing columns are implemented using database sequences. The advantage of using a sequence over calculating the next increment, is that getting the next value from a sequence is atomic. So if you have multiple processes creating new elements, the sequence will make sure your numbers are really unique.
Sequences can be used in postgresql, oracle, mysql, ...
How to implement this, if you are using postgres for instance:
select the next value from the sequence:
Integer(Operator.connection.select_value("SELECT nextval('#{sequence_name}')"))
create a sequence:
Operator.connection.execute("CREATE sequence #{sequence_name}")
set the start-value of a sequence :
Operator.connection.execute("SELECT setval('#{sequence_name}', #{new_start_serial})")
Hope this helps.
If you really think you need this you could create a before_create filter in the model to check the last record attribute value and add 1 to it. Feels hacking though.

Data integrity when updating/incrementing values in Rails

I am working on an application in which one module accesses a log (an ActiveRecord model) to increment various values at several points of the execution. The problem is that the script is likely to take several seconds, and other instances of the same script is likely to run at the same time.
What I'm seeing is that while the script finished correctly and all data is accounted for, the values never are correct. This is likely because by the time the script gets to the point where it updates the value, the read value for the column (which is incremented) is already out of date.
The values are correct if I force it to only have one instance of the module at a time, but for performance reasons I can't keep doing this.
Currently, I've tried to solve the problem by querying the database for the record before each increment statement in a transaction, like this, where column is a symbol of the column and value is 1 or higher:
Log.transaction do
log = Log.find(#log_id)
log.update_attribute(column, log.send(column) + value)
end
However, it still won't give me accurate numbers. I'm thinking caching is involved, and I suppose I could try something like:
Log.transaction do
uncached do
log = Log.find(#log_id)
log.update_attribute(column, log.send(column) + value)
end
end
But surely I can't be the first to come across this issue, so I am wondering if there is a better implementation/solution for my problem?
ActiveRecord provides an update_counters method. You can change your code as follows:
Log.update_counters #log_id, column.to_sym => value

Can one rely on the auto-incrementing primary key in your database?

In my present Rails application, I am resolving scheduling conflicts by sorting the models by the "created_at" field. However, I realized that when inserting multiple models from a form that allows this, all of the created_at times are exactly the same!
This is more a question of best programming practices: Can your application rely on your ID column in your database to increment greater and greater with each INSERT to get their order of creation? To put it another way, can I sort a group of rows I pull out of my database by their ID column and be assured this is an accurate sort based on creation order? And is this a good practice in my application?
The generated identification numbers will be unique.
Regardless of whether you use Sequences, like in PostgreSQL and Oracle or if you use another mechanism like auto-increment of MySQL.
However, Sequences are most often acquired in bulks of, for example 20 numbers.
So with PostgreSQL you can not determine which field was inserted first. There might even be gaps in the id's of inserted records.
Therefore you shouldn't use a generated id field for a task like that in order to not rely on database implementation details.
Generating a created or updated field during command execution is much better for sorting by creation-, or update-time later on.
For example:
INSERT INTO A (data, created) VALUES (smething, DATE())
UPDATE A SET data=something, updated=DATE()
That depends on your database vendor.
MySQL I believe absolutely orders auto increment keys. SQL Server I don't know for sure that it does or not but I believe that it does.
Where you'll run into problems is with databases that don't support this functionality, most notably Oracle that uses sequences, which are roughly but not absolutely ordered.
An alternative might be to go for created time and then ID.
I believe the answer to your question is yes...if I read between the lines, I think you are concerned that the system may re-use ID's numbers that are 'missing' in the sequence, and therefore if you had used 1,2,3,5,6,7 as ID numbers, in all the implementations I know of, the next ID number will always be 8 (or possibly higher), but I don't know of any DB that would try and figure out that record Id #4 is missing, so attempt to re-use that ID number.
Though I am most familiar with SQL Server, I don't know why any vendor who try and fill the gaps in a sequence - think of the overhead of keeping that list of unused ID's, as opposed to just always keeping track of the last I number used, and adding 1.
I'd say you could safely rely on the next ID assigned number always being higher than the last - not just unique.
Yes the id will be unique and no, you can not and should not rely on it for sorting - it is there to guarantee row uniqueness only. The best approach is, as emktas indicated, to use a separate "updated" or "created" field for just this information.
For setting the creation time, you can just use a default value like this
CREATE TABLE foo (
id INTEGER UNSIGNED AUTO_INCREMENT NOT NULL;
created TIMESTAMP NOT NULL DEFAULT NOW();
updated TIMESTAMP;
PRIMARY KEY(id);
) engine=InnoDB; ## whatever :P
Now, that takes care of creation time. with update time I would suggest an AFTER UPDATE trigger like this one (of course you can do it in a separate query, but the trigger, in my opinion, is a better solution - more transparent):
DELIMITER $$
CREATE TRIGGER foo_a_upd AFTER UPDATE ON foo
FOR EACH ROW BEGIN
SET NEW.updated = NOW();
END;
$$
DELIMITER ;
And that should do it.
EDIT:
Woe is me. Foolishly I've not specified, that this is for mysql, there might be some differences in the function names (namely, 'NOW') and other subtle itty-bitty.
One caveat to EJB's answer:
SQL does not give any guarantee of ordering if you don't specify an order by column. E.g. if you delete some early rows, then insert 'em, the new ones may end up living in the same place in the db the old ones did (albeit with new IDs), and that's what it may use as its default sort.
FWIW, I typically use order by ID as an effective version of order by created_at. It's cheaper in that it doesn't require adding an index to a datetime field (which is bigger and therefore slower than a simple integer primary key index), guaranteed to be different, and I don't really care if a few rows that were added at about the same time sort in some slightly different order.
This is probably DB engine depended. I would check how your DB implements sequences and if there are no documented problems then I would decide to rely on ID.
E.g. Postgresql sequence is OK unless you play with the sequence cache parameters.
There is a possibility that other programmer will manually create or copy records from different DB with wrong ID column. However I would simplify the problem. Do not bother with low probability cases where someone will manually destroy data integrity. You cannot protect against everything.
My advice is to rely on sequence generated IDs and move your project forward.
In theory yes the highest id number is the last created. Remember though that databases do have the ability to temporaily turn off the insert of the autogenerated value , insert some records manaully and then turn it back on. These inserts are no typically used on a production system but can happen occasionally when moving a large chunk of data from another system.

Resources