Removing duplicates from array before saving - ruby-on-rails

I periodically fetch the latest tweets with a certain hashtag and save them locally. In order to prevent saving duplicates, I use the method below. Unfortunately, it does not seem to be working... so what's wrong with this code:
def remove_duplicates
before = #tweets.size
#tweets.delete_if {|tweet| !((Tweet.all :conditions => { :twitter_id => tweet.twitter_id}).empty?) }
duplicates = before - #tweets.size
puts "#{duplicates} duplicates found"
end
Where #tweets is an array of Tweet objects fetched from twitter. I'd appreciate any solution that works and especially one that might be more elegant...

you can validate_uniqueness_of :twitter_id in the Tweet model (where this code should be). This will cause duplicates to fail to save.

Since it sounds like you're using the Twitter search API, a better solution is to use the since_id parameter. Keep track of the last twitter status id you got from your previous query and use that as the since_id parameter on your next query.
More information is available at Twitter Search API Method: search

array.uniq!
Removes duplicate elements from self. Returns nil if no changes are made (that is, no duplicates are found).

Ok, turns out the problem was a bit of different nature: When looking closer into it, I found out that multipe Tweets were saved with the twitter_id 2147483647... This is the upper limit for integer fields :)
Changing the field to bigint solved the problem. It took me very long to figure out since MySQL did silently fail and just reverted to the maximum value as long as it could. (until I added the unique index). I quickly tried it out with postgres, which returned a nice "Integer out of range" error, which then pointed me to the real cause of the problem here.
Thanks Ben for the validation and indexing tips, as they lead to much cleaner code now!

Related

Mongoid identity_map and memory usage, memory leaks

When I executing query
Mymodel.all.each do |model|
# ..do something
end
It uses allot of memory and amount of used memory increases at all the time and at the and it crashes. I found out that to fix it I need to disable identity_map but when I adding to my mongoid.yml file identity_map_enabled: false I am getting error
Invalid configuration option: identity_map_enabled.
Summary:
A invalid configuration option was provided in your mongoid.yml, or a typo is potentially present. The valid configuration options are: :include_root_in_json, :include_type_for_serialization, :preload_models, :raise_not_found_error, :scope_overwrite_exception, :duplicate_fields_exception, :use_activesupport_time_zone, :use_utc.
Resolution:
Remove the invalid option or fix the typo. If you were expecting the option to be there, please consult the following page with repect to Mongoid's configuration:
I am using Rails 4 and Mongoid 4, Mymodel.all.count => 3202400
How can I fix it or maybe some one know other way to reduce amount of memory used during executing query .all.each ..?
Thank you very much for the help!!!!
I started with something just like you by doing loop through millions of record and the memory just keep increasing.
Original code:
#portal.listings.each do |listing|
listing.do_something
end
I've gone through many forum answers and I tried them out.
1st attempt: I try to use the combination of WeakRef and GC.start but no luck, I fail.
2nd attempt: Adding listing = nil to the first attempt, and still fail.
Success Attempt:
#start_date = 10.years.ago
#end_date = 1.day.ago
while #start_date < #end_date
#portal.listings.where(created_at: #start_date..#start_date.next_month).each do |listing|
listing.do_something
end
#start_date = #start_date.next_month
end
Conclusion
All the memory allocated for the record will never be released during
the query request. Therefore, trying with small number of record every
request does the job, and memory is in good condition since it will be
released after each request.
Your problem isn't the identity map, I don't think Mongoid4 even has an identity map built in, hence the configuration error when you try to turn it off. Your problem is that you're using all. When you do this:
Mymodel.all.each
Mongoid will attempt to instantiate every single document in the db.mymodels collection as a Mymodel instance before it starts iterating. You say that you have about 3.2 million documents in the collection, that means that Mongoid will try to create 3.2 million model instances before it tries to iterate. Presumably you don't have enough memory to handle that many objects.
Your Mymodel.all.count works fine because that just sends a simple count call into the database and returns a number, it won't instantiate any models at all.
The solution is to not use all (and preferably forget that it exists). Depending on what "do something" does, you could:
Page through all the models so that you're only working with a reasonable number of them at a time.
Push the logic into the database using mapReduce or the aggregation framework.
Whenever you're working with real data (i.e. something other than a trivially small database), you should push as much work as possible into the database because databases are built to manage and manipulate big piles of data.

Rails: Getting column value from query

Seems like it should be able to look at a simple tutorial or find an aswer with a quick google, but I can't...
codes = PartnerCode.find_by_sql "SELECT * from partner_codes where product = 'SPANMEX' and isused = 'false' limit 1"
I want the column named code, I want just the value. Tried everything what that seems logical. Driving me nuts because everything I find shows an example without referencing the actual values returned
So what is the object returned? Array, hash, ActiveRecord? Thanks in advance.
For Rails 4+ (and a bit earlier I think), use pluck:
Partner.where(conditions).pluck :code
> ["code1", "code2", "code3"]
map is inefficient as it will select all columns first and also won't be able to optimise the query.
You need this one
Partner.where( conditions ).map(&:code)
is shorthand for
Partner.where( conditions ).map{|p| p.code}
PS
if you are often run into such case you will like this gem valium by ernie
it gives you pretty way to get values without instantiating activerecord object like
Partner.where( conditions ).value_of :code
UPDATED:
if you need access some attribute and after that update record
save instance first in some variable:
instance=Partner.where( conditions ).first
then you may access attributes like instance.code and update some attribute
instance.update_attribute || instance.update_attributes
check documentation at api.rubyonrails.org for details

Rails - given an array of Users - how to get a output of just emails?

I have the following:
#users = User.all
User has several fields including email.
What I would like to be able to do is get a list of all the #users emails.
I tried:
#users.email.all but that errors w undefined
Ideas? Thanks
(by popular demand, posting as a real answer)
What I don't like about fl00r's solution is that it instantiates a new User object per record in the DB; which just doesn't scale. It's great for a table with just 10 emails in it, but once you start getting into the thousands you're going to run into problems, mostly with the memory consumption of Ruby.
One can get around this little problem by using connection.select_values on a model, and a little bit of ARel goodness:
User.connection.select_values(User.select("email").to_sql)
This will give you the straight strings of the email addresses from the database. No faffing about with user objects and will scale better than a straight User.select("email") query, but I wouldn't say it's the "best scale". There's probably better ways to do this that I am not aware of yet.
The point is: a String object will use way less memory than a User object and so you can have more of them. It's also a quicker query and doesn't go the long way about it (running the query, then mapping the values). Oh, and map would also take longer too.
If you're using Rails 2.3...
Then you'll have to construct the SQL manually, I'm sorry to say.
User.connection.select_values("SELECT email FROM users")
Just provides another example of the helpers that Rails 3 provides.
I still find the connection.select_values to be a valid way to go about this, but I recently found a default AR method that's built into Rails that will do this for you: pluck.
In your example, all that you would need to do is run:
User.pluck(:email)
The select_values approach can be faster on extremely large datasets, but that's because it doesn't typecast the returned values. E.g., boolean values will be returned how they are stored in the database (as 1's and 0's) and not as true | false.
The pluck method works with ARel, so you can daisy chain things:
User.order('created_at desc').limit(5).pluck(:email)
User.select(:email).map(&:email)
Just use:
User.select("email")
While I visit SO frequently, I only registered today. Unfortunately that means that I don't have enough of a reputation to leave comments on other people's answers.
Piggybacking on Ryan's answer above, you can extend ActiveRecord::Base to create a method that will allow you to use this throughout your code in a cleaner way.
Create a file in config/initializers (e.g., config/initializers/active_record.rb):
class ActiveRecord::Base
def self.selected_to_array
connection.select_values(self.scoped)
end
end
You can then chain this method at the end of your ARel declarations:
User.select('email').selected_to_array
User.select('email').where('id > ?', 5).limit(4).selected_to_array
Use this to get an array of all the e-mails:
#users.collect { |user| user.email }
# => ["test#example.com", "test2#example.com", ...]
Or a shorthand version:
#users.collect(&:email)
You should avoid using User.all.map(&:email) as it will create a lot of ActiveRecord objects which consume large amounts of memory, a good chunk of which will not be collected by Ruby's garbage collector. It's also CPU intensive.
If you simply want to collect only a few attributes from your database without sacrificing performance, high memory usage and cpu cycles, consider using Valium.
https://github.com/ernie/valium
Here's an example for getting all the emails from all the users in your database.
User.all[:email]
Or only for users that subscribed or whatever.
User.where(:subscribed => true)[:email].each do |email|
puts "Do something with #{email}"
end
Using User.all.map(&:email) is considered bad practice for the reasons mentioned above.

fullcalendar with rails - limit results to a range

I've got fullcalendar working with a small rails app (yeah) but it's sluggish because the find in my controller is finding ALL the records before it renders the calendar. I'm using a JSON approach. The field names I'm using are starts_at and ends_at. This (in the index method of the assignments_controller) works:
#assignments = Assignment.find(:all, :conditions => "starts_at IS NOT NULL")
But, as I said, it's pokey, and will only get worse as more records get added.
So this is clearly more of a rails question than a fullcalendar question: I can't figure out how to get fullcalendar to initially display the current week (when no parameters have been sent) and then accept parameters from next/previous buttons while, in either case, only looking up the relevant items from the database.
Oh - this is rails 2.x, NOT 3.
Thanks for any pointers.
Please ignore this question.
It turned out to be an issue with Date format inconsistencies between JavaScript (Epoch) and Ruby. At least that's what I think at the moment.
I'm still scratching my head, trying to figure out how exactly I "fixed" it, but it seems to be working.
I was aware of this project: http://github.com/bansalakhil/fullcalendar
but it took me ages to get the nuance of Time.at figured out.
I must say, Time is a tricky thing.
In real life as well as in code.
Thanks to everyone who gave my (misguided, as it turned out) question a glance.

Saving updates to objects in rails

I'm trying to update one of my objects in my rails app and the changes just don't stick. There are no errors, and stepping through with the debugger just reveals that it thinks everything is updating.
Anyway, here is the code in question...
qm = QuestionMembership.find(:first, :conditions => ["question_id = ? AND form_id = ?", q_id, form_id])
qm.position = x
qm.save
For reference sake, QuestionMembership has question_id, form_id, and position fields. All are integers, and have no db constraints.
That is basically my join table between Forms and Questions.
Stepping through the code, qm gets a valid object, the position of the object does get changed to the value of x, and save returns 'true'.
However, after the method exits, the object in the db is unchanged.
What am I missing?
You may not be finding the object that you think you are. Some experimenting in irb might be enlightening.
Also, as a general rule when changing only one attribute, it's better to write
qm.update_attribute(:position, x)
instead of setting and saving. Rails will then update only that column instead of the entire row. And you also get the benefit of the data being scrubbed.
Is there an after_save?
Is the correct SQL being emitted?
In development log, you can actually see the sql that is generated.
For something like this:
qm = QuestionMembership.find(:first, :conditions => ["question_id = ? AND form_id = ?", q_id, form_id])
qm.position = x
qm.save
You should see something to the effect of:
SELECT * FROM question_memberships WHERE question_id=2 AND form_id=6 LIMIT 1
UPDATE question_memberships SET position = x WHERE id = 5
Can you output what sql you are actually seeing so we can compare?
Either update the attribute or call:
qm.reload
after the qm.save
What is the result of qm.save? True or false? And what about qm.errors, does that provide anything that makes sense to you? And what does the development.log say?
I have run into this problem rather frequently. (I was about to say consistently, but I cannot, as that would imply that I would know when it was about to happen.)
While I have no solution to the underlying issue, I have found that it seems to happen to me only when I am trying to update mysql text fields. My workaround has been to set the field to do something like:
qm.position = ""
qm.save
qm.position = x
qm.save
And to answer everyone else... when I run qm.save! I get no errors. I have not tried qm.save?
When I run through my code in the rails console everything works perfectly as evidenced by re-finding the object using the same query brings the expected results.
I have the same issue when using qm.update_attribute(... as well
My workaround has gotten me limping this far, but hopefully someone on this thread will be able to help.
Try changing qm.save to qm.save! and see if you get an exception message.
Edit: What happens when you watch the log on the call to .save!? Does it generate the expected SQL?
Use ./script/console and run this script.. step by step..
see if the position field for the object is update or not when you run line 2
then hit qm.save or qm.save!... to test
see what happens. Also as mentioned by Tim .. check the logs
Check your QuestionMembership class and verify that position does not have something like
attr_readonly :position
Best way to debug this is to do
tail -f log/development.log
And then open another console and do the code executing the save statement. Verify that the actual SQL Update statement is executed.
Check to make sure your database settings are correct. If you're working with multiple databases (or haven't changed the default sqlite3 database to MySQL) you may be working with the wrong database.
Run the commands in ./script/console to see if you see the same behavior.
Verify that a similar object (say a Form or Question) saves.
If the Form or Question saves, find the difference between the QuestionMembership and Form or Question object.
Turns out that it was emitting the wrong SQL. Basically it was looking for the QuestionMembeship object by the id column which doesn't exist.
I was under the impression that that column was unnecessary with has_many_through relationships, although it seems I was misguided.
To fix, I simply added the id column to the table as a primary key. Thanks for all the pointers.

Resources