At first I was trying to do this :
Photo.find(:all, :conditions => ["id < 2000 AND id > 999"])
But then I realized that this isn't necessarily 1000 objects. How do I select exactly a thousand objects. So that I can run a process that works 1,000 objects at a time. Such as this..
Photo.find(:all, :conditions => ["id < 2000 AND id > 999"]).each{|instance| instance.photo.reprocess!(:tiny_thumb) }
You'd want to use :limit and :offset:
# First chunk
Photo.find(:all, :order => :id, :limit => 1000)
# Second chunk
Photo.find(:all, :order => :id, :limit => 1000, :offset => 1000)
You need to include the :order to ensure consistent results, otherwise the entries won't necessarily come out in the same order that will mess up your chunking.
Use find_in_batches: http://apidock.com/rails/ActiveRecord/Batches/find_in_batches
with a :batch_size of 1000, which also happens to be the default.
Related
I'm studding sphinx and thinking-sphinx and I need your opinion and help, what I want to do is the following:
I have a list of news (noticias) and I want to order the results by date and relevance because if I search for something doesn't matter when the news was created the query won't take in consideration. If I could specify at least that the closer year or year and month have more relevance my question should already be solved.
I saw a lot of things but not to much conclusive, maybe for my low experience with sphinx and thinking-sphinx.
How can a solve this problem? How you think is the best way? Thanks.
My model:
define_index do
indexes :titulo
indexes :chamada
indexes :texto
indexes :description
indexes :keywords
indexes :otimizador_de_busca
indexes :created_at, :sortable => true
indexes tags.nome, :as => :tag
indexes usuario.nome, :as => :autor
where "validacao = '1'"
end
My search function on controller:
termo = params[:termo].first(50)
#noticias = Noticia.search termo,
:field_weights => {:tag => 150, :autor => 120, :titulo => 100, :chamada => 80, :otimizador_de_busca => 65, :description => 50, :keywords => 50, :texto => 10},
:match_mode => :all,
:page => params[:pagina],
:sort_mode => :extended,
:order => "#relevance DESC, created_at DESC",
:per_page => 15
A few things to note. Firstly, there's a difference between fields and attributes with Sphinx, and there's not really much to be gained by having created_at as a field, but it's far more useful as an attribute (which are natively sortable). So, let's update the index definition:
define_index do
indexes :titulo
indexes :chamada
indexes :texto
indexes :description
indexes :keywords
indexes :otimizador_de_busca
indexes tags.nome, :as => :tag
indexes usuario.nome, :as => :autor
has :created_at
where "validacao = '1'"
end
And then run rake ts:rebuild so that change is reflected in your index files and the Sphinx daemon is aware of it too.
As for how you're sorting... you've got a few options. In your example, you're sorting primarily by relevance, but anything with matching relevance scores has the newer items listed first. I think that'll work quite well.
If you want to use Sphinx's time_segments sorting, then that might also work well, as it'll group results first by their age (without being too specific), and then automatically orders within each age group by relevance:
termo = params[:termo].first(50)
#noticias = Noticia.search termo,
:field_weights => {:tag => 150, :autor => 120, :titulo => 100, :chamada => 80, :otimizador_de_busca => 65, :description => 50, :keywords => 50, :texto => 10},
:match_mode => :extended,
:page => params[:pagina],
:sort_mode => :time_segments,
:order => :created_at,
:per_page => 15
I've also changed the match mode to extended, which I'd generally recommend.
Finally, as you've suggested, you could factor in the created_at timestamp with the relevance in an expression - that's up to you. There's probably formulas out there that could help with that, but I think that's extra complexity you probably don't need.
If you think that it's more important to have newer results first, then use time segments. If you think that it's more important to have relevant results to the search query first, use the extended sort mode in your own example. I think that one is better, but it's up to you.
I'm working in Rails 2.3.11 environment.
I want to seed a social_activities table like so:
votes = Vote.find(:all, :limit => 10000, :order => 'created_at DESC')
for vote in votes do
act = vote.activity_for!
act.created_at = vote.created_at
act.save!
end
comments = Comment.find(:all, :limit => 10000, :order => 'created_at DESC')
for comment in comments do
act = comment.activity_for!
act.created_at = comment.created_at
act.save!
end
...and...so on...
As you can see, I'm processing a lot of records. How can I do so in the most memory- and performance-efficient way?
Instead of fetching 10000 records at a time, you can reduce the number of objects in memory by making this number smaller (say 100), and using find_each to work your way through all the records.
Vote.find_each(:order => 'created_at DESC', :batch_size => 100) do |vote|
act = vote.activity_for!
act.created_at = vote.created_at
act.save!
end
Comment.find_each(:order => 'created_at DESC', :batch_size => 100) do |comment|
act = comment.activity_for!
act.created_at = comment.created_at
act.save!
end
Now records will only be fetched 100 at a time, reducing the memory footprint.
Active Record isn't great for importing or moving large datasets around. This looks like something you should do with an SQL statement directly.
In SQL I think you would do this with an update across an inner join or something similiar. Either way the SQL server would run your query directly, and this would be dramatically faster than Active Record.
In rails 2.3.8 I'm trying to order a query having first the post that has the most comments AND votes.
I've tried to add a new method to the Post model as:
def interestingness
self.comments_count + self.votes_count
end
post_of_the_moment = find(:all, :conditions => ["submitted_at BETWEEN ? and ?", from, to],
:order => :interestingness,
:limit => 10
)
but this code gives me a Unknown column error.
I also tried this
post_of_the_moment = find(:all, :conditions => ["submitted_at BETWEEN ? and ?", from, to],
:order => "SUM(comments_count+votes_count) DESC",
:limit => 10
)
this doesn't give me errors but puts as result only 1 row that has 0 comments and 0 votes.
What am I doing wrong?
Thanks,
Augusto
Try this:
post_of_the_moment = find(:all, :select => '*, comments_count + votes_count AS total', :conditions => ["submitted_at BETWEEN ? and ?", from, to], :order => "total DESC", :limit => 10)
I'd also see if you can optimize it be replacing the * above with only the fields you actually need. Also check your MySQL indexes are ok, as you want to avoid a full table scan etc. to sum the counts.
Figured out the error I was doing: the SUM() in the order was grouping the result set.
This works:
post_of_the_moment = find(:all, :conditions => ["submitted_at BETWEEN ? and ?", from, to],
:order => "(comments_count+votes_count) DESC",
:limit => 10
)
Still don't know why I cannot use as a sort field the interestingness method I created.
an example would be...
Pets.find(:all, :select => 'count(*) count, pet_type', :group => 'pet_type', :order => 'count')
returns the correct results but not the actual counts in the orderedhash object returned.
Pets.count(:all, :group => 'pet_type')
returns the count but are not sorting in a descending fashion... how would i do this?
I think i'd prefer to use .find .. but i'll take .count if i can sort it.
Pets.find(:all, :select => '*, count(*) AS count, pet_type', :group => 'pet_type', :order => 'count')
Pets.find(:all, :select => 'count(*) count, pet_type', :group => 'pet_type', :order => 'count DESC')
This works fine with MySQL but might not transfer well if you switch DBs:
Pets.count(:all, :group => 'pet_type', :order => 'count(*) DESC')
#pets=Pets.include(:meals_per_days).sort do |a,b|
a.meals_per_days.size <=> b.meals_per_days.size
end
Note : This will returns an array of records, not an ActiveRecord:Relation.
Note2 : Use size, not count, as count will execute sql calls to the db.
Afternoon,
Lets say I have gather a random selection of users:
User.find(:all, :limit => 10, :order => "rand()")
Now from these results, I want to see if the user with the ID of 3 was included in the results, what would be the best way of finding this out?
I thought about Array.include? but that seems to be a dead end for me.
Thanks
JP
users = User.find(:all, :limit => 10, :order => "rand()")
users.any? {|u| u.id == 3}
assert random_users.include?(User.find 3), "Not found!"
Active record objects are considered equal if they have equal ids. Array#include? respects the objects defined equality via the == method.
User.find(:all, :limit => 10, :order => "rand()").any? { |u| u.id == 3 }
This will save you from doing another find.