Rails Sunspot/Solr Index Out of Sync

Rails Sunspot/Solr Index Out of Sync - ruby-on-rails

I'm using the sunspot-rails gem for searching and filtering but am running into issues where Solr queries are not returning the correct results. For example, say I want to retrieve all the records of a model with state == Processing, and I do the following Sunspot search:
MyModel.search do
# pagination and ordering stuff
...
with('state', 'Processing')
...
end
Most of the time, this returns the correct results. Sometimes (and so far only on production, I can't duplicate the issue locally) the query will return records with state == In Review. If I do a regular ActiveRecord MyModel.where(state: 'Processing') I always get the correct results.
I thought it might have to do with my solrconfig.yml file but changing those params haven't seemed to change anything. Relevant portion of that file:
<autoCommit>
<maxTime>15000</maxTime>
<maxDocs>1000</maxDocs>
<openSearcher>true</openSearcher>
</autoCommit>
<autoSoftCommit>
<maxTime>5000</maxTime>
</autoSoftCommit>
Does anyone have any pointers for why changes in my db aren't reflected in the Solr index, or how I can debug/log what's going on? This is on a small internal app with maybe 100 users or so. I shouldn't have to reindex Solr daily to keep the results up to date.
Thanks.

Related

Rails / Postgres Lookup Performance

I have a status dashboard that shows the status of remote hardware devices that 'ping' the application every minute and log their status.
class Sensor < ActiveRecord::Base
has_many :logs
def most_recent_log
logs.order("id DESC").first
end
end
class Log < ActiveRecord::Base
belongs_to :sensor
end
Given I'm only interested in showing the current status, the dashboard only shows the most recent log for all sensors. This application has been running for a long time now and there are tens of millions of Log records.
The problem I have is that the dashboard takes around 8 seconds to load. From what I can tell, this is largely because there is an N+1 Query fetching these logs.
Completed 200 OK in 4729.5ms (Views: 4246.3ms | ActiveRecord: 480.5ms)
I do have the following index in place:
add_index "logs", ["sensor_id", "id"], :name => "index_logs_on_sensor_id_and_id", :order => {"id"=>:desc}
My controller / lookup code is the following:
class SensorsController < ApplicationController
def index
#sensors = Sensor.all
end
end
How do I make the load time reasonable?
Is there a way to avoid the N+1 and reload this?
I had thought of putting a latest_log_id reference on to Sensor and then updating this every time a new log for that sensor is posted - but something in my head is telling me that other developers would say this is a bad thing. Is this the case?
How are problems like this usually solved?

There are 2 relatively easy ways to do this:
Use ActiveRecord eager loading to pull in just the most recent logs
Roll your own mini eager loading system (as a Hash) for just this purpose
Basic ActiveRecord approach:
subquery = Log.group(:sensor_id).select("MAX('id')")
#sensors = Sensor.eager_load(:logs).where(logs: {id: subquery}).all
Note that you should NOT use your most_recent_log method for each sensor (that will trigger an N+1), but rather logs.first. Only the latest log for each sensor will actually be prefetched in the logs collection.
Rolling your own may be more efficient from a SQL perspective, but more complex to read and use:
#sensors = Sensor.all
logs = Log.where(id: Log.group(:sensor_id).select("MAX('id')"))
#sensor_logs = logs.each_with_object({}){|log, hash|
hash[log.sensor_id] = log
}
#sensor_logs is a Hash, permitting a fast lookup for the latest log by sensor.id.
Regarding your comment about storing the latest log id - you are essentially asking if you should build a cache. The answer would be 'it depends'. There are many advantages and many disadvantages to caching, so it comes down to 'is the benefit worth the cost'. From what you are describing, it doesn't appear that you are familiar with the difficulties they introduce (Google 'cache invalidation') or if they are applicable in your case. I'd recommend against it until you can demonstrate that a) it is adding real value over a non-cache solution, and b) it can be safely applied for your scenario.

There's 3 options:
eager loading
joining
caching the current status
--
is explained by PinnyM
You can do a join from the Sensor just to the latest Log record for each row, so everything gets fetched in the one query. Not sure off hand how that'll perform with the number of rows you have, likely it'll still be slower than you want.
The thing you mentioned - caching the latest_log_id (or even caching just the latest_status if that's all you need for the dashboard) is actually OK. It's called denormalization and it's a useful thing if used carefully. You've likely come across "counter cache" plugins for rails which are in the same vein - duplicating data, in the interests of being able to optimise read performance.

Mongoid identity_map and memory usage, memory leaks

When I executing query
Mymodel.all.each do |model|
# ..do something
end
It uses allot of memory and amount of used memory increases at all the time and at the and it crashes. I found out that to fix it I need to disable identity_map but when I adding to my mongoid.yml file identity_map_enabled: false I am getting error
Invalid configuration option: identity_map_enabled.
Summary:
A invalid configuration option was provided in your mongoid.yml, or a typo is potentially present. The valid configuration options are: :include_root_in_json, :include_type_for_serialization, :preload_models, :raise_not_found_error, :scope_overwrite_exception, :duplicate_fields_exception, :use_activesupport_time_zone, :use_utc.
Resolution:
Remove the invalid option or fix the typo. If you were expecting the option to be there, please consult the following page with repect to Mongoid's configuration:
I am using Rails 4 and Mongoid 4, Mymodel.all.count => 3202400
How can I fix it or maybe some one know other way to reduce amount of memory used during executing query .all.each ..?
Thank you very much for the help!!!!

I started with something just like you by doing loop through millions of record and the memory just keep increasing.
Original code:
#portal.listings.each do |listing|
listing.do_something
end
I've gone through many forum answers and I tried them out.
1st attempt: I try to use the combination of WeakRef and GC.start but no luck, I fail.
2nd attempt: Adding listing = nil to the first attempt, and still fail.
Success Attempt:
#start_date = 10.years.ago
#end_date = 1.day.ago
while #start_date < #end_date
#portal.listings.where(created_at: #start_date..#start_date.next_month).each do |listing|
listing.do_something
end
#start_date = #start_date.next_month
end
Conclusion
All the memory allocated for the record will never be released during
the query request. Therefore, trying with small number of record every
request does the job, and memory is in good condition since it will be
released after each request.

Your problem isn't the identity map, I don't think Mongoid4 even has an identity map built in, hence the configuration error when you try to turn it off. Your problem is that you're using all. When you do this:
Mymodel.all.each
Mongoid will attempt to instantiate every single document in the db.mymodels collection as a Mymodel instance before it starts iterating. You say that you have about 3.2 million documents in the collection, that means that Mongoid will try to create 3.2 million model instances before it tries to iterate. Presumably you don't have enough memory to handle that many objects.
Your Mymodel.all.count works fine because that just sends a simple count call into the database and returns a number, it won't instantiate any models at all.
The solution is to not use all (and preferably forget that it exists). Depending on what "do something" does, you could:
Page through all the models so that you're only working with a reasonable number of them at a time.
Push the logic into the database using mapReduce or the aggregation framework.
Whenever you're working with real data (i.e. something other than a trivially small database), you should push as much work as possible into the database because databases are built to manage and manipulate big piles of data.

How do I effectively search across encrypted fields?

I am using Ruby on Rails 4.1, Ransack and attr_encrypted. I have sensitive data being stored in my database and I want to protect it using the gem attr_encrypted.
As I expected, I got zero results when searching encrypted test data with Ransack.
I tried the following solution to but it didn't seem to work for me. I was under the impression that the load function was used to return the decrypted value.
ReportsController
def index
#report_list = Report.all.load
#q = #report_list.search(params[:q])
#reports = #q.result(distinct: true).order('created_at DESC')
end
Has anyone had any experience searching across encrypted data and could help me generate a working solution?

load will cause the Active Record Collection to execute a query and retrieve the results matching your query (in addition to running after_create call backs which I believe is where the decrypt you were expecting is happening).
def index
#returns all records in DB
#report_list = Report.all.load
#I'm surprised these aren't throwing undefined method search of Array (or something similar)
#q = #report_list.search(params[:q])
#reports = #q.result(distinct: true).order('created_at DESC')
end
I would like to precede this with, I normally do this thing manually and am not familiar with attr_encrypted or Ransack, but I believe these concepts are general enough they could be applied to any setup. So, as to your question, 2 possibilities.
If your ok searching for exact values:
Model.where(encrypted_field: encrypt(params[:value])).first
where encrypt is a method that encrypts and returns the passed string.
Secondly (and painfully)
Model.all.delete_if{|m| !m.encrypted_field.include?(params[:value]) }
This will literally pull, decrypt, and scan every entry in your database.
I would highly recommend not doing this, but you need to do what you need to do.
If you absolutely need to have the information encrypted but still need to be able to do searches like this. I would highly recommend adding tags of some sort to your model. This would allow you to remove sensitive information but still search by some attributes.

Updating a lot of records frequently

I have a Rails 3 app that has several hundred records in a mySQL-DB that need to be updated multiple times each hour. The actual updating is done through delayed_job which is triggered in controller-logic (checking if enough time has passed since the last update, only then sth. happens).
Each update is slow, it can take up to a second in some cases (although it averages at 3 - 5 updates/sec.).
Code looks like this:
class Thing < ActiveRecord::Base
...
def self.scheduled_update
Thing.all.each do |t|
...
t.some_property = new_value
t.save
end
end
end
I've observed that the execution stalls after 300 - 400 records and then the delayed job just seems to hang and times out eventually (entries in delayed_job.log). After a while the next one starts, also fails, and so forth, so not all records get updated.
What is the proper way to do this?
How does Rails handle database-connections when used like that? Could it be some timeout issue that is not detected/handled properly?
There must be a default way to do this, but couldn't find anything so far..
Any help is appreciated.

Another options is update_all.

Rails is a bad choice for mass data records. See if you can create a sql stored procedure or some other way that would avoid active record.
Use object.save_with_validation(false) if you are ok with skipping validations altogether.
When finding records, use :select => 'a,b,c,other_fields' to limit the fields you want ('a', 'b', 'c' and 'other' in this example).
Use :include for eager loading when you are initially selecting and joining across multiple tables.

So I solved my problem.
There was some issue with the rails-version I was using (3.0.3), the Timeout was caused by some bug I suspect. Updating to a later version of the 3.0.x branch solved it and everything runs perfectly now.

Removing duplicates from array before saving

I periodically fetch the latest tweets with a certain hashtag and save them locally. In order to prevent saving duplicates, I use the method below. Unfortunately, it does not seem to be working... so what's wrong with this code:
def remove_duplicates
before = #tweets.size
#tweets.delete_if {|tweet| !((Tweet.all :conditions => { :twitter_id => tweet.twitter_id}).empty?) }
duplicates = before - #tweets.size
puts "#{duplicates} duplicates found"
end
Where #tweets is an array of Tweet objects fetched from twitter. I'd appreciate any solution that works and especially one that might be more elegant...

you can validate_uniqueness_of :twitter_id in the Tweet model (where this code should be). This will cause duplicates to fail to save.

Since it sounds like you're using the Twitter search API, a better solution is to use the since_id parameter. Keep track of the last twitter status id you got from your previous query and use that as the since_id parameter on your next query.
More information is available at Twitter Search API Method: search

array.uniq!
Removes duplicate elements from self. Returns nil if no changes are made (that is, no duplicates are found).

Ok, turns out the problem was a bit of different nature: When looking closer into it, I found out that multipe Tweets were saved with the twitter_id 2147483647... This is the upper limit for integer fields :)
Changing the field to bigint solved the problem. It took me very long to figure out since MySQL did silently fail and just reverted to the maximum value as long as it could. (until I added the unique index). I quickly tried it out with postgres, which returned a nice "Integer out of range" error, which then pointed me to the real cause of the problem here.
Thanks Ben for the validation and indexing tips, as they lead to much cleaner code now!

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Rails Sunspot/Solr Index Out of Sync - ruby-on-rails

Related

Rails / Postgres Lookup Performance

Mongoid identity_map and memory usage, memory leaks

How do I effectively search across encrypted fields?

Updating a lot of records frequently

Removing duplicates from array before saving

Categories

Resources