Thinking Sphinx and lack of updated records - ruby-on-rails

We are running thinking sphinx on a utility instance in our server cluster. It is rerunning the index every minute. But, if you make a change to a record, it disappears from search results until the index is updated (up to 1 minute).
Is Thinking Sphinx only returning rows that have updated_at times that less than their last index?
If so, how can I get db changes to update the TS on the utility instance?

Instead of re-indexing every minute try using the Delayed Deltas approach. It is designed to tide over your search results until you fully re-index.
See:
http://freelancing-god.github.com/ts/en/deltas.html
Update:
Looks like the team at sphinx is trying to solve these problems with real-time indexes:
http://sphinxsearch.com/docs/current.html#rt-indexes

Related

rails find_or_create_by is slow when database contains 100k records

I noticed that find_or_create_by in rails slowing data ingestion, although I have an index set on the SELECT fields. any suggestions on how to speed this up? I'm using postgres
find_or_create_by is nothing but simply where query with limit 1 and if result is NULL it will fire insert query to return new object.
If you have added indexing properly to columns then it will be faster as much as it is supposed to be.
But for large database you described, I will suggest you to run such operations in background using sidekiq

Rails 4.1 - Saving friendly_id slug for many records en masse

I have a database with a few million entities needing a friendly_id slug. Is there a way to speed up the process of saving the entities? find_each(&:save) is very slow. At 6-10 per second I'm looking at over a week of running this 24/7.
I'm just wondering if there is a method within friendly_id or parallel processing trick that can speed this process up drastically.
Currently I'm running about 10 consoles, and within each console starting the value +100k:
Model.where(slug: nil).find_each(start: value) do |e|
puts e.id
e.save
end
EDIT
Well one of the biggest things that was causing the updates to go so insanely slow is the initial find query of the entity, and not the actual saving of the record. I put the site live the other day, and looking at server database requests continually hitting 2000ms and the culprit was #entity = Entity.find(params[:id]) causing the most problems with 5+ million records. I didn't realize there was no index on the slug column and active record is doing its SELECT statements on the slug column. After indexing properly, I get 20ms response times and running the above query went from 1-2 entities per second to 1k per second. Doing multiple of them got the job down quick enough for the one time operation.
I think the fastest way to do this would be to go straight to the database, rather than using Active Record. If you have a GUI like Sequel Pro, connect to your database (the details are in your database.yml) and execute a query there. If you're comfortable on the command line you can run it straight in the database console window. Ruby and Active Record will just slow you down for something like this.
To update all the slugs of a hypothetical table called "users" where the slug will be a concatenation of their first name and last name you could do something like this in MySQL:
UPDATE users SET slug = CONCAT(first_name, "-", last_name) WHERE slug IS NULL

Is is possible in ruby to set a specific active record call to read dirty

I am looking at a rather large database.. Lets say I have an exported flag on the product records.
If I want an estimate of how many products I have with the flag set to false, I can do a call something like this
Product.where(:exported => false).count.. .
The problem I have is even the count takes a long time, because the table of 1 million products is being written to. More specifically exports are happening, and the value I'm interested in counting is ever changing.
So I'd like to do a dirty read on the table... Not a dirty read always. And I 100% don't want all subsequent calls to the database on this connection to be dirty.
But for this one call, dirty is what I'd like.
Oh.. I should mention ruby 1.9.3 heroku and postgresql.
Now.. if I'm missing another way to get the count, I'd be excited to try that.
OH SNOT one last thing.. this example is contrived.
PostgreSQL doesn't support dirty reads.
You might want to use triggers to maintain a materialized view of the count - but doing so will mean that only one transaction at a time can insert a product, because they'll contend for the lock on the product count in the summary table.
Alternately, use system statistics to get a fast approximation.
Or, on PostgreSQL 9.2 and above, ensure there's a primary key (and thus a unique index) and make sure vacuum runs regularly. Then you should be able to do quite a fast count, as PostgreSQL should choose an index-only scan on the primary key.
Note that even if Pg did support dirty reads, the read would still not return perfectly up to date results because rows would sometimes inserted behind the read pointer in a sequential scan. The only way to get a perfectly up to date count is to prevent concurrent inserts: LOCK TABLE thetable IN EXCLUSIVE MODE.
As soon as a query begins to execute it's against a frozen read-only state because that's what MVCC is all about. The values are not changing in that snapshot, only in subsequent amendments to that state. It doesn't matter if your query takes an hour to run, it is operating on data that's locked in time.
If your queries are taking a very long time it sounds like you need an index on your exported column, or whatever values you use in your conditions, as a COUNT against an indexed an column is usually very fast.

Thinking Sphinx & Rails questions

I'm building my first Rails app and have it working great with Thinking Sphinx. I'm understanding most of it but would love it if someone could help me clarify a few conceptual questions
When displaying search results after a sphinx query, should I be using the sphinx_attributes that are returned from the sphinx query? Or should my view use normal rails objects, such as #property.title, #property.amenities.title etc? If I use normal rails objects, doesn't that mean its doing extra queries?
In a forum, I'd like to display 'unread posts'. Obviously this is true/false for each user/topic combination, so I'm thinking I should be caching the 'reader' ids within the topic's sphinx index. This way I can quickly do a query for all unread posts for a given user_id. I've got this working, but then realised its pointless, as there is a time delay between sphinx indexes. So if a user clicks on an unread post, it will still appear unread until the sphinx DB is re-indexed
I'm still on development so I'm manually indexing/rebuilding, but on production, what is a standard time between re-indexing?
I have a model with several text fields - should I concat these all into one column in the sphinx index for a keyword search? Surely this is quicker than indexing all the separate fields.
Slightly off-topic, but just wondering - when you access nested models, for example #property.agents.name, does this affect performance? Or does rails automatically fetch all associated entries when a property is pulled from the database?
To answer each of your points:
For both of your examples, sphinx_attributes would not be helpful. Firstly, you've already loaded the property, so the title is available directly without an extra database hit. And for property.amenities.title you're dealing with an array of strings, which Sphinx has no concept of. Generally, I would only use sphinx_attributes for complicated calculated attributes, not standard column references.
Yes, you're right, there will be a delay with this value.
It depends on how often your data changes. I have some apps where I can index every day because changes are so rare, but others where we'll run it every 10 minutes. If the data is particularly volatile, I'll look at using deltas (usually via Sidekiq) to have changes reflected in Sphinx in a few seconds.
I don't think it's much difference either way - unless you want to search on any of those columns separately? If so, it'll need to be a separate field.
By default, as you use each property's agents, the agents for that property will be loaded from the database (one SQL call per property). You could look at the eager loading docs for how to manage this better when you're dealing with multiple records. Thinking Sphinx has the ability to pass through :include options to the underlying ActiveRecord call.

Rails on Postgresql FullText Index has been building up for 20 hours. Is that normal?

I have an Ruby on Rails 3 Heroku application, which needs to perform text search on a few models. Each models have a large datasets, and that dataset is expected to grow considerably.
I want to be able to do fast text search on columns like title and description. Simple queries, like give me all Articles having "postgresql" (case insensitive) in their title, or body. I need multilingual capability too.
Currently, my DB is not being used in production, and I'm using the Ronin plan, which gives a dedicated db using PostgreSQL.
In order to do that, I decided to go with a plugin call texticle. That plugin allows full text search using PostgreSQL capability. However, it did not work smoothly, and I decided to build full text indexes.
I ran the following query, on a table with 15 millions entries. 20 hours later, it is still running.
create index on articles using gin(to_tsvector('english', title));
My questions :
1- Is it normal that it is so long for this index to build?
2- Is there any way to find out the status of that index build-up? It doesn't show yet in my indexes usage table.
3- What about my approach. Am I looking at this the wrongway? Would you have other recommendations? I would like to keep my budget low for now, but be able to easily migrate to an effective production quality solution when needs arise, a scalable one.
Thanks
1- Is it normal that it is so long for this index to build?
No.
This is on my postgres 9.0 server which runs on single-core AMD Athlon 64 3700+:
filip#filip=# create table articles as select i, md5('the ' || random()::text || ' feds took my ' || random()::text ) as title from generate_series(1,15000000) i;
SELECT 15000000
Time: 91851.97 ms
filip#filip=# create index on articles using gin(to_tsvector('english', title));
CREATE INDEX
Time: 340802.395 ms
As you can see, on building GIN index on 15 Mrows took 340 seconds (BTW, table size was 977 MB and index size was 319 MB).
Turning text documents into tsvector and building a GIN (or GIST) index is CPU-intensive.
I don't know exact specs of heroku ronin in terms of CPU power. Can you tell us what it compares to?
Performance of index building is also very sensitive to maintenance_work_mem setting. Memory needed (and size of the index) depends on input data, might be from 20% to 150% of input data size.
2- Is there any way to find out the status of that index build-up? It
doesn't show yet in my indexes usage table.
Unfortunately, no. PostgreSQL does not have this kind of "introspection".
You could create same index on a 10% sample and multiply to estimate.
3- What about my approach.
Nothing bad - it is OK, at last if PostgreSQL has built-in FTS, it's good to begin with.
But if you need faster solution (both indexing time and searching speed) - the only way is to go out of database. External solutions like Sphinx or Lucene are faster (10x from my experience).

Resources