Sphinx delta indexing -- still necessary to rebuild the main index?

Sphinx delta indexing -- still necessary to rebuild the main index? - ruby-on-rails

I've been reading up on the Sphinx search engine and the Thinking Sphinx gem. In the TS docs it says...
Sphinx has one major limitation when compared to a lot of other search services: you cannot update the fields [of] a single document in an index, but have to re-process all the data for that index.
If I understand correctly, that means when a user adds or edits something, the change is not reflected in the index. So if they add a record it won't come up in searches until the entire index is rebuilt. Or if they delete a record, it will come up in searches, and then cause some kind of error or frustrating behavior.
Moreover, while rebuilding the index Sphinx is shut down. So, your app's search functionality goes off line regularly (once an hour, once every few hours), and anyone who tries to do a search then will get an error or a "try later" message.
OK, clearly none of that is acceptable in real-world app. So you pretty much have to use delta indexing.
But apparently you still need to regularly shut down your search engine and do a full indexing...
Turning on delta indexing does not remove the need for regularly running a full re-index, as otherwise the delta index itself will grow to become just as large as the core indexes, and this removes the advantage of keeping it separate. It also slows down your requests to your server that make changes to the model records.
I don't really understand what the docs are saying here. Maybe someone can help me out. I thought the whole point of delta indexing was that you don't need to regularly rebuild the index. It's updated instantly whenever the data changes.
Because rebuilding the index every hour or every anything would be totally messed up, right?

If I understand correctly, that means
when a user adds or edits something,
the change is not reflected in the
index. So if they add a record it
won't come up in searches until the
entire index is rebuilt. Or if they
delete a record, it will come up in
searches, and then cause some kind of
error or frustrating behavior.
Moreover, while rebuilding the index Sphinx is shut down. ...
You don't need to rebuild your indexes - just reindex them. Which means - there's no need to stop the daemon. Rebuilding is only needed after changing the structure of the index - and that is not the case here.
And for the second part - again, you don't rebuild the index, ergo stopping the deamon isn't necessary. When using delta indexing there are actually two indexes that are used for searching - the main index (which should be reindexed once a while) and the delta index (which is refreshed after each relevant operation on the record). If I understand it correctly, when reindexing the main index (eg. via cron task), the delta index is simply merged into the main index, so it won't take that much place and stay fast.

Related

Core Spotlight Index missing over time

I'm trying to understand why my core spotlight indexes eventually don't show up anymore
My strategy is that I index the first time the user opens the app. After successful indexing, I never index again. Everything works great at first, the indexes appear in spotlight. Then over time (I'm unsure of how long, maybe weeks), the indexes stop appearing, even though I made absolutely no changes to them.
So therefore I'm trying to understand how the system handles indexes. Does it rebuild them, thus wiping any previous indexes out? So I would be responsible for re indexing?

On CSSearchableItem, there is a property expirationDate
The date after which the searchable item should no longer exist.
Discussion
If you don’t set the expirationDate property appropriately, the system automatically expires the item after a period of time.
This should explain why your items disappear after some time.

Firebase observing adding new records

Before you link me to a duplicate, please read what I'm asking..
I'm building an app which basically has a list of about 5000 teams. These teams are fairly static (they don't change very often). I would like to observe any time one is changed though as it's essential it get's updated in the app ASAP.
If I include dbTeams.ref.observe(.childAdded, with: {}), it runs each time the app starts, loading over all 5000 records despite having them in the persistent storage already (I have enabled persistence).
Now the documentation says this will happen, I know, but with 5000 records (and potentially way more in the future), I can't have this happen.
My options so far (from what I've found and tried) are:
Add a timestamp to each record and create a custom query to call .childAdded after the last timestamp... This is inefficient. Storing a timestamp for soccer teams which will hardly ever change, is silly. It also means keeping a copy of the last time it was checked.
Create a sub-list within the Teams list. This too is silly as you may as well call .value and get the whole bunch of data in one go.
Just live with it... Fine - until it scales to tens of thousands of records. Not clever either.
It just seems weird that all the other event listeners only fire when they are "supposed to" except this one.
Any help would be appreciated - how do I achieve what I need?

Sqlite randomly slowing down on simple (but big) table on iOS

I'm working on an enterprise sales app, for iPad, that uses Sqlite as its internal database, and a strange behaviour recently showed up.
I have a huge table that is filled with information from several other tables (sort of like a "materialized view"), which can contain over 2 million rows, depending on how the user is set up. When the user wants to search for an item, the app performs a query on this huge table that has an indexed column and on other columns that are used as filters and/or metadata. I'll post the query and the basic idea below. Anyway, this query usually returns in 2~3 seconds on an iPad 4th gen, no more than that, and this is just fine. This table is dropped, re-created and filled every time the user taps a button to synchronize his data with our server.
However, recently the same query in the same table (with no relevant changes at all), randomly started to take 40~50 seconds. If you do the same thing later, on the same device, with the same filters (or even changing the filters!), the same query on the same table takes the 2~3 seconds again. I haven't found any specific situation that causes this slowdown, the app is the only one running at that time. The device is not the problem, we've seen this happen on at least 5 different iPads, one is an iPad 3 and the others are iPads 4th gen.
I don't think it is some sort of caching, since the app does not cache anything, and these times are rather random. Sometimes they take 40 seconds for 10 times in a row, then suddenly it starts to take only 2 seconds again, and the same thing the other way. The only thing that is clear to me is that this slowdown only occurs after intensive use (1 - 2 days of work using the app), so I'm also having troubles to cause this behaviour while debugging on the iPad I have with me.
What I've tried:
Attach Instruments to the process and check what resources are being used during the slowdown. The app does INTENSIVE use of the iPad's 'disk' (flash memory) during the whole time. I don't have the example to analyse it again now, but I think the CPU usage was around 30%. The RAM usage is stable at 90~100MB, which is normal for our app.
Run VACCUM on the db; - reduced ~50MB on a database I had as example. Went from ~600MB to ~550MB.
Run ANALYZE on the db; - didn't see any improvements
Run REINDEX on the db; - seems to be helping a little, but it's not solving the problem.
Kill the process and start over - nothing changes
The huge table is constructed as the following, and does NOT have any foreign keys or other any other constraint:
CREATE TABLE FMV_CATALOG(
UNIQUE_ID TEXT,
PRODUCT_ID INTEGER,
<bunch of metadata/filtered columns - total of 20 columns>
);
And the query that is made to find the products is:
SELECT
PRODUCT_ID
,UNIQUE_ID
<all other required columns, ~20 columns>
FROM
FMV_CATALOG
WHERE
UNIQUE_ID = '<some id>_<other id>'
AND PRODUCT_NAME LIKE '%iPhone%'
<and other optional, rarely used, filters.>
I'm totally out of ideas, so any help will be appreciated.
Thanks!
UPDATE (more info):
Important informations that I forgot to mention, Rob reminded me of it. My database connection is always open, it is closed only when the user logs out. We've noticed a huge performance on all parts of the app when we kept the connection opened, since we have hundreds of small queries that are executed on other situations (but not while browsing/searching the products catalog).
The query used to create the index is below:
CREATE INDEX IDX_MV_CATALOG ON MV_CATALOG(UNIQUE_ID);
Also, even though the column is named UNIQUE_ID, it is not unique. It was supposed to be originally, but now it is repeated N times. I know this is wrong, we'll change that ASAP.
This "UNIQUE_ID" (which is not really unique) is filled by joining the IDs of two other tables. This way, our "materialized view" removes the need of at least three joins when the user searches on our catalog, which improved our query times from ~20 seconds to ~2 seconds.
We don't call sqlite3 API directly on our queries, we have developed a wrapper class around it and we've been using it for at least 2 years now. And it's the first time ever we've been on this situation, but again it's the first time we're handling so much data.

A couple of thoughts:
You're not showing us the creation of any index on FMV_CATALOG. If nothing else, if UNIQUE_ID is, as the name suggests, unique, then I'd be inclined to define the table with a PRIMARY KEY:
CREATE TABLE FMV_CATALOG(
UNIQUE_ID TEXT PRIMARY KEY,
PRODUCT_ID INTEGER,
<bunch of metadata/filtered columns - total of 20 columns>
);
You should try using the SQLite EXPLAIN QUERY PLAN command to look at the query and look at its plan and make sure it's availing itself of your index. Do this as it is, and then again with PRIMARY KEY (and perhaps if that still doesn't do it, an index on the fields in your WHERE clause), and make sure the final query is definitely using your index.
I'm not sure why, if you have the unique id, why you're also looking at the other fields. If adding of the primary key (and possibly other index(es)) doesn't solve the problem, I might try just retrieving the record based upon the unique id, and then check for conformance with your other parameters in code. I don't believe you need to do this, but it's a worst case scenario.
In terms of why it will slow down, that's harder to guess what's going on without seeing the code (which I'm sure is too complicated to share in a simple S.O. question). I could imagine strange behavior if, for example, you fail to sqlite3_finalize after one of your sqlite3_prepare_v2 statements or if you accidentally failed to close the database and then opened it again elsewhere. I could imagine performance issues that might come in place if the sequence of sqlite3 calls wasn't precisely right. Use of something like FMDB can minimize the chance of those sorts of issues occurring (as well as simplifying your SQLite code). Or, if that's too radical of a step, try writing your own macros that call the SQLite calls, but also log the fact that you've called that sqlite3 function, and pour through that log and double check the sequence of your SQLite calls.
The only thing I can suggest is whether you can construct a simplified project that can reproduce the aberrant behavior. Tracking down a Heisenbug can be infuriating: Unless you can consistently reproduce the bug, it's hard to track down.

Does Solr / Sunspot include new record in search query even if it's not indexed?

Because of the slow indexing and resource intensive, I only set Solr to re-index in 12 hours period. So when new record is added before it's indexed, it cannot be searched, am I right?
If yes, should I change to other search system?

Documents are only available for search after a commit. They are parsed and converted to internal formats as soon as they are submitted to Solr. That is indexing.
Your question doesn't quite make sense for a pure Solr system. Do you have an RDBMS and are you reindexing that? If so, then you are right, Solr is only updated when the records are fetched from the database, indexed, and committed.
For a shorter delay, there are a couple of options. If you have a timestamp for each record, you can periodically only reindex the changed records. They will replace the old version of the records. If you do this, you need to handle deleted records specially, usually by adding a "deleted" column and issuing a Solr delete command for those records.
The Data Input Handler has support for delta queries, though it is a bit complicated. You also could write a bit of code to read the database and submit regular Solr updates in the dedicated Solr XML format.
Updating frequently might hurt search performance. Solr gets a lot of performance from caching results. Those caches are cleared after a commit. Check the statistics page for the cache hit ratio in your query result cache. If that is high, frequent updates could cause performance problems.

Rails DB Performance with several workers

I'm working on an application that works like a search engine, and all the time it has workers in the background searching the web and adding results to the Results table.
While everything works perfectly, lately I started getting huge response times while trying to browse, edit or delete the results. My guess is that the Results table is being constantly locked by the workers who keep adding new data, which means web requests must wait until the table is freed.
However, I can't figure out a way to lower that load on the Results table and get faster respose times for my web requests. Has anyone had to deal with something like that?
The search bots are constantly reading and adding new stuff, it adds new results as it finds them. I was wondering if maybe by only adding the bulk of the results to the database after the search would help, or if it would make things worse since it would take longer.
Anyway, I'm at a loss here and would appreciate any help or ideas.
I'm using RoR 2.3.8 and hosting my app on Heroku with PostgreSQL

PostgreSQL doesn't lock tables for reads nor writes. Start logging your queries and try to find out what is going on. Guessing doesn't help, you have to dig into it.
To check the current activity:
SELECT * FROM pg_stat_activity;

Try the NOWAIT command. Since you're only adding new stuff with your background workers, I'd assume there would be no lock conflicts when browsing/editing/deleting.

You might want to put a cache in front of the database. On Heroku you can use memcached as a cache store very easily.
This'll take some load off your db reads. You could even have your search bots update the cache when they add new stuff so that you can use a very long expiration time and your frontend Rails app will very rarely (if ever) hit the database directly for simple reads.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart