solr vs xapian: which one gives you the most meaningful results?

solr vs xapian: which one gives you the most meaningful results? - search-engine

I am currently using whoosh to dev a website, and I'll need to choose something more powerful once the website will be in production.
If anyone of you used both of these engines, which one gave you the most meaningful results one the long road?

Solr is the best option. Its well documented and the community is huge. Almost a year ago I benchmarked Xapian vs Solr:
My dataset had +8000 emails:
Solr
index time: 3s
index size: 5.2mb
Xapian
index time: 30s
index size: 154mb
Another great reading about benchmarks between Xapian and Solr is this document: Cross-instance Search System - Search Engine Comparison

Related

How does a data model affect neo4j write performance with CYPHER?

I have been really struggling to achieve acceptable performance for my application with Neo4J 3.0.3. Here is some background:
I am trying to replace Apache Solr with Neo4j for an application to extend its capabilities, while maintaining or improving performance.
In Solr I have documents that essentially look like this:
{
"time": "2015-08-05T00:16:00Z",
"point": "45.8300018311,-129.759994507",
"sea_water_temperature": 18.49,
"sea_water_temperature_depth": 4,
"wind_speed": 6.48144,
"eastward_wind": 5.567876,
"northward_wind": -3.3178043,
"wind_depth": -15,
"sea_water_salinity": 32.19,
"sea_water_salinity_depth": 4,
"platform": 1,
"mission": 1,
"metadata": "KTDQ_20150805v20001_0016"
}
Since Solr is a key-value data store, my initial translation to Neo4J was going to be simple so I could get a feel for working with the API.
My method was essentially to have each Solr record equate to a Neo4J node, where every key-value would become a node-property.
Obviously a few tweaks were required (changing None to 'None' (python), changing ISO times to epoch times (neo4j doesnt support indexing datetimes), changing point to lat/lon (neo4j spatial indexing), etc).
My goal was to load up Neo4J using this model, regardless of how naive it might be.
Here is an example of a rest call I make when loading in a single record (using http:localhost:7474/db/data/cypher as my endpoint):
{
"query" :
"CREATE (r:record {lat : {lat}, SST : {SST}, meta : {meta}, lon : {lon}, time : {time}}) RETURN id(r);",
"params": {
"lat": 40.1021614075,
"SST": 6.521100044250488,
"meta": "KCEJ_20140418v20001_1430",
"lon": -70.8780212402,
"time": 1397883480
}
}
Note that I have actually removed quite a few parameters for testing neo4j.
Currently I have serious performance issues. Loading a document like this into Solr for me takes about 2 seconds. For Neo4J it takes:
~20 seconds using REST API
~45 seconds using BOLT
~70 seconds using py2neo
I have ~50,000,000 records I need to load. Doing this in Solr usually takes 24 hours, so Neo4J could take almost a month!!
I recorded these times without using a uniqueness constraint on my 'meta' attribute, and without adding each node into the spatial index. The time results in this scenario was extremely awful.
Running into this issue, I tried searching for performance tweaks online. The following things have not improved my situation:
-increasing the open file limit from 1024 to 40000
-using ext4, and tweaking it as documented here
-increasing the page cache size to 16 GB (my system has 32)
So far I have only addressed load times. After I had loaded about 50,000 nodes overnight, I attempted queries on my spatial index like so:
CALL spatial.withinDistance('my_layer', lon : 34.0, lat : 20.0, 1000)
as well as on my time index like so:
MATCH (r:record) WHERE r.time > {} AND r.time < {} RETURN r;
These simple queries would take literally several minutes just return possibly a few nodes.
In Apache Solr, the spatial index is extremely fast and responds within 5 seconds (even with all 50000000 docs loaded).
At this point, I am concerned as to whether or not this performance lag is due to the nature of my data model, the configuration of my server, etc.
My goal was to extrapolate from this model, and move several measurement types to their own class of Node, and create relationships from my base record node to these.
Is it possible that I am abusing Neo4j, and need to recreate this model to use relationships and several different Node types? Should I expect to see dramatic improvements?
As a side note, I originally planned to use a triple store (specifically Parliament) to store this data, and after struggling to work with RDF, decided that Neo4J looked promising and much easier to get up and running. Would it be worth while to go back to RDF?
Any advice, tips, comments are welcome. Thank you in advance.
EDIT:
As suggested in the comments, I have changed the behavior of my loading script.
Previously I was using python in this manner:
from neo4j.v1 import GraphDatabase
driver = GraphDatabase('http://localhost:7474/db/data')
session = driver.session()
for tuple in mydata:
statement = build_statement(tuple)
session.run(statement)
session.close()
With this approach, the actual .run() statements run in virtually no time. The .close() statement was where all the run time occurs.
My modified approach:
transaction = ''
for tuple in mydata:
statement = build_statement(tuple)
transaction += ('\n' + statement)
with session.begin_transaction() as tx:
tx.run(transaction)
session.close()
I'm a bit confused because the behavior of this is pretty much the same. .close() still takes around 45 seconds, except only it doesn't commit. Since I am reusing the same identifier in each of my statements (CREATE (r:record {...}) .... CREATE (r:record {...}) ...), I get the CypherError regarding this behavior. I don't really know how to avoid this problem at the moment, and furthermore, the run time did not seem to improve at all (I would expect an error to actually make this terminate much faster).

How does "DHT search engine" work?

I'm interested in the Btdigg.org which is called a "DHT search engine". According to this article, it doesn't store any content and even has no database. Then how does it work? Doesn't it need to gather meta infos and store them in database like other normal search engines? After a user submit a query, it scans the DHT network and return the results in "real time"? Is this possible?

I don't have specific insight into BTDigg, but I believe the claim that there is not database (or something that acts like a database) is a false statement. The author of that article might have been referring to something more specific that you might encounter in a traditional torrent site, where actual .torrent files are stored for instance.
This is how a BTDigg-like site works:
You run a bunch of DHT nodes, specifically with the purpose of "eaves dropping" on DHT traffic, to be introduced to info-hashes that people talk about.
join those swarms and download the metadata (.torrent file) by using the ut_metadata extension
index the information you find in there, map it to the info-hash
Provide a front-end for that index
If you want to luxury it up a bit you can also periodically scrape the info-hashes you know about to gather stats over time and maybe also figure out when swarms die out and should be removed from the index.
So, the claim that you don't store .torrent files nor any content is true.
It is not realistic to search the DHT in real-time, because the DHT is not organized around keyword searches, you need to build and maintain the index continuously, "in the background".
EDIT:
Since this answer, an optimization (BEP 51) has been implemented in some DHT clients that lets you query which info-hashes they are hosting, significantly reducing the cost of indexing.

For a deep understanding of DHT and its applications, see Scott Wolchok's paper and presentation "Crawling BitTorrent DHTs for Fun and Profit". He presents the autonomous search engine idea as a sidenote to his study of DHT's security:
PDF of his paper:
https://www.usenix.org/legacy/event/woot10/tech/full_papers/Wolchok.pdf
His presentation at DEFCON 18 (parts 1 & 2)
http://www.youtube.com/watch?v=v4Q_F4XmNEc
http://www.youtube.com/watch?v=mO3DfLtKPGs

https://www.usenix.org/legacy/event/woot10/tech/full_papers/Wolchok.pdf
The method used in Section 3 seems to suggest a database to store all the torrent data is required. While performance is better, it may not be a true DHT search engine.
Section 8, while less efficient, seems to be a DHT search engine as long as the keywords are the store values.
From Section 3, Bootstrapping Bittorent Search:
"The system handles user queries by treating the
concatenation of each torrent's filenames and description as a
document in the typical information retrieval model and using an
inverted index to match keywords to torrents. This has the advantage
of being well supported by popular open-source relational DBMSs. We
rank the search results according to the popularity of the torrent,
which we can infer from the number of peers listed in the DHT"
From Section 8, Related Work:
the usual approach to distributing search using a DHT is
with an inverted index, by storing each (keyword, list of matching
documents) pair as a key-value pair in the DHT. Joung et al. [17]
describe this approach and point out its performance problems: the
Zipf distribution of keywords among files results in very skewed load
balance, document information is replicated once for each keyword in
the document, and it is difficult to rank documents in a distributed
environment

It is divided into two steps.
To achieve bep_0005 protocol got infohash, you do not need to implement all protocol requires only now find_node (request), get_peers (response), announce_peer (response). Here's one of my open source dhtspider.
To achieve bep_0009 protocol got metainfo index it, here are my own a bittorrent search engine, every day can get unique infohash 300w +, effective metainfo 50w +.

Thinking-Sphinx not returning any results in console or via web app

Similar to these: sphinx-not-indexing, thinking-sphinx returning empty results in console, similar here
The problem description is this: I can't get any results with any parameters from Sphinx. I have a hdd scaffold, and when I try and perform a search in rails console with sphinx I get [], and that's even when I know that there are items in the database. If I do Hdd.search() I should get the same as Hdd.all, but instead I get []. I saw a post about doing Hdd.search().to_a, but that doesn't make a difference for me, I get nil. For others it seems that they sometimes get different results on their webpage than their rails console, but it isn't this way for me. My site also has search functionality and it produces [] as well.
Funny part about it is, all I've done between the time it was last working and now that it's failing is add some small modifications-namely change the model and view, make a few migrations, and install a new plugin, carrierwave. Now, it seems like between all of that, search just isn't working. I've restarted, rebuilt, reindexed thinking sphinx, but to no avail. I've also restarted the server too, and even ran it under webrick instead of apache, but the problem has essentially remained the same.
Output of:
rake ts:rebuild
using config file '/home/adam/RailsForensicsHardDriveApp/ForensicsHDD/config/development.sphinx.conf'...
indexing index 'hdd_core'...
collected 6 docs, 0.0 MB
sorted 0.0 Mhits, 100.0% done
total 6 docs, 292 bytes
total 0.006 sec, 47510 bytes/sec, 976.24 docs/sec
indexing index 'hdd_delta'...
collected 0 docs, 0.0 MB
total 0 docs, 0 bytes
total 0.003 sec, 0 bytes/sec, 0.00 docs/sec
skipping non-plain index 'hdd'...
total 4 reads, 0.000 sec, 3.8 kb/call avg, 0.0 msec/call avg
total 14 writes, 0.000 sec, 2.5 kb/call avg, 0.0 msec/call avg
Started successfully (pid 6700).
I notice that I get a line in the above ^ that says "skipping non-plain index 'hdd'..."
I don't know what that means, but it sounds like it's not doing something that it should.
I have tried to debug the function using ruby debug, but that hasn't born any fruit yet. I'm hoping that somebody else has some advice from what I post that could help me.
I've run rake rails:update
I've uninstalled and reinstalled thinking-sphinx gem
In this gist I have included the pertinent files. You can see that in the model, I've tried to eliminate the problem of concatenation space being too small by extending the size, but that hasn't made a difference either.
I've read the deltas documentation and the FAQ on setting up sphinx with passenger (yes, I am running rails with passenger) and I don't feel like either of those are the problem because I had sphinx running quite well before it mysteriously (to me) stopped functioning correctly.
I honestly have no idea why it's not working. I have the entire project under git, and I've tried rolling back to earlier versions when it was working, rebuilding the index and performing a search, but that doesn't seem to work. I've tried dropping the database and recreating it, re-migrating, and then rebuilding sphinx, but that hasn't worked either.

I found that when I did sudo apt-get purge sphinxsearch and consequently reinstalled it, it somehow stopped the server from blindly continuing through an error. If you look in the model that I posted, you'll see that I had a special field used in development-type that I had moved in via a migration. I think this was preventing it from working, because I got some really odd errors after I resinstalled thinking-sphinx (both as a gem and as an apt package). After reinstalling, upon dropping the database and editing the migration so that it was now hd_type instead of type, I was able to successfully reindex the hdd and populate search results. I know that this isn't exactly precise, because I had the issue fixed before I wrote any of the final errors that I saw that would have been pertinent to understanding precisely why it was malfunctioning the way it was.
See my answer here which links back to this article, in case it provides any different context than what this answer does, which I don't think it does.

skipping non-plain index 'hdd' is okay - this is distributed index and can't be indexed.
If you take a look on indexer output
collected 6 docs, 0.0 MB
That means that you only have six documents in the index. If you don't get any errors from Sphinx it just might be that you don't have any documents that matches your search.

Performance difference between sunspot and thinking sphinx

I read an article comparing the performance of sunspot and thinking
sphinx ( http://www.vijedi.net/2010/ruby-full-text-search-performance-thinking-sphinx-vs-sunspot-solr/
). As per the article sunspot drags a lot behind thinking sphinx since
it uses xml to interact with java layer. This is the result mentioned
there
Runs Thinking Sphinx Sunspot
5000 38.49 1611.60
10000 38.54 1648.51
15000 39.06 1614.52
20000 38.86 1583.53
25000 39.78 1613.79
30000 38.83 1595.60
35000 38.34 1571.96
40000 38.06 1631.87
45000 37.57 1603.31
50000 38.23 1634.53
Total 385.80 16109.26
Is there really such a difference? Is sunspot really slower? or is the article
just biased? Which Full Text Search Engine would you guys recommend?

If you look at the comments on that article, it seems that the author is not biased, but that the times aren't a reliable comparison of the two libraries.
I'm the author of Thinking Sphinx, so of course I think it's a viable option and should serve you well - but sometimes Solr (or a different option again) will be a better fit. Both Thinking Sphinx and Sunspot are well-maintained and used widely - certainly, Thinking Sphinx supports Rails 3 and 3.1 and won't be disappearing any time soon.
I would recommend trying one or the other out, seeing how it works - unless you're dealing with a site that's massive, then search is unlikely to be a bottleneck, so go with what you feel more comfortable with.

Optimizing Lucid/Solr to index large text documents

I am trying to index about 3 million text documents in solr. About 1/3 of these files are emails that have about 1-5 paragraphs of text in them. The remaining 2/3 files only have a few words to sentences each.
It takes Lucid/Solr nearly 1 hour to fully index the entire dataset I'm working with. I'm trying to find ways to optimize this. I have setup Lucid/Solr to only commit every 100,000 files, and it indexes the files in batches of 50,000 files at once. Memory isn't an issue anymore, as it consistently stays around 1GB of memory because of the batching.
The entire dataset has to be indexed initially. It's like a legacy system that has to be loaded to a new system, so the data has to be indexed and it needs to be as fast as possible, but I'm not sure what areas to look into to optimize this time.
I'm thinking that maybe there's a lot of little words like "the, a, because, should, if, ..." that are causing a lot of overhead and are just "noise" words. I am curious if I cut them out if it would drastically speed up the indexing time. I have been looking at the Lucid docs for a while, but I can't seem to find a way to specify what words not to index. I came across the term "stop list" but didn't see much more than a reference to it in passing.
Are there other ways to make this indexing go faster or am I just stuck with a 1 hour indexing time?

We met similar problem recently. We can't use solrj as the request and response have to go through some applications, so we take the following steps:
Creating Custom Solr Type to Stream Large Text Field!
Use GZipOutput/InputStream and Bse64Output/InputStream to compress the large text. This can reduce size of text about 85%, this can reduce the time to transfer the request/response.
To reduce memory usage at client side:
2.1 We use stream api(GSon stream or XML Stax) to read doc one by one.
2.2 Define a custom Solr Field Type: FileTextField which accepts FileHolder as value. FileTextField will eventually pass a reader to Lucene. Lucene will use the reader to read content and add to index.
2.3 When the text field is too big, first uncompress it to a temp file, create a FileHolder instance, then set the FileHolder instance as field value.

It seems from your query that Indexing time is really important for your application. Solr is a great search engine however if you need super fast indexing time and if that is a very important criteria for you, than you should go with Sphinx Search Engine. It wont take much of time for you to quickly setup and benchmark your results using Sphinx.
There can be ways (like the one you have mentioned, stopwords etc.) to optimize however whatever you do with respect to indexing time Solr won't be able to beat Sphinx. I have done benchmarking myself.
I too love Solr a lot because of its ease of use, its out of box great features like N-Gram Indexing, Faceting, Multi-core, Spelling Correctors and its integration with other apache products etc.. but when it comes to Optimized Algorithms (be it Index size, Index time etc.) Sphinx rocks!!
Sphinx too is open source. Try that out.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart