Performance difference between sunspot and thinking sphinx - ruby-on-rails

I read an article comparing the performance of sunspot and thinking
sphinx ( http://www.vijedi.net/2010/ruby-full-text-search-performance-thinking-sphinx-vs-sunspot-solr/
). As per the article sunspot drags a lot behind thinking sphinx since
it uses xml to interact with java layer. This is the result mentioned
there
Runs Thinking Sphinx Sunspot
5000 38.49 1611.60
10000 38.54 1648.51
15000 39.06 1614.52
20000 38.86 1583.53
25000 39.78 1613.79
30000 38.83 1595.60
35000 38.34 1571.96
40000 38.06 1631.87
45000 37.57 1603.31
50000 38.23 1634.53
Total 385.80 16109.26
Is there really such a difference? Is sunspot really slower? or is the article
just biased? Which Full Text Search Engine would you guys recommend?

If you look at the comments on that article, it seems that the author is not biased, but that the times aren't a reliable comparison of the two libraries.
I'm the author of Thinking Sphinx, so of course I think it's a viable option and should serve you well - but sometimes Solr (or a different option again) will be a better fit. Both Thinking Sphinx and Sunspot are well-maintained and used widely - certainly, Thinking Sphinx supports Rails 3 and 3.1 and won't be disappearing any time soon.
I would recommend trying one or the other out, seeing how it works - unless you're dealing with a site that's massive, then search is unlikely to be a bottleneck, so go with what you feel more comfortable with.

Related

max-series-per-database limit exceeded clarification needed / how to calculate number of series in use

We recently started to encounter this error:
{"error":"partial write: max-series-per-database limit exceeded: (1000000) dropped=1"}
When writing metric data like this:
resque_job,environment=beta,billing_status=active-current,billing_active=active,instance_id=1103,instance_testmode=0,instance_staging=0,server_addr=RESQUE,database_host=db11.msp1.our-domain.com,admin_sso_key=_EMPTY_,admin_is_internal=_EMPTY_,queue_priority=default seconds_spent_job=0.20966601371765,number_in_batch=1 1649203450783000002
I know that Influx recommends you keep your series cardinality low, and our impression was that series cardinality would mean keeping each tag individually to a small number of values. e.g. we felt comfortable sending instance_id=1103 as a tag, because we know that there will never be more than 2000 distinct instance_id tag values.
But after running into this error... I'm afraid maybe I was mistaken here. Do we actually need to keep the cardinality of all possible combinations of all tags low? e.g. do these two things count as two separate series towards the 1,000,000 default max, because the instance_id is different?
resque_job,environment=beta,billing_status=active-current,billing_active=active,instance_id=1111,instance_testmode=0,instance_staging=0,server_addr=RESQUE,database_host=db11.msp1.our-domain.com,admin_sso_key=_EMPTY_,admin_is_internal=_EMPTY_,queue_priority=default seconds_spent_job=0.20966601371765,number_in_batch=1 1649203450783000002
resque_job,environment=beta,billing_status=active-current,billing_active=active,instance_id=2222,instance_testmode=0,instance_staging=0,server_addr=RESQUE,database_host=db11.msp1.our-domain.com,admin_sso_key=_EMPTY_,admin_is_internal=_EMPTY_,queue_priority=default seconds_spent_job=0.20966601371765,number_in_batch=1 1649203450783000002
If those count as two separate series... then is there a better way to structure this data in Influx? 1,000,000 total seems like a tiny amount if each separate combination of tags is a separate series...
Does InfluxDB 2.x help with this?
Is there a better tool that can handle a large number of tags and not bump into limits like this?
There is no way to figure out what data was not recorded. Update the max-series-per-database configuration to be more than 1M in order to stop dropping data.
This can be an indication that you are creating a lot of series. i saw some documentation on why that isn't great.
Hope this helps!

How does "DHT search engine" work?

I'm interested in the Btdigg.org which is called a "DHT search engine". According to this article, it doesn't store any content and even has no database. Then how does it work? Doesn't it need to gather meta infos and store them in database like other normal search engines? After a user submit a query, it scans the DHT network and return the results in "real time"? Is this possible?
I don't have specific insight into BTDigg, but I believe the claim that there is not database (or something that acts like a database) is a false statement. The author of that article might have been referring to something more specific that you might encounter in a traditional torrent site, where actual .torrent files are stored for instance.
This is how a BTDigg-like site works:
You run a bunch of DHT nodes, specifically with the purpose of "eaves dropping" on DHT traffic, to be introduced to info-hashes that people talk about.
join those swarms and download the metadata (.torrent file) by using the ut_metadata extension
index the information you find in there, map it to the info-hash
Provide a front-end for that index
If you want to luxury it up a bit you can also periodically scrape the info-hashes you know about to gather stats over time and maybe also figure out when swarms die out and should be removed from the index.
So, the claim that you don't store .torrent files nor any content is true.
It is not realistic to search the DHT in real-time, because the DHT is not organized around keyword searches, you need to build and maintain the index continuously, "in the background".
EDIT:
Since this answer, an optimization (BEP 51) has been implemented in some DHT clients that lets you query which info-hashes they are hosting, significantly reducing the cost of indexing.
For a deep understanding of DHT and its applications, see Scott Wolchok's paper and presentation "Crawling BitTorrent DHTs for Fun and Profit". He presents the autonomous search engine idea as a sidenote to his study of DHT's security:
PDF of his paper:
https://www.usenix.org/legacy/event/woot10/tech/full_papers/Wolchok.pdf
His presentation at DEFCON 18 (parts 1 & 2)
http://www.youtube.com/watch?v=v4Q_F4XmNEc
http://www.youtube.com/watch?v=mO3DfLtKPGs
https://www.usenix.org/legacy/event/woot10/tech/full_papers/Wolchok.pdf
The method used in Section 3 seems to suggest a database to store all the torrent data is required. While performance is better, it may not be a true DHT search engine.
Section 8, while less efficient, seems to be a DHT search engine as long as the keywords are the store values.
From Section 3, Bootstrapping Bittorent Search:
"The system handles user queries by treating the
concatenation of each torrent's filenames and description as a
document in the typical information retrieval model and using an
inverted index to match keywords to torrents. This has the advantage
of being well supported by popular open-source relational DBMSs. We
rank the search results according to the popularity of the torrent,
which we can infer from the number of peers listed in the DHT"
From Section 8, Related Work:
the usual approach to distributing search using a DHT is
with an inverted index, by storing each (keyword, list of matching
documents) pair as a key-value pair in the DHT. Joung et al. [17]
describe this approach and point out its performance problems: the
Zipf distribution of keywords among files results in very skewed load
balance, document information is replicated once for each keyword in
the document, and it is difficult to rank documents in a distributed
environment
It is divided into two steps.
To achieve bep_0005 protocol got infohash, you do not need to implement all protocol requires only now find_node (request), get_peers (response), announce_peer (response). Here's one of my open source dhtspider.
To achieve bep_0009 protocol got metainfo index it, here are my own a bittorrent search engine, every day can get unique infohash 300w +, effective metainfo 50w +.

Optimizing Lucid/Solr to index large text documents

I am trying to index about 3 million text documents in solr. About 1/3 of these files are emails that have about 1-5 paragraphs of text in them. The remaining 2/3 files only have a few words to sentences each.
It takes Lucid/Solr nearly 1 hour to fully index the entire dataset I'm working with. I'm trying to find ways to optimize this. I have setup Lucid/Solr to only commit every 100,000 files, and it indexes the files in batches of 50,000 files at once. Memory isn't an issue anymore, as it consistently stays around 1GB of memory because of the batching.
The entire dataset has to be indexed initially. It's like a legacy system that has to be loaded to a new system, so the data has to be indexed and it needs to be as fast as possible, but I'm not sure what areas to look into to optimize this time.
I'm thinking that maybe there's a lot of little words like "the, a, because, should, if, ..." that are causing a lot of overhead and are just "noise" words. I am curious if I cut them out if it would drastically speed up the indexing time. I have been looking at the Lucid docs for a while, but I can't seem to find a way to specify what words not to index. I came across the term "stop list" but didn't see much more than a reference to it in passing.
Are there other ways to make this indexing go faster or am I just stuck with a 1 hour indexing time?
We met similar problem recently. We can't use solrj as the request and response have to go through some applications, so we take the following steps:
Creating Custom Solr Type to Stream Large Text Field!
Use GZipOutput/InputStream and Bse64Output/InputStream to compress the large text. This can reduce size of text about 85%, this can reduce the time to transfer the request/response.
To reduce memory usage at client side:
2.1 We use stream api(GSon stream or XML Stax) to read doc one by one.
2.2 Define a custom Solr Field Type: FileTextField which accepts FileHolder as value. FileTextField will eventually pass a reader to Lucene. Lucene will use the reader to read content and add to index.
2.3 When the text field is too big, first uncompress it to a temp file, create a FileHolder instance, then set the FileHolder instance as field value.
It seems from your query that Indexing time is really important for your application. Solr is a great search engine however if you need super fast indexing time and if that is a very important criteria for you, than you should go with Sphinx Search Engine. It wont take much of time for you to quickly setup and benchmark your results using Sphinx.
There can be ways (like the one you have mentioned, stopwords etc.) to optimize however whatever you do with respect to indexing time Solr won't be able to beat Sphinx. I have done benchmarking myself.
I too love Solr a lot because of its ease of use, its out of box great features like N-Gram Indexing, Faceting, Multi-core, Spelling Correctors and its integration with other apache products etc.. but when it comes to Optimized Algorithms (be it Index size, Index time etc.) Sphinx rocks!!
Sphinx too is open source. Try that out.

solr vs xapian: which one gives you the most meaningful results?

I am currently using whoosh to dev a website, and I'll need to choose something more powerful once the website will be in production.
If anyone of you used both of these engines, which one gave you the most meaningful results one the long road?
Solr is the best option. Its well documented and the community is huge. Almost a year ago I benchmarked Xapian vs Solr:
My dataset had +8000 emails:
Solr
index time: 3s
index size: 5.2mb
Xapian
index time: 30s
index size: 154mb
Another great reading about benchmarks between Xapian and Solr is this document: Cross-instance Search System - Search Engine Comparison

Is there a drop-in replacement for ActiveRecord to_xml that's faster?

I have a large-ish array (~400 elements) of ActiveRecord objects that I need to convert to XML. I've used array.to_xml for convenience, but it's very slow -- about 20 seconds when the server is busy, and about 5 seconds when idle.
I've just run a few benchmarks while the server was idle, and found that:
the ActiveRecord query (complete with two-level :include) takes about 0.3s on average.
converting that result set to XML takes about 4.9s on average. 4.86s of that is User CPU time.
Is there a drop-in replacement for Builder::XmlMarkup that will improve the speed of to_xml? Or will I have to hand-roll something?
Following link claims a 2 - 3x speed increase. It's not a drop in replacement but rather a technique one uses to build a structure that to_xml will traverse faster.. Faster alternatives to ActiveRecord::Base.to_xml (Rails Performance Series)
You might as well want to check out http://github.com/rti/FastXml
This is a simple Rails plug-in which replaces Array#to_xml and ActiveRecord::Base#to_xml. It uses the 'libxml-ruby' gem (which is a native binding to libxml) to generate the documents.

Resources