Lucene.NET performance - search-engine

Lucene.NET performance - search-engine

I have a website that runs of a third party search provider that is expensive. I am going to roll my own.
Is Lucene.NET capable of ~25,000 products (or documents), each with maybe ten attributes used for filtering? I am looking to do a "narrow/drill down" or "faceted search".
Does that sound like to much to ask from Lucene.NET?

I've used it with millions of entries and the performance was excellent. It was reliable and easy to get up and running. Highly recommended if you need full text search.

Related

Geodata Querying Optimisations

I am planning to write a Node.js-powered RESTful web service that I will use for a mobile application which provides some sort of location based features. The most basic use case is going to look something like this:
the user can create a resource by sending a request to the web service containing the resource's name and the user's current location (latitude and longitude)
the web service will store the metadata about this resource internally in some sort of collection
the user can query the web service for a list of resources within 5km of his current location
One of the first problems that came up in my mind was scalability. Let's suppose that at some point in the future the server will hold metadata for 1 million resources. When a user will query for nearby results, looping through 1 million entries to compute the distance will take forever.
There are many services out there that have the same flow, so I thought implementing something like this is not going to take me a lot of time. I might have been wrong.
I am now two days into researching proven methods and algorithms. By now I have read everything I could put my hands on about QuadTrees, Geohases, databases with spatial indexing support, formulas and so on. However, I still can't get the whole picture of how everything is going to work.
I was hoping that maybe someone who has worked on something similar could share his insight on what approach might be the most suitable considering this use case and the technologies that I am planning to use. Also, a short description of how it can be implemented would help me a lot!

For those who are also looking for more information on this topic out of curiosity, my answer might not provide much clearance. However, some answers in here might help you understand how you could achieve proximity searches using Geohashes.
My approach, after doing a little research on Redis, will be not to overcomplicate things and just use the tools that are already out there. It has out of the box support for spatial indexing and will most probably meet all my persistance requirements for this project.
Apparently MongoDB also comes with built-in support for geodata. In fact, even RDBMS like MySQL or SQLite do come with such capabilities.

Cache solution for a news feed, based on objective information?

I need some suggestions of what works well for caching an updatable news feed.
Please, no "Fanboy" answers either please - not looking for subjective opinions of what the "best" system, just seeking some suggestions of technologies that will fit the requirements below. So please, share what you have used in the real world, even if you prefer some other solution.
I have a rails based news feed (Neo4j database), and while performance is good, I would like to cache it so that servers don't get bogged down serving live feeds.
REQUIREMENTS:
EASY FRAGMENT UPDATES: I'd like to easily update parts of a user's newsfeed the
cache based upon specific triggers, for example, when a user edits
their status update - I don't want to regenerate the user's entire
news feed in the cache, rather I just want to update that one
"fragment", or section if you will, of the particular user's feed. And I don't want to jump through hoops to try and do so.
DELETION: If someone deletes an activity, I just want to remove that activity
from their news feed before the system eventually refreshes the entire feed for that user.
EASY RETRIEVAL: I'd like to retrieve the cache in such a way that the rails
controller/models can easily read them and hand them off to views without
any modification of the views.
PERSISTENCE: If I need to reboot the cache, it should load up the
cache from disk. Which means it should save cached entries to disk.
SPEED: Given that it must be able to be update fragments of cached
news feeds, there is going to be a performance hit of some sort. But
I need speed..
What cache technologies provide such capabilities? Will Redis, MongoDB, Memcached fit these requirements? What other options are there? (CouchDB, Tokyo File cabinet, etc)..
In the spirit Stack Overflow, I'm not asking for subjective opinions on what you like better and why, I'm just asking for possible candidate systems that you may have actually used in production to accomplish caching and updating a cached news feed (or anything similar).

Since it is mainly an opinion-based topic, this answer will be subjective. But I will try anyway to remain factual.
The first point to notice is your requirements tend to be mutually exclusive. As we said in France, you want the butter, the money for the butter, and the wife of the farmer (ok, this is probably a lousy translation).
For example, to support easy fragment updates and proper deletion, you will need some kind of data structures in the cache. I have zero knowledge about Rails, but I guess it will have impact on the data access patterns, and the definitions of controllers/models. In other words, it will add complexity to data retrieval. You need speed, but at the same time, you also require persistency, and also non-trivial data access patterns. Well, you cannot get everything at the same time, you will have to make choices, and prioritize these requirements.
My second point is a cache is only useful when there is a significant difference in term of performance between the cache and the underlying storage engine. Since you already use a NoSQL engine which is rather efficient (Neo4j), you need to consider only engines which are truly designed for raw performance (i.e. low-latency stores): memcached, redis, couchbase, aerospike, to name well-established open-source products. If you feel a bit more adventurous, you can also consider other projects like tarantool or hyperdex.
There are a number of commercial products as well, but I'm not sure they provide a Ruby client (TIBCO ActiveSpaces, Gigaspaces, Red-Hat Infinispan, etc ...)
Other NoSQL engines (MongoDB, Cassandra, CouchDB, etc ...) have other interesting properties, but they will not beat these solutions at raw performance for a mixed r/w workload. Here, I'm only talking about raw performance (i.e. low latency at high throughput), not scalability.
Actually, memcached can be excluded because it does not support persistency. I would say you can probably implement what you want with Redis, Couchbase or Aerospike, but Aerospike 3 does not seem to have yet an officially supported Ruby client.
Supporting multiple data accesses paths (i.e. consistent indexing data structure) will be easier with Redis and Aerospike than Couchbase. High-availability will be easier with Couchbase or Aerospike than with Redis. Implementing a cache behavior will be easier with Redis and Couchbase than with Aerospike.
Some general advices:
make sure you really have a performance or a scalability issue with Neo4j before adding the complexity of an extra layer. Complexity is like toothpaste: once it is out of the tube, you cannot put it back.
data access patterns should be listed at design time, and must be backed by corresponding data structures in the chosen engine.
the hardware footprint must be considered as well. If you have only a couple of boxes, pick a lightweight solution like Redis.
with persistency, you need to consider also HA. What happens if the caching layer is lost? Actually, I would say that for a cache, HA may be more important than persistency.
Finally, you need also to define the exact cache semantic you want (update behavior, invalidation behavior, cache miss management, TTL policy if any, etc ...). The 3 NoSQL engines I have listed provide some tools to help the implementation of the various strategies, but none of them will support an off-the-shelf strategy. This will require some coding to implement it.

Free data warehousing systems--specifically, for data storage

I am building out some reporting stuff for our website (a decent sized site that gets several million pageviews a day), and am wondering if there are any good free/open source data warehousing systems out there.
Specifically, I am looking for only something to store the data--I plan to build a custom front end/UI to it so that it shows the information we care about. However, I don't want to have to build a customized database for this, and while I'm pretty sure an SQL database would not work here, I'm not sure what to use exactly. Any pointers to helpful articles would also be appreciated.
Edit: I should mention--one DB I have looked at briefly was MongoDB. It seems like it might work, but their "Use Cases" specifically mention data warehousing as "Less Well Suited": http://www.mongodb.org/display/DOCS/Use+Cases . Also, it doesn't seem to be specifically targeted towards data warehousing.

http://www.hypertable.org/ might be what you are looking for is (and I'm going by your descriptions above here) something to store large amounts of logged data with normalization. i.e. a visitor log.
Hypertable is based on google's bigTable project.
see http://code.google.com/p/hypertable/wiki/PerformanceTestAOLQueryLog for benchmarks
you lose the relational capabilities of SQL based dbs but you gain a lot in performance. you could easily use hypertable to store millions of rows per hour (hard drive space withstanding).
hope that helps

I may not understand the problem correctly -- however, if you find some time to (re)visit Kimball’s “The Data Warehouse Toolkit”, you will find that all it takes for a basic DW is a plain-vanilla SQL database, in other words you could build a decent DW with MySQL using MyISAM for the storage engine. The question is only in desired granularity of information – what you want to keep and for how long. If your reports are mostly periodic, and you implement a report storage or cache, than you don’t need to store pre-calculated aggregations (no need for cubes). In other words, Kimball star with cached reporting can provide decent performance in many cases.
You could also look at the community edition of “Pentaho BI Suite” (open source) to get a quick start with ETL, analytics and reporting -- and experiment a bit to evaluate the performance before diving into custom development.
Although this may not be what you were expecting, it may be worth considering.

Pentaho Mondrian
Open source
Uses standard relational database
MDX (think pivot table)
ETL ( via Kettle )
I use this.

In addition to Mike's answer of hypertable, you may want to take a look at Apache's Hadoop project:
http://hadoop.apache.org/
They provide a number of tools which may be useful for your application, including HBase, another implementation of the BigTable concept. I'd imagine for reporting, you might find their mapreduce implementation useful as well.

It all depends on the data and how you plan to access it. MonetDB is a column-oriented database engine from the most revolutionary team on database technologies. They just got VLDB's 10-year best paper award. The DB is open source and there are plenty of reviews online praising them.
Perhaps you should have a look at TPC and see which of their test problem datasets match best your case and work from there.
Also consider the need for concurrency, it adds a big overhead for any kind of approach and sometimes is not really required. For example, you can pre-digest some summary or index data and only have that protected for high concurrency. Profiling your data queries is the following step.
About SQL, I don't like it either but I don't think it's smart ruling out an engine just because of the front-end language.

I see a similar problem and thinking of using plain MyISAM with http://www.jitterbit.com/ as data access layer. Jitterbit (or another free tool alike) seems very nice for this sort of transformations.
Hope this helps a bit.

A lot of people just use Mysql or Postgres :)

MVC Implementation where a Search Engine is the Model

Maybe I am mistating the problems and conflating the answer with the questions, but please here me out. I would like to think (communally, with you) about a site that is based on any any of the MVC frameworks(something PHP or ASP.NET MVC, whtever) that would use a search engine (lucene/solr, FAST ESP, whatever) as the back end of the Model. That is to say, there is no database per se in the project. Just a giant index of docuements that are semistructured content.
I am looking to understand - and keep in mind the site is primarily read-only - where I am likely to run into trouble. What are the things that make you think this is a bad idea from the get go. Also, please assume that there will be a robust infrastructure with caching surrounding the search engine - so while perf comments are welcomed, we feel they are not the major problem.
Thanks!

In general, I'd use a tool like Lucene for searching content, and a database for retrieving it. That doesn't mean that it won't work. It's more a question of why you don't want to use a database. Yes, it can work, and it probably will work (depending on the functional requirements of the site, read on), but that still doesn't make a tool like Lucene the right tool for the job per se.
That being said, it also it does depend on the kind of site however. Is it really a site with just a whole bunch of searchable data and nothing else, or is it something much more than that? If the answer is the first, then good! If it is the latter, there are some issues I can think of:
Updates to the data can be troublesome. "Instant updates" are usually a no-go, as Lucene would have to rebuild its index, which is time-consuming. If there aren't many updates to the data that's fine. You can just recreate the index a couple of times per day, or nightly, if that works.
Trying to stuff any data in an index which is not really suited to be indexed is usually not a good idea. If the site lets users register on your site, then that user data should really go in a database. It's not impossible to store it in a lucene index, it's just not the right tool for the job. Use the index as a bunch of indexed documents, but don't use it as a database as well.

How to prevent hackers from scraping our database? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
How do you stop scripters from slamming your website hundreds of times a second?
I am building a web application in RubyOnRails, which is based on a large body of data. The application makes for powerful navigation and intersection of the data, as well as a community model for adding more data.
In that respect one could compare it with StackOverflow.com: a big bunch of data, structured in a fairly simple way.
I intend to offer the content under a CreativeCommons license, but if the site "hits it off", I need to discourage copycats. My biggest fear is screen scraping scripters, not only leeching away the raw data, but also incurring huge usage peaks on my servers.
I wonder if RubyOnRails offers any way to throttle (obviously automated) requests, e.g. to reduce their response-time at the benefit of regular users. Perhaps this requires Apache or Phusion Passenger settings?
EDIT: My target is not to recognize user types, but to reduce responsiveness to overly active users, e.g. maximize the number of requests handled per IP address per unit of time (?)

My suggestion would be to limit any easy iterative navigation of your websites which was the primary way I have seen harvesting programs work. The simple encryption of your id numbers used as GET variables would make stripmining your info more difficult. You can only try and make getting your information onerous. You won't be able to prevent it completely.

You could present a captcha to the "overly active users", just like SO does when you edit too fast. That should effectively hinder automatic spider like scraping.

You might also want to look into using some Rack middleware to do rate limiting, like this recent article covered for doing API limiting (such as what you'd want at Twitter or similar).

I believe all you could do is put hoops for the user to jump though. Ultimately there is no foolproof way to distinguish a regular user from a bot.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart