There is a great pattern for handling searches with few or no results due to a user having constrained the search with too many filters. It involves showing all the components of a search query with the numbers of hits for that component alone, or for each combination of components.
For example, if I was searching a database of music and I built my search query from the following criteria:
Producer - Martin Hannett
Genre - Electronica
Year - 1977
I would get get no matches as the three don't coincide. However removing a component does allow for matches.
So rather than just displaying 'No Results', a much better handing of this situation would be to present the number of hits for each component of the the query:
Martin Hannett | Electronica | 1975 (2 Results)
Martin Hannett | Electronica | 1975 (33 Results)
Given that a search term might have multiple components, how would this be done efficiently in terms of queries? To get numbers for each component, separate queries would need to be performed for each, meaning the queries would be inefficient? I'm using Rails with Postres and PGSearch, but I think this question is much more general.
You can take advantage of a key value store such as Redis in front of your PostgreSQL database to return the count of search results from Redis. Redis is optimized for fast random read/writes.
Ryan Bates did an episode on autocomplete search to prevent multiple queries to the main database. This case is similar.
Related
I am using Neo4j community edition embedded in java application for recommendation purpose. I made a custom function which contains a complex logic of comparing two entities, namely product and users. Both entities are present as nodes in graph and has more than 20 properties each for comparison purpose. For eg. I am calling this function in following format:
match (e:User {user_id:"some-id"}) with e
match (f:Product {product_id:"some-id"}) with e,f
return e,f,findComparisonValue(e,f) as pref_value;
This function call on an average takes about 4-5 ms to run. Now, to recommend best product to a particular user, I wrote a cypher query which iterates over all products, calculate the pref_value and rank them. My cypher query looks like this:
MATCH (source:User) WHERE id(source)={id} with source
MATCH (reco:Product) WHERE reco.is_active='t'
with reco, source, findComparisonValue(source, reco) as score_result
RETURN distinct reco, score_result.score as score, score_result.params as params, score_result.matched_keywords as matched_keywords
order by score desc
Some insights on graph structure:
Total Number of nodes: 2 million
Total Number of relationships: 20 million
Total Number of Users: 0.2 million
Total Number of Products: 1.8 million
The above cypher query is taking more than 10 seconds as it is iterating over all the products. On top of this cypher query, I am using graphaware-reco module for my recommendation needs (Using precompute, filteing, post processing etc). I thought of parallelising this but community edition does not support clustering. Now, as number of users in system is increasing day by day, I need to think of a scalable solution.
Can anyone help me out here, on how to optimize the query.
As others have commented, doing a significant calculation potentially millions of times in a single query is going to be slow, and does not take advantage of neo4j's strengths. You should investigate modifying your data model and calculation so that you can leverage relationships and/or indexes.
In the meantime, there are a number of things to suggest with your second query:
Make sure you have created an index for :Product(is_active), so that it is not necessary to scan all products. (By the way, if that property is actually supposed to be a boolean, then consider making it a boolean rather than a string.)
The RETURN clause should not need the DISTINCT operator, since all the result rows should be distinct anyway. This is because every reco value is already distinct. Removing that keyword should improve performance.
I am working on a model using wikipedia topics' names for my experiments in full-text index.
I set up and index on 'topic' (legacy), and do a full text search for : 'united states':
start n=node:topic('name:(united states)') return n
The first results are not relevant at all:
'List of United States National Historic Landmarks in United States commonwealths and territories, associated states, and foreign states'
[...]
and the actual 'united states' is buried deep down the list.
As such, it raises the problem that, in order to find the best match (e.g. levershtein, bi-gram, and so on algorithms) on results, you first must fetch all the items matching the pattern.
That would be a serious constraint, cause just in this case I have 21K rows, ~4 seconds.
Which algorithms does neo4j use to order the results of a full-text search (START)?
Which rationale does it use to sort result and how to change it using cypher?
In the doc is written to use JAVA api to apply sort() - it would be very useful to have a tutorial for appointing to which files to modify and also to know which ranking rationale is used before any tweak.
EDITED based on comments below - pagination of results is possible as:
n=node:topic('name:(united states)') return n skip 10 limit 50;
(skip before limit) but I need to ensure first results are meaningful before pagination.
I don't know which order algorithms does lucene use to order the results.
However, about the pagination, if you change the order of limit and skip like follows, should be ok.
start n=node:topic('name:(united states)') return n skip 10 limit 50 ;
I would also add that if you are performing full-text search maybe a solution like solr is more appropriate.
For just a lucene index lookup with scoring you might be better off with this:
http://neo4j.com/docs/stable/rest-api-indexes.html#rest-api-find-node-by-query
There are several possible ways I can think of to store and then query temporal data in Neo4j. Looking at an example of being able to search for recurring events and any exceptions, I can see two possibilities:
One easy option would be to create a node for each occurrence of the event. Whilst easy to construct a cypher query to find all events on a day, in a range, etc, this could create a lot of unnecessary nodes. It would also make it very easy to change individual events times, locations etc, because there is already a node with the basic information.
The second option is to store the recurrence temporal pattern as a property of the event node. This would greatly reduce the number of nodes within the graph. When searching for events on a specific date or within a range, all nodes that meet the start/end date (plus any other) criteria could be returned to the client. It then boils down to iterating through the results to pluck out the subset who's temporal pattern gives a date within the search range, then comparing that to any exceptions and merging (or ignoring) the results as necessary (this could probably be partially achieved when pulling the initial result set as part of the query).
Whilst the second option is the one I would choose currently, it seems quite inefficient in that it processes the data twice, albeit a smaller subset the second time. Even a plugin to Neo4j would probably result in two passes through the data, but the processing would be done on the database server rather than the requesting client.
What I would like to know is whether it is possible to use Cypher or Neo4j to do this processing as part of the initial query?
Whilst I'm not 100% sure I understand you requirement, I'd have a look at this blog post, perhaps you'll find a bit of inspiration there: http://graphaware.com/neo4j/2014/08/20/graphaware-neo4j-timetree.html
Hello stackoverflow folks,
We got a Rails project which is growing and growing and we now get first performance problems on the search, because we don't know how to utilize sphinx properly for our needs.
We have search queries like "Java PHP Software developer". Our problem is now the ranking should work with multiple things.
As search fields we have tag list, description and title.
If one of the terms is inside of one of the fields it should get for example 2 points. More Points if its in more fields, but not multiple points if it is in the same field more than once.
Next Problem is I have a big file with synonyms for which should also be checked. It looks like this:
Java > Java
Java-EE > Java
...
So if Java-EE is found it should get some points too but with a penalty for being a synonym.
Maximum amount of points would be 5 as in 5 stars which get displayed.
Any speedy solution would be nice because at the moment it's done in plain ruby and it gets slow, because we cant rank properly in sphinx.
If there is a solution with another search engine that would also be very nice, as it could be changed.
Thanks in advance for all efforts. All spelling corrections and questions to clear the question are welcome.
Most of the performance issues can be solved by changing the way you use sphinx. First you need to address how you index the data in sphinx. Doing some processing during while indexing will make the search quicker and the results more relevant. Second, tackle the search terms and last but not least, decide on the ranking algorithm to use.
I am going to use the "title" field as an example, but the logic can be replicated for all fields.
Indexing
Add two fields to sphinx ("title" and "title_synonyms"). For each record in the database do the following :-
Perform a DISTINCT on the words to remove duplicates ("Ruby Developer / Java Developer" will become "Ruby Developer / Java". This will stop records from getting two scores for duplicates when searching. This goes in to "title"
Take the DISTINCT title from above and REPLACE all the words with their expanded synonym equivalents. I would suggest putting the synonyms in the DB to make the expansion easier. The text would then become "Ruby Developer / Java-EE". Each word must be replaced with all the synonyms. If Java has two synonyms, they both must be in the field. This goes into "title_synonyms"
Searching
Because there are now two fields in sphinx we can give them each a different weight; "title" can get a weight of "10" and "title_synonyms" a weight of "3". That means a record has to match 4 synonyms before it ranks higher than one with the original title. You can play around with the weights to suit your needs.
Lets assume a user was searching for "Java Developer". For the search phrase do the following :-
Remove duplicate words
Get synonyms for each word in the search phrase
Set Matching Mode in Sphinx to SPH_MATCH_EXTENDED
The above rules will mean the search in sphinx looks like this :-
#title "Java Developer" | #title_synonyms "Java-EE"
If you want to rank exact matches higher than lexemes, the search query would look like this :-
#title ("Java Developer" | "=Java =Developer") | #title_synonyms ("Java-EE" | "=Java-EE")
You will need to use SPH_RANK_PROXIMITY_BM25 or SPH_RANK_SPH04 to make this work properly though.
Ranking
You can try any of the built in ranking algorithms to see what the results look like. I recommend SPH_RANK_MATCHANY or SPH_RANK_WORDCOUNT as a start.
For Proximity and exact match ranking use SPH_RANK_PROXIMITY_BM25, SPH_RANK_SPH04 or SPH_RANK_EXPR where you can use your own algorithm.
Conclusion
You should now have a search that is both fast and accurate. Very little work has to be done by your Ruby application and most of the work is done inside sphinx (where it should be).
Hope this helps...
This performance problem is an algorithm problem.
If you cannot express the problem in a way to utilize a backend tool, like sphinx or the database engine, then you are doing the processing in ruby, and that's easy to have a performance problem.
First, do as much as you can with sphinx (or whatever other search engine) and the database as you can. The more pre-digested the data coming into ruby, the less you have to do in ruby code, and that will likely be faster, since databases have been highly optimized over the last half century.
So, for example, run sphinx on the key words. Also run sphinx on the synonyms. Limit all the answers to the top results, and merge the results. That way your ruby code will be limited to the likely high results instead of having to consider the whole database of entries.
Once in ruby, the most important thing is to avoid high order algorithms, that is, make sure you are using a low order algorithm.
As you process your raw data, if you hold your top results in an array and try to sort or scan the array, you are going to have an N-squared order. That is, your order will be the product of the number of raw entries and the number of elements you keep in your array.
The best algorithms for your problem are a priority queue implemented by a heap like container, or a b-tree. Both have N-log-N order (N times the log of N), or the number of raw data records time the log of the number of items you will keep in your container.
A heap is a binary tree, where each node in the tree (not just the leaves but each node) has a rated record. The nodes below each record all have lower ranks. This is called the heap condition.
There are algorithms for adding elements, taking the top ranked element out, and replacing the lowest ranked element which maintain the heap condition. Look up binary heap in the wikipedia.
Let's say your site will display the top 100 ranked results. Maintain a help where the root is the lowest ranked. Populate the heap by adding the first 100 raw records you are processing.
Now for record 101 and after, compare its rank with the root. If the new record is ranked higher, use the delete algorithm to reduce your heap to 99 nodes (which will remove the lowest ranked record in the heap) and add your new record to the heap.
Once you have gone through all your records, you will have the top 100 ranked results. The heap delete algorithm will pull them out in reverse order.
I am currently using Solr 1.4 (soon to upgrade to 3.3). The friendship table is pretty standard:
id | follower_id | user_id
I would like to perform a regular keyword solr search and order the results by degrees of separation as well as the standard score ordering. From the result set, given the keyword matched any of my immediate friends, they would show up first. Secondly would be the friends of my friends, and thirdly friends by 3rd degree of separation. All other results would come after.
I am pretty sure Solr doesn't offer any 'pre-baked' way of doing this therefore I would likely have to do a join on MySQL to properly order the results. Curious if anyone has done this before and/or has some insights.
It's simply not possible in Solr. However, if you aren't too restricted and could use another platform for this, consider neo4j?
This "connections" and degrees is exactly where Neo4j steps in.
http://neo4j.org/
One way might be to create fields like degree_1, degree_2 etc. and store the list of friends at degree x in the field degree_x. Then you could fire multiple queries - the first restricting the results to those who have you in degree_1, the second restricting the results to those who have you in degree_2 and so on.
It is a bit complicated, but the only solution I could think of using Solr.
I haven't represented a graph in solr before, but I think at a high level, this is what you could do. First, represent people as nodes and the social network as a graph in the database. Implement transitive closure function in sql to allow you to walk the graph. Then you would index the result into solr with the social network info stored into payloads, for example.
I was able to achieve this by performing multiple queries and with the scope "with" to restrict to the id's of colleagues, 2nd and 3rd degree colleagues, using the id's and using mysql to do the select.
#search_1 = perform_search(1, options)
#search_2 = perform_search(2, options)
if degree == 1
with(:id).any_of(options[:colleague_ids])
elsif degree == 2
with(:id).any_of(options[:second_degree_colleagues])
end
It's kinda of a dirty solution as I have to perform multiple solr queries, but until I can use dynamic field sorting options (solr 3.3, not currently supported by sunspot) I really don't know any other way to achieve this.