solr join vs lucene join - join

I am trying to find how Solr join compares with respect to the Lucene joins. Specifically, if Lucene joins uses any filter cache during the JOIN operation. I looked into code and it seems that in the QParser there is a reference to cache, but I am not sure if it's a filter cache. If somebody has any experience on this, please do share, or please tell me how can I find that.
The Solr join wiki states
Fields or other properties of the documents being joined "from" are not available for use in processing of the resulting set of "to" documents (ie: you can not return fields in the "from" documents as if they were a multivalued field on the "to" documents).
I am finding it hard to understand the above limitation of solr join,does it means that unlike the traditional RDMS joins that can have columns from both the TO and FROM field, solr joins will only have fields from the TO documents ? Is my understanding correct ? If yes, then why this limitation ?
Also, there's some difference with respect to scoring too and towards that the wiki says
The Join query produces constant scores for all documents that match -- scores computed by the nested query for the "from" documents are not available to use in scoring the "to" documents
Does it mean the subquery's score is not available the main query? If so again why solr scoring took this approach ?
If there are any other differences that are worth considering when moving from Lucene join to Solr, please share.

this post is quite old, but I jump on it. Sorry if it's not active any more.
To tell the truth, it's far better to avoid the join strategy on solr/lucene. You have to think as object as a whole, joining is much an SQL approch that is not close to the phylosophy of SOLR.
Despite that, solr enables very limited joins operations. Take a look to this very good reference join solr lucene! And also this document about the block join support in solr

Related

How do a general search across string properties in my nodes?

Working with Neo4j in a Rails app.
I have nodes with several string properties containing long strings of user generated content. For example in my nodes of type: "Book", I might have properties, "review", and "summary", which would contain long-form string values.
I was trying to design queries that returned nodes which match those properties to general language search terms provided by a user in a search box. As my query got increasingly complicated, it occurred to me that I was trying to resolve natural language search.
I looked into some of the popular search gems in Rails, but they all seem to depend on ActiveRecord. What search solutions exist for Neo4j.rb?
There are a few ways that you could go about this!
As FrobberOfBits said, Neo4j has what are called "legacy indexes" which use Lucene it the background to provide indexing of generic things. It does support the new schema indexes. Unfortunately those are based on exact matches (though I'm pretty sure that will change in Neo4j 2.3.x somewhat).
Neo4j does support pattern matching on strings via the =~ operator, but those queries aren't indexed. So the performance depends on the size of your database.
We often recommend a gem called searchkick which lets you define indexes for Elasticsearch in your models. Then you can just call a Model.search method to do your searches and it will first query elasticsearch to get the node IDs and then load those nodes via Neo4j.rb. You can use that via the neo4j-searchkick gem: https://github.com/neo4jrb/neo4j-searchkick
Lastly, if you're doing NLP and are trying to extract important words from your text, you could create a Tag/Word label and create relationships from your nodes to these NLP extracted nodes so that you can search based on those nodes in the future. You could even build recommendations from one text node to another based on the number/type of common tag nodes.
I don't know if anything specific exists for neo4j.rb and activerecord. What I can say is that generally this stuff is handled through the use of legacy indexes that are implemented by Lucene.
The premise is that you create a lucene-managed index on certain properties, and that then gives you access to use the Lucene query language via cypher to get data from those indices. Relative to neo4j.rb, it doesn't look any different than running cypher queries, like this:
START item=node:node_auto_index("(title:'foo bar' AND body:baz*) OR title:'bat'")
RETURN item
Note that lucene indexes and that query language can only be used in a START block, not a MATCH block. Refer to the Lucene Query Syntax to discover more about what you can do with that query syntax (fuzzy matching, wildcards, etc -- quite a bit more extensive than what regex would give you).

How to model a relational database into a neo4j graph database?

I have a relational database (about 30 tables) and I would like to transpose it in a neo4j graph database, and I don't know where to start...
Is there a general way to transpose tables and/or tuples into a graph model ? (relations properties, one or more graphs ?) What are the best sources of documentation ?
Thanks for any help,
Best regards
First, if at all possible, I'd suggest NOT using your relational DB as your "reference" for transposing to a graph model. All too often, mistakes and pitfalls from relational modelling get transferred over to the graph model and introduce other oddities. In fact, if you have a source ER diagram, that might be an even better starting point as it's really already a graph. And maybe even consider a re-modelling exercise for your domain!
That said, from a basic point of view, you can think of most tables as representing a node type (e.g. "User" or "Movie") with join tables and keys representing relationship types.
A great starting point, from my perspective anyway, is to determine some questions your graph/data source should answer. Write those questions down, and try to come up with Cypher queries that represent the questions. Often times, a graph model naturally arises from such an effort, and it's really not that difficult.
If you haven't already, I'd strongly recommend picking up a (free) copy of the Graph Databases ebook from here: http://graphdatabases.com/
It's jam-packed with a lot of good info on where to start with modelling your domain and even things to consider when you're used to doing things in a relational manner. It also contains some material on Cypher, although the Neo4j site (neo4j.org) has a reference manual with plenty of up-to-date info on Cypher.
Hope this helps!
There's not going to be a one-stop-shop for this kind of conversion, as not all data models are appropriate for graph modeling, and every application is a unique special snowflake...but with that said.....
Generally, your 'base' tables (e.g. User, Role, Order, Product) would become nodes, and your 'join tables' (a.k.a. buster tables) would be candidates for your relationships (e.g. UserRole, OrderLineItem). The key thing to remember that in a graph, generally, you can only have one relationship of a given type between two specific nodes - so in the above example, if your system allows the same product to be in an order twice - it would cause issues.
Foreign keys are your second source of relationships, look to them to see if it makes sense to be a relationship or just a property.
Just keep in mind what you are trying to solve by your data model - if it's traversing your objects to find relationships and distance, etc... then graphs may be a good fit. If you are modeling an eCommerce app, where you are dealing with manipulating a single nested object (e.g. order -> line item -> product -> sku), then a relational model may be the right fit.
Hope my $0.02 helps...
As has been already said, there is no magical transformation from a relational database model to a graph database model.
You should look for the original entities and how they are related in order to find your nodes, properties and relations. And always keeping in mind what type of queries you are going to perform.
As BtySgtMajor said, "Graph Databases" is a good book to start, and it is free.

In Lucene/Solr what is the difference between Join and BlockJoin?

Join is described as pseudo-Join, because it's more equivalent to an SQL inner-query.
Whereas BlockJoin is described as more like a SQL join but requiring a sophisticated indexing schema, one that anticipates all the possible joins you'd want to make.
Could someone explain the difference between these features in terms of how to implement them at index time and query time. And what are the implications for performance?
I don't think blockjoinquery is a Solr function. I think its Lucene feature.
The solr join doesn't score documents in the from query and it doesn't return combined results. So its best used as a filter query. This will allow the main query.to score.
Block join on the other hand does use scoring and returns both results.( not 100% sure)
You can also use querytime join. This has serval scoring options. This is also a lucene feature but doesn't require special indexing blocks. I've used this in combination with a solr query parser plugin. The performance is a bit lower then blockjoin but it Works.
I have only used solr join and querytimejoin So I can't really say much about blockjoin.
As I understand, BlockJoin is for joining against nested/child documents within the same core. Join is for joining against a separate core.

SOLR : joining from separate indices

Is there a good way to join data from SOLR indices ? Of course, I assume server side support for this is limited, but i want to do it client side (right now, im manually doing this in java, using hashmaps and loops .... Im assuming there might be a better way to combine data from different indices).
If with join you mean relational joining a la SQL, then no.
If with join you mean merging then server side support is far from limited.
What you are looking for is index sharding.
This is not "fast" since searches are distributed and then merged, but it scales really well.
Give a read to the following articles:
http://lucidworks.lucidimagination.com/display/solr/Distributed+Search+with+Index+Sharding
http://wiki.apache.org/solr/DistributedSearch

Order Solr results by degrees of friendship

I am currently using Solr 1.4 (soon to upgrade to 3.3). The friendship table is pretty standard:
id | follower_id | user_id
I would like to perform a regular keyword solr search and order the results by degrees of separation as well as the standard score ordering. From the result set, given the keyword matched any of my immediate friends, they would show up first. Secondly would be the friends of my friends, and thirdly friends by 3rd degree of separation. All other results would come after.
I am pretty sure Solr doesn't offer any 'pre-baked' way of doing this therefore I would likely have to do a join on MySQL to properly order the results. Curious if anyone has done this before and/or has some insights.
It's simply not possible in Solr. However, if you aren't too restricted and could use another platform for this, consider neo4j?
This "connections" and degrees is exactly where Neo4j steps in.
http://neo4j.org/
One way might be to create fields like degree_1, degree_2 etc. and store the list of friends at degree x in the field degree_x. Then you could fire multiple queries - the first restricting the results to those who have you in degree_1, the second restricting the results to those who have you in degree_2 and so on.
It is a bit complicated, but the only solution I could think of using Solr.
I haven't represented a graph in solr before, but I think at a high level, this is what you could do. First, represent people as nodes and the social network as a graph in the database. Implement transitive closure function in sql to allow you to walk the graph. Then you would index the result into solr with the social network info stored into payloads, for example.
I was able to achieve this by performing multiple queries and with the scope "with" to restrict to the id's of colleagues, 2nd and 3rd degree colleagues, using the id's and using mysql to do the select.
#search_1 = perform_search(1, options)
#search_2 = perform_search(2, options)
if degree == 1
with(:id).any_of(options[:colleague_ids])
elsif degree == 2
with(:id).any_of(options[:second_degree_colleagues])
end
It's kinda of a dirty solution as I have to perform multiple solr queries, but until I can use dynamic field sorting options (solr 3.3, not currently supported by sunspot) I really don't know any other way to achieve this.

Resources