In Lucene/Solr what is the difference between Join and BlockJoin? - join

Join is described as pseudo-Join, because it's more equivalent to an SQL inner-query.
Whereas BlockJoin is described as more like a SQL join but requiring a sophisticated indexing schema, one that anticipates all the possible joins you'd want to make.
Could someone explain the difference between these features in terms of how to implement them at index time and query time. And what are the implications for performance?

I don't think blockjoinquery is a Solr function. I think its Lucene feature.
The solr join doesn't score documents in the from query and it doesn't return combined results. So its best used as a filter query. This will allow the main query.to score.
Block join on the other hand does use scoring and returns both results.( not 100% sure)
You can also use querytime join. This has serval scoring options. This is also a lucene feature but doesn't require special indexing blocks. I've used this in combination with a solr query parser plugin. The performance is a bit lower then blockjoin but it Works.
I have only used solr join and querytimejoin So I can't really say much about blockjoin.

As I understand, BlockJoin is for joining against nested/child documents within the same core. Join is for joining against a separate core.

Related

Is there any diferrence between includes(:associations).references(:associations) and eager_load(:associations) in Ruby on Rails 5?

It seems includes(:associations).references(:associations) and eager_load(:associations) execute exactly the same SQL (LEFT OUTER JOIN) in Rails 5. So when do I need to use includes(:associations).references(:associations) syntax?
For example,
Parent.includes(:children1, :children2).references(:children1).where('(some conditions of children1)')
can be converted to
Parent.eager_load(:children1).preload(:children2).where('(some conditions of children1)')
I think the latter (query using eager_load and preload) is simpler and looks better.
UPDATE
I found a strange behavior in my environment (rails 5.2.4.3).
Even when I includes several associations and references only one of them, all the associations I included are LEFT OUTER JOINed.
For example,
Parent.includes(:c1, :c2, :c3).references(:c1).to_sql
executes a SQL which LEFT OUTER JOINs all of c1, c2, c3.
I thought it joins only c1.
Indeed, includes + references ends up being the same as eager_load. Like many things in Rails, you have a few ways of accomplishing the same result, and here you are witnessing that first hand. If I were writing them in a single statement, I would always prefer eager_load because it is more explicit and it is a single function call.
I would also prefer eager_load because I consider references a kind of hack. It says to the SQL generator "Hey, I am referring to this object in a way that you would not otherwise detect, so include it in the JOIN statement" and is generally used when you use a String to pass a SQL fragment as part of the query.
The only time I would use the includes(:associations).references(:associations) syntax is when it was an artifact needed to make the query work and not a statement of intent. The Rails Guide gives this good example:
Article.includes(:comments).where("comments.visible = true").references(:comments)
As for why referencing 1 association causes 3 to be JOIN'ed, I do not know for sure. The code for includes uses heuristics to decide when it is faster to use a JOIN and when it is faster to use 2 separate queries, the first to retrieve the parent and the second to retrieve the associated objects. I was surprised to find how often it is faster to use 2 separate queries. It may be that since the query has to use 1 join anyway, the algorithm figures it will be faster to use 1 big join rather than 3 queries, or it may in general think 1 join is faster than 4 queries.
I would not in general use preload unless I had a strong reason to believe that it was faster than join. I would just use includes alone and let the algorithm decide.

graphQL join on common column

how we can do data join on a common column in graphQL.
For example in SQL : Select t.name and z.address where t.id=z.id;
how is this managed by graphQL query.
How is this managed by GraphQL query?
The short answer: It's not (managed). You have to write it in your resolvers.
Your GraphQL schema is basically just a declaration of types that represent nodes, and fields that represent edges in a graph. GraphQL is the language to query this graph. At each edge in your schema, you write a function for how to get the data when the query traverses that edge. This function is called a "resolver", and you can put pretty much any code in your resolvers as long as it returns valid data. This means GraphQL is completely, absolutely database-independent. Your resolvers are responsible for talking to your database(s).
So when it comes to SQL, and joins, the concern of making the queries is entirely the responsibility of the API developer. GraphQL knows nothing about SQL or joins, so this is essentially custom logic for your application which you must write. It turns out that building SQL queries to resolve data for GraphQL queries is very difficult without running into problems like eagerly fetching too much data or the "N+1" problem.
Some tools in the open-source community have emerged to help solve this problem. Here are a couple I would recommend in the Node.js space:
DataLoader - A database-agnostic tool that batches queries and caches individual records.
Join Monster - A SQL-tailored tool for efficient data-retrieval and query batching. It examines each query and your schema and generates SQL queries dynamically.
Disclaimer: I'm a co-creator of Join Monster.

solr join vs lucene join

I am trying to find how Solr join compares with respect to the Lucene joins. Specifically, if Lucene joins uses any filter cache during the JOIN operation. I looked into code and it seems that in the QParser there is a reference to cache, but I am not sure if it's a filter cache. If somebody has any experience on this, please do share, or please tell me how can I find that.
The Solr join wiki states
Fields or other properties of the documents being joined "from" are not available for use in processing of the resulting set of "to" documents (ie: you can not return fields in the "from" documents as if they were a multivalued field on the "to" documents).
I am finding it hard to understand the above limitation of solr join,does it means that unlike the traditional RDMS joins that can have columns from both the TO and FROM field, solr joins will only have fields from the TO documents ? Is my understanding correct ? If yes, then why this limitation ?
Also, there's some difference with respect to scoring too and towards that the wiki says
The Join query produces constant scores for all documents that match -- scores computed by the nested query for the "from" documents are not available to use in scoring the "to" documents
Does it mean the subquery's score is not available the main query? If so again why solr scoring took this approach ?
If there are any other differences that are worth considering when moving from Lucene join to Solr, please share.
this post is quite old, but I jump on it. Sorry if it's not active any more.
To tell the truth, it's far better to avoid the join strategy on solr/lucene. You have to think as object as a whole, joining is much an SQL approch that is not close to the phylosophy of SOLR.
Despite that, solr enables very limited joins operations. Take a look to this very good reference join solr lucene! And also this document about the block join support in solr

What is the default MapReduce join used by Apache Hive?

What is the default MapReduce join algorithm implemented by Hive? Is it a Map-Side Join, Reduce-Side, Broadcast-Join, etc.?
It is not specified in the original paper nor the Hive wiki on joins:
http://cs.brown.edu/courses/cs227/papers/hive.pdf
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins
The 'default' join would be the shuffle join, aka. as common-join. See JoinOperator.java. It relies on M/R shuffle to partition the data and the join is done on the Reduce side. As is a size-of-data copy during the shuffle, it is slow.
A much better option is the MapJoin, see MapJoinOpertator.java. This works if you have only one big table and one or more small tables to join against (eg. typical star schema). The small tables are scanned first, a hash table is built and uploaded into the HDFS cache and then the M/R job is launched which only needs to split one table (the big table). Is much more efficient than shuffle join, but requires the small table(s) to fit in memory of the M/R map tasks. Normally Hive (at least since 0.11) will try to use MapJoin, but it depends on your configs.
A specialized join is the bucket-sort-merge join, aka. SMBJoin, see SMBJoinOperator.java. This works if you have 2 big tables that match the bucketing on the join key. The M/R job splits then can be arranged so that a map task gest only splits form the two big tables that are guaranteed to over overlap on the join key so the map task can use a hash table to do the join.
There are more details, like skew join support and fallback in out-of-memory conditions, but this should probably get you started into investigating your needs.
A very good presentation on the subject of joins is Join Strategies in Hive. Keep in mind that things evolve fast an a presentaiton from 2011 is a bit outdated.
Do an explain on the Hive query and you can see the execution plan.

SOLR : joining from separate indices

Is there a good way to join data from SOLR indices ? Of course, I assume server side support for this is limited, but i want to do it client side (right now, im manually doing this in java, using hashmaps and loops .... Im assuming there might be a better way to combine data from different indices).
If with join you mean relational joining a la SQL, then no.
If with join you mean merging then server side support is far from limited.
What you are looking for is index sharding.
This is not "fast" since searches are distributed and then merged, but it scales really well.
Give a read to the following articles:
http://lucidworks.lucidimagination.com/display/solr/Distributed+Search+with+Index+Sharding
http://wiki.apache.org/solr/DistributedSearch

Resources