I am trying to join three cassandra tables using Solr. According to datastax documentation ,
DataStax Enterprise 4.0.2 and later supports the OS Solr query time
join through a custom implementation. You can join Solr documents,
including those having different Solr cores under these conditions:
Solr cores need to have the same keyspace and same Cassandra partition key.
Both Cassandra tables that support the Solr cores to be joined have to be either Thrift- or CQL-compatible. You cannot have one
that is Thift-compatible and one that is CQL-compatible.
The type of the unique key (Cassandra key validator of the partition key) are the same.
The order of table partition keys and schema unique keys are the same.
I could join two tables as per the documentation. I was wondering whether I can join three tables together satisfying the conditions. Couldn't find any documentation on joining more than two tables. I desperately need to join three tables. Is it really possible or should I drop the idea right now?
What I needed was a recursive join. From the same documentation as the above example, I could find this example:
Use a nested join query to recursively join the songs and lyrics documents with the videos document, and to select the song that mentions love and also won a video award.
http://localhost:8983/solr/internet.songs/select/?q=
{!join+fromIndex=internet.lyrics}words:love AND _query_:
{!join+fromIndex=internet.videos}award:true&indent=true&wt=json
Output is:
"response":{"numFound":1,"start":0,"docs":[
{
"song":"a3e64f8f-bd44-4f28-b8d9-6938726e34d4",
"title":"Dangerous",
"artist":"Big Data"}]
}}
Related
Is Sort merge Bucket Join different from Sort Merge Bucket Map join? If so, what hints should be added to enable SMB join? How is SMBM join superior to SMB join?
Will "set hive.auto.convert.sortmerge.join=true" this hint alone be sufficient for SMB join? Else should the below hints be included as well.
set hive.optimize.bucketmapjoin = true
set hive.optimize.bucketmapjoin.sortedmerge = true
The reason I ask is, the hint says Bucket map join, but MAP join is not performed here. I am under the assumption that both map and reduce tasks are involved in SMB while only map tasks are involved in SMBM.
Please correct me if I am wrong.
If your table is large(determined by "set hive.mapjoin.smalltable.filesize;"), you cannot do a map side join. Except that your tables are bucketed and sorted, and you turned on "set hive.optimize.bucketmapjoin.sortedmerge = true", then you can still do a map side join on large tables. (Of course, you still need "set hive.optimize.bucketmapjoin = true")
Make sure that your tables are truly bucketed and sorted on the same column. It's so easy to make mistakes. To get a bucketed and sorted table, you need to
set hive.enforce.bucketing=true;
set hive.enforce.sorting=true;
DDL script
CREATE table XXX
(
id int,
name string
)
CLUSTERED BY (id)
SORTED BY (id)
INTO XXX BUCKETS
;
INSERT OVERWRITE TABLE XXX
select * from XXX
CLUSTER BY member_id
;
Use describe formatted XXX and look for Num Buckets, Bucket Columns, Sort Columns to make sure it's correctly setup.
Other requirements for the bucket join is that two tables should have
Data bucketed on the same columns, and they are used in the ON clause.
The number of buckets for one table must be a multiple of the number of buckets for the other table.
If you meet all the requirements, then the MAP join will be performed. And it will be lightning fast.
By the way, SMB Map Join is not well supported in Hive 1.X for ORC format. You will get a null exception. The bug has been fixed in 2.X.
JOIN can be made between multiple cores(Is Solr 4.0 capable of using 'join" for multiple core?) but is it possible to JOIN 2 cores but both are present at different ports?
For example:
Instance 1: http://example:8983/solrInd1/#/person/
Instance 2: http://example:9097/solrInd2/#/engineers/
I want to get age, qualification etc from person index and engineering information from engineers index.
Thanks
The short answer is no for solr 4 out of the box, but this might change in the future. And longer answer is yes, by roll your own join plugin, or perform the join on the client(like mongodb). In your example, it might be easier to use multiple cores in a single solr instance.
I am trying to find how Solr join compares with respect to the Lucene joins. Specifically, if Lucene joins uses any filter cache during the JOIN operation. I looked into code and it seems that in the QParser there is a reference to cache, but I am not sure if it's a filter cache. If somebody has any experience on this, please do share, or please tell me how can I find that.
The Solr join wiki states
Fields or other properties of the documents being joined "from" are not available for use in processing of the resulting set of "to" documents (ie: you can not return fields in the "from" documents as if they were a multivalued field on the "to" documents).
I am finding it hard to understand the above limitation of solr join,does it means that unlike the traditional RDMS joins that can have columns from both the TO and FROM field, solr joins will only have fields from the TO documents ? Is my understanding correct ? If yes, then why this limitation ?
Also, there's some difference with respect to scoring too and towards that the wiki says
The Join query produces constant scores for all documents that match -- scores computed by the nested query for the "from" documents are not available to use in scoring the "to" documents
Does it mean the subquery's score is not available the main query? If so again why solr scoring took this approach ?
If there are any other differences that are worth considering when moving from Lucene join to Solr, please share.
this post is quite old, but I jump on it. Sorry if it's not active any more.
To tell the truth, it's far better to avoid the join strategy on solr/lucene. You have to think as object as a whole, joining is much an SQL approch that is not close to the phylosophy of SOLR.
Despite that, solr enables very limited joins operations. Take a look to this very good reference join solr lucene! And also this document about the block join support in solr
What is the default MapReduce join algorithm implemented by Hive? Is it a Map-Side Join, Reduce-Side, Broadcast-Join, etc.?
It is not specified in the original paper nor the Hive wiki on joins:
http://cs.brown.edu/courses/cs227/papers/hive.pdf
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins
The 'default' join would be the shuffle join, aka. as common-join. See JoinOperator.java. It relies on M/R shuffle to partition the data and the join is done on the Reduce side. As is a size-of-data copy during the shuffle, it is slow.
A much better option is the MapJoin, see MapJoinOpertator.java. This works if you have only one big table and one or more small tables to join against (eg. typical star schema). The small tables are scanned first, a hash table is built and uploaded into the HDFS cache and then the M/R job is launched which only needs to split one table (the big table). Is much more efficient than shuffle join, but requires the small table(s) to fit in memory of the M/R map tasks. Normally Hive (at least since 0.11) will try to use MapJoin, but it depends on your configs.
A specialized join is the bucket-sort-merge join, aka. SMBJoin, see SMBJoinOperator.java. This works if you have 2 big tables that match the bucketing on the join key. The M/R job splits then can be arranged so that a map task gest only splits form the two big tables that are guaranteed to over overlap on the join key so the map task can use a hash table to do the join.
There are more details, like skew join support and fallback in out-of-memory conditions, but this should probably get you started into investigating your needs.
A very good presentation on the subject of joins is Join Strategies in Hive. Keep in mind that things evolve fast an a presentaiton from 2011 is a bit outdated.
Do an explain on the Hive query and you can see the execution plan.
I'm running SOLR4 and run some join queries, for example - {!join from=some_id to=another_id}(a_id:55 AND some_type_id:3)
When I run single instance of SOLR4 (not cloud) this query returns 4 results, exactly how it is supposed to be.
But when I run it on SOLR cloud, with two shards and two replicas it returns only one result, while another 3 could be found in the index if searched directly by id, for example.
Any ideas what is wrong and/or how to fix it?
thanks in advance!
Join works only within the shard. Join will not work across the shards. I think one shard should having 3 documents matching the condition and another shard should have one. Complex joins across shards yet to come.
If you want join as mandatory feature, think about single shard with multiple replication as work around.
When creating shards in Solr you can set the router to compositeId and when indexing documents you can insert ID prefixes into the ID attribute which will help Solr to select a shard for documents. In other words, all documents with the same ID prefix will be stored on a single shard. While you can't use this to tell Solr which shard exactly to use, you can indicate docs that need to be stored on a single shard.
For example, if you index Posts and Comments then your ID attribute for a Post could look like POSTDATA123!Post 123, where 123 is the Post ID. When indexing a Comment that belongs to a Post with the ID 123 the ID attribute could be POSTDATA123!Comment 321, where 321 is the Comment ID. Solr will understand this prefix POSTDATA123! on both documents and will store both Post and its Comments on a single shard.
When indexing multiple Posts Solr will still use shards and spread your posts across available shards evenly but joins will work because Comments will be always stored on the same shard as their parent Post.
You can find more docs about the compositeId router here https://lucene.apache.org/solr/guide/6_6/shards-and-indexing-data-in-solrcloud.html