SOLR join query in distributed configuration - join

I'm running SOLR4 and run some join queries, for example - {!join from=some_id to=another_id}(a_id:55 AND some_type_id:3)
When I run single instance of SOLR4 (not cloud) this query returns 4 results, exactly how it is supposed to be.
But when I run it on SOLR cloud, with two shards and two replicas it returns only one result, while another 3 could be found in the index if searched directly by id, for example.
Any ideas what is wrong and/or how to fix it?
thanks in advance!

Join works only within the shard. Join will not work across the shards. I think one shard should having 3 documents matching the condition and another shard should have one. Complex joins across shards yet to come.
If you want join as mandatory feature, think about single shard with multiple replication as work around.

When creating shards in Solr you can set the router to compositeId and when indexing documents you can insert ID prefixes into the ID attribute which will help Solr to select a shard for documents. In other words, all documents with the same ID prefix will be stored on a single shard. While you can't use this to tell Solr which shard exactly to use, you can indicate docs that need to be stored on a single shard.
For example, if you index Posts and Comments then your ID attribute for a Post could look like POSTDATA123!Post 123, where 123 is the Post ID. When indexing a Comment that belongs to a Post with the ID 123 the ID attribute could be POSTDATA123!Comment 321, where 321 is the Comment ID. Solr will understand this prefix POSTDATA123! on both documents and will store both Post and its Comments on a single shard.
When indexing multiple Posts Solr will still use shards and spread your posts across available shards evenly but joins will work because Comments will be always stored on the same shard as their parent Post.
You can find more docs about the compositeId router here https://lucene.apache.org/solr/guide/6_6/shards-and-indexing-data-in-solrcloud.html

Related

Joining three tables in solr

I am trying to join three cassandra tables using Solr. According to datastax documentation ,
DataStax Enterprise 4.0.2 and later supports the OS Solr query time
join through a custom implementation. You can join Solr documents,
including those having different Solr cores under these conditions:
Solr cores need to have the same keyspace and same Cassandra partition key.
Both Cassandra tables that support the Solr cores to be joined have to be either Thrift- or CQL-compatible. You cannot have one
that is Thift-compatible and one that is CQL-compatible.
The type of the unique key (Cassandra key validator of the partition key) are the same.
The order of table partition keys and schema unique keys are the same.
I could join two tables as per the documentation. I was wondering whether I can join three tables together satisfying the conditions. Couldn't find any documentation on joining more than two tables. I desperately need to join three tables. Is it really possible or should I drop the idea right now?
What I needed was a recursive join. From the same documentation as the above example, I could find this example:
Use a nested join query to recursively join the songs and lyrics documents with the videos document, and to select the song that mentions love and also won a video award.
http://localhost:8983/solr/internet.songs/select/?q=
{!join+fromIndex=internet.lyrics}words:love AND _query_:
{!join+fromIndex=internet.videos}award:true&indent=true&wt=json
Output is:
"response":{"numFound":1,"start":0,"docs":[
{
"song":"a3e64f8f-bd44-4f28-b8d9-6938726e34d4",
"title":"Dangerous",
"artist":"Big Data"}]
}}

How to determine the Max property on a Relationship in Neo4j 2.2.3

How do you quickly get the maximum (or minimum) value for a property of all instances of a relationship? You can assume the machine I'm running this on is well within the recommended spec's for the cpu and memory size of graph and the heap size is set accordingly.
Facts:
Using Neo4j v2.2.3
Only have access to modify graph via Cypher query language which I'm hitting via PHP or in the web interfacxe--would love to avoid any solution that requires java coding.
I've got a relationship, call it likes that has a single property id that is an integer.
There's about 100 million of these relationships and growing
Every day I grab new likes from a MySQL table to add to the graph within in Neo4j
The relationship property id is actually the primary key (auto incrementing integer) from the raw MySQL table.
I only want to add new likes so before querying MySQL for the new entries I want to get the max id from the likes, so I can use it in my SQL query as SELECT * FROM likes_table WHERE id > max_neo4j_like_property_id
How can I accomplish getting the max id property from neo4j in a optimal way? Please indicate the create statement needed for any index as well as the query you'd used to get the final result.
I've tried creating an index as follows:
CREATE INDEX ON :likes(id);
After the index is online I've tried:
MATCH ()-[r:likes]-() RETURN r.i ORDER BY r.id DESC LIMIT 1
as well as:
MATCH ()-[r:likes]->() RETURN MAX(r.id)
They work but take freaking forever as the explain plan for both indicate no indexes being used.
UPDATE: Holy $?##$?!!!! It looks like the new schema indexes aren't functional for relationships even though you can create them and show them with :schema. It also looks as if there's no way with cypher directly to create Legacy Indexes which look like they might solve this issue.
If you need to query relationship properties, it is generally a sign of a model issue.
The need of this query reveals you that you would better extract these properties into a node, that you'll then be able to query faster.
I don't say it is 100% the case, but certainly 99% of the people seen so far with the same problem has been demonstrating this model concern.
What is your model right now ?
Also you don't use labels at all in your query, likes have a context bound to the nodes.

Same Cypher Query has different performance on different DBs

I have a fullDB, (a graph clustered by Country) that contains ALL countries and I have various single country test DBs that contain exactly the same schema but only for one given country.
My query's "start" node, is identified via a match on a given value for a property e.g
match (country:Country{name:"UK"})
and then proceeds to the main query defined by the variable country. So I am expecting the query times to be similar given that we are starting from the same known node and it will be traversing the same number of nodes related to it in both DBs.
But I am getting very difference performance for my query if I run it in the full DB or just a single country.
I immediately thought that I must have some kind of "Cartesian Relationship" issue going on so I profiled the query in the full DB and a single country DB but the profile is exactly the same for each step in the plan. I was assuming that the profile would reveal a marked increase in db hits at some point in the plan, but the values are the same. Am I mistaken in what profile is displaying?
Some sizing:
The fullDB would have 70k nodes, the test DB 672 nodes, the time in full db for the query to complete is 218764ms while the test db is circa 3407ms.
While writing this I realised that there will be an increase in the number of outgoing relationships on certain nodes (suppliers can supply different countries) which I think is probably the cause, but the question remains as to why I am not seeing any indication of this in the profiling.
Any thoughts welcome.
What version are you using?
Both query times are way too long for your dataset size.
So you might check your configuration / disk.
Did you create an index/constraint for :Country(name) and is that index online?
And please share your query and your query plans.

solr join vs lucene join

I am trying to find how Solr join compares with respect to the Lucene joins. Specifically, if Lucene joins uses any filter cache during the JOIN operation. I looked into code and it seems that in the QParser there is a reference to cache, but I am not sure if it's a filter cache. If somebody has any experience on this, please do share, or please tell me how can I find that.
The Solr join wiki states
Fields or other properties of the documents being joined "from" are not available for use in processing of the resulting set of "to" documents (ie: you can not return fields in the "from" documents as if they were a multivalued field on the "to" documents).
I am finding it hard to understand the above limitation of solr join,does it means that unlike the traditional RDMS joins that can have columns from both the TO and FROM field, solr joins will only have fields from the TO documents ? Is my understanding correct ? If yes, then why this limitation ?
Also, there's some difference with respect to scoring too and towards that the wiki says
The Join query produces constant scores for all documents that match -- scores computed by the nested query for the "from" documents are not available to use in scoring the "to" documents
Does it mean the subquery's score is not available the main query? If so again why solr scoring took this approach ?
If there are any other differences that are worth considering when moving from Lucene join to Solr, please share.
this post is quite old, but I jump on it. Sorry if it's not active any more.
To tell the truth, it's far better to avoid the join strategy on solr/lucene. You have to think as object as a whole, joining is much an SQL approch that is not close to the phylosophy of SOLR.
Despite that, solr enables very limited joins operations. Take a look to this very good reference join solr lucene! And also this document about the block join support in solr

Solr Join - getting data from different index

I'm working on a project where we have 2 million products and have 50 clients with different pricing scheme. Indexing 2M X 50 records is not an option at the moment. I have looked at solr join and cannot get it to work the way i want it too. I know it's like a self join so I'm kinda skeptical it would work but here it is anyway.
here is the sample schema
core0 - product
core1 - client
So given a client id, i wanted to display all bags manufactured by Samsonite sorted by lowest price.
If there's a better way of approaching this, I'm open to redesigning exciting schema.
Thank you in advance.
Solr is not a relational database. You should give a look at the sharding feature and split your indexes. Also, you could write your custom plugins to elaborate the price data based on client's id/name/whatever at index time (BAD you'll still get a product replicated for each client).
How we do (so you can get an example):
clients are handled by sqlite
products are stored in solr with their "base" price
each client has a "pricing rule" applied via custom query handler when they interrogate the db (it's just a value modifier)

Resources