Apache Solr - Count of subquery as a superquery parameter - join

I'm having a little trouble trying to make a query in Solr.
The problem is: I must be able retrieve documents that have the same value for a specified field, but they should only be retrieved if this value appeared more than X times for a specified user.
In pseudosql it would be something like:
select user_id from documents
where my_field="my_value"
and
(select count(*) from documents where my_field="my_value" and user_id=super.user_id) > X
I Know that solr return a 'numFound' for each query you make, but I dont know how to retrieve this value in a subquery.
My Solr is organized in a way that a user is a document, and the properties of the user (such as name, age, etc) are grouped in another document with a 'root_id' field.
So lets suppose the following query that gets all the root documents whose children have the prefix "some_prefix".
is_root:true AND _query_:"{!join from=root_id to=id}requests_prefix:\"some_prefix\""
Now, how can I get the root documents (users in some sense) that have more than X children matching 'requests_prefix:"some_prefix"' or any other condition?
Is it possible?
P.S. It must be done in a single query, fields can be added at will, but the root/children structure should be preserved (preferentially).

As it turns out, Solr didn't match my needs and I ended up using ElasticSearch with its nativa parent-child mapping.

Related

How to get instant match to a node, If i know it's < id >

I'm trying to speed up this query:
LOAD CSV FROM 'file:///path/to/file' AS line
MATCH (n:Organization{rc:'2051061'})-[:Ap]->(a:Person{numDc: toint(line[1])})
CREATE (a)-[:Af]->(n)
The CSV has about 100k rows, the relationship (n:Organization)-[:Ap]->(a:Person) is unique between different a/b pairs. The number of nodes with label :Organization is 50, and those with :Person is 200k.
So basically I take a value in the csv and check if a :Person who has a relation :Ap with the :Organization with the given rc (2051061) has that value as numDc and then I add another relation between the Person and the organization.
My query runs too slow, I even added indexes to :Person(numDc) and Organization(rc).
So I think since I'm matching the organization for every row It may be the problem.
How can I get instant match to that node if I do know it's < id >.
Thanks in advance.
Note: You may not actually need to create an Af relationship if it does not have any properties, since you can easily traverse an Ap relationship "backwards" from a to n.
If you really do need to create an Af relationship, you can improve your performance by forcing Cypher to use both of your indexes.
Using PROFILE on your query (with the 2 indexes), I see that the Cypher planner (I tried both planner types) uses the SchemaIndex operator (which takes advantage of an index) with only one of your indexes. In order to force Cypher to use both indexes, you can use the USING INDEX clause, like this:
LOAD CSV FROM 'file:///path/to/file' AS line
MATCH (n:Organization { rc:'2051061' })
USING INDEX n:Organization(rc)
MATCH (n)-[:Ap]->(a:Person { numDc: toint(line[1])})
USING INDEX a:Person(numDc)
CREATE (a)-[:Af]->(n);
The performance should be much improved.
It's better to use your own unique identifier instead of node id. Because you can't to rely on ID. Node id is basically address where node is in file with nodes records.
You can add unique id to your csv file and import it into database.
Or you can use GraphAware UUID module for creating UUID on the fly - https://github.com/graphaware/neo4j-uuid

How to determine the Max property on a Relationship in Neo4j 2.2.3

How do you quickly get the maximum (or minimum) value for a property of all instances of a relationship? You can assume the machine I'm running this on is well within the recommended spec's for the cpu and memory size of graph and the heap size is set accordingly.
Facts:
Using Neo4j v2.2.3
Only have access to modify graph via Cypher query language which I'm hitting via PHP or in the web interfacxe--would love to avoid any solution that requires java coding.
I've got a relationship, call it likes that has a single property id that is an integer.
There's about 100 million of these relationships and growing
Every day I grab new likes from a MySQL table to add to the graph within in Neo4j
The relationship property id is actually the primary key (auto incrementing integer) from the raw MySQL table.
I only want to add new likes so before querying MySQL for the new entries I want to get the max id from the likes, so I can use it in my SQL query as SELECT * FROM likes_table WHERE id > max_neo4j_like_property_id
How can I accomplish getting the max id property from neo4j in a optimal way? Please indicate the create statement needed for any index as well as the query you'd used to get the final result.
I've tried creating an index as follows:
CREATE INDEX ON :likes(id);
After the index is online I've tried:
MATCH ()-[r:likes]-() RETURN r.i ORDER BY r.id DESC LIMIT 1
as well as:
MATCH ()-[r:likes]->() RETURN MAX(r.id)
They work but take freaking forever as the explain plan for both indicate no indexes being used.
UPDATE: Holy $?##$?!!!! It looks like the new schema indexes aren't functional for relationships even though you can create them and show them with :schema. It also looks as if there's no way with cypher directly to create Legacy Indexes which look like they might solve this issue.
If you need to query relationship properties, it is generally a sign of a model issue.
The need of this query reveals you that you would better extract these properties into a node, that you'll then be able to query faster.
I don't say it is 100% the case, but certainly 99% of the people seen so far with the same problem has been demonstrating this model concern.
What is your model right now ?
Also you don't use labels at all in your query, likes have a context bound to the nodes.

Solr join and non-Join queries give different results

I am trying to link two types of documents in my Solr index. The parent is named "house" and the child is named "available". So, I want to return a list of houses that have available documents with some filtering. However, the following query gives me around 18 documents, which is wrong. It should return 0 documents.
q=*:*
&fq={!join from=house_id_fk to=house_id}doctype:available AND discount:[1 TO *] AND start_date:[NOW/DAY TO NOW/DAY%2B21DAYS]
&fq={!join from=house_id_fk to=house_id}doctype:available AND sd_year:2014 AND sd_month:11
To debug it, I tried first to check whether there is any available documents with the given filter queries. So, I tried the following query:
q=*:*
&fq=doctype:available AND discount:[1 TO *] AND start_date:[NOW/DAY TO NOW/DAY%2B21DAYS]
&fq=doctype:available AND sd_year:2014 AND sd_month:11
The query gives 0 results, which is correct. So as you can see both queries are the same, the different is using the join query parser. I am a bit confused, why the first query gives results. My understanding is that this should not happen because the second query shows that there is no any available documents that satisfy the given filter queries.
I have figured it out.
The reason is simply the type of join in Solr. It is an outer join. Since both filter queries are executed separately, a house that has available documents with discount > 1 or (sd_year:2014 AND sd_month:11) will be returned even though my intention was applying bother conditions at the same time.
However, in the second case, both conditions are applied at same time to find available documents, then houses based on the matching available documents are returned. Since there is no any available document that satisfies both conditions, then there is no any matching house which gives zero results.
It really took sometime to figure this out, I hope this will help someone else.

solr join - return parent and child document

I am using Solr's (4.0.0-beta) join capability to query an index that has documents with parent/child relationships. The join query works great, but I only get the parent documents in the search results. I believe this is the expected behavior.
Is it possible, though, to get both the parent and the child documents to be returned in the search results? (as separate search hits).
For example:
Parents:
SolrDocument{uid=m_1, media_id=1}<br/>
SolrDocument{uid=m_2, media_id=2}<br/>
SolrDocument{uid=m_3, media_id=3}
Children:
SolrDocument(uid=p_1, page_id=1, fk_media_id=[1], partNumber=[abc, def, xyz]}<br/>
SolrDocument(uid=p_2, page_id=2, fk_media_id=[1,2], partNumber=[123, 456]}<br/>
SolrDocument(uid=p_3, page_id=3, fk_media_id=[1,3], partNumber=[100, 101]}
I query by partNumber like this:
{!join from=fk_media_id to=media_id}partNumber:abc
and I get the parent document (uid=m_1) in the results, as expected. But I would like, in this case, both the parent and the child to be returned in the results. Is that possible?
No, It´s not posible. According to Solr Wiki:
For people who are used to SQL, it's important to note that Joins in Solr are not really equivalent to SQL Joins because no information about the table being joined "from" is carried forward into the final result. A more appropriate SQL analogy would be an "inner query""
http://wiki.apache.org/solr/Join
You have to denormalize all your data to do that or run two different querys.

How to order the data back from Amazon simpleDB int specific column order

I'm using Amazon's SimpleDB Java client to read data from SimpleDB. The problem I have is even though I specified the columns in the some order in the SelectRequest like the following:
SelectRequest req = new SelectRequest("SELECT TIMESTAMP, TYPE, APP, http_status, USER_ID from mydata");
SElectResult res = _sdb.select(req);
..
It returned data in following column order:
APP, TIMSTAMP, TYPE, USER_ID, http_status,
It seems it automatically reordered the columns in ascend order. Is there any way I can force the order as I specified in the select clause?
The columns returned are not an ordered list but an unordered set of attributes. You can't control the order they come back in. SELECT is designed to work even in cases where some of the attributes in your query don't exist for every (or any) returned items. In those cases specifically you wouldn't be able to rely on order anyway. I realize that's small consolation if you have structured your data set so that the attributes are always present.
However, since you know the desired order ahead of time, it should be pretty easy to pull the data out of the result in the proper order. It's just XML after all, or in the case of the Java client, freshly parsed XML.
The Select operation returns a set of Attributes for ItemNames that match the select expression.
SimpleDB docs for SELECT

Resources