Solr join and non-Join queries give different results - join

I am trying to link two types of documents in my Solr index. The parent is named "house" and the child is named "available". So, I want to return a list of houses that have available documents with some filtering. However, the following query gives me around 18 documents, which is wrong. It should return 0 documents.
q=*:*
&fq={!join from=house_id_fk to=house_id}doctype:available AND discount:[1 TO *] AND start_date:[NOW/DAY TO NOW/DAY%2B21DAYS]
&fq={!join from=house_id_fk to=house_id}doctype:available AND sd_year:2014 AND sd_month:11
To debug it, I tried first to check whether there is any available documents with the given filter queries. So, I tried the following query:
q=*:*
&fq=doctype:available AND discount:[1 TO *] AND start_date:[NOW/DAY TO NOW/DAY%2B21DAYS]
&fq=doctype:available AND sd_year:2014 AND sd_month:11
The query gives 0 results, which is correct. So as you can see both queries are the same, the different is using the join query parser. I am a bit confused, why the first query gives results. My understanding is that this should not happen because the second query shows that there is no any available documents that satisfy the given filter queries.

I have figured it out.
The reason is simply the type of join in Solr. It is an outer join. Since both filter queries are executed separately, a house that has available documents with discount > 1 or (sd_year:2014 AND sd_month:11) will be returned even though my intention was applying bother conditions at the same time.
However, in the second case, both conditions are applied at same time to find available documents, then houses based on the matching available documents are returned. Since there is no any available document that satisfies both conditions, then there is no any matching house which gives zero results.
It really took sometime to figure this out, I hope this will help someone else.

Related

Neo4j Query Optimization for Cartesian Product

I am trying to implement a user-journey analytics solution. Simply analyze on which screens, which users leave the application.
For this , I have modeled the data like this:
I modeled single activity since I want to index some attributes. Relation attributes can not be indexed in Neo4j.
With this model, I am trying to write a query that follows three successive event types with below query:
MATCH (eventType1:EventType {eventName:'viewStart-home'})<--(event:EventNode)
<--(eventType2:EventType{eventName:'viewStart-payment'})
WITH distinct event.deviceId as eUsers, event.clientCreationDate as eDate
MATCH((eventType2)<--(event2:EventNode)
<--(eventType3:EventType{eventName:'viewStart-screen1'}))
WITH distinct event2.deviceId as e2Users, event2.clientCreationDate as e2Date
RETURN e2Users limit 200000
And the execution plan is below:
I could not figure the reason of this process out. Can you help me?
Your query is doing a lot more work than it needs to.
The first WITH clause is not needed at all, since its generated eUsers and eDate variables are never used. And the second WITH clause does not need to generate the unused e2Date variable.
In addition, you could first add an index for :EventType(eventName) to speed up the processing:
CREATE INDEX ON :EventType(eventName);
With these changes, your query's profile could be simpler and the processing would be faster.
Here is an updated query (that should use the index to quickly find the EventType node at one end of the path, to kick off the query):
MATCH (:EventType {eventName:'viewStart-home'})<--(:EventNode)
<--(:EventType{eventName:'viewStart-payment'})<--(event2:EventNode)
<--(:EventType{eventName:'viewStart-screen1'})
RETURN distinct event2.deviceId as e2Users
LIMIT 200000;
Here is an alternate query that uses 2 USING INDEX hints to tell the planner to quickly find the :EventType nodes at both ends of the path to kick off the query. This might be even faster than the first query:
MATCH (a:EventType {eventName:'viewStart-home'})<--(:EventNode)
<--(:EventType{eventName:'viewStart-payment'})<--(event2:EventNode)
<--(b:EventType{eventName:'viewStart-screen1'})
USING INDEX a:EventType(eventName)
USING INDEX b:EventType(eventName)
RETURN distinct event2.deviceId as e2Users
LIMIT 200000;
Try profiling them both on your DB, and pick the best one or keep tweaking further.

Different results of two (synonymous) queries in Neo4j

I have identified that some queries happen to return less results than expected. I have taken one of the missing results and tried to force Neo4j to return this result - and I succeeded with the following query:
match (q0),(q1),(q2),(q3),(q4),(q5)
where
q0.name='v4' and q1.name='v3' and q2.name='v5' and
q3.name='v1' and q4.name='v3' and q5.name='v0' and
(q1)-->(q0) and (q0)-->(q3) and (q2)-->(q0) and (q4)-->(q0) and
(q5)-->(q4)
return *
I have supposed that the following query is semantically equivalent to the previous one. However in this case, Neo4j returns no result at all.
match (q1)-->(q0), (q0)-->(q3), (q2)-->(q0), (q4)-->(q0), (q5)-->(q4)
where
q0.name='v4' and q1.name='v3' and q2.name='v5' and
q3.name='v1' and q4.name='v3' and q5.name='v0'
return *
I have also manually verified that the required edges among vertices v0, v1, v3, v4 and v5 are present in the database with right directions.
Am I missing some important difference between these queries or is it just a bug of Neo4j? (I have tested these queries on Neo4j 2.1.6 Community Edition.)
Thank you for any advice
/EDIT: Updating to newest version 2.2.1 was of no help.
This might not be a complete answer, but here's what I found out.
These queries aren't synonymous, if I understand correctly.
First of all, use EXPLAIN (or even PROFILE) to look under the hood. The first query will be executed as follows:
The second query:
As you can see (even without going deep down), those are different queries in terms of both efficiency and semantics.
Next, what's actually going on here:
the 1st query will look through all (single) nodes, filter them by name, then - try to group them according to your pattern, which will involve computing Cartesian product (hence the enormous space complexity), then collect those groups into the larger ones, and then evaluate your other conditions.
the 2nd query will first pick a pair of nodes connected with some relationship (which satisfy the condition on the name property), then throw in the third node and filter again, ..., and so on till the end. The number of nodes is expected to decrease after every filter cycle.
By the way, is it possible that you accidentally set the same name twice (for q1 and q3?)

Neo4j 2.0: Indexing array-valued properties with schema indexing

I have nodes with multiple "sourceIds" in one array-valued property called "sourceIds", just because there could be multiple resources a node could be derived from (I'm assembling multiple databases into one Neo4j model).
I want to be able to look up nodes by any of their source IDs. With legacy indexing this was no problem, I would just add a node to the index associated with each element of the sourceIds property array.
Now I wanted to switch to indexing with labels and I'm wondering how that kind of index works here. I can do
CREATE INDEX ON :<label>(sourceIds)
but what does that actually do? I hoped it would just create index entries for each array element, but that doesn't seem to be the case. With
MATCH n:<label> WHERE "testid" in n.sourceIds RETURN n
the query takes between 300ms and 500ms which is too long for an index lookup (other schema indexes work three to five times faster). With
MATCH n:<label> WHERE n.sourceIds="testid" RETURN n
I don't get a result. That's clear because it's an array property but I just gave it a try since it would make sense if array properties would be broken down to their elements for indexing purposes.
So, is there a way to handle array properties with schema indexing or are there plans or will I just have to stick to legacy indexing here? My problem with the legacy Lucene index was that I hit the max number of boolean clauses (1024). Another question thus would be: Can I raise this number? Lucene allows that, but can I do this with the Lucene index used by Neo4j?
Thanks and best regards!
Edit: A bit more elaboration on why I hit the boolean clauses max limit: I need to export specific parts of the database into custom file formats for text processing pipelines. These pipelines use components I cannot (be it for the sake of accessibility or time) change to query Neo4j directly, so I'd rather stay with the defined required file format(s). I do the export via the pattern "give me all IDs in the DB; now, for batches of IDs, query the desired information (e.g. specific paths) from Neo4j and store the results to file". Why I use batches at all? Well, if I don't, things are slowed down significantly via the connection overhead. Thus, large batches are a kind of optimization here.
Schema indexes can only do exact matches right now. Your "testid" in n.sourceIds does not use the index (as shown by your query times). I think there are plans to make this behave better, but I'm waiting for them just as eagerly as you are.
I've actually hit a lower max in the lucene query: 512. If there is a way to increase it I'd love to hear of it. The way I got around it is just doing more than one query if I have one of the rare cases that actually goes over 512 ids. What query are you doing where you need more?

Can a Grails dynamic finder be broken by application code?

There is some code in the project I'm working on where a dynamic finder behaves differently in one code branch than it does in another.
This line of code returns all my advertisers (there are 8 of them), regardless of which branch I'm in.
Advertiser.findAllByOwner(ownerInstance)
But when I start adding conditions, things get weird. In branch A, the following code returns all of my advertisers:
Advertiser.findAllByOwner(ownerInstance, [max: 25])
But in branch B, that code only returns 1 advertiser.
It doesn't seem possible that changes in application code could affect how a dynamic finder works. Am I wrong? Is there anything else that might cause this not to work?
Edit
I've been asked to post my class definitions. Instead of posting all of it, I'm going to post what I think is the important part:
static mapping = {
owner fetch: 'join'
category fetch: 'join'
subcategory fetch: 'join'
}
static fetchMode = [
grades: 'eager',
advertiserKeywords: 'eager',
advertiserConnections: 'eager'
]
This code was present in branch B but absent from branch A. When I pull it out, things now work as expected.
I decided to do some more digging with this code present to see what I could observe. I found something interesting when I used withCriteria instead of the dynamic finder:
Advertiser.withCriteria{owner{idEq(ownerInstance.id)}}
What I found was that this returned thousands of duplicates! So I tried using listDistinct:
Adviertiser.createCriteria().listDistinct{owner{idEq(ownerInstance.id)}}
Now this returns all 8 of my advertisers with no duplicates. But what if I try to limit the results?
Advertiser.createCriteria().listDistinct{
owner{idEq(ownerInstance.id)}
maxResults 25
}
Now this returns a single result, just like my dynamic finder does. When I cranked maxResults upto 100K, now I get all 8 of my results.
So what's happening? It seems that the joins or the eager fetching (or both) generated sql that returned thousands of duplicates. Grails dynamic finders must return distinct results by default, so when I wasn't limiting them, I didn't notice anything strange. But once I set a limit, since the records were ordered by ID, the first 25 records would all be duplicate records, meaning that only one distinct record will be returned.
As for the joins and eager fetching, I don't know what problem that code was trying to solve, so I can't say whether or not it's necessary; the question is, why does having this code in my class generate so many duplicates?
I found out that the eager fetching was added (many levels deep) in order to speed up the rendering of certain reports, because hundreds of queries were being made. Attempts were made to eager fetch on demand, but other developers had difficulty going more than one level deep using finders or Grails criteria.
So the general answer to the question above is: instead of eager by default, which can cause huge nightmares in other places, we need to find a way to do eager fetching on a single query that can go more than one level down the tree
The next question is, how? It's not very well supported in Grails, but it can be achieved by simply using Hibernate's Criteria class. Here's the gist of it:
def advertiser = Advertiser.createCriteria()
.add(Restrictions.eq('id', advertiserId))
.createCriteria('advertiserConnections', CriteriaSpecification.INNER_JOIN)
.setFetchMode('serpEntries', FetchMode.JOIN)
.uniqueResult()
Now the advertiser's advertiserConnections, will be eager fetched, and the advertiserConnections' serpEntries will also be eager fetched. You can go as far down the tree as you need to. Then you can leave your classes lazy by default - which they definitely should be for hasMany scenarios.
Since your query are retrieving duplicates, there's a chance that this limit of 25 records return the same data, consequently your distinct will reduce to one record.
Try to define the equals() and hashCode() to your classes, specially if you have some with composite primary key, or is used as hasMany.
I also suggest you to try to eliminate the possibilities. Remove the fetch and the eager one by one to see how it affects your result data (without limit).

Returning all relationships for a list of nodes

This is quite a general question but to make it more understandable I'll give it a bit of context.
In neo4j I have a series of words (nodes) that are associated with one another. I want to specify a list of nodes and the Cypher query return a list of any relationships between those nodes.
The nodes specified in the list are all guaranteed to have at least one relationship to another node specified in the list.
I created a query to do this and in certain circumstances it works fine - http://console.neo4j.org/?id=s30cbm
Unfortunately, when I add the words 'bark' and 'dog' to the list I get an 'unexpected traversal state encountered' error message. I presume this is because the database cursor has got to the fruit node and then there's no relationship between that and bark, even though there is a relationship from tree to bark. http://console.neo4j.org/?id=258d6g
I'm obviously doing the query slightly wrong and any advice would be appreciated on how I can rectify this.
This works in the latest console (your second link), btw, so it looks like they fixed it. Looks like it should be working in 1.9-M04+.

Resources