I'm using Neo4j in a community e-commerce built in PHP and using the REST interface.
I need to get all categories related to the search results like Amazon. This feature is available in other engines like Solr (another implementation of Lucene) as Faceted Search
How can I do a Faceted Search in Neo4j? or What's the best way (performance grade) to recreate this feature?
All required modules related to this feature are excluded from the core package of neo4j. I want to know if someone try to do something like this without transverse all nodes in the graph, grab some properties, and make a groupCount of this values. If we have 200k nodes, the transverse took 10sec to only get the categories.
This is my Gremlin approach.
(new Neo4jVertexSequence(
g.getRawGraph().index().forNodes('products').query(
new org.neo4j.index.lucene.QueryContext('category:?')
), g
))._().groupBy{it.category}.cap.next();
Results in 90 rows and took 54 seconds.
Books = 12002
Movies_Music_Games = 19233
Electronics_Computers = 60540
Home_Garden_Tools = 9123
Grocery_Health_Beauty = 15643
Toys_Kids_Baby = 15099
Clothing_Shoes_Jewelry = 12543
Sports_Outdoors = 10342
Automotive_Industrial = 9638
... (more rows)
Of course, I can't put this results in cache, because, this is for "non input search". If the user makes a query like "Iphone", the query looks like
(new Neo4jVertexSequence(
g.getRawGraph().index().forNodes('products').query(
new org.neo4j.index.lucene.QueryContext('search:"iphone" AND category:?')
), g
))._().groupBy{it.category}.cap.next();
What about your domain model? Did you just put everything in the index? Usually you would model your categories as nodes and have your products being related to the category nodes.
(product)-[:HAS_CATEGORY]->(category)<-[:IS_CATEGORY]-(categories)
In your query you would just traverse this little tree and count the relationships of type :HAS_CATEGORY starting from each category node.
start categories=node(x)
match (product)-[:HAS_CATEGORY]->(category)<-[:IS_CATEGORY]-(categories)
return category.name, count(*)
Related
I am currently trying to build a search application on top of SNOMED CT international FULL RF 2 Release. The database is quite huge and I had decided to move on with a full text search index for optimal results. So there are primarily 3 types of nodes in a SNOMED CT database :
ObjectConcept
Descriptions
Role Group
There are multiple relationships between the nodes but I'm focussing on that later.
Currently I'm focussing on searching for ObjectConcept Nodes by a String property value called FSN which stands for fully specified name. For this I tried two things :
Create text indexes on FSN : After using this with MATCH queries the results were rather slow even when I was using the CONTAINS predicate and limiting return value to 15.
Create Full Text Indexes: According to the docs, the FT indexes are powered by Apache Lucene. I created this for FSN. After using the FT Indexes and using AND clauses in the search term, for example :
Search Term : head pain
Query Term: head AND pain
I observe quite impressive benefits in query time using the profiler in neo4j browser(around from 43ms to 10ms for some queries) however once I start querying the db using the apollo server, query times go as high as 2s - 3s. Sometimes upto 8-10s.
The query is as follows, implemented by a custom resolver in neo4j/graphql and apollo-server:
` const session = context.driver.session()
let words = args.name.split(" ");
let compoundQuery = "";
if (words.length === 1) compoundQuery = words[0];
else compoundQuery = words.join(" AND ");
console.log(compoundQuery)
compoundQuery+= AND (${args.type})
return session
.run(
`CALL db.index.fulltext.queryNodes('searchIndex',$name) YIELD node, score
RETURN node
LIMIT 10
`,
{ name: compoundQuery }
)
.then((res) => {
session.close()
return res.records.map((record) => {
return record.get('node').properties
})
})
} `
I have the following questions:
Am I utilising FT indexes as much as I can or am I missing important optimisations out?
I was trying to implement elasticsearch with neo4j but I read elasticsearch and the FT indexes are both powered by lucene. So am I likely to gain improvements from using elasticsearch? If so? how should I go about it considering that I am using neo4j aura db and my graphql server is on ec2. I am confused as to how to use elasticsearch overall with the GRANDStack. Any help would be appreciated.
Any other Suggestions for optimising search would be greatly appreciated.
Thanks in advance!
I am loading simple csv data into neo4j. The data is simple as follows :-
uniqueId compound value category
ACT12_M_609 mesulfen 21 carbon
ACT12_M_609 MNAF 23 carbon
ACT12_M_609 nifluridide 20 suphate
ACT12_M_609 sulfur 23 carbon
I am loading the data from the URL using the following query -
LOAD CSV WITH HEADERS
FROM "url"
AS row
MERGE( t: Transaction { transactionId: row.uniqueId })
MERGE(c:Compound {name: row.compound})
MERGE (t)-[r:CONTAINS]->(c)
ON CREATE SET c.category= row.category
ON CREATE SET r.price =row.value
Next I do the aggregation to count total orders for a compound and create property for a node in the following way -
MATCH (c:Compound) <-[:CONTAINS]- (t:Transaction)
with c.name as name, count( distinct t.transactionId) as ord
set c.orders = ord
So far so good. I can accomplish what I want but I have the following 2 questions -
How can I create the orders property for compound node in the first step itself? .i.e. when I am loading the data I would like to perform the aggregation straight away.
For a compound node I am also setting the property for category. Theoretically, it can also be modelled as category -contains-> compound by creating Categorynode. But what advantage will I have if I do it? Because I can execute the queries and get the expected output without creating this additional node.
Thank you for your answer.
I don't think that's possible, LOAD CSV goes over one row at a time, so at row 1, it doesn't know how many more rows will follow.
I guess you could create virtual nodes and relationships, aggregate those and then use those to create the real nodes, but that would be way more complicated. Virtual Nodes/Rels
That depends on the questions/queries you want to ask.
A graph database is optimised for following relationships, so if you often do a query where the category is a criteria (e.g. MATCH (c: Category {category_id: 12})-[r]-(:Compound) ), it might be more performant to create a label for it.
If you just want to get the category in the results (e.g. RETURN compound.category), then it's fine as a property.
I am server engineer in company that provide dating service.
Currently I am building a PoC for our new recommendation engine.
I try to use neo4j. But performance of this database does not meet our needs.
I have strong feeling that I am doing something wrong and neo4j can do much better.
So can someone give me an advice how to improve performance of my Cypher’s query or how to tune neo4j in right way?
I am using neo4j-enterprise-2.3.1 which is running on c4.4xlarge instance with Amazon Linux.
In our dataset each user can have 4 types of relationships with others users - LIKE, DISLIKE, BLOCK and MATCH.
Also he has a properties like countryCode, birthday and gender.
I made import of all our users and relationships from RDBMS to neo4j using neo4j-import tool.
So each user is a node with properties and each reference is a relationship.
The report from neo4j-import tool said that :
2 558 667 nodes,
1 674 714 539 properties and
1 664 532 288 relationships
were imported.
So it’s huge DB :-) In our case some nodes can have up to 30 000 outgoing relationships..
I made 3 indexes in neo4j :
Indexes
ON :User(userId) ONLINE
ON :User(countryCode) ONLINE
ON :User(birthday) ONLINE
Then I try to build online recommendation engine using this query :
MATCH (me:User {userId: {source_user_id} })-[:LIKE | :MATCH]->()<-[:LIKE | :MATCH]-(similar:User)
USING INDEX me:User(userId)
USING INDEX similar:User(birthday)
WHERE similar.birthday >= {target_age_gte} AND
similar.birthday <= {target_age_lte} AND
similar.countryCode = {target_country_code} AND
similar.gender = {source_gender}
WITH similar, count(*) as weight ORDER BY weight DESC
SKIP {skip_similar_person} LIMIT {limit_similar_person}
MATCH (similar)-[:LIKE | :MATCH]-(recommendation:User)
WITH recommendation, count(*) as sheWeight
WHERE recommendation.birthday >= {recommendation_age_gte} AND
recommendation.birthday <= {recommendation_age_lte} AND
recommendation.gender= {target_gender}
WITH recommendation, sheWeight ORDER BY sheWeight DESC
SKIP {skip_person} LIMIT {limit_person}
MATCH (me:User {userId: {source_user_id} })
WHERE NOT ((me)--(recommendation))
RETURN recommendation
here is the execution plan for one of the user :
plan
When I executed this query for list of users I had the result :
count=2391, min=4565.128849, max=36257.170065, mean=13556.750555555178, stddev=2250.149335254768, median=13405.409811, p75=15361.353029999998, p95=17385.136478, p98=18040.900481, p99=18426.811424, p999=19506.149138, mean_rate=0.9957385490980866, m1=1.2148195797996817, m5=1.1418078036067119, m15=0.9928564378521962, rate_unit=events/second, duration_unit=milliseconds
So even the fastest is too slow for Real-time recommendations..
Can you tell me what I am doing wrong?
Thanks.
EDIT 1 : plan with the expanded boxes :
I built an unmanaged extension to see if I could do better than Cypher. You can grab it here => https://github.com/maxdemarzi/social_dna
This is a first shot, there are a couple of things we can do to speed things up. We can pre-calculate/save similar users, cache things here and there, and random other tricks. Give it a shot, let us know how it goes.
Regards,
Max
If I'm reading this right, it's finding all matches for users by userId and separately finding all matches for users by your various criteria. It's then finding all of the places that they come together.
Since you have a case where you're starting on the left with a single node, my guess is that we'd be better served by following the paths and then filtering what it gotten via relationship traversal.
Let's see how starting like this works for you:
MATCH
(me:User {userId: {source_user_id} })-[:LIKE | :MATCH]->()
<-[:LIKE | :MATCH]-(similar:User)
WITH similar
WHERE similar.birthday >= {target_age_gte} AND
similar.birthday <= {target_age_lte} AND
similar.countryCode = {target_country_code} AND
similar.gender = {source_gender}
WITH similar, count(*) as weight ORDER BY weight DESC
SKIP {skip_similar_person} LIMIT {limit_similar_person}
MATCH (similar)-[:LIKE | :MATCH]-(recommendation:User)
WITH recommendation, count(*) as sheWeight
WHERE recommendation.birthday >= {recommendation_age_gte} AND
recommendation.birthday <= {recommendation_age_lte} AND
recommendation.gender= {target_gender}
WITH recommendation, sheWeight ORDER BY sheWeight DESC
SKIP {skip_person} LIMIT {limit_person}
MATCH (me:User {userId: {source_user_id} })
WHERE NOT ((me)--(recommendation))
RETURN recommendation
[UPDATED]
One possible (and nonintuitive) cause of inefficiency in your query is that when you specify the similar:User(birthday) filter, Cypher uses an index seek with the :User(birthday) index (and additional tests for countryCode and gender) to find all possible DB matches for similar. Let's call that large set of similar nodes A.
Only after finding A does the query filter to see which of those nodes are actually connected to me, as specified by your MATCH pattern.
Now, if there are relatively few me to similar paths (as specified by the MATCH pattern, but without considering its WHERE clause) as compared to the size of A -- say, 2 or more orders of magnitude smaller -- then it might be faster to remove the :User label from similar (since I presume they are probably all going to be users anyway, in your data model), and also remove the USING INDEX similar:User(birthday) clause. In this case, not using the index for similar may actually be faster for you, since you will only be using the WHERE clause on a relatively small set of nodes.
The same considerations also apply to the recommendation node.
Of course, this all has to be verified by testing on your actual data.
I'm wanting to run some tests on neo4j, and compare its performance with other databases, in this case postgresql.
This postgres database have about 2000000 'content's distributed around 3000 'categories'. ( this means that there is a table 'content', one 'category' and a relation table 'content-to-category' since one content can be in more than 1 category).
So, mapping this to a neo4j db, i'm creating nodes 'content', 'category' and their relations ( content to category, and content to content, cause contents can have related contents).
category -> category ( categories can have sub-categories )
content -> category
content -> content (related)
Do you think this 'schema' is ok for this type of domain ?
migrating all data from postgresql do neo4j: it is taking forever ( about 4, 5 days ). This is just some search for nodes and creating/updating accordingly. (search is using indexes and the insert/update if taking 500ms for each node)
Am i doing something wrong ?
Migration is done, so i went to try some querying ...
i ended up with about 2000000 content nodes, 3000 category nodes, and more than 4000000 relationships.
(please note that i'm new to all this neo4j world, so i have no idea how to optimize cypher queries...)
One of the queries i wanted to test is: get the 10 latest published contents of a given 'definition' in a given category (this includes contents that are in sub categories of the given category)
experimenting a little, i ended up with something like this :
START
c = node : node_auto_index( 'type: category AND code: category_code' ),
n = node : node_auto_index( 'type: content AND state: published AND definitionCode: definition_name' )
MATCH (c) <- [ r:BELONGS_TO * ] - (n)
RETURN n.published_stamp, n.title
ORDER BY n.published_stamp DESC
LIMIT 6
this takes around 3 seconds, excluding the first run, that takes a lot more ... is this normal ?
What am i doing wrong ?
please note that i'm using neo4j 1.9.2, and auto indexing some node properties ( type, code, state, definitionCode and published_stamp included - title is not auto indexed )
also, returning 'c' on the previous query ( start c = node : node_auto_index( 'type: category AND code : category-code' ) return c; ) is fast (again, excluding the first run, that takes aroung 20-30ms)
also, i'm not sure if this is the right way to use indexes ...
Thank you in advance (sorry if something is not making sense - ask me and i'll try to explain better).
Have you looked at the batch import facilities: http://www.neo4j.org/develop/import? You really should look at that for the initial import - it will take minutes instead of days.
I will ask some of our technical folks to get back to you on some of the other stuff. You really should not be seeing this.
Rik
How many nodes are returned by this?
START
n = node : node_auto_index( 'type: content AND state: published AND definitionCode: definition_name' )
RETURN count(*)
I would try to let the graph do the work.
How deep are your hierarchies usually?
Usually you limit arbitrary length relationships to not have the combinatorial explosion:
I would also have a different relationship-type between content and category than the category tree.
Can you point out your current relationship-types?
START
c = node : node_auto_index( 'type: category AND code: category_code' ),
MATCH (c) <- [:BELONGS_TO*5] - (n)
WHERE n.type = 'content' AND n.state='published' and n.definitionCode = 'definition_name'
RETURN n.published_stamp, n.title
ORDER BY n.published_stamp DESC
LIMIT 6
Can you try that?
For import it is easiest to generate CSV from your SQL and import that using http://github.com/jexp/batch-import
Are you running Linux, maybe on an ext4 filesystem?
You might want to set the barrier=0 mount option, as described here: http://structr.org/blog/neo4j-performance-on-ext4
Further discussion of this topic: https://groups.google.com/forum/#!topic/neo4j/nflUyBsRKyY
I realise this may not be ideal usage, but apart from all the graphy goodness of Neo4j, I'd like to show a collection of nodes, say, People, in a tabular format that has indexed properties for sorting and filtering
I'm guessing the Type of a node can be stored as a Link, say Bob -> type -> Person, which would allow us to retrieve all People
Are the following possible to do efficiently (indexed?) and in a scalable manner?
Retrieve all People nodes and display all of their names, ages, cities of birth, etc (NOTE: some of this data will be properties, some Links to other nodes (which could be denormalised as properties for table display's and simplicity's sake)
Show me all People sorted by Age
Show me all People with Age < 30
Also a quick how to do the above (or a link to some place in the docs describing how) would be lovely
Thanks very much!
Oh and if the above isn't a good idea, please suggest a storage solution which allows both graph-like retrieval and relational-like retrieval
if you want to operate on these person nodes, you can put them into an index (default is Lucene) and then retrieve and sort the nodes using Lucene (see for instance How do I sort Lucene results by field value using a HitCollector? on how to do a custom sort in java). This will get you for instance People sorted by Age etc. The code in Neo4j could look like
Transaction tx = neo4j.beginTx();
idxManager = neo4j.index()
personIndex = idxManager.forNodes('persons')
personIndex.add(meNode,'name',meNode.getProperty('name'))
personIndex.add(youNode,'name',youNode.getProperty('name'))
tx.success()
tx.finish()
'*** Prepare a custom Lucene query context with Neo4j API ***'
query = new QueryContext( 'name:*' ).sort( new Sort(new SortField( 'name',SortField.STRING, true ) ) )
results = personIndex.query( query )
For combining index lookups and graph traversals, Cypher is a good choice, e.g.
START people = node:people_index(name="E*") MATCH people-[r]->() return people.name, r.age order by r.age asc
in order to return data on both the node and the relationships.
Sure, that's easily possible with the Neo4j query language Cypher.
For example:
start cat=node:Types(name='Person')
match cat<-[:IS_A]-person-[born:BORN]->city
where person.age > 30
return person.name, person.age, born.date, city.name
order by person.age asc
limit 10
You can experiment with it in our cypher console.