Neo4j suggestion on large scale - neo4j

i need to implement a suggestion system for my project
in this system we should recommend people base on some parameters like current city, education, friend of friends etc.
i have designed this by creating(update) may_know relations when users edit their profile or become friend with someone and i will retrieve them by MATCH u-[r:MAY_KNOW]-x RETURN * ORDER BY r.weight so people can find most like people to them
but i think this is not a best practice because soon may_know relation from/to every user can reach even milions and scan and sorting them will be heavy cost
do you have a better idea?

Depends a bit on the data-structure, I assume there are relationships to cities, education facilities and friends. So you don't actually have MAY_KNOW relationships as those are only inferred?
Also it depends if you want to create a cross products between all your users (how many) and how you would want to filter out non-related people.
Perhaps check out this blog post from Max: http://maxdemarzi.com/2013/04/19/match-making-with-neo4j/
So something like this query might work (depending on the data volume I'd rewrite it in the Java API).
match (p:Person {id:{user_id})
match (p)-[:LIVES_IN]->(:City)<-[:LIVES_IN]-(other)
match (p)-[:GRADUATED]->(:School)<-[:GRADUATED]-(other)
match (p)-[:KNOWS]->(:Person)<-[:KNOWS]-(other)
RETURN other

Related

Can graph database query "nodes that a given node has no relationship with"?

I am working on a dating app where users can "like" or "dislike" other users and get matched.
As you can imagine the most important query of the app would be:
Give me a stack of nearby user profiles that I have NOT liked/disliked before.
I tried to work on this with a document database (Firestore) and figured it's simply not suitable for such kind of application and hence landed in the graph database world which is new and fascinating to me.
I understand that by nature a graph database retrieves data by tracing through the relationships and make relationships first-class citizens. My question now is that what if the nodes that I am trying to get are those with no relationship from the given node? What would the query look like? Can anyone provide an example query?
Edit:
- added nearby criteria to the query statement
This is definitely possible, here is a query example :
MATCH (me:Profile {name: "Chris"})
MATCH (other:Profile) WHERE NOT (other)-[:LIKES]->(me)
As stated in the comments of your original question, on a large dataset it might not scale well, that said it is pretty uncommon that you would use only one criteria for matching, for example, the list of possible profiles to match from can be grouped by :
geolocation
profiles in depth 2 ( who is liking me, then find who other people they like, do those people like me ? )
shared interests
age group
skin color
...

how to get user-based recommendation with graphaware neo4j-reco

I need to get user-based recommendations with graphaware and I don't know how to do that. As far as I can see, all I seem to get from graphaware's neo4j-reco are item-similarities as in 'people who bought a also bought b'. But what I'm interested in is user-based recommendations like 'recommended for you, based on your previous purchases'. Any idea how to do that?
GraphAware-Reco is mainly a skeleton helping you build enterprise-grade recommendation engines atop a neo4j database.
This means that it provides base classes and an architecture that you need to extend yourself with your own logic.
If you take your requirements, here purchase history, a very naive approach to get started with is for example to find the characteristics of the products purchased.
Lets say user 1 purchased an iphone and an ipad, that can have those characteristics :
iphone brand : apple, category: electronics
ipad brand: apple, category: electronics
You can create a first engine that will match potential candidates based on those characteristics, this engine will extend the CypherEngine with the following query :
MATCH (n:User {id: 111})-[:PURCHASED]->(product)
WITH distinct product
MATCH (product)-[:HAS_CHARACTERISTIC]->(c)<-[:HAS_CHARACTERISTIC]-(reco)
RETURN reco, count(*) AS score
Another approach you can combine with this one is to find people having bought the same items as the user and find what they also bought, you will then create another engine with the following query :
MATCH (n:User {id: 111})-[:PURCHASED]->(product)
WITH distinct product, user
MATCH (product)<-[:PURCHASED]-(collab)
WHERE collab <> user
MATCH (collab)-[:PURCHASED]->(reco)
RETURN reco, count(*) AS score
When using those two engines, GraphAware Reco will automatically combine the scores from each engine into one.
You can find an example of a CypherEngine in the tests : https://github.com/graphaware/neo4j-reco/blob/master/src/test/java/com/graphaware/reco/neo4j/engine/CypherEngineTest.java
You can also add a blacklist for not recommending items the user already bought.
As I said, this is a first step, if you have a big catalog with lot of purchases, you might consider doing background computations (for eg, similarity between products and only relate top k-nn products between them and same for purchases and related similar users between them)
GraphAware-Reco offers you facilities for having background computation jobs and GraphAware-Reco-Enterprise comes with pre-defined algorithms for the similarity computations between items as well as an Apache Spark integration for moving the similarity computation process outside of the neo4j jvm and write back the results/relationships to neo4j (not open-sourced)

Fuzzy neo4j relationships

I want to do something in neo4j that I hope will work ok: I want to make "fuzzy" path matches; the links will sometimes count as a relationship, and sometimes not, depending on the query.
Here's an example: let's say I have a (p:Person)-[:HAS]->(n:Name). A search has found a Person (say, by phone number). I want to go from this Person to other Persons with similar names, to get their phone numbers. Also, I want the similarity to be adjustable, so the user might ask to match very similar names, or not very similar names.
I could get the first person's name, and then do a search against other names with some lucene patterns - this is easy enough, but it means doing a full lucene search on the Name values, which in my use case is not ideal as I think it might be a bit slow (there are very many names - let's say a billion, remembering this is just an example). I hope there is a better way.
One approach I can imagine is having a "similarity" relationship between Names. Whenever a new Name node is added, we check for similar names and link them (creating these relationships would be slow, but we could push it onto a batch process, and it's ok if it takes some minutes). We would only link names that were fairly similar (so the number of links would hopefully not get too large). I suppose we could then craft a query on this, matching similarities greater than my threshold. Something like this:
MATCH (p1:Person {phone:"555-234234"})-->(n1:Name)-[s:SIMILAR]->(n2:Name)-->(p2:Person)
WHERE s.matchLevel >=2
RETURN p2.phone;
Is this approach better or worse than just doing the lucene search? Has anyone else wanted to do something like this?
Also, based on the suggestion at http://graphaware.com/neo4j/2013/10/24/neo4j-qualifying-relationships.html, I believe I'll be better off having many relationships (SIMILAR_1, SIMILAR_2 ..) instead of using a "match level" attribute on my relationship.
BTW, I know there are many similar questions to this (eg. Neo4j 2 Cypher fuzzy search), but afaik this exact question isn't on stackoverflow (and I have looked).

Handling long path patterns in neo4j

My database contains hotels, reviews of hotels, terms (i.e. words) in reviews and topics (e.g. there could be a topic talking "Staff" containing terms describing the hotel staff) as nodes. Indices on all nodes are present. Relationships as follows: Hotel<--Review-->Term-->Topic
I am currently trying to find an efficient way of querying for topics that have paths to two or more specified hotels. In other words, I am interested in the common topics of two hotels. If hotel A has paths to topics 1,2,3 and hotel B has paths to topics 2,3,4 then the result should be 2,3.
I tried the following below but this seems very inefficient which is very likely due to the amount of possible paths between hotels and topics. Basically each word in a review could create a new path that has to be checked.
// show all topics that two hotels have in common
MATCH (h2:Hotel)<--(r2:Review)-->(t2:Term)-->(to:Topic)<--(t1:Term)<--(r1:Review)-->(h1:Hotel)
WHERE h1.id IN ["id1","id2"] AND h2.id IN ["id1","id2"] AND NOT h1.id=h2.id
RETURN h1.id,to.topic, count(to) AS topic_mentions
I am wondering if there's a faster way of dealing with this, if I were to implement this in java or similar language I'd probably try doing a BFS starting at each hotel and then taking the overlap of what I find. I am fairly certain that adding the transitive edges as direct edges Hotel-->Topic would speed this up, but my limited database design knowledge told me that this might be unnecessarily redundant and not a good practice?
I tried to do the id matching before the pattern matching with another MATCH and WITH clause, but this didnt speed up anything; I think the problem really lies in the pattern matching itself.
I created something similar for searching KB's, and a direct relationship between Hotels and Topics will make this search dead easy, and it'll be faster. For example, your search for all topics with more than one Hotel in common, you'd use:
MATCH (h1:Hotel)-[:TOPIC]->(t:Topic)
MATCH (h2:Hotel)-[:TOPIC]->(t:Topic)
WHERE h1 <> h2
RETURN h1.id, h2.id, t.topic, count(t) AS topic_mentions
Note that this will return a count of all topics these two hotels have in common, which may or may not be what you want.
I am fairly certain that adding the transitive edges as direct edges
Hotel--Topic would speed this up, but my limited database design
knowledge told me that this might be unnecessarily redundant and not a
good practice?
All that would be doing is making an implicit relationship explicit, which is one of things that make graph db's so powerful. There is the maintenance aspect to be concerned about - namely if someone updates the words in a review, then you have to make sure that the (hotel)-[:TOPIC]->(topic) relationships are still valid - but you'd have to do that in your original design anyway, so no loss there.

Why do relationships as a concept exist in neo4j or graph databases in general?

I can't seem to find any discussion on this. I had been imagining a database that was schemaless and node based and heirarchical, and one day I decided it was too common sense to not exist, so I started searching around and neo4j is about 95% of what I imagined.
What I didn't imagine was the concept of relationships. I don't understand why they are necessary. They seem to add a ton of complexity to all topics centered around graph databases, but I don't quite understand what the benefit is. Relationships seem to be almost exactly like nodes, except more limited.
To explain what I'm thinking, I was imagining starting a company, so I create myself as my first nodes:
create (u:User { u.name:"mindreader"});
create (c:Company { c.name:"mindreader Corp"});
One day I get a customer, so I put his company into my db.
create (c:Company { c.name:"Customer Company"});
create (u:User { u.name:"Customer Employee1" });
create (u:User { u.name:"Customer Employee2"});
I decide to link users to their customers
match (u:User) where u.name =~ "Customer.*"
match (c:Company) where c.name =~ "Customer.*
create (u)-[:Employee]->(c);
match (u:User where name = "mindreader"
match (c:Company) where name =~ "mindreader.*"
create (u)-[:Employee]->(c);
Then I hire some people:
match (c:Company) where c.name =~ "mindreader.*"
create (u:User { name:"Employee1"})-[:Employee]->(c)
create (u:User { name:"Employee2"})-[:Employee]->(c);
One day hr says they need to know when I hired employees. Okay:
match (c:Company)<-[r:Employee]-(u:User)
where name =~ "mindreader.*" and u.name =~ "Employee.*"
set r.hiredate = '2013-01-01';
Then hr comes back and says hey, we need to know which person in the company recruited a new employee so that they can get a cash reward for it.
Well now what I need is for a relationship to point to a user but that isn't allowed (:Hired_By relationship between :Employee relationship and a User). We could have an extra relationship :Hired_By, but if the :Employee relationship is ever deleted, the hired_by will remain unless someone remembers to delete it.
What I could have done in neo4j was just have a
(u:User)-[:hiring_info]->(hire_info:HiringInfo)-[:hired_by]->(u:User)
In which case the relationships only confer minimal information, the name.
What I originally envisioned was that there would be nodes, and then each property of a node could be a datatype or it could be a pointer to another node. In my case, a user record would end up looking like:
User {
name: "Employee1"
hiring_info: {
hire_date: "2013-01-01"
hired_by: u:User # -> would point to a user
}
}
Essentially it is still a graph. Nodes point to each other. The name of the relationship is just a field in the origin node. To query it you would just go
match (u:User) where ... return u.name, u.hiring_info.hiring_date, u.hiring_info.hired_by.name
If you needed a one to many relationship of the same type, you would just have a collection of pointers to nodes. If you referenced a collection in return, you'd get essentially a join. If you delete hiring_info, it would delete the pointer. References to other nodes would not have to be a disorganized list at the toplevel of a node. Furthermore when I query each user I will know all of the info about a user without both querying for the user itself and also all of its relationships. I would know his name and the fact that he hired someone in the same query. From the database backend, I'm not sure much would change.
I see quite a few questions from people asking whether they should use nodes or relationships to model this or that, and occasionally people asking for a relationship between relationships. It feels like the XML problem where you are wondering if a pieces of information should be its own tag or just a property its parent tag.
The query engine goes to great pains to handle relationships, so there must be some huge advantage to having them, but I can't quite see it.
Different databases are for different things. You seem to be looking for a noSQL database.
This is an extremely wide topic area that you've reached into, so I'll give you the short of it. There's a spectrum of database schemas, each of which have different use cases.
NoSQL aka Non-relational Databases:
Every object is a single document. You can have references to other documents, but any additional traversal means you're making another query. Times when you don't have relationships between your data very often, and are usually just going to want to query once and have a large amount of flexibly-stored data as the document that is returnedNote: These are not "nodes". Node have a very specific definition and implies that there are edges.)
SQL aka Relational Databases:
This is table land, this is where foreign keys and one-to-many relationships come into play. Here you have strict schemas and very fast queries. This is honestly what you should use for your user example. Small amounts of data where the relationships between things are shallow (You don't have to follow a relationship more than 1-2 times to get to the relevant entry) are where these excel.
Graph Database:
Use this when relationships are key to what you're trying to do. The most common example of a graph is something like a social graph where you're connecting different users together and need to follow relationships for many steps. (Figure out if two people are connected within a depth for 4 for instance)
Relationships exist in graph databases because that is the entire concept of a graph database. It doesn't really fit your application, but to be fair you could just keep more in the node part of your database. In general the whole idea of a database is something that lets you query a LOT of data very quickly. Depending on the intrinsic structure of your data there are different ways that that makes sense. Hence the different kinds of databases.
In strongly connected graphs, Neo4j is 1000x faster on 1000x the data than a SQL database. NoSQL would probably never be able to perform in a strongly connected graph scenario.
Take a look at what we're building right now: http://vimeo.com/81206025
Update: In reaction to mindreader's comment, we added the related properties to the picture:
RDBM systems are tabular and put more information in the tables than the relationships. Graph databases put more information in relationships. In the end, you can accomplish much the same goals.
However, putting more information in relationships can make queries smaller and faster.
Here's an example:
Graph databases are also good at storing human-readable knowledge representations, being edge (relationship) centric. RDF takes it one step further were all information is stored as edges rather than nodes. This is ideal for working with predicate logic, propositional calculus, and triples.
Maybe the right answer is an object database.
Objectivity/DB, which now supports a full suite of graph database capabilities, allows you to design complex schema with one-to-one, one-to-many, many-to-one, and many-to-many reference attributes. It has the semantics to view objects as graph nodes and edges. An edge can be just the reference attribute from one node to another or an edge can exist as an edge object that sits between two nodes.
An edge object can have any number of attribute and can have references off to other objects, as shown in the diagram below.
Being able to "hang" complex objects off of an edge allows Objectivity/DB to support weighted queries where the edge-weight can be calculated using a user-defined weight calculator operator. The weight calculator operator can build the weight from a static attribute on the edge or build the weight by digging down through the objects connected to the edge. In the picture, above, we could create a edge-weight calculator that computes the sum of the CallDetail lengths connected to the Call edge.

Resources