I'm attempting to create a simple Twitter-esque "follower / friend" graph using Neo4J and Python. The graph would look something like
user_1 FOLLOWS user_2
user_1 FOLLOWS user_3
user_2 FOLLOWS user_1
After a day of reading I thought it best to dive straight in using the REST interface and, since I'm using Python, py2neo. Here is my code:
from py2neo import neo4j
def main():
g = neo4j.GraphDatabaseService()
# Create an index for our user nodes
index = g.get_or_create_index(neo4j.Node, "user")
# Create a single node, User 1
node = index.get_or_create("user", "User_1", {"id": "User_1"})
# Populate the graph with some more users just for testing
nodes = []
for user in ["User_2", "User_3", "User_4", "User_5"]:
nodes.append( index.get_or_create("user", user, {"id":user}) )
# Create a relationship between User_1 and User_2
g.get_or_create_relationships( (node, "FOLLOWS", nodes[0]) )
if __name__ == '__main__':
main()
As you can see, I'm using get_or_create_relationships to prevent duplicate relationships and when adding thousands of nodes I'm assuming this is going to incur some kind of overhead.
Using straight up "node.create_relationship_to(nodes[0], "FOLLOWERS")" seems to create duplicate relationships each time the script is run which for a graph db newbie confuses me slightly since the relationship is exactly the same.
The likelihood of creating duplicate relationships is very low but in the event it were to happen, would this cause issues with graph traversal? Should I be indexing my FOLLOWS index with some kind of unique function?
I would use cypher CREATE UNIQUE to only create a FOLLOWs releationship if there is none existing, see http://docs.neo4j.org/chunked/milestone/query-create-unique.html
Would that work?
Related
I have written some code with python to obtain the list of my followers and following users in Twitter. Once I have this information, I create nodes and relationships in neo4j with py2neo by looping over the list of followers and following users that I obtained.
The code seems to work fine, however not all nodes and relationships are created. I am trying to generate about 170 nodes, however only around 25 are created.
I am wondering if there is any kind of connection limit, of uploading threshold or any other thing that might be creating the problem.
I am using Python 3.6, py2neo 3.1.2 and neo4j Community distribution 3.1.3.
I am not a python expert, so please forgive my code:
import py2neo
from py2neo import Graph
from py2neo import Node, Relationship
from py2neo import authenticate
import tweepy
import time
auth = tweepy.OAuthHandler('...', '...')
auth.set_access_token('...', '...')
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
user = api.get_user(myUser)
def getFollowers(user_screen_name):
follower_ids=[]
for page in tweepy.Cursor(api.followers, screen_name=user_screen_name).pages():
time.sleep(60)
follower_ids.extend(page)
return follower_ids
def getFollowing(user_screen_name):
following_ids=[]
for page in tweepy.Cursor(api.friends, screen_name=user_screen_name).pages():
time.sleep(60)
following_ids.extend(page)
return following_ids
def createNode(screen_name):
node=Node("User", screen_name=screen_name)
gf.merge(node)
return
def createRelationship(nodeA, nodeB, relationship):
#creates relationship and nodes (if not existant)
nodeA=Node("User", screen_name=nodeA)
nodeB=Node("User", screen_name=nodeB)
gf.merge(nodeA)
gf.merge(nodeB)
gf.merge(Relationship(nodeA, relationship, nodeB))
return
authenticate("localhost:7474",myID,myPass)
gf = Graph()
#First time graph is created
gf.run("CREATE CONSTRAINT ON (u:User) ASSERT u.screen_name IS UNIQUE")
createNode(myUser)
user_followers=getFollowers(myUser)
user_following=getFollowing(myUser)
for followers in user_followers:
createRelationship(followers.screen_name, sc, "FOLLOWS")
i=1
for following in user_following:
createRelationship(sc, following.screen_name, "FOLLOWS")
I cannot think of any reason why did would not work, but I believe it is more a problem related to neo4j than the code itself.
Any help would be very much appreciated,
Thanks in advance
I'm going to preface this that I am a total database pleb. I have 0 experience with any form of databases so I know that I'm in way over my head.
Background: I do Active Directory consulting for my company so I routinely look at client's group membership of their active directory accounts. Currently, I have a PowerShell script that will run my analytics, however, I'm finding that it takes way too long in larger organizations. I'm thinking "There has to be a better way" so I have jumped into looking at databases. NEO4J seems to be a good possible solution as I should be able to to link a user account or group as a member of another group. However, after browsing documentation and forums, I have no idea how to create those links.
I have two CSVs that I have successfully imported with the following information:
Users = DistinguishedName, SAMACCOUNTNAME, MemberOf
Groups = DistinguishedName, SAMACCOUNTNAME, MemberOf, Members
What I want to do is match a string from all users and groups (DistinguishedName) to a string in the group node's property of members. Members is a concatenated string of all DistinguishedName's (whether user or group). So if a node with a DistinguishedName matches part of a string in a group's "members" property, I want to build a one way relationship like so:
user -[memberof] - > group
The best I could rack my brain on this is the following code but I have no idea if I'm even close:
Match(n)
Match(u:user) WHERE n.Members CONTAINS u.DN
Create (u)-[MS:Memberof]->((match)})
In PowerShell, I know how I would accomplish this (loosely translated to relate to the NEO4J world):
$groups = (all-groups)
$AllUsersAndGroups = (all-objs)
foreach ($line in $groups) {
$line.relationship = $line | where {$_.members -contains $AllUsersAndGrups.DistinguishedName}
}
So at last, I'm stuck right now. I will continue to look into it but I figure I would ask the community as you guys have the experience and stuff.
Here is an example of how you should have imported your data (notice that the redundant Members column is not actually needed):
Import (in batches of 5000, to avoid resource issues) each user, and create a unique relationship to its group:
USING PERIODIC COMMIT 5000
LOAD CSV WITH HEADERS FROM "file:///users.csv" AS u
MERGE (u:User {DistinguishedName: u.DistinguishedName, SAMACCOUNTNAME: u.SAMACCOUNTNAME})
MERGE (g:Group {DistinguishedName: u.MemberOf})
MERGE (u)-[:Memberof]->(g);
Import each group, and create a unique relationship to its parent group, if any:
USING PERIODIC COMMIT 5000
LOAD CSV WITH HEADERS FROM "file:///groups.csv" AS g1
MERGE (:Group {DistinguishedName: g1.DistinguishedName, SAMACCOUNTNAME: g1.SAMACCOUNTNAME})
MERGE (g2:Group {DistinguishedName: g1.MemberOf})
MERGE (g1)-[:Memberof]->(g2);
I have the following python code to make a graph in neo4j. I am using py2neo version 2.0.3.
import json
from py2neo import neo4j, Node, Relationship, Graph
graph = neo4j.Graph("http://localhost:7474/db/data/")
with open("example.json") as f:
for line in f:
while True:
try:
file = json.loads(line)
break
except ValueError:
# Not yet a complete JSON value
line += next(f)
# Now creating the node and relationships
news, = graph.create(Node("Mainstream_News", id=unicode(file["_id"]), entry_url=unicode(file["entry_url"]),
title=unicode(file["title"]))) # Comma unpacks length-1 tuple.
authors, = graph.create(
Node("Authors", auth_name=unicode(file["auth_name"]), auth_url=unicode(file["auth_url"]),
auth_eml=unicode(file["auth_eml"])))
graph.create(Relationship(news, "hasAuthor", authors ))
I can create a graph with nodes Mainstream_News and Authors with a relation 'hasAuthor'. My problem is when I am doing this I am having one Mainstream_News node with one Authors but in reality one author nodes has more than one Mainstream_News. I would like to make auth_name property of a Author nodes as a index to connect with the Mainstream_news nodes. Any suggestions will be great.
You are creating a new Authors node each time through your loop, even if an Author node (with the same properties) already exists.
First, I think you should create uniqueness constraints on Authors(auth_name) and Mainstream_News(id), to enforce what seem to be your requirements. This only needs to be done once. A uniqueness constraint also creates an index for you automatically, which is a bonus.
graph.schema.create_uniqueness_constraint("Authors", "auth_name")
graph.schema.create_uniqueness_constraint("Mainstream_News", "id")
But you will probably have to empty out your DB first (at least of all Authors and Mainstream_News nodes and their relationships), since I presume it currently has a lot of duplicate nodes.
Then, you can use the merge_one and create_unique APIs to prevent duplicate nodes and relationships:
news = graph.merge_one("Mainstream_News", "id", unicode(file["_id"]))
news.properties["entry_url"] = unicode(file["entry_url"])
news.properties["title"] = unicode(file["title"])
authors = graph.merge_one("Authors", "auth_name", unicode(file["auth_name"]))
news.properties["auth_url"] = unicode(file["auth_url"])
news.properties["auth_eml"] = unicode(file["auth_eml"])
graph.create_unique(Relationship(news, "hasAuthor", authors))
This is what I normally do, as I find it easier to know what's happening. As far as I know there are a but when you create_unique with only a Node, and there are no need to create the nodes, when you also have to create an edge.
I don't have the database on this computer, so please bear with me, if there are some typo'es, I'll correct it in the morning, but I guess you'll rather have a fast answer.. :-)
news = graph.cypher.execute_one('MATCH (m:Mainstream_News) '
'WHERE m.id = {id} '
'RETURN p'.format(id=unicode(file["_id"])))
if not news:
news = Node("Mainstream_News")
news.properties['id] = unicode(file["_id"])
news.properties['entry_url'] = unicode(file["entry_url"])
news.properties['title'] = unicode(file["title"])
# You can make a for-loop here
authors = Node("Authors")
authors.properties['auth_name'] = unicode(file["auth_name"])
authors.properties['auth_url'] = unicode(file["auth_url"])
authors.properties['auth_eml'] = unicode(file["auth_eml"])
rel = Relationship(new, "hasAuthor", authors)
graph.create_unique(rel)
# For-loop should end here
I've included the tree first lines, to make it more generic. It returns a node-object or None.
EDIT:
#cybersam use of schema is cool, implement that to, I'll try to use it myselfe also.. :-)
You can read more about it here:
http://neo4j.com/docs/stable/query-constraints.html
http://py2neo.org/2.0/schema.html
I am analysing an author dataset and I wish I could create a co-authorship graph from this dataset. A author graph was created using CYPHER, just this way:
CREATE (N0{data:"2007-12-18", title:"ABC"}),
(N2 {data:"2007-10-20",title:"BBB"}),
(N3 {data:"2007-08-02",title:"CCC"}),
(N4 {name:"xxx"}),
(N5 {name:"yyy"}),
(N6 {name:"zzz"}),
N4-[R0:autor_de]->N0,
N5-[R1:autor_de]->N0,
N6-[R2:autor_de]->N2,
N5-[R3:autor_de]->N3;
I can't figure out how to create a new graph so that authors were linked by a new relationship such as "are_coauthors". Sorry if this is a very simple question, I know that this can solved using Java and (maybe) py2neo, but does someone has any hint?
In cypher you can do something like (assuming an autoindex on title):
start title=node:node_auto_index("title:*")
match a-[:autor_de]->title<-[:autor_de]-b
create unique a-[:coautores]-b
To create a link between coauthors.
I'm using Neo4j in a community e-commerce built in PHP and using the REST interface.
I need to get all categories related to the search results like Amazon. This feature is available in other engines like Solr (another implementation of Lucene) as Faceted Search
How can I do a Faceted Search in Neo4j? or What's the best way (performance grade) to recreate this feature?
All required modules related to this feature are excluded from the core package of neo4j. I want to know if someone try to do something like this without transverse all nodes in the graph, grab some properties, and make a groupCount of this values. If we have 200k nodes, the transverse took 10sec to only get the categories.
This is my Gremlin approach.
(new Neo4jVertexSequence(
g.getRawGraph().index().forNodes('products').query(
new org.neo4j.index.lucene.QueryContext('category:?')
), g
))._().groupBy{it.category}.cap.next();
Results in 90 rows and took 54 seconds.
Books = 12002
Movies_Music_Games = 19233
Electronics_Computers = 60540
Home_Garden_Tools = 9123
Grocery_Health_Beauty = 15643
Toys_Kids_Baby = 15099
Clothing_Shoes_Jewelry = 12543
Sports_Outdoors = 10342
Automotive_Industrial = 9638
... (more rows)
Of course, I can't put this results in cache, because, this is for "non input search". If the user makes a query like "Iphone", the query looks like
(new Neo4jVertexSequence(
g.getRawGraph().index().forNodes('products').query(
new org.neo4j.index.lucene.QueryContext('search:"iphone" AND category:?')
), g
))._().groupBy{it.category}.cap.next();
What about your domain model? Did you just put everything in the index? Usually you would model your categories as nodes and have your products being related to the category nodes.
(product)-[:HAS_CATEGORY]->(category)<-[:IS_CATEGORY]-(categories)
In your query you would just traverse this little tree and count the relationships of type :HAS_CATEGORY starting from each category node.
start categories=node(x)
match (product)-[:HAS_CATEGORY]->(category)<-[:IS_CATEGORY]-(categories)
return category.name, count(*)