I just started to learn py2neo and neo4j, and I'm having this problem of duplicates. I'm writing a simple python script in python that will make a database of scientific papers and authors. I only need to add the nodes of papers and authors and add the their relationship. I was using this code, that works fine but is very slow:
paper = Node('Paper', id=post_id)
graph.merge(paper)
paper['created_time'] = created_time
graph.push(paper)
for author_id,author_name in paper_dict['authors']:
researcher = Node('Person', id=author_id)
graph.merge(researcher)
researcher['name'] = author_name
graph.push(researcher)
wrote = Relationship(researcher,'author', paper)
graph.merge(wrote)
So, in order to write multiple relationships at the same time, I'm trying to use transaction. My problem is that if I run this multiple times for the same papers and authors, it assumes that they are different entities and then duplicates each node and relationship in the database (I tried to run the scrip multiple times). But the same doesn't happen with the previous code.
This is the code that uses transactions:
tx = graph.begin()
paper = Node('Paper', id=post_id)
paper['created_time'] = created_time
tx.create(paper)
for author_id,author_name in paper_dict['authors']:
researcher = Node('Person', id=author_id)
researcher['name'] = author_name
tx.create(researcher)
wrote = Relationship(researcher,'author', paper)
tx.create(wrote)
tx.commit()
I believe you should use the merge function, and not the create function to avoid duplicates.
Consider the following source code:
import py2neo
from py2neo import Graph, Node, Relationship
def authenticateAndConnect():
py2neo.authenticate('localhost:7474', 'user', 'password')
return Graph('http://localhost:7474/default.graphdb/data/')
def actorsDictionary():
return
def createData():
graph = authenticateAndConnect()
tx = graph.begin()
movie = Node('Movie', title='Answer')
personDictionary = [{'name':'Dan', 'born':2001}, {'name':'Brown', 'born':2001}]
for i in range(10):
for person in personDictionary:
person = Node('Person', name=person['name'], born=person['born'])
tx.merge(person)
actedIn = Relationship(person, 'ACTED_IN', movie)
tx.merge(actedIn)
tx.commit()
if __name__ == '__main__':
for i in range(10):
createData()
Related
I'm currently running the following query to update the properties on two nodes and relationships.
I'd like to be able to update 1,000 nodes and the corresponding relationships in one query.
MATCH (p1:Person)-[r1:OWNS_CAR]->(c1:Car) WHERE id(r1) = 3018
MATCH (p2:Person)-[r2:OWNS_CAR]->(c2:Car) WHERE id(r2) = 3019
SET c1.serial_number = 'SERIAL027436', c1.signature = 'SIGNATURE728934',
r1.serial_number = 'SERIAL78765', r1.signature = 'SIGNATURE749532',
c2.serial_number = 'SERIAL027436', c2.signature = 'SIGNATURE728934',
r2.serial_number = 'SERIAL78765', r2.signature = 'SIGNATURE749532'
This query has issues when you run it in larger quantities. Is there a better way?
Thank you.
You could work with a LOAD CSV. Your input would contain the keys (not the ids, using the ids is not recommended) for Person and Car and whatever properties you need to set. For example
personId, carId, serial_number, signature
00001, 00045, SERIAL78765, SIGNATURE728934
00002, 00046, SERIAL78665, SIGNATURE724934
Your query would then be something like :
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///input.csv' AS row
MATCH (p:Person {personId: row.PersonId})-[r:OWNS_CAR]->(c:Car {carId: row.carId})
SET r.serial_number = row.serialnumber, c.signature = row.signature
Note that you should have unique constraints on Person and Car to make that work. You can do thousands (even millions) like that very quickly ...
Hope this helps,
Tom
I have written some code with python to obtain the list of my followers and following users in Twitter. Once I have this information, I create nodes and relationships in neo4j with py2neo by looping over the list of followers and following users that I obtained.
The code seems to work fine, however not all nodes and relationships are created. I am trying to generate about 170 nodes, however only around 25 are created.
I am wondering if there is any kind of connection limit, of uploading threshold or any other thing that might be creating the problem.
I am using Python 3.6, py2neo 3.1.2 and neo4j Community distribution 3.1.3.
I am not a python expert, so please forgive my code:
import py2neo
from py2neo import Graph
from py2neo import Node, Relationship
from py2neo import authenticate
import tweepy
import time
auth = tweepy.OAuthHandler('...', '...')
auth.set_access_token('...', '...')
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
user = api.get_user(myUser)
def getFollowers(user_screen_name):
follower_ids=[]
for page in tweepy.Cursor(api.followers, screen_name=user_screen_name).pages():
time.sleep(60)
follower_ids.extend(page)
return follower_ids
def getFollowing(user_screen_name):
following_ids=[]
for page in tweepy.Cursor(api.friends, screen_name=user_screen_name).pages():
time.sleep(60)
following_ids.extend(page)
return following_ids
def createNode(screen_name):
node=Node("User", screen_name=screen_name)
gf.merge(node)
return
def createRelationship(nodeA, nodeB, relationship):
#creates relationship and nodes (if not existant)
nodeA=Node("User", screen_name=nodeA)
nodeB=Node("User", screen_name=nodeB)
gf.merge(nodeA)
gf.merge(nodeB)
gf.merge(Relationship(nodeA, relationship, nodeB))
return
authenticate("localhost:7474",myID,myPass)
gf = Graph()
#First time graph is created
gf.run("CREATE CONSTRAINT ON (u:User) ASSERT u.screen_name IS UNIQUE")
createNode(myUser)
user_followers=getFollowers(myUser)
user_following=getFollowing(myUser)
for followers in user_followers:
createRelationship(followers.screen_name, sc, "FOLLOWS")
i=1
for following in user_following:
createRelationship(sc, following.screen_name, "FOLLOWS")
I cannot think of any reason why did would not work, but I believe it is more a problem related to neo4j than the code itself.
Any help would be very much appreciated,
Thanks in advance
I know hot to get the sub-graph by using Cypher query.
But since I use py2neo.ogm model. I just want to know how to get sub-graph by using ogm. for example:
class Company(GraphObject):
__primarykey__ = "firm_name"
firm_name = Property()
shareHolder = RelatedFrom("Company", "hold_by")
I already created the relationship between companies. I want to get the sub-graph of a company. I checked the document of py2neo, seems there is no example...
Anyone can help?
Best regards
The source code (partly copied py2neo v3 ogm doc) produces the following movie titles list (not including the minus sign), when run with community edition of Neo4J with the movies sample (:play movies)
Something's Gotta Give
Johnny Mnemonic
The Replacements
The Matrix Reloaded
The Matrix Revolutions
The Matrix
The Devil's Advocate
A Few Good Men
Apollo 13
Frost/Nixon
A Few Good Men
Stand By Me
A Few Good Men
Top Gun
Jerry Maguire
import py2neo
import py2neo.ogm
from py2neo import Graph, Node, Relationship
from py2neo.ogm import GraphObject, Property, RelatedFrom, RelatedTo, RelatedObjects
class Movie(GraphObject):
__primarykey__ = "title"
title = Property()
tag_line = Property("tagline")
released = Property()
actors = RelatedFrom("Person", "ACTED_IN")
directors = RelatedFrom("Person", "DIRECTED")
producers = RelatedFrom("Person", "PRODUCED")
class Person(GraphObject):
__primarykey__ = "name"
name = Property()
born = Property()
acted_in = RelatedTo(Movie)
directed = RelatedTo(Movie)
produced = RelatedTo(Movie)
def authenticateAndConnect():
# Authenticate the user using py2neo.authentication
py2neo.authenticate('localhost:7474', '<username>', '<password>')
# Connect to Graph and get the instance of graph
return Graph('http://localhost:7474/default.graphdb/data/')
def foo():
graph = authenticateAndConnect()
for person in list(Person.select(graph).where("_.name =~ 'K.*'")):
for movie in person.acted_in:
print(movie.title)
if __name__ == '__main__':
foo()
I have two dataframes and I would like to join them based on one column, with a caveat that this column is a timestamp, and that timestamp has to be within a certain offset (5 seconds) in order to join records. More specifically, a record in dates_df with date=1/3/2015:00:00:00 should be joined with events_df with time=1/3/2015:00:00:01 because both timestamps are within 5 seconds from each other.
I'm trying to get this logic working with python spark, and it is extremely painful. How do people do joins like this in spark?
My approach is to add two extra columns to dates_df that will determine the lower_timestamp and upper_timestamp bounds with a 5 second offset, and perform a conditional join. And this is where it fails, more specifically:
joined_df = dates_df.join(events_df,
dates_df.lower_timestamp < events_df.time < dates_df.upper_timestamp)
joined_df.explain()
Captures only the last part of the query:
Filter (time#6 < upper_timestamp#4)
CartesianProduct
....
and it gives me a wrong result.
Do I really have to do a full blown cartesian join for each inequality, removing duplicates as I go along?
Here is the full code:
from datetime import datetime, timedelta
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import udf
master = 'local[*]'
app_name = 'stackoverflow_join'
conf = SparkConf().setAppName(app_name).setMaster(master)
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
def lower_range_func(x, offset=5):
return x - timedelta(seconds=offset)
def upper_range_func(x, offset=5):
return x + timedelta(seconds=offset)
lower_range = udf(lower_range_func, TimestampType())
upper_range = udf(upper_range_func, TimestampType())
dates_fields = [StructField("name", StringType(), True), StructField("date", TimestampType(), True)]
dates_schema = StructType(dates_fields)
dates = [('day_%s' % x, datetime(year=2015, day=x, month=1)) for x in range(1,5)]
dates_df = sqlContext.createDataFrame(dates, dates_schema)
dates_df.show()
# extend dates_df with time ranges
dates_df = dates_df.withColumn('lower_timestamp', lower_range(dates_df['date'])).\
withColumn('upper_timestamp', upper_range(dates_df['date']))
event_fields = [StructField("time", TimestampType(), True), StructField("event", StringType(), True)]
event_schema = StructType(event_fields)
events = [(datetime(year=2015, day=3, month=1, second=3), 'meeting')]
events_df = sqlContext.createDataFrame(events, event_schema)
events_df.show()
# finally, join the data
joined_df = dates_df.join(events_df,
dates_df.lower_timestamp < events_df.time < dates_df.upper_timestamp)
joined_df.show()
I get the following output:
+-----+--------------------+
| name| date|
+-----+--------------------+
|day_1|2015-01-01 00:00:...|
|day_2|2015-01-02 00:00:...|
|day_3|2015-01-03 00:00:...|
|day_4|2015-01-04 00:00:...|
+-----+--------------------+
+--------------------+-------+
| time| event|
+--------------------+-------+
|2015-01-03 00:00:...|meeting|
+--------------------+-------+
+-----+--------------------+--------------------+--------------------+--------------------+-------+
| name| date| lower_timestamp| upper_timestamp| time| event|
+-----+--------------------+--------------------+--------------------+--------------------+-------+
|day_3|2015-01-03 00:00:...|2015-01-02 23:59:...|2015-01-03 00:00:...|2015-01-03 00:00:...|meeting|
|day_4|2015-01-04 00:00:...|2015-01-03 23:59:...|2015-01-04 00:00:...|2015-01-03 00:00:...|meeting|
+-----+--------------------+--------------------+--------------------+--------------------+-------+
I did spark SQL query with explain() to see how it is done, and replicated the same behavior in python. First here is how to do the same with SQL spark:
dates_df.registerTempTable("dates")
events_df.registerTempTable("events")
results = sqlContext.sql("SELECT * FROM dates INNER JOIN events ON dates.lower_timestamp < events.time and events.time < dates.upper_timestamp")
results.explain()
This works, but the question was about how to do it in python, so the solution seems to be just a plain join, followed by two filters:
joined_df = dates_df.join(events_df).filter(dates_df.lower_timestamp < events_df.time).filter(events_df.time < dates_df.upper_timestamp)
joined_df.explain() yields the same query as sql spark results.explain() so I assume this is how things are done.
Although a year later, but might help others..
As you said, a full cartesian product is insane in your case. Your matching records will be close in time (5 minutes) so you can take advantage of that and save a lot of time if you first group together records to buckets based on their timestamp, then join the two dataframes on that bucket and only then apply the filter. Using that method causes Spark to use a SortMergeJoin and not a CartesianProduct and greatly boosts performance.
There is a small caveat here - you must match to both the bucket and the next one.
It's better explain in my blog, with working code examples (Scala + Spark 2.0 but you can implement the same in python too...)
http://zachmoshe.com/2016/09/26/efficient-range-joins-with-spark.html
I have the following python code to make a graph in neo4j. I am using py2neo version 2.0.3.
import json
from py2neo import neo4j, Node, Relationship, Graph
graph = neo4j.Graph("http://localhost:7474/db/data/")
with open("example.json") as f:
for line in f:
while True:
try:
file = json.loads(line)
break
except ValueError:
# Not yet a complete JSON value
line += next(f)
# Now creating the node and relationships
news, = graph.create(Node("Mainstream_News", id=unicode(file["_id"]), entry_url=unicode(file["entry_url"]),
title=unicode(file["title"]))) # Comma unpacks length-1 tuple.
authors, = graph.create(
Node("Authors", auth_name=unicode(file["auth_name"]), auth_url=unicode(file["auth_url"]),
auth_eml=unicode(file["auth_eml"])))
graph.create(Relationship(news, "hasAuthor", authors ))
I can create a graph with nodes Mainstream_News and Authors with a relation 'hasAuthor'. My problem is when I am doing this I am having one Mainstream_News node with one Authors but in reality one author nodes has more than one Mainstream_News. I would like to make auth_name property of a Author nodes as a index to connect with the Mainstream_news nodes. Any suggestions will be great.
You are creating a new Authors node each time through your loop, even if an Author node (with the same properties) already exists.
First, I think you should create uniqueness constraints on Authors(auth_name) and Mainstream_News(id), to enforce what seem to be your requirements. This only needs to be done once. A uniqueness constraint also creates an index for you automatically, which is a bonus.
graph.schema.create_uniqueness_constraint("Authors", "auth_name")
graph.schema.create_uniqueness_constraint("Mainstream_News", "id")
But you will probably have to empty out your DB first (at least of all Authors and Mainstream_News nodes and their relationships), since I presume it currently has a lot of duplicate nodes.
Then, you can use the merge_one and create_unique APIs to prevent duplicate nodes and relationships:
news = graph.merge_one("Mainstream_News", "id", unicode(file["_id"]))
news.properties["entry_url"] = unicode(file["entry_url"])
news.properties["title"] = unicode(file["title"])
authors = graph.merge_one("Authors", "auth_name", unicode(file["auth_name"]))
news.properties["auth_url"] = unicode(file["auth_url"])
news.properties["auth_eml"] = unicode(file["auth_eml"])
graph.create_unique(Relationship(news, "hasAuthor", authors))
This is what I normally do, as I find it easier to know what's happening. As far as I know there are a but when you create_unique with only a Node, and there are no need to create the nodes, when you also have to create an edge.
I don't have the database on this computer, so please bear with me, if there are some typo'es, I'll correct it in the morning, but I guess you'll rather have a fast answer.. :-)
news = graph.cypher.execute_one('MATCH (m:Mainstream_News) '
'WHERE m.id = {id} '
'RETURN p'.format(id=unicode(file["_id"])))
if not news:
news = Node("Mainstream_News")
news.properties['id] = unicode(file["_id"])
news.properties['entry_url'] = unicode(file["entry_url"])
news.properties['title'] = unicode(file["title"])
# You can make a for-loop here
authors = Node("Authors")
authors.properties['auth_name'] = unicode(file["auth_name"])
authors.properties['auth_url'] = unicode(file["auth_url"])
authors.properties['auth_eml'] = unicode(file["auth_eml"])
rel = Relationship(new, "hasAuthor", authors)
graph.create_unique(rel)
# For-loop should end here
I've included the tree first lines, to make it more generic. It returns a node-object or None.
EDIT:
#cybersam use of schema is cool, implement that to, I'll try to use it myselfe also.. :-)
You can read more about it here:
http://neo4j.com/docs/stable/query-constraints.html
http://py2neo.org/2.0/schema.html