py2neo, why do we need to do push/pull/commit .. etc ? Is that something we need to do in the actual python program ? - py2neo

In the website Introduction to Py2neo 2.0
http://py2neo.org/2.0/intro.html
why do we need Pushing & Pulling ?
also graph.cypher.begin() and commit(). as below
tx = graph.cypher.begin()
for name in ["Alice", "Bob", "Carol"]:
tx.append("CREATE (person:Person {name:{name}})
RETURN person", name=name)
alice, bob, carol = [result.one for result in tx.commit()]
friends = Path(alice, "KNOWS", bob, "KNOWS", carol)
graph.create(friends)`
I used a small py2neo program as below, and it also works ( at least I can see it on localhost:7474 ) ? please explain the two different methods, thanks
alice = Node("Person", name="Alice")
bob = Node("Person", name="Bob")
alice_knows_bob = Relationship(alice, "KNOWS", bob)
graph.create(alice_knows_bob)

Pushing and pulling are explicitly drawn out into methods to allow full control over network traffic. Earlier versions of the library sent and received changes to and from the server as they were made on the client. This was inefficient as separate HTTP requests were made for each individual property change etc. The current push/pull API is modelled to some extent on git by passing synchronisation control to the application developer.

Related

Accuracy of the geolite_city_bq_b2 dataset

I believe there are inaccuracies in the BigQuery fh-bigquery.geocode.geolite_city_bq_b2 dataset, and am curious if others have noticed this too.
Background: I have the BigQuery code from Ramtin M. Seraj running, and his/my logic appear to be sound. However there are IP addresses known to represent certain places, e.g. Tokyo # 150.249.199.17, but which are indicated by Ramtin's query to be in Rochester NY-USA or Ottawa ON-CA. If the query logic is sound then the only conclusion is that the underlying Geolite dataset is not.
To verify, look at results of this query:
SELECT *
FROM `fh-bigquery.geocode.geolite_city_bq_b2b`
WHERE classB = 38649
Note from these results that startIp = 150.245.0.0 and endIp = 150.249.255.255, therefore address 150.249.199.17 is within this IP range.
Now compare with the results from https://ipinfo.io/150.249.199.17, and also with the results from the following BigQuery. Notice that all computed values, such as IPV4_TO_INT64() of the IP address, fall within the ranges returned by the query above.
SELECT '150.249.199.17' as ipAddress
, NET.IPV4_TO_INT64(NET.IP_FROM_STRING('150.249.199.17')) AS clientIpNum_int
, TRUNC(NET.IPV4_TO_INT64(NET.IP_FROM_STRING('150.249.199.17'))/(256*256)) AS classB
, CAST(TRUNC(NET.IPV4_TO_INT64(NET.IP_FROM_STRING('150.249.199.17'))/(256*256)) as INT64) as client_classB_int
p.s. I would upvote the first answer, or add a comment, but I don't have enough Reputons yet!
2019, much improved answer:
https://medium.com/#hoffa/geolocation-with-bigquery-de-identify-76-million-ip-addresses-in-20-seconds-e9e652480bd2
#standardSQL
# replace with your source of IP addresses
# here I'm using the same Wikipedia set from the previous article
WITH source_of_ip_addresses AS (
SELECT REGEXP_REPLACE(contributor_ip, 'xxx', '0') ip, COUNT(*) c
FROM `publicdata.samples.wikipedia`
WHERE contributor_ip IS NOT null
GROUP BY 1
)
SELECT country_name, SUM(c) c
FROM (
SELECT ip, country_name, c
FROM (
SELECT *, NET.SAFE_IP_FROM_STRING(ip) & NET.IP_NET_MASK(4, mask) network_bin
FROM source_of_ip_addresses, UNNEST(GENERATE_ARRAY(9,32)) mask
WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 4
)
JOIN `fh-bigquery.geocode.201806_geolite2_city_ipv4_locs`
USING (network_bin, mask)
)
GROUP BY 1
ORDER BY 2 DESC
I'm about to publish a much improved version of Geolite in BigQuery. Stay tuned to https://twitter.com/felipehoffa and https://medium.com/#hoffa. And I'll update this answer then too.
With that said, to answer the accuracy part which titles this question, Maxmind says:
GeoLite2 databases are free IP geolocation databases comparable to, but less accurate than, MaxMind’s GeoIP2 databases
https://dev.maxmind.com/geoip/geoip2/geolite2/

Build relationships in NEO4J

I'm going to preface this that I am a total database pleb. I have 0 experience with any form of databases so I know that I'm in way over my head.
Background: I do Active Directory consulting for my company so I routinely look at client's group membership of their active directory accounts. Currently, I have a PowerShell script that will run my analytics, however, I'm finding that it takes way too long in larger organizations. I'm thinking "There has to be a better way" so I have jumped into looking at databases. NEO4J seems to be a good possible solution as I should be able to to link a user account or group as a member of another group. However, after browsing documentation and forums, I have no idea how to create those links.
I have two CSVs that I have successfully imported with the following information:
Users = DistinguishedName, SAMACCOUNTNAME, MemberOf
Groups = DistinguishedName, SAMACCOUNTNAME, MemberOf, Members
What I want to do is match a string from all users and groups (DistinguishedName) to a string in the group node's property of members. Members is a concatenated string of all DistinguishedName's (whether user or group). So if a node with a DistinguishedName matches part of a string in a group's "members" property, I want to build a one way relationship like so:
user -[memberof] - > group
The best I could rack my brain on this is the following code but I have no idea if I'm even close:
Match(n)
Match(u:user) WHERE n.Members CONTAINS u.DN
Create (u)-[MS:Memberof]->((match)})
In PowerShell, I know how I would accomplish this (loosely translated to relate to the NEO4J world):
$groups = (all-groups)
$AllUsersAndGroups = (all-objs)
foreach ($line in $groups) {
$line.relationship = $line | where {$_.members -contains $AllUsersAndGrups.DistinguishedName}
}
So at last, I'm stuck right now. I will continue to look into it but I figure I would ask the community as you guys have the experience and stuff.
Here is an example of how you should have imported your data (notice that the redundant Members column is not actually needed):
Import (in batches of 5000, to avoid resource issues) each user, and create a unique relationship to its group:
USING PERIODIC COMMIT 5000
LOAD CSV WITH HEADERS FROM "file:///users.csv" AS u
MERGE (u:User {DistinguishedName: u.DistinguishedName, SAMACCOUNTNAME: u.SAMACCOUNTNAME})
MERGE (g:Group {DistinguishedName: u.MemberOf})
MERGE (u)-[:Memberof]->(g);
Import each group, and create a unique relationship to its parent group, if any:
USING PERIODIC COMMIT 5000
LOAD CSV WITH HEADERS FROM "file:///groups.csv" AS g1
MERGE (:Group {DistinguishedName: g1.DistinguishedName, SAMACCOUNTNAME: g1.SAMACCOUNTNAME})
MERGE (g2:Group {DistinguishedName: g1.MemberOf})
MERGE (g1)-[:Memberof]->(g2);

py2neo return number of nodes and relationships created

I need to create a python function such that it adds nodes and relationship to a graph and returns the number of created nodes and relationships.
I have added the nodes and relationship using graph.cypher.execute().
arr_len = len(dic_st[story_id]['PER'])
for j in dic_st[story_id]['PER']:
graph.cypher.execute("MERGE (n:PER {name:{name}})",name = j[0].upper()) #creating the nodes of PER in the story
print j[0]
for j in range(0,arr_len):
for k in range(j+1,arr_len):
graph.cypher.execute("MATCH (p1:PER {name:{name1}}), (p2:PER {name:{name2}}) WHERE upper(p1.name)<>upper(p2.name) CREATE UNIQUE (p1)-[r:in_same_doc {st_id:{st_id}}]-(p2)", name1=dic_st[story_id]['PER'][j][0].upper(),name2=dic_st[story_id]['PER'][k][0].upper(),st_id=story_id) #linking the edges for PER nodes
What I need is to return the number of new nodes and relationships created.
What I get to know from the neo4j documentation is that there is something called "ON CREATE" and "ON MATCH" for MERGE in cypher, but thats not being very useful.
The browser interface for neo4j do actually shows the number of nodes and relationship updated. This is what I need to return, but I am not getting quite the way for it to access it.
Any help please.
In case you need the exact counts of properties either created or updated then you have use "Match" with "Create" or "Match" with "Set" and then count the size of results. Merge may not return which ones are updated and which ones are created.
When you post your query against the Cypher endpoint of the neo4j REST API without using py2neo, you can include the argument "includeStats": true in your post request to get the node/relationship statistics. See this question for an example.
As far as I can tell, py2neo currently does not support additional parameters for the Cypher query (even though it is using the same API endpoints under the hood).
In Python, you could do something like this (using the requests and json packages):
import requests
import json
payload = {
"statements": [{
"statement": "CREATE (t:Test) RETURN t",
"includeStats": True
}]
}
r = requests.post('http://your_server_host:7474/db/data/transaction/commit',
data=json.dumps(payload))
print(r.text)
The response will include statistics about the number of nodes created etc.
{
"stats":{
"contains_updates":true,
"nodes_created":1,
"nodes_deleted":0,
"properties_set":1,
"relationships_created":0,
"relationship_deleted":0,
"labels_added":1,
"labels_removed":0,
"indexes_added":0,
"indexes_removed":0,
"constraints_added":0,
"constraints_removed":0
}
}
After executing your query using x = session.run(...) you can use x.summary.counters to get the statistics noted in Martin Perusse's answer. See the documentation here.
In older versions the counters are available as a "private" field under x._summary.counters.

connecting the nodes with relationships in py2neo

I have the following python code to make a graph in neo4j. I am using py2neo version 2.0.3.
import json
from py2neo import neo4j, Node, Relationship, Graph
graph = neo4j.Graph("http://localhost:7474/db/data/")
with open("example.json") as f:
for line in f:
while True:
try:
file = json.loads(line)
break
except ValueError:
# Not yet a complete JSON value
line += next(f)
# Now creating the node and relationships
news, = graph.create(Node("Mainstream_News", id=unicode(file["_id"]), entry_url=unicode(file["entry_url"]),
title=unicode(file["title"]))) # Comma unpacks length-1 tuple.
authors, = graph.create(
Node("Authors", auth_name=unicode(file["auth_name"]), auth_url=unicode(file["auth_url"]),
auth_eml=unicode(file["auth_eml"])))
graph.create(Relationship(news, "hasAuthor", authors ))
I can create a graph with nodes Mainstream_News and Authors with a relation 'hasAuthor'. My problem is when I am doing this I am having one Mainstream_News node with one Authors but in reality one author nodes has more than one Mainstream_News. I would like to make auth_name property of a Author nodes as a index to connect with the Mainstream_news nodes. Any suggestions will be great.
You are creating a new Authors node each time through your loop, even if an Author node (with the same properties) already exists.
First, I think you should create uniqueness constraints on Authors(auth_name) and Mainstream_News(id), to enforce what seem to be your requirements. This only needs to be done once. A uniqueness constraint also creates an index for you automatically, which is a bonus.
graph.schema.create_uniqueness_constraint("Authors", "auth_name")
graph.schema.create_uniqueness_constraint("Mainstream_News", "id")
But you will probably have to empty out your DB first (at least of all Authors and Mainstream_News nodes and their relationships), since I presume it currently has a lot of duplicate nodes.
Then, you can use the merge_one and create_unique APIs to prevent duplicate nodes and relationships:
news = graph.merge_one("Mainstream_News", "id", unicode(file["_id"]))
news.properties["entry_url"] = unicode(file["entry_url"])
news.properties["title"] = unicode(file["title"])
authors = graph.merge_one("Authors", "auth_name", unicode(file["auth_name"]))
news.properties["auth_url"] = unicode(file["auth_url"])
news.properties["auth_eml"] = unicode(file["auth_eml"])
graph.create_unique(Relationship(news, "hasAuthor", authors))
This is what I normally do, as I find it easier to know what's happening. As far as I know there are a but when you create_unique with only a Node, and there are no need to create the nodes, when you also have to create an edge.
I don't have the database on this computer, so please bear with me, if there are some typo'es, I'll correct it in the morning, but I guess you'll rather have a fast answer.. :-)
news = graph.cypher.execute_one('MATCH (m:Mainstream_News) '
'WHERE m.id = {id} '
'RETURN p'.format(id=unicode(file["_id"])))
if not news:
news = Node("Mainstream_News")
news.properties['id] = unicode(file["_id"])
news.properties['entry_url'] = unicode(file["entry_url"])
news.properties['title'] = unicode(file["title"])
# You can make a for-loop here
authors = Node("Authors")
authors.properties['auth_name'] = unicode(file["auth_name"])
authors.properties['auth_url'] = unicode(file["auth_url"])
authors.properties['auth_eml'] = unicode(file["auth_eml"])
rel = Relationship(new, "hasAuthor", authors)
graph.create_unique(rel)
# For-loop should end here
I've included the tree first lines, to make it more generic. It returns a node-object or None.
EDIT:
#cybersam use of schema is cool, implement that to, I'll try to use it myselfe also.. :-)
You can read more about it here:
http://neo4j.com/docs/stable/query-constraints.html
http://py2neo.org/2.0/schema.html

Can Neo4j be effectively used to show a collection of nodes in a sortable and filterable table?

I realise this may not be ideal usage, but apart from all the graphy goodness of Neo4j, I'd like to show a collection of nodes, say, People, in a tabular format that has indexed properties for sorting and filtering
I'm guessing the Type of a node can be stored as a Link, say Bob -> type -> Person, which would allow us to retrieve all People
Are the following possible to do efficiently (indexed?) and in a scalable manner?
Retrieve all People nodes and display all of their names, ages, cities of birth, etc (NOTE: some of this data will be properties, some Links to other nodes (which could be denormalised as properties for table display's and simplicity's sake)
Show me all People sorted by Age
Show me all People with Age < 30
Also a quick how to do the above (or a link to some place in the docs describing how) would be lovely
Thanks very much!
Oh and if the above isn't a good idea, please suggest a storage solution which allows both graph-like retrieval and relational-like retrieval
if you want to operate on these person nodes, you can put them into an index (default is Lucene) and then retrieve and sort the nodes using Lucene (see for instance How do I sort Lucene results by field value using a HitCollector? on how to do a custom sort in java). This will get you for instance People sorted by Age etc. The code in Neo4j could look like
Transaction tx = neo4j.beginTx();
idxManager = neo4j.index()
personIndex = idxManager.forNodes('persons')
personIndex.add(meNode,'name',meNode.getProperty('name'))
personIndex.add(youNode,'name',youNode.getProperty('name'))
tx.success()
tx.finish()
'*** Prepare a custom Lucene query context with Neo4j API ***'
query = new QueryContext( 'name:*' ).sort( new Sort(new SortField( 'name',SortField.STRING, true ) ) )
results = personIndex.query( query )
For combining index lookups and graph traversals, Cypher is a good choice, e.g.
START people = node:people_index(name="E*") MATCH people-[r]->() return people.name, r.age order by r.age asc
in order to return data on both the node and the relationships.
Sure, that's easily possible with the Neo4j query language Cypher.
For example:
start cat=node:Types(name='Person')
match cat<-[:IS_A]-person-[born:BORN]->city
where person.age > 30
return person.name, person.age, born.date, city.name
order by person.age asc
limit 10
You can experiment with it in our cypher console.

Resources