Traverse two lists at once - neo4j

I have an input CSV that will have the following format
email,skills,expertiseLevels
john1#xyz.com,"Oracle database;SSIS;SQL Server","5;4;3"
john2#xyz.com,"Python;Hadoop;SQL Server","1;2;4"
where for each row, expertiseLevels[i] signals the person's expertise in skills[i]
I would like to write a Cypher query to obtain a data set like so:
╒════════════════════════════════════╤═════════════════╤════════════════╕
│"email" │"skill" │"expertiseLevel"│
╞════════════════════════════════════╪═════════════════╪════════════════╡
│"john1#xyz.com" │"Oracle database"│"5" │
├────────────────────────────────────┼─────────────────┼────────────────┤
│"john1#xyz.com" │"SSIS" │"4" │
├────────────────────────────────────┼─────────────────┼────────────────┤
│"john1#xyz.com" │"SQL Server" │"3" │
├────────────────────────────────────┼─────────────────┼────────────────┤
│"john2#xyz.com" │"Python" │"1" │
├────────────────────────────────────┼─────────────────┼────────────────┤
│"john2#xyz.com" │"Hadoop" │"2" │
├────────────────────────────────────┼─────────────────┼────────────────┤
│"john2#xyz.com" │"SQL Server" │"4" │
└────────────────────────────────────┴─────────────────┴────────────────┘
The query I currently have does work, only I was wondering if there is a more straight forward way to accomplish what I am looking for:
LOAD CSV WITH HEADERS FROM 'file:///test.csv' AS line
WITH line.email AS email, split(line.skills,";") AS skills, split(line.expertiseLevels,";") AS expertiseLevels
WITH email, reduce(x = "", idx in range(0,size(skills)-1) | x + skills[idx] + ":" + expertiseLevels[idx] + ";") AS se
WITH email, split(se,";") AS skillsWithExpertise
UNWIND skillsWithExpertise AS skillWithExpertise
WITH email AS email, split(skillWithExpertise,":") AS tokens
WHERE skillWithExpertise <> ""
WITH email, tokens[0] AS skill, tokens[1] AS expertiseLevel
RETURN email, skill, expertiseLevel;
Thanks

This will produce your desired output:
LOAD CSV WITH HEADERS FROM 'file:///test.csv' AS line
WITH
line.email AS email,
SPLIT(line.skills, ";") AS skills,
SPLIT(line.expertiseLevels, ";") AS expertiseLevels
UNWIND RANGE(0, SIZE(skills)-1) AS i
RETURN email, skills[i] AS skill, expertiseLevels[i] AS expertiseLevel

Related

Handle whitespace in Neo4j full text search

I need some help with full text search.
I have created an index like so:
CALL db.index.fulltext.createNodeIndex("ReasourceName",["Resource"],["name"])
I can query it and get results:
CALL db.index.fulltext.queryNodes('ReasourceName', 'bmc pumping station~') YIELD node, score
WITH node, score
RETURN node.name, score
limit 10;
output:
╒════════════════════════════════╤══════════════════╕
│"node.name" │"score" │
╞════════════════════════════════╪══════════════════╡
│"BMC Pumping Station" │8.143752098083496 │
├────────────────────────────────┼──────────────────┤
│"BMC Office" │2.944127082824707 │
├────────────────────────────────┼──────────────────┤
│"BMC Office" │2.944127082824707 │
├────────────────────────────────┼──────────────────┤
│"BMC Dispensary" │2.944127082824707 │
├────────────────────────────────┼──────────────────┤
│"BMC Office" │2.944127082824707 │
├────────────────────────────────┼──────────────────┤
│"BMC Dispensary" │2.944127082824707 │
├────────────────────────────────┼──────────────────┤
│"BMC Office" │2.944127082824707 │
├────────────────────────────────┼──────────────────┤
│"Police Station" │2.6569595336914062│
├────────────────────────────────┼──────────────────┤
│"Momo Station" │2.6569595336914062│
├────────────────────────────────┼──────────────────┤
│"BMC Shikshak Bhavan" │2.515393018722534 │
└────────────────────────────────┴──────────────────┘
However it performs poorly if the input query differs in whitespace. For example, I would expect the query bmcpumpingstation or bmcpumpingstation~ to have a similar result set, however it returns nothing.
There does not appear to be an analyzer that works on levenshtein distance.
(I also asked this question on neo4j community but didn't get a response)
The underlying engine is Lucene and the flow is that it tokenizes the text and store it as tokens.
Then when you "search", it tokenize your search string and compare tokens between your search and what is in the indexes.
I would suggest to read this article ( note Neo4j did some changes but 90% is still valid today ) : https://graphaware.com/neo4j/2019/01/11/neo4j-full-text-search-deep-dive.html
So, if you search for bmcpumpingstation and your index contains the following tokens :
bmc, pumping, station
Then there is simply no match.
If you want to hack a bit and have this type of search working, you can create a dedicated index for this and remove all whitespaces from the names when you index it, then you can use bmcpumpingstation with some fuzziness to search
Looks like you need to clean up your data!
match (m:Resource) set m.name=replace(m.node,"~","")
or do the clean up before loading the data.

Neo4j: Sequence of Events as Nodes Not working

I am new to cypher query syntax and tried different types of syntax/relationship to build sequence graph. My data contains group_id and within each group_id a code occurs based on the 'number'. Lowest number is the first sequence and highest number is the last sequence per group id. I am able to load the data from csv and create nodes with properties, however it is not letting me convert to numerical sequence for 'code' nodes. I am reading/referencing this article : this tutorial. Is there special cypher syntax to use to achieve this result?
Sample Data:
group_id,code,date,number
123,abc,2/18/21,4
123,def,11/11/20,3
123,ghi,11/10/20,2
123,jkl,10/1/20,1
456,gtg,11/28/20,5
456,abc,10/30/20,4
456,def,10/5/20,3
456,jkl,10/1/20,2
456,uuu,10/1/20,1
My Code to load data:
LOAD CSV WITH HEADERS FROM "file:///sample2.csv" AS row
WITH row
WHERE row.group_id IS NOT NULL
MERGE (g:group_id {group_id: row.group_id});
LOAD CSV WITH HEADERS FROM "file:///sample2.csv" AS row
WITH row
WHERE row.code IS NOT NULL
MERGE (c:code {code: row.code})
ON CREATE SET c.number = row.number,
c.date = row.date;
Here is what I have tried:
// Building relationship
LOAD CSV WITH HEADERS FROM "file:///sample2.csv" AS row
WITH row
MATCH (g:group_id {group_id: row.group_id})
MATCH (c:code {code: row.code})
MERGE (g)-[:GROUPS]->(c) // Connects ALL codes to group id, but how to connect to 'code' and 'number' sequentially?
MERGE (c:{code: row.number})-[:NEXT]->(c) // Neo.ClientError.Statement.SyntaxError
I have gotten result:
I am trying to get this.
This will be a two step process. First the initial loading of the data as you have outlined. Then an enhancement in which you create the NEXT relationships. We do this in healthcare analytics of patient journeys or trajectories. By analogy, your yellow nodes might be a patient and the blue one an encounter. So each patient has a sequence of encounters.
You can query and sort by the date or other ordering variable. For example, collect a sorted list of encounters:
match (e:encounter) with e order by e.enc_date with e.subjectId as sid,collect(distinct e.enc_date) as eo return sid,size(eo) as ct,eo
I used this in some python code to then iterate through the collection to create the enc_seq edge, equivalent to your NEXT:
> dfeo = Neo4jLib.CypherToPandas("match (e:encounter) with e order by e.enc_date with e.subjectId as sid,collect(distinct e.enc_date) as eo return sid,size(eo) as ct,eo",'ppmi')
csv = dfeo.to_csv(index=False).split('\n')
cts=0
sw = open("c:\\temp\\error.txt","a")
for i in range(1,len(dfeo)):
cc = csv[i].split(',')
for j in range(0,int(cc[1])-1):
try:
q= "match (e1:encounter{subjectId:" + str(dfeo['sid'][i]) + ",enc_date:date('" + str(dfeo['eo'][i][j]) + "')}) match (e2:encounter{subjectId:" + str(dfeo['sid'][i]) + ",enc_date:date('" + str(dfeo['eo'][i][j+1]) + "')}) merge (e1)-[r:enc_seq{subjectId:" + str(dfeo['sid'][i]) + ", seqCt:" + str(j) + "}]-(e2)"
Neo4jLib.CypherNoReturn(q,'ppmi')
except:
cts = cts + 1
sw.write(str(i) + ':' + str(j) + "\n"+ q + "\n")
print("exceptions: " + str(cts))
sw.flush()
sw.close()
You can probably do this within a cypher query using a WITH (each row) followed by a CALL to a function similar to my python code. For my purposes it was more convenient to use python.

Neo4J Subquery over same property to calculate ratio

I'm working with a tweets graph. I'm trying to get the ratio between tweets in spanish and tweets in english.
When checking the number of tweets by language:
MATCH (twtEs:Tweet)<-[:HAS_WRITEN]-()-[:HAS_AS_PROFILE_LANGUAGE]->(l:Language)
RETURN DISTINCT l.languageCode, count(*)
\\ Result:
╒════════════════╤══════════╕
│"l.languageCode"│"count(*)"│
╞════════════════╪══════════╡
│"en" │165392 │
├────────────────┼──────────┤
│"es" │73693 │
└────────────────┴──────────┘
We can see the counts for each language.
But when trying to calculate the ratio directly:
MATCH (twtEs:Tweet)<-[:HAS_WRITEN]-()-[:HAS_AS_PROFILE_LANGUAGE]->(:Language{languageCode:'es'})
WITH count(twtEs) AS tweetsEs
MATCH (twtEn:Tweet)<-[:HAS_WRITEN]-()-[:HAS_AS_PROFILE_LANGUAGE]->(:Language{languageCode:'en'})
WITH count(twtEn) as tweetsEn, tweetsEs
RETURN tweetsEs/tweetsEn as RatioTweetsEsvsEn;
\\ Result:
╒═══════════════════╕
│"RatioTweetsEsvsEn"│
╞═══════════════════╡
│0 │
└───────────────────┘
Thats what I obtain, when it should be 0,44557.
I've been checking the documentation and other answers in Stackoverflow but haven't found something similar to use as example. Probably the second query is incorrect but I'm strugling to resolve it.
Thanks in advance.
I'm running:
Neo4j Browser version: 3.2.20
Neo4j Server version: 3.5.8 (community)
It's because count values are Integers.
For example :
WITH 165392 AS v1, 73693 AS v2
RETURN v1/v2
╒═══════╕
│"v1/v2"│
╞═══════╡
│2 │
└───────┘
You can transform them to floats instead :
WITH 165392 AS v1, 73693 AS v2
RETURN v1*1.0f/v2*1.0f
╒══════════════════╕
│"v1*1.0f/v2*1.0f" │
╞══════════════════╡
│2.2443379968246644│
└──────────────────┘
Which would give for you :
MATCH (twtEs:Tweet)<-[:HAS_WRITEN]-()-[:HAS_AS_PROFILE_LANGUAGE]->(:Language{languageCode:'es'})
WITH count(twtEs) AS tweetsEs
MATCH (twtEn:Tweet)<-[:HAS_WRITEN]-()-[:HAS_AS_PROFILE_LANGUAGE]->(:Language{languageCode:'en'})
WITH count(twtEn) as tweetsEn, tweetsEs
RETURN tweetsEs*1.0f/tweetsEn*1.0f as RatioTweetsEsvsEn;

Find a number of photos inside a photo album

I have photo albums and their photos stored in Neo4j. I would like to be able to find one album and get a certain amount of photos. The goal is to lazily load photos as required (pagination).
Now I can do the following to achieve what I want:
match(p:Photo)-[bt:BELONGS_TO]->(a:Album) where a.name = "Summer 2019" return a, collect(p)[..4] as photos
However I would like to be able to sort the list of photos by different criteria such as their upload date or creation date. I'm not exactly sure whether this is the best approach to do this.
match(p:Photo)-[bt:BELONGS_TO]->(a:Album) where a.name = "Summer 2019" return a, collect(p)[4..] as photos order by p.file_name
Fails and tells me the following:
In a WITH/RETURN with DISTINCT or an aggregation, it is not possible to access variables declared before the WITH/RETURN: p
I would like to keep the exact same format of the result (one album, one page of photos) if possible so that I don't have to do complicated mapping inside my application code:
╒══════════════════════╤══════════════════════════════════════════════════════════════════════╕
│"a" │"photos" │
╞══════════════════════╪══════════════════════════════════════════════════════════════════════╡
│{"name":"Summer 2019"}│[{"file_name":"cat.jpeg"},{"file_name":"dog.jpeg"},{"file_name":"birdi│
│ │e.jpeg"},{"file_name":"bird.jpeg"}] │
└──────────────────────┴──────────────────────────────────────────────────────────────────────┘
Is there a clean way to get this format while being able to sort the photos?
You need to ORDER BY before collecting your p nodes
MATCH (p:Photo)-[bt:BELONGS_TO]->(a:Album)
WHERE a.name = "Summer 2019"
WITH
a,
p
ORDER BY p.file_name
RETURN a, collect(p)[4..] as photos

Count relationships by types in Neo4j

I have many relationship types in the database. How do I count relationships by each type without using apoc?
Solution
MATCH ()-[relationship]->()
RETURN TYPE(relationship) AS type, COUNT(relationship) AS amount
ORDER BY amount DESC;
The first line specifies the pattern to define the relationship variable, which is used to determine type and amount in line two.
Example result
╒══════════════╤════════╕
│"type" │"amount"│
╞══════════════╪════════╡
│"BELONGS_TO" │1234567 │
├──────────────┼────────┤
│"CONTAINS" │432552 │
├──────────────┼────────┤
│"IS_PART_OF" │947227 │
├──────────────┼────────┤
│"HOLDS" │4 │
└──────────────┴────────┘
There's also a built in procedure in 3.5.x that you can use to retrieve counts, but it does take a bit of filtering to get down to those you are interested in:
CALL db.stats.retrieve('GRAPH COUNTS') YIELD data
UNWIND [data IN data.relationships WHERE NOT exists(data.startLabel) AND NOT exists(data.endLabel)] as relCount
RETURN coalesce(relCount.relationshipType, 'all') as relationshipType, relCount.count as count

Resources