Handle whitespace in Neo4j full text search - neo4j

I need some help with full text search.
I have created an index like so:
CALL db.index.fulltext.createNodeIndex("ReasourceName",["Resource"],["name"])
I can query it and get results:
CALL db.index.fulltext.queryNodes('ReasourceName', 'bmc pumping station~') YIELD node, score
WITH node, score
RETURN node.name, score
limit 10;
output:
╒════════════════════════════════╤══════════════════╕
│"node.name" │"score" │
╞════════════════════════════════╪══════════════════╡
│"BMC Pumping Station" │8.143752098083496 │
├────────────────────────────────┼──────────────────┤
│"BMC Office" │2.944127082824707 │
├────────────────────────────────┼──────────────────┤
│"BMC Office" │2.944127082824707 │
├────────────────────────────────┼──────────────────┤
│"BMC Dispensary" │2.944127082824707 │
├────────────────────────────────┼──────────────────┤
│"BMC Office" │2.944127082824707 │
├────────────────────────────────┼──────────────────┤
│"BMC Dispensary" │2.944127082824707 │
├────────────────────────────────┼──────────────────┤
│"BMC Office" │2.944127082824707 │
├────────────────────────────────┼──────────────────┤
│"Police Station" │2.6569595336914062│
├────────────────────────────────┼──────────────────┤
│"Momo Station" │2.6569595336914062│
├────────────────────────────────┼──────────────────┤
│"BMC Shikshak Bhavan" │2.515393018722534 │
└────────────────────────────────┴──────────────────┘
However it performs poorly if the input query differs in whitespace. For example, I would expect the query bmcpumpingstation or bmcpumpingstation~ to have a similar result set, however it returns nothing.
There does not appear to be an analyzer that works on levenshtein distance.
(I also asked this question on neo4j community but didn't get a response)

The underlying engine is Lucene and the flow is that it tokenizes the text and store it as tokens.
Then when you "search", it tokenize your search string and compare tokens between your search and what is in the indexes.
I would suggest to read this article ( note Neo4j did some changes but 90% is still valid today ) : https://graphaware.com/neo4j/2019/01/11/neo4j-full-text-search-deep-dive.html
So, if you search for bmcpumpingstation and your index contains the following tokens :
bmc, pumping, station
Then there is simply no match.
If you want to hack a bit and have this type of search working, you can create a dedicated index for this and remove all whitespaces from the names when you index it, then you can use bmcpumpingstation with some fuzziness to search

Looks like you need to clean up your data!
match (m:Resource) set m.name=replace(m.node,"~","")
or do the clean up before loading the data.

Related

Neo4J Subquery over same property to calculate ratio

I'm working with a tweets graph. I'm trying to get the ratio between tweets in spanish and tweets in english.
When checking the number of tweets by language:
MATCH (twtEs:Tweet)<-[:HAS_WRITEN]-()-[:HAS_AS_PROFILE_LANGUAGE]->(l:Language)
RETURN DISTINCT l.languageCode, count(*)
\\ Result:
╒════════════════╤══════════╕
│"l.languageCode"│"count(*)"│
╞════════════════╪══════════╡
│"en" │165392 │
├────────────────┼──────────┤
│"es" │73693 │
└────────────────┴──────────┘
We can see the counts for each language.
But when trying to calculate the ratio directly:
MATCH (twtEs:Tweet)<-[:HAS_WRITEN]-()-[:HAS_AS_PROFILE_LANGUAGE]->(:Language{languageCode:'es'})
WITH count(twtEs) AS tweetsEs
MATCH (twtEn:Tweet)<-[:HAS_WRITEN]-()-[:HAS_AS_PROFILE_LANGUAGE]->(:Language{languageCode:'en'})
WITH count(twtEn) as tweetsEn, tweetsEs
RETURN tweetsEs/tweetsEn as RatioTweetsEsvsEn;
\\ Result:
╒═══════════════════╕
│"RatioTweetsEsvsEn"│
╞═══════════════════╡
│0 │
└───────────────────┘
Thats what I obtain, when it should be 0,44557.
I've been checking the documentation and other answers in Stackoverflow but haven't found something similar to use as example. Probably the second query is incorrect but I'm strugling to resolve it.
Thanks in advance.
I'm running:
Neo4j Browser version: 3.2.20
Neo4j Server version: 3.5.8 (community)
It's because count values are Integers.
For example :
WITH 165392 AS v1, 73693 AS v2
RETURN v1/v2
╒═══════╕
│"v1/v2"│
╞═══════╡
│2 │
└───────┘
You can transform them to floats instead :
WITH 165392 AS v1, 73693 AS v2
RETURN v1*1.0f/v2*1.0f
╒══════════════════╕
│"v1*1.0f/v2*1.0f" │
╞══════════════════╡
│2.2443379968246644│
└──────────────────┘
Which would give for you :
MATCH (twtEs:Tweet)<-[:HAS_WRITEN]-()-[:HAS_AS_PROFILE_LANGUAGE]->(:Language{languageCode:'es'})
WITH count(twtEs) AS tweetsEs
MATCH (twtEn:Tweet)<-[:HAS_WRITEN]-()-[:HAS_AS_PROFILE_LANGUAGE]->(:Language{languageCode:'en'})
WITH count(twtEn) as tweetsEn, tweetsEs
RETURN tweetsEs*1.0f/tweetsEn*1.0f as RatioTweetsEsvsEn;

Find a number of photos inside a photo album

I have photo albums and their photos stored in Neo4j. I would like to be able to find one album and get a certain amount of photos. The goal is to lazily load photos as required (pagination).
Now I can do the following to achieve what I want:
match(p:Photo)-[bt:BELONGS_TO]->(a:Album) where a.name = "Summer 2019" return a, collect(p)[..4] as photos
However I would like to be able to sort the list of photos by different criteria such as their upload date or creation date. I'm not exactly sure whether this is the best approach to do this.
match(p:Photo)-[bt:BELONGS_TO]->(a:Album) where a.name = "Summer 2019" return a, collect(p)[4..] as photos order by p.file_name
Fails and tells me the following:
In a WITH/RETURN with DISTINCT or an aggregation, it is not possible to access variables declared before the WITH/RETURN: p
I would like to keep the exact same format of the result (one album, one page of photos) if possible so that I don't have to do complicated mapping inside my application code:
╒══════════════════════╤══════════════════════════════════════════════════════════════════════╕
│"a" │"photos" │
╞══════════════════════╪══════════════════════════════════════════════════════════════════════╡
│{"name":"Summer 2019"}│[{"file_name":"cat.jpeg"},{"file_name":"dog.jpeg"},{"file_name":"birdi│
│ │e.jpeg"},{"file_name":"bird.jpeg"}] │
└──────────────────────┴──────────────────────────────────────────────────────────────────────┘
Is there a clean way to get this format while being able to sort the photos?
You need to ORDER BY before collecting your p nodes
MATCH (p:Photo)-[bt:BELONGS_TO]->(a:Album)
WHERE a.name = "Summer 2019"
WITH
a,
p
ORDER BY p.file_name
RETURN a, collect(p)[4..] as photos

Traverse two lists at once

I have an input CSV that will have the following format
email,skills,expertiseLevels
john1#xyz.com,"Oracle database;SSIS;SQL Server","5;4;3"
john2#xyz.com,"Python;Hadoop;SQL Server","1;2;4"
where for each row, expertiseLevels[i] signals the person's expertise in skills[i]
I would like to write a Cypher query to obtain a data set like so:
╒════════════════════════════════════╤═════════════════╤════════════════╕
│"email" │"skill" │"expertiseLevel"│
╞════════════════════════════════════╪═════════════════╪════════════════╡
│"john1#xyz.com" │"Oracle database"│"5" │
├────────────────────────────────────┼─────────────────┼────────────────┤
│"john1#xyz.com" │"SSIS" │"4" │
├────────────────────────────────────┼─────────────────┼────────────────┤
│"john1#xyz.com" │"SQL Server" │"3" │
├────────────────────────────────────┼─────────────────┼────────────────┤
│"john2#xyz.com" │"Python" │"1" │
├────────────────────────────────────┼─────────────────┼────────────────┤
│"john2#xyz.com" │"Hadoop" │"2" │
├────────────────────────────────────┼─────────────────┼────────────────┤
│"john2#xyz.com" │"SQL Server" │"4" │
└────────────────────────────────────┴─────────────────┴────────────────┘
The query I currently have does work, only I was wondering if there is a more straight forward way to accomplish what I am looking for:
LOAD CSV WITH HEADERS FROM 'file:///test.csv' AS line
WITH line.email AS email, split(line.skills,";") AS skills, split(line.expertiseLevels,";") AS expertiseLevels
WITH email, reduce(x = "", idx in range(0,size(skills)-1) | x + skills[idx] + ":" + expertiseLevels[idx] + ";") AS se
WITH email, split(se,";") AS skillsWithExpertise
UNWIND skillsWithExpertise AS skillWithExpertise
WITH email AS email, split(skillWithExpertise,":") AS tokens
WHERE skillWithExpertise <> ""
WITH email, tokens[0] AS skill, tokens[1] AS expertiseLevel
RETURN email, skill, expertiseLevel;
Thanks
This will produce your desired output:
LOAD CSV WITH HEADERS FROM 'file:///test.csv' AS line
WITH
line.email AS email,
SPLIT(line.skills, ";") AS skills,
SPLIT(line.expertiseLevels, ";") AS expertiseLevels
UNWIND RANGE(0, SIZE(skills)-1) AS i
RETURN email, skills[i] AS skill, expertiseLevels[i] AS expertiseLevel

Accessing map values from neo4, apoc, Cypher

I am still rather new to Neo4j, Cypher and programming in general.
Is there a way to access the posted output below, i.e. access the "count" values for every "item“ (which has to be the pair), and also access the "item" values? I need the amount of how often a pair, i.e. specific neighboring nodes occur not only as information, but as values with which I can further work with in order to adjust my graph.
My last lines of code (in the preceding lines I just ordered the nodes sequentially):
...
WITH apoc.coll.pairs(a) as pairsOfa
WITH apoc.coll.frequencies(pairsOfa) AS giveBackFrequencyOfPairsOfa
UNWIND giveBackFrequencyOfPairsOfa AS x
WITH DISTINCT x
RETURN x
Output from the Neo4j Browser that I need to work with:
"x"
│{"count":1,"item":[{"aName“:"Rob","time":1},{"aName":"Edwin“,"time“:2}]},{„count“:4,“item":[{"aName":"Edwin","time":2},{"aName“:"Celesta","time":3}]}
...
Based on your code, your result should contain multiple x records (not a single record, as implied by the "output" provided in your question). Here is an example of what I would expect:
╒══════════════════════════════════════════════════════════════════════╕
│"x" │
╞══════════════════════════════════════════════════════════════════════╡
│{"count":1,"item":[{"aName":"Rob","time":1},{"aName":"Edwin","time":2}│
│]} │
├──────────────────────────────────────────────────────────────────────┤
│{"count":1,"item":[{"aName":"Edwin","time":2},{"aName":"Celesta","time│
│":3}]} │
└──────────────────────────────────────────────────────────────────────┘
If that is true, then you can just access the count and item properties of each x directly via x.count and x.item. To get each value within an item, you could use x.item[0] and x.item[1].
Asides: you probably want to use apoc.coll.pairsMin instead of apoc.coll.pairs, to avoid the generation of a spurious "pair" (whose second element is null) when the number of values to be paired is odd. Also, you probably do not need the DISTINCT step.

Count relationships by types in Neo4j

I have many relationship types in the database. How do I count relationships by each type without using apoc?
Solution
MATCH ()-[relationship]->()
RETURN TYPE(relationship) AS type, COUNT(relationship) AS amount
ORDER BY amount DESC;
The first line specifies the pattern to define the relationship variable, which is used to determine type and amount in line two.
Example result
╒══════════════╤════════╕
│"type" │"amount"│
╞══════════════╪════════╡
│"BELONGS_TO" │1234567 │
├──────────────┼────────┤
│"CONTAINS" │432552 │
├──────────────┼────────┤
│"IS_PART_OF" │947227 │
├──────────────┼────────┤
│"HOLDS" │4 │
└──────────────┴────────┘
There's also a built in procedure in 3.5.x that you can use to retrieve counts, but it does take a bit of filtering to get down to those you are interested in:
CALL db.stats.retrieve('GRAPH COUNTS') YIELD data
UNWIND [data IN data.relationships WHERE NOT exists(data.startLabel) AND NOT exists(data.endLabel)] as relCount
RETURN coalesce(relCount.relationshipType, 'all') as relationshipType, relCount.count as count

Resources