Properties aren't ordered alphabetically - neo4j

I noticed that the ORDER BY clause orders text according to ASCII order, not alphabetically, like, for example, MySQL does.
In other words, this is how Neo4J would order properties:
Apple
Carrot
banana
And MySQL would order them like this:
Apple
banana
Carrot
What is the best way to get Neo4J to sort alphabetically? One way is to use upper (or lower) like this:
MATCH (e) RETURN e.name ORDER BY upper(e.name) ASC;
Another idea is to create a new property, nameSort, which is the same as the name property, but in upper (or lower) case.
Any other ways to do it? I would prefer to do something simple, like the Cypher modification above, rather than creating a new property, but I don't know what the performance implications are.

Neo4j performs lexicographic string ordering, which is what you're seeing. This is documented here. To achieve case-insensitive ordering, you'd need to implement this on your own (such as your suggestion for converting the case either during the query or when storing the property value).

Related

How to do an ORDER BY ignoring the diacritical marks (use a collation) in Cypher?

I have a list of names where some names contain diacritics characters, like Á, Ê
For example:
Átila
André
Êlisa
Mercês
Sá
But when I run a simple query like this:
MATCH (p:Person)
ORDER BY p.Name
It returns the names out of alphabetical order, because of the diacritics:
André
Mercês
Sá
Átila
Êlisa
I would like that it return in alphabetical order, independently of the presence of diacritics (pt-BR / portuguese / Brazil).
I can do this in Microsoft SQL Server:
SELECT Name
FROM Person
ORDER BY Name COLLATE SQL_Latin1_General_CP1_CS_AS
How to do that in Cypher?
DISCLAIMER: I'm the co-founder and CTO of Memgraph.
I would say this goes under the proper support for Unicode. openCypher grammar has the support to parse various characters, but it's an implementation detail of how characters are stored and later interpreted. I'm not aware of a clause like COLLATE in openCypher.
When it comes to Memgraph, it just stores and interprets raw bytes (for now), which results in the wrong sorting order. Options when using Memgraph are:
implementing a user-defined function in C/C++/Python/Rust which will allow you to sort the items correctly
maybe a quick fix is to order on the client/application side (I'm aware of all the problems that might produce, large transferred data volume, and slow queries), but maybe it's a quick solution for you.
There is a related GitHub issue. Please contribute with more details or follow the discussion. At some point, we'll add native capability :) Also, we are looking for contributors!
Cypher doesn't have the support for specifying Collation and Character Set as of now, but you can try this for your use case:
MATCH (p:Person)
WITH p, apoc.text.clean(p.name) AS cleanedName
RETURN p.name
ORDER BY cleanedName
APOC is an external library, with some very useful functions.
apoc.text.clean is one of them, it only keeps alphanumeric characters in the string and converts them all to lowercase. Hence, there are two limitations, if you want non-alphanumeric characters to play a role in sorting, or the sorting should be case-sensitive, then this is not an exact solution, then you can basically write your own custom procedure and call it from within Cypher, as described here.
Please install the APOC library first, if not already installed.

Neo4j data modeling: correct way to specify a source for a statement?

I'm working on a scientific database that contains model statements such as:
"A possible cause of Fibromyalgia is Microglial hyperactivity, as supported by these 10 studies: [...] and contradicted by 1 study [...]."
I need to specify a source for statements in Neo4j and be able to do 2 ways operations, like:
Find all statements supported by a study
Find all studies supporting a statement
The most immediate idea I had is to use the DOI of studies as unique identifiers in the relationship property. The big con of this idea is that I have to scan all the relationships to find the list of all statements supported by a study.
So, since it is impossible to make a link between a study and a relationship, I had the idea to make 2 links, at each extremity of the relationship. The obvious con is that it does not give information about the relationship, like "support" or "contradict".
So, I came to the conclusion that I need a node for the hypothesis:
However, it overloads the graph and we are not anymore in the classical node -relationship-> node design that makes property graphs so easy to understand.
Using RDF, it is possible to add properties to relationships using subgraphs, however there we enter semantic graphs and quad stores, which is a more complex tool.
So I'm wondering if there is a "correct" design pattern for Neo4j to support this type of need that I may not have imagined instead?
Thanks
Based on your requirements, I think put support_study as property of edge will do the work:
Thus we could query the following as:
Find all statements supported by a study
MATCH ()-[e:has_cause{support_study: "doi_foo_bar"}]->()
RETURN e;
Find all studies supporting a statement
Given statement is “foo” is caused by “bar”
MATCH (v:disease{name: "foo"})-[e:has_cause]->(v1:sympton{name: "bar")
RETURN DISTINCT e.support_study;
While, this is mostly based on NebulaGraph, where:
It speaks cypher DQL(together with nGQL)
It supports properties in edge
It used 4-tuple(rather than a Key) to distingush an edge(src,dst,edge_type,rank), where rank is an unique design to enable multiple has_cause edge instance between one pair of disease-> sympton, you could put the hash of doi or other number as rank field(or omit, of cause, it will be 0)
It’s distributed and Open-Source(Apache 2.0)
Note:
In NebulaGraph, index should be created on has_cause(support_study) and disease(name), ref: https://www.siwei.io/en/nebula-index-explained/ and https://docs.nebula-graph.io/3.2.0/3.ngql-guide/14.native-index-statements/
But, I think it applies to neo4j, too :)

Display two separated records according to multiple values in properties neo4j

I faced a need to make a strange thing. I have some query which is can’t be changed. It’s a match query for getting record:
MATCH (j:journal) WHERE j.id in [12] RETURN j.`id` AS ID, j.`language` AS LANGUAGE
And I have some node that contains array as property: e.g. can be created like this: create (j:journal {id:12, language:[“English”, “Polish”]})
So, is there any possibility to display this node like two records with the same id, but with different language fields? Like the following:
ID | LANGUAGE
12 | English
12 | Polish
The important thing is that match query can’t be changed at all.
But the node can be changed.
I know that I can add UNWIND keyword for the language field in the source query. But there is a requirement to not to.
I didn’t find something like that in the documentation nor in the internet. I’m not sure if it’s even possible (but consumer wants it). Just I don’t have much experience with neo4j.
I understand that it can sound weird, but I need to understand if it can be implemented this way.
Thanks in advance.
If you can change the DB, you can change it so that each journal node contains a single language (as a scalar value, not in a list). However, this change might break any other queries that you might have.
If this conversion is acceptable, here is a query that should: (a) convert existing journal nodes to have a scalar language value, and (b) create new journal nodes as necessary for the remaining language values. The nodes that are spawned from an original journal node will share the same properties (except for language).
MATCH (j:journal)
WITH j, j.language[1..] AS langs
SET j.language = j.language[0]
WITH j, langs
UNWIND langs AS lang
CREATE (k:journal)
SET k = j, k.language = lang
If a node's language property had N values, you will end up with N nodes, each with the same properties -- except for the language property, which will contain a different language value (as a string). For efficiency, the original node is reused.

Create Unique Relationship is taking much amount of time

START names = node(*),
target=node:node_auto_index(target_name="TARGET_1")
MATCH names
WHERE NOT names-[:contains]->()
AND HAS (names.age)
AND (names.qualification =~ ".*(?i)B.TECH.*$"
OR names.qualification =~ ".*(?i)B.E.*$")
CREATE UNIQUE (names)-[r:contains{type:"declared"}]->(target)
RETURN names.name,names,names.qualification
Iam consisting of nearly 1,80,000 names nodes, i had iterated the above process to create unique relationships above 100 times by changing the target. its taking too much amount of time.How can i resolve it..
i build the query with java and iterated.iam using neo4j 2.0.0.5 and java 1.7 .
I edited your cypher query because I think I understand it, but I can barely read the rest of your question. If you edit it with white spaces and punctuation it might be easier to understand what you are trying to do. Until then, here are some thoughts about your query being slow.
You bind all the nodes in the graph, that's typically pretty slow.
You bind all the nodes in the graph twice. First you bind universally in your start clause: names=node(*), and then you bind universally in your match clause: MATCH names, and only then you limit your pattern. I don't quite know what the Cypher engine makes of this (possibly it gets a migraine and goes off to make a pot of coffee). It's unnecessary, you can at least drop the names=node(*) from your start clause. Or drop the match clause, I suppose that could work too, since you don't really do anything there, and you will still need a start clause for as long as you use legacy indexing.
You are using Neo4j 2.x, but you use legacy indexing instead of labels, at least in this query. Without knowing your data and model it's hard to know what the difference would be for performance, but it would certainly make it much easier to write (and read) your queries. So, that's a different kind of slow. It's likely that if you had labels and label indices, the query performance would improve.
So, first try removing one of the universal bindings of nodes, then use the 2.x schema tools to structure your data. You should be able to write queries like
MATCH target:Target
WHERE target.target_name="TARGET_1"
WITH target
MATCH names:Name
WHERE NOT names-[:contains]->()
AND HAS (names.age)
AND (names.qualification =~ ".*(?i)B.TECH.*$"
OR names.qualification =~ ".*(?i)B.E.*$")
CREATE UNIQUE (names)-[r:contains{type:"declared"}]->(target)
RETURN names.name,names,names.qualification
I have no idea if such a query would be fast on your data, however. If you put the "Name" label on all your nodes, then MATCH names:Name will still bind all nodes in the database, so it'll probably still be slow.
P.S. The relationships you create have a TYPE called contains, and you give them a property called type with value declared. Maybe you have a good reason, but that's potentially very confusing.
Edit:
Reading through your question and my answer again I no longer think that I understand even your cypher query. (Why are you returning both the bound nodes and properties of those nodes?) Please consider posting sample data on console.neo4j.org and explain in more detail what your model looks like and what you are trying to do. Let me know if my answer meets your question at all or I'll consider removing it.

Neo4j node property type

I'm playing around with neo4j, and I was wondering, is it common to have a type property on nodes that specify what type of Node it is? I've tried searching for this practice, and I've seen some people use name for a purpose like this, but I was wondering if it was considered a good practice or if indexes would be the more practical method?
An example would be a "User" node, which would have type: user, this way if the index was bad, I would be able to do an all-node scan and look for types of user.
Labels have been added to neo4j 2.0. They fix this problem.
You can create nodes with labels:
CREATE (me:American {name: "Emil"}) RETURN me;
You can match on labels:
MATCH (n:American)
WHERE n.name = 'Emil'
RETURN n
You can set any number of labels on a node:
MATCH (n)
WHERE n.name='Emil'
SET n :Swedish:Bossman
RETURN n
You can delete any number of labels on a node:
MATCH (n { name: 'Emil' })
REMOVE n:Swedish
Etc...
True, it does depend on your use case.
If you add a type property and then wish to find all users, then you're in potential trouble as you've got to examine that property on every node to get to the users. In that case, the index would probably do better- but not in cases where you need to query for all users with conditions and relations not available in the index (unless of course, your index is the source of the "start").
If you have graphs like mine, where a relation type implies two different node types like A-(knows)-(B) and A or B can be a User or a Customer, then it doesn't work.
So your use case is really important- it's easy to model graphs generically, but important to "tune" it as per your usage pattern.
IMHO you shouldn't have to put a type property on the node. Instead, a common way to reference all nodes of a specific "type" is to connect all user nodes to a node called "Users" maybe. That way starting at the Users node, you can very easily find all user nodes. The "Users" node itself can be indexed so you can find it easily, or it can be connected to the reference node.
I think it's really up to you. Some people like indexed type attributes, but I find that they're mostly useful when you have other indexed attributes to narrow down the number of index hits (search for all users over age 21, for example).
That said, as #Luanne points out, most of us try to solve the problem in-graph first. Another way to do that (and the more natural way, in my opinion) is to use the relationship type to infer a practical node type, i.e. "A - (knows) -> B", so A must be a user or some other thing that can "know", and B must be another user, a topic, or some other object that can "be known".
For client APIs, modeling the element type as a property makes it easy to instantiate the right domain object in your client-side code so I always include a type property on each node/vertex.
The "type" var name is commonly used for this, but in some languages like Python, "type" is a reserved word so I use "element_type" in Bulbs ( http://bulbflow.com/quickstart/#models ).
This is not needed for edges/relationships because they already contain a type (the label) -- note that Neo4j also uses the keyword "type" instead of label for relationships.
I'd say it's common practice. As an example, this is exactly how Spring Data Neo4j knows of which entity type a certain node is. Each node has "type" property that contains the qualified class name of the entity. These properties are automatically indexed in the "types" index, thus nodes can be looked up really fast. You could implement your use case exactly like this.
Labels have recently been added to Neo4j 2.0 ( http://docs.neo4j.org/chunked/milestone/graphdb-neo4j-labels.html ). They are still under development at the moment, but they address this exact problem.

Resources