Collection? Dictionary? List? - neo4j

Neo4j's manual does a very good job at explaining the meaning of the terms Node, Relationship, Label and a few others.
However, the real vocabulary of Cypher seems to include quite a few elusive terms, as well.
For instance, clause 3.3.15.1 of the manual says "Lists and paths are key concepts in Cypher". Fine, but what is a List in Cypher? I have all but given up trying to find a definition of that "key concept".
Similarly, the Cypher Reference Card mentions that "Cypher also supports maps and collections". Elsewhere, one can find that Cypher also "works with dictionaries".
Needless to say, I am in the dark as to how to spot and/or use those in Cypher.
Would really appreciate some illustrations.
Thanks.

The docs has a section on Composite types:
3.2.1.3. Composite types
✓ Can be returned from Cypher queries
✓ Can be used as parameters
❏ Cannot be stored as properties
✓ Can be constructed with Cypher literals
Composite types comprise:
Lists are heterogeneous, ordered collections of values, each of which has any property, structural or composite type.
Maps are heterogeneous, unordered collections of (key, value) pairs, where:
the key is a String
the value has any property, structural or composite type
You might also be interested in the development of openCypher. One of the goals of the openCypher project is to define the concepts of the Cypher language. As stated on its homepage:
The openCypher project aims to deliver a full and open specification of the industry’s most widely adopted graph database query language: Cypher.
Currently, openCypher is a work-in-progress. It has a draft document on the Property Graph Model, which itself does not discuss lists/maps in detail, but refers the CIP2015-06-16 - Public Type System and Type Annotations document, which is an accepted Cypher Improvement Proposal. It has a section on "Container Types", which defines how lists and maps work in Cypher.
I have not seen the term "dictionary" in the core Neo4j docs. It could be mentioned around drivers though, as some languages, e.g. Python, use this term for maps.
(Disclaimer: I am a regular participant at the openCypher Implementers Group meetings.)

According to wikipedia :
https://en.wikipedia.org/wiki/List_(abstract_data_type) : a list (or sequence, collection) is a data type that represents a countable number of ordered values.
https://en.wikipedia.org/wiki/List_(abstract_data_type) : a map (or dictionnary) is a data type composed of a collection of (key, value) pairs.
And in Cypher :
this is a list : RETURN ['Benoit', 'Simard'] AS list
this is a map : RETURN { firstname:'Benoit', lastname:'Simard'} AS map
Cheers

Related

Neo4j data modeling: correct way to specify a source for a statement?

I'm working on a scientific database that contains model statements such as:
"A possible cause of Fibromyalgia is Microglial hyperactivity, as supported by these 10 studies: [...] and contradicted by 1 study [...]."
I need to specify a source for statements in Neo4j and be able to do 2 ways operations, like:
Find all statements supported by a study
Find all studies supporting a statement
The most immediate idea I had is to use the DOI of studies as unique identifiers in the relationship property. The big con of this idea is that I have to scan all the relationships to find the list of all statements supported by a study.
So, since it is impossible to make a link between a study and a relationship, I had the idea to make 2 links, at each extremity of the relationship. The obvious con is that it does not give information about the relationship, like "support" or "contradict".
So, I came to the conclusion that I need a node for the hypothesis:
However, it overloads the graph and we are not anymore in the classical node -relationship-> node design that makes property graphs so easy to understand.
Using RDF, it is possible to add properties to relationships using subgraphs, however there we enter semantic graphs and quad stores, which is a more complex tool.
So I'm wondering if there is a "correct" design pattern for Neo4j to support this type of need that I may not have imagined instead?
Thanks
Based on your requirements, I think put support_study as property of edge will do the work:
Thus we could query the following as:
Find all statements supported by a study
MATCH ()-[e:has_cause{support_study: "doi_foo_bar"}]->()
RETURN e;
Find all studies supporting a statement
Given statement is “foo” is caused by “bar”
MATCH (v:disease{name: "foo"})-[e:has_cause]->(v1:sympton{name: "bar")
RETURN DISTINCT e.support_study;
While, this is mostly based on NebulaGraph, where:
It speaks cypher DQL(together with nGQL)
It supports properties in edge
It used 4-tuple(rather than a Key) to distingush an edge(src,dst,edge_type,rank), where rank is an unique design to enable multiple has_cause edge instance between one pair of disease-> sympton, you could put the hash of doi or other number as rank field(or omit, of cause, it will be 0)
It’s distributed and Open-Source(Apache 2.0)
Note:
In NebulaGraph, index should be created on has_cause(support_study) and disease(name), ref: https://www.siwei.io/en/nebula-index-explained/ and https://docs.nebula-graph.io/3.2.0/3.ngql-guide/14.native-index-statements/
But, I think it applies to neo4j, too :)

Neo4j OGM Filter query against array

I'm trying to build a small property searching system. I'm using spring boot with a neo4j database and I'm trying to query from the neo4j database using filters becasue i need the querying to be dynamic too. In my use case properties has features such as electricity, tap water, tilled roof, etc. Property & Feature are nodes, Feature node has an attribtue named 'key', a Property node is linked to many Feature nodes by rich relationships typed HAS_FEATURE, what i want to do is query properties for a given array of feature keys using Filters. Follwing is code,
featureKeys is a java List here,
filter = new Filter("key", ComparisonOperator.IN, featureKeys);
filter.setRelationshipType("HAS_FEATURE");
filter.setNestedPropertyName("hasFeatures");
filter.setNestedPropertyType(Feature.class);
filters.add(filter);
SESSION.loadAll(Property.class, filters, new Pagination(pageNumber, pageSize));
The problem is i want only the properties that related to all the given features keys to be returned, but even the properties that is related to one element of the given feature key list is also returned. What do i need to do to query only the properties that are linked to all the given list elements, i can change the rich relationship to a normal relationship if needed.

How do a general search across string properties in my nodes?

Working with Neo4j in a Rails app.
I have nodes with several string properties containing long strings of user generated content. For example in my nodes of type: "Book", I might have properties, "review", and "summary", which would contain long-form string values.
I was trying to design queries that returned nodes which match those properties to general language search terms provided by a user in a search box. As my query got increasingly complicated, it occurred to me that I was trying to resolve natural language search.
I looked into some of the popular search gems in Rails, but they all seem to depend on ActiveRecord. What search solutions exist for Neo4j.rb?
There are a few ways that you could go about this!
As FrobberOfBits said, Neo4j has what are called "legacy indexes" which use Lucene it the background to provide indexing of generic things. It does support the new schema indexes. Unfortunately those are based on exact matches (though I'm pretty sure that will change in Neo4j 2.3.x somewhat).
Neo4j does support pattern matching on strings via the =~ operator, but those queries aren't indexed. So the performance depends on the size of your database.
We often recommend a gem called searchkick which lets you define indexes for Elasticsearch in your models. Then you can just call a Model.search method to do your searches and it will first query elasticsearch to get the node IDs and then load those nodes via Neo4j.rb. You can use that via the neo4j-searchkick gem: https://github.com/neo4jrb/neo4j-searchkick
Lastly, if you're doing NLP and are trying to extract important words from your text, you could create a Tag/Word label and create relationships from your nodes to these NLP extracted nodes so that you can search based on those nodes in the future. You could even build recommendations from one text node to another based on the number/type of common tag nodes.
I don't know if anything specific exists for neo4j.rb and activerecord. What I can say is that generally this stuff is handled through the use of legacy indexes that are implemented by Lucene.
The premise is that you create a lucene-managed index on certain properties, and that then gives you access to use the Lucene query language via cypher to get data from those indices. Relative to neo4j.rb, it doesn't look any different than running cypher queries, like this:
START item=node:node_auto_index("(title:'foo bar' AND body:baz*) OR title:'bat'")
RETURN item
Note that lucene indexes and that query language can only be used in a START block, not a MATCH block. Refer to the Lucene Query Syntax to discover more about what you can do with that query syntax (fuzzy matching, wildcards, etc -- quite a bit more extensive than what regex would give you).

Graph Databases vs Triple Stores - when to use which?

I know that there are similar questions around on Stackoverflow but I don't feel they answer the following.
Graph Databases to my understanding store data following mostly this schema:
Table/Collection 1: store nodes with UID
Table/Collection 2: store relations referencing nodes via UID
This allows storing arbitrary types of graphs. Now as I understand triple stores store nothing but triples:
Triple/Collection 1: store triples (2 nodes, 1 relation)
Now I would see the following distinction regarding use cases:
Graph Databases: when you have known, static connections
Triple Stores: when you have loosely connected nodes and are often looking for new connections
I am confused by the fact that people do not seem to be discussing which one to use according to these criteria. Most article I find are talking about arguments like speed or compatibility. But is this not the most relevant point?
Put the other way round:
Imagine having a clearly connected, user defined graph. Why on earth would you want to store that as triples only, loosing all the info about connections? Or having to implement some custom solution storing IDs in the triple subject.
Imagine having loosely collected nodes that you want to query for unknown relations using SPARQL. Graph databases do support that. But for this they have to build another index I assume and would be slower?
EDIT:
I see that "loosing info about connections" is the wrong way to put it. If you do as shown in the accepted answer and insert several triples for 2 nodes + 1 relation then you keep all the info and specifically the info what exact nodes are connected.
The main difference between graph databases and triple stores is how they model the graph. In a triple store (or quad store), the data tends to be very atomic. What I mean is that the "nodes" in the graph tend to be primitive data types like string, integer, date, etc. Relationships link primitives together, and so the "unit of discourse" in a triple store is a triple, and not a node or a relationship, typically.
By contrast, other graph databases are often called "property stores" because nodes are data containers that correspond to objects in a domain. A node stands in for an object, and has properties; they act as rich data types specified by the graph modelers, more than just primitive data types. In these graph databases, nodes and relationships are the "unit of discourse".
Let's say I have a person named "Bob" who knows "Susan". In RDF, it would be something like this:
<http://example.org/person/1> :hasName "Bob".
<http://example.org/person/1> foaf:knows <http://example.org/person/2>.
<http://example.org/person/2> :hasName "Susan".
In a graph database like neo4j, it would be this:
(a:Person {name: "Bob"})-[:KNOWS]->(b:Person {name: "Susan"})
Notice that in RDF, it's 3 relationships but only one of those relationships actually expresses semantics between two entities. The other two relationships are just tracking properties of a single higher-level entity (the person). In neo4j, it's 1 relationship amongst two nodes, with each node having a property. In RDF you'll tend to identify things by URI, in neo4j it's a database object that gets a database ID automatically. That's what I mean about the difference between a more atomic/primitive store (triple stores) and a richer property graph.
RDF and triple stores are mostly built for the kinds of architectural challenges you'd run into with the semantic web. For example, XML namespacing is built in, on the architectural assumption that you'll be mixing and matching the use of many different vocabularies and namespaces. (That right there is a very "semantic web" assumption). So in SPARQL and RDF you'll see typically at least the use of xsd, rdf, and rdfs namespaces concurrently, and probably also owl, skos, and many others. SPARQL and RDF/RDFS also have many hooks and features that are there explicitly to make things like ontology inference easier. You'll tend to identify things with URIs as a way of "namespacing your identifiers" but also because some people may want to de-reference the URI...again the assumption here is a wide data sharing arrangement between many parties.
Property stores by contrast are keyed towards different use cases, like flexible modeling of data within one model/namespace, mappings between objects and graphs for persistence of enterprise applications, rapid evolvability, and so on. You'll tend to identify things with your own scheme (or an internal database ID). An auto-incrementing integer may not be best form of ID for any random consumer on the web, (and they certainly can't be de-referenced like URLs) but they might not be your first thought for a company internal application.
So which is better? The more atomic triple store format, or a rich property graph? Do you need to mix and match many different vocabularies in one query or data model? Do you need to create an OWL ontology or do inference? Do you need to serialize a bunch of java objects in memory to a database? Do you need to do fast traversal of long paths? Those types of questions would guide your selection.
Graphs are graphs, both of them do graphs, and so I don't think there's much difference in terms of what they can represent, or how you go about thinking about a problem in "graph terms". The differences boil down to the architecture underneath of the hood, and what sorts of use cases you think you'll need. I won't tell you one is better than the other, but choose wisely.
(in reply to the comments on this answer: https://stackoverflow.com/a/30167732 )
When an owl:inverseOf production rule is defined, the inverse property triple is inferred by the reasoner either when adding or updating the store, or when selecting from the store. This is a "materialized relation"
Schema.org - an RDFS vocabulary - defines, for example, https://schema.org/isPartOf as the inverse property of hasPart. If both are specified, it's not necessary to run another graph pattern query to traverse a directed relation in the other direction.
(:book1 schema:hasPart ?o)
(?o schema:isPartOf :book1)
(?s schema:hasPart :chapter2)
It's certainly possible to use RDFS and OWL to describe schema for and within neo4j property graphs; but there's no reasoner to e.g. infer inverse properties or do schema validation.
Is there any RDF graph that neo4j cannot store? RDF has datatypes and languages for objects: you'd need to reify properties where datatypes and/or languages are specified (and you'd be re-implementing well-defined semantics)
Can every neo4j graph be represented with RDF? Yes.
RDF is a representation for graphs for which there are very many store implementations that are optimized for various use cases like insert and query performance.
Comparing neo4j to a particular triplestore (with reasoning support) might be a more useful comparison given that all neo4j graphs can be expressed as RDF.

Neo4j node property type

I'm playing around with neo4j, and I was wondering, is it common to have a type property on nodes that specify what type of Node it is? I've tried searching for this practice, and I've seen some people use name for a purpose like this, but I was wondering if it was considered a good practice or if indexes would be the more practical method?
An example would be a "User" node, which would have type: user, this way if the index was bad, I would be able to do an all-node scan and look for types of user.
Labels have been added to neo4j 2.0. They fix this problem.
You can create nodes with labels:
CREATE (me:American {name: "Emil"}) RETURN me;
You can match on labels:
MATCH (n:American)
WHERE n.name = 'Emil'
RETURN n
You can set any number of labels on a node:
MATCH (n)
WHERE n.name='Emil'
SET n :Swedish:Bossman
RETURN n
You can delete any number of labels on a node:
MATCH (n { name: 'Emil' })
REMOVE n:Swedish
Etc...
True, it does depend on your use case.
If you add a type property and then wish to find all users, then you're in potential trouble as you've got to examine that property on every node to get to the users. In that case, the index would probably do better- but not in cases where you need to query for all users with conditions and relations not available in the index (unless of course, your index is the source of the "start").
If you have graphs like mine, where a relation type implies two different node types like A-(knows)-(B) and A or B can be a User or a Customer, then it doesn't work.
So your use case is really important- it's easy to model graphs generically, but important to "tune" it as per your usage pattern.
IMHO you shouldn't have to put a type property on the node. Instead, a common way to reference all nodes of a specific "type" is to connect all user nodes to a node called "Users" maybe. That way starting at the Users node, you can very easily find all user nodes. The "Users" node itself can be indexed so you can find it easily, or it can be connected to the reference node.
I think it's really up to you. Some people like indexed type attributes, but I find that they're mostly useful when you have other indexed attributes to narrow down the number of index hits (search for all users over age 21, for example).
That said, as #Luanne points out, most of us try to solve the problem in-graph first. Another way to do that (and the more natural way, in my opinion) is to use the relationship type to infer a practical node type, i.e. "A - (knows) -> B", so A must be a user or some other thing that can "know", and B must be another user, a topic, or some other object that can "be known".
For client APIs, modeling the element type as a property makes it easy to instantiate the right domain object in your client-side code so I always include a type property on each node/vertex.
The "type" var name is commonly used for this, but in some languages like Python, "type" is a reserved word so I use "element_type" in Bulbs ( http://bulbflow.com/quickstart/#models ).
This is not needed for edges/relationships because they already contain a type (the label) -- note that Neo4j also uses the keyword "type" instead of label for relationships.
I'd say it's common practice. As an example, this is exactly how Spring Data Neo4j knows of which entity type a certain node is. Each node has "type" property that contains the qualified class name of the entity. These properties are automatically indexed in the "types" index, thus nodes can be looked up really fast. You could implement your use case exactly like this.
Labels have recently been added to Neo4j 2.0 ( http://docs.neo4j.org/chunked/milestone/graphdb-neo4j-labels.html ). They are still under development at the moment, but they address this exact problem.

Resources