self-join on documentdb syntax error - join

I'm having trouble doing an otherwise SQL valid self-join query on documentdb.
So the following query works:
SELECT * FROM c AS c1 WHERE c1.obj="car"
But this simple self join query does not: SELECT c1.url FROM c AS c1 JOIN c AS c2 WHERE c1.obj="car" AND c2.obj="person" AND c1.url = c2.url, with the error, Identifier 'c' could not be resolved.
It seems that documendb supports self-joins within the document, but I'm asking on the collection level.
I looked at the official syntax doc and understand that the collection name is basically inferred; I tried changing c to explicitly my collection name and root but neither worked.
Am I missing something obvious? Thanks!

A few things to clarify:
1.) Regarding Identifier 'c' could not be resolved
Queries are scoped to a single collection; and in the example above, c is an implicit alias for the collection which is being re-aliased to c1 with the AS keyword.
You can fix the example query changing fixing the JOIN to reference c1:
SELECT c1.url
FROM c AS c1
JOIN c1 AS c2
WHERE c1.obj="car" AND c2.obj="person" AND c1.url = c2.url`
This is also equivalent to:
SELECT c1.url
FROM c1
JOIN c1 AS c2
WHERE c1.obj="car" AND c2.obj="person" AND c1.url = c2.url`
2.) Understanding JOINs and examining your data model
With that said, I don't think fixing the query syntax issue above will produce the behavior you are expecting. The JOIN keyword in DocumentDB SQL is designed for forming a cross product with a denormalized array of elements within a document (as opposed to forming cross products across other documents in the same collection). If you run in to struggles here, it may be worth taking a step back and revisiting how to model your data for Azure Cosmos DB.
In a RDBMS, you are trained to think entity-first and normalize your data model based on entities. You rely heavily on a query engine to optimize queries to fit your workload (which typically do a good, but not always optimal, job for retrieving data). The challenges here are that many relational benefits are lost as scale increases, and scaling out to multiple shards/partitions becomes a requirement.
For a scale-out distributed database like Cosmos DB, you will want to start with understanding the workload first and optimize your data model to fit the workload (as opposed to thinking entity first). You'll want to keep in mind that collections are merely a logical abstraction composed of many replicas that live within partition sets. They do not enforce schema and are the boundary for queries.
When designing your model, you will want to incorporate the following questions in to your thought process:
What is the scale, in terms of size and throughput, for the broader solution (an estimate of order of magnitude is sufficient)?
What is the ratio of reads vs writes?
For writes - what is the pattern for writes? Is it mostly inserts, or are there a lot of updates?
For reads - what do top N queries look like?
The above should influence your choice of partition key as well as what your data / object model should look like. For example:
The ratio of requests will help guide how you make tradeoffs (use Pareto principle and optimize for the bulk of your workload).
For read-heavy workloads, commonly filtered properties become candidates for choice of partition key.
Properties that tend to be updated together frequently should be abstracted together in the data model, and away from properties that get updated with a slower cadence (to lower the RU charge for updates).
Don't be afraid to duplicate properties to enrich queryability, and annotate types, across different record types. For example, we have two types of documents: cat and person.
{
   "id": "Andrew",
   "type": "Person",
   "familyId": "Liu",
   "employer": "Microsoft"
}
 
{
   "id": "Ralph",
   "type": "Cat",
   "familyId": "Liu",
   "fur": {
         "length": "short",
         "color": "brown"
   }
}
 
We can query both types of documents without needing a JOIN simply by running a query without a filter on type:
SELECT * FROM c WHERE c.familyId = "Liu"
And if we wanted to filter on type = “Person”, we can simply add a filter on type to our query:
SELECT * FROM c WHERE c.familyId = "Liu" AND c.type = "Person"

Above Answer has queries mentioned by #Andrew Liu. This will resolve your error but Azure Cosmos DB does not support Cross-item and cross-container joins. Use this link to read about joins https://learn.microsoft.com/en-us/azure/cosmos-db/sql/sql-query-join

Related

Neo4j data modeling: correct way to specify a source for a statement?

I'm working on a scientific database that contains model statements such as:
"A possible cause of Fibromyalgia is Microglial hyperactivity, as supported by these 10 studies: [...] and contradicted by 1 study [...]."
I need to specify a source for statements in Neo4j and be able to do 2 ways operations, like:
Find all statements supported by a study
Find all studies supporting a statement
The most immediate idea I had is to use the DOI of studies as unique identifiers in the relationship property. The big con of this idea is that I have to scan all the relationships to find the list of all statements supported by a study.
So, since it is impossible to make a link between a study and a relationship, I had the idea to make 2 links, at each extremity of the relationship. The obvious con is that it does not give information about the relationship, like "support" or "contradict".
So, I came to the conclusion that I need a node for the hypothesis:
However, it overloads the graph and we are not anymore in the classical node -relationship-> node design that makes property graphs so easy to understand.
Using RDF, it is possible to add properties to relationships using subgraphs, however there we enter semantic graphs and quad stores, which is a more complex tool.
So I'm wondering if there is a "correct" design pattern for Neo4j to support this type of need that I may not have imagined instead?
Thanks
Based on your requirements, I think put support_study as property of edge will do the work:
Thus we could query the following as:
Find all statements supported by a study
MATCH ()-[e:has_cause{support_study: "doi_foo_bar"}]->()
RETURN e;
Find all studies supporting a statement
Given statement is “foo” is caused by “bar”
MATCH (v:disease{name: "foo"})-[e:has_cause]->(v1:sympton{name: "bar")
RETURN DISTINCT e.support_study;
While, this is mostly based on NebulaGraph, where:
It speaks cypher DQL(together with nGQL)
It supports properties in edge
It used 4-tuple(rather than a Key) to distingush an edge(src,dst,edge_type,rank), where rank is an unique design to enable multiple has_cause edge instance between one pair of disease-> sympton, you could put the hash of doi or other number as rank field(or omit, of cause, it will be 0)
It’s distributed and Open-Source(Apache 2.0)
Note:
In NebulaGraph, index should be created on has_cause(support_study) and disease(name), ref: https://www.siwei.io/en/nebula-index-explained/ and https://docs.nebula-graph.io/3.2.0/3.ngql-guide/14.native-index-statements/
But, I think it applies to neo4j, too :)

Is there a definite answer for the performance of 1st level depth query search in Neo4j (or any other graph db) vs. SQL/NoSQL?

I am testing the use of Neo4j for social-like graph, but I also have many use-cases that require 1st level depth queries (e.g. get my likes / views). Consequently, I wish to decide whether I need another SQL/NoSQL to support (in terms of performance) this type of queries.
Up until now, I was only able to find benchmarks and quantitive data concerning > 1st level searches (i.e. friends of my friends ...)
Is it common knowledge that SQL/NoSQL db will have better performance for such queries? are there any research/benchmarks about this?
A depth of 1 shouldn't result in much difference in performance, I'd think. However, that does depend on two factors: the indexes you've set up, and the depth of expansion.
For both Neo4j and your relational db, you would want a supporting index on the starting node in the graph (the person or post whose likes/views you want to get). For the relational db, you would also want an index to support the join operation being used to get at the connected nodes.
For Neo4j, the expansion to the connected nodes is directly proportional to the number of nodes you are expanding to, since this is just pointer chasing between the nodes and relationships forming your graph. No indexes are used for that.
For a relational database, the relationship would likely be modeled as a table join (which should be index-backed), and that cost will be proportional to the size of the tables being joined, so as more data is added to the graph (no matter of whether it is connected to the user who you are querying for), it will be impacting your execution time.
Thankfully for your case only a single table join would be needed. You may not see a big difference between a graph db and a relational db. Neo4j tends to shine when many (possibly an unbounded number) of traversals are needed, like the friend-of-a-friend queries or those with longer patterns. If your use cases include longer patterns, especially if the types of the node expanded to are not known ahead of time, then Neo4j would be very helpful, especially as the data in your database grows, since traversal performance is proportional only to the directly connected data, not the total number of nodes of the given labels.

Graph Databases vs Triple Stores - when to use which?

I know that there are similar questions around on Stackoverflow but I don't feel they answer the following.
Graph Databases to my understanding store data following mostly this schema:
Table/Collection 1: store nodes with UID
Table/Collection 2: store relations referencing nodes via UID
This allows storing arbitrary types of graphs. Now as I understand triple stores store nothing but triples:
Triple/Collection 1: store triples (2 nodes, 1 relation)
Now I would see the following distinction regarding use cases:
Graph Databases: when you have known, static connections
Triple Stores: when you have loosely connected nodes and are often looking for new connections
I am confused by the fact that people do not seem to be discussing which one to use according to these criteria. Most article I find are talking about arguments like speed or compatibility. But is this not the most relevant point?
Put the other way round:
Imagine having a clearly connected, user defined graph. Why on earth would you want to store that as triples only, loosing all the info about connections? Or having to implement some custom solution storing IDs in the triple subject.
Imagine having loosely collected nodes that you want to query for unknown relations using SPARQL. Graph databases do support that. But for this they have to build another index I assume and would be slower?
EDIT:
I see that "loosing info about connections" is the wrong way to put it. If you do as shown in the accepted answer and insert several triples for 2 nodes + 1 relation then you keep all the info and specifically the info what exact nodes are connected.
The main difference between graph databases and triple stores is how they model the graph. In a triple store (or quad store), the data tends to be very atomic. What I mean is that the "nodes" in the graph tend to be primitive data types like string, integer, date, etc. Relationships link primitives together, and so the "unit of discourse" in a triple store is a triple, and not a node or a relationship, typically.
By contrast, other graph databases are often called "property stores" because nodes are data containers that correspond to objects in a domain. A node stands in for an object, and has properties; they act as rich data types specified by the graph modelers, more than just primitive data types. In these graph databases, nodes and relationships are the "unit of discourse".
Let's say I have a person named "Bob" who knows "Susan". In RDF, it would be something like this:
<http://example.org/person/1> :hasName "Bob".
<http://example.org/person/1> foaf:knows <http://example.org/person/2>.
<http://example.org/person/2> :hasName "Susan".
In a graph database like neo4j, it would be this:
(a:Person {name: "Bob"})-[:KNOWS]->(b:Person {name: "Susan"})
Notice that in RDF, it's 3 relationships but only one of those relationships actually expresses semantics between two entities. The other two relationships are just tracking properties of a single higher-level entity (the person). In neo4j, it's 1 relationship amongst two nodes, with each node having a property. In RDF you'll tend to identify things by URI, in neo4j it's a database object that gets a database ID automatically. That's what I mean about the difference between a more atomic/primitive store (triple stores) and a richer property graph.
RDF and triple stores are mostly built for the kinds of architectural challenges you'd run into with the semantic web. For example, XML namespacing is built in, on the architectural assumption that you'll be mixing and matching the use of many different vocabularies and namespaces. (That right there is a very "semantic web" assumption). So in SPARQL and RDF you'll see typically at least the use of xsd, rdf, and rdfs namespaces concurrently, and probably also owl, skos, and many others. SPARQL and RDF/RDFS also have many hooks and features that are there explicitly to make things like ontology inference easier. You'll tend to identify things with URIs as a way of "namespacing your identifiers" but also because some people may want to de-reference the URI...again the assumption here is a wide data sharing arrangement between many parties.
Property stores by contrast are keyed towards different use cases, like flexible modeling of data within one model/namespace, mappings between objects and graphs for persistence of enterprise applications, rapid evolvability, and so on. You'll tend to identify things with your own scheme (or an internal database ID). An auto-incrementing integer may not be best form of ID for any random consumer on the web, (and they certainly can't be de-referenced like URLs) but they might not be your first thought for a company internal application.
So which is better? The more atomic triple store format, or a rich property graph? Do you need to mix and match many different vocabularies in one query or data model? Do you need to create an OWL ontology or do inference? Do you need to serialize a bunch of java objects in memory to a database? Do you need to do fast traversal of long paths? Those types of questions would guide your selection.
Graphs are graphs, both of them do graphs, and so I don't think there's much difference in terms of what they can represent, or how you go about thinking about a problem in "graph terms". The differences boil down to the architecture underneath of the hood, and what sorts of use cases you think you'll need. I won't tell you one is better than the other, but choose wisely.
(in reply to the comments on this answer: https://stackoverflow.com/a/30167732 )
When an owl:inverseOf production rule is defined, the inverse property triple is inferred by the reasoner either when adding or updating the store, or when selecting from the store. This is a "materialized relation"
Schema.org - an RDFS vocabulary - defines, for example, https://schema.org/isPartOf as the inverse property of hasPart. If both are specified, it's not necessary to run another graph pattern query to traverse a directed relation in the other direction.
(:book1 schema:hasPart ?o)
(?o schema:isPartOf :book1)
(?s schema:hasPart :chapter2)
It's certainly possible to use RDFS and OWL to describe schema for and within neo4j property graphs; but there's no reasoner to e.g. infer inverse properties or do schema validation.
Is there any RDF graph that neo4j cannot store? RDF has datatypes and languages for objects: you'd need to reify properties where datatypes and/or languages are specified (and you'd be re-implementing well-defined semantics)
Can every neo4j graph be represented with RDF? Yes.
RDF is a representation for graphs for which there are very many store implementations that are optimized for various use cases like insert and query performance.
Comparing neo4j to a particular triplestore (with reasoning support) might be a more useful comparison given that all neo4j graphs can be expressed as RDF.

How to model a relational database into a neo4j graph database?

I have a relational database (about 30 tables) and I would like to transpose it in a neo4j graph database, and I don't know where to start...
Is there a general way to transpose tables and/or tuples into a graph model ? (relations properties, one or more graphs ?) What are the best sources of documentation ?
Thanks for any help,
Best regards
First, if at all possible, I'd suggest NOT using your relational DB as your "reference" for transposing to a graph model. All too often, mistakes and pitfalls from relational modelling get transferred over to the graph model and introduce other oddities. In fact, if you have a source ER diagram, that might be an even better starting point as it's really already a graph. And maybe even consider a re-modelling exercise for your domain!
That said, from a basic point of view, you can think of most tables as representing a node type (e.g. "User" or "Movie") with join tables and keys representing relationship types.
A great starting point, from my perspective anyway, is to determine some questions your graph/data source should answer. Write those questions down, and try to come up with Cypher queries that represent the questions. Often times, a graph model naturally arises from such an effort, and it's really not that difficult.
If you haven't already, I'd strongly recommend picking up a (free) copy of the Graph Databases ebook from here: http://graphdatabases.com/
It's jam-packed with a lot of good info on where to start with modelling your domain and even things to consider when you're used to doing things in a relational manner. It also contains some material on Cypher, although the Neo4j site (neo4j.org) has a reference manual with plenty of up-to-date info on Cypher.
Hope this helps!
There's not going to be a one-stop-shop for this kind of conversion, as not all data models are appropriate for graph modeling, and every application is a unique special snowflake...but with that said.....
Generally, your 'base' tables (e.g. User, Role, Order, Product) would become nodes, and your 'join tables' (a.k.a. buster tables) would be candidates for your relationships (e.g. UserRole, OrderLineItem). The key thing to remember that in a graph, generally, you can only have one relationship of a given type between two specific nodes - so in the above example, if your system allows the same product to be in an order twice - it would cause issues.
Foreign keys are your second source of relationships, look to them to see if it makes sense to be a relationship or just a property.
Just keep in mind what you are trying to solve by your data model - if it's traversing your objects to find relationships and distance, etc... then graphs may be a good fit. If you are modeling an eCommerce app, where you are dealing with manipulating a single nested object (e.g. order -> line item -> product -> sku), then a relational model may be the right fit.
Hope my $0.02 helps...
As has been already said, there is no magical transformation from a relational database model to a graph database model.
You should look for the original entities and how they are related in order to find your nodes, properties and relations. And always keeping in mind what type of queries you are going to perform.
As BtySgtMajor said, "Graph Databases" is a good book to start, and it is free.

Why do relationships as a concept exist in neo4j or graph databases in general?

I can't seem to find any discussion on this. I had been imagining a database that was schemaless and node based and heirarchical, and one day I decided it was too common sense to not exist, so I started searching around and neo4j is about 95% of what I imagined.
What I didn't imagine was the concept of relationships. I don't understand why they are necessary. They seem to add a ton of complexity to all topics centered around graph databases, but I don't quite understand what the benefit is. Relationships seem to be almost exactly like nodes, except more limited.
To explain what I'm thinking, I was imagining starting a company, so I create myself as my first nodes:
create (u:User { u.name:"mindreader"});
create (c:Company { c.name:"mindreader Corp"});
One day I get a customer, so I put his company into my db.
create (c:Company { c.name:"Customer Company"});
create (u:User { u.name:"Customer Employee1" });
create (u:User { u.name:"Customer Employee2"});
I decide to link users to their customers
match (u:User) where u.name =~ "Customer.*"
match (c:Company) where c.name =~ "Customer.*
create (u)-[:Employee]->(c);
match (u:User where name = "mindreader"
match (c:Company) where name =~ "mindreader.*"
create (u)-[:Employee]->(c);
Then I hire some people:
match (c:Company) where c.name =~ "mindreader.*"
create (u:User { name:"Employee1"})-[:Employee]->(c)
create (u:User { name:"Employee2"})-[:Employee]->(c);
One day hr says they need to know when I hired employees. Okay:
match (c:Company)<-[r:Employee]-(u:User)
where name =~ "mindreader.*" and u.name =~ "Employee.*"
set r.hiredate = '2013-01-01';
Then hr comes back and says hey, we need to know which person in the company recruited a new employee so that they can get a cash reward for it.
Well now what I need is for a relationship to point to a user but that isn't allowed (:Hired_By relationship between :Employee relationship and a User). We could have an extra relationship :Hired_By, but if the :Employee relationship is ever deleted, the hired_by will remain unless someone remembers to delete it.
What I could have done in neo4j was just have a
(u:User)-[:hiring_info]->(hire_info:HiringInfo)-[:hired_by]->(u:User)
In which case the relationships only confer minimal information, the name.
What I originally envisioned was that there would be nodes, and then each property of a node could be a datatype or it could be a pointer to another node. In my case, a user record would end up looking like:
User {
name: "Employee1"
hiring_info: {
hire_date: "2013-01-01"
hired_by: u:User # -> would point to a user
}
}
Essentially it is still a graph. Nodes point to each other. The name of the relationship is just a field in the origin node. To query it you would just go
match (u:User) where ... return u.name, u.hiring_info.hiring_date, u.hiring_info.hired_by.name
If you needed a one to many relationship of the same type, you would just have a collection of pointers to nodes. If you referenced a collection in return, you'd get essentially a join. If you delete hiring_info, it would delete the pointer. References to other nodes would not have to be a disorganized list at the toplevel of a node. Furthermore when I query each user I will know all of the info about a user without both querying for the user itself and also all of its relationships. I would know his name and the fact that he hired someone in the same query. From the database backend, I'm not sure much would change.
I see quite a few questions from people asking whether they should use nodes or relationships to model this or that, and occasionally people asking for a relationship between relationships. It feels like the XML problem where you are wondering if a pieces of information should be its own tag or just a property its parent tag.
The query engine goes to great pains to handle relationships, so there must be some huge advantage to having them, but I can't quite see it.
Different databases are for different things. You seem to be looking for a noSQL database.
This is an extremely wide topic area that you've reached into, so I'll give you the short of it. There's a spectrum of database schemas, each of which have different use cases.
NoSQL aka Non-relational Databases:
Every object is a single document. You can have references to other documents, but any additional traversal means you're making another query. Times when you don't have relationships between your data very often, and are usually just going to want to query once and have a large amount of flexibly-stored data as the document that is returnedNote: These are not "nodes". Node have a very specific definition and implies that there are edges.)
SQL aka Relational Databases:
This is table land, this is where foreign keys and one-to-many relationships come into play. Here you have strict schemas and very fast queries. This is honestly what you should use for your user example. Small amounts of data where the relationships between things are shallow (You don't have to follow a relationship more than 1-2 times to get to the relevant entry) are where these excel.
Graph Database:
Use this when relationships are key to what you're trying to do. The most common example of a graph is something like a social graph where you're connecting different users together and need to follow relationships for many steps. (Figure out if two people are connected within a depth for 4 for instance)
Relationships exist in graph databases because that is the entire concept of a graph database. It doesn't really fit your application, but to be fair you could just keep more in the node part of your database. In general the whole idea of a database is something that lets you query a LOT of data very quickly. Depending on the intrinsic structure of your data there are different ways that that makes sense. Hence the different kinds of databases.
In strongly connected graphs, Neo4j is 1000x faster on 1000x the data than a SQL database. NoSQL would probably never be able to perform in a strongly connected graph scenario.
Take a look at what we're building right now: http://vimeo.com/81206025
Update: In reaction to mindreader's comment, we added the related properties to the picture:
RDBM systems are tabular and put more information in the tables than the relationships. Graph databases put more information in relationships. In the end, you can accomplish much the same goals.
However, putting more information in relationships can make queries smaller and faster.
Here's an example:
Graph databases are also good at storing human-readable knowledge representations, being edge (relationship) centric. RDF takes it one step further were all information is stored as edges rather than nodes. This is ideal for working with predicate logic, propositional calculus, and triples.
Maybe the right answer is an object database.
Objectivity/DB, which now supports a full suite of graph database capabilities, allows you to design complex schema with one-to-one, one-to-many, many-to-one, and many-to-many reference attributes. It has the semantics to view objects as graph nodes and edges. An edge can be just the reference attribute from one node to another or an edge can exist as an edge object that sits between two nodes.
An edge object can have any number of attribute and can have references off to other objects, as shown in the diagram below.
Being able to "hang" complex objects off of an edge allows Objectivity/DB to support weighted queries where the edge-weight can be calculated using a user-defined weight calculator operator. The weight calculator operator can build the weight from a static attribute on the edge or build the weight by digging down through the objects connected to the edge. In the picture, above, we could create a edge-weight calculator that computes the sum of the CallDetail lengths connected to the Call edge.

Resources