Compliance to a schema in neo4j - neo4j

I am thinking of using a graph DB to store IFC data. Ideally, the DB should provide a way to define all the rule types defined in the IFC schema. However, I don't think there are any such databases because some of the rule types in IFC are very complex and requires querying the DB. Others are simple, like uniqueness of GUID, existence of mandatory attributes, or data validation. Neo4j seem to have a few constraint enforcing methods:
Neo4j helps enforce data integrity with the use of constraints. Constraints can be applied to either nodes or relationships. Unique node property constraints can be created, as well as node and relationship property existence constraints.
Are there other methods that can ensure compliance of entered data with a predefined schema?
Or are there other graph DBs that are more suitable for this job?

You can achieve pretty much everything you want by creating Transaction Event handlers.
http://neo4j.com/docs/java-reference/current/javadocs/org/neo4j/graphdb/event/TransactionEventHandler.html
You can also take a look at the GraphAware Framework and all its submodules for use-cases and also the ease of creating and deploying neo4j extensions.

Depends on whether you need the schema enforced by the database itself, or whether you're OK with that being done at the application layer.
I've just gotten Restagraph to the "working prototype" level, and my next trick is Dockerising it.
It's a framework of sorts, that enables you to define a schema by creating nodes and relationships in Neo4J with specific labels, and which dynamically creates a REST API to enforce it.
It's also written in Common Lisp, so I'll understand if you wait for the Docker image :)

Related

Graph databases for modeling specific domain

With a normal 'graph database' the data is broken up into nodes and edges, and there isn't much of a restriction/schema between the connections. With this, it seems great for modeling straightforward graphs where the relationships are relatively consistent -- Movies with cast and crew; Computer networks with IPs and devices; Social networks with users and connections; etc.
Are there any graph-like databases that can be more specialized? For example to be able to model something like an electrical circuit where each component has a sort of 'schema' or well defined input and output -- i.e., a Resistor has two connections and has various properties:
a Transistor takes has three connections and has various properties, etc.
I'm not asking about particular circuit simulators, such as https://www.falstad.com/circuit/circuitjs.html, but more about whether it's possible in any graph (or pseudo-graph) databases to model and enforce very specific, well-defined relationships in a network, such as circuit design.
Definitely possible.
I've been working on this problem with Neo4j, and Restagraph is the result. It provides a REST API that enforces a schema on any updates to the database, and I've packaged it as a Docker image.
I haven't really promoted it so far, because it's only recently been mature enough for my own use, and I really need to improve the documentation. If you try it out, though, I'd love to hear any feedback you have.
TLDR: in general yes, but it depends.
This is a really broad question, so let me break it down.
While it's a little exaggerating to talk about all graph databases (which are not as standardized as SQL databases - which in turn are not very standardized as well), so take this answer with a grain of salt: Yes, that is possible.
As in SQL databases, you usually can set up constraints to be checked before any changes in data is persisted.
Most graph databases incorporate something along the lines of a "type", similarly to what a table represents in SQL databases. Some allow to constrain relationships to only target specific types, so you could restrict relationships e.g. between a node using a CAN bus and an I2C-bus to the specific types.
If a database does not provide these mechanisms, it's usually possible to constrain relationships to the existence of specific keys and values in the model. To have another example than your circuit one: Imagine a node-based system, which has typed inputs and outputs - an int-based output can only be connected to an int based input, a float based output only to a float based input, etc. Then you could add a field output_type and input_type to the nodes and constrain relationships between the values.
As soon as you add the ability to write (the SQL-similar stored) procedures, you can write very complex data integrity constraints.
So, while it is possible, the question is, if you should.
How much logic you actually want to put into your database is a decades-long heated argument. At some point in your application architecture, you will have to check the validity of the data that you are handling. Handling the data consistency in the database itself solves a lot of problems with race conditions or performance issues through multiple round trips between the application and the database, which would occur if the consistency checks are done in the application layer.
Putting a lot of your logic into the database severely limits your ability to switch databases ("vendor lock-in"), might lead to code duplication between your application layer and your database, and sprays your logic between two (or more) layers of your architecture (which makes it harder to find bugs, introduces temporal coupling, and might re-introduce race conditions and performance problems where you have to use transactions again).
My personal take is along the lines of Steve Wozniak - see your database as another service. If that service can provide you with everything you need to ensure data integrity, it might be a good idea to just use the database directly. But if this increases the problems I mentioned before, you might be better off putting a layer between your database and your business logic.
Objectivity/DB is object/graph database that uses schema. You can absolutely do what you are proposing. It supports complex object definitions including type inheritance and it has a full graph/navigational query language similar to Cypher. www.objectivity.com

user defined data integrity constraints in Neo4j

as I am relatively new to Neo4J and I was wondering if it is possible to impose user defined data integrity constraints on the stored data.
The manual says that it is possible to impose UNIQUE constraints and here Michael Hunger pointed out that in the current RC NOT NULL constraints have been added.
I was wondering if it is possible, in some way, to define constraints like "every node with label X has to have a relationship with Label Y" or to impose, in some way, a type system, possibly with a type hierarchy and everything.
Such constraints should automatically be checked by the DBMS, like in many of the old school (relational) database systems.
Cheers!
No, it's not possible to have same functionality like traditional RDBMS has, at least not out of the box.
You can write your own Unmanaged Extensions which could handle that for you. You can find basic information how to do that in this article.
I'm not aware of any existing "plugin". In the future GraphAware Enterprise should bring "schema enforcement".

How can you define and enforce a Neo4J graph's schema?

I want to achieve, with a Neo4j graph a RDBMS's ability to define and enforce a known schema. We know what our graph should look like (all the edge types and node types). So we simply want to prevent someone (developer/user) from adding an edge or node type which is "invalid" i.e. not part of the defined graph schema. How can we enforce a graphs schema? Note I am not asking about how to enforce the properties of an edge or a graph but simply how to enforce that the graph is made up if a specific set of known edge and node types.
Please help
This should probably be done on the application side. Build a wrapper/API that enforces this sort of thing, and make the developers use it. Sorry for the short answer...
Most of the language drivers or frameworks listed here provide means to define a schema:
http://www.neo4j.org/drivers
For Java we developed structr (https://github.com/structr/structr) where you define your schema in Java beans. You could start f.e. with the simple Maven archetype as shown in this screencast: http://vimeo.com/53235075
Cheers
Axel
It has to happen in a layer above Neo4j. I've been building one of those layers (Restagraph), which puts a REST interface on top of it.
It's a mite less mature than Structr, but may be worth a look. I package it in a Docker image, and it's designed so you can easily define your own schema in YAML files.

Architecting a Neo4j-Based Application - stick to vanilla API using plain nodes & relationships or use Spring/GORM?

I'm hoping to hear from any of you who have architected and implemented a decent sized Neo4j app (10's millions nodes/rels) - and what your recommendations are particularly w.r.t modelling and the various APIs (vanilla java/groovy Neo4j vs Spring-Data-Neo4j vs Grails GORM/Neo4j).
I'm interested if it actually pays off to add the extra OGM (object-graph-mapping) layer and associated abstractions?
Has anyone's experience been that it is best to stick to 'plain' graph-modelling with nodes+properties, relationships+properties, traversals and (e.g.) Cypher to model and store their data?
My concern is that 'forcing' a particular OGM abstraction onto a graph database will affect future flexibility in adapting/changing the domain model and/or flexibility in querying the data.
We're a Grails shop, and I have experimented with GORM/Neo4J and also with spring-data-neo4j.
The primary purpose for the dataset will be to model and query relationships amongst v.large numbers of people, their aliases, their associates and all sorts of criminal activity and history. There will be more than 50 main domain classes. There must be flexibility in the model (which will need to evolve rapidly in the early phases of the project) and in speed and flexibility of querying.
I have to confess, I'm struggling to find a compelling reason to use a OGM layer when I can use (e.g.) POJOs or POGOs, a little Groovy magic and some simple hand-rolled domain object <-> node/relationship mapping code. As far as I can tell, I think I would be happy just dealing with nodes & traversals & Cypher (aka KISS). But I would be very happy to hear others' experiences and recommendations.
Thanks for your time & thoughts,
TP
since I'm the author of the Grails Neo4j plugin, I might be biased. The main reason for creating the plugin was to apply the ease of Grails domain classes with their powerful out-of-the-box scaffolding to Neo4j for ~80% of the use cases. For the other 20% where specific requirements require stuff like traversals etc. we're using Neo4j APIs directly (traversals/cypher) and do not use the GORM API.
The current version of the Neo4j plugin suffers from a supernode issue since each domain instance is connected to a subreference node. If multiple concurrent requests (aka threads) add new domain instances there is chance to get a locking exception. I'm about to fix that either by a sub-subreference approach or by using indexing.
Cypher can also be used in the Neo4j Grails plugin.
Spring-Data-Neo4j on the other hand is a more advanced approach with finer control over mapping details, but requires usage of specific annotations. And I found no easy way to integrate that into Grails in a way scaffolding works.
We're using the predecessor version of the plugin in a productive application with ~60k users and ~10^6 rels. Due to NDA I cannot provide more details on that.
We do not use grails, but do use a hybrid plain neo4j / spring-data-neo4j solution. The reason is based on the fact that some of our domain data has a fixed schema and some doesn't. SDN takes a lot of the burden away and can be mixed with plain neo4j if the need arises.
We have classes that describe a data model, the objects for these classes we persist using SDN, with no additional tricks, we just use the basics from SDN. Then we have classes that contain the data for the model that is not known beforehand. These are stored in nodes contain special properties for describing what model type the data refers to. When neo4j 2 gets released, we will probably move that info into labels. Between these nodes there can be relations, also described by the aforementioned data model managed by sdn. We also have relations from the generic nodes to SDN nodes, which works fine, as everything ends up being the same things: nodes.
We have not encountered any issues yet using this approach. The thing we love the most is that the data of which we do not know in advanced how it will be modelled, is stored in the way you would have wanted to store data when you would have known it in advance, making the data actually match the model chosen, which is very hard to do when using any other type of (non-graph) database.

Entity Framework 4: Does it make sense to create a single diagram for all entities?

I wrote a few assumptions regarding Entity Framework, then a few questions (so please correct where I am wrong). I am trying to use POCOs with EF 4.
My assumptions:
Only one data context can exist for an EF diagram.
Data Contexts can refer to more than one entity.
If you have two data sources, say MS SQL server and Oracle, EF requires two different diagrams to access the data.
The EF diagram data context is the "Unit of Work", having a single Save() for anything on the diagram. (Sure you could wrap it in a UnitOfWork class, but it essentially has the same duties).
Assuming that's correct, here are my questions:
If you don't keep all entities on the same EF diagram, how do you maintain data integrity, like "Orders" cannot exist without a "Customer"? Is this solely a function of the repository to load data just to verify integrity, or do we "try/catch" on database referential integrity errors?
Wouldn't you create an EF diagram for each Entity? For example, I wouldn't expect changes to a customer and changes to a product to be written together as they have nothing to do with each other (having them on the same diagram would cause them to be written together). Or is the scope of an EF diagram to encompass all similar entities stored in the same storage medium?
Is it the norm to divide up the entities like that, or just have a single diagram holding all the entities? I would think the latter, but the thinking is getting the better of me.
Having one big EDM containing all the entities generally is NOT a good practice and is not recommended.
Using one large EDM will cause several issues such as:
Performance Issue in Metadata Load Times:
As the size of the schema files increase, the time it takes to parse and create an in-memory model for this metadata would also increase.
Performance Issue in View Generation:
View generation is a process that compiles the declarative mapping provided by the user into client side Entity Sql views that will be used to query and store Entities to the database. The process runs the first time either a query or SaveChanges happens. The performance of view generation step not only depends on the size of your model but also on how interconnected the model is. If two Entities are connected via an inheritance chain or an Association, they are said to be connected. Similarly if two tables are connected via a foreign key, they are connected. As the number of connected Entities and tables in your schemas increase, the view generation cost increases.
Cluttered Designer Surface:
When you generate an Edm model from a big database schema, the designer surface is cluttered with a lot of Entities and it would be hard to make sense of how your Entity model in total looks like. If you don’t have a good overview of the Entity Model, how are you going to customize it?
Intellisense experience is not great:
When you generate an Edm model from a database with say 1000 tables, you will end up with 1000 different entity sets. Imagine how your intellisense experience would be when you type “context.” in the VS code window.
Cluttered CLR Namespaces:
Since a model schema will have a single EDM namespace, the generated code will place the classes in a single namespace.
For a more detailed discussion, have a look at Working With Large Models In Entity Framework – Part 1
Solution:
While there is no out of the box solution for this but it suggests that instead, you should Naturally Disconnected Subsets in your model meaning that based on your domain model, you should come up with different sets of domain models each containing related objects while each set is unrelated and disconnected from the other one. No Foreign Keys in between could be a good sign for separation. And this make sense because in a large model, usually your application does not require all the tables in a database to be mapped to one Entity Model in order to work.
Even if this kind of separation is not 100% possible - meaning that there are subsets of tables that have out going foreign keys to other tables in the database - it still encourages you do separate them. When you do this, you would have to take the responsibility of setting the foreign key appropriately. There would be no navigation property that allows you to get the Entity that represents this foreign key. Of course you could manually query for this Entity in the other container if needed.
Also, for some tips and tricks on how you can split one large entity model into smaller ones while reusing types, take a look at: Working With Large Models In Entity Framework – Part 2
About your question: Order and Customer belong to the same natural domain and should be kept in the same EDM. Like I said, you can scatter them over 2 different entity data models but then you have to take the responsibility of setting the appropriate foreign keys or you'll get runtime exceptions, by the same token, Customer and Product should be kept in separate entity data models. Following these rules, you can come up with a well defined domain set design in your data access layer.
I realize that this question was about EF4 but I am sure that many people who are just now "making the switch" will end up here via Google and will read this and the approved answer and make decisions based on it even though they are using EF5 (or EF4.4 if you are stuck on .Net 4.0)
EF5 allows multiple diagrams per edmx. This is a big deal, at least to my team, because it allows us to visually separate entities without requiring separate edmx files. Dr. Zim's points are all still valid except (obviously) the "cluttered designer surface".
There are draw backs to having multiple edmx files, the biggest one is that even if you create separate namespaces for each, you cannot duplicate entity names. Yes, if you truly are designing your system "code first" then this should not be a problem. However, many (most) of us are adding EF to existing systems that are already built on top of relational databases which have normalization.
"But normalization is a good thing, right?" Well, if you are using a relational database yes. "But why does that matter if I am using EF?" A common "normalized" table is Address. Possible scenario: Company (location of business/office) and Contact (might be "remote" worker so they are not at the business location) and they both have a FK that points to Address. Using one edmx file for Company and one for Contact (even with different namespaces) that both include the Address table, the code will compile but at run time you will get this beauty:
Multiple types with the name 'Address' exist in the EdmItemCollection
in different namespaces. Convention based mapping requires unique names
without regard to namespace in the EdmItemCollection
You can change the mapping that is used by EF but then you have other "issues" when working through implementation and most people use the default mapping so forums like this won't have many pertinent questions and answers.
You could also rename the Model name for the Address table to "ContactAddress" and "CompanyAddress" respectively, but that gives the illusion that they are different types when they really aren't. OK, so they are different types in EF but not in the database and, as I said, most of us "live" in the world of tacking on EF to an existing system with an existing data store that is a relational database.
This is already a long-winded "answer" so I will stop here. I just wanted to make sure that people who landed here because they searched for "multiple edmx" and did not realize that there are significant difference between EF4 and EF5 were made aware and realized they may need to do some more investigating.

Resources