I'm in the process of migrating a neo4j database into Grakn for genomics and biological data, I have the files in CSV for this but I need to an ETL Tool for solving this problem in the simplest way.
I am following this template Python migrator:
https://blog.grakn.ai/loading-data-and-querying-knowledge-from-a-grakn-knowledge-graph-using-the-python-client-b764a476cda8
Am I correct in thinking this way -
Do nodes map to entities?
Do edges in neo4j map to relationships in Grakn?
Do labels map to attributes?
While it is possible to use a direct mapping of the property-graph model to the entity-relationship model (used by Grakn), it is highly likely that limitations and shortcomings of the property graph model will be transferred. This is why Grakn does not provide or encourage a completely general migration tool. Every Grakn knowledge graph should be powered by a thought-out model (ie. schema) that is tailored to the intended domain.
To outline how one can easily (re)model a dataset in Grakn, the key is to create a schema that closely resembles how we perceive data in the real world in terms of things and their interactions. This easily maps onto the Entity-Relationship-Attribute model Grakn uses. It is common to iterate several times before settling on the final schema (though it can always be extended later).
Then we can:
ask intuitive questions (in the form of Graql queries) - using the defined Entities/Relationships/Attributes that map closely to our mental model
build an intelligent database that is capable of reasoning over data the same way we do, by adding logical, deductive rules that apply in our domain
I encourage to you check out this blog post on the challenges of working with graph databases, and for any domain specific modeling questions head over to the Grakn community forum.
Good luck and welcome to Grakn!
If you map your property graph directly to GRAKN, you will end up with relations that are most likely named as verbs connecting only two objects (one of which will appear to be a subject and the other an object). GRAKN will be fine with this, but as mentioned previously, may make leveraging all the goodness in GRAKN more difficult. In particular, converting existing graph structures to hyperedges may take some significant reengineering. But the good news is that the ETL would be straightforward.
A better solution would be to define your ideal schema first in GRAKN (taking advantage of hyperedges), then fashion an ETL to populate the schema. In such a case, the ETL might be simple or complex. It would depend on how complex your original data was and how complex the new schema was.
Related
With a normal 'graph database' the data is broken up into nodes and edges, and there isn't much of a restriction/schema between the connections. With this, it seems great for modeling straightforward graphs where the relationships are relatively consistent -- Movies with cast and crew; Computer networks with IPs and devices; Social networks with users and connections; etc.
Are there any graph-like databases that can be more specialized? For example to be able to model something like an electrical circuit where each component has a sort of 'schema' or well defined input and output -- i.e., a Resistor has two connections and has various properties:
a Transistor takes has three connections and has various properties, etc.
I'm not asking about particular circuit simulators, such as https://www.falstad.com/circuit/circuitjs.html, but more about whether it's possible in any graph (or pseudo-graph) databases to model and enforce very specific, well-defined relationships in a network, such as circuit design.
Definitely possible.
I've been working on this problem with Neo4j, and Restagraph is the result. It provides a REST API that enforces a schema on any updates to the database, and I've packaged it as a Docker image.
I haven't really promoted it so far, because it's only recently been mature enough for my own use, and I really need to improve the documentation. If you try it out, though, I'd love to hear any feedback you have.
TLDR: in general yes, but it depends.
This is a really broad question, so let me break it down.
While it's a little exaggerating to talk about all graph databases (which are not as standardized as SQL databases - which in turn are not very standardized as well), so take this answer with a grain of salt: Yes, that is possible.
As in SQL databases, you usually can set up constraints to be checked before any changes in data is persisted.
Most graph databases incorporate something along the lines of a "type", similarly to what a table represents in SQL databases. Some allow to constrain relationships to only target specific types, so you could restrict relationships e.g. between a node using a CAN bus and an I2C-bus to the specific types.
If a database does not provide these mechanisms, it's usually possible to constrain relationships to the existence of specific keys and values in the model. To have another example than your circuit one: Imagine a node-based system, which has typed inputs and outputs - an int-based output can only be connected to an int based input, a float based output only to a float based input, etc. Then you could add a field output_type and input_type to the nodes and constrain relationships between the values.
As soon as you add the ability to write (the SQL-similar stored) procedures, you can write very complex data integrity constraints.
So, while it is possible, the question is, if you should.
How much logic you actually want to put into your database is a decades-long heated argument. At some point in your application architecture, you will have to check the validity of the data that you are handling. Handling the data consistency in the database itself solves a lot of problems with race conditions or performance issues through multiple round trips between the application and the database, which would occur if the consistency checks are done in the application layer.
Putting a lot of your logic into the database severely limits your ability to switch databases ("vendor lock-in"), might lead to code duplication between your application layer and your database, and sprays your logic between two (or more) layers of your architecture (which makes it harder to find bugs, introduces temporal coupling, and might re-introduce race conditions and performance problems where you have to use transactions again).
My personal take is along the lines of Steve Wozniak - see your database as another service. If that service can provide you with everything you need to ensure data integrity, it might be a good idea to just use the database directly. But if this increases the problems I mentioned before, you might be better off putting a layer between your database and your business logic.
Objectivity/DB is object/graph database that uses schema. You can absolutely do what you are proposing. It supports complex object definitions including type inheritance and it has a full graph/navigational query language similar to Cypher. www.objectivity.com
I recently came across an application which uses NEO4j as the backend. In my experience with SQL and other Key-value based databases, I have developed an understanding(which could be refined) that other databases store data and your application derives the information while with NEO4J you store the information. This means that the logic of deriving the information is already captured in the model of NEO4J. I am not able to get my head around this because now I cannot have logic that can be composed and most importantly something that can be tested with unit tests. I can sure have component level tests using embedded neo4j but then that's not the same. Can someone please help me understand the application development philosophy/methodology with NEO4J.
...other databases store data and your application derives the information while with NEO4J you store the information.
Hmmm.... Define data and define information. Mostly it goes: Data is something that requires further processing to become information (that is, something informative - something you can derived some conclusion or insights from).
Anyhow, doubt this has anything to do with Graph databases vs relational/aggregate databases. A database, as the name suggests, stores data.
This means that the logic of deriving the information is already captured in the model of NEO4J.
I'm not sure what you mean by "the logic... is already captured". Some queries are much easier with Neo+Cypher that with say SQL; like "Find all the friends of my friends that live in Berlin", but I would hardly relate this to 'logic'.
I cannot have logic that can be composed and most importantly something that can be tested with unit tests.
What do you mean by 'logic that can be composed'? And unit tests has nothing to do with this I'm afraid - there's no logic being tested if you talk about graph vs other databases.
Can someone please help me understand the application development philosophy/methodology with NEO4J.
There's really not much to it. Neo4J is a database like any other database, only that it uses a different model from relational/aggregate databases.
To highlight two of its strengths:
No joins - That's a pain point with relational/aggregate databases, especially with complex queries. Essentially, nearly all system involve a data model that is a graph (you only need one many-to-many relationship in your data model for that), and not using a graph database is a form of dimensionality reduction. The reasons relational databases prevailed for so many years is nothing short of a set of historical coincidences.
Easier DB migrations - and that's for being a schema-less data base. You ripe the same benefits with any other schema-less database.
I strongly recommend you read the 'NOSQL Overview' appendix of the free Graph Databases. It focus on a lot of these points.
I need to import SNOMED CT ontology into a graph database, in this case Neo4J but it could be another choice eventually.
However, I could not find a clear depiction of SNOMED CT underlying relational data model, in order to achieve this. Or at least, simplified SQL views that expose entity relantionship in a way that can be mapped to a graph database.
I would greatly appreciate any guidance or previous experiencies with this matter.
Directly trying to serialise the relational data model is probably going to be quite difficult and will take you further away from your goal.
It is worth noting that SNOMED data is actually available in RDF format already. So you get a graph structure for "free".
For example this project provides the data in a RDF format and putting RDF data into a graph is quite simple regardless of your choice of Titan or Neo4j.
Side Note:
A colleague of mine has actually worked on importing SNOMED data into a Grakn Graph, a semantic graph system we both work on. If you interested you can check out his work here. Grakn is a semantic graph solution which runs on top of Titan.
If you are looking for a sample on how to model the Concepts, Descriptions and Relationships into a Graph database. I have a sample project in Github that can upload the Snomed data into a Neo4j database.
https://github.com/pradeepvemulakonda/Snomed
Before you go into the implementation detail, I would suggest trying out the following Snomed data browser at
http://ontoserver.csiro.au/shrimp/
Once you get a feel of the concepts and relationships you can go through the implementation. You can use the following gist to understand how you can query the uploaded concepts and relationships in Neo4j.
https://neo4j.com/graphgist/95f4f165-0172-4b3d-981b-edcbab2e0a4b#listing_category=health-care-and-science
SNOMED can be loaded into MySQL using the UMLS (unified medical language system) released by NIH. Once loaded the table MRREL contains all the relations between SNOMED nodes. If you want load it right away in Neo4j you can totally skip the MySQL step and work directly with the UMLS RRF files. The RRF documentation format is not great but the files are easy to parse tabular text.
There are in fact three tables, Concepts, Descriptions and Relationships
You'll find them described here:
https://confluence.ihtsdotools.org/display/DOCTIG/3.1.+Components
Most important are the relations between Relationships and Concepts and Descriptions and Concepts.
Am I doing this correctly? There's no measure so this is throwing me off a bit.
I am designing my database to hold records of user profiles. The Users can come in and edit profile on a front end portal that links to the this DB when records are edited/updated/deleted. The DB also needs to produce XML feeds for a public website.
The warehouse:
Yes, a fact table can exist without measures, it is called a factless fact table.
Please inform more on : http://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/factless-fact-table/ and other documentation.
While you absolutely can have a fact table without measures - as RaduM has linked to an explanation of - if you have no measures anywhere in your model I would question whether this database should use a dimensional model at all.
Dimensional models are intended for BI functions - data analysis, reporting, feeding into cubes, etc. Your description in a later comment about the use of this database seems to suggest this database is actually just the back end database for a website? If so, I would suggest avoiding dimensional modelling altogether. A standard normalised data model is likely to be far more suitable.
Data warehouses are normally secondary datastores which are not your live application database. Data is pulled from your primary sources into the data warehouse for reporting and analytics needs.
Transactional databases - like the one you are describing - are generally modelled in a more standard and more highly normalised manner. The usual gold standard is third normal form or higher. If you're unclear on the rules of database normalisation and the concept of third normal form, then I would strongly suggest that you obtain some training on this (there are online tutorials around if you search), and then have a crack at remodelling your scenario in this way. If you get stuck, post up a new question with the problem(s) you're running into.
You might also find this previous question helpful - it describes the difference between OLTP and OLAP. While you're not using OLAP, dimensional models are often used as the the RDBMS layer behind an OLAP database:
What are OLTP and OLAP. What is the difference between them?
I am developing a web-based application using Rails. I am debating between using a Graph Database, such as InfoGrid, or a Document Database, such as MongoDB.
My application will need to store both small sets of data, such as a URL, and very large sets of data, such as Virtual Machines. This data will be tied to a single user.
I am interested in learning about peoples experiences with either Graph or Document databases and why they would use either of the options.
Thank you
I don't feel enough experienced with both worlds to properly and fully answer your question, however I'm using a document database for some time and here are some personal hints.
The document databases are based on a concept of key,value, and static views and are pretty cool for finding a set of documents that have a particular value.
They don't conceptualize the relations between documents.
So if your software have to provide advanced "queries" where selection criteria act on several 'types of document' or if you simply need to perform a selection using several elements, the [key,value] concept is not appropriate.
There are also a number of other cases where document databases are inappropriate : presenting large datasets in "paged" tables, sortable on several columns is one of the cases where the performances are low and disk space usage is huge.
So in many cases you'll have to perform "server side" processing in order to pick up the pieces, and with rails, or any other ruby based framework, you might run into performance issues.
The graph database are based on the concept of tripplestore, meaning that they also conceptualize the relations between the entities.
The graph can be traversed using the relations (and entity roles), and might be more convenient when performing searches across relation-structured data.
As I have no experience with graph database, I'm not aware if the graph database can be easily queried/traversed with several criterias, however if an advised reader has such an information I'd really appreciate any examples of such queries/traversals.
I'm currently reading about InfoGrid and trying to figure if such databases could by handy in order to perform complex requests on a very large set of data, relations included ....
From what I can read, the InfoGrah should be considered as a "data federator" able to search/mine the data from several sources (Stores) wich can also be a NoSQL database such as Mongo.
Wich means that you could use a mongo store for updates and InfoGraph for data searching, and maybe spare a lot of cpu and disk when it comes to complex searches inside a nosql database.
Of course it might seem a little "overkill" if your app simply stores a large set of huge binary files in a database and all you need is to perform simple key queries and to retrieve the result. In that cas a nosql database such as mongo or couch would probably be handy.
Hope some of this helps ;)
When connecting related documents by edges, will you get a shallow or a deep graph? I think the answer to that question is important when deciding between graphdbs and documentdbs. See Square Pegs and Round Holes in the NOSQL World by Jim Webber for thoughts along these lines.