Merging Domain Ontologies

Merging Domain Ontologies - jena

I have worked and developed two different domain ontologies. But now i like to merge these two ontologies as my application requirement has been changed and requires knowledge of both ontologies. What are merging techniques and how can i do it with ontology framework i.e jena, OWL-API etc

If your ontologies are following the OWL 2 EL profile you can do it with Brain:
Brain brain = new Brain();
brain.learn("path/to/first_ontology.owl");
brain.learn("path/to/second_ontology.owl");
//Do your logic here, like queries, etc...
brain.save("path/to/merged_ontologies.owl");
Entities with identical IRIs will be automatically merged.

Assume you have two ontologies: modelA and modelB, if you are using Jena, you can do the following thing to merge them:
modelA.add(modelB);
or
modelB.add(modelA);

Related

How to model relationships between sets of nodes

I am currently looking at modelling tertiary courses and other such entities (MATH101, BIOL360, BSc etc.), and one of the options we're looking at is graph databases. I am not familiar with graph databases, other than in theory.
One of the things I am trying to model is requirements, for example "MATH201 requires the student previously completed MATH101". That one seems easy - I can create a vertex between the two. Others are more complicated: "Bachelor of Computer Science requires 40 points at 200 level or higher in science papers".
What I'd like to do here is name a bunch of sets, a la Neo4J Labels, then create a relationship from one of the nodes to the set of nodes described in the labels, but I can't see a way to do this. Is this something that's possible in graph databasing engines, or am I basically running down an XY path, and should be doing something else entirely?
I've tagged Neo4J because I'm leaning towards it as (from what I can tell) it's the most widely known/used graph dbms, but I'm open to solutions in other databases too (in fact, if it's possible in the very new SQL Server offering, that would probably be ideal as other infrastructure is on that).

Well, in the fist case, I think you can make a :REQUIREMENT relationship between "MATH201" and "MATH101".
In the second case you can make a :REQUIREMENT relationship between "Bachelor of Computer Science" node and an intermediate node to group all science papers, as in the image below. Also, you can put some extra properties in the relationship to what type of requirement the course has:

default ontologies loaded into Graphdb

I am interested in finding out what are the ontologies preloaded into Graphdb by default. This will help me identify what ontologies (.ttl files) do I need to add along with my ontology as part of the package, especially in the situations when there is no Internet connection.
I know that some ontologies such as rdfs and owl are preloaded into GraphDb. but I could not find any list on preloaded ontologies.

Please keep in mind that OWL does not differentiate very clearly ontology from instance triples. Also GraphDB introduces another term "axiomatic triple" (i.e. statement that cannot be deleted with a normal user transaction) used to separate the ontology statements from the normal RDF.
There are 3 ways of loading ontologies as axiomatic triples in GraphDB:
Ruleset - will import all statements from the beginning of a PIE file as axiomatic statements. Check here for additional information.
Add imports initialisation parameter - this will safe a configuration predicate in the SYSTEM's repository See the configuration parameter
Add a special predicate in the beginning of an RDF file - the system transaction will add all following statements as ontology. Check here.
Another approach is to add every file in a different named graph. This will allow you to see which graphs are currently stored in the repository.

Graph Databases vs Triple Stores - when to use which?

I know that there are similar questions around on Stackoverflow but I don't feel they answer the following.
Graph Databases to my understanding store data following mostly this schema:
Table/Collection 1: store nodes with UID
Table/Collection 2: store relations referencing nodes via UID
This allows storing arbitrary types of graphs. Now as I understand triple stores store nothing but triples:
Triple/Collection 1: store triples (2 nodes, 1 relation)
Now I would see the following distinction regarding use cases:
Graph Databases: when you have known, static connections
Triple Stores: when you have loosely connected nodes and are often looking for new connections
I am confused by the fact that people do not seem to be discussing which one to use according to these criteria. Most article I find are talking about arguments like speed or compatibility. But is this not the most relevant point?
Put the other way round:
Imagine having a clearly connected, user defined graph. Why on earth would you want to store that as triples only, loosing all the info about connections? Or having to implement some custom solution storing IDs in the triple subject.
Imagine having loosely collected nodes that you want to query for unknown relations using SPARQL. Graph databases do support that. But for this they have to build another index I assume and would be slower?
EDIT:
I see that "loosing info about connections" is the wrong way to put it. If you do as shown in the accepted answer and insert several triples for 2 nodes + 1 relation then you keep all the info and specifically the info what exact nodes are connected.

The main difference between graph databases and triple stores is how they model the graph. In a triple store (or quad store), the data tends to be very atomic. What I mean is that the "nodes" in the graph tend to be primitive data types like string, integer, date, etc. Relationships link primitives together, and so the "unit of discourse" in a triple store is a triple, and not a node or a relationship, typically.
By contrast, other graph databases are often called "property stores" because nodes are data containers that correspond to objects in a domain. A node stands in for an object, and has properties; they act as rich data types specified by the graph modelers, more than just primitive data types. In these graph databases, nodes and relationships are the "unit of discourse".
Let's say I have a person named "Bob" who knows "Susan". In RDF, it would be something like this:
<http://example.org/person/1> :hasName "Bob".
<http://example.org/person/1> foaf:knows <http://example.org/person/2>.
<http://example.org/person/2> :hasName "Susan".
In a graph database like neo4j, it would be this:
(a:Person {name: "Bob"})-[:KNOWS]->(b:Person {name: "Susan"})
Notice that in RDF, it's 3 relationships but only one of those relationships actually expresses semantics between two entities. The other two relationships are just tracking properties of a single higher-level entity (the person). In neo4j, it's 1 relationship amongst two nodes, with each node having a property. In RDF you'll tend to identify things by URI, in neo4j it's a database object that gets a database ID automatically. That's what I mean about the difference between a more atomic/primitive store (triple stores) and a richer property graph.
RDF and triple stores are mostly built for the kinds of architectural challenges you'd run into with the semantic web. For example, XML namespacing is built in, on the architectural assumption that you'll be mixing and matching the use of many different vocabularies and namespaces. (That right there is a very "semantic web" assumption). So in SPARQL and RDF you'll see typically at least the use of xsd, rdf, and rdfs namespaces concurrently, and probably also owl, skos, and many others. SPARQL and RDF/RDFS also have many hooks and features that are there explicitly to make things like ontology inference easier. You'll tend to identify things with URIs as a way of "namespacing your identifiers" but also because some people may want to de-reference the URI...again the assumption here is a wide data sharing arrangement between many parties.
Property stores by contrast are keyed towards different use cases, like flexible modeling of data within one model/namespace, mappings between objects and graphs for persistence of enterprise applications, rapid evolvability, and so on. You'll tend to identify things with your own scheme (or an internal database ID). An auto-incrementing integer may not be best form of ID for any random consumer on the web, (and they certainly can't be de-referenced like URLs) but they might not be your first thought for a company internal application.
So which is better? The more atomic triple store format, or a rich property graph? Do you need to mix and match many different vocabularies in one query or data model? Do you need to create an OWL ontology or do inference? Do you need to serialize a bunch of java objects in memory to a database? Do you need to do fast traversal of long paths? Those types of questions would guide your selection.
Graphs are graphs, both of them do graphs, and so I don't think there's much difference in terms of what they can represent, or how you go about thinking about a problem in "graph terms". The differences boil down to the architecture underneath of the hood, and what sorts of use cases you think you'll need. I won't tell you one is better than the other, but choose wisely.

(in reply to the comments on this answer: https://stackoverflow.com/a/30167732 )
When an owl:inverseOf production rule is defined, the inverse property triple is inferred by the reasoner either when adding or updating the store, or when selecting from the store. This is a "materialized relation"
Schema.org - an RDFS vocabulary - defines, for example, https://schema.org/isPartOf as the inverse property of hasPart. If both are specified, it's not necessary to run another graph pattern query to traverse a directed relation in the other direction.
(:book1 schema:hasPart ?o)
(?o schema:isPartOf :book1)
(?s schema:hasPart :chapter2)
It's certainly possible to use RDFS and OWL to describe schema for and within neo4j property graphs; but there's no reasoner to e.g. infer inverse properties or do schema validation.
Is there any RDF graph that neo4j cannot store? RDF has datatypes and languages for objects: you'd need to reify properties where datatypes and/or languages are specified (and you'd be re-implementing well-defined semantics)
Can every neo4j graph be represented with RDF? Yes.
RDF is a representation for graphs for which there are very many store implementations that are optimized for various use cases like insert and query performance.
Comparing neo4j to a particular triplestore (with reasoning support) might be a more useful comparison given that all neo4j graphs can be expressed as RDF.

Cypher query: Is it possible to "hide" an existing path with a "virtual relationship"?

We are working on a project trying to map a structure like Java code connections with Noe4J 2.1.5. We have succeeded in connecting Applications-Jars-Classes-Methods and can for example get a Cypher answer resulting in:
App1-->Jar1-->Class1-->Method1-->Method2-->Method3<--Class22<--Jar2<--App1
Now we would like to be able to get the condensed answer to what Jars that are connected like this, "hiding" the existing path above?
Jar1--Jar2
Is it possible with Cypher to get this result without creating a new Relationship like
Jar1-[:PATH_EXISTS]-Jar2
We can't find anything related collapsing/hiding paths in the manual nor here on stack overflow
Regards
Christofer

There's basically two ways of going about this.
The first is to explicitly create the new relationship, but I won't talk about this that much because it seems you've thought of that and rejected it. That method is easy, but more disk intensive (depending on the size of your graph)
The second is simply to query for the path when needed, with a variable length path like this:
MATCH (jar1 {myid: "something"})-[*]->(jar2 {myid: "somethingelse"})
RETURN jar2;
This will get you what you need, but it requires that this distant path be recomputed every time it's needed. So, it's easy, but it's compute intensive.
Now, more broadly what it sounds like you want is something like a graph inference engine. In the OWL/RDF world, people will create ontologies that describe different types of entities, and the relationships between them. One of the consequences of these relationships is that they can be transitive and can have implications on them. A classic example is that a person is an entity, and things like motherOf and fatherOf are relationships between. So if you have a path of fatherOf relationships between nodes, i.e. (A)-[:fatherOf]->(B)-[:fatherOf]->(C), the inference engine will return the "fact" that (A) and (C) are related by family. This would be a consequence of your ontological definition. That "fact" wouldn't actually be in the RDF store, it would simply be entailed by the facts.
In your case, you'd do something like writing an ontology that specified that all of the individual relationships you have in your graph are a specialization of some relationship type (like "related to"). You'd then ask the reasoner if a "related to" relationship exists between Jar1 and Jar2, and the answer would be yes because of your ontological definitions.
OK, so the bad news is that neo4j isn't RDF and doesn't do this. Also, doing this sort of thing is way harder than I'm making it sound; correct ontology modeling is an art unto itself, not unlike logic programming from the prolog world of the 1970s. But basically, that kind of inference is what it sounds like you're looking for.
What I think you might be able to hope for in some future release of neo4j is something akin to a database "view", or better schema support. I.e. it ought to be possible to specify that whenever a certain relationship pattern holds, some other relationship ought also be present.

Neo4j data modeling for branching/merging graphs

We are working on a system where users can define their own nodes and connections, and can query them with arbitrary queries. A user can create a "branch" much like in SCM systems and later can merge back changes into the main graph.
Is it possible to create an efficient data model for that in Neo4j? What would be the best approach? Of course we don't want to duplicate all the graph data for every branch as we have several million nodes in the DB.
I have read Ian Robinson's excellent article on Time-Based Versioned Graphs and Tom Zeppenfeldt's alternative approach with Network versioning using relationnodes but unfortunately they are solving a different problem.
I Would love to know what you guys think, any thoughts appreciated.

I'm not sure what your experience level is. Any insight into that would be helpful.
It would be my guess that this system would rely heavily on tags on the nodes. maybe come up with 5-20 node types that are very broad, including the names and a few key properties. Then you could allow the users to select from those base categories and create their own spin-offs by adding tags.
Say you had your basic categories of (:Thing{Name:"",Place:""}) and (:Object{Category:"",Count:4})
Your users would have a drop-down or something with "Thing" and "Object". They'd select "Thing" for instance, and type a new label (Say "Cool"), values for "Name" and "Place", and add any custom properties (IsAwesome:True).
So now you've got a new node (:Thing:Cool{Name:"Rock",Place:"Here",IsAwesome:True}) Which allows you to query by broad categories or a users created categories. Hopefully this would keep each broad category to a proportional fraction of your overall node count.
Not sure if this is exactly what you're asking for. Good luck!

Hmm. While this isn't insane, think about the type of system you're replacing first. SQL. In SQL databases you wouldn't use branches because it's data storage. If you're trying to get data from multiple sources into one DB, I'd suggest exporting them all to CSV files and using a MERGE statement in cypher to bring them all into your DB at once.
This could manifest similar to branching by having each person run a script on their own copy of the DB when you merge that takes all the nodes and edges in their copy and puts them all into a CSV. IE
MATCH (n)-[:e]-(n2)
RETURN n,e,n2
Then comparing these CSV's as you pull them into your final DB to see what's already there from the other copies.
IMPORT CSV WITH HEADERS FROM "file:\\YourFile.CSV" AS file
MERGE (N:Node{Property1:file.Property1, Property2:file.Property2})
MERGE (N2:Node{Property1:file.Property1, Property2:file.Property2})
MERGE (N)-[E:Edge]-(N2)
This will work, as long as you're using node types that you already know about and each person isn't creating new data structures that you don't know about until the merge.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart