Protege compare Property only - ontology

I want to see the difference between two ontologies in Protege, and I used the 'Compare ontology' tool. I want to compare the properties of two ontologies rather than 'created', 'deleted', etc. How can I get that information?

Related

Model source informations to maximize query performance

I am wondering about the best way (in terms of performance) to model data sources in Neo4j.
Consider the following scenario:
We are joining different datasets about the music domain in one graph. The data can range from different artists and styles to sales information. Important is to store the source of this information. E.g. do we have the data from a public source like DBpedia or some other private sources.
To be able to run queries only on certain datasets we have to include the source to each Node (and in the optimal way to each Relation). Of course one Node or Relation could have multiple sources.
There are three straight forward solutions:
Add a source property to each Node and Relation; index this property and use it in a cypher query. E.g.:
MATCH(n:Artist) WHERE n.source='DBpedia' return n
Add the source as Label to each Node and a Type to each Relation (can we have multiple types on one Relation?). E.g.:
CREATE (n:Artist:DBpediaSource:CustomerSource)
Create a separate Node for each Source and link all other Nodes to the corresponding Source Node. E.g.:
MATCH (n:Artist)-[:HASSOURCE]-(:DBpediaSource) return n
Of course for those examples the solution does not matter in terms of performance. However using the source in more complex queries and on a bigger graph (lets say with a few million Nodes and Relations) the way we model this challenge will have a significant influence on the performance.
One more complex example where the sources are also needed is the generation of a "sub graph".
We want to extract all Nodes and Relations from one or multiple Sources and for example export this to a new Neo4j instance, or restrict some graph algorithms such as PageRang to this "sub graph" without creating a separate Neo4j instance.
Does anyone in the community has experience with such a case? What is the best way to model this in terms of performance? Are there maybe other solutions?
Thanks for your help.

Using ontology to infer labels for process model

I'm trying to implement a specific type of process mining, that has been presented in this thesis [link]. It is based on HMMs and generates a process model in form of a directed graph, where:
Nodes are called intentions and correspond to hidden states
Edges are called strategies and consist of different activities
These activities correspond to the HMM's observable emissions
Intentions can be fulfilled using different strategies
A user event log consisting of user IDs, timestamps and activities is used as input. The image below is an example of such a process model. The highlighted nodes and edges resemble the path that has been predicted using the Viterbi algorithm.
You can see that the graph's nodes and edges only carry numeric labels, which allow to distinguish between the different strategies and intentions. In order to make these labels more meaningful to the human reader, I'd like to infer some suitable labels.
My idea is to use an ontology to obtain those labels. After some research I figured out that I probably needed to do something that is generally referred to as "ontology learning". For this I would need to create some axioms in RDF/OWL format and then use these as input for a reasoner, that would infer an ontology.
Is this approach correct and reasonable to achieve my goal?
If this is the way to go, I will need some tool to generate axioms in an automated way. So far I couldn't find any tool that would do that completely out-of-the-box. Based on what I've seen so far I conclude that I would need to define some kind of mapping between the original data and the desired axioms. I took a closer look at protégé, which offers a plugin for spreadsheets. It seems to be based on the MappingMasterDSL project [link].
I've also found an interesting paper [link] on ontology learning where an RNN-based model is trained in a end-to-end fashion to translate definitory sentences into OWL formulae. BUT: My user event log data does not contain any natural sentences. Its activities are defined by tokens derived from HTML elements of the user interface. Therefore the RNN-based approach does not seem to be applicable here. (For the interested reader, the related project can be found here [link])
Isn't there really any easier way than hand-crafting the axioms' schema(ta)?
Assuming that I have created my axioms and inferred an ontology, I would like to use the strategies' (edges') observable activities (emissions) to infer a suitable label. I guess I would need to query my ontology somehow. I could use the activity names as parameters for my query and look for some related entities that reveal the desired label. I'm expecting something like:
"I have a strategy with ID=3, that strategy can be executed with
actions a, b and c, give me all entities of the ontology, that
have these actions as property value and show and give me all related
labels for those entities"
But where would the data for the labels actually come from?
I think I'm missing some important step during the process of ontology learning. Where do I find an additional data source for the labels and how do I relate this data to my ontology's entities?
Also I'm wondering if there is a way to incorporate the inherent knowledge of the process model's topology into my ontology.

Is that normal that nodes with same labels have different properties?

In modeling, instances of the same label, i.e Student, have same set of properties. However, is it normal that instances of the same label have different sets of properties. For example, I have Product node:
(p:Product)-[:HAS_ATTRIBUTE]->(a:Attributes)
Different instances of Product result in different instances of Attributes. In this case, different Attributes nodes have very different properties.
Is this modeling normal? Different categories of products can have very different attributes.
It's very useful to have different properties. For instance, I have a Y-DNA project with single nucleotide polymorphisms (SNP) Nodes. Some are on the know haplotree and some are not. So, I set a property InHGTree to Y or blank to reflect this. Now I can more readily create queries using the haplotree branching.
BTW, relationships can also have different properties with the same value. DNA results from an individual are in a "kit." The kit is related to numerous SNPs. You want to be able to determine whether the kit is positive or negative for the SNP. It is most logically to put this fact in the relationship between the kit and the SNP.
It's certainly allowed, as there is no table schema as in relational dbs to enforce homogenous properties.
While this provides great flexibility, it may introduce complexity. It's up to the modelers and administrators of the database to provide any guidelines or implement restrictions, if needed.
While that would usually be in the form of convention, APOC triggers (or kernel extensions if you want to implement this yourself) could be used to enforce only certain properties for a node of a given label.

Individuals from DBpedia

I have an exercise in Semantic Web. I must extract some individuals from the DBpedia. These individuals must be inserted into an ontology that I must create. My question is. Can I retrieve individuals from the DBedia?
Let me clarify !
When I send this sparql query
PREFIX dbo: <http://dbpedia.org/ontology>
SELECT distinct * WHERE
{
?album a dbo:Album .
} LIMIT 10
I get 10 URIs. Should I get whole instances ? I mean, label, object properties, data properties etc. in order to insert them to my ontology?
I want them as a complete instance. I don't want to add more variables e.g
?album dbo:artist ?artist .
Can I use a java api e.g. Jena ?
EDIT:
Let me give you an example. Suppose that you get an Album with URI
http://dbpedia.org/resource/...Baby_One_More_Time_(album)
This album has also some properties with their values e.g.
dbo:artist dbr:Britney_Spears
dbo:releaseDate 1999-01-12 (xsd:date)
...
How could I get all of them in order to create an indivual album for my ontology with properties artist and releaseDate and values Britney_Spears and 1999-01-12 respectively ?
Well, a good point always to start is your requirements! What do you exactly need? There is scientific plethora research on Ontology Module Extraction (see for example here).
My rule of thumb is that: the amount you extract must align with the required constraints of soundness and completeness of results, which in turn, aligns with your requirements. To make it clear, consider the following: A DBpedia Artist is a subClassOf Person. Now consider that you extract all the instances of Artist from DBPedia, without the piece of information that Artist is a subClassOf Person. Now if you query your dataset asking for Person, you will get nothing. Is this a sound result? yes, but is it complete? No! However, if you don't care about the fact that each Artist is a Person, then it's okay. A mentioning worthy thing is that it depends on the DBpedia endpoint itself and what kind of reasoning it performs as well.
Concluding: Specify what you really need. While you can suffice for a couple of classes with their instances, you can as well extract the whole DBpedia.
Regarding getting the data, there are multiple ways; again depending on your requirements. For simple purposes, you can use Jena TDB for triples storage and access them via Jena. You can even store your data simply in an RDF file. You can, for example, use a construct query on DBpedia endpoint and specify the results format as RDF and then insert them to your RDF engine. Another option, for example, this answer, states how to use an INSERT query to perform the insert task into a local graph.
You can retrieve instances from DBpedia with whatever metadata you want, but it depends on your ontology that you would like to create. Please take a look at this document, it will help you to understand some notions.
Should you get whole instances? I think you are asking if you should take all the proporties and objects depending on the subject. Not necessarily..It depends on your ontology as stated in first step and you decide what to take.
Should you use Jena? You can but you don't have to! If you pose a CONSTRUCT query to the endpoint you can get the data but as far as I understood you don't want to add variables. So you can pose a query as follows by asking all the metadata regarding to the instance.
CONSTRUCT { ?album ?p ?o } WHERE {
?album a dbo:Album .
?album ?p ?o
}
If you would like to get a limited number of instances then you can add limit again at the end of this query.

Graph Databases vs Triple Stores - when to use which?

I know that there are similar questions around on Stackoverflow but I don't feel they answer the following.
Graph Databases to my understanding store data following mostly this schema:
Table/Collection 1: store nodes with UID
Table/Collection 2: store relations referencing nodes via UID
This allows storing arbitrary types of graphs. Now as I understand triple stores store nothing but triples:
Triple/Collection 1: store triples (2 nodes, 1 relation)
Now I would see the following distinction regarding use cases:
Graph Databases: when you have known, static connections
Triple Stores: when you have loosely connected nodes and are often looking for new connections
I am confused by the fact that people do not seem to be discussing which one to use according to these criteria. Most article I find are talking about arguments like speed or compatibility. But is this not the most relevant point?
Put the other way round:
Imagine having a clearly connected, user defined graph. Why on earth would you want to store that as triples only, loosing all the info about connections? Or having to implement some custom solution storing IDs in the triple subject.
Imagine having loosely collected nodes that you want to query for unknown relations using SPARQL. Graph databases do support that. But for this they have to build another index I assume and would be slower?
EDIT:
I see that "loosing info about connections" is the wrong way to put it. If you do as shown in the accepted answer and insert several triples for 2 nodes + 1 relation then you keep all the info and specifically the info what exact nodes are connected.
The main difference between graph databases and triple stores is how they model the graph. In a triple store (or quad store), the data tends to be very atomic. What I mean is that the "nodes" in the graph tend to be primitive data types like string, integer, date, etc. Relationships link primitives together, and so the "unit of discourse" in a triple store is a triple, and not a node or a relationship, typically.
By contrast, other graph databases are often called "property stores" because nodes are data containers that correspond to objects in a domain. A node stands in for an object, and has properties; they act as rich data types specified by the graph modelers, more than just primitive data types. In these graph databases, nodes and relationships are the "unit of discourse".
Let's say I have a person named "Bob" who knows "Susan". In RDF, it would be something like this:
<http://example.org/person/1> :hasName "Bob".
<http://example.org/person/1> foaf:knows <http://example.org/person/2>.
<http://example.org/person/2> :hasName "Susan".
In a graph database like neo4j, it would be this:
(a:Person {name: "Bob"})-[:KNOWS]->(b:Person {name: "Susan"})
Notice that in RDF, it's 3 relationships but only one of those relationships actually expresses semantics between two entities. The other two relationships are just tracking properties of a single higher-level entity (the person). In neo4j, it's 1 relationship amongst two nodes, with each node having a property. In RDF you'll tend to identify things by URI, in neo4j it's a database object that gets a database ID automatically. That's what I mean about the difference between a more atomic/primitive store (triple stores) and a richer property graph.
RDF and triple stores are mostly built for the kinds of architectural challenges you'd run into with the semantic web. For example, XML namespacing is built in, on the architectural assumption that you'll be mixing and matching the use of many different vocabularies and namespaces. (That right there is a very "semantic web" assumption). So in SPARQL and RDF you'll see typically at least the use of xsd, rdf, and rdfs namespaces concurrently, and probably also owl, skos, and many others. SPARQL and RDF/RDFS also have many hooks and features that are there explicitly to make things like ontology inference easier. You'll tend to identify things with URIs as a way of "namespacing your identifiers" but also because some people may want to de-reference the URI...again the assumption here is a wide data sharing arrangement between many parties.
Property stores by contrast are keyed towards different use cases, like flexible modeling of data within one model/namespace, mappings between objects and graphs for persistence of enterprise applications, rapid evolvability, and so on. You'll tend to identify things with your own scheme (or an internal database ID). An auto-incrementing integer may not be best form of ID for any random consumer on the web, (and they certainly can't be de-referenced like URLs) but they might not be your first thought for a company internal application.
So which is better? The more atomic triple store format, or a rich property graph? Do you need to mix and match many different vocabularies in one query or data model? Do you need to create an OWL ontology or do inference? Do you need to serialize a bunch of java objects in memory to a database? Do you need to do fast traversal of long paths? Those types of questions would guide your selection.
Graphs are graphs, both of them do graphs, and so I don't think there's much difference in terms of what they can represent, or how you go about thinking about a problem in "graph terms". The differences boil down to the architecture underneath of the hood, and what sorts of use cases you think you'll need. I won't tell you one is better than the other, but choose wisely.
(in reply to the comments on this answer: https://stackoverflow.com/a/30167732 )
When an owl:inverseOf production rule is defined, the inverse property triple is inferred by the reasoner either when adding or updating the store, or when selecting from the store. This is a "materialized relation"
Schema.org - an RDFS vocabulary - defines, for example, https://schema.org/isPartOf as the inverse property of hasPart. If both are specified, it's not necessary to run another graph pattern query to traverse a directed relation in the other direction.
(:book1 schema:hasPart ?o)
(?o schema:isPartOf :book1)
(?s schema:hasPart :chapter2)
It's certainly possible to use RDFS and OWL to describe schema for and within neo4j property graphs; but there's no reasoner to e.g. infer inverse properties or do schema validation.
Is there any RDF graph that neo4j cannot store? RDF has datatypes and languages for objects: you'd need to reify properties where datatypes and/or languages are specified (and you'd be re-implementing well-defined semantics)
Can every neo4j graph be represented with RDF? Yes.
RDF is a representation for graphs for which there are very many store implementations that are optimized for various use cases like insert and query performance.
Comparing neo4j to a particular triplestore (with reasoning support) might be a more useful comparison given that all neo4j graphs can be expressed as RDF.

Resources