Ontology comparison in owlapi - ontology

I am using OWLAPI for a project, and I need to compare two ontologies for differences between them. This would ignore blank nodes so that, for instance, I can determine whether the same OWL restrictions are in both ontologies. Not only do I need to know whether there are differences, but I need to find out what those differences are. does such functionality exist in the OWLAPI, oz is there a relatively simple way to do this?

The equality between anonymous class expressions is not based on the blank node ids - anonymous class expressions only have blank nodes in the textual output, in memory the ids are ignored. So checking if an axiom exists in an ontology will by default match expressions correctly for your diff.
This is not true for individuals - anonymous individuals will not be found to be the same across ontologies, and this is by specs. An anonymous individual in one ontology cannot be found in another, because the anonymous individual ids are scoped to the containing ontology.
Note: the unit tests for OWLAPI have to carry out a very similar task, to verify that an ontology can be parsed, written and parsed again without change (i.e., roundtripped between input syntax and output syntax), so there is code that you can look at to take inspiration. See TestBase.java - equal() method for more details. This includes code to deal with different ids for anonymous individuals.

Related

Is there a well-defined difference between "normalizing" and "canonicalizing" data?

I understand canonicalization and normalization to mean removing any non-meaningful or ambiguous parts of of a data's presentation, turning effectively identical data into actually identical data.
For example, if you want to get the hash of some input data and it's important that anyone else hashing the canonically same data gets the same hash, you don't want one file indenting with tabs and the other using spaces (and no other difference) to cause two very different hashes.
In the case of JSON:
object properties would be placed in a standard order (perhaps alphabetically)
unnecessary white spaces would be stripped
indenting either standardized or stripped
the data may even be re-modeled in an entirely new syntax, to enforce the above
Is my definition correct, and the terms are interchangeable? Or is there a well-defined and specific difference between canonicalization and normalization of input data?
"Canonicalize" & "normalize" (from "canonical (form)" & "normal form") are two related general mathematical terms that also have particular uses in particular contexts per some exact meaning given there. It is reasonable to label a particular process by one of those terms when the general meaning applies.
Your characterizations of those specific uses are fuzzy. The formal meanings for general & particular cases are more useful.
Sometimes given a bunch of things we partition them (all) into (disjoint) groups, aka equivalence classes, of ones that we consider to be in some particular sense similar or the same, aka equivalent. The members of a group/class are the same/equivalent according to some particular equivalence relation.
We pick a particular member as the representative thing from each group/class & call it the canonical form for that group & its members. Two things are equivalent exactly when they are in the same equivalence class. Two things are equivalent exactly when their canonical forms are equal.
A normal form might be a canonical form or just one of several distinguished members.
To canonicalize/normalize is to find or use a canonical/normal form of a thing.
Canonical form.
The distinction between "canonical" and "normal" forms varies by subfield. In most fields, a canonical form specifies a unique representation for every object, while a normal form simply specifies its form, without the requirement of uniqueness.
Applying the definition to your example: Have you a bunch of values that you are partitioning & are you picking some member(s) per each class instead of the other members of that class? Well you have JSON values and short of re-modeling them you are partitioning them per what same-class member they map to under a function. So you can reasonably call the result JSON values canonical forms of the inputs. If you characterize re-modeling as applicable to all inputs then you can also reasonably call the post-re-modeling form of those canonical values canonical forms of re-modeled input values. But if not then people probably won't complain that you call the re-modeled values canonical forms of the input values even though technically they wouldn't be.
Consider a set of objects, each of which can have multiple representations. From your example, that would be the set of JSON objects and the fact that each object has multiple valid representations, e.g., each with different permutations of its members, less white spaces, etc.
Canonicalization is the process of converting any representation of a given object to one and only one, unique per object, representation (a.k.a, canonical form). To test whether two representations are of the same object, it suffices to test equality on their canonical forms, see also wikipedia's definition.
Normalization is the process of converting any representation of a given object to a set of representations (a.k.a., "normal forms") that is unique per object. In such case, equality between two representations is achieved by "subtracting" their normal forms and comparing the result with a normal form of "zero" (typically a trivial comparison). Normalization may be a better option when canonical forms are difficult to implement consistently, e.g., because they depend on arbitrary choices (like ordering of variables).
Section 1.2 from the "A=B" book, has some really good examples for both concepts.

Difference between rdf:seeAlso and rdfs:seeAlso

What is the difference between rdf:seeAlso and rdfs:seeAlso?
When I can use rdf:seeAlso and when I can use rdfs:seeAlso?
Can you do any examples?
First, note that rdf and rdfs are prefixes commonly used to reference the RDF syntax and RDF schema vocabularies respectively. The rdf is typically used for http://www.w3.org/1999/02/22-rdf-syntax-ns#, so that rdf:seeAlso would expand to http://www.w3.org/1999/02/22-rdf-syntax-ns#seeAlso. However, if you follow the vocabulary reference, you won't find a term defined for seeAlso. The RDF syntax is used for basic types such as rdf:type, rdf:XMLLiteral, and rdf:langString. the RDF Schema vocabulary is typically bound to the rdfs prefix, and is at http://www.w3.org/2000/01/rdf-schema#. It is mostly used to define terms useful in performing simple reasoning over RDF graphs, such as rdfs:subClassOf, rdfs:domain, and rdfs:range.
In reality, the terms defined between the two vocabularies end up being in arbitrary locations, and on retrospect, there should have probably been just a single vocabulary definition and a more easily understood location (such as http://www.w3.org/ns/rdf#), but too late for that now.
Why use rdfs:seeAlso is unclear. The description says "Further information about the subject resource.", but there's rules defined for how to use it. In Linked Data, it can be used to do just what it says, and a hypothetical linked data client might dereference IRI values of rdfs:seeAlso to find out more information that might be useful.
You can find out more in the RDF Concepts document and other publications of the RDF Working Group.
What is the difference between rdfs:seeAlso and rdfs:isDefinedBy?
These are defined pretty clearly in the specification:
5.4.1 rdfs:seeAlso
rdfs:seeAlso is an instance of rdf:Property that is used to indicate a
resource that might provide additional information about the subject
resource.
A triple of the form:
S rdfs:seeAlso O
states that the resource O may provide additional
information about S. It may be possible to retrieve representations of
O from the Web, but this is not required. When such representations
may be retrieved, no constraints are placed on the format of those
representations.
5.4.2 rdfs:isDefinedBy
rdfs:isDefinedBy is an instance of rdf:Property that is used to
indicate a resource defining the subject resource. This property may
be used to indicate an RDF vocabulary in which a resource is
described.
A triple of the form:
S rdfs:isDefinedBy O
states that the resource O defines S. It may be
possible to retrieve representations of O from the Web, but this is
not required. When such representations may be retrieved, no
constraints are placed on the format of those representations.
rdfs:isDefinedBy is a subproperty of rdfs:seeAlso.
When I can use rdfs:seeAlso and when I can use rdfs:isDefinedBy?
Can you do any examples for me?
You can use these whenever they're appropriate. Just include the appropriate triples in your data. I don't think there's really a whole lot of need for examples in this case; if something's a related resource, add a seeAlso link. If something has a definition by another resource, add a isDefinedBy link. Note that last bit, "rdfs:isDefinedBy is a subproperty of rdfs:seeAlso". That means that whenever you assert that "x rdfs:isDefinedBy y", you're implicitly asserting that "x rdfs:seeAlso y".
rdfs:seeAlso is an instance of rdf:Property that is used to indicate a resource that might provide additional information about the subject resource.
rdf is the original resource description framework whereas rdfs provides the schema information

Why build an AST walker instead of having the nodes responsible for their own output?

Given an AST, what would be the reason behind making a Walker class that walks over the tree and does the output, as opposed to giving each Node class a compile() method and having it responsible for its own output?
Here are some examples:
Doctrine 2 (an ORM) uses a SQLWalker to walk over an AST and generate SQL from nodes.
Twig (a templating language) has the nodes output their own code (this is an if statement node).
Using a separate Walker for code generation avoids combinatorial explosion in the number of AST node classes as the number of target representations increases. When a Walker is responsible for code generation, you can retarget it to a different representation just by altering the Walker class. But when the AST nodes themselves are responsible for compilation, you need a different version of each node for each separate target representation.
Mostly because of old literature and available tools. Experimenting with both methods you can easily find that AST traversal produces very slow and convoluted code. Moreover, code separated from immediate syntax doesn't resemble it anymore. It's very much like supporting two synchronized code bases, which is always a bad idea. Debugging, maintenance become difficult.
Of course, it can be also difficult to process semantics on the nodes unless you have a well designed state machine. In fact you are never worse than having to traverse AST after the fact, because it's just one particular case of processing semantics on nodes.
You can often hear that AST traversal allows for implementation of multiple semantics for the same syntax. In reality you would never want that, not only because it's rarely needed, but also for performance reasons. And frankly, there is no difficulty in writing separate syntax for a different semantics. The results were always better when both designed together.
And finally, in every non-trivial task, get syntax parsed is the easiest part, getting semantics correct and process actions fast is a challenge. Focusing on AST is approaching the task backwards.
To have support for a feature that the "internal AST walker" doesn't have.
For example, there are several ways to trasnverse a "hierarchical" or "tre" structure,
like "walk thru the leafs first", or "walk thru the branches first".
Or if the nodes siblings have a sort index, and you want to "walk" / "visit" them decremantally by their index, instead of incrementally.
If the AST class or structure you have only works with one method, you may want to use another method using your custom "walker" / "visitor".

How to design a set of file readers and writers for different format

Digging into a legacy project (C++) that needs to be extended, I realized that there are about 40 reader/writer/parser classes. They are used to read and write various type of data (different objects) in different files format (binary, hdf5, xml, text, ...) ; one type of object is typically bound to one or two file formats. The classes have for most of them just no knowledge of the others. Interfaces and inheritance were apparently unknown to the writer, as well as design patterns.
It seems to me an horrendous mess. On the other hand I am not exactly sure how to handle this situation. I will at least extract interfaces. I would also like to see if I can have common code in some parent classes, for example what is specific to a hdf5 reader/writer. I also thought that the abstract factory pattern could help but the object I get out of the readers are completely different.
How would you handle this situation ? how would you design the classes ? what design pattern would you use if any ? Would you keep the reading and writing parts splitted ?
The Abstract Factory pattern isn't the right track. You usually only need interfaces if you anticipate multiple implementations for a given file type and want both to operate the same way.
Question: can one class be written to multiple file types? As in, object 'a' (of type Class A) potentially needs to be written to either/both XML or text formats?
If that is true, you need to decouple the classes from the readers/writers. Take a look at this question: What design pattern should I use for import/export?

Alpha renaming in many languages

I have what I imagine will be a fairly involved technical challenge: I want to be able to reliably alpha-rename identifiers in multiple languages (as many as possible). This will require special consideration for each language, and I'm asking for advice for how to minimize the amount of work I need to do by sharing code. Something like a unified parsing or abstract syntax framework that already has support for many languages would be great.
For example, here is some python code:
def foo(x):
def bar(y):
return x+y
return bar
An alpha renaming of x to y changes the x to a y and preserves semantics. So it would become:
def foo(y):
def bar(y1):
return y+y1
return bar
See how we needed to rename y to y1 in order to keep from breaking the code? That is why this is a hard problem. It seems like the program would have to have a pretty good knowledge of what constitutes a scope, rather than just doing, say, a string search and replace.
I would also like to preserve as much of the formatting as possible: comments, spacing, indentation. But that is not 100% necessary, it would just be nice.
Any tips?
To do this safely, you need to be able to to determine
all the identifiers (and those things that are not, e.g., the middle of a comment) in your code
the scopes of validity for each identifer
the ability to substitute a new identifier for an old one in the text
the ability to determine if renaming an identifier causes another name to be shadowed
To determine identifiers accurately, you need a least a langauge-accurate lexer. Identifiers in PHP look different than the do in COBOL.
To determine scopes of validity, you have to be determine program structure in practice, since most "scopes" are defined by such structure. This means you need a langauge-accurate parser; scopes in PHP are different than scopes in COBOL.
To determine which names are valid in which scopes, you need to know the language scoping rules. Your language may insist that the identifier X will refer to different Xes depending on the context in which X is found (consider object constructors named X with different arguments). Now you need to be able to traverse the scope structures according to the naming rules. Single inheritance, multiple inheritance, overloading, default types all will pretty much require you to build a model of the scopes for the programs, insert the identifiers and corresponding types into each scope, and then climb from the point of encounter of an identifier in the program text through the various scopes according to the language semantics. You will need symbol tables, inheritance linkages, ASTs, and the ability to navigage all of these. These structures are different from PHP and COBOL, but they share lots of common ideas so you likely need a library with the common concept support.
To rename an identifier, you have to modify the text. In a million lines of code, you need to point carefully. Modifying an AST node is one way to point carefully. Actually, you need to modify all the identifiers that correspond to the one being renamed; you have to climb over the tree to find them all, or record in the AST where all the references exist so they can be found easily. After modifyingy the tree you have to regenerate the source text after modifying the AST. That's a lot of machinery; see my SO answer on how to prettyprint ASTs preseriving all of the stuff you reasonably suggest should be preserved.
(Your other choice is to keep track in the AST of where the text for the string is,
and the read/patch/write the file.)
Before you update the file, you need to check that you haven't shadowed something. Consider this code:
{ local x;
x=1;
{local y;
y=2;
{local z;
z=y
print(x);
}
}
}
We agree this code prints "1". Now we decide to rename y to x.
We've broken the scoping, and now the print statement which referred
conceptually to the outer x refers to an x captured by the renamed y. The code now prints "2", so our rename broke it. This means that one must check all the other identifiers in scopes in which the renamed variable might be found, to see if the new name "captures" some name we weren't expecting. (This would be legal if the print statement printed z).
This is a lot of machinery.
Yes, there is a framework that has almost all of this as well as a number of robust language front ends. See our DMS Software Reengineering Toolkit. It has parsers producing ASTs, prettyprinters to produce text back from ASTs, generic symbol table management machinery (including support for multiple inheritance), AST visiting/modification machinery. Ithas prettyprinting machinery to turn ASTs back into text. It has front ends for C, C++, COBOL and Java that implement name and type resolution (e.g. instanting symbol table scopes and identifier to symbol table entry mappings); it has front ends for many other langauges that don't have scoping implemented yet.
We've just finished an exercise in implementing "rename" for Java. (All the above issues of course appeared). We about about to start one for C++.
You could try to create Xtext based implementations for the involved languages. The Xtext framework provides reliable infrastructure for cross language rename refactoring. However, you'll have to provide a grammar a at least a "good enough" scope resolution for each language.
Languages mostly guarantee tokens will be unique, whatever the context. A naive first approach (and this will break many, many pieces of code) would be:
cp file file.orig
sed -i 's/\b(newTokenName)\b/TEMPTOKEN/g' file
sed -i 's/\b(oldTokenName)\b/newTokenName/g' file
With GNU sed, this will break on PHP. Rewriting \b to a general token match, like ([^a-zA-Z~$-_][^a-zA-Z0-9~$-_]) would work on most C, Java, PHP, and Python, but not Perl (need to add # and % to the token characters. Beyond that, it would require a plugin architecture that works for any language you wanted to add. At some point, there will be two languages whose variable and function naming rules will be incompatible, and at that point, you'll need to do more and more in the plugin.

Resources