Individuals from DBpedia - jena

I have an exercise in Semantic Web. I must extract some individuals from the DBpedia. These individuals must be inserted into an ontology that I must create. My question is. Can I retrieve individuals from the DBedia?
Let me clarify !
When I send this sparql query
PREFIX dbo: <http://dbpedia.org/ontology>
SELECT distinct * WHERE
{
?album a dbo:Album .
} LIMIT 10
I get 10 URIs. Should I get whole instances ? I mean, label, object properties, data properties etc. in order to insert them to my ontology?
I want them as a complete instance. I don't want to add more variables e.g
?album dbo:artist ?artist .
Can I use a java api e.g. Jena ?
EDIT:
Let me give you an example. Suppose that you get an Album with URI
http://dbpedia.org/resource/...Baby_One_More_Time_(album)
This album has also some properties with their values e.g.
dbo:artist dbr:Britney_Spears
dbo:releaseDate 1999-01-12 (xsd:date)
...
How could I get all of them in order to create an indivual album for my ontology with properties artist and releaseDate and values Britney_Spears and 1999-01-12 respectively ?

Well, a good point always to start is your requirements! What do you exactly need? There is scientific plethora research on Ontology Module Extraction (see for example here).
My rule of thumb is that: the amount you extract must align with the required constraints of soundness and completeness of results, which in turn, aligns with your requirements. To make it clear, consider the following: A DBpedia Artist is a subClassOf Person. Now consider that you extract all the instances of Artist from DBPedia, without the piece of information that Artist is a subClassOf Person. Now if you query your dataset asking for Person, you will get nothing. Is this a sound result? yes, but is it complete? No! However, if you don't care about the fact that each Artist is a Person, then it's okay. A mentioning worthy thing is that it depends on the DBpedia endpoint itself and what kind of reasoning it performs as well.
Concluding: Specify what you really need. While you can suffice for a couple of classes with their instances, you can as well extract the whole DBpedia.
Regarding getting the data, there are multiple ways; again depending on your requirements. For simple purposes, you can use Jena TDB for triples storage and access them via Jena. You can even store your data simply in an RDF file. You can, for example, use a construct query on DBpedia endpoint and specify the results format as RDF and then insert them to your RDF engine. Another option, for example, this answer, states how to use an INSERT query to perform the insert task into a local graph.

You can retrieve instances from DBpedia with whatever metadata you want, but it depends on your ontology that you would like to create. Please take a look at this document, it will help you to understand some notions.
Should you get whole instances? I think you are asking if you should take all the proporties and objects depending on the subject. Not necessarily..It depends on your ontology as stated in first step and you decide what to take.
Should you use Jena? You can but you don't have to! If you pose a CONSTRUCT query to the endpoint you can get the data but as far as I understood you don't want to add variables. So you can pose a query as follows by asking all the metadata regarding to the instance.
CONSTRUCT { ?album ?p ?o } WHERE {
?album a dbo:Album .
?album ?p ?o
}
If you would like to get a limited number of instances then you can add limit again at the end of this query.

Related

Query-document similarity with doc2vec

Given a query and a document, I would like to compute a similarity score using Gensim doc2vec.
Each document consists of multiple fields (e.g., main title, author, publisher, etc)
For training, is it better to concatenate the document fields and treat each row as a unique document or should I split the fields and use them as different training examples?
For inference, should I treat a query like a document? Meaning, should I call the model (trained over the documents) on the query?
The right answer will depend on your data & user behavior, so you'll want to try several variants.
Just to get some initial results, I'd suggest combining all fields into a single 'document', for each potential query-result, and using the (fast-to-train) PV-DBOW mode (dm=0). That will let you start seeing results, doing either some informal assessment or beginning to compile some automatic assessment data (like lists of probe queries & docs that they "should" rank highly).
You could then try testing the idea of making the fields separate docs – either instead-of, or in addition-to, the single-doc approach.
Another option might be to create specialized word-tokens per field. That is, when 'John' appears in the title, you'd actually preprocess it to be 'title:John', and when in author, 'author:John', etc. (This might be in lieu of, or in addition to, the naked original token.) That could enhance the model to also understand the shifting senses of each token, depending on the field.
Then, providing you have enough training data, & choose other model parameters well, your search interface might also preprocess queries similarly, when the user indicates a certain field, and get improved results. (Or maybe not: it's just an idea to be tried.)
In all cases, if you need precise results – exact matches of well-specified user queries – more traditional searches like exact DB matches/greps, or full-text reverse-indexes, will outperform Doc2Vec. But when queries are more approximate, and results need filling-out with near-in-meaning-even-if-not-in-literal-tokens results, a fuzzier vector document representation may be helpful.

Using ontology to infer labels for process model

I'm trying to implement a specific type of process mining, that has been presented in this thesis [link]. It is based on HMMs and generates a process model in form of a directed graph, where:
Nodes are called intentions and correspond to hidden states
Edges are called strategies and consist of different activities
These activities correspond to the HMM's observable emissions
Intentions can be fulfilled using different strategies
A user event log consisting of user IDs, timestamps and activities is used as input. The image below is an example of such a process model. The highlighted nodes and edges resemble the path that has been predicted using the Viterbi algorithm.
You can see that the graph's nodes and edges only carry numeric labels, which allow to distinguish between the different strategies and intentions. In order to make these labels more meaningful to the human reader, I'd like to infer some suitable labels.
My idea is to use an ontology to obtain those labels. After some research I figured out that I probably needed to do something that is generally referred to as "ontology learning". For this I would need to create some axioms in RDF/OWL format and then use these as input for a reasoner, that would infer an ontology.
Is this approach correct and reasonable to achieve my goal?
If this is the way to go, I will need some tool to generate axioms in an automated way. So far I couldn't find any tool that would do that completely out-of-the-box. Based on what I've seen so far I conclude that I would need to define some kind of mapping between the original data and the desired axioms. I took a closer look at protégé, which offers a plugin for spreadsheets. It seems to be based on the MappingMasterDSL project [link].
I've also found an interesting paper [link] on ontology learning where an RNN-based model is trained in a end-to-end fashion to translate definitory sentences into OWL formulae. BUT: My user event log data does not contain any natural sentences. Its activities are defined by tokens derived from HTML elements of the user interface. Therefore the RNN-based approach does not seem to be applicable here. (For the interested reader, the related project can be found here [link])
Isn't there really any easier way than hand-crafting the axioms' schema(ta)?
Assuming that I have created my axioms and inferred an ontology, I would like to use the strategies' (edges') observable activities (emissions) to infer a suitable label. I guess I would need to query my ontology somehow. I could use the activity names as parameters for my query and look for some related entities that reveal the desired label. I'm expecting something like:
"I have a strategy with ID=3, that strategy can be executed with
actions a, b and c, give me all entities of the ontology, that
have these actions as property value and show and give me all related
labels for those entities"
But where would the data for the labels actually come from?
I think I'm missing some important step during the process of ontology learning. Where do I find an additional data source for the labels and how do I relate this data to my ontology's entities?
Also I'm wondering if there is a way to incorporate the inherent knowledge of the process model's topology into my ontology.

Graph Databases vs Triple Stores - when to use which?

I know that there are similar questions around on Stackoverflow but I don't feel they answer the following.
Graph Databases to my understanding store data following mostly this schema:
Table/Collection 1: store nodes with UID
Table/Collection 2: store relations referencing nodes via UID
This allows storing arbitrary types of graphs. Now as I understand triple stores store nothing but triples:
Triple/Collection 1: store triples (2 nodes, 1 relation)
Now I would see the following distinction regarding use cases:
Graph Databases: when you have known, static connections
Triple Stores: when you have loosely connected nodes and are often looking for new connections
I am confused by the fact that people do not seem to be discussing which one to use according to these criteria. Most article I find are talking about arguments like speed or compatibility. But is this not the most relevant point?
Put the other way round:
Imagine having a clearly connected, user defined graph. Why on earth would you want to store that as triples only, loosing all the info about connections? Or having to implement some custom solution storing IDs in the triple subject.
Imagine having loosely collected nodes that you want to query for unknown relations using SPARQL. Graph databases do support that. But for this they have to build another index I assume and would be slower?
EDIT:
I see that "loosing info about connections" is the wrong way to put it. If you do as shown in the accepted answer and insert several triples for 2 nodes + 1 relation then you keep all the info and specifically the info what exact nodes are connected.
The main difference between graph databases and triple stores is how they model the graph. In a triple store (or quad store), the data tends to be very atomic. What I mean is that the "nodes" in the graph tend to be primitive data types like string, integer, date, etc. Relationships link primitives together, and so the "unit of discourse" in a triple store is a triple, and not a node or a relationship, typically.
By contrast, other graph databases are often called "property stores" because nodes are data containers that correspond to objects in a domain. A node stands in for an object, and has properties; they act as rich data types specified by the graph modelers, more than just primitive data types. In these graph databases, nodes and relationships are the "unit of discourse".
Let's say I have a person named "Bob" who knows "Susan". In RDF, it would be something like this:
<http://example.org/person/1> :hasName "Bob".
<http://example.org/person/1> foaf:knows <http://example.org/person/2>.
<http://example.org/person/2> :hasName "Susan".
In a graph database like neo4j, it would be this:
(a:Person {name: "Bob"})-[:KNOWS]->(b:Person {name: "Susan"})
Notice that in RDF, it's 3 relationships but only one of those relationships actually expresses semantics between two entities. The other two relationships are just tracking properties of a single higher-level entity (the person). In neo4j, it's 1 relationship amongst two nodes, with each node having a property. In RDF you'll tend to identify things by URI, in neo4j it's a database object that gets a database ID automatically. That's what I mean about the difference between a more atomic/primitive store (triple stores) and a richer property graph.
RDF and triple stores are mostly built for the kinds of architectural challenges you'd run into with the semantic web. For example, XML namespacing is built in, on the architectural assumption that you'll be mixing and matching the use of many different vocabularies and namespaces. (That right there is a very "semantic web" assumption). So in SPARQL and RDF you'll see typically at least the use of xsd, rdf, and rdfs namespaces concurrently, and probably also owl, skos, and many others. SPARQL and RDF/RDFS also have many hooks and features that are there explicitly to make things like ontology inference easier. You'll tend to identify things with URIs as a way of "namespacing your identifiers" but also because some people may want to de-reference the URI...again the assumption here is a wide data sharing arrangement between many parties.
Property stores by contrast are keyed towards different use cases, like flexible modeling of data within one model/namespace, mappings between objects and graphs for persistence of enterprise applications, rapid evolvability, and so on. You'll tend to identify things with your own scheme (or an internal database ID). An auto-incrementing integer may not be best form of ID for any random consumer on the web, (and they certainly can't be de-referenced like URLs) but they might not be your first thought for a company internal application.
So which is better? The more atomic triple store format, or a rich property graph? Do you need to mix and match many different vocabularies in one query or data model? Do you need to create an OWL ontology or do inference? Do you need to serialize a bunch of java objects in memory to a database? Do you need to do fast traversal of long paths? Those types of questions would guide your selection.
Graphs are graphs, both of them do graphs, and so I don't think there's much difference in terms of what they can represent, or how you go about thinking about a problem in "graph terms". The differences boil down to the architecture underneath of the hood, and what sorts of use cases you think you'll need. I won't tell you one is better than the other, but choose wisely.
(in reply to the comments on this answer: https://stackoverflow.com/a/30167732 )
When an owl:inverseOf production rule is defined, the inverse property triple is inferred by the reasoner either when adding or updating the store, or when selecting from the store. This is a "materialized relation"
Schema.org - an RDFS vocabulary - defines, for example, https://schema.org/isPartOf as the inverse property of hasPart. If both are specified, it's not necessary to run another graph pattern query to traverse a directed relation in the other direction.
(:book1 schema:hasPart ?o)
(?o schema:isPartOf :book1)
(?s schema:hasPart :chapter2)
It's certainly possible to use RDFS and OWL to describe schema for and within neo4j property graphs; but there's no reasoner to e.g. infer inverse properties or do schema validation.
Is there any RDF graph that neo4j cannot store? RDF has datatypes and languages for objects: you'd need to reify properties where datatypes and/or languages are specified (and you'd be re-implementing well-defined semantics)
Can every neo4j graph be represented with RDF? Yes.
RDF is a representation for graphs for which there are very many store implementations that are optimized for various use cases like insert and query performance.
Comparing neo4j to a particular triplestore (with reasoning support) might be a more useful comparison given that all neo4j graphs can be expressed as RDF.

Cypher query: Is it possible to "hide" an existing path with a "virtual relationship"?

We are working on a project trying to map a structure like Java code connections with Noe4J 2.1.5. We have succeeded in connecting Applications-Jars-Classes-Methods and can for example get a Cypher answer resulting in:
App1-->Jar1-->Class1-->Method1-->Method2-->Method3<--Class22<--Jar2<--App1
Now we would like to be able to get the condensed answer to what Jars that are connected like this, "hiding" the existing path above?
Jar1--Jar2
Is it possible with Cypher to get this result without creating a new Relationship like
Jar1-[:PATH_EXISTS]-Jar2
We can't find anything related collapsing/hiding paths in the manual nor here on stack overflow
Regards
Christofer
There's basically two ways of going about this.
The first is to explicitly create the new relationship, but I won't talk about this that much because it seems you've thought of that and rejected it. That method is easy, but more disk intensive (depending on the size of your graph)
The second is simply to query for the path when needed, with a variable length path like this:
MATCH (jar1 {myid: "something"})-[*]->(jar2 {myid: "somethingelse"})
RETURN jar2;
This will get you what you need, but it requires that this distant path be recomputed every time it's needed. So, it's easy, but it's compute intensive.
Now, more broadly what it sounds like you want is something like a graph inference engine. In the OWL/RDF world, people will create ontologies that describe different types of entities, and the relationships between them. One of the consequences of these relationships is that they can be transitive and can have implications on them. A classic example is that a person is an entity, and things like motherOf and fatherOf are relationships between. So if you have a path of fatherOf relationships between nodes, i.e. (A)-[:fatherOf]->(B)-[:fatherOf]->(C), the inference engine will return the "fact" that (A) and (C) are related by family. This would be a consequence of your ontological definition. That "fact" wouldn't actually be in the RDF store, it would simply be entailed by the facts.
In your case, you'd do something like writing an ontology that specified that all of the individual relationships you have in your graph are a specialization of some relationship type (like "related to"). You'd then ask the reasoner if a "related to" relationship exists between Jar1 and Jar2, and the answer would be yes because of your ontological definitions.
OK, so the bad news is that neo4j isn't RDF and doesn't do this. Also, doing this sort of thing is way harder than I'm making it sound; correct ontology modeling is an art unto itself, not unlike logic programming from the prolog world of the 1970s. But basically, that kind of inference is what it sounds like you're looking for.
What I think you might be able to hope for in some future release of neo4j is something akin to a database "view", or better schema support. I.e. it ought to be possible to specify that whenever a certain relationship pattern holds, some other relationship ought also be present.

Parsing Wikipedia countries, regions, cities

Is it possible to get a list of all Wikipedia countries, regions and cities with relations between them? I couldn't find any API appropriate for this task.
What is be the easiest way to parse all the information I need?
PS: I know, that there are another datasources I can get this information from. But I am interested in Wikipedia...
[2020 update] this is now best done using the Wikidata Query Service, you can run super specific queries with a bit of SPARQL, example: Find all countries and their label. See Wikidata Query Help
It might be a bit tedious to get the whole graph but you can get most of the data from the experimental/non-official Wikidata Query API.
I suggest the following workflow:
Go to an instance of the kind of entities you want to work with, say Estonia (Q191) and look for its instance of (P31) properties, you will find: country, sovereign state, member of the UN, member of the EU, etc.
Use the Wikidata Query API claim command to output every entity
that as the chosen P31 property. Lets try with country (Q6256):
http://wdq.wmflabs.org/api?q=claim[31:6256]
It outputs an array of numeric ids: that's your countries! (notice that the result is still incomplete as there are only 141 items found: either countries are missing from Wikidata, or, as suggested by Nemo in comments, some countries are to be found in country (Q6256) subclasses(P279))
You may want more than ids though, so you can ask Wikidata Official API for entities data:
https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q16&format=json&props=labels|claims&languages=en|fr
(here Canada(Q16) data, in json, with only claims and labels data, in English and French. Look at the documentation to adapt parameters to your needs)
You can query multiple entities at a time, with a limit of 50, as follow:
https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q16|Q17|Q20|Q27|Q28|Q29|Q30|Q31|Q32|Q33|Q34|Q35|Q36|Q37|Q38|Q39|Q40|Q41|Q43|Q45|Q77|Q79|Q96|Q114&format=json&props=labels|claims&languages=en|fr
From every countries data, you could look for entities registered as administrative subdivisions (P150) and repeat on those new entities.
Aternatively, you can get all the tree of administrative subdivisions with the tree command. For instance, for France(Q142) that would be http://wdq.wmflabs.org/api?q=tree[142][150] Tadaaa, 36994 items! But that's way harder to refine given the different kinds of subdivision you can encounter from a country to another. And avoid doing this kind of query from a browser, it might crash.
You now just have to find cities by countries by refining this last query with the claim command, and the appropriate sub-class(P279) of municipality(Q15284) entity (all available here): for France, that's commune (Q484170), so your request looks like
http://wdq.wmflabs.org/api?q=tree[142][150] AND claim[31:484170]
then repeat for all the countries: have fun!
You should go with Wikidata and/or dbpedia.
Personally I'd start with Wikidata as it's directly using MediaWiki, with the same API so you can use similar code. I would use pywikibot to get started. Like that you can still request pages from Wikipedia where that makes sense (e.g. list pages or categories).
Here's a nice overview of ways to access Wikidata

Resources