Using ontology to infer labels for process model - machine-learning

I'm trying to implement a specific type of process mining, that has been presented in this thesis [link]. It is based on HMMs and generates a process model in form of a directed graph, where:
Nodes are called intentions and correspond to hidden states
Edges are called strategies and consist of different activities
These activities correspond to the HMM's observable emissions
Intentions can be fulfilled using different strategies
A user event log consisting of user IDs, timestamps and activities is used as input. The image below is an example of such a process model. The highlighted nodes and edges resemble the path that has been predicted using the Viterbi algorithm.
You can see that the graph's nodes and edges only carry numeric labels, which allow to distinguish between the different strategies and intentions. In order to make these labels more meaningful to the human reader, I'd like to infer some suitable labels.
My idea is to use an ontology to obtain those labels. After some research I figured out that I probably needed to do something that is generally referred to as "ontology learning". For this I would need to create some axioms in RDF/OWL format and then use these as input for a reasoner, that would infer an ontology.
Is this approach correct and reasonable to achieve my goal?
If this is the way to go, I will need some tool to generate axioms in an automated way. So far I couldn't find any tool that would do that completely out-of-the-box. Based on what I've seen so far I conclude that I would need to define some kind of mapping between the original data and the desired axioms. I took a closer look at protégé, which offers a plugin for spreadsheets. It seems to be based on the MappingMasterDSL project [link].
I've also found an interesting paper [link] on ontology learning where an RNN-based model is trained in a end-to-end fashion to translate definitory sentences into OWL formulae. BUT: My user event log data does not contain any natural sentences. Its activities are defined by tokens derived from HTML elements of the user interface. Therefore the RNN-based approach does not seem to be applicable here. (For the interested reader, the related project can be found here [link])
Isn't there really any easier way than hand-crafting the axioms' schema(ta)?
Assuming that I have created my axioms and inferred an ontology, I would like to use the strategies' (edges') observable activities (emissions) to infer a suitable label. I guess I would need to query my ontology somehow. I could use the activity names as parameters for my query and look for some related entities that reveal the desired label. I'm expecting something like:
"I have a strategy with ID=3, that strategy can be executed with
actions a, b and c, give me all entities of the ontology, that
have these actions as property value and show and give me all related
labels for those entities"
But where would the data for the labels actually come from?
I think I'm missing some important step during the process of ontology learning. Where do I find an additional data source for the labels and how do I relate this data to my ontology's entities?
Also I'm wondering if there is a way to incorporate the inherent knowledge of the process model's topology into my ontology.

Related

How much data / context needed to train custom NER Spacy model?

I am trying to extract previous Job titles from a CV using spacy and named entity recognition.
I would like to train spacy to detect a custom named entity type : 'JOB'. For that I have around 800 job title names from https://www.careerbuilder.com/browse/titles/ that I can use as training data.
In my training data for spacy, do I need to integrate these job titles in sentences added to provide context or not?
In general in the CV the job title kinda stands on it's own and is not really part of a full sentence.
Also, if I need to provide coherent context for each of the 800 titles, it will be too time-consuming for what I'm trying to do, so maybe there are other solutions than NER?
Generally, Named Entity Recognition relies on the context of words, otherwise the model would not be able to detect entities in previously unseen words. Consequently, the list of titles would not help you to train any model. You could rather run string matching to find any of those 800 titles in CV documents and you will even be guaranteed to find all of them - no unknown titles, though.
I you could find 800 (or less) real CVs and replace the Job names by those in your list (or others!), then you are all set to train a model capable of NER. This would be the way to go, I suppose. Just download as many freely available CVs from the web and see where this gets you. If it is not enough data, you can augment it, for example by exchanging the job titles in the data by some of the titles in your list.

Query-document similarity with doc2vec

Given a query and a document, I would like to compute a similarity score using Gensim doc2vec.
Each document consists of multiple fields (e.g., main title, author, publisher, etc)
For training, is it better to concatenate the document fields and treat each row as a unique document or should I split the fields and use them as different training examples?
For inference, should I treat a query like a document? Meaning, should I call the model (trained over the documents) on the query?
The right answer will depend on your data & user behavior, so you'll want to try several variants.
Just to get some initial results, I'd suggest combining all fields into a single 'document', for each potential query-result, and using the (fast-to-train) PV-DBOW mode (dm=0). That will let you start seeing results, doing either some informal assessment or beginning to compile some automatic assessment data (like lists of probe queries & docs that they "should" rank highly).
You could then try testing the idea of making the fields separate docs – either instead-of, or in addition-to, the single-doc approach.
Another option might be to create specialized word-tokens per field. That is, when 'John' appears in the title, you'd actually preprocess it to be 'title:John', and when in author, 'author:John', etc. (This might be in lieu of, or in addition to, the naked original token.) That could enhance the model to also understand the shifting senses of each token, depending on the field.
Then, providing you have enough training data, & choose other model parameters well, your search interface might also preprocess queries similarly, when the user indicates a certain field, and get improved results. (Or maybe not: it's just an idea to be tried.)
In all cases, if you need precise results – exact matches of well-specified user queries – more traditional searches like exact DB matches/greps, or full-text reverse-indexes, will outperform Doc2Vec. But when queries are more approximate, and results need filling-out with near-in-meaning-even-if-not-in-literal-tokens results, a fuzzier vector document representation may be helpful.

Detect common features in multidimensional data

I am designing a system for anomaly detection.
There are multiple approaches for building such system. I choose to implement one facet of such system by detection of features shared by the majority of samples. I acknowledge the possible insufficiencies of such method but for my specific use-case: (1) It suffices to know that a new sample contains (or lacks) features shared by the majority of past data to make a quick decision.(2) I'm interested in the insights such method will offer to the data.
So, here is the problem:
Consider a large data set with M data points, where each data point may include any number of {key:value} features. I choose to model a training dataset by grouping all the features observed in the data (the set of all unique keys) and setting it as the model's feature space. I define each sample by setting its values for existing keys and None for values in features it does not include.
Given this training data set I want to determine which features reoccur in the data; and for such reoccurring features, do they mostly share a single value.
My question:
A simple solution would be to count everything - for each of the N features calculate the distribution of values. However as M and N are potentially large, I wonder if there is a more compact way to represent the data or more sophisticated method to make claims about features' frequencies.
Am I reinventing an existing wheel? If there's an online approach for accomplishing such task it would be even better.
If I understand correctly your question,
you need to go over all the data anyway, so why not using hash?
Actually two hash tables:
Inner hash table for the distribution of feature values.
Outer hash table for feature existence.
In this way, the size of the inner hash table will indicate how is the feature common in your data, and the actual values will indicate how they differ one another. Another thing to notice is that you go over your data only once, and the time complexity for every operation (almost) on hash tables (if you allocate enough space from the beginning) is O(1).
Hope it helps

What's the optimal structure for a multi-domain sentence/word graph in Neo4j?

I'm implementing abstractive summarization based on this paper, and I'm having trouble deciding the most optimal way to implement the graph such that it can be used for multi-domain analysis. Let's start with Twitter as an example domain.
For every tweet, each sentence would be graphed like this (ex: "#stackoverflow is a great place for getting help #graphsftw"):
(#stackoverflow)-[next]->(is)
-[next]->(a)
-[next]->(great)
-[next]->(place)
-[next]->(for)
-[next]->(getting)
-[next]->(help)
-[next]->(#graphsftw)
This would yield a graph similar to the one outlined in the paper:
To have a kind of domain layer for each word, I'm adding them to the graph like this (with properties including things like part of speech):
MERGE (w:Word:TwitterWord {orth: "word" }) ON CREATE SET ... ON MATCH SET ...
In the paper, they set a property on each word {SID:PID}, which describes the sentence id of the word (SID) and also the position of each word in the sentence (PID); so in the example sentence "#stackoverflow" would have a property of {1:1}, "is" would be {1:2}, "#graphsftw" {1:9}, etc. Each subsequent reference to the word in another sentence would add an element to the {SID:PID} property array: [{1:x}, {n:n}].
It doesn't seem like having sentence and positional information as an array of elements contained within a property of each node is efficient, especially when dealing with multiple word-domains and sub-domains within each word layer.
For each word layer or domain like Twitter, what I want to do is get an idea of what's happening around specific domain/layer entities like mentions and hashtags; in this example, #stackoverflow and #graphsftw.
What is the most optimal way to add subdomain layers on top of, for example, a 'Twitter' layer, such that different words are directed towards specific domain-entities like #hashtags and #mentions? I could use a separate label for each subdomain, like :Word:TwitterWord:Stackoverflow, but that would give my graph a ton of separate labels.
If I include the subdomain entities in a node property array, then it seems like traversal would become an issue.
Since all tweets and extracted entities like #mentions and #hashtags are being graphed as nodes/vertices prior to the word-graph step, I could have edges going from #hashtags and #mentions to words. Or, I could have edges going from tweets to words with the entities as an edge property. Basically, I'm looking for a structure that is the "cheapest" in terms of both storage and traversal.
Any input on how generally to structure this graph would be greatly appreciated. Thanks!
You could also put the domains / positions on the relationships (and perhaps also add a source-id).
OTOH you can also infer that information as long as your relationships represent the original sentence.
You could then either aggregate the relationships dynamically to compute the strengths or have a separate "composite" relationship that aggregates all the others into a counter or sum.

Graph Databases vs Triple Stores - when to use which?

I know that there are similar questions around on Stackoverflow but I don't feel they answer the following.
Graph Databases to my understanding store data following mostly this schema:
Table/Collection 1: store nodes with UID
Table/Collection 2: store relations referencing nodes via UID
This allows storing arbitrary types of graphs. Now as I understand triple stores store nothing but triples:
Triple/Collection 1: store triples (2 nodes, 1 relation)
Now I would see the following distinction regarding use cases:
Graph Databases: when you have known, static connections
Triple Stores: when you have loosely connected nodes and are often looking for new connections
I am confused by the fact that people do not seem to be discussing which one to use according to these criteria. Most article I find are talking about arguments like speed or compatibility. But is this not the most relevant point?
Put the other way round:
Imagine having a clearly connected, user defined graph. Why on earth would you want to store that as triples only, loosing all the info about connections? Or having to implement some custom solution storing IDs in the triple subject.
Imagine having loosely collected nodes that you want to query for unknown relations using SPARQL. Graph databases do support that. But for this they have to build another index I assume and would be slower?
EDIT:
I see that "loosing info about connections" is the wrong way to put it. If you do as shown in the accepted answer and insert several triples for 2 nodes + 1 relation then you keep all the info and specifically the info what exact nodes are connected.
The main difference between graph databases and triple stores is how they model the graph. In a triple store (or quad store), the data tends to be very atomic. What I mean is that the "nodes" in the graph tend to be primitive data types like string, integer, date, etc. Relationships link primitives together, and so the "unit of discourse" in a triple store is a triple, and not a node or a relationship, typically.
By contrast, other graph databases are often called "property stores" because nodes are data containers that correspond to objects in a domain. A node stands in for an object, and has properties; they act as rich data types specified by the graph modelers, more than just primitive data types. In these graph databases, nodes and relationships are the "unit of discourse".
Let's say I have a person named "Bob" who knows "Susan". In RDF, it would be something like this:
<http://example.org/person/1> :hasName "Bob".
<http://example.org/person/1> foaf:knows <http://example.org/person/2>.
<http://example.org/person/2> :hasName "Susan".
In a graph database like neo4j, it would be this:
(a:Person {name: "Bob"})-[:KNOWS]->(b:Person {name: "Susan"})
Notice that in RDF, it's 3 relationships but only one of those relationships actually expresses semantics between two entities. The other two relationships are just tracking properties of a single higher-level entity (the person). In neo4j, it's 1 relationship amongst two nodes, with each node having a property. In RDF you'll tend to identify things by URI, in neo4j it's a database object that gets a database ID automatically. That's what I mean about the difference between a more atomic/primitive store (triple stores) and a richer property graph.
RDF and triple stores are mostly built for the kinds of architectural challenges you'd run into with the semantic web. For example, XML namespacing is built in, on the architectural assumption that you'll be mixing and matching the use of many different vocabularies and namespaces. (That right there is a very "semantic web" assumption). So in SPARQL and RDF you'll see typically at least the use of xsd, rdf, and rdfs namespaces concurrently, and probably also owl, skos, and many others. SPARQL and RDF/RDFS also have many hooks and features that are there explicitly to make things like ontology inference easier. You'll tend to identify things with URIs as a way of "namespacing your identifiers" but also because some people may want to de-reference the URI...again the assumption here is a wide data sharing arrangement between many parties.
Property stores by contrast are keyed towards different use cases, like flexible modeling of data within one model/namespace, mappings between objects and graphs for persistence of enterprise applications, rapid evolvability, and so on. You'll tend to identify things with your own scheme (or an internal database ID). An auto-incrementing integer may not be best form of ID for any random consumer on the web, (and they certainly can't be de-referenced like URLs) but they might not be your first thought for a company internal application.
So which is better? The more atomic triple store format, or a rich property graph? Do you need to mix and match many different vocabularies in one query or data model? Do you need to create an OWL ontology or do inference? Do you need to serialize a bunch of java objects in memory to a database? Do you need to do fast traversal of long paths? Those types of questions would guide your selection.
Graphs are graphs, both of them do graphs, and so I don't think there's much difference in terms of what they can represent, or how you go about thinking about a problem in "graph terms". The differences boil down to the architecture underneath of the hood, and what sorts of use cases you think you'll need. I won't tell you one is better than the other, but choose wisely.
(in reply to the comments on this answer: https://stackoverflow.com/a/30167732 )
When an owl:inverseOf production rule is defined, the inverse property triple is inferred by the reasoner either when adding or updating the store, or when selecting from the store. This is a "materialized relation"
Schema.org - an RDFS vocabulary - defines, for example, https://schema.org/isPartOf as the inverse property of hasPart. If both are specified, it's not necessary to run another graph pattern query to traverse a directed relation in the other direction.
(:book1 schema:hasPart ?o)
(?o schema:isPartOf :book1)
(?s schema:hasPart :chapter2)
It's certainly possible to use RDFS and OWL to describe schema for and within neo4j property graphs; but there's no reasoner to e.g. infer inverse properties or do schema validation.
Is there any RDF graph that neo4j cannot store? RDF has datatypes and languages for objects: you'd need to reify properties where datatypes and/or languages are specified (and you'd be re-implementing well-defined semantics)
Can every neo4j graph be represented with RDF? Yes.
RDF is a representation for graphs for which there are very many store implementations that are optimized for various use cases like insert and query performance.
Comparing neo4j to a particular triplestore (with reasoning support) might be a more useful comparison given that all neo4j graphs can be expressed as RDF.

Resources