I am a beginner in the field of ontologies, ontology alignment and composition of ontologies. What is the purpose of the composition of ontologies and on what basis it is performed and how ?
One of the main advantages of using ontologies is knowledge sharing. Different people from various backgrounds might develop the same ontology. This will often result in having different labels for the same concepts or relations. In order to be able to take advantage of having multiple ontologies in the same domain, for example for having a more comprehensive and expressive domain ontology, ontology matching/alignment comes to play. In the ontology matching, a mapping between concepts an relations of various ontologies is created.
For example, before national cancer institute came up with the first version of their cancer ontology, there were multiple ontologies modelling cancer around. They started by combining the various available ontologies and creating a central, more reliable ontology.
There are various algorithms for ontology matching. The algorithms normally are categorised based on:
input
process
output
Broadly putting it, you can either match on element to element basis, or based on the structure. The tools that can be used for matching can be linguistic resources such as WordNet for semantic matching, or domain specific resources, statistical approaches, taxonomy, various models, and etc.There are too much research in this area and you should really consider using google scholar.
Related
I'm trying to find the best way to compare two text documents using AI and machine learning methods. I've used the TF-IDF-Cosine Similarity and other similarity measures, but this compares the documents at a word (or n-gram) level.
I'm looking for a method that allows me to compare the meaning of the documents. What is the best way to do that?
You should start reading about word2vec model.
use gensim, get the pretrained model of google.
For vectoring a document, use Doc2vec() function.
After getting vectors for all your document, use some distance metric like cosine distance or euclidean distance for comparison.
This is very difficult. There is actually no computational definition of "meaning". You should dive into text mining, summarization and libraries like gensim, spacy or pattern.
In my opinion, the more readily useable libraries available out there ie. higher return on investesment (ROI), that is if you are a newbie you might want to look at tools around chatbots they want to extract from natural language structured data. That is what is the most similar to "meaning". One example free software tool to achieve that is rasa natural language understanding.
The drawback of such tools is that they somewhat work but only in the domain where they were trained and prepared to work. And in particular they do not aim at comparing documents like you want.
I'm trying to find the best way to compare two text documents using AI
You must come up with a more precise task and from there find out which technic apply best to your use case. Do you want to classify documents in predefined categories. Do you to compute some similarity between two documents? Given an input document, do you want to find most similar documents in a database. Do you want to extract important topics or keywords in the document? Do you want to summarize the document? Is it an abstractig summary or key phrase extraction?
In particular, there is no software that allows to extract somekind of semantic fingerprint from any document. Depending on the end goal, the way to achieve it might be completly different.
You must narrow the precise goal you are trying to achieve; From there, you will be able to ask another question (or improve this one) to describe precisly your goal.
Text understanding is AI-Complete. So, just saying to the computer "tell me something about this two documents" doesn't work.
Like other have said, word2vec and other word embeddings are tools to achieve many goals in NLP but it only a mean for an end. You must define the input and output of the system you are trying to design to be able to start working on the implementation.
There is two other Stack Overflow communities that you might want to dig:
Linguistics
Data Science
Given the tfidf value for each token in your corpus (or the most meaningful ones) you can compute a sparse representation for a document.
This is implemented in the sklearn TFIDFVectorizer.
As other users have pointed out, this is not the best solution to your task.
You should take into account embeddings.
The easiest solution consists in using an embedding at the words level, such as the one provided by the FastText framework.
Then you can create an embedding for the whole document by summing together the embedding of the single words which compose it.
An alterative consists in training an embedding directly at the document level, using some Doc2Vec framework such as the gensim or DL4J one.
Also you can use LDA Or LSI Models for text corpus. these methods(and other methods like wor2vec and doc2vec) can summarize documents to fixed length vectors with respect to it's meaning and topics that this document belongs to.
read more:
https://radimrehurek.com/gensim/models/ldamodel.html
I heard there are three approaches from Dr. Golden:
- Cosine Angular Separation
- Hamming Distance
- Latent Semantic Analysis (LSA) or Latent Semantic Indexing (LSI)
These methods are based on semantic similarity.
I also heard some company used tool called Spacy to summarize document to compare each other.
Both of them are widely used to type DBpedia resources but it seems that YAGO has much more classes or concepts organized using rdfs:subClassOf predicate. Despite this, it is not clear if, for example, that class hierarchy is a DAG (like in DBpedia), how many classes conform it, etc.
DBpedia is a community effort to extract structured information from Wikipedia. In this sense, both YAGO and DBpedia share the same goal of generating a structured ontology. The projects differ in their foci. In YAGO, the focus is on precision, the taxonomic structure, and the spatial and temporal dimension. For a detailed comparison of the projects, see Chapter 10.3 of our AI journal paper "YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia".
[Link: http://resources.mpi-inf.mpg.de/yago-naga/yago/publications/aij.pdf]
I have been doing research on feature selection and I'm failing to understand the difference about these two approaches.
According to most authors on literature, feature selection algorithms are categorized into three categories. The first two, filter and wrapper are easy to understand and there is a general agreement on that. However, on the last category there seems to be a misunderstandment. Some authors as the case of H. Liu name the last category as hybrid. In contrast, V. Kumar names it embedded. In addiction to that there are cases where authors define 4 categories including both embedded and hybrid algorithms, as is the case of P. Abinaya.
Authors explain the hybrid algorithms as the combination between a filter algorithm and a wrapper approachs. The main idea behind these algorithms is to use a filter approach to reduce the search space for a wrapper approach.
On the other hand the definition of embedded algorithms on the literature is very different depending on the source. Some use almost the same definitation as the hybrid algorithms as is the case of the wikipedia page. Others give more abstract definitions such as: methods that perform feature selection during learning of optimal parameters, and methods that incorporate knowledge about the specific structure of the class of functions used by a certain learning machine.
So I would appreciate if anyone could explain me what's the difference between these two approaches or give a less abstract definition of embedded methods.
Thanks.
I'm trying to do some date mining with DBpedia. Now I have a dataset with properties of DBpedia ontology and DBpedia mapping and I'm not sure about the difference between those two.
What is the difference between DBpedia ontology and DBpedia mapping?
In short, DBpedia a very valuable resource for the semantic web community, but compared to Wikipedia it is quite small. Also, due to contribution of various people to Wikipedia, the infobox information is no harmonised. Therefore, a mapping language has been created to define synonymy between infobox relations and DBpedia properties.
One of the challenges in extracting information from Wikipedia is that the same concepts can be expressed using different parameters in infobox and other templates, such as |birthplace= and |placeofbirth=. Because of this, queries about where people were born would have to search for both of these properties in order to get more complete results. As a result, the DBpedia Mapping Language has been developed to help in mapping these properties to an ontology while reducing the number of synonyms. Due to the large diversity of infoboxes and properties in use on Wikipedia, the process of developing and improving these mappings has been opened to public contributions.
I have a bunch of text documents that describe diseases. Those documents are in most cases quite short and often only contain a single sentence. An example is given here:
Primary pulmonary hypertension is a progressive disease in which widespread occlusion of the smallest pulmonary arteries leads to increased pulmonary vascular resistance, and subsequently right ventricular failure.
What I need is a tool that finds all disease terms (e.g. "pulmonary hypertension" in this case) in the sentences and maps them to a controlled vocabulary like MeSH.
Thanks in advance for your answers!
Here are two pipelines that are specifically designed for medical document parsing:
Apache cTAKES
NLM's MetaMap
Both use UMLS, the unified medical language system, and thus require that you have a (free) license. Both are Java and more or less easy to set up.
See http://www.ebi.ac.uk/webservices/whatizit/info.jsf
Whatizit is a text processing system that allows you to do textmining
tasks on text. The tasks come defined by the pipelines in the drop
down list of the above window and the text can be pasted in the text
area.
You could also ask biostars: http://www.biostars.org/show/questions/
there are many tools to do that. some popular ones:
NLTK (python)
LingPipe (java)
Stanford NER (java)
OpenCalais (web service)
Illinois NER (java)
most of them come with some predefined models, i.e. they've already been trained on some general datasets (news articles, etc.). however, your texts are pretty specific, so you might want to first constitute a corpus and re-train one of those tools, in order to adjust it to your data.
more simply, as a first test, you can try a dictionary-based approach: design a list of entity names, and perform some exact or approximate matching. for instance, this operation is decribed in LingPipe's tutorial.
Open Targets has a module for this as part of LINK. It's not meant to be used directly so it might require some hacking and tinkering, but it's the most complete medical NER (named entity recognition) tool I've found for python. For more info, read their blog post.
a bash script that has as example a lexicon generated from the disease ontology:
https://github.com/lasigeBioTM/MER