I have a list of problems and resolutions. The current search for existing problems is just a keyword search. To improve searching for existing solutions to problems :
Classify documents based on their semantic meaning using nlp. User enters a search term and documents that closely match this search are displayed alongside possible solutions.
The search will be based on semantic meaning, this is use case for nlp ?
NLP stand for Natural Language Processing. Thus everything that involves processing natural language can be described as NLP. This is not any strict science, this is just a bag term for everything related to this field. Even if you just map your text to vector space and forget about any language and semantics, you are still processing natural language, thus - NLP.
Related
I'm curious about applying NLP to predict/evaluate someone's level of education (or adherence to correct grammar, spelling, etc.) by analyzing text written by them.
It would be something like: f(t) = s where t is a text and s is some score which rates the grammatical correctness of that text.
Does that exist? I don't know how to search for it. If it does, I'd like some references to relevant papers or algorithms.
It does not exist. "Grammatical correctness" is a vague concept anyway, as there is no complete grammatical description of any given language. Also, we all speak and write different variations of our language, which cannot be captured by a single grammar. A language is basically the union of all the individual variants that its speakers produce.
Leaving aside these linguistic philosophy issues, there is also no formal grammar of even a single variant of a language that you could use as a benchmark. I guess the nearest thing you could do is come up with a couple of heuristics and simple rules (which I assume commercial grammar checkers use), checking for example that reads always occurs after a third person singular noun. If you have a sufficient number of such heuristics, you can get an idea if a given text is grammatical according to the definition that grammaticality is equivalent to not breaking the rules you encoded.
However, language is very flexible, and hard to capture in rules. Sometimes a sentence might sound like an error, but then in a given context it is fine. If it was easy, someone would already have done it, and primary school teachers could focus their efforts on tasks other than teaching basic grammar...
You can probably capture a number of 'mistakes' easily, but I wouldn't want to guess what coverage you would get; there will be a lot of issues you cannot capture easily.
I'm trying to find the best way to compare two text documents using AI and machine learning methods. I've used the TF-IDF-Cosine Similarity and other similarity measures, but this compares the documents at a word (or n-gram) level.
I'm looking for a method that allows me to compare the meaning of the documents. What is the best way to do that?
You should start reading about word2vec model.
use gensim, get the pretrained model of google.
For vectoring a document, use Doc2vec() function.
After getting vectors for all your document, use some distance metric like cosine distance or euclidean distance for comparison.
This is very difficult. There is actually no computational definition of "meaning". You should dive into text mining, summarization and libraries like gensim, spacy or pattern.
In my opinion, the more readily useable libraries available out there ie. higher return on investesment (ROI), that is if you are a newbie you might want to look at tools around chatbots they want to extract from natural language structured data. That is what is the most similar to "meaning". One example free software tool to achieve that is rasa natural language understanding.
The drawback of such tools is that they somewhat work but only in the domain where they were trained and prepared to work. And in particular they do not aim at comparing documents like you want.
I'm trying to find the best way to compare two text documents using AI
You must come up with a more precise task and from there find out which technic apply best to your use case. Do you want to classify documents in predefined categories. Do you to compute some similarity between two documents? Given an input document, do you want to find most similar documents in a database. Do you want to extract important topics or keywords in the document? Do you want to summarize the document? Is it an abstractig summary or key phrase extraction?
In particular, there is no software that allows to extract somekind of semantic fingerprint from any document. Depending on the end goal, the way to achieve it might be completly different.
You must narrow the precise goal you are trying to achieve; From there, you will be able to ask another question (or improve this one) to describe precisly your goal.
Text understanding is AI-Complete. So, just saying to the computer "tell me something about this two documents" doesn't work.
Like other have said, word2vec and other word embeddings are tools to achieve many goals in NLP but it only a mean for an end. You must define the input and output of the system you are trying to design to be able to start working on the implementation.
There is two other Stack Overflow communities that you might want to dig:
Linguistics
Data Science
Given the tfidf value for each token in your corpus (or the most meaningful ones) you can compute a sparse representation for a document.
This is implemented in the sklearn TFIDFVectorizer.
As other users have pointed out, this is not the best solution to your task.
You should take into account embeddings.
The easiest solution consists in using an embedding at the words level, such as the one provided by the FastText framework.
Then you can create an embedding for the whole document by summing together the embedding of the single words which compose it.
An alterative consists in training an embedding directly at the document level, using some Doc2Vec framework such as the gensim or DL4J one.
Also you can use LDA Or LSI Models for text corpus. these methods(and other methods like wor2vec and doc2vec) can summarize documents to fixed length vectors with respect to it's meaning and topics that this document belongs to.
read more:
https://radimrehurek.com/gensim/models/ldamodel.html
I heard there are three approaches from Dr. Golden:
- Cosine Angular Separation
- Hamming Distance
- Latent Semantic Analysis (LSA) or Latent Semantic Indexing (LSI)
These methods are based on semantic similarity.
I also heard some company used tool called Spacy to summarize document to compare each other.
I'm a newbie in the field of Machine Learning and Supervised learning.
My task is the following: from the name of a movie file on a disk, I'd like to retrieve some metadata about the file. I have no control on how the file is named, but it has a title and one or more additional info, like a release year, a resolution, actor names and so on.
Currently I have developed a rule heuristic-based system, where I split the name into tokens and try to understand what each word could represent, either alone or with adjacent ones. For detecting people names for example, I'm using a dataset of english names, and score the word as being a potential person's name if I find it in the dataset. If adjacent to it is a word that I scored as a potential surname, I score the two words as being an actor. And so on. It works with a decent accuracy, but changing heuristic scores manually to "teach" the system is tedious and unpredictable.
Such a rule-based system is hard to maintain or develop further, so, out of curiosity, I was exploring the field of machine learning. What I would like to know is:
Is there some kind of public literature about these kinds of problems?
Is ML a good way to approach the problem, given the limited data set available?
How would I proceed to debug or try to understand the results of such a machine? I already have problems with the "simplistic" heuristic engine I have developed..
Thanks, any advice would be appreciated.
You need to look into NLP (natural language processing). NLP deals with text processing and other things; for example entity recognition and tagging.
Here is an example of using Spacy library: https://spacy.io/usage/linguistic-features.
Some time ago I did a similar thing, you can see it here: https://github.com/Erlemar/Erlemar.github.io/blob/master/Notebooks/Fate_Zero_explore.ipynb
I am looking for an approach in NLP , where i can generate a concept tree from a set of keywords.
Here is the scenario, i have extracted a set of keywords from a research paper. Now i want to arrange these keywords in form of a tree where most general keyword comes on top. At next level of tree will have keywords that are important to understand upper level concept and will be more specific as compared to upper level keywords. And the same way tree will grow.
Something like this :
I know there are many resources that can help me to solve this problem. Like Wikipedia dataset, Wordnet. But i do not know how to proceed with them.
My preferred programming language is Python. Do you know any python library or package which generate this?
I am also very interested to see the use of Machine learning approach to solve this problem.
I will really appreciate your any kind of help.
One way of looking at the problem is, given a set of documents, identify topics from them and also the dependencies between the topics.
So, for example, if you have some research papers as input (large set of documents), the output would be what topics the papers are on and how those topics are related in a hierarchy/tree. One research area that tries to tackle this is Hierarchical topic modeling and you can read more about this here and here.
But if you are just looking at creating a tree out of a bunch of keywords (that are somehow obtained) and no other information is available, then it needs knowledge of real world relationships and can perhaps be a rule-based system where we define Math --> Algebra and so on.
There is no way for a system to understand that algebra comes under math other than by looking at large no. of documents and inferring that relationship (see first suggestion) or if we manually map that relationship (perhaps, a rule-based system). That is how even humans learn those relationships.
Let's say I want to build a search engine that goes through a text and finds sentences or paragraphs that could be turned into an image, video or 3d-animation. So sentences that contain information that could be expressed visually.
Ideally, this search engine would get better over time.
Is there already search engine that could to that?
If not, which type of things would I need to look at/consider? My point here being that I don't really know much about machine learning and search engines. I am trying to get a feeling of which areas of machine learning, information retrieval and so forth I would need to look at.
I don't expect long answers here, just things like "well, take a look at this type of machine learning" or "this part of information retrieval theory may be relevant".
Just to get a broad overview of what I would need to look at.
Natural Language Understanding
I don't know about any existing search engine doing that. But this can be done with the help of Natural Language Understanding and Semantic Parsing.
Have a look at Stanford's Natural Language Understanding course (discussion of the text-to-scene problem can be found here) for further details.
How semantic search works is, it analysis data and put them into a 3-D vector space. One it's done with the help of bid data and knowledge graph the algorithm will try to find data points that connect to the article, the authority of the author, website relevance, and a couple of other factors. Once these factors are factored in, it then tries to create co-relate data to create a layer of information interconnected to each other. Once these information's are gathered then it is used to arrive at a conclusion to decide how relevant the data is.