I need to detect whether a given sentence have interrogative characteristics? Is there any available gem(ROR) that implements this with NLP libraries or any current state of art implementation.
In general NLP code is python or java.
For this problem though I think you could hack together something that's basically - does it start with a W-word, or 'Are' and end with a question mark? More advanced, you could take a few thousand question sentences, create features similar to the quick hack (ie first word, final char) and then train a machine learning model on it. You'd probably do this in python the quickest, and then you could write the model interpreter in ruby (the interpreter is easy).
Or you could just write a simple Naive Bayes classifier!
Related
I am trying to finetune gpt2 for a generative question answering task.
Basically I have my data in a format similar to:
Context : Matt wrecked his car today.
Question: How was Matt's day?
Answer: Bad
I was looking on the huggingface documentation to find out how I can finetune GPT2 on a custom dataset and I did find the instructions on finetuning at this address:
https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling
The issue is that they do not provide any guidance on how your data should be prepared so that the model can learn from it. They give different datasets that they have available, but none is in a format that fits my task well.
I would really appreciate if someone with more experience could help me.
Have a nice day!
Your task right now is ambiguous, it could be any of:
QnA via Classification (answer is categorical)
QnA via Extraction (answer is in the text)
QnA via Language Modeling (answer can be anything)
Classification
If all you're examples have Answer: X, where X is categorical (i.e. always "Good", "Bad", etc ...), you can do classification.
In this setup, you'd would have text-label pairs:
Text
Context: Matt wrecked his car today.
Question: How was Matt's day?
Label
Bad
For classification, you're probably better off just fine-tuning a BERT style model (something like RoBERTTa).
Extraction
If all you're examples have Answer: X, where X is a word (or consecutive words) in the text (for example), then it's probably best to do a SQuAD-style fine-tuning with a BERT-style model. In this setup, you're input is (basically) text, start_pos, end_pos triplets:
Text
Context: In early 2012, NFL Commissioner Roger Goodell stated that the league planned to make the 50th Super Bowl "spectacular" and that it would be "an important game for us as a league".
Question: Who was the NFL Commissioner in early 2012?
Start Position, End Position
6, 8
Note: The start/end position values of course positions of tokens, so these values will depend on how you tokenize your inputs
In this setup, you're also better off using a BERT-style model. In fact, there are already models on huggingface hub trained on SQuAD (and similar datasets). They should already be good at these tasks out of the box (but you can always fine-tune on top of this).
Language Modeling
If all you're examples have Answer: X, where X can basically be anything (it need not be contained in the text, and is not categorical), then you'd need to do language modeling.
In this setup, you have to use a GPT-style model, and your input would just be the whole text as is:
Context: Matt wrecked his car today.
Question: How was Matt's day?
Answer: Bad
There is no need for labels, since the text itself is the label (we're asking the model to predict the next word, for each word). Larger models like GPT-3 and https://cohere.com (full disclosure, I work at Cohere) should be good at these tasks without any finetuning (if you give it the right prompt + examples), but of course, these are accessed behind APIs. These platforms also allow you to fine-tune models (via language modeling), so you don't need to run any code yourself. Not sure how much mileage you'll get with finetuning a smaller model like GPT-2. If this project is for learning, then yeah, definitely go ahead and fine-tune a GPT-2 model! But if performance is key, I highly recommend using a solution like https://cohere.com, which will just work out of the box.
I'm looking for test datasets to optimize my Word2Vec model. I have found a good one from gensim:
gensim/test/test_data/questions-words.txt
Does anyone know other similar datasets?
Thank you!
It is important to note that there isn't really a "ground truth" for word-vectors. There are interesting tasks you can do with them, and some arrangements of word-vectors will be better on a specific tasks than others.
But also, the word-vectors that are best on one task – such as analogy-solving in the style of the questions-words.txt problems – might not be best on another important task – like say modeling texts for classification or info-retrieval.
That said, you can make your own test data in the same format as questions-words.txt. Google's original word2vec.c release, which also included a tool for statistically combining nearby words into multi-word phrases, also included a questions-phrases.txt file, in the same format, that can be used to test word-vectors that have been similarly constructed for 'words' that are actually short multiple-word phrases.
The Python gensim word-vectors support includes an extra method, evaluate_word_pairs() for checking word-vectors not on analogy-solving but on conformance to collections of human-determined word-similarity-rankings. The documentation for that method includes a link to an appropriate test-set for that method, SimLex-999, and you may be able to find other test sets of the same format elsewhere.
But, again, none of these should be considered the absolute test of word-vectors' overall quality. The best test, for your particular project's use of word-vectors, would be some repeatable domain-specific evaluation score you devise yourself, that's inherently correlated to your end goals.
Let's say, Message1 = your bill of amount 121.0 is due on 15 Feb., Similarly Message2 = bill amt 234.0 due on 11 Jun and so on. I want to extract bill amount and due date from similar messages. One way is to write a regular expression for every possible format. But that won't be able to handle new formats.
What is the Machine Learning approach to solve this? How do I train a model and use it to extract amount, due date from newer messages?
To better answer your question, I need to know how the training data will be provided? Will you get label for each training example? Do you want to use any advanced technique that involves deep neural networks?
For example, if you want to use sequence labeling, then you can refer Supervised Sequence Labelling with Recurrent Neural Networks by Alex Graves chapter 2 for more details. For your task, I think you can try more simple approach first.
For example, pattern mining or template-based approach should help you in this regard. Besides, parsing techniques, ex., dependency parsing can help you in this context. See the difference between dependency parsing and constituent parsing.
Finally, you can also consider well-known information extraction techniques in this scenario. See the usage of NLTK for this.
I'm trying to use HMM to do named entity recognition but later on, I found most of the sentences that contain the entities are very structured. For example:
What's Apple's price today? Than instead to teach the model to learn each word within the sentence, can i teach it to learn the structure of the sentence? Like every word after "What's " or "What is" should be the name of a kind of fruit?
Thanks!
Instead of using an HMM, consider using a conditional random field. They are very similar to HMMs, but are the discriminative version (in Ng and Jordan's terminology, HMMs and Linear Chain CRFs form a generative/discriminative pair).
The benefits of doing this are that you can define features of your word observation which are the POS tag of the current word, the POS tag of the previous word(s), etc, without making independence assumptions about these features. This would allow you to incorporate structural and lexical features into the same decision framework.
Edit: Here's the original paper. Here's a very comprehensive tutorial.
You could begin exploring that structure with something as simple as n-grams, or try something richer like grammar induction.
I'm not sure whats the best algorithm to use for the classification of relationships in words. For example in the case of a sentence such as "The yellow sun" there is a relationship between yellow and sun. THe machine learning techniques I have considered so far are Baynesian Statistics, Rough Sets, Fuzzy Logic, Hidden markov model and Artificial Neural Networks.
Any suggestions please?
thank you :)
It kind of sounds like you're looking for a dependency parser. Such a parser will give you the relationship between any word in a sentence and its semantic or syntactic head.
The MSTParser uses an online max-margin technique known as MIRA to classify the relationships between words. The MaltParser package does the same but uses SVMs to make parsing decisions. Both systems are trainable and provide similar classification and attachment performance, see table 1 here.
Like the user dmcer pointed out, dependency parsers will help you. There is tons of literature on dependency parsing you can read. This book and these lecture notes are good starting points to introduce the conventional methods.
The Link Grammar Parser which is sorta like dependency parsing uses Sleator and Temperley's Link Grammar syntax for producing word-word linkages. You can find more information on the original Link Grammar page and on the more recent Abiword page (Abiword maintains the implementation now).
For an unconventional approach to dependency parsing, you can read this paper that models word-word relationships analogous to subatomic particle interactions in chemistry/physics.
The Stanford Parser does exactly what you want. There's even an online demo. Here's the results for your example.
Your sentence
The yellow sun.
Tagging
The/DT yellow/JJ sun/NN ./.
Parse
(ROOT
(NP (DT The) (JJ yellow) (NN sun) (. .)))
Typed dependencies
det(sun-3, The-1)
amod(sun-3, yellow-2)
Typed dependencies, collapsed
det(sun-3, The-1)
amod(sun-3, yellow-2)
From your question it sounds like you're interested in the typed dependencies.
Well, no one knows what the best algorithm for language processing is because it hasn't been solved. To be able to understand a human language is to create a full AI.
Hoever, there have, of course, been attempts to process natural languages, and these might be good starting points for this sort of thing:
X-Bar Theory
Phrase Structure Rules
Noam Chomsky did a lot of work on natural language processing, so I'd recommend looking up some of his work.