NLP Text classification Based on User comments - machine-learning

I am new to the machine learning and wanted to work on this problem statement.
I have got some of the user comments about products and based on those comments, my model should summarize and give me the output for that text sentence.
Example :-
User commented "Device Battery is heating up", based on this comment my model should summarize this to "Battery issue".
User commented "Cracked Screen", based on this comment my model should summarize this to "Display issue".
Can anyone suggest me which model should be the best fit for my problem statement or any model code samples would be really helpful.
I have tried with TF-IDF, and MB Naive bayes classifier but those are not helpful. I think topic modelling can help me here.

this sounds like a problem for an encoder-decoder DNN. You can translate the words e.g. with word2vec to vectors and feed the sentences into an encoder-model. This will give you another vector for a second model (decoder) which gives you the final classification.
For reference take a look at the deep learning course by Andrew Ng on Coursera (Sequence models).

Related

Fine tuning GPT2 for generative question anwering

I am trying to finetune gpt2 for a generative question answering task.
Basically I have my data in a format similar to:
Context : Matt wrecked his car today.
Question: How was Matt's day?
Answer: Bad
I was looking on the huggingface documentation to find out how I can finetune GPT2 on a custom dataset and I did find the instructions on finetuning at this address:
https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling
The issue is that they do not provide any guidance on how your data should be prepared so that the model can learn from it. They give different datasets that they have available, but none is in a format that fits my task well.
I would really appreciate if someone with more experience could help me.
Have a nice day!
Your task right now is ambiguous, it could be any of:
QnA via Classification (answer is categorical)
QnA via Extraction (answer is in the text)
QnA via Language Modeling (answer can be anything)
Classification
If all you're examples have Answer: X, where X is categorical (i.e. always "Good", "Bad", etc ...), you can do classification.
In this setup, you'd would have text-label pairs:
Text
Context: Matt wrecked his car today.
Question: How was Matt's day?
Label
Bad
For classification, you're probably better off just fine-tuning a BERT style model (something like RoBERTTa).
Extraction
If all you're examples have Answer: X, where X is a word (or consecutive words) in the text (for example), then it's probably best to do a SQuAD-style fine-tuning with a BERT-style model. In this setup, you're input is (basically) text, start_pos, end_pos triplets:
Text
Context: In early 2012, NFL Commissioner Roger Goodell stated that the league planned to make the 50th Super Bowl "spectacular" and that it would be "an important game for us as a league".
Question: Who was the NFL Commissioner in early 2012?
Start Position, End Position
6, 8
Note: The start/end position values of course positions of tokens, so these values will depend on how you tokenize your inputs
In this setup, you're also better off using a BERT-style model. In fact, there are already models on huggingface hub trained on SQuAD (and similar datasets). They should already be good at these tasks out of the box (but you can always fine-tune on top of this).
Language Modeling
If all you're examples have Answer: X, where X can basically be anything (it need not be contained in the text, and is not categorical), then you'd need to do language modeling.
In this setup, you have to use a GPT-style model, and your input would just be the whole text as is:
Context: Matt wrecked his car today.
Question: How was Matt's day?
Answer: Bad
There is no need for labels, since the text itself is the label (we're asking the model to predict the next word, for each word). Larger models like GPT-3 and https://cohere.com (full disclosure, I work at Cohere) should be good at these tasks without any finetuning (if you give it the right prompt + examples), but of course, these are accessed behind APIs. These platforms also allow you to fine-tune models (via language modeling), so you don't need to run any code yourself. Not sure how much mileage you'll get with finetuning a smaller model like GPT-2. If this project is for learning, then yeah, definitely go ahead and fine-tune a GPT-2 model! But if performance is key, I highly recommend using a solution like https://cohere.com, which will just work out of the box.

Finding contradictory semantic sentences through natural language processing

I'm working on a project that aims to find conflicting Semantic Sentences (NLP - Semantic Search )
For example
Our text is: "I ate today. The lunch was very tasty. I was an honest guest."
Query: "I had lunch with my friend"
Do we want to give the query model and find the meaning of the sentences with a certain point in terms of synonyms and antonyms?
The solution that came to my mind was to first find the synonymous sentences and extract the key words from the synonymous sentences and then get the semantic opposite words and then find the semantic synonymous sentences based on these opposite words.
Do you think this idea is possible? If you have a solution or experience in this area, please reply
Thanks
You have not mentioned the exact use case for your problem so I am not sure if the solution I know will help your cause. But there is an approach in NLP (using Deep learning) which helps to find whether two sentences are correlated, unrelated or contradictory.
Below is the information about the pretrained model which is trained specifically for this task ->
https://huggingface.co/facebook/bart-large-mnli
The dataset on which the above model is trained is given here ->
https://huggingface.co/datasets/glue/viewer/mnli/train
You can check the dataset to verify if your use case is related to the classification task performed on the dataset.
Since the model is already pretrained, you do not need to perform any training and can jump straight to evaluation. Once you can somewhat satisfied with the results, you can fine tune the model a bit for your specific problem.
We can talk in comments if you need more clarification.

Training and Testing Data set for classification text file

Suppose we have 10000 text file and We would like to classify as political ,health,weather,sports,Science ,Education,.........
I need training data set for classification of text documents and I am Naive Bayes classification Algorithm. Anyone can help to get data sets .
OR
Is there any another way to get classification done..I am new at Machine Learning Please explain your answer completely.
Example:
**Sentence** **Output**
1) Obama won election. ----------------------------------------------->political
2) India won by 10 wickets ---------------------------------------------->sports
3) Tobacco is more dangerous --------------------------------------------->Health
4) Newtons laws of motion can be applied to car -------------->science
Any way to classify these sentences into their respective categories
Have you tried to google it? There are tons and tons of datasets for text categorization. The classical one is Reuters-21578 (https://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection), another famous one and mentioned almost in each ML book is 20 newsgroup: http://web.ist.utl.pt/acardoso/datasets/
But there are lots of other, one google query away from you. Just load them, slightly adjust if needed and train your classifier on that datasets.

Unsupervised Sentiment Analysis

I've been reading a lot of articles that explain the need for an initial set of texts that are classified as either 'positive' or 'negative' before a sentiment analysis system will really work.
My question is: Has anyone attempted just doing a rudimentary check of 'positive' adjectives vs 'negative' adjectives, taking into account any simple negators to avoid classing 'not happy' as positive? If so, are there any articles that discuss just why this strategy isn't realistic?
A classic paper by Peter Turney (2002) explains a method to do unsupervised sentiment analysis (positive/negative classification) using only the words excellent and poor as a seed set. Turney uses the mutual information of other words with these two adjectives to achieve an accuracy of 74%.
I haven't tried doing untrained sentiment analysis such as you are describing, but off the top of my head I'd say you're oversimplifying the problem. Simply analyzing adjectives is not enough to get a good grasp of the sentiment of a text; for example, consider the word 'stupid.' Alone, you would classify that as negative, but if a product review were to have '... [x] product makes their competitors look stupid for not thinking of this feature first...' then the sentiment in there would definitely be positive. The greater context in which words appear definitely matters in something like this. This is why an untrained bag-of-words approach alone (let alone an even more limited bag-of-adjectives) is not enough to tackle this problem adequately.
The pre-classified data ('training data') helps in that the problem shifts from trying to determine whether a text is of positive or negative sentiment from scratch, to trying to determine if the text is more similar to positive texts or negative texts, and classify it that way. The other big point is that textual analyses such as sentiment analysis are often affected greatly by the differences of the characteristics of texts depending on domain. This is why having a good set of data to train on (that is, accurate data from within the domain in which you are working, and is hopefully representative of the texts you are going to have to classify) is as important as building a good system to classify with.
Not exactly an article, but hope that helps.
The paper of Turney (2002) mentioned by larsmans is a good basic one. In a newer research, Li and He [2009] introduce an approach using Latent Dirichlet Allocation (LDA) to train a model that can classify an article's overall sentiment and topic simultaneously in a totally unsupervised manner. The accuracy they achieve is 84.6%.
I tried several methods of Sentiment Analysis for opinion mining in Reviews.
What worked the best for me is the method described in Liu book: http://www.cs.uic.edu/~liub/WebMiningBook.html In this Book Liu and others, compared many strategies and discussed different papers on Sentiment Analysis and Opinion Mining.
Although my main goal was to extract features in the opinions, I implemented a sentiment classifier to detect positive and negative classification of this features.
I used NLTK for the pre-processing (Word tokenization, POS tagging) and the trigrams creation. Then also I used the Bayesian Classifiers inside this tookit to compare with other strategies Liu was pinpointing.
One of the methods relies on tagging as pos/neg every trigrram expressing this information, and using some classifier on this data.
Other method I tried, and worked better (around 85% accuracy in my dataset), was calculating the sum of scores of PMI (punctual mutual information) for every word in the sentence and the words excellent/poor as seeds of pos/neg class.
I tried spotting keywords using a dictionary of affect to predict the sentiment label at sentence level. Given the generality of the vocabulary (non domain dependent), the results were just about 61%. The paper is available in my homepage.
In a somewhat improved version, negation adverbs were considered. The whole system, named EmoLib, is available for demo:
http://dtminredis.housing.salle.url.edu:8080/EmoLib/
Regards,
David,
I'm not sure if this helps but you may want to look into Jacob Perkin's blog post on using NLTK for sentiment analysis.
There are no magic "shortcuts" in sentiment analysis, as with any other sort of text analysis that seeks to discover the underlying "aboutness," of a chunk of text. Attempting to short cut proven text analysis methods through simplistic "adjective" checking or similar approaches leads to ambiguity, incorrect classification, etc., that at the end of the day give you a poor accuracy read on sentiment. The more terse the source (e.g. Twitter), the more difficult the problem.

Clarification How CRF(Conditional random Field) works using examples

I read different documents how CRF(conditional random field) works but all the papers puts the formula only. Is there any one who can send me a paper that describes about CRF with examples like if we have a sentence
"Mr.Smith was born in New York. He has been working for the last 20 years in Microsoft company."
if the above sentence is given as an input to train, how does the Model works during the training taking in to consideration for the formula for CRF?
Smith is tagged as "PER" New York is as "LOC" Microsoft Company as "ORG".
Moges.A
Here is a link to a set of slides made by Shasha Rush, a PhD student who is currently working on NLP at Google. One of the reasons I really like the slides is because they contain concrete examples and walk you through executions of important algorithms.
It is not a paper, but there is available whole online free course on probabilistic graphical models -- CRF is one of them.
It is very definitive and you'll get an intuitive level of understanding after completing it.
I don't think somebody will write such tutorial. You can check HMM tutorial which is easier to understand and can be explained by example. The problem with CRF is that it is global optimization with many dependencies, so it is very hard to show step by step how we optimize parameters and how we predict labels. But the idea is very simple - maximization of dependency(clique) graph using sparsity...

Resources