Weka POS tagging + tokenization - machine-learning

I'm new to Weka. I am trying to sentimental classify movie reviews. The thing is, I can understand the StringToWord Vector which tokenizes and attributes the word occurrences. I want to add the Parts Of Speech tags also to the attribute vocabulary but I am getting stuck to how?
Has anyone tried this before?
Please, can you guide me?
P.S . I am using OpenNLP for POS tagging and Weka J48 classifier !!

Trial and error approach:
Do something like write the POStagged data into a text file and then do the word2vec. Then check the distance between a word and a POStag, nearest one is it's POS?
Then there would be a problem like adjacent tags distance might be same!
Or else you can use RegEx after that, definitely worth a try.
But do the first one and do share the results! :)

Related

Finding contradictory semantic sentences through natural language processing

I'm working on a project that aims to find conflicting Semantic Sentences (NLP - Semantic Search )
For example
Our text is: "I ate today. The lunch was very tasty. I was an honest guest."
Query: "I had lunch with my friend"
Do we want to give the query model and find the meaning of the sentences with a certain point in terms of synonyms and antonyms?
The solution that came to my mind was to first find the synonymous sentences and extract the key words from the synonymous sentences and then get the semantic opposite words and then find the semantic synonymous sentences based on these opposite words.
Do you think this idea is possible? If you have a solution or experience in this area, please reply
Thanks
You have not mentioned the exact use case for your problem so I am not sure if the solution I know will help your cause. But there is an approach in NLP (using Deep learning) which helps to find whether two sentences are correlated, unrelated or contradictory.
Below is the information about the pretrained model which is trained specifically for this task ->
https://huggingface.co/facebook/bart-large-mnli
The dataset on which the above model is trained is given here ->
https://huggingface.co/datasets/glue/viewer/mnli/train
You can check the dataset to verify if your use case is related to the classification task performed on the dataset.
Since the model is already pretrained, you do not need to perform any training and can jump straight to evaluation. Once you can somewhat satisfied with the results, you can fine tune the model a bit for your specific problem.
We can talk in comments if you need more clarification.

Does summing up word embedding vectors in ML destroy their meaning?

For example, I have a paragraph which I want to classify in a binary manner. But because the inputs have to have a fixed length, I need to ensure that every paragraph is represented by a uniform quantity.
One thing I've done is taken every word in the paragraph, vectorized it using GloVe word2vec and then summed up all of the vectors to create a "paragraph" vector, which I've then fed in as an input for my model. In doing so, have I destroyed any meaning the words might have possessed? Considering these two sentences would have the same vector:
"My dog bit Dave" & "Dave bit my dog", how do I get around this? Am I approaching this wrong?
What other way can I train my model? If I take every word and feed that into my model, how do I know how many words I should take? How do I input these words? In the form of a 2D array, where each word vector is a column?
I want to be able to train a model that can classify text accurately.
Surprisingly, I'm getting a high (>90%) for a relatively simple model like RandomForestClassifier just by using this summing up method. Any insights?
Edit: One suggestion I have received is to instead featurize my data as a 2D array where each word is a column, on which a CNN could work. Another suggestion I received was to use transfer learning through the huggingface transformer to get a vector for the whole paragraph. Which one is more feasible?
I want to be able to train a model that can classify text accurately. Surprisingly, I'm getting a high (>90%) for a relatively simple model like RandomForestClassifier just by using this summing up method. Any insights?
If you look up papers on aggregating word embeddings you'll find out that this in fact occurs sometimes, especially if the texts are shorter.
What other way can I train my model? If I take every word and feed that into my model, how do I know how many words I should take? How do I input these words? In the form of a 2D array, where each word vector is a column?
Have you tried keyword extraction? It can alleviate some of the problems with averaging
In doing so, have I destroyed any meaning the words might have
possessed?
As you remarked, you throw out information on word order. But that's not even the worst part: most of the times for longer documents if you embed everything the mean will get dominated by common words ("how", "like", "do" et c). BTW see my answer to this question
Other than that, one trick I've seen is to average word vectors, but subtract first principal component of PCA on word embedding matrix. For details you can see for example this repo which also links to the paper (BTW this paper suggests you can ignore "Smooth Inverse Frequency" stuff since principal component reduction does the useful part).

Training a Text Detection System

I'm currently developping a text detection system in a given image using logistic regression, and I need training data like the image below:
The first column show a positive example (y=1) of text wheras the second column show images without text (y=0).
I'm wondering where I can get a labeled dataset of this kind??
Thanks in advance.
A good place to start for these sorts of things is the UC Irvine Machine Learning Repository:
http://archive.ics.uci.edu/ml/
But maybe also consider heading over to Cross-Validated as well, for machine learning-related questions:
https://stats.stackexchange.com/
You can get a similar dataset here.
Hope it helps.

Text Feature Representation As Vectors for SVM

I am learning the Semantic Role Labeling (SRL) task. I have read a lot, and now I come to a problem for how to represent the text features as vectors.
For example, for the sentence:
We like StackOverflow very much
given the predicate verb: like, a few features are:
the left 1st word: I
the right 1st word: StackOverflow
the POS tag of the left 1st word: Pronoun
The POS tag of the right 1st word: Adverbial
What are the right ways to represent these features as vectors?
If possible, can you also give me some guidances for how to normalize these features please?
I basically want to train the data with these type of features using SVM models.
It doesn't matter what classifier you use (SVM or not) the feature generation for text is the same.
I suggest you to take a look at this:
Binary Feature Extraction
Also this library would make your life much easier:
http://cogcomp.cs.illinois.edu/page/software_view/LBJ
A tutorial is here: http://cogcomp.cs.illinois.edu/page/tutorial.201310

NLP text tagging

I am a newbie in NLP, just doing it for the first time.
I am trying to solve a problem.
My problem is I have some documents which are manually tagged like:
doc1 - categoryA, categoryB
doc2 - categoryA, categoryC
doc3 - categoryE, categoryF, categoryG
.
.
.
.
docN - categoryX
Here I have a fixed set of categories and any document can have any number of tags associated with it.
I want to train the classifier using this input, so that this tagging process can be automated.
Thanks
What you are trying to do is called multi-way supervised text categorization (or classification). Knowing the right question to ask is half the problem.
As for how this can be done, here are two references:
RCV1 : A New Benchmark Collection for Text Categorization
Research
Improved Nearest Neighbor Methods For Text Classification With
Language Modeling and Harmonic Functions
Most of classifier works on Bag of word model . There are multiple use case to get expected result.
Try out most general Multinomial naive base classifer with changing different input paramters and check out result.
Try variants of ML Naive base (http://scikit-learn.org/0.11/modules/naive_bayes.html)
You can check out sentence classifier along with considering sentence structures. Considering ngram concepts, you can try out with 2,3,4,5 gram models and check how result varies. Count vectorizer allows ngram, check out this link for example - http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
Based on dataset features, not a single classifier can be best for you scenario, you have to check out different use case, which fits best for you.
Most initial approach is, you get started with simple classifier using scikit learn.
Put each category as traning class and train the classifier with this classes
For any input docX, classifier with trained model
You will get probability result for each category
Now put some threshold like probability different between three most highest resulting category, if it matches the threshold consider those category as result for that input class.
its not clear what you have tried or what programming language you are using but as most have suggested try text classification with document vectors, bag of words (as long as there are words in the documents that can help with classification)
Here are some simple tools that can help get you started
Weka http://www.cs.waikato.ac.nz/ml/weka/ (GUI & Java)
NLTK http://www.nltk.org (Python)
Mallet http://mallet.cs.umass.edu/ (command line & Java)
NUML http://numl.net/ (C#)

Resources