can we select features without classification and if I have a text how can i know which are the features to choose? I need example regarding text not real word object example. if anyone can explain please?
Text Classification is classifying the text based on its features. For example, you might classify a sentence as having a positive ("I am so happy") or negative ("I am so sad") sentiment.
Text Feature selection is effectively deciding how you want to encode the text so you can run it through the classifier. There are many ways of doing this. For example, you could use a bag of words representation, where each column represents a word in your vocabulary and each cell represents how many times the word appears in the document.
If you had two sentences, "I am so happy, so very happy" and "I am so sad", your encoding for the sentences might be
| I || am | so | happy | very | sad |
0.
1.
Related
I am doing my first project in the NLP domain which is sentiment analysis of a dataset with ~250 tagged english data points/sentences. The dataset is reviews of a pharmaceutical product having positive, negative or neutral tags. I have worked with numeric data in supervised learning for 3 years but NLP is unchartered territory for me. So I want to know the best pre-processing techniques and the steps that I need to do that are best suited to my problem. A guideline from an NLP expert would be much appreciated!
Based on your comment on mohammad karami answer, what you haven't understood is the paragraph or sentence representation (you said "converting to numeric is the real question"). So in numerical data, suppose you have like a table with 2 columns (features) and a label, maybe something like "work experience", "age", and a label "salary" (to predict a salary based on age and work experience). In NLP, features are usually if not most of the time on word level (can sometimes be character level or subword level too). These features are called tokens. Now the columns are replaced with these tokens. The simplest way to make a paragraph representation is by using bag of words. So after preprocessing, every unique words will be mapped as column. So suppose we have data train with 2 rows as follows:
"I help you and you should help me"
"you and I"
the unique words will become the column, so the table might look like:
I | help | you | and | should | me
Now the two samples would have value as follows:
[1, 2, 2, 1, 1, 1]
[1, 0, 1, 1, 0, 0]
Notice that the first element of the array is 1, because both samples have word I and occurred once, now see the second element is 2 on first row, and 0 on second row, because word help occurred twice on first row and never occurred on the second row. The logic behind this would be something like "if word A, word B... exists and word H, word I... doesn't exist, then the label is positive".
Bag of words works most of the time but it has problem such as dimensionality problem (imagine there are four billion unique words, the features are too many), and also notice that it doesn't take order of words into account, notice that similar words are represented the same way, and there are many more. The current state of the art for NLP is called BERT, learn that if you want to use what's best.
First of all, you have to specify what features you want to have and then do the pre-processing. However, you can: 1- Remove HTML tags
2- Remove extra whitespaces
3- Convert accented characters to ASCII characters
4- Expand contractions
5- Remove special characters
5 - Lowercase all texts
6- Convert number words to numeric form
7- Remove numbers
8- Remove stopwords
9- Lemmatization
Do your own Data. I suggest looking at the NLTK package for NLP. NLTK has sentiment analysis Function (maybe help your work).
Then extract your features with tf-idf or any other feature extraction or feature selection algorithms . And then give the machine learning algorithm after scaling.
What ML Algorithms can I use to train Action phrases in a given Sentence.
Sentence1:I want to play cricket
Label1: play cricket
Sentence2: Need to wash my clothes
Label2: wash clothes
I have a data of some ~2k Sentences & corresponding Action phrases (Labels) and need to predict another bunch of sentences based on them. Can someone guide me on how to do this using NLP/ML? Which Algo's to use for the same? (preferably python)
Here's the process of sentence classification:
1) Normalize the text - bring all text to lower case
2) Remove all stop words - ensures that only relevant features are left
3) Tokenize the sentences to unigram tokens
4) Apply stemming technique - try out different stemming models/ lemmatizer to bring the words to their base word. See which one works best for your case. For example: play, played, plays will be converted to base word "play". This step reduces the number of features.
5) Create a Term Document Matrix for all the sentences. Each row of the TDM corresponds to a sentence and each column of the TDM corresponds to a token of the sentence. (There's another way of representing text in the form of matrix called Tf-Idf)
6) Now this term document matrix contains tokens as columns. You already have the labels in place. You can start training the ML models now. I'm assuming you know how to do this part.
Take a look at NLTK's Naive Bayes Classifier,
it's multiclass and you can feed it the sentence/label pairs directly.
NaiveBayesClassifier.train() will want training features, I would start
with the features simply being the words in each sentence. You can modify the feature selection with more complex methods until you get the results you want.
You can use nltk.classify.util.accuracy to evaluate results. Remember to split your sentences into training and test data.
So I am trying to (just for fun) classify movies based on their description, the idea is to "tag" movies, so a given movie might be "action" and "humor" at the same time for example.
Normally when using a text classifier, what you get is the class to where a given text belongs, but in my case I want to assign a text to 1 to N tags.
Currently my training set would look like this
+--------------------------+---------+
| TEXT | TAG |
+--------------------------+---------+
| Some text from a movie | action |
+--------------------------+---------+
| Some text from a movie | humor |
+--------------------------+---------+
| Another text here | romance |
+--------------------------+---------+
| Another text here | cartoons|
+--------------------------+---------+
| And some text more | humor |
+--------------------------+---------+
What I am doing next is to train classifiers to tell me whether or not each tag belongs to a single text, so for example, if I want to figure out whether or not a text is classified as "humor" I would end up with the following training set
+--------------------------+---------+
| TEXT | TAG |
+--------------------------+---------+
| Some text from a movie | humor |
+--------------------------+---------+
| Another text here |not humor|
+--------------------------+---------+
| And some text more | humor |
+--------------------------+---------+
Then I train a classifier that would learn whether or not a text is humor or not (the same approach is done with the rest of the tags). After that I end with a total of 4 classifiers that are
action / no action
humor / no humor
romance / no romance
cartoons / no cartoons
Finally when I get a new text, I apply it to each of the 4 classifiers, for each classifier that gives me a positive classification (that is, gives me X instead of no-X) if such classification is over a certain threshold (say 0.9), then I assume that the new text belongs to tag X, and then I repeat the same with each of the classifiers.
In particular I am using Naive Bayes as algorithm, but the same could be applied with any algorithm that outputs a probability.
Now the question is, is this approach correct? Am I doing something terribly wrong here? From the results I get things seems to make sense, but I would like a second opinion.
Yes, this makes sense. it is a well known, basic technique for multilabel/multiclass classification known as "one vs all" (or "one vs all") classifier. This is very old and widely used. On the other hand - it is also very naive as you do not consider any relations between your classes/tags. You might be interested in reading about structure learning, which covers topics where there is some structure over labels space that can be exploited (and usually there is).
The problem you've described can be addressed by Latent Dirichlet Allocation, a statistical topic model method to find the underlying ("latent") topics in a collection of documents. This approach is based on a model where each document is a mixture of these topics.
In general, you initially decide upon the topics (in your case, the tags are the topics) and then run a trainer. The LDA software will then output a probability distribution over the topics for each document.
Here is a good introduction: http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/
Yes. your approach is correct and it is a well-known strategy to enable classifiers that are designed to do binary classification to handle multi-class classification tasks.
Andrew Ng (of Standford University) explains this approach here. Even though it is explained for logistic regression, as you mentioned the idea can be applied to any algorithm that outputs probability.
I have a Naive Bayes classifier (implemented with WEKA) that looks for uppercase letters.
contains_A
contains_B
...
contains_Z
For a certain class the word LCD appears in almost every instance of the training data. When I get the probability for "LCD" to belong to that class it is something like 0.988. win.
When I get the probability for "L" I get a plain 0 and for "LC" I get 0.002. Since features are naive, shouldn't the L, C and D contribute to overall probability independently, and as a result "L" have some probability, "LC" some more and "LCD" even more?
At the same time, the same experiment with an MLP, instead of having the above behavior it gives percentages of 0.006, 0.5 and 0.8
So the MLP does what I would expect a Naive Bayes to do, and vise versa. Am I missing something, can anyone explain these results?
I am not familiar with the internals of WEKA - so please correct me if you think that I am not righth.
When using a text as a "feature" than this text is transformed to a vector of binary values. Each value correponds to one concrete word. The length of the vector is equal to the size of the dictionary.
if your dictionary contains 4 worlds: LCD, VHS, HELLO, WORLD
then for example a text HELLO LCD will be transformed to [1,0,1,0].
I do not know how WEKA builds it's dictionary, but I think it might go over all the words present in the examples. Unless the "L" is present in the dictionary (and therefor is present in the examples) than it's probability is logicaly 0. Actually it should not even be considered as a feature.
Actually you can not reason over the probabilities of the features - and you cannot add them together, I think there is no such a relationship between the features.
Beware that in text mining, words (letters in your case) may be given weights different than their actual counts if you are using any sort of term weighting and normalization, e.g. tf.idf. In the case of tf.idf for example, characters counts are converted into a logarithmic scale, also characters that appear in every single instance may be penalized using idf normalization.
I am not sure what options you are using to convert your data into Weka features, but you can see here that Weka has parameters to be set for such weighting and normalization options
http://weka.sourceforge.net/doc.dev/weka/filters/unsupervised/attribute/StringToWordVector.html
-T
Transform the word frequencies into log(1+fij)
where fij is the frequency of word i in jth document(instance).
-I
Transform each word frequency into:
fij*log(num of Documents/num of documents containing word i)
where fij if frequency of word i in jth document(instance)
I checked the weka documentation and I didn't see support for extracting letters as features. This implies the weka function may need a space or punctuation to delimit each feature from those adjacent. If so, then the search for "L", "C" and "D" would be interpreted as three separate one-letter-words and would explain why they were not found.
If you think this is it, you could try splitting the text into single characters delimited by \n or space, prior to ingestion.
I want to classify sentences with Weka. My features are sentence terms (words) and a Part of Speech tag of each terms. I don't know how figure attributes, because if each term is presented as one feature, number of feature for each instance (sentence) has become different. And, if all words in sentence is presented as one feature, how relate words and their POS tag.
Any ideas how I should proceed?
If I understand the question correctly, the answer is as follows: It is most common to treat words independently of their position in the sentence and represent a sentence in the feature space by the number of times each of the known words occurs in that sentence. I.e. there is usually a separate numerical feature for each word present in the training data. Or, if you're willing to use n-grams, a separate feature for every n-gram in the training data (possibly with some frequency threshold).
As for the POS tags, it might make sense to use them as separate features, but only if the classification you're interested in has to do with sentence structure (syntax). Otherwise you might want to just append the POS tag to the word, which would partly disambiguate those words that can represent different parts of speech.