I have data in an excel file that I need to use to perform multi-label classification using SVM. It has two columns as shown below. 'tweet' - A,B,C,D,E,F,G and 'category' = X,Y,Z
tweet category
A X
B Y
C Z
D X,Y
E Y,Z
F X,Y,Z
G X,Z
Given a tweet, I want to train my model to predict the category it belongs to. Both the tweets and categories are text. I am trying to use Weka's LibSVM classifier to do the classification as I read it does multi-label classification. I converted the csv file to arff file and loaded it in Weka. I then ran the "LibSVM" classifier. However, I am getting very poor results as shown below. Any idea what I am doing wrong ? Is multi-label text classification even possible with "LibSVM" ?
Correctly Classified Instances 82 25.9494 %
Incorrectly Classified Instances 234 74.0506 %
Kappa statistic 0
Mean absolute error 0.0423
Root mean squared error 0.2057
Relative absolute error 89.9823 %
Root relative squared error 134.3377 %
Total Number of Instances 316
SVM can definitely be used for multiclass classification.
I have not used Weka LibSV before, but you if you already haven't you would need to do some data cleaning before you input text for any sort of classification.
The type of cleaning also depends on your classification task, but you can look into the following techniques which are used in practice for text analysis:
1) Remove twitter handles from your text
2) Remove stop words or words that you know for sure do not impact your classifications. Maybe you can only preserve pronouns and remove any other words. You can use POS tagging to perform this task. More info here
3) Remove punctuations
4) Use n-grams to get contextual meaning out of your text. This site has some good explanation of how that works. Essentially, this would mean that you would treat a sequence of words as a feature rather than using a single word as a data point in your model. Mind you this might impact the amount of memory your model occupies up while training.
5) Remove words that either occur too frequently or do not occur too frequently in your data set.
6) Balance your classes or categories in your case. This means before training your model, make sure the training data has a similar number of X,Y and Z categories. It is possible that your data had a lot of tweets that classify to X and Y but in your test set you had tweets that mostly mapped to the Z category.
Related
I have a file of raw feedbacks that needs to be labeled(categorized) and then work as the training input for SVM Classifier(or any classifier for that matter).
But the catch is, I'm not assigning whole feedback to a certain category. One feedback may belong to more than one category based on the topics it talks about (noun n-grams are extracted). So, I'm labeling the topics(terms) not the feedbacks(documents). And so, I've extracted the n-grams using TFIDF while saving their features so i could train my model on. The problem with that is, using tfidf, it returns a document-term matrix that's train_x, but on the other side, I've got train_y; The labels that are assigned to each n-gram (not the whole document). So, I've ended up with a document to frequency matrix that contains x number of rows(no of documents) against a label of y number of n-grams(no of unique topics extracted).
Below is a sample of what the data look like. Blue is the n-grams(extracted by TFIDF) while the red is the labels/categories (calculated for each n-gram with a function I've manually made).
Instead of putting code, this is my strategy in implementing my concept:
The problem lies in that part where TFIDF producesx_train = tf.Transform(feedbacks), which is a document-term matrix and it doesn't make sense for it to be an input for the classifier against y_train, which is the labels for the terms and not the documents. I've tried to transpose the matrix, it gave me an error. I've tried to input 1-D array that holds only feature values for the terms directly, which also gave me an error because the classifier expects from X to be in a (sample, feature) format. I'm using Sklearn's version of SVM and TfidfVectorizer.
Simply, I want to be able to use SVM classifier on a list of terms (n-grams) against a list of labels to train the model and then test new data (after cleaning and extracting its n-grams) for SVM to predict its labels.
The solution might be a very technical thing like using another classifier that expects a different format or not using TFIDF since it's document focused (referenced) or even broader, a whole change of approach and concept (if it's wrong).
I'd very much appreciate it if someone could help.
I've trained a Keras LSTM classification model on characters, saved its model architecture and weights, and now want to load it into a separate application that I can run separate from the training system, stick a REST endpoint on it, and then be able to make predictions via REST...
I haven't found - maybe poor googlefu - references to how other people are doing it, and the main vagueness I'm running into is how to load the original text index & the corresponding labels index.
i.e. the index of 1="a",2="g",3=" ",4="b", and the "original" labels of ["green","blue","red","orange"] prior to the label being 1-hot encoded...
So this is my understanding:
the weights are based on the numerical inputs that were given to the originally trained model
the numerical inputs & the generated index are based on the specific data set that was used for training
the outputs from the model that represent the classification are based on the order in which the original data set's labels were added - i.e. if green was in position 0 of the training labels, and is in position 1 of the actual "runtime" labels, then that's not gonna work... true?
which means that reusing the model + weights, not only requires the actual model architecture & weights, but it also requires the indices of the input & output data...
Is that correct? Or am I missing something major?
Because the thing then is... IF this is the case, is there a way to save & load the indices other than doing it manually?
Because if it needs to be done manually, then we kinda lose the benefits of keras' preprocessing functionality (like the Tokenizer and the np_utils.to_categorically) that we WERE able to use in the training system...
Does anybody have a pattern for doing this sort of activity?
I'm currently doing something along the lines of:
save the X index & the Y label array during training together with the model architecture & weights
in the prediction application, load and recreate the model with the architecture & weights
have a custom class to tokenise input words based on the X index, pad it to the right length, etc
make the prediction
take the prediction and map the highest probability item to the original Y labels array, and therefore figure out what the predicted label is
Let's suppose I have a noisy 2d data set where one person watching the data could easily draw a straight line in the data so that the mean squared error is minimized.
The model of the line has the form y = mx + b, where x is the input value, y is the predicted value of the model and m and b are trained variables to minimize the cost.
My question is that if we plug some input x1 to the model, it will always output the same number, not taking into account how sparse the data is. How can a model like this predict different values from same inputs?
Maybe this could be done taking all the errors from the model line to the points, making a distribution of them, taking an expected value of such distribution and then adding that value to y?
If the data is 2d, and it can be perfectly modeled with a straight line then there is no data-based nor statistical-based reason not to claim that the process is fully deterministic, and you should output one value.
However, if you have many more dimensions, or your fit is not perfect (error is minimised but not 0) then what you are after is either predicting distribution of values or at least confidence bounds. There are many probabilistic models that can model distribution of the outputs rather than a singe value. In particular linear regression does that, it assumes that you have a Gaussian error around your predictions, thus effectively once you obtain MSE "A" you can draw predictions from N(mx+b, A) - which, as you can easily see degenerates to deterministic model when A=0. These predictions are optimal in expectation, and they are simply your way of "simulating observations" according to the model. There are also meta methods, if you treat your predictor as a black box - you can train multiple models on subsets of data, and treat their predictions as samples to fit a distribution (again for simplicity it could be a single Gaussian).
I'm working on a text classification problem, and I have problems with missing values on some features.
I'm calculating class probabilities of words from labeled training data.
For example;
Let word foo belongs to class A for 100 times and belongs to class B for 200 times. In this case, i find class probability vector as [0.33,0.67] , and give it along with the word itself to classifier.
Problem is that, in the test set, there are some words that have not been seen in training data, so they have no probability vectors.
What could i do for this problem?
I ve tried giving average class probability vector of all words for missing values, but it did not improve accuracy.
Is there a way to make classifier ignore some features during evaluation just for specific instances which does not have a value for giving feature?
Regards
There is many way to achieve that
Create and train classifiers for all sub-set of feature you have. You can train your classifier on sub-set with the same data as tre training of the main classifier.
For each sample juste look at the feature it have and use the classifier that fit him the better. Don't try to do some boosting with thoses classifiers.
Just create a special class for samples that can't be classified. Or you have experimented result too poor with so little feature.
Sometimes humans too can't succefully classify samples. In many case samples that can't be classified should just be ignore. The problem is not in the classifier but in the input or can be explain by the context.
As nlp point of view, many word have a meaning/usage that is very similare in many application. So you can use stemming/lemmatization to create class of words.
You can also use syntaxic corrections, synonyms, translations (does the word come from another part of the world ?).
If this problem as enouph importance for you then you will end with a combination of the 3 previous points.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I'm working on Sentiment Analysis for text classification and I want to classify tweets from Twitter to 3 categories, positive, negative, or neutral. I have 210 training data, and I'm using Naive Bayes as classifier. I'm implementing using PHP and MySQL as my database for training data.
What I've done is in sequence :
I split my training data based on 10-fold Cross Validation to 189 training data and 21 testing data.
I insert my training data into database, so my classifier can classify based on training data
Then I classify my testing data using my classifier. I got 21 prediction results.
Repeat step 2 and 3 for 10 times based on 10-fold Cross Validation
I evaluate the accuracy of the classifier one by one, so I got 10 accuracy results. Then I take the average of the results.
What i want to know is :
Which is the learning process ? What is the input, process, and output ?
Which is the validation process ? What is the input, process, and output ?
Which is the testing process ? What is the input, process, and output ?
I just want to make sure that my comprehension about these 3 process (learning, validation, and testing) is the right one.
In your example, I don't think there is a meaningful distinction between validation and testing.
Learning is when you train the model, which means that your outputs are, in general, parameters, such as coefficients in a regression model or weights for connections in a neural network. In your case, the outputs are estimated probabilities for the probability of seeing a word w in a tweet given the tweet positive P(w|+), seeing a word given negative P(w|-), and seeing a word given neutral P(w|*). Also the probabilities of not seeing words in the tweet given positive, negative, neutral, P(~w|+), etc. The inputs are the training data, and the process is simply estimating probabilities by measuring the frequencies that words occur (or don't occur) in each of your classes, i.e just counting!
Testing is where you see how well your trained model does on data you haven't seen before. Training tends to produce outputs that overfit the training data, i.e. the coefficients or probabilities are "tuned" to noise in the training data, so you need to see how well your model does on data it hasn't been trained on. In your case, the inputs are the test examples, the process is applying Bayes theorem, and the outputs are classifications for the test examples (you classify based on which probability is highest).
I have come across cross-validation -- in addition to testing -- in situations where you don't know what model to use (or where there are additional, "extrinsic", parameters to estimate that can't be done in the training phase). You split the data into 3 sets.
So, for example, in linear regression you might want to fit a straight line model, i.e. estimate p and c in y = px + c, or you might want to fit a quadratic model, i.e. estimate p, c, and q in y = px + qx^2 + c. What you do here is split your data into three. You train the straight line and quadratic models using part 1 of the data (the training examples). Then you see which model is better by using part 2 of the data (the cross-validation examples). Finally, once you've chosen your model, you use part 3 of the data (the test set) to determine how good your model is. Regression is a nice example because a quadratic model will always fit the training data better than the straight line model, so can't just look at the errors on the training data alone to decide what to do.
In the case of Naive Bayes, it might make sense to explore different prior probabilities, i.e. P(+), P(-), P(*), using a cross-validation set, and then use the test set to see how well you've done with the priors chosen using cross-validation and the conditional probabilities estimated using the training data.
As an example of how to calculate the conditional probabilities, consider 4 tweets, which have been classified as "+" or "-" by a human
T1, -, contains "hate", "anger"
T2, +, contains "don't", "hate"
T3, +, contains "love", "friend"
T4, -, contains "anger"
So for P(hate|-) you add up the number of times hate appears in negative tweets. It appears in T1 but not in T4, so P(hate|-) = 1/2. For P(~hate|-) you do the opposite, hate doesn't appear in 1 out of 2 of the negative tweets, so P(~hate|-) = 1/2.
Similar calculations give P(anger|-) = 1, and P(love|+) = 1/2.
A fly in the ointment is that any probability that is 0 will mess things up in the calculation phase, so you instead of using a zero probability you use a very low number, like 1/n or 1/n^2, where n is the number of training examples. So you might put P(~anger|-) = 1/4 or 1/16.
(The maths of the calculation I put in this answer).