I tried to play with libsvm and 3D descriptors in order to perform object recognition. So far I have 7 categories of objects and for each category I have its number of objects (and its pourcentage) :
Category 1. 492 (14%)
Category 2. 574 (16%)
Category 3. 738 (21%)
Category4. 164 (5%)
Category5. 369 (10%)
Category6. 123 (3%)
Category7. 1025 (30%)
So I have in total 3585 objects.
I have followed the practical guide of libsvm.
Here for reminder :
A. Scaling the training and the testing
B. Cross validation
C. Training
D. Testing
I separated my data into training and testing.
By doing a 5 cross validation process, I was able to determine the good C and Gamma.
However I obtained poor results (CV is about 30-40 and my accuracy is about 50%).
Then, I was thinking about my data and saw that I have some unbalanced data (categories 4 and 6 for example). I discovered that on libSVM there is an option about weight. That's why I would like now to set up the good weights.
So far I'm doing this :
svm-train -c cValue -g gValue -w1 1 -w2 1 -w3 1 -w4 2 -w5 1 -w6 2 -w7 1
However the results is the same. I'm sure that It's not the good way to do it and that's why I ask you some helps.
I saw some topics on the subject but they were related to binary classification and not multiclass classification.
I know that libSVM is doing "one against one" (so a binary classifier) but I don't know to handle that when I have multiple class.
Could you please help me ?
Thank you in advance for your help.
I've met the same problem before. I also tried to give them different weight, which didn't work.
I recommend you to train with a subset of the dataset.
Try to use approximately equal number of different class samples. You can use all category 4 and 6 samples, and then pick up about 150 samples for every other categories.
I used this method and the accuracy did improve. Hope this will help you!
Related
I've been working through a Coursera course for extra practice and ran into an issue I don't understand.
Link to Collab
So as far as I've worked on ML neural network problems, I've always been taught that the output layer of a multiclass classification problem will be Dense, with number of nodes equal to the number of classes. E.g. Dog, cat, horse - 3 classes = 3 nodes.
However, in the notebook, there are 5 classes in the labels, checked using len(label_tokenizer.word_index) but using 5 nodes I had terrible results and with 6 nodes the model worked properly.
Can anyone please explain why this is the case? I can't find any online example explaining this. Cheers!
I figured it out. The output of the dense layer with loss of categorical cross entropy expects labels/targets to be starting from zero. For example:
cat - 0
dog - 1
horse - 2
In this case, the number of dense nodes are 3.
However, in the collab, the labels were generated using keras tokenizer, which tokenizes starting from 1 (because padding is usually 0).
label_tokenizer = Tokenizer()
label_tokenizer.fit_on_texts(labels)
# {'business': 2, 'entertainment': 5, 'politics': 3, 'sport': 1, 'tech': 4}
This leads to a weird case where, if we have 5 dense nodes, we have output classes from 0-4, which doesn't match up with our labels with predictions 1-5.
I proved this empirically by rerunning the code with all labels reduced by 1 and the model trains successfully with 5 dense nodes, since labels are 0-4 now.
I suspect that with labels 1-5 and 6 dense nodes work because the model simply learns that label 0 is not used and focuses on 1-5.
If anyone understands the inner workings of categorical cross entropy, do feel free to add on!
This is my first data mining project. I am using SAS Enterprise miner to train and test a classifier.
I have 3 files at my disposal,
Training file : 85 input variables and 1 target variable, with 5800+ observations
Prediction file : 85 input variables with 4000 observations
Verification file : 1 variable containing the correct predictions for the second file. Since this is an academic project, this file is here to tell us if we are doing a good job or not.
My problem is that the dataset is unbalanced (95% of 0s and 5% of 1s for the target variable in the training file). So naturally, I tried to re-sample the model using the "sampling node" as described in the following link
Here are the 2 approaches I used, they give slightly different results. But here is the general unsatisfactory result I am getting:
Without resampling : The model predicts less than ten solicited individuals (target variable = 1) over 4000 observations
With the resampling : The model predicts about 1500 solicited individuals over 4000 observations.
I am looking for 100 to 200 solicited individuals to have a model that would be considered acceptable.
Why do you think our predictions are way off this way, and how can we remedy to this situation?
Here is a screen shot of both models
There are some Technics to deal with unbalanced data. One that I remember many years ago was this approach:
say you have 100 observation solicited(minority) that are 5% of all your observations
cluster other none solicited(maturity) class, to 20 groups(each of with have 100 observation of none solicited individuals) with clustering algorithms like KMEAN, MEANSHIF, DBSCAN and...
then for each group of maturity clustered observation, create a dataset with all 100 observation solicited(minority) class. It means that you have 20 group of dataset each of witch is balanced with 100 solicited and 100 none solicited observations
train each balanced group and create a model for each of them
at prediction, predict all 20 models. for example if 15 out of 20 models say it is solicited, it is solicited
I am having a trouble in classification problem.
I have almost 400k number of vectors in training data with two labels, and I'd like to train MLP which classifies data into two classes.
However, the dataset is so imbalanced. 95% of them have label 1, and others have label 0. The accuracy grows as training progresses, and stops after reaching 95%. I guess this is because the network predict the label as 1 for all vectors.
So far, I tried dropping out layers with 0.5 probabilities. But, the result is the same. Is there any ways to improve the accuracy?
I think the best way to deal with unbalanced data is to use weights for your class. For example, you can weight your classes such that sum of weights for each class will be equal.
import pandas as pd
df = pd.DataFrame({'x': range(7),
'y': [0] * 2 + [1] * 5})
df['weight'] = df['y'].map(len(df)/2/df['y'].value_counts())
print(df)
print(df.groupby('y')['weight'].agg({'samples': len, 'weight': sum}))
output:
x y weight
0 0 0 1.75
1 1 0 1.75
2 2 1 0.70
3 3 1 0.70
4 4 1 0.70
5 5 1 0.70
6 6 1 0.70
samples weight
y
0 2.0 3.5
1 5.0 3.5
You could try another classifier on subset of examples. SVMs, may work good with small data, so you can take let's say 10k examples only, with 5/1 proportion in classes.
You could also oversample small class somehow and under-sample the another.
You can also simply weight your classes.
Think also about proper metric. It's good that you noticed that the output you have predicts only one label. It is, however, not easily seen using accuracy.
Some nice ideas about unbalanced dataset here:
https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
Remember not to change your test set.
That's a common situation: the network learns a constant and can't get out of this local minimum.
When the data is very unbalanced, like in your case, one possible solution is a weighted cross entropy loss function. For instance, in tensorflow, apply a built-in tf.nn.weighted_cross_entropy_with_logits function. There is also a good discussion of this idea in this post.
But I should say that getting more data to balance both classes (if that's possible) will always help.
i'm new to machine learning field.
Trying to classify 10 people with a their phone call logs.
The phone call logs look like this
UserId IsInboundCall Duration PhoneNumber(hashed)
1 false 23 1011112222
2 true 45 1033334444
Trained with this kind of 8700 logs with SVM from sklearn gives a result is accuracy 88%
I have a several question about this result and
what is a proper way to use some not ordinal data(ex. phone number)
I'm not sure using a hashed phone number as a feature but this multi class classifiers accuracy is not bad, is it just a coincidence?
How to use not oridnal data as a feature?
If this classifier have to classify more 1000 classes(more 1000 users), is SVM still work on that case?
Any advice is helpful for me. Thanks
1) Try the SVM without Phone number as a feature to get a sense of how much impact it has.
2) In order to avoid Ordinal Data you can either transform into a number or use a 1 of K approach. Say you added an Phone OS field with possible values {IOS, Android, Blackberry} you can represent this as a number 0,1,2 or as 3 features (1,0,0), (0,1,0), (0,0,1).
3) The SVM will still give good results as long as the data is approximately linearly separable. To achieve this you might need to add more features and map into a different feature space (an RBF kernel is a good start).
I am studying for my Machine Learning (ML) class and I have question that I couldn't find an answer with my current knowledge. Assume that I have the following data-set,
att1 att2 att3 class
5 6 10 a
2 1 5 b
47 8 4 c
4 9 8 a
4 5 6 b
The above data-set is clear and I think I can apply classification algorithms for new incoming data after I train my data-set. Since each instance has a label, it is easy to understand that each instance has a class that is labeled with. Now, my question is what if we had a class consisting of different instances such as gesture recognition data. Any class will have multiple instances that specifies its class. For example,
xcor ycord depth
45 100 10
50 20 45
10 51 12
the above three instances belong to class A and the below three instances belong to class B as a group, I mean those three data instances constitute that class together. For gesture data, the coordinates of movement of your hand.
xcor ycord depth
45 100 10
50 20 45
10 51 12
Now, I want every incoming three instances to be grouped either as A or B? Is it possible to label all of them either A or B without labeling each instance independently? As an example, assume that following group belongs to B, so I want all of the instances to be labelled together as B not individually because of their independent similarity to class A or B? If it is possible, how do we call it?
xcor ycord depth
45 10 10
5 20 87
10 51 44
I don't see an scenario where you might want to group an indeterminate number of rows in your dataset as features of a given class. They are either independently associated with a class or they are all features and therefore an unique row. Something like:
Instead of
xcor ycord depth
45 10 10
5 20 87
10 51 44
Would be something like:
xcor1 ycord1 depth1 xcor2 ycord2 depth2 xcor3 ycord3 depth3
45 10 10 5 20 87 10 51 44
This is pretty much the same approach that is used to model time series
It seems you may be confused between different types of machine learning.
The dataset given in your class is an example of a supervised classification algorithm. That is, given some data and some classes, learn a classifier that can predict classes on new, unseen data. Classifiers that you can apply to this problem include
decision trees,
support vector machines
artificial neural networks, etc.
The second problem you are describing is an example of an unsupervised classification problem. That is, given some data without labels, we want to find an automatic way to separate the different types of data (your A and B) algorithmically. Algorithms that solve this problem include
K-means clustering
Mixture models
Principal components analysis followed by some sort of clustering
I would look into running a factor analysis or normalizing your data, then running a K-means or gaussian mixture model. This should discover the A and B types of your data if they are distinguishable.
Take a peek at the use of neural networks for recognizing hand-written text. You can think of a gesture as a hand-written figure with an additional time component (so, give each pixel an "age".) If your training data also includes similar time data, then I think the technique should carry over well.