I am using Weka 3.8.3 multilayer perceptron on Iris dataset. I have 75 training instances and 75 test instances. The thing is no matter how I change the 'seed' parameter, it does not affect the accuracy that much. It's almost always the stats below. Is seed used to randomly initialize the weight? Could someone please help to explain why it behaves this way? Many thanks.
=== Summary ===
Correctly Classified Instances 70 93.3333 %
Incorrectly Classified Instances 5 6.6667 %
I tried the same thing (Training and test 50% percentage split using the radio button) and got 72 and 3 with a random seed for XVal / % Split of 1.
When I change the random seed to 777 (or 666 or 54321) I get 73 and 2, which is a different result, so I can't replicate what you are seeing.
With a random seed of 0 I get 71 and 4.
Related
This is my first data mining project. I am using SAS Enterprise miner to train and test a classifier.
I have 3 files at my disposal,
Training file : 85 input variables and 1 target variable, with 5800+ observations
Prediction file : 85 input variables with 4000 observations
Verification file : 1 variable containing the correct predictions for the second file. Since this is an academic project, this file is here to tell us if we are doing a good job or not.
My problem is that the dataset is unbalanced (95% of 0s and 5% of 1s for the target variable in the training file). So naturally, I tried to re-sample the model using the "sampling node" as described in the following link
Here are the 2 approaches I used, they give slightly different results. But here is the general unsatisfactory result I am getting:
Without resampling : The model predicts less than ten solicited individuals (target variable = 1) over 4000 observations
With the resampling : The model predicts about 1500 solicited individuals over 4000 observations.
I am looking for 100 to 200 solicited individuals to have a model that would be considered acceptable.
Why do you think our predictions are way off this way, and how can we remedy to this situation?
Here is a screen shot of both models
There are some Technics to deal with unbalanced data. One that I remember many years ago was this approach:
say you have 100 observation solicited(minority) that are 5% of all your observations
cluster other none solicited(maturity) class, to 20 groups(each of with have 100 observation of none solicited individuals) with clustering algorithms like KMEAN, MEANSHIF, DBSCAN and...
then for each group of maturity clustered observation, create a dataset with all 100 observation solicited(minority) class. It means that you have 20 group of dataset each of witch is balanced with 100 solicited and 100 none solicited observations
train each balanced group and create a model for each of them
at prediction, predict all 20 models. for example if 15 out of 20 models say it is solicited, it is solicited
I am having a trouble in classification problem.
I have almost 400k number of vectors in training data with two labels, and I'd like to train MLP which classifies data into two classes.
However, the dataset is so imbalanced. 95% of them have label 1, and others have label 0. The accuracy grows as training progresses, and stops after reaching 95%. I guess this is because the network predict the label as 1 for all vectors.
So far, I tried dropping out layers with 0.5 probabilities. But, the result is the same. Is there any ways to improve the accuracy?
I think the best way to deal with unbalanced data is to use weights for your class. For example, you can weight your classes such that sum of weights for each class will be equal.
import pandas as pd
df = pd.DataFrame({'x': range(7),
'y': [0] * 2 + [1] * 5})
df['weight'] = df['y'].map(len(df)/2/df['y'].value_counts())
print(df)
print(df.groupby('y')['weight'].agg({'samples': len, 'weight': sum}))
output:
x y weight
0 0 0 1.75
1 1 0 1.75
2 2 1 0.70
3 3 1 0.70
4 4 1 0.70
5 5 1 0.70
6 6 1 0.70
samples weight
y
0 2.0 3.5
1 5.0 3.5
You could try another classifier on subset of examples. SVMs, may work good with small data, so you can take let's say 10k examples only, with 5/1 proportion in classes.
You could also oversample small class somehow and under-sample the another.
You can also simply weight your classes.
Think also about proper metric. It's good that you noticed that the output you have predicts only one label. It is, however, not easily seen using accuracy.
Some nice ideas about unbalanced dataset here:
https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
Remember not to change your test set.
That's a common situation: the network learns a constant and can't get out of this local minimum.
When the data is very unbalanced, like in your case, one possible solution is a weighted cross entropy loss function. For instance, in tensorflow, apply a built-in tf.nn.weighted_cross_entropy_with_logits function. There is also a good discussion of this idea in this post.
But I should say that getting more data to balance both classes (if that's possible) will always help.
I tried to play with libsvm and 3D descriptors in order to perform object recognition. So far I have 7 categories of objects and for each category I have its number of objects (and its pourcentage) :
Category 1. 492 (14%)
Category 2. 574 (16%)
Category 3. 738 (21%)
Category4. 164 (5%)
Category5. 369 (10%)
Category6. 123 (3%)
Category7. 1025 (30%)
So I have in total 3585 objects.
I have followed the practical guide of libsvm.
Here for reminder :
A. Scaling the training and the testing
B. Cross validation
C. Training
D. Testing
I separated my data into training and testing.
By doing a 5 cross validation process, I was able to determine the good C and Gamma.
However I obtained poor results (CV is about 30-40 and my accuracy is about 50%).
Then, I was thinking about my data and saw that I have some unbalanced data (categories 4 and 6 for example). I discovered that on libSVM there is an option about weight. That's why I would like now to set up the good weights.
So far I'm doing this :
svm-train -c cValue -g gValue -w1 1 -w2 1 -w3 1 -w4 2 -w5 1 -w6 2 -w7 1
However the results is the same. I'm sure that It's not the good way to do it and that's why I ask you some helps.
I saw some topics on the subject but they were related to binary classification and not multiclass classification.
I know that libSVM is doing "one against one" (so a binary classifier) but I don't know to handle that when I have multiple class.
Could you please help me ?
Thank you in advance for your help.
I've met the same problem before. I also tried to give them different weight, which didn't work.
I recommend you to train with a subset of the dataset.
Try to use approximately equal number of different class samples. You can use all category 4 and 6 samples, and then pick up about 150 samples for every other categories.
I used this method and the accuracy did improve. Hope this will help you!
I am trying to use Vowpal Wabbit to do a binary classification, i.e. given feature values vw will classify it either 1 or 0. This is how I have the training data formatted.
1 'name | feature1:0 feature2:1 feature3:48 feature4:4881 ...
-1 'name2 | feature1:1 feature2:0 feature3:5 feature4:2565 ...
etc
I have about 30,000 1 data points, and about 3,000 0 data points. I have 100 1 and 100 0 data points that I'm using to test on, after I create the model. These test data points are classified by default as 1. Here is how I format the prediction set:
1 'name | feature1:0 feature2:1 feature3:48 feature4:4881 ...
From my understanding of the VW documentation, I need to use either the logistic or hinge loss_function for binary classifications. This is how I've been creating the model:
vw -d ../training_set.txt --loss_function logistic/hinge -f model
And this is how I try the predictions:
vw -d ../test_set.txt --loss_function logistic/hinge -i model -t -p /dev/stdout
However, this is where I'm running into problems. If I use the hinge loss function, all the predictions are -1. When I use the logistic loss function, I get arbitrary values between 5 and 11. There is a general trend for data points that should be 0 to be lower values, 5-7, and for data points that should be 1 to be from 6-11. What am I doing wrong? I've looked around the documentation and checked a bunch of articles about VW to see if I can identify what my problem is, but I can't figure it out. Ideally I would get a 0,1 value, or a value between 0 and 1 which corresponds to how strong VW thinks the result is. Any help would be appreciated!
If the output should be just -1 and +1 labels, use the --binary option (when testing).
If the output should be a real number between 0 and 1, use --loss_function=logistic --link=logistic. The loss_function=logistic is needed when training, so the number can be interpreted as probability.
If the output should be a real number between -1 and 1, use --link=glf1.
If your training data is unbalanced, e.g. 10 times more positive examples than negative, but your test data is balanced (and you want to get the best loss on this test data), set the importance weight of the positive examples to 0.1 (because there are 10 times more positive examples).
Independently of your tool and/or specific algorithm you can use "learning curves" ,and train/cross validation/test splitting to diagnose your algorithm and determine whats your problem . After diagnosing your problem you can apply adjustments to your algorithm, for example if you find you have over-fitting you can apply some actions like:
Add regularization
Get more training data
Reduce the complexity of your model
Eliminate redundant features.
You can reference Andrew Ng. "Advice for machine learning" videos on YouTube to more details on this subject.
I am trying to learn RankSVM using OHSUMED dataset and SVM Rank library as explained in following link:
http://research.microsoft.com/en-s/um/beijing/projects/letor/Baselines/RankSVM-Struct.txt
I used same parameters as link suggests for OHSUMED dataset. i.e
OHSUMED/QueryLevelNorm/cv_l1_e0.001/fold1_l1_c0.0002_e0.001.log
OHSUMED/QueryLevelNorm/cv_l1_e0.001/fold2_l1_c0.002_e0.001.log
OHSUMED/QueryLevelNorm/cv_l1_e0.001/fold3_l1_c0.01_e0.001.log
OHSUMED/QueryLevelNorm/cv_l1_e0.001/fold4_l1_c0.02_e0.001.log
OHSUMED/QueryLevelNorm/cv_l1_e0.001/fold5_l1_c0.01_e0.001.log
But when I train my model & run "svm_rank_classify" command I get following result:
Reading model...done.
Reading test examples...done.
Classifying test examples...done
Runtime (without IO) in cpu-seconds: 0.00
Average loss on test set: 0.3864
Zero/one-error on test set: 100.00% (0 correct, 22 incorrect, 22 total)
NOTE: The loss reported above is the fraction of swapped pairs averaged over
all rankings. The zero/one-error is fraction of perfectly correct
rankings!
Total Num Swappedpairs : 31337
Avg Swappedpairs Percent: 38.64
Please suggest If any steps I am missing here?
Thanks.
The zero/one-error is the percentage of rankings (i.e. qid sets) where the model ranked at least one pair incorrectly. Your accuracy over all pairs is actually:
(100 - Avg Swappedpairs Percent) = 61.36%