Caffe, shuffle train.txt before creating LMDB - machine-learning

Unfortunately, I can't use --shuffle while creating LMDB.
So, I was advised to shuffle train.txt before creating LMDB.
After shuffling train.txt looks like this
n07747607/n07747607_28410.JPEG 950
n02111277/n02111277_55668.JPEG 256
n02091831/n02091831_4757.JPEG 176
n04599235/n04599235_10544.JPEG 911
n03240683/n03240683_14669.JPEG 540
After creating LMDBs for TEST and TRAIN I'm trying to train caffe on bvlc_reference_caffenet.
Only one problem, after more that 10 thousand iterations accuracy = 0.001 and loss = 6.9. Which, as I understand, means that it doesn't train and is just guessing.
Can you point out what I'm doing wrong? Thank you.

Related

Linear Regression test data violating training data.Please explain where i went wrong

This is a part of a dataset containing 1000 entries of pricing of rents of houses at different locations.
after training the model, if i send same training data as test data, i am getting incorrect results. How is this even possible?
X_loc = df[{'area','rooms','location'}]
y_loc = df[:]['price']
X_train, X_test, y_train, y_test = train_test_split(X_loc, y_loc, test_size = 1/3, random_state = 0)
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_train[0:1])
DATASET:
price rooms area location
0 0 22000 3 1339 140
1 1 45000 3 1580 72
3 3 72000 3 2310 72
4 4 40000 3 1800 41
5 5 35000 3 2100 57
expected output (y_pred)should be 220000 but its showing 290000 How can it violate the already trained input?
What you observed is exactly what is referred to as the "training error". Machine learning models are meant to find the "best" fit which minimizes the "total error" (i.e. for all data points and not every data point).
22000 is not very far from 29000, although it is not the exact number. This because linear regression tries compress all the variations in your data to follow one straight line.
Possibly the model is nonlinear and so applying a Linear Regression yields bad results. There are other reasons why a Linear Regression may fail cf. https://stats.stackexchange.com/questions/393706/bad-linear-regression-results
Nonlinear data often appears when there are (statistical) interactions between features.
A generalization of Linear Regression is the Generalized Linear Model (GLM), that is able to handle nonlinearities by its nonlinear link functions : https://en.wikipedia.org/wiki/Generalized_linear_model
In scikit-learn you can use a Support Vector Regression with polynomial or RBF kernel for a nonlinear model https://scikit-learn.org/stable/auto_examples/svm/plot_svm_regression.html
An alternative ansatz is to analyze the data on interactions and apply methods that are described in https://en.wikipedia.org/wiki/Generalized_linear_model#Correlated_or_clustered_data however this is complex. Possibly try Ridge Regression for this assumption because it can handle multicollinearity tht is one form of statistical interactions: https://ncss-wpengine.netdna-ssl.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Ridge_Regression.pdf
https://statisticsbyjim.com/regression/difference-between-linear-nonlinear-regression-models/

Unbalanced model, confused as to what steps to take

This is my first data mining project. I am using SAS Enterprise miner to train and test a classifier.
I have 3 files at my disposal,
Training file : 85 input variables and 1 target variable, with 5800+ observations
Prediction file : 85 input variables with 4000 observations
Verification file : 1 variable containing the correct predictions for the second file. Since this is an academic project, this file is here to tell us if we are doing a good job or not.
My problem is that the dataset is unbalanced (95% of 0s and 5% of 1s for the target variable in the training file). So naturally, I tried to re-sample the model using the "sampling node" as described in the following link
Here are the 2 approaches I used, they give slightly different results. But here is the general unsatisfactory result I am getting:
Without resampling : The model predicts less than ten solicited individuals (target variable = 1) over 4000 observations
With the resampling : The model predicts about 1500 solicited individuals over 4000 observations.
I am looking for 100 to 200 solicited individuals to have a model that would be considered acceptable.
Why do you think our predictions are way off this way, and how can we remedy to this situation?
Here is a screen shot of both models
There are some Technics to deal with unbalanced data. One that I remember many years ago was this approach:
say you have 100 observation solicited(minority) that are 5% of all your observations
cluster other none solicited(maturity) class, to 20 groups(each of with have 100 observation of none solicited individuals) with clustering algorithms like KMEAN, MEANSHIF, DBSCAN and...
then for each group of maturity clustered observation, create a dataset with all 100 observation solicited(minority) class. It means that you have 20 group of dataset each of witch is balanced with 100 solicited and 100 none solicited observations
train each balanced group and create a model for each of them
at prediction, predict all 20 models. for example if 15 out of 20 models say it is solicited, it is solicited

Classification with imbalanced dataset using Multi Layer Perceptrons

I am having a trouble in classification problem.
I have almost 400k number of vectors in training data with two labels, and I'd like to train MLP which classifies data into two classes.
However, the dataset is so imbalanced. 95% of them have label 1, and others have label 0. The accuracy grows as training progresses, and stops after reaching 95%. I guess this is because the network predict the label as 1 for all vectors.
So far, I tried dropping out layers with 0.5 probabilities. But, the result is the same. Is there any ways to improve the accuracy?
I think the best way to deal with unbalanced data is to use weights for your class. For example, you can weight your classes such that sum of weights for each class will be equal.
import pandas as pd
df = pd.DataFrame({'x': range(7),
'y': [0] * 2 + [1] * 5})
df['weight'] = df['y'].map(len(df)/2/df['y'].value_counts())
print(df)
print(df.groupby('y')['weight'].agg({'samples': len, 'weight': sum}))
output:
x y weight
0 0 0 1.75
1 1 0 1.75
2 2 1 0.70
3 3 1 0.70
4 4 1 0.70
5 5 1 0.70
6 6 1 0.70
samples weight
y
0 2.0 3.5
1 5.0 3.5
You could try another classifier on subset of examples. SVMs, may work good with small data, so you can take let's say 10k examples only, with 5/1 proportion in classes.
You could also oversample small class somehow and under-sample the another.
You can also simply weight your classes.
Think also about proper metric. It's good that you noticed that the output you have predicts only one label. It is, however, not easily seen using accuracy.
Some nice ideas about unbalanced dataset here:
https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
Remember not to change your test set.
That's a common situation: the network learns a constant and can't get out of this local minimum.
When the data is very unbalanced, like in your case, one possible solution is a weighted cross entropy loss function. For instance, in tensorflow, apply a built-in tf.nn.weighted_cross_entropy_with_logits function. There is also a good discussion of this idea in this post.
But I should say that getting more data to balance both classes (if that's possible) will always help.

Why my CNN based on Alexnet fails in classification?

I'm trying to build a CNN to classify dogs.In fact , my data set consists of 5 classes of dogs. I've 50 images of dogs splitted into 40 images for training and 10 for testing.
I've trained my network based on AlexNet pretrained model over 100,000 and 140,000 iterations but the accuracy is always between 20 % and 30 %.
In fact, I have adapted AlexNet to my problem as following : I changed the name of last fully connected network and num_output to 5. Also , I ve changed the name of the first fully connected layer (fc6).
So why this model failed even I' ve used data augmentation (cropping )?
Should I use a linear classification on top layer of my network since I have a little bit of data and similar to AlexNet dataset ( as mentioned here transfer learning) or my data set is very different of original data set of AlexNet and I should train linear classifier in earlier network ?
Here is my solver :
net: "models/mymodel/train_val.prototxt"
test_iter: 1000
test_interval: 1000
base_lr: 0.01
lr_policy: "step"
gamma: 0.1
stepsize: 100000
display: 20
max_iter: 200000
momentum: 0.9
weight_decay: 0.0005
snapshot: 1000
snapshot_prefix: "models/mymodel/my_model_alex_net_train"
solver_mode: GPU
Although you haven't given us much debugging information, I suspect that you've done some serious over-fitting. In general, a model's "sweet spot" for training is dependent on epochs, not iterations. Single-node AlexNet and GoogleNet, on an ILSVRC-style of data base, train in 50-90 epochs. Even if your batch size is as small as 1, you've trained for 2500 epochs with only 5 classes. With only 8 images per class, the AlexNet topology is serious overkill and is likely adapted to each individual photo.
Consider this: you have only 40 training photos, but 96 kernels in the first convolution layer and 256 in the second. This means that your model can spend over 2 kernels in conv1 and 6 in conv 2 for each photograph! You get no commonality of features, no averaging ... instead of edge detection generalizing to finding faces, you're going to have dedicated filters tuned to the individual photos.
In short, your model is trained to find "Aunt Polly's dog on a green throw rug in front of the kitchen cabinet with a patch of sun to the left." It doesn't have to learn to discriminate a basenji from a basset, just to recognize whatever is randomly convenient in each photo.

LibSVM - Multi class classification with unbalanced data

I tried to play with libsvm and 3D descriptors in order to perform object recognition. So far I have 7 categories of objects and for each category I have its number of objects (and its pourcentage) :
Category 1. 492 (14%)
Category 2. 574 (16%)
Category 3. 738 (21%)
Category4. 164 (5%)
Category5. 369 (10%)
Category6. 123 (3%)
Category7. 1025 (30%)
So I have in total 3585 objects.
I have followed the practical guide of libsvm.
Here for reminder :
A. Scaling the training and the testing
B. Cross validation
C. Training
D. Testing
I separated my data into training and testing.
By doing a 5 cross validation process, I was able to determine the good C and Gamma.
However I obtained poor results (CV is about 30-40 and my accuracy is about 50%).
Then, I was thinking about my data and saw that I have some unbalanced data (categories 4 and 6 for example). I discovered that on libSVM there is an option about weight. That's why I would like now to set up the good weights.
So far I'm doing this :
svm-train -c cValue -g gValue -w1 1 -w2 1 -w3 1 -w4 2 -w5 1 -w6 2 -w7 1
However the results is the same. I'm sure that It's not the good way to do it and that's why I ask you some helps.
I saw some topics on the subject but they were related to binary classification and not multiclass classification.
I know that libSVM is doing "one against one" (so a binary classifier) but I don't know to handle that when I have multiple class.
Could you please help me ?
Thank you in advance for your help.
I've met the same problem before. I also tried to give them different weight, which didn't work.
I recommend you to train with a subset of the dataset.
Try to use approximately equal number of different class samples. You can use all category 4 and 6 samples, and then pick up about 150 samples for every other categories.
I used this method and the accuracy did improve. Hope this will help you!

Resources