Specific amount of training data required for fine-tuning Resnet-50 model - machine-learning

for decent generalization, how many images per class is needed for fine-tuning the Resnet-50 model for ASL HandSign Classification(24 classes)? I have around 600 images per class and the model is overfitting very badly.

I can't give you a number, but a method to find it yourself. The technique is plotting a graph called "learning curve" where the x-axis is the number if training samples and the y-axis is the score. You start at 1 training sample and increase to 600. You plot two curves: the training error and the test error. You can then see how much influence more data without any other change will have on the result.
More details and the following image in my masters thesis, section 2.5.4:
In this example you can see that having up to 20 training samples each new example is improving the test score a lot (green curve goes down a lot). But after that, throwing just more data at the problem will not help a lot.
The curve will look different in your case, but the principle should be the same.
Other analysis
Look at chapter 2.5 and 2.6 of my masters thesis. I especially recommend having a look at the confusion matrix and confusion matrix ordering. This will give you an idea which classes are confused. Maybe the classes are just inherently difficult to distinguish? Maybe one can add more features? Maybe there are labeling errors? Have a look at chapter 2.5 for more of those "maybe's"

Related

Autoencoder Failing to Capture Small Artifacts

tl;dr - I use an autoencoder to try to reduce input dimensions for a reinforcement-learning (RL) agent to learn how to play Atari-KungFu. But it fails at encoding/decoding thrown knives, because they are only a couple pixels and getting them wrong probably has negligible impact on the autoencoder MSE loss (see green arrows in bottom left of image). This will probably permanently hobble the results. I want to figure out if there is a way to solve this -- preferably with a generalized solution, but I'd be happy for now with something specific to this problem.
Background:
I am working on Week5 of the "Practical Reinforcement Learning" course on Coursera (National Research University HSE), and I decided to spend extra time trying to expand performance on the Atari-KungFu assignment using Actor-Critic architecture. This post is not about actor-critic, but more about an interesting sub-problem I ran into related to autoencoders.
I create an encoder which outputs a tanh-64-neuron layer, which is used as a common input to the decoder, policy learner (actor), and value learner (critic). During training, the simulator returns batches of four sequential frames (64 x 144 x 4) and rewards from the last action. Then images are first used to train the autoencoder, then used with the rewards to train the actor & critic branches.
I display some metrics and example frames every 25000 iterations to see how it's doing. If the reconstructed images are accurate, then the inputs to the actor & critic branches should be getting good distilled information for efficient learning.
You can see below that the autoencoder is pretty good except for the thrown knives (see bottom-left). Arguably this is because missing those couple pixels minimally increases the MSE loss of the reconstructed image, so it has little incentive to learn it (and also there's not a lot of frames that have knives). Yet, seeing those knives is critical for the RL agent to learn to how to survive.
I haven't seen this kind of problem addressed before. A tiny artifact in the input images is crucial for learning, but is unlikely to be learned by the autoencoder. Can we fix/improve this?
IMO your problem is loss specific, some things which would probably help autoencoder reconstruct knife as well:
Find knives in input image using image processing techniques. Regions where knives are present should have higher loss value in MSE, say 10 times more. One way to find those semi-automatically could probably be convolution with big kernel; White pixels at the strict center would give more weight and only zeros around it would give it more weight as well. Something along these lines should find a region where only knives are located (throwing guys wouldn't, as they contain too many white pixels and holes). Using some threshold found empirically for the value of this kernel should be enough to correctly find them.
Lower loss for images when no knive was found, say divided by half. This would focus autoencoder harder on rarely seen cases when knive is seen.
On the downside - I suppose it could introduce some artifacts. In such case you may think about usage of pretrained encoder (like some version of ResNet) and increase model's capabilities.

How can re-train my logistic model using pymc3?

I have a binary classification problem where I have around 15 features. I have chosen these features using some other model. Now I want to perform Bayesian Logistic on these features. My target classes are highly imbalance(minority class is 0.001%) and I have around 6 million records. I want to build a model which can be trained nighty or weekend using Bayesian logistic.
Currently, I have divided the data into 15 parts and then I train my model on the first part and test on the last part then I am updating my priors using Interpolated method of pymc3 and rerun the model using the 2nd set of data. I am checking the accuracy and other metrics(ROC, f1-score) after each run.
Problems:
My score is not improving.
Am I using the right approch?
This process is taking too much time.
If someone can guide me with the right approach and code snippets it will be very helpful for me.
You can use variational inference. It is faster than sampling and produces almost similar results. pymc3 itself provides methods for VI, you can explore that.
I only know this part of question. If you can elaborate your problem a bit further, maybe.. I can help you.

Overfitting my model over my training data of a single sample

I am trying to over-fit my model over my training data that consists of only a single sample. The training accuracy comes out to be 1.00. But, when I predict the output for my test data which consists of the same single training input sample, the results are not accurate. The model has been trained for 100 epochs and the loss ~ 1e-4.
What could be the possible sources of error?
As mentioned in the comments of your post, it isn't possible to give specific advice without you first providing more details.
Generally speaking, your approach to overfitting a tiny batch (in your case one image) is in essence providing three sanity checks, i.e. that:
backprop is functioning
the weight updates are doing their job
the learning rate is in the correct order of magnitude
As is pointed out by Andrej Karpathy in Lecture 5 of CS231n course at Stanford - "if you can't overfit on a tiny batch size, things are definitely broken".
This means, given your description, that your implementation is incorrect. I would start by checking each of those three points listed above. For example, alter your test somehow by picking several different images or a btach-size of 5 images instead of one. You could also revise your predict function, as that is where there is definitely some discrepancy, given you are getting zero error during training (and so validation?).

How to choose feature selection method? By data or some rules?

I have been using some feature selection methods individually, e.g.RFE OR Select K best, for multi-label classification. Is there a technique or method can be used into choosing a feature selection method dynamically? for instance, according to the statistics of test data or some rule-based approach?
This probably isn't the answer you're looking for, but you could try each one and cross validate it against some test data. It should be fairly trivial to script this.
I don't know of any better way of picking a feature selection algorithm than this, but it can bias you towards the test data you've used.
These answers may help.
My assumption about the feature statistics is: maximal distances between means of values between the classes and minimal variance of values for one class classify a good feature.
I start with small learning set, test this assumption and increase the learning set if results look promising.
The final optimization is the histogram of means comparison. Features with similar histograms are removed. Those are redundant features which decrease (at least on SVM) the accuracy considerable (5-10%).
With this approach I gain 95% of accuracy on my data-set of 5 classes, 600 instances. The training takes < 1h. Manual training used to gained 98% with many days of experimenting.

One versus rest classifier

I'm implementing an one-versus-rest classifier to discriminate between neural data corresponding (1) to moving a computer cursor up and (2) to moving it in any of the other seven cardinal directions or no movement. I'm using an SVM classifier with an RBF kernel (created by LIBSVM), and I did a grid search to find the best possible gamma and cost parameters for my classifier. I have tried using training data with 338 elements from each of the two classes (undersampling my large "rest" class) and have used 338 elements from my first class and 7218 from my second one with a weighted SVM.
I have also used feature selection to bring the number of features I'm using down from 130 to 10. I tried using the ten "best" features and the ten "worst" features when training my classifier. I have also used the entire feature set.
Unfortunately, my results are not very good, and moreover, I cannot find an explanation why. I tested with 37759 data points, where 1687 of them came from the "one" (i.e. "up") class and the remaining 36072 came from the "rest" class. In all cases, my classifier is 95% accurate BUT the values that are predicted correctly all fall into the "rest" class (i.e. all my data points are predicted as "rest" and all the values that are incorrectly predicted fall in the "one"/"up" class). When I tried testing with 338 data points from each class (the same ones I used for training), I found that the number of support vectors was 666, which is ten less than the number of data points. In this case, the percent accuracy is only 71%, which is unusual since my training and testing data are the exact same.
Do you have any idea what could be going wrong? If you have any suggestions, please let me know.
Thanks!
Test dataset being same as training data implies your training accuracy was 71%. There is nothing wrong about it as the data was possibly not well separable by the kernel you used.
However, one point of concern is the number of support vectors being high suggests probable overfitting .
Not sure if this amounts to an answer - it would probably be hard to give one without actually seeing the data - but here are some ideas regarding the issue you describe:
In general, SVM tries to find a hyperplane that would best separate your classes. However, since you have opted for 1vs1 classification, you have no choice but to mix all negative cases together (your 'rest' class). This might make the 'best' separation much less fit to solve your problem. I'm guessing that this might be a major issue here.
To verify if that's the case, I suggest trying to use only one other cardinal direction as the negative set, and see if that improves results. In case it does, you can train 7 classifiers, one for each direction. Another option might be to use the multiclass option of libSVM, or a tool like SVMLight, which is able to classify one against many.
One caveat of most SVM implementations is their inability to support big differences between the positive and negative sets, even with weighting. From my experience, weighting factors of over 4-5 are problematic in many cases. On the other hand, since your variety in the negative side is large, taking equal sizes might also be less than optimal. Thus, I'd suggest using something like 338 positive examples, and around 1000-1200 random negative examples, with weighting.
A little off your question, I would have considered also other types of classification. To start with, I'd suggest thinking about knn.
Hope it helps :)

Resources