Softmax with large number of classes?

Softmax with large number of classes? - machine-learning

Does using softmax with large number of classes (for example 10k) have any conceptual problems or numerical stability problems?

Softmax itself will not occur any problem. However there will be a problem due to the network L2-norm error, if you real have a 10k classes to classify, which a small numerical fraction on the weights will cause a huge difference on latest layer output.

Related

Can logistic regression, Neural Network and SVM give the same accuracy when using the same training/test set? If so, why?

I used the three algorithm with the same training and test set. However I'm getting the same exact mean accuracy for all of them. Is there any good explanation on why this is the case for me. I read it may have something to do with the classes being used might be considered easy.

It’s a bit weird that this happens but I have to make assumptions regarding to your question:
The MLP architecture consists of a n inputs and one hidden neuron and m outputs.
You use a linear kernel for the SVM
Your test and training data is linearly separateable.
If 1. is true, your network is equal to logistic regression and your data is separate by a line which leads to the 3. and also to 2. because you won’t need a kernel function to separate the data.
So, the weird part is, that the SVM turns the computation of a decision boundary of your classifier into a convex optimization problem to solve for the optimal boundary. Neither Logistic Regression nor an MLP is able to do this. Hence, your test data must be really easy to separate and must lay with a larger margin to the decision boundary than your training data. This way, it’s not necessary to have a optimal margin between the classes and any boundary which separates the classes without error is sufficient.

They can all give identical performance if your problem is simple enough. There's nothing stopping them from giving you identical results.

Python/SKlearn: Using KFold Results in big ROC_AUC Variations

Based on data that our business department supplied to us, I used the sklearn decision tree algorithm to determine the ROC_AUC for a binary classification problem.
The data consists of 450 rows and there are 30 features in the data.
I used 10 times StratifiedKFold repetition/split of training and test data. As a result, I got the following ROC_AUC values:
0.624
0.594
0.522
0.623
0.585
0.656
0.629
0.719
0.589
0.589
0.592
As I am new in machine learning, I am unsure whether such a variation in the ROC_AUC values can be expected (with minimum values of 0.522 and maximum values of 0.719).
My questions are:
Is such a big variation to be expected?
Could it be reduced with more data (=rows)?
Will the ROC_AUC variance get smaller, if the ROC_AUC gets better ("closer to 1")?

Well, you do k-fold splits to actually evaluate how well your model generalizes.
Therefore, from your current results I would assume the following:
This is a difficult problem, the AUCs are usually low.
0.71 is an outlier, you were just lucky there (probably).
Important questions that will help us help you:
What is the proportion of the binary classes? Are they balanced?
What are the features? Are they all continuous? If categorical, are they ordinal or nominal?
Why Decision Tree? Have you tried other methods? Logistic Regression for instance is a good start before you move on to more advanced ML methods.
You should run more iterations, instead of k fold use the ShuffleSplit function and run at least 100 iterations, compute the Average AUC with 95% Confidence Intervals. That will give you a better idea of how well the models perform.
Hope this helps!

Is such a big variation to be expected?
This is a textbook case of high variance.
Depending on the difficulty of your problem, 405 training samples may not be enough for it to generalize properly, and the random forest may be too powerful.
Try adding some regularization, by limiting the number of splits that the trees are allowed to make. This should reduce the variance in your model, though you might expect a potentially lower average performance.
Could it be reduced with more data (=rows)?
Yes, adding data is the other popular way of lowering the variance of your model. If you're familiar with deep learning, you'll know that deep models usually need LOTS of samples to learn properly. That's because they are very powerful models with an intrinsically high variance, and therefore a lot of data is needed for them to generalize.
Will the ROC_AUC variance get smaller, if the ROC_AUC gets better ("closer to 1")?
Variance will decrease with regularization and adding data, it has no relation to the actual performance "number" that you get.
Cheers

Why does the linear model work in image classification?

I'm heavily studying machine learning on all its mathematic foundations. It's perfectly clear to me that it works mathematically, but there is one thing that I cannot get.
My question is simple as that:
Why does the linear model work while training an image-to-character (using, for example, notMNIST dataset as training source) classification model? For what I know, using a linear model, we are saying that an output is a function a linear function of an input + a bias parameter. But I already know that a linear model doesn't work well for other kinds of applications.
So, why does it work for this and not for others?

Model complexity varies with the problem being solved. MNIST is a very simple case, and happens to be susceptible to linear combination, due to the narrow range of both inputs (face-on numbers in gray scale) and outputs (one of 10 digits) and their inherent differences. For instance, 4 and 9 have different connectivity, a property that linear combination can discern. Given enough nodes, an MNIST model can be trained into the upper 90s with little problem.
Consider instead the ILSVRC image set, where discrimination depends on color, stance, relative proportion of subject parts (e.g. wolfhound vs poodle), and other characteristics both small and large. These require scaling, generalization, adaptability to interfering objects (e.g. bushes in the foreground) and other properties. A sufficiently large linear network would likely differentiate ten classes reasonably well, but would not make the fine discrimination of 1000.
I just found this blog that helps highlight some of the complexity of MNIST ... and its simplification.

LSTM weights batch normalization in CRNN architecture

I tried batch normalization for the LSTM weights as per https://arxiv.org/abs/1603.09025 on a Convolutional-RNN based network and I got notable improvement in training speed and performance. The features extracted from CNN are fed into 2 layers of Bidirectional LSTM.
In my first network I used few feature maps, so the input to the LSTM layers was 128. However, when I increase the input size (eg 256), I start getting NaNs for the LSTM output after some iterations (it works fine without batch normalization). I understand that this might be related to the division by small numbers. I also used an epsilon of 10^-6, but still getting NaNs.
Any ideas on what can I do to get rid of NaNs? Thanks.

For those who are having the same problem, using float64 data type instead of float32 helps in solving this issue. Of course this has memory implications, but I found it to be the only solution so far.

Finetuning Caffe Deep Learning Check failed: error == cudaSuccess out of memory

I am relatively new in Deep learning and its framework. Currently, I am experimenting with Caffe framework and trying to fine tune the Vgg16_places_365.
I am using the Amazone EC2 instance g2.8xlarge with 4 GPUs (each has 4 GB of RAM). However, when I try to train my model (using a single GPU), I got this error:
Check failed: error == cudaSuccess (2 vs. 0) out of memory
After I did some research, I found that one of the ways to solve this out of memory problem is by reducing the batch size in my train.prototxt
Caffe | Check failed: error == cudaSuccess (2 vs. 0) out of memory.
Initially, I set the batch size into 50, and iteratively reduced it until 10 (since it worked when batch_size = 10).
Now, the model is being trained and I am pretty sure it will take quite long time. However, as a newcomer in this domain, I am curious about the relation between this batch size and another parameter such as the learning rate, stepsize and even the max iteration that we specify in the solver.prototxt.
How significant the size of the batch will affect the quality of the model (like accuracy may be). How the other parameters can be used to leverage the quality. Also, instead of reducing the batch size or scale up my machine, is there another way to fix this problem?

To answer your first question regarding the relationship between parameters such as batch size, learning rate and maximum number of iterations, you are best of reading about the mathematical background. A good place to start might be this stats.stackexchange question: How large should the batch size be for stochastic gradient descent?. The answer will briefly discuss the relation between batch size and learning rate (from your question I assume learning rate = stepsize) and also provide some references for further reading.
To answer your last question, with the dataset you are finetuning on and the model (i.e. the VGG16) being fixed (i.e. the input data of fixed size, and the model of fixed size), you will have a hard time avoiding the out of memory problem for large batch sizes. However, if you are willing to reduce the input size or the model size you might be able to use larger batch sizes. Depending on how (and what) exactly you are finetuning, reducing the model size may already be achieved by discarding learned layers or reducing the number/size of fully connected layers.
The remaining questions, i.e. how significant the batchsize influences quality/accuracy and how other parameters influence quality/accuracy, are hard to answer without knowing the concrete problem you are trying to solve. The influence of e.g. the batchsize on the achieved accuracy might depend on various factors such as the noise in your dataset, the dimensionality of your dataset, the size of your dataset as well as other parameters such as learning rate (=stepsize) or momentum parameter. For these sort of questions, I recommend the textbook by Goodfellow et al., e.g. chapter 11 may provide some general guidelines on choosing these hyperparmeters (i.e. batchsize, learning rate etc.).

another way to solve your problem is using all the GPUs on your machine. If you have 4x4=16GB RAM on your GPUs, that would be enough. If you are running caffe in command mode, just add the --gpu argument as follows (assuming you have 4 GPUs indexed as default 0,1,2,3):
build/tools/caffe train --solver=solver.prototxt --gpu=0,1,2,3
However if you are using the python interface, running with multiple GPUs is not yet supported.
I can point out some general hints to answer your question on the batchsize:
- The smaller the batchsize is, the more stochastic your learning would be --> less probability of overfitting on the training data; higher probability of not converging.
- each iteration in caffe fetches one batch of data, runs forward and ends with a backpropagation.
- Let's say your training data is 50'000 and your batchsize is 10; then in 1000 iterations, 10'000 of your data has been fed to the network. In the same scenario scenario, if your batchsize is 50, in 1000 iterations, all your training data are seen by the network. This is called one epoch. You should design your batchsize and maximum iterations in a way that your network is trained for a certain number of epochs.
- stepsize in caffe, is the number of iterations your solver will run before multiplying the learning rate with the gamma value (if you have set your training approach as "step").

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart