Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I have using Neural Network for a classification problem and I am now at the point to tune all the hyperparameters.
For now, I saw many different hyperparameters that I have to tune :
Learning rate
batch-size
number of iterations (epoch)
For now, my tuning is quite "manual" and I am not sure I am not doing everything in a proper way. Is there a special order to tune the parameters? E.g learning rate first, then batch size, then ... I am not sure that all these parameters are independent. Which ones are clearly independent and which ones are clearly not independent? Should we then tune them together? Is there any paper or article which talks about properly tuning all the parameters in a special order?
There is even more than that! E.g. the number of layers, the number of neurons per layer, which optimizer to chose, etc...
So the real work in training a neural network is actually finding the best-suited parameters.
I would say there is no clear guideline because training a machine learning algorithm, in general, is always task-specific.
You see, there are many hyperparameters to tune, and you won't have time to try out every combination of each. For many hyperparameters, you will build somewhat of intuition on what a good choice would be, but for now, a great starting point is always using what has been proven by others to work. So if you find a paper on the same or similar task you could try to use the same or similar parameters as them too.
Just to share with you some small experiences I've made:
I rarely vary the learning rate. I mostly choose the Adam optimizer and stick with it.
The batch size I try to choose as big as possible without running out of memory
number of iterations you could just set to e.g. 1000. You can always look at the current loss and decide for yourself if you can stop when the net e.g. isn't learning anymore.
Keep in mind these are in no way rules or strict guidelines. Just some ideas until you've got a better intuition yourself. The more papers you've read and more nets you've trained you will understand what to chose when better.
Hope this serves a good starting point at least.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
Classification problems can exhibit a strong label imbalance in the given dataset. This can be overcome by subsampling certain class weight attributed weights, which allow for balancing the label distributions at least during model training. Stratification on the other hand will allow for keeping a certain label distribution, which stays for every respective fold.
For a regression problem this is by standard libaries e.g. scikit-learn not defined. There are few approaches to cover stratification and a well written theoretical approach for regression subsampling by Scott Lowe here.
I am wondering why label balancing for regression instead of classification problems has so few attention in the Machine Learning community? Regression problems also exhibit different characteristica that might be easier / harder acquired in a data collection setting. And then, is there any framework or paper that further addresses this issue?
The complexity of the problem lies in the continuous nature of regression. When you have the classification, it is very natural to split them into classes because they are basically already split into classes :) Now, if you have a regression, the number of possibilities to split is basically infinite and most importantly, it is just impossible to know what a good split would be. As in the article you sent, you might apply sorted or fractional approaches but in the end, you have no idea to what extent they would be correct. You can also split it into intervals. This is what the stack library does. In the documentation, it says: "For continuous target variable overstock uses binning and categoric split based on bins". What they do is, they first assign the continuous values to bins(classes) and then they apply stratification on them.
There are not many studies on this because everything you can come up with is going to be a heuristic. However, there can be exceptions if you can incorporate some domain knowledge. As an example, let's say that you are trying to predict the frequency of some electromagnetic waves from some set of features. In that case, you have prior knowledge of how the wave frequencies are split. ( https://en.wikipedia.org/wiki/Electromagnetic_spectrum) So now it is natural to split them into continuous intervals with respect to their wavelengths and do a regression stratification. But otherwise, it is hard to come with something that would generalize.
I personally never encountered a study on this.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 3 years ago.
Improve this question
Prior to 2017, it was relatively simple to understand which CNN was the best to classify images with the imagnet yearly competition.
In 2017 the imagenet competition was divided into different tasks with winners such as this. In 2018, the competition moved to kaggle and became about 3D detection.
I am interested in image classification only and there no longer seems to be a competition for this.
Does anyone know what neural network was recognised as the best for image classification in 2018?
If i recall correct I think it is Googles NasNet. It's a very cool (and computer intensive) method used to design the model architecture, but good for transfer learning and prediction. I would recommend taking a look at the NasNet-paper
It should also be available to use through keras.application
This is a really good question. I was wondering about the same and played around with some of the models that are on TensorFlow Hub. So, here are my two cents.
The current best models in terms of performance on ImageNet are the ones which are obtained with Progressive Neural Architecture Search. On the other hand, these models are incredibly slow to train because they are huge. When it comes to the models such as InceptionNet, ResNet, and VGG, this is a good link to check out the performance compared to the training/inference speed.
My personal experience is that if you want to maximize performance, use ResNet152. If you want a relatively fast CNN, while achieving good performance, go with ResNet50. When it comes to the VGG nets, I played around with the TF-Slim implementation but it was slower than ResNet50, with performance around the same. Finally, I can't say much about Inception because I didn't use it. In the end, I went with ResNet152, because it yield the best performance for me (Please note that I was using a pre-trained version and I was fine-tuning it to my task).
To summarize, I think that there is no general best CNN. I would avoid using VGG16/19, because it yields worse performance than ResNet50, while being slower. If you have access to a lot of computational power, go with Resnet152 or PNASNet. Again, this my opinion based on my personal experience by playing around with the pre-trained models on TF-Hub.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
In machine learning, especially deep learning, what does it mean to warm-up?
I've heard sometimes that in some models, warming-up is a phase in training. But honestly, I don't know what it is because I'm very new to ML. Until now I've never used or come across it, but I want to know it because I think it might be useful for me.
What is learning rate warm-up and when do we need it?
If your data set is highly differentiated, you can suffer from a sort of "early over-fitting". If your shuffled data happens to include a cluster of related, strongly-featured observations, your model's initial training can skew badly toward those features -- or worse, toward incidental features that aren't truly related to the topic at all.
Warm-up is a way to reduce the primacy effect of the early training examples. Without it, you may need to run a few extra epochs to get the convergence desired, as the model un-trains those early superstitions.
Many models afford this as a command-line option. The learning rate is increased linearly over the warm-up period. If the target learning rate is p and the warm-up period is n, then the first batch iteration uses 1*p/n for its learning rate; the second uses 2*p/n, and so on: iteration i uses i*p/n, until we hit the nominal rate at iteration n.
This means that the first iteration gets only 1/n of the primacy effect. This does a reasonable job of balancing that influence.
Note that the ramp-up is commonly on the order of one epoch -- but is occasionally longer for particularly skewed data, or shorter for more homogeneous distributions. You may want to adjust, depending on how functionally extreme your batches can become when the shuffling algorithm is applied to the training set.
It means that if you specify your learning rate to be say 2e-5, then during training the learning rate will be linearly increased from approximately 0 to 2e-5 within the first say 10,000 steps.
There are actually two strategies for warmup, ref here.
constant: Use a low learning rate than base learning rate for the initial few steps.
gradual: In the first few steps, the learning rate is set to be lower than base learning rate and increased gradually to approach it as step number increases. As #Prune an #Patel suggested.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
Can I use reinforcement learning on classification? Such as human activity recognition? And how?
There are two types of feedback. One is evaluative that is used in reinforcement learning method and second is instructive that is used in supervised learning mostly used for classification problems.
When supervised learning is used, the weights of the neural network are adjusted based on the information of the correct labels provided in the training dataset. So, on selecting a wrong class, the loss increases and weights are adjusted, so that for the input of that kind, this wrong class is not chosen again.
However, in reinforcement learning, the system explores all the possible actions, class labels for various inputs in this case and by evaluating the reward it decides what is right and what is wrong. It may be the case too that until it gets the correct class label it may be giving wrong class name as it is the best possible output it has found till now. So, it doesn't make use of the specific knowledge we have about the class labels, hence slows the convergence rate significantly as compared to supervised learning.
You can use reinforcement learning for classification problems but it won't be giving you any added benefit and instead slow down your convergence rate.
Short answer: Yes.
Detailed answer: yes but it's an overkill. Reinforcement learning is useful when you don't have labeled dataset to learn the correct policy, so you need to develop correct strategy based on the rewards. This also allows to backpropagate through non-differentiable blocks (which I suppose is not your case). The biggest drawback of reinforcement learning methods is that thay are typically took a VERY large amount of time to converge. So, if you possess labels, it would be a LOT more faster and easier to use regular supervised learning.
You may be able to develop an RL model that chooses which classifier to use. The gt labels being used to train the classifiers and the change in performance of those classifiers being the reward for the RL model. As others have said, it would probably take a very long time to converge, if it ever does. This idea may also require many tricks and tweaks to make it work. I would recommend searching for research papers on this topic.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
Let me just start by saying I only took the undergrad AI class at school so I know just enough to be dangerous.
Here's the problem I'm looking to solve...accurate credit scoring is a key part to the success of my business. Currently we rely on a team of actuaries and statistical analysis to suss out patterns in the few dozen variables we track about each individual that indicate that they may be a low or high credit risk. As I understand it this is exactly the type of job that neural nets are great at solving, that is, finding high order relationships across many inputs that a human would likely never spot and then rendering a decision or output that is on average more accurate than what a trained human could do. In short, I want to be able to input your name, address, marital status, what car you drive, where you work, hair color, favorite food, etc in and get a credit score back.
My question is what type or architecture for a neural network would be best for this particular problem. I've done a bit of research and it seems I'm generating questions faster than I'm finding answers at this point. The best I've been able to come up with is some kind of generative deep neural network with multiple hidden layers where each layer is able to abstract one level beyond the previous one. Im assuming it's going to be feed-forward just because it seems to be the default. We have historical data on all previous customers including the information we used to make the initial score as well as data on what type of credit risk they actually turned out to be. This would seem to lend itself to unsupervised learning. Where I'm lost is in number of layers, how the layers are different from each other, size of each layer, connectedness of each of the perceptrons and so on. The more I dig the more I'm getting into research papers that are over my head so I just need some smart person to point me in the right direction
Does anyone have any ideas? Again, I don't need a thorough explanation just a general area I should focus on.
This is supervised learning since you have actual data that can be labelled. It's also feedforward since you're not predicting time series but assigning scores. Further, you should probably just prepare your data (assigning credit scores manually or with some rough heuristic) and start experimenting with some tools before you invest time into implementing state-of-the-art architectures. A multi-layer-perceptron (MLP) with 1 hidden layer is a sufficient starting point for such a problem. From there on, you can train the network to generalize your credit assignment heuristic you began with.
You should know that most "new" architectures you probably read about while researching are dealing with much more difficult problems than credit scoring (speech/image/character recognition/detection). There is a collection of papers on the scenario of credit scoring / risk classification, so I'd recommend reshifting your focus from architectures to actual case studies (see e.g. this paper). Just pick a recent paper with MLPs and apply their parameters. Start simple and improve the system incrementally (as #roganjosh stated).