Parameters in Weka Multilayer Perceptron Classifier - machine-learning

I'm doing some experiments with Weka Multilayer Perceptron, and I have some questions relating to its parameters. I've checked the help document but couldn't understand:
What is nominalToBinaryFilter? How to use?
normalizeAttribute: I think this is to scale value of features to [-1, 1] range. But how they do it in case the value is not numeric, for example with weather dataset.
reset: This will reset if the current training process diverges and start again with a lower learning rate. How much should we decrease the current learning rate? (how to identify the next learning rate)
Initial weights: This isn't a parameter, but how they initialize initial weights? Is it symmetric (something like values inside [-ε, +ε])?

It has been a while since i used WEKA, but here are my comments about bullets 2,3 and 4 which may seem useful to you:
Bullet 2: Normalization is non applicable to categorical (non numerical) attributes so you don't need to worry about this parameter.
Bullet 3: By default reset sets the learning rate to half. Adjustment of learning rate depends on many factors and I suggest searching scholarly articles in case you think you are not covered by the default approach. From my experiecnce,a rule of thumb is to alter learning rate in steps of 0,1
Bullet 4: Initial weights are small random numbers that are not identical

I know this is very old but wanted to add regarding bullet 4:
The seed paramter is used to seed a random number generator that is then used to generate the random initial weights. Therefore, if you wanted to explore sensitivity to initial weights, you can use different values here.

Related

Random Forest - Max Features

I do have a question and I need your support. I have a data set which I am analyzing. I need to predict a target. To do this I did some data cleaning, among others drop highly (linear correlated feautes)
After preparing my data I applied random forest regressor (it is a regression problem). I am stucked a bit, since I really cannot catch the meaning and thus the value for max_features
I found the following page answer, where it is written
features=n_features for regression is a mistake on scikit's part. The original paper for RF gave max_features = n_features/3 for regression
I do get different results if I use max_features=sqrt(n) or max_features=n_features
Can any1 give me a good explanation how to approach this parameter?
That would be really great
max_features is a parameter that needs to be tuned. Values such as sqrt or n/3 are defaults and usually perform decently, but the parameter needs to be optimized for every dataset, as it will depend on the features you have, their correlations and importances.
Therefore, I suggest training the model many times with a grid of values for max_features, trying every possible value from 2 to the total number of your features. Train your RandomForestRegressor with oob_score=True and use oob_score_ to assess the performance of the Forest. Once you have looped over all possible values of max_features, keep the one that obtained the highest oob_score.
For safety, keep the n_estimators on the high end.
PS: this procedure is basically a grid search optimization for one parameter, and is usually done via Cross Validation. Since RFs give you OOB scores, you can use these instead of CV scores, as they are quicker to compute.

Deep Neural Network - Order of the Parameters to tune

I am new to this DNN field and I am fed up with tunning hyperparameters and other parameters in a DNN cause there are a lot of parameters to tune and it is like a multivariable analysis without the help of a computer. How human can move towards the highest accuracy that can be achieved for a task using DNN due to the huge number of variables inside a DNN. And how will we know what accuracy is possible to get by using DNN or do I have to give up on DNN? I am lost. Help is appreciated.
Main problems I have :
1. What are the limits of DNN / when we have to give up on DNN
2. What is the proper way of tunning without missing good parameter values
Here is the summary I got by learning theory in this field. Corrections are much appreciated if I am wrong or misunderstood. You can add anything I missed. Sorted by the importance according to my knowledge.
for overfitting -
1. reduce the number of layers
2. reduce the number of nodes of layers
3. add regularizers (l1/ l2/ l1-l2) - have to decide the factors
4. add dropout layers and -have to decide the dropout factor
5. reduce batch size
6. stop earlier
for underfitting
1. increase the number of layers
2. increase number of nodes of layers
3. Add different types of layers (Conv, LSTM, ...)
4. add learning rate decay (decide the type and parameters for the type)
5. reduce the learning rate
other than that generally we can do,
1. number of epochs (by seeing what is happening while model training)
2. Adjust Learning Rate
3. batch normalization -for fast learning
4. initializing techniques (zero/ random/ Xavier / he)
5. different optimization algorithms
auto tunning methods
- Gridsearchcv - but for this, we have to choose what we want to change and it takes a lot of time.
Short Answer: You should experiment a lot!
Long Answer: At first, you may be overwhelmed by having plenty of knobs that you can tweak, but you gradually become experienced. A very quick way to gain some intuition on how you should tune the hyperparameters of your model is trying to replicate what other researchers have published. By replicating the results (and trying to improve the state-of-the-art), you acquire the intuition about deep learning.
I, personally, follow no particular order in tuning the hyperparameters of the model. Instead, I try to implement a dirty model and try to improve it. For instance, if I see that there are overshoots in validation accuracy, which might be an indicator of the fact that the model is bouncing around the sweet spot, I divide the learning rate by ten and see how it goes. If I see the model begins to overfit, I use early stopping to save the best parameters before overfitting. I also play with dropout rates and weight decay to find the best combination of them in order to have the model fit enough while maintaining the regularization effect. And so on.
To correct some of your assumptions, adding different types of layers will not necessarily help your model not to overfit. Moreover, sometimes (especially when using transfer learning, which is a trend these days), you cannot simply add a convolutional layer to your neural network.
Assuming you are dealing with computer vision tasks, Data Augmentation is another useful approach to increase the amount of available data to train your model and perform its performance.
Also, note that Batch Normalization also has a regularization effect. Weight Decay is another implementation of l2 regularization that is widely used.
Another interesting technique that can improve the training of neural networks is the One Cycle policy for learning rate and momentum (if applicable). Check this paper out: https://doi.org/10.1109/WACV.2017.58

Neural network input with exponential decay

Often, to improve learning rates, inputs to a neural network are preprocessed by scaling and shifting to be between -1 and 1. I'm wondering though if that's a good idea with an input whose graph would be exponentially decaying. For instance, if I had an input with integer values 0 to 100 distributed with the majority of inputs being 0 and smaller values being more common than large values, with 99 being very rare.
Seems that scaling them and shifting wouldn't be ideal, since now the most common value would be -1. How is this type of input best dealt with?
Consider you're using a sigmoid activation function which is symmetric around the origin:
The trick to speed up convergence is to have the mean of the normalized data set be 0 as well. The choice of activation function is important because you're not only learning weights from the input to the first hidden layer, i.e. normalizing the input is not enough: the input to the second hidden layer/output is learned as well and thus needs to obey the same rule to be consequential. In the case of non-input layers this is done by the activation function. The much cited Efficient Backprop paper by Lecun summarizes these rules and has some nice explanations as well which you should look up. Because there's other things like weight and bias initialization that one should consider as well.
In chapter 4.3 he gives a formula to normalize the inputs in a way to have the mean close to 0 and the std deviation 1. If you need more sources, this is great faq as well.
I don't know your application scenario but if you're using symbolic data and 0-100 is ment to represent percentages, then you could also apply softmax to the input layer to get better input representations. It's also worth noting that some people prefer scaling to [.1,.9] instead of [0,1]
Edit: Rewritten to match comments.

What are the metrics to evaluate a machine learning algorithm

I would like to know what are the various techniques and metrics used to evaluate how accurate/good an algorithm is and how to use a given metric to derive a conclusion about a ML model.
one way to do this is to use precision and recall, as defined here in wikipedia.
Another way is to use the accuracy metric as explained here. So, what I would like to know is whether there are other metrics for evaluating an ML model?
I've compiled, a while ago, a list of metrics used to evaluate classification and regression algorithms, under the form of a cheatsheet. Some metrics for classification: precision, recall, sensitivity, specificity, F-measure, Matthews correlation, etc. They are all based on the confusion matrix. Others exist for regression (continuous output variable).
The technique is mostly to run an algorithm on some data to get a model, and then apply that model on new, previously unseen data, and evaluate the metric on that data set, and repeat.
Some techniques (actually resampling techniques from statistics):
Jacknife
Crossvalidation
K-fold validation
bootstrap.
Talking about ML in general is a quite vast field, but I'll try to answer any way. The Wikipedia definition of ML is the following
Machine learning, a branch of artificial intelligence, concerns the construction and study of systems that can learn from data.
In this context learning can be defined parameterization of an algorithm. The parameters of the algorithm are derived using input data with a known output. When the algorithm has "learned" the association between input and output, it can be tested with further input data for which the output is well known.
Let's suppose your problem is to obtain words from speech. Here the input is some kind of audio file containing one word (not necessarily, but I supposed this case to keep it quite simple). You'd record X words N times and then use (for example) N/2 of the repetitions to parameterize your algorithm, disregarding - at the moment - how your algorithm would look like.
Now on the one hand - depending on the algorithm - if you feed your algorithm with one of the remaining repetitions, it may give you some certainty estimate which may be used to characterize the recognition of just one of the repetitions. On the other hand you may use all of the remaining repetitions to test the learned algorithm. For each of the repetitions you pass it to the algorithm and compare the expected output with the actual output. After all you'll have an accuracy value for the learned algorithm calculated as the quotient of correct and total classifications.
Anyway, the actual accuracy will depend on the quality of your learning and test data.
A good start to read on would be Pattern Recognition and Machine Learning by Christopher M Bishop
There are various metrics for evaluating the performance of ML model and there is no rule that there are 20 or 30 metrics only. You can create your own metrics depending on your problem. There are various cases wherein when you are solving real - world problem where you would need to create your own custom metrics.
Coming to the existing ones, it is already listed in the first answer, I would just highlight each metrics merits and demerits to better have an understanding.
Accuracy is the simplest of the metric and it is commonly used. It is the number of points to class 1/ total number of points in your dataset. This is for 2 class problem where some points belong to class 1 and some to belong to class 2. It is not preferred when the dataset is imbalanced because it is biased to balanced one and it is not that much interpretable.
Log loss is a metric that helps to achieve probability scores that gives you better understanding why a specific point is belonging to class 1. The best part of this metric is that it is inbuild in logistic regression which is famous ML technique.
Confusion metric is best used for 2-class classification problem which gives four numbers and the diagonal numbers helps to get an idea of how good is your model.Through this metric there are others such as precision, recall and f1-score which are interpretable.

Why do we have to normalize the input for an artificial neural network? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
Why do we have to normalize the input for a neural network?
I understand that sometimes, when for example the input values are non-numerical a certain transformation must be performed, but when we have a numerical input? Why the numbers must be in a certain interval?
What will happen if the data is not normalized?
It's explained well here.
If the input variables are combined linearly, as in an MLP [multilayer perceptron], then it is
rarely strictly necessary to standardize the inputs, at least in theory. The
reason is that any rescaling of an input vector can be effectively undone by
changing the corresponding weights and biases, leaving you with the exact
same outputs as you had before. However, there are a variety of practical
reasons why standardizing the inputs can make training faster and reduce the
chances of getting stuck in local optima. Also, weight decay and Bayesian
estimation can be done more conveniently with standardized inputs.
In neural networks, it is good idea not just to normalize data but also to scale them. This is intended for faster approaching to global minima at error surface. See the following pictures:
Pictures are taken from the coursera course about neural networks. Author of the course is Geoffrey Hinton.
Some inputs to NN might not have a 'naturally defined' range of values. For example, the average value might be slowly, but continuously increasing over time (for example a number of records in the database).
In such case feeding this raw value into your network will not work very well. You will teach your network on values from lower part of range, while the actual inputs will be from the higher part of this range (and quite possibly above range, that the network has learned to work with).
You should normalize this value. You could for example tell the network by how much the value has changed since the previous input. This increment usually can be defined with high probability in a specific range, which makes it a good input for network.
There are 2 Reasons why we have to Normalize Input Features before Feeding them to Neural Network:
Reason 1: If a Feature in the Dataset is big in scale compared to others then this big scaled feature becomes dominating and as a result of that, Predictions of the Neural Network will not be Accurate.
Example: In case of Employee Data, if we consider Age and Salary, Age will be a Two Digit Number while Salary can be 7 or 8 Digit (1 Million, etc..). In that Case, Salary will Dominate the Prediction of the Neural Network. But if we Normalize those Features, Values of both the Features will lie in the Range from (0 to 1).
Reason 2: Front Propagation of Neural Networks involves the Dot Product of Weights with Input Features. So, if the Values are very high (for Image and Non-Image Data), Calculation of Output takes a lot of Computation Time as well as Memory. Same is the case during Back Propagation. Consequently, Model Converges slowly, if the Inputs are not Normalized.
Example: If we perform Image Classification, Size of Image will be very huge, as the Value of each Pixel ranges from 0 to 255. Normalization in this case is very important.
Mentioned below are the instances where Normalization is very important:
K-Means
K-Nearest-Neighbours
Principal Component Analysis (PCA)
Gradient Descent
When you use unnormalized input features, the loss function is likely to have very elongated valleys. When optimizing with gradient descent, this becomes an issue because the gradient will be steep with respect some of the parameters. That leads to large oscillations in the search space, as you are bouncing between steep slopes. To compensate, you have to stabilize optimization with small learning rates.
Consider features x1 and x2, where range from 0 to 1 and 0 to 1 million, respectively. It turns out the ratios for the corresponding parameters (say, w1 and w2) will also be large.
Normalizing tends to make the loss function more symmetrical/spherical. These are easier to optimize because the gradients tend to point towards the global minimum and you can take larger steps.
Looking at the neural network from the outside, it is just a function that takes some arguments and produces a result. As with all functions, it has a domain (i.e. a set of legal arguments). You have to normalize the values that you want to pass to the neural net in order to make sure it is in the domain. As with all functions, if the arguments are not in the domain, the result is not guaranteed to be appropriate.
The exact behavior of the neural net on arguments outside of the domain depends on the implementation of the neural net. But overall, the result is useless if the arguments are not within the domain.
I believe the answer is dependent on the scenario.
Consider NN (neural network) as an operator F, so that F(input) = output. In the case where this relation is linear so that F(A * input) = A * output, then you might choose to either leave the input/output unnormalised in their raw forms, or normalise both to eliminate A. Obviously this linearity assumption is violated in classification tasks, or nearly any task that outputs a probability, where F(A * input) = 1 * output
In practice, normalisation allows non-fittable networks to be fittable, which is crucial to experimenters/programmers. Nevertheless, the precise impact of normalisation will depend not only on the network architecture/algorithm, but also on the statistical prior for the input and output.
What's more, NN is often implemented to solve very difficult problems in a black-box fashion, which means the underlying problem may have a very poor statistical formulation, making it hard to evaluate the impact of normalisation, causing the technical advantage (becoming fittable) to dominate over its impact on the statistics.
In statistical sense, normalisation removes variation that is believed to be non-causal in predicting the output, so as to prevent NN from learning this variation as a predictor (NN does not see this variation, hence cannot use it).
The reason normalization is needed is because if you look at how an adaptive step proceeds in one place in the domain of the function, and you just simply transport the problem to the equivalent of the same step translated by some large value in some direction in the domain, then you get different results. It boils down to the question of adapting a linear piece to a data point. How much should the piece move without turning and how much should it turn in response to that one training point? It makes no sense to have a changed adaptation procedure in different parts of the domain! So normalization is required to reduce the difference in the training result. I haven't got this written up, but you can just look at the math for a simple linear function and how it is trained by one training point in two different places. This problem may have been corrected in some places, but I am not familiar with them. In ALNs, the problem has been corrected and I can send you a paper if you write to wwarmstrong AT shaw.ca
On a high level, if you observe as to where normalization/standardization is mostly used, you will notice that, anytime there is a use of magnitude difference in model building process, it becomes necessary to standardize the inputs so as to ensure that important inputs with small magnitude don't loose their significance midway the model building process.
example:
√(3-1)^2+(1000-900)^2 ≈ √(1000-900)^2
Here, (3-1) contributes hardly a thing to the result and hence the input corresponding to these values is considered futile by the model.
Consider the following:
Clustering uses euclidean or, other distance measures.
NNs use optimization algorithm to minimise cost function(ex. - MSE).
Both distance measure(Clustering) and cost function(NNs) use magnitude difference in some way and hence standardization ensures that magnitude difference doesn't command over important input parameters and the algorithm works as expected.
Hidden layers are used in accordance with the complexity of our data. If we have input data which is linearly separable then we need not to use hidden layer e.g. OR gate but if we have a non linearly seperable data then we need to use hidden layer for example ExOR logical gate.
Number of nodes taken at any layer depends upon the degree of cross validation of our output.

Resources