I have a dataset which has 5 variables and 1 response. The variables are discrete. I want to find the key variable and its value which leads to a significant increase or decrease to the response.
You will need to perform some statistical tests in order to find which variables are the most significant.
If you are familiar with python you could use SelectKBest from scikit-learn. It will give you a score, the highest the score, the stronger the link between the feature and the output.
Additionally you can train an explainable ML model, strong enough to converge, and find the pattern within the data, from that you could compute the feature importance.
For example you could use DecisionTreeClasifier from scikit-learn. It has a decision_path class function that will plot the decision path taken by the tree, decision_path has a property called feature_importances_ that uses Gini coefficient to compute the importance of the features.
Last but not the least, you can use feature reduction techniques, such as PCA, it's used to find the variance between variables, from the PCA you will compute new Principal Components that are linked to the features, from the most explenatory ones you can find the features importance. Check this stack overflow answer that explains everything you should know for that.
I do have a question and I need your support. I have a data set which I am analyzing. I need to predict a target. To do this I did some data cleaning, among others drop highly (linear correlated feautes)
After preparing my data I applied random forest regressor (it is a regression problem). I am stucked a bit, since I really cannot catch the meaning and thus the value for max_features
I found the following page answer, where it is written
features=n_features for regression is a mistake on scikit's part. The original paper for RF gave max_features = n_features/3 for regression
I do get different results if I use max_features=sqrt(n) or max_features=n_features
Can any1 give me a good explanation how to approach this parameter?
That would be really great
max_features is a parameter that needs to be tuned. Values such as sqrt or n/3 are defaults and usually perform decently, but the parameter needs to be optimized for every dataset, as it will depend on the features you have, their correlations and importances.
Therefore, I suggest training the model many times with a grid of values for max_features, trying every possible value from 2 to the total number of your features. Train your RandomForestRegressor with oob_score=True and use oob_score_ to assess the performance of the Forest. Once you have looped over all possible values of max_features, keep the one that obtained the highest oob_score.
For safety, keep the n_estimators on the high end.
PS: this procedure is basically a grid search optimization for one parameter, and is usually done via Cross Validation. Since RFs give you OOB scores, you can use these instead of CV scores, as they are quicker to compute.
Okay I have a lot of confusion in regards to the way likelihood functions are defined in the context of different machine learning algorithms. For the context of this discussion, I will reference Andrew Ng 229 lecture notes.
Here is my understanding thus far.
In the context of classification, we have two different types of algorithms: discriminative and generative. The goal in both of these cases is to determine the posterior probability, that is p(C_k|x;w), where w is parameter vector and x is feature vector and C_k is kth class. The approaches are different as in discriminative we are trying to solve for the posterior probability directly given x. And in the generative case, we are determining the conditional distributions p(x|C_k), and prior classes p(C_k), and using Bayes theorem to determine P(C_k|x;w).
From my understanding Bayes theorem takes the form: p(parameters|data) = p(data|parameters)p(parameters)/p(data) where the likelihood function is p(data|parameters), posterior is p(parameters|data) and prior is p(parameters).
Now in the context of linear regression, we have the likelihood function:
p(y|X;w) where y is the vector of target values, X is design matrix.
This makes sense in according to how we defined the likelihood function above.
Now moving over to classification, the likelihood is defined still as p(y|X;w). Will the likelihood always be defined as such ?
The posterior probability we want is p(y_i|x;w) for each class which is very weird since this is apparently the likelihood function as well.
When reading through a text, it just seems the likelihood is always defined to different ways, which just confuses me profusely. Is there a difference in how the likelihood function should be interpreted for regression vs classification or say generative vs discriminative. I.e the way the likelihood is defined in Gaussian discriminant analysis looks very different.
If anyone can recommend resources that go over this in detail I would appreciate this.
A quick answer is that the likelihood function is a function proportional to the probability of seeing the data conditional on all the parameters in your model. As you said in linear regression it is p(y|X,w) where w is your vector of regression coefficients and X is your design matrix.
In a classification context, your likelihood would be proportional to P(y|X,w) where y is your vector of observed class labels. You do not have a y_i for each class, because your training data was observed to be in one particular class. Given your model specification and your model parameters, for each observed data point you should be able to calculate the probability of seeing the observed class. This is your likelihood.
The posterior predictive distribution, p(y_new_i|X,y), is the probability you want in paragraph 4. This is distinct from the likelihood because it is the probability for some unobserved case, rather than the likelihood, which relates to your training data. Note that I removed w because typically you would want to marginalize over it rather than condition on it because there is still uncertainty in the estimate after training your model and you would want your predictions to marginalize over that rather than condition on one particular value.
As an aside, the goal of all classification methods is not to find a posterior distribution, only Bayesian methods are really concerned with a posterior and those methods are necessarily generative. There are plenty of non-Bayesian methods and plenty of non-probabilistic discriminative models out there.
Any function proportional to p(a|b) where a is fixed is a likelihood function for b. Note that p(a|b) might be called something else, depending on what's interesting at the moment. For example, p(a|b) can also be called the posterior for a given b. The names don't really matter.
There are to main probabilistic approaches to novelty detection: parametric and non-parametric. Non-para assumes the distribution or density function derived from the training data, like Kernel Density Estimation (e.g.:Parzen Windows), while para approach assumes the data is from a known distribution.
I am not familiar with the parametric approach. Could anyone show me some well known algorithms? By the way, if the MLE is a kind of parametric approach (the density curve is known, and then find the parameter corresponding to the maximum value)?
Yes, MLE is by definition a parametric approach. You are estimating the parameters to a distribution, which maximizes the probability of observation of the data. Between, a non parametric approach generally means infinite number of parameters rather than an absence of parameters. For example a dirichlet process.
I think you're confused about a lot of terminology here.
Maximum likelihood makes sense for a paraemtric model (say a Gaussian distribution) because the number of parameters is fixed a priori and so it makes sense to ask what the 'best' estimates are. Maximum likelihood gives you one (of many) possible answers.
One way of understanding non-parametric methods is that the number of parameters they have grows as the number of samples increases. Thus a KDE has one parameter for each data point, plus a bandwidth parameter. What would MLE mean here? You increase the number of parameters as more data arrives, and thus there's no sense of best value for a fixed set of parameters.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
Why do we have to normalize the input for a neural network?
I understand that sometimes, when for example the input values are non-numerical a certain transformation must be performed, but when we have a numerical input? Why the numbers must be in a certain interval?
What will happen if the data is not normalized?
It's explained well here.
If the input variables are combined linearly, as in an MLP [multilayer perceptron], then it is
rarely strictly necessary to standardize the inputs, at least in theory. The
reason is that any rescaling of an input vector can be effectively undone by
changing the corresponding weights and biases, leaving you with the exact
same outputs as you had before. However, there are a variety of practical
reasons why standardizing the inputs can make training faster and reduce the
chances of getting stuck in local optima. Also, weight decay and Bayesian
estimation can be done more conveniently with standardized inputs.
In neural networks, it is good idea not just to normalize data but also to scale them. This is intended for faster approaching to global minima at error surface. See the following pictures:
Pictures are taken from the coursera course about neural networks. Author of the course is Geoffrey Hinton.
Some inputs to NN might not have a 'naturally defined' range of values. For example, the average value might be slowly, but continuously increasing over time (for example a number of records in the database).
In such case feeding this raw value into your network will not work very well. You will teach your network on values from lower part of range, while the actual inputs will be from the higher part of this range (and quite possibly above range, that the network has learned to work with).
You should normalize this value. You could for example tell the network by how much the value has changed since the previous input. This increment usually can be defined with high probability in a specific range, which makes it a good input for network.
There are 2 Reasons why we have to Normalize Input Features before Feeding them to Neural Network:
Reason 1: If a Feature in the Dataset is big in scale compared to others then this big scaled feature becomes dominating and as a result of that, Predictions of the Neural Network will not be Accurate.
Example: In case of Employee Data, if we consider Age and Salary, Age will be a Two Digit Number while Salary can be 7 or 8 Digit (1 Million, etc..). In that Case, Salary will Dominate the Prediction of the Neural Network. But if we Normalize those Features, Values of both the Features will lie in the Range from (0 to 1).
Reason 2: Front Propagation of Neural Networks involves the Dot Product of Weights with Input Features. So, if the Values are very high (for Image and Non-Image Data), Calculation of Output takes a lot of Computation Time as well as Memory. Same is the case during Back Propagation. Consequently, Model Converges slowly, if the Inputs are not Normalized.
Example: If we perform Image Classification, Size of Image will be very huge, as the Value of each Pixel ranges from 0 to 255. Normalization in this case is very important.
Mentioned below are the instances where Normalization is very important:
K-Means
K-Nearest-Neighbours
Principal Component Analysis (PCA)
Gradient Descent
When you use unnormalized input features, the loss function is likely to have very elongated valleys. When optimizing with gradient descent, this becomes an issue because the gradient will be steep with respect some of the parameters. That leads to large oscillations in the search space, as you are bouncing between steep slopes. To compensate, you have to stabilize optimization with small learning rates.
Consider features x1 and x2, where range from 0 to 1 and 0 to 1 million, respectively. It turns out the ratios for the corresponding parameters (say, w1 and w2) will also be large.
Normalizing tends to make the loss function more symmetrical/spherical. These are easier to optimize because the gradients tend to point towards the global minimum and you can take larger steps.
Looking at the neural network from the outside, it is just a function that takes some arguments and produces a result. As with all functions, it has a domain (i.e. a set of legal arguments). You have to normalize the values that you want to pass to the neural net in order to make sure it is in the domain. As with all functions, if the arguments are not in the domain, the result is not guaranteed to be appropriate.
The exact behavior of the neural net on arguments outside of the domain depends on the implementation of the neural net. But overall, the result is useless if the arguments are not within the domain.
I believe the answer is dependent on the scenario.
Consider NN (neural network) as an operator F, so that F(input) = output. In the case where this relation is linear so that F(A * input) = A * output, then you might choose to either leave the input/output unnormalised in their raw forms, or normalise both to eliminate A. Obviously this linearity assumption is violated in classification tasks, or nearly any task that outputs a probability, where F(A * input) = 1 * output
In practice, normalisation allows non-fittable networks to be fittable, which is crucial to experimenters/programmers. Nevertheless, the precise impact of normalisation will depend not only on the network architecture/algorithm, but also on the statistical prior for the input and output.
What's more, NN is often implemented to solve very difficult problems in a black-box fashion, which means the underlying problem may have a very poor statistical formulation, making it hard to evaluate the impact of normalisation, causing the technical advantage (becoming fittable) to dominate over its impact on the statistics.
In statistical sense, normalisation removes variation that is believed to be non-causal in predicting the output, so as to prevent NN from learning this variation as a predictor (NN does not see this variation, hence cannot use it).
The reason normalization is needed is because if you look at how an adaptive step proceeds in one place in the domain of the function, and you just simply transport the problem to the equivalent of the same step translated by some large value in some direction in the domain, then you get different results. It boils down to the question of adapting a linear piece to a data point. How much should the piece move without turning and how much should it turn in response to that one training point? It makes no sense to have a changed adaptation procedure in different parts of the domain! So normalization is required to reduce the difference in the training result. I haven't got this written up, but you can just look at the math for a simple linear function and how it is trained by one training point in two different places. This problem may have been corrected in some places, but I am not familiar with them. In ALNs, the problem has been corrected and I can send you a paper if you write to wwarmstrong AT shaw.ca
On a high level, if you observe as to where normalization/standardization is mostly used, you will notice that, anytime there is a use of magnitude difference in model building process, it becomes necessary to standardize the inputs so as to ensure that important inputs with small magnitude don't loose their significance midway the model building process.
example:
√(3-1)^2+(1000-900)^2 ≈ √(1000-900)^2
Here, (3-1) contributes hardly a thing to the result and hence the input corresponding to these values is considered futile by the model.
Consider the following:
Clustering uses euclidean or, other distance measures.
NNs use optimization algorithm to minimise cost function(ex. - MSE).
Both distance measure(Clustering) and cost function(NNs) use magnitude difference in some way and hence standardization ensures that magnitude difference doesn't command over important input parameters and the algorithm works as expected.
Hidden layers are used in accordance with the complexity of our data. If we have input data which is linearly separable then we need not to use hidden layer e.g. OR gate but if we have a non linearly seperable data then we need to use hidden layer for example ExOR logical gate.
Number of nodes taken at any layer depends upon the degree of cross validation of our output.