Is Maximum Likelihood Estimation (MLE) a parametric approach? - machine-learning

There are to main probabilistic approaches to novelty detection: parametric and non-parametric. Non-para assumes the distribution or density function derived from the training data, like Kernel Density Estimation (e.g.:Parzen Windows), while para approach assumes the data is from a known distribution.
I am not familiar with the parametric approach. Could anyone show me some well known algorithms? By the way, if the MLE is a kind of parametric approach (the density curve is known, and then find the parameter corresponding to the maximum value)?

Yes, MLE is by definition a parametric approach. You are estimating the parameters to a distribution, which maximizes the probability of observation of the data. Between, a non parametric approach generally means infinite number of parameters rather than an absence of parameters. For example a dirichlet process.

I think you're confused about a lot of terminology here.
Maximum likelihood makes sense for a paraemtric model (say a Gaussian distribution) because the number of parameters is fixed a priori and so it makes sense to ask what the 'best' estimates are. Maximum likelihood gives you one (of many) possible answers.
One way of understanding non-parametric methods is that the number of parameters they have grows as the number of samples increases. Thus a KDE has one parameter for each data point, plus a bandwidth parameter. What would MLE mean here? You increase the number of parameters as more data arrives, and thus there's no sense of best value for a fixed set of parameters.

Related

Question about a new type of confidence interval

I came up with the following result, tested on many data sets, but I do not have a formal proof yet:
Theorem: The width L of any confidence interval is asymptotically equal (as n tends to infinity) to a power function of n, namely L=A / n^B where A and B are two positive constants depending on the data set, and n is the sample size.
See here and here for details. The B exponent seems to be very similar to the Hurst exponent in time series, not only in terms of what it represents, but also in the values that it takes: B=1/2 corresponds to perfect data (no auto-correlation or undesirable features) and B=1 corresponds to "bad data" typically with strong auto-correlations.
Note that B=1/2 is what everyone uses nowadays, assuming observations are independently and identically distributed, with an underlying normal distribution. I also devised a method to make the interval width converges faster to zero: O(1/n) rather than O(1/SQRT(n)). This is also described in section 3.3. in my article on re-sampling (here) and my approach in this context seems very much related to what is called second-order accurate intervals (usually achieved with modern versions of bootstrapping, see here.)
My question is whether my theorem is original, ground-breaking, and correct, and how would someone prove it (or refute it.)
Example of Confidence Interval
Perl code to produce confidence intervals for the correlation
The first problem is, what do you mean by confidence interval?
Let's say i do non parametric estimation of a density probability function with a kernel density estimator.
Interval confidence has no meaning in this setting. however you can compute something which is the "speed" of convergence of your kernel density estimator to your target function. Depending on the choice of the distance you choose between function, you can get different speed of convergence. And for example, the best speed with $L^{\infty}$ distance depends on a $\log(n)$ factor.
By the way you give yourself a counterexample in your first article.
So for me your theorem can not exist for two reasons :
It is not clear, you need to specify exactly what you mean by confidence interval. You need to say what do you mean by depending on the dataset (does it depends on $N$ the number of observations?)
There is "counter example", since asymptotic speed of convergence of estimators can be more complicated than what you say.

Weighted features in machine learning

I am a beginner in machine learning. So any help or suggestion would be of great help.
I have read that putting weights on features and Predicting is a very bad idea. But what if few features needs to be weighted.
In a classification problem let's say it's a common norm that age is most dependent, how do I give weights to this feature. I was thinking to normalize it but with a variance of 1.5 or 2 (other features with variance 1), I believe that this feature will have more weight. Is this fundamentally wrong ? If wrong any other method.
Does it effect differently for classification and regression problems ?
If we are talking specifically about random forests (as you tagged) then you can use the Weighted Subspace Random Forest algorithm (in R wsrf package). The algorithm determines a weight for each variable and then uses these during the model building.
The informativeness of a variable with respect to the class is
measured by an information gain ratio. The measure is used as the
probability of that variable being selected for inclusion in the
variable subspace when splitting a specific node during the tree
building process. Therefore, variables with higher values by the
measure are more likely to be chosen as candidates during variable
selection and a stronger tree can be built.
Generally if a feature has more Importance compared to other features and the model is Dense enough, with enough training sample, your model will automatically give it more Importance by optimizing weight matrices to account for that because we have partial derivatives in back propagation which calculate change by each connection, so it learns to give more importance to that feature on itself. If you don't normalize it, but scale it to a higher scale, you might have overstated it's important.
In practice a neural network works best if the inputs are centered and white. That means that their covariance is diagonal and the mean is the zero vector. This improves optimization of the neural net, since the hidden activation functions don't saturate that fast and thus do not give you near zero gradients early on in learning.
If you do scale just one feature up by a small value, it may or may not have desired effects, but the higher probability is of saturated gradients, so we avoid it.

Are high values for c or gamma problematic when using an RBF kernel SVM?

I'm using WEKA/LibSVM to train a classifier for a term extraction system. My data is not linearly separable, so I used an RBF kernel instead of a linear one.
I followed the guide from Hsu et al. and iterated over several values for both c and gamma. The parameters which worked best for classifying known terms (test and training material differ of course) are rather high, c=2^10 and gamma=2^3.
So far the high parameters seem to work ok, yet I wonder if they may cause any problems further on, especially regarding overfitting. I plan to do another evaluation by extracting new terms, yet those are costly as I need human judges.
Could anything still be wrong with my parameters, even if both evaluation turns out positive? Do I perhaps need another kernel type?
Thank you very much!
In general you have to perform cross validation to answer whether the parameters are all right or do they lead to the overfitting.
From the "intuition" perspective - it seems like highly overfitted model. High value of gamma means that your Gaussians are very narrow (condensed around each poinT) which combined with high C value will result in memorizing most of the training set. If you check out the number of support vectors I would not be surprised if it would be the 50% of your whole data. Other possible explanation is that you did not scale your data. Most ML methods, especially SVM, requires data to be properly preprocessed. This means in particular, that you should normalize (standarize) the input data so it is more or less contained in the unit sphere.
RBF seems like a reasonable choice so I would keep using it. A high value of gamma is not necessary a bad thing, it would depends on the scale where your data lives. While a high C value can lead to overfitting, it would also be affected by the scale so in some cases it might be just fine.
If you think that your dataset is a good representation of the whole data, then you could use crossvalidation to test your parameters and have some peace of mind.

What is the best/preferred approach to implement Maximum Likelihood Estimation for large data sets in GBs

I have a data-set in Gigabytes(GB) and want to estimate the parameters for missing values in that.
There is an algorithm called MLE(Maximum-likelihood Estimation) in machine learning that can be used for it.
Since R might not work on such a large data-set,so which library will be best to use for it?
By wiki:MLE:
In statistics, maximum-likelihood estimation (MLE) is a method of estimating the parameters of a statistical model. When applied to a data set and given a statistical model, maximum-likelihood estimation provides estimates for the model's parameters.
Generally you need two steps before you can apply MLE:
obtain a dataset
identify a statistical model
At this time, if you can obtain an analytic form of solution for the MLE estimate, just stream your data to the mle-estimate calculation, e.g., for gaussian distribution, to estimate mean, you just accumulate the sum, and keep the count and the sample mean will be your mle-estimate.
However, when the model involves many parameters and its pdf is highly non-linear. In such situations, the MLE estimate must be sought numerically using nonlinear optimization algorithms. If your data size is huge, try stochastic gradient descent, the true gradient is approximated by a gradient at a single example. As the algorithm sweeps through the training set, it performs the update formula for each training example. So that you can still stream your data one at a time to your update program in multiple sweeps fashion. In this way, memory constraint should not be a problem at all.

Why do we have to normalize the input for an artificial neural network? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
Why do we have to normalize the input for a neural network?
I understand that sometimes, when for example the input values are non-numerical a certain transformation must be performed, but when we have a numerical input? Why the numbers must be in a certain interval?
What will happen if the data is not normalized?
It's explained well here.
If the input variables are combined linearly, as in an MLP [multilayer perceptron], then it is
rarely strictly necessary to standardize the inputs, at least in theory. The
reason is that any rescaling of an input vector can be effectively undone by
changing the corresponding weights and biases, leaving you with the exact
same outputs as you had before. However, there are a variety of practical
reasons why standardizing the inputs can make training faster and reduce the
chances of getting stuck in local optima. Also, weight decay and Bayesian
estimation can be done more conveniently with standardized inputs.
In neural networks, it is good idea not just to normalize data but also to scale them. This is intended for faster approaching to global minima at error surface. See the following pictures:
Pictures are taken from the coursera course about neural networks. Author of the course is Geoffrey Hinton.
Some inputs to NN might not have a 'naturally defined' range of values. For example, the average value might be slowly, but continuously increasing over time (for example a number of records in the database).
In such case feeding this raw value into your network will not work very well. You will teach your network on values from lower part of range, while the actual inputs will be from the higher part of this range (and quite possibly above range, that the network has learned to work with).
You should normalize this value. You could for example tell the network by how much the value has changed since the previous input. This increment usually can be defined with high probability in a specific range, which makes it a good input for network.
There are 2 Reasons why we have to Normalize Input Features before Feeding them to Neural Network:
Reason 1: If a Feature in the Dataset is big in scale compared to others then this big scaled feature becomes dominating and as a result of that, Predictions of the Neural Network will not be Accurate.
Example: In case of Employee Data, if we consider Age and Salary, Age will be a Two Digit Number while Salary can be 7 or 8 Digit (1 Million, etc..). In that Case, Salary will Dominate the Prediction of the Neural Network. But if we Normalize those Features, Values of both the Features will lie in the Range from (0 to 1).
Reason 2: Front Propagation of Neural Networks involves the Dot Product of Weights with Input Features. So, if the Values are very high (for Image and Non-Image Data), Calculation of Output takes a lot of Computation Time as well as Memory. Same is the case during Back Propagation. Consequently, Model Converges slowly, if the Inputs are not Normalized.
Example: If we perform Image Classification, Size of Image will be very huge, as the Value of each Pixel ranges from 0 to 255. Normalization in this case is very important.
Mentioned below are the instances where Normalization is very important:
K-Means
K-Nearest-Neighbours
Principal Component Analysis (PCA)
Gradient Descent
When you use unnormalized input features, the loss function is likely to have very elongated valleys. When optimizing with gradient descent, this becomes an issue because the gradient will be steep with respect some of the parameters. That leads to large oscillations in the search space, as you are bouncing between steep slopes. To compensate, you have to stabilize optimization with small learning rates.
Consider features x1 and x2, where range from 0 to 1 and 0 to 1 million, respectively. It turns out the ratios for the corresponding parameters (say, w1 and w2) will also be large.
Normalizing tends to make the loss function more symmetrical/spherical. These are easier to optimize because the gradients tend to point towards the global minimum and you can take larger steps.
Looking at the neural network from the outside, it is just a function that takes some arguments and produces a result. As with all functions, it has a domain (i.e. a set of legal arguments). You have to normalize the values that you want to pass to the neural net in order to make sure it is in the domain. As with all functions, if the arguments are not in the domain, the result is not guaranteed to be appropriate.
The exact behavior of the neural net on arguments outside of the domain depends on the implementation of the neural net. But overall, the result is useless if the arguments are not within the domain.
I believe the answer is dependent on the scenario.
Consider NN (neural network) as an operator F, so that F(input) = output. In the case where this relation is linear so that F(A * input) = A * output, then you might choose to either leave the input/output unnormalised in their raw forms, or normalise both to eliminate A. Obviously this linearity assumption is violated in classification tasks, or nearly any task that outputs a probability, where F(A * input) = 1 * output
In practice, normalisation allows non-fittable networks to be fittable, which is crucial to experimenters/programmers. Nevertheless, the precise impact of normalisation will depend not only on the network architecture/algorithm, but also on the statistical prior for the input and output.
What's more, NN is often implemented to solve very difficult problems in a black-box fashion, which means the underlying problem may have a very poor statistical formulation, making it hard to evaluate the impact of normalisation, causing the technical advantage (becoming fittable) to dominate over its impact on the statistics.
In statistical sense, normalisation removes variation that is believed to be non-causal in predicting the output, so as to prevent NN from learning this variation as a predictor (NN does not see this variation, hence cannot use it).
The reason normalization is needed is because if you look at how an adaptive step proceeds in one place in the domain of the function, and you just simply transport the problem to the equivalent of the same step translated by some large value in some direction in the domain, then you get different results. It boils down to the question of adapting a linear piece to a data point. How much should the piece move without turning and how much should it turn in response to that one training point? It makes no sense to have a changed adaptation procedure in different parts of the domain! So normalization is required to reduce the difference in the training result. I haven't got this written up, but you can just look at the math for a simple linear function and how it is trained by one training point in two different places. This problem may have been corrected in some places, but I am not familiar with them. In ALNs, the problem has been corrected and I can send you a paper if you write to wwarmstrong AT shaw.ca
On a high level, if you observe as to where normalization/standardization is mostly used, you will notice that, anytime there is a use of magnitude difference in model building process, it becomes necessary to standardize the inputs so as to ensure that important inputs with small magnitude don't loose their significance midway the model building process.
example:
√(3-1)^2+(1000-900)^2 ≈ √(1000-900)^2
Here, (3-1) contributes hardly a thing to the result and hence the input corresponding to these values is considered futile by the model.
Consider the following:
Clustering uses euclidean or, other distance measures.
NNs use optimization algorithm to minimise cost function(ex. - MSE).
Both distance measure(Clustering) and cost function(NNs) use magnitude difference in some way and hence standardization ensures that magnitude difference doesn't command over important input parameters and the algorithm works as expected.
Hidden layers are used in accordance with the complexity of our data. If we have input data which is linearly separable then we need not to use hidden layer e.g. OR gate but if we have a non linearly seperable data then we need to use hidden layer for example ExOR logical gate.
Number of nodes taken at any layer depends upon the degree of cross validation of our output.

Resources