In the pytorch docs, it is stated that:
torch.amp provides convenience methods for mixed precision, where some operations use the torch.float32 (float) datatype and other operations use lower precision floating point datatype (lower_precision_fp): torch.float16 (half) or torch.bfloat16. Some ops, like linear layers and convolutions, are much faster in lower_precision_fp. Other ops, like reductions, often require the dynamic range of float32. Mixed precision tries to match each op to its appropriate datatype.
Does that mean, linear layer will always be float16?
I'm training a regression model, and final layer is a linear layer. My model predicts inf for value above 65k. Am I missing anything?
Related
Is regularisation a subset of normalisation ? I know normalisation is used when all the values are not on the same scale, but normalisation is also used to tone down the values, and so is regularisation .So what's the difference between two?
Normalisation adjusts the data; regularisation adjusts the prediction function.
As you noted, if your data are on very different scales (esp. low-to-high range), you likely want to normalise the data: alter each column to have the same (or compatible) basic statistics, such as standard deviation and mean. This is helpful to keep your fitting parameters on a scale that the computer can handle without a damaging loss of accuracy.
One goal of model training is to identify the signal (important features) and ignore the noise (random variation not really related to classification). If you give your model free rein to minimize the error on the given data, you can suffer from overfitting: the model insists on predicting the data set exactly, including those random variations.
Regularisation imposes some control on this by rewarding simpler fitting functions over complex ones. For instance, it can promote that a simple log function with a RMS error of x is preferable to a 15th-degree polynomial with an error of x/2. Tuning the trade-off is up to the model developer: if you know that your data are reasonably smooth in reality, you can look at the output functions and fitting errors, and choose your own balance.
How do I do the transformation in the test data when I have the trained SVM model in hand? I am trying to simulate the SVM output from mathematical equations and the trained SVM model (using RBF kernel). How do I do that?
In SVM, some of the common kernels used are:
Here xi and xj represent two samples. Now if the data has, say 5 samples, does this transformation include all the combination of two samples to generate the transformed feature space, like, x1 and x1, x1 and x2, x1 and x3,..., x4 and x5, x5 and x5.
If data has two features, then a polynomial transformation of order 2 transforms the input to 3 dimensions, as explained her in slide 15 http://www.robots.ox.ac.uk/~az/lectures/ml/lect3.pdf
Now how can be find the similar explantion for the transformation using the RBF kernel? I am trying to write a code for transforming the test data so that i can apply the trained SVM model on it.
This is way more complex than that. In short - you do not map your data directly into feature space. You simply change the dot product to the one induced by the kernel. What happens "inside" SVM when you work with polynomial kernel, each point is actually (indirectly) transformed to O(d^p) dimensional space (where d-input data dimension, p-degree of polynomial kernel). From mathematical perspective you work with some (often unknown) projection phi_K(x) which has the property that K(x, y) = <phi_K(x), phi_K(y)>, and nothing more. In SVM implementations, you do not need actual data representation (as phi_K(x) is usually huge, sometimes even infinite, like in RBF case) but instead it needs vector of dot product of your point will each element of the training set.
Thus what you do (in implementations, not from math perspective) is you provide:
During training whole Gram matrix, G defined as G_ij = K(x_i, x_j) where x_i is i'th training sample
During testing, when you get new point y you provide it to SVM as a vector of dot products H such that H_i = K(y, x_i), where again x_i are your training points (in fact you just need values for support vectors, but many implementations, like libsvm, actually require vector of the size of the training set - you can simply put 0's for K(y, x_j) if x_j is not a training vector)
Just remember, that this is not the same as training linear SVM "on top" of the above representation. This is just a way implementations usually accept your data, as they need a definition of dot product (function) and it is often easier to pass numbers than functions (but some of them, like scikit-learn SVC module, actually accepts functions as kernel parameter).
So what is RBF kernel? It is actually a mapping from points to functions space of normal distributions with means in your training points. And then dot product is just an integral from -inf to +inf from the product of such two functions. Sounds complex? It is at first sight, but it is a really nice trick, worth understanding!
An unsupervised dimensionality reduction algorithm is taking as input a matrix NxC1 where N is the number of input vectors and C1 is the number of components for each vector (the dimensionality of the vector). As a result, it returns a new matrix NxC2 (C2 < C1) where each vector has a lower number of component.
A fuzzy clustering algorithm is taking as input a matrix N*C1 where N, here again, is the number of input vectors and C1 is the number of components for each vector. As a result, it returns a new matrix NxC2 (C2 usually lower than C1) where each component of each vector is indicating the degree to which the vector belongs to the corresponding cluster.
I noticed that input and output of both classes of algorithms are the same in structure, only the interpretation of the results changes. Moreover, there no fuzzy clustering implementation in scikit-learn, hence the following question:
Does it make sense to use a dimensionality reduction algorithm to perform fuzzy clustering?
For instance, is it a non-sense to apply FeatureAgglomeration or TruncatedSVD to a dataset built from TF-IDF vectors extracted from textual data, and interpret the results as a fuzzy clustering?
In some sense, sure. It kind of depends on how you want to use the results downstream.
Consider SVD truncation or excluding principal components. We have projected into a new, variance-preserving space with essentially few other restrictions on the structure of the new manifold. The new coordinate representations of the original data points could have large negative numbers for some elements, which is a little weird. But one could shift and rescale the data without much difficulty.
One could then interpret each dimension as a cluster membership weight. But consider a common use for fuzzy clustering, which is to generate a hard clustering. Notice how easy this is with fuzzy cluster weights (e.g. just take the max). Consider a set of points in the new dimensionally-reduced space, say <0,0,1>,<0,1,0>,<0,100,101>,<5,100,99>. A fuzzy clustering would given something like {p1,p2}, {p3,p4} if thresholded, but if we took the max here (i.e. treat the dimensionally reduced axes as membership, we get {p1,p3},{p2,p4}, for k=2, for instance. Of course, one could use a better algorithm than max to derive hard memberships (say by looking at pairwise distances, which would work for my example); such algorithms are called, well, clustering algorithms.
Of course, different dimensionality reduction algorithms may work better or worse for this (e.g. MDS which focuses on preserving distances between data points rather than variances is more naturally cluster-like). But fundamentally, many dimensionality reduction algorithms implicitly preserve data about the underlying manifold that the data lie on, whereas fuzzy cluster vectors only hold information about the relations between data points (which may or may not implicitly encode that other information).
Overall, the purpose is a little different. Clustering is designed to find groups of similar data. Feature selection and dimensionality reduction are designed to reduce the noise and/or redundancy of the data by changing the embedding space. Often we use the latter to help with the former.
Suppose I have a training set made by (x, y) samples.
To apply a generative algorithm, let's say the Gaussian discriminative, I must assume that
p(x|y) ~ Normal(mu, sigma) for every possible sigma
or I just need to I know if x ~ Normal(mu, sigma) given y?
How can I evaluate if p(x|y) follows a multivariate Normal distribution well enough (up to a threshold) to me to use generative algorithm?
That's a lot of questions.
To apply a generative algorithm, let's say the Gaussian
discriminative, I must assume that
p(x|y) ~ Normal(mu, sigma) for every possible sigma
No, you must assume that's true for some mu, sigma pair. In practice you won't know what mu and sigma is, so you'll need to either estimate it (frequentist, Max Likelihood/Max A Posteriori estimates), or even better incorporate uncertainty about your estimates of the parameters into predictions (Bayesian methodology).
How can I evaluate if p(x|y) follows a multivariate Normal distribution?
Classically, using a goodness of fit test. If the dimensionality of x is more than a handful, though, this won't work because standard tests involve the number of items in bins, and the number of bins you need in high dimensions is astronomical so you have very low expected counts.
A better idea is to say the following: what are my options for modelling the (conditional) distribution of x? You can compare between these options using model comparison techniques. Read up on model checking and comparison.
Finally, your last point:
well enough (up to a threshold) to me to use generative algorithm?
The paradox of many generative methods, including Fisher's Linear Discriminant Analysis for example, as well as the Naive Bayes classifier, is that the classifier can work very well even though the model is poor for the data. There's no particularly sound reason why this should be the case, but many have observed it to be empirically true. Whether it works can be checked much more easily than whether the assumed distribution explains the data very well: just split your data into training and testing and find out!
I'm going through the ML Class on Coursera on Logistic Regression and also the Manning Book Machine Learning in Action. I'm trying to learn by implementing everything in Python.
I'm not able to understand the difference between the cost function and the gradient. There are examples on the net where people compute the cost function and then there are places where they don't and just go with the gradient descent function w :=w - (alpha) * (delta)w * f(w).
What is the difference between the two if any?
Whenever you train a model with your data, you are actually producing some new values (predicted) for a specific feature. However, that specific feature already has some values which are real values in the dataset. We know the closer the predicted values to their corresponding real values, the better the model.
Now, we are using cost function to measure how close the predicted values are to their corresponding real values.
We also should consider that the weights of the trained model are responsible for accurately predicting the new values. Imagine that our model is y = 0.9*X + 0.1, the predicted value is nothing but (0.9*X+0.1) for different Xs.
[0.9 and 0.1 in the equation are just random values to understand.]
So, by considering Y as real value corresponding to this x, the cost formula is coming to measure how close (0.9*X+0.1) is to Y.
We are responsible for finding the better weight (0.9 and 0.1) for our model to come up with a lowest cost (or closer predicted values to real ones).
Gradient descent is an optimization algorithm (we have some other optimization algorithms) and its responsibility is to find the minimum cost value in the process of trying the model with different weights or indeed, updating the weights.
We first run our model with some initial weights and gradient descent updates our weights and find the cost of our model with those weights in thousands of iterations to find the minimum cost.
One point is that gradient descent is not minimizing the weights, it is just updating them. This algorithm is looking for minimum cost.
A cost function is something you want to minimize. For example, your cost function might be the sum of squared errors over your training set. Gradient descent is a method for finding the minimum of a function of multiple variables. So you can use gradient descent to minimize your cost function. If your cost is a function of K variables, then the gradient is the length-K vector that defines the direction in which the cost is increasing most rapidly. So in gradient descent, you follow the negative of the gradient to the point where the cost is a minimum. If someone is talking about gradient descent in a machine learning context, the cost function is probably implied (it is the function to which you are applying the gradient descent algorithm).
It's strange to think about it, but there is more than one measure for how "accurately" a line fits to data points.
To access how accurately a line fits the data, we have a "cost" function which which can compare predicted vs. actual values and provide a "penalty" for how wrong it is.
penalty = cost_funciton(predicted, actual)
A naive cost function might just take the difference between the predicted and actual.
More sophisticated functions will square the value, since we'd rather have many small errors than one large error.
Additionally, each point has a different "sensitivity" to moving the line. Some points react very strongly to movement. Others react less strongly.
Often, you can make a tradeoff, and move TOWARD a point that is sensitive, and AWAY from a point that is NOT sensitive. In that scenario , you get more than you give up.
The "gradient" is a way of measuring how sensitive each point is to moving the line.
This article does a good job of describing WHY there is more than one measure, and WHY some points are more sensitive than others:
https://towardsdatascience.com/wrapping-your-head-around-gradient-descent-with-pictures-3fbd810235f5?source=friends_link&sk=7117e5de8c66bd4a4c2bb2a87a928773
Let's take an example of logistic regression model for binary classification. Output(Predicted Value) of the model for any given input will be offset(deviation) with respect to the actual output(Expected Value) while training. So, the model needs to be trained with minimal error(loss) so that model can perform well with high accuracy.
The function used to find the parameters(m and c in case of linear equation, y = mx+c) value at which the minimal error(loss) occurs is called Cost Function/Loss Function. Loss function is a term used to find the loss for single row/record of the training sample and Cost function is a term used to find the loss for the entire training dataset.
Now, How do we find the parameter(m and c in our case) values at which the minimum loss occurs? Its by using gradient descent algorithm using the equation, which helps us to find the points at which the minimum loss occurs and the parameters values at this points are considered for model building (let say y = 0.5x + 2) where m=.5 and c=2 are the points at which the loss is minimum.
Cost function is something is like at what cost you are building your model for a good model that cost should be minimum. To find the minimum cost function we use gradient descent method. That give value of coefficients to determine minimum cost function