Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 10 months ago.
Improve this question
I was reading this paper on word2vec, and came around the following description of a feedforward NNLM:
It consists of input, projection, hidden and output layers. At the input layer, N previous words are encoded using 1-of-V coding, where V is size of the vocabulary. The input layer is then projected to a projection layer P that has dimensionality N × D, using a shared projection matrix. As only N inputs are active at any given time, composition of the projection layer is a relatively cheap operation.
The following expression is given for the computational complexity per training example:
Q = N×D + N×D×H + H×V.
The last two terms make sense to me: N×D×H is roughly the amount of parameters in a dense layer from the N×D-dimesnional projection layer to the H hidden neurons, analogous for H×V. The first term, however, I expected to be V×D since the mapping from a one-hot encoded word to a D-dimensional vector is done via a V×D dimensional matrix. I came to that conclusion after reading this referenced paper and this SO post where the workings of the projection layer are explained in more detail.
Perhaps I have misunderstood what is meant by "training complexity".
If I understand this excerpt correctly, it is N x D because N previous words are taken into account, so you don't just test for one word, nor do you test for all words (V). Instead your input consists of N words. Therefore N x D is the complexity of this layer
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I read the article from the link below.
https://towardsdatascience.com/principal-component-analysis-pca-with-scikit-learn-1e84a0c731b0
The author does a nice job describing the process of PCA decomposition. I feel like I understand everything except for one thing. How can we know which principal components were selected and thus preserver for eventual improved performance of our ML algos? For instance, the author starts with this.
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
df = pd.DataFrame(data=cancer.data, columns=cancer.feature_names)
df.head()
Ok, I know what the features are. Great. Then all the fun happens, and we end up with this at the end.
df_new = pd.DataFrame(X_pca_95, columns=['PC1','PC2','PC3','PC4','PC5','PC6','PC7','PC8','PC9','PC10'])
df_new['label'] = cancer.target
df_new
Of the 30 features that we started with, how do we know what the final 10 columns consist of? It seems like there has to be some kind of last step to map the df_new to df?
To understand, you need to know a little bit more about PCA. In fact, PCA returns all principal components that shape the whole space of vectors, i.e., eigenvalues and eigenvectors of the covariance matrix of features. Hence, you can select eigenvectors based on the size of their corresponding eigenvalues. Hence, you need to pick up the biggest eigenvalues and their corresponding eigenvectors.
Now if you look at the documentation of PCA method in scikit learn, you find some useful properties like the following:
components_ ndarray of shape (n_components, n_features): Principal axes in feature space, representing the directions of maximum variance in the data. The components are sorted by explained_variance_.
explained_variance_ratio_ ndarray of shape (n_components,)
Percentage of variance explained by each of the selected components.
If n_components is not set then all components are stored and the sum of the ratios is equal to 1.0.
explained_variance_ratio_ is a very useful property that you can use it to select principal components based on the desired threshold for the percentage of the covered variance. For example, take values in this array are [0.4, 0.3, 0.2, 0.1]. If we take the first three components, the covered variance is 90% of the whole variance of the original data.
Almost surely the 10 resulting columns are all composed of pieces of all 30 original features. The PCA object has an attribute components_ that shows the coefficients defining the principal components in terms of the original features.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
As decision trees are non linear models so Random Forest should also be nonlinear methods in my opinion. But at some articles i have read otherwise. Can anyone explain how are they nonlinear or not .
or in other words Is Random Forest for linear or non linear data .
If i have a variable A (dependent) and other independent variables B and C and so on . How would RF fit a regression on these variables in the data.
What RF does is to devide your data into square boxes.
When you then get a new datapoint it follows the yes/no-answers and ends up in a box.
In classification, it counts how many of each class thats in each box, and the majority of the classes is the prediciton.
When doing regression, it takes the mean of the values in each box.
In a regression setting you have the following equation
y = b0 + x1*b1 + x2*b2 +.. + xn*bn
where xi is your feature "i" and bi is the coefficient to xi.
A linear regression is linear in the coefficients but say we have the following regression
y=x0 +x1*b1 + x2*cos(b2)
that is not a linear regression since it is not linear in the coefficient b2.
To check if it is linear then the derivative of y with respect to bi should be independent of bi for all bi, i.e take the first example (the linear one):
dy/db1 = x1
which is independent of b1 (this give the same answer for all dy/dbi) but the second example
# y=x0 +x1*b1 + x2*cos(b2)
dy/db2 = x2*(-sin(b2))
which is not independent of b2 thus not a linear regression.
As you can see RF and linear regression is two different things and the linearity of a regression has nothing to do with a RF (or the other way round that matter)
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
Why Regularization can't accept Φ as penalizing parameter? but accept Φ^2 (L2) AND |Φ| (L1).
are there Any other penalizing parameter forms?
The Regularization term is usually a vector norm because it outputs an scalar value that means the length of the vector in a certain space. What it's important here is that it's a scalar value, not a vector. You cannot use a single vector as regularisation term, because the regularization value is added to the loss function, which is also a scalar. Thus, you need to compute some scalar from this vector and use it, and that's exactly what a vector norm does. As you can image, having a scalar value as regularisation term is pretty intuitive: the higher, the more regularization.
And yes, there are some other regularization methods. For instance, the elastic net method, which combines L1 and L2 norm functions. But usually, the most popular ones are single L1 (Lasso regression) and single L2 (ridge regression). I encourage you also to look for the Bayesian assumptions that are behind the regularization terms, they are pretty interesting and a completely different point of view ;)
In other family of algorithms such as Neural Networks, the regularization is done by early stopping, or in Variational Autoencoders with the prior belief of a isotropic Gaussian in the latent variables distribution.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I'm recently working on CNN and I want to know what is the function of temperature in softmax formula? and why should we use high temperatures to see a softer norm in probability distribution?Softmax Formula
One reason to use the temperature function is to change the output distribution computed by your neural net. It is added to the logits vector according to this equation :
𝑞𝑖 =exp(𝑧𝑖/𝑇)/ ∑𝑗exp(𝑧𝑗/𝑇)
where 𝑇 is the temperature parameter.
You see, what this will do is change the final probabilities. You can choose T to be anything (the higher the T, the 'softer' the distribution will be - if it is 1, the output distribution will be the same as your normal softmax outputs). What I mean by 'softer' is that is that the model will basically be less confident about it's prediction. As T gets closer to 0, the 'harder' the distribution gets.
a) Sample 'hard' softmax probs : [0.01,0.01,0.98]
b) Sample 'soft' softmax probs : [0.2,0.2,0.6]
'a' is a 'harder' distribution. Your model is very confident about its predictions. However, in many cases, you don't want your model to do that. For example, if you are using an RNN to generate text, you are basically sampling from your output distribution and choosing the sampled word as your output token(and next input). IF your model is extremely confident, it may produce very repetitive and uninteresting text. You want it to produce more diverse text which it will not produce because when the sampling procedure is going on, most of the probability mass will be concentrated in a few tokens and thus your model will keep selecting a select number of words over and over again. In order to give other words a chance of being sampled as well, you could plug in the temperature variable and produce more diverse text.
With regards to why higher temperatures lead to softer distributions, that has to do with the exponential function. The temperature parameter penalizes bigger logits more than the smaller logits. The exponential function is an 'increasing function'. So if a term is already big, penalizing it by a small amount would make it much smaller (% wise) than if that term was small.
Here's what I mean,
exp(6) ~ 403
exp(3) ~ 20
Now let's 'penalize' this term with a temperature of let's say 1.5:
exp(6/1.5) ~ 54
exp(3/1.5) ~ 7.4
You can see that in % terms, the bigger the term is, the more it shrinks when the temperature is used to penalize it. When the bigger logits shrink more than your smaller logits, more probability mass (to be computed by the softmax) will be assigned to the smaller logits.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I am new to Machine Learning 7 I have started following Udacity's Intro to Machine Learning
I was following Simple Vector Machine's when this concept of C and Gamma came along. I did some digging around and found the following:
C - A high C tries to minimize the misclassification of training data
and a low value tries to maintain a smooth classification. This makes sense to me.
Gamma - I am unable to understand this one.
Can someone explain this to me in layman terms?
When you are using SVM, you are necessarily using one of the kernels: linear, polynomial or RBF=Radial Base Function (also called Gaussian Kernel) or anything else . The latter is
K(x,x') = exp(-gamma * ||x-x'||^2)
which explicitly contains your gamma. The larger the gamma, the narrower the gaussian "bell" is.
I believe, as you go with the course, you will learn more about such "kernel trick".
Intuitively, the gamma parameter defines how far the influence of a single training example reaches, with low values meaning ‘far’ and high values meaning ‘close’. The gamma parameters can be seen as the inverse of the radius of influence of samples selected by the model as support vectors.
The C parameter trades off misclassification of training examples against simplicity of the decision surface. A low C makes the decision surface smooth, while a high C aims at classifying all training examples correctly by giving the model freedom to select more samples as support vectors.
http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html
-C parameter: C determines how many data samples are allowed to be placed in different classes. If the value of C is set to a low value, the probability of the outliers is increased, and the general decision boundary is found. If the value of C is set high, the decision boundary is found more carefully.
C is used in the soft margin, which requires understanding of slack variables.
-Soft margin classifier:
-slack variables determine how much margin to adjust.
gamma parameter: gamma determines the distance a single data sample exerts influence. That is, the gamma parameter can be said to adjust the curvature of the decision boundary.