I try to get the importance weights of every feature from my dataframe.
I use this code from scikit documentation:
names=['Class label', 'Alcohol',
'Malic acid', 'Ash',
'Alcalinity of ash', 'Magnesium',
'Total phenols', 'Flavanoids',
'Nonflavanoid phenols',
'Proanthocyanins',
'Color intensity', 'Hue',
'OD280/OD315 of diluted wines',
'Proline']
df_wine = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data', header=None,names=names)
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators=10000,
random_state=0,
n_jobs=-1)
forest.fit(X_train, y_train)
feat_labels = df_wine.columns[1:]
importances = forest.feature_importances_
indices = np.argsort(importances)[::-1]
for f in range(X_train.shape[1]):
print("%2d) %-*s %f" % (f + 1, 30,feat_labels[f], importances[indices[f]]))
but despite I understand np.argsort method, I still don't comprehend this FOR loop.
Why do we use "indices" for indexing "importances" array? And why we can't simply use such code:
for f in range(X_train.shape[1]):
print("%2d) %-*s %f" % (f + 1, 30,feat_labels[f], importances[f]))
Output in case of using "importances[indices[f]]"(first 5 rows):
1) Alcohol 0.182483
2) Malic acid 0.158610
3) Ash 0.150948
4) Alcalinity of ash 0.131987
5) Magnesium 0.106589
Output in case of "importances[f]"(first 5 rows):
1) Alcohol 0.106589
2) Malic acid 0.025400
3) Ash 0.013916
4) Alcalinity of ash 0.032033
5) Magnesium 0.022078
This is not what is placed in the docs, look closely, it says
# FROM DOCS
for f in range(X.shape[1]):
print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))
which is correct, and not
# FROM YOUR QUESTION
for f in range(X_train.shape[1]):
print("%2d) %-*s %f" % (f + 1, 30,feat_labels[f], importances[indices[f]]))
which is wrong. If you want to use feat_labels you should do
# CORRECT SOLUTION
for f in range(X_train.shape[1]):
print("%2d) %-*s %f" % (f + 1, 30,feat_labels[indices[f]], importances[indices[f]]))
Their approach is used because they want to iterate in decreasing order of the feature importances, not using "indices" would use ordering of features instead. Both are fine, the only incorrect one is the first one you proposed - which is a mix of both approaches and incorrectly assigns importance to features.
Related
Could anyone please explain me the below code:
import torch
import torch.nn as nn
input = torch.randn(5, 3, 10)
h0 = torch.randn(2, 3, 20)
c0 = torch.randn(2, 3, 20)
rnn = nn.LSTM(10,20,2)
output, (hn, cn) = rnn(input, (h0, c0))
print(input)
While calling rnn rnn(input, (h0, c0)) we gave arguments h0 and c0 in parenthesis. What is it supposed to mean? if (h0, c0) represents a single value then what is that value and what is the third argument passed here?
However, in the line rnn = nn.LSTM(10,20,2) we are passing arguments in LSTM function without paranthesis.
Can anyone explain me how this function call is working?
The assignment rnn = nn.LSTM(10, 20, 2) instanciates a new nn.Module using the nn.LSTM class. It's first three arguments are input_size (here 10), hidden_size (here 20) and num_layers (here 2).
On the other hand rnn(input, (h0, c0)) corresponds to actually calling the class instance, i.e. running __call__ which is roughly equivalent to the forward function of that module. The __call__ method of nn.LSTM takes in two parameters: input (shaped (sequnce_length, batch_size, input_size), and a tuple of two tensors (h_0, c_0) (both shaped (num_layers, batch_size, hidden_size) in the basic use case of nn.LSTM)
Please refer to the PyTorch documentation whenever using builtins, you will find the exact definition of the parameters list (the arguments used to initialize the class instance) as well as the input/outputs specifications (whenever inferring with that said module).
You might be confused with the notation, here's a small example that could help:
tuple as input:
def fn1(x, p):
a, b = p # unpack input
return a*x + b
>>> fn1(2, (3, 1))
>>> 7
tuple as output
def fn2(x):
return x, (3*x, x**2) # actually output is a tuple of int and tuple
>>> x, (a, b) = fn2(2) # unpacking
(2, (6, 4))
>>> x, a, b
(2, 6, 4)
I have two probability distributions. How should I find the KL-divergence between them in PyTorch? The regular cross entropy only accepts integer labels.
Yes, PyTorch has a method named kl_div under torch.nn.functional to directly compute KL-devergence between tensors. Suppose you have tensor a and b of same shape. You can use the following code:
import torch.nn.functional as F
out = F.kl_div(a, b)
For more details, see the above method documentation.
function kl_div is not the same as wiki's explanation.
I use the following:
# this is the same example in wiki
P = torch.Tensor([0.36, 0.48, 0.16])
Q = torch.Tensor([0.333, 0.333, 0.333])
(P * (P / Q).log()).sum()
# tensor(0.0863), 10.2 µs ± 508
F.kl_div(Q.log(), P, None, None, 'sum')
# tensor(0.0863), 14.1 µs ± 408 ns
compare to kl_div, even faster
If you have two probability distribution in form of pytorch distribution object. Then you are better off using the function torch.distributions.kl.kl_divergence(p, q). For documentation follow the link
If working with Torch distributions
mu = torch.Tensor([0] * 100)
sd = torch.Tensor([1] * 100)
p = torch.distributions.Normal(mu,sd)
q = torch.distributions.Normal(mu,sd)
out = torch.distributions.kl_divergence(p, q).mean()
out.tolist() == 0
True
If you are using the normal distribution, then the following code will directly compare the two distributions themselves:
p = torch.distributions.normal.Normal(p_mu, p_std)
q = torch.distributions.normal.Normal(q_mu, q_std)
loss = torch.distributions.kl_divergence(p, q)
p and q are two tensor objects.
This code will work and won't give any NotImplementedError.
I have a dataset consisting of student marks in 2 subjects and the result if the student is admitted in college or not. I need to perform a logistic regression on the data and find the optimum parameter θ to minimize the loss and predict the results for the test data. I am not trying to build any complex non linear network here.
The data looks like this
I have the loss function defined for logistic regression like this which works fine
predict(X) = sigmoid(X*θ)
loss(X,y) = (1 / length(y)) * sum(-y .* log.(predict(X)) .- (1 - y) .* log.(1 - predict(X)))
I need to minimize this loss function and find the optimum θ. I want to do it with Flux.jl or any other library which makes it even easier.
I tried using Flux.jl after reading the examples but not able to minimize the cost.
My code snippet:
function update!(ps, η = .1)
for w in ps
w.data .-= w.grad .* η
print(w.data)
w.grad .= 0
end
end
for i = 1:400
back!(L)
update!((θ, b))
#show L
end
You can use either GLM.jl (simpler) or Flux.jl (more involved but more powerful in general).
In the code I generate the data so that you can check if the result is correct. Additionally I have a binary response variable - if you have other encoding of target variable you might need to change the code a bit.
Here is the code to run (you can tweak the parameters to increase the convergence speed - I chose ones that are safe):
using GLM, DataFrames, Flux.Tracker
srand(1)
n = 10000
df = DataFrame(s1=rand(n), s2=rand(n))
df[:y] = rand(n) .< 1 ./ (1 .+ exp.(-(1 .+ 2 .* df[1] .+ 0.5 .* df[2])))
model = glm(#formula(y~s1+s2), df, Binomial(), LogitLink())
x = Matrix(df[1:2])
y = df[3]
W = param(rand(2,1))
b = param(rand(1))
predict(x) = 1.0 ./ (1.0+exp.(-x*W .- b))
loss(x,y) = -sum(log.(predict(x[y,:]))) - sum(log.(1 - predict(x[.!y,:])))
function update!(ps, η = .0001)
for w in ps
w.data .-= w.grad .* η
w.grad .= 0
end
end
i = 1
while true
back!(loss(x,y))
max(maximum(abs.(W.grad)), abs(b.grad[1])) > 0.001 || break
update!((W, b))
i += 1
end
And here are the results:
julia> model # GLM result
StatsModels.DataFrameRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Binomial{Float64},GLM.LogitLink},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}
Formula: y ~ 1 + s1 + s2
Coefficients:
Estimate Std.Error z value Pr(>|z|)
(Intercept) 0.910347 0.0789283 11.5338 <1e-30
s1 2.18707 0.123487 17.7109 <1e-69
s2 0.556293 0.115052 4.83513 <1e-5
julia> (b, W, i) # Flux result with number of iterations needed to converge
(param([0.910362]), param([2.18705; 0.556278]), 1946)
Thanks for this helpful example. However, it does not seem to run with my setup (Julia 1.1, Flux 0.7.1.), since the 1+ and 1- operations in the predict and loss function is not broadcast over the TrackedArray objects. Fortunately the fix is simple (note the dots!):
predict(x) = 1.0 ./ (1.0 .+ exp.(-x*W .- b))
loss(x,y) = -sum(log.(predict(x[y,:]))) - sum(log.(1 .- predict(x[.!y,:])))
I created a table to test my understanding
F1 F2 Outcome
0 2 5 1
1 4 8 2
2 6 0 3
3 9 8 4
4 10 6 5
From F1 and F2 I tried to predict Outcome
As you can see F1 have a strong correlation to Outcome,F2 is random noise
I tested
pca = PCA(n_components=2)
fit = pca.fit(X)
print("Explained Variance")
print(fit.explained_variance_ratio_)
Explained Variance
[ 0.57554896 0.42445104]
Which is what I expected and shows that F1 is more important
However when I do RFE (Recursive Feature Elimination)
model = LogisticRegression()
rfe = RFE(model, 1)
fit = rfe.fit(X, Y)
print(fit.n_features_)
print(fit.support_)
print(fit.ranking_)
1
[False True]
[2 1]
It asked me to keep F2 instead? It should ask me to keep F1 since F1 is a strong predictor while F2 is random noise... why F2?
Thanks
It is advisable to do a Recursive Feature Elimination Cross Validation (RFECV) before running the Recursive Feature Elimination (RFE)
Here is an example:
Having columns :
df.columns = ['age', 'id', 'sex', 'height', 'gender', 'marital status', 'income', 'race']
Use RFECV to identify the optimal number of features needed.
from sklearn.ensemble import RandomForestClassifier
rfe = RandomForestClassifier(random_state = 32) # Instantiate the algo
rfecv = RFECV(estimator= rfe, step=1, cv=StratifiedKFold(2), scoring="accuracy") # Instantiate the RFECV and its parameters
fit = rfecv.fit(features(or X), target(or y))
print("Optimal number of features : %d" % rfecv.n_features_)
>>>> Optimal number of output is 4
Now that the optimal number of features has been known, we can use Recursive Feature Elimination to identify the Optimal features
from sklearn.feature_selection import RFE
min_features_to_select = 1
rfc = RandomForestClassifier()
rfe = RFE(estimator=rfc, n_features_to_select= 4, step=1)
fittings1 = rfe.fit(features, target)
for i in range(features.shape[1]):
print('Column: %d, Selected %s, Rank: %.3f' % (i, rfe.support_[i], rfe.ranking_[i]))
output will be something like:
>>> Column: 0, Selected True, Rank: 1.000
>>> Column: 1, Selected False, Rank: 4.000
>>> Column: 2, Selected False, Rank: 7.000
>>> Column: 3, Selected False, Rank: 10.000
>>> Column: 4, Selected True, Rank: 1.000
>>> Column: 5, Selected False, Rank: 3.000
Now display the features to remove based on recursive feature elimination done above
columns_to_remove = features.columns.values[np.logical_not(rfe.support_)]
columns_to_remove
output will be something like:
>>> array(['age', 'id', 'race'], dtype=object)
Now create your new dataset by dropping the un-needed features and selecting the needed one
new_df = df.drop(['age', 'id', 'race'], axis = 1)
Then you can cross validation to know how well this newly selected features (new_df) predicts the target column.
# Check how well the features predict the target variable using cross_validation
cv = ShuffleSplit(n_splits=5, test_size=0.3, random_state=0)
scores = cross_val_score(RandomForestClassifier(), new_df, target, cv= cv)
print("%0.2f accuracy with a standard deviation of %0.2f" % (scores.mean(), scores.std()))
>>> 0.84 accuracy with a standard deviation of 0.01
Don't forget you can also read up on the best Cross-Validation (CV) parameters to use in this documentation
Recursive Feature Elimination (RFE) documentation to learn more and understand better
Recursive Feature Elimination Cross validation (RFECV) documentation
You are using LogisticRegression model. This is a classifier, not a regressor. So your outcome here is treated as labels (not numbers). For good training and prediction, a classifier needs multiple samples of each class. But in your data, only single row is present for each class. Hence the results are garbage and not to be taken seriously.
Try replacing that with any regression model and you will see the outcome which you thought would be.
model = LinearRegression()
rfe = RFE(model, 1)
fit = rfe.fit(X, y)
print(fit.n_features_)
print(fit.support_)
print(fit.ranking_)
# Output
1
[ True False]
[1 2]
I am trying to teach my SVM algorithm using data of clicks and conversion by people who see the banners. The main problem is that the clicks is around 0.2% of all data so it's big disproportion in it. When I use simple SVM in testing phase it always predict only "view" class and never "click" or "conversion". In average it gives 99.8% right answers (because of disproportion), but it gives 0% right prediction if you check "click" or "conversion" ones. How can you tune the SVM algorithm (or select another one) to take into consideration the disproportion?
The most basic approach here is to use so called "class weighting scheme" - in classical SVM formulation there is a C parameter used to control the missclassification count. It can be changed into C1 and C2 parameters used for class 1 and 2 respectively. The most common choice of C1 and C2 for a given C is to put
C1 = C / n1
C2 = C / n2
where n1 and n2 are sizes of class 1 and 2 respectively. So you "punish" SVM for missclassifing the less frequent class much harder then for missclassification the most common one.
Many existing libraries (like libSVM) supports this mechanism with class_weight parameters.
Example using python and sklearn
print __doc__
import numpy as np
import pylab as pl
from sklearn import svm
# we create 40 separable points
rng = np.random.RandomState(0)
n_samples_1 = 1000
n_samples_2 = 100
X = np.r_[1.5 * rng.randn(n_samples_1, 2),
0.5 * rng.randn(n_samples_2, 2) + [2, 2]]
y = [0] * (n_samples_1) + [1] * (n_samples_2)
# fit the model and get the separating hyperplane
clf = svm.SVC(kernel='linear', C=1.0)
clf.fit(X, y)
w = clf.coef_[0]
a = -w[0] / w[1]
xx = np.linspace(-5, 5)
yy = a * xx - clf.intercept_[0] / w[1]
# get the separating hyperplane using weighted classes
wclf = svm.SVC(kernel='linear', class_weight={1: 10})
wclf.fit(X, y)
ww = wclf.coef_[0]
wa = -ww[0] / ww[1]
wyy = wa * xx - wclf.intercept_[0] / ww[1]
# plot separating hyperplanes and samples
h0 = pl.plot(xx, yy, 'k-', label='no weights')
h1 = pl.plot(xx, wyy, 'k--', label='with weights')
pl.scatter(X[:, 0], X[:, 1], c=y, cmap=pl.cm.Paired)
pl.legend()
pl.axis('tight')
pl.show()
In particular, in sklearn you can simply turn on the automatic weighting by setting class_weight='auto'.
This paper describes a variety of techniques. One simple (but very bad method for SVM) is just replicating the minority class(s) until you have a balance:
http://www.ele.uri.edu/faculty/he/PDFfiles/ImbalancedLearning.pdf