show feature names after feature selection - machine-learning

I need to build a classifier for text, and now I'm using TfidfVectorizer and SelectKBest to selection the features, as following:
vectorizer = TfidfVectorizer(sublinear_tf = True, max_df = 0.5, stop_words = 'english',charset_error='strict')
X_train_features = vectorizer.fit_transform(data_train.data)
y_train_labels = data_train.target;
ch2 = SelectKBest(chi2, k = 1000)
X_train_features = ch2.fit_transform(X_train_features, y_train_labels)
I want to print out selected features name(text) after select k best features, is there any way to do that? I just need to print out selected feature names, maybe I should use CountVectorizer instead?

The following should work:
np.asarray(vectorizer.get_feature_names())[ch2.get_support()]

To expand on #ogrisel's answer, the returned list of features is in the same order when they've been vectorized. The code below will give you a list of top ranked features sorted according to their Chi-2 scores in descending order (along with the corresponding p-values):
top_ranked_features = sorted(enumerate(ch2.scores_),key=lambda x:x[1], reverse=True)[:1000]
top_ranked_features_indices = map(list,zip(*top_ranked_features))[0]
for feature_pvalue in zip(np.asarray(train_vectorizer.get_feature_names())[top_ranked_features_indices],ch2.pvalues_[top_ranked_features_indices]):
print feature_pvalue

Related

How to detect contiguos images

I am trying to detect when two images correspond to a chunk that matches the other image but there is no overlap.
That is, suppose we have the Lenna image:
Someone unknown to me has split it vertically in two and I must know if both pieces are connected or not (assume that they are independent images or that one is a piece of the other).
A:
B:
The positive part is that I know the order of the pieces, the negative part is that there may be other images and I must know which of them fit or not to join them.
My first idea has been to check if the MAE between the last row of A and the first row B is low.
def mae(a, b):
min_mae = 256
for i in range(-5, 5, 1):
a_s = np.roll(a, i, axis=1)
value_mae = np.mean(abs(a_s - b))
min_mae = min(min_mae, value_mae)
return min_mae
if mae(im_a[im_a.shape[0] - 1:im_a.shape[0], ...], im_b[0:1, ...]) < threshold:
# join images a and b
The problem is that it is a not very robust metric.
I have done the same using the horizontal derivative, as well as applying various smoothing filters, but I find myself in the same situation.
Is there a way to solve this problem?
Your method seems like a decent one. Even on visual inspection it looks reasonable:
Top (Bottom row expanded)
Bottom (Top row expanded)
Diff of the images:
It might even be more clear if you also check neighboring columns, but this already looks like the images are similar enough.
Code
import cv2
import numpy as np
# load images
top = cv2.imread("top.png");
bottom = cv2.imread("bottom.png");
# gray
tgray = cv2.cvtColor(top, cv2.COLOR_BGR2GRAY);
bgray = cv2.cvtColor(bottom, cv2.COLOR_BGR2GRAY);
# expand rows
texp = tgray;
bexp = bgray;
trow = np.zeros_like(texp);
brow = np.zeros_like(bexp);
trow[:] = texp[-1, :];
brow[:] = bexp[0, :];
trow = trow[:100, :];
brow = brow[:100, :];
# check absolute difference
ldiff = trow - brow;
rdiff = brow - trow;
diff = np.minimum(ldiff, rdiff);
# show
cv2.imshow("top", trow);
cv2.imshow("bottom", brow);
cv2.imshow("diff", diff);
cv2.waitKey(0);
# save
cv2.imwrite("top_out.png", trow);
cv2.imwrite("bottom_out.png", brow);
cv2.imwrite("diff_out.png", diff);

Trouble understanding pseudocode for Two-Pass Connected Components

I'm having some trouble understanding the wikipedia's pseudocode for connected-components labeling using Two-Pass algorithm. Here's the pseudocode:
algorithm TwoPass(data) is
linked = []
labels = structure with dimensions of data, initialized with the value of Background
First pass
for row in data do
for column in row do
if data[row][column] is not Background then
neighbors = connected elements with the current element's value
if neighbors is empty then
linked[NextLabel] = set containing NextLabel
labels[row][column] = NextLabel
NextLabel += 1
else
Find the smallest label
L = neighbors labels
labels[row][column] = min(L)
for label in L do
linked[label] = union(linked[label], L)
Second pass
for row in data do
for column in row do
if data[row][column] is not Background then
labels[row][column] = find(labels[row][column])
return labels
My problem is with the line linked[NextLabel] = set containing NextLabel. It never initializes the NextLabel and still uses it. Also, what does it mean by "set containing NextLabel"? I'm really confused by this code.

Is it possible to vectorize this calculation in numpy?

Can the following expression of numpy arrays be vectorized for speed-up?
k_lin1x = [2*k_lin[i]*k_lin[i+1]/(k_lin[i]+k_lin[i+1]) for i in range(len(k_lin)-1)]
Is it possible to vectorize this calculation in numpy?
x1 = k_lin
x2 = k_lin
s = len(k_lin)-1
np.roll(x2, -1) #do this do bring the column one position right
result1 = x2[:s]+x1[:s] #your divider. You add everything but the last element
result2 = x2[:s]*x1[:s] #your upper part
# in one line
result = 2*x2[:s]*x1[:s] / (x2[:s]+x1[:s])
You last column wont be added or taken into the calculations and you can do this by simply using np.roll to shift the columns. x2[0] = x1[1], x2[1] = x1[2].
This is barely a demo of how you should approach google numpy roll. Also instead of using s on x2 you can simply drop the last column since it's useless for the calculations.

K-means initialization with further-first traversal and k-mean++

I am confused about k-mean++ initialization. I understand k-mean++ choose and furthest data point as next data center. But how about outlier? What is the different between `initialization with further-first traversal and k-mean++ ?
I saw someone explain in this way:
Here is a one-dimensional example. Our observations are [0, 1, 2, 3, 4]. Let the first center, c1, be 0. The probability that the next
cluster center, c2, is x is proportional to ||c1-x||^2. So, P(c2 = 1)
= 1a, P(c2 = 2) = 4a, P(c2 = 3) = 9a, P(c2 = 4) = 16a, where a = 1/(1+4+9+16).
Suppose c2=4. Then, P(c3 = 1) = 1a, P(c3 = 2) = 4a, P(c3 = 3) = 1a,
where a = 1/(1+4+1).
What is this array or list is [0,1,2,4,5,6,100]. Obviously, 100 is the outlier in this case and it will be chosen as the data center at some point. Can someone give a better explanation?
K-means chooses points with probability.
But yes, with extreme outliers it is likely to chose the outlier.
That is fine, because so will k-means. Most likely the best SSQ solution has a one-element cluster containing only that point.
If you have such data, the k-means solutions tend to be rather useless, and you probably should choose another algorithm such as DBSCAN instead.

How to estimate? "simple" Nonlinear Regression + Parameter Constraints + AR residuals

I am new to this site so please bear with me. I want to
the nonlinear model as shown in the link: https://i.stack.imgur.com/cNpWt.png by imposing constraints on the parameters a>0 and b>0 and gamma1 in [0,1].
In the nonlinear model [1] independent variable is x(t) and dependent are R(t), F(t) and ΞΎ(t) is the error term.
An example of the dataset can be shown here: https://i.stack.imgur.com/2Vf0j.png 68 rows of time series
To estimate the nonlinear regression I use the nls() function with no problem as shown below:
NLM1 = nls(**Xt ~ (aRt-bFt)/(1-gamma1*Rt), start = list(a = 10, b = 10, lamda = 0.5)**,algorithm = "port", lower=c(0,0,0),upper=c(Inf,Inf,1),data = temp2)
I want to estimate NLM1 with allowing for also an AR(1) on the residuals.
Basically I want the same procedure as we go from lm() to gls(). My problem is that in the gnls() function I dont know how to put contraints for the model parameters a, b, gamma1 and the model estimates wrong values for them.
nls() has the option for lower and upper bounds. I cant do the same on gnls()
In the gnls(): I need to add the contraints something like as in nls() lower=c(0,0,0),upper=c(Inf,Inf,1)
NLM1_AR1 = gnls( model = Xt ~ (aRt-bFt)/(1-gamma1*Rt), data = temp2, start = list(a =13, b = 10, lamda = 0.5),correlation = corARMA(p = 1))
Does any1 know the solution on how to do it?
Thank you

Resources