I'm looking at the attributes of skbio's PCoA method (listed below). I am new to this API and I want to be able to get the eigenvectors and the original points projected onto the new axis similar to .fit_transform in sklearn.decomposition.PCA so I can create some PC_1 vs PC_2-style plots. I figured out how to get the eigvals and proportion_explained but features comes back as None.
Is that because it's in beta?
If there are any tutorials that use this, that would be greatly appreciated. I am a huge fan of scikit-learn and would like to start using more of scikit's products.
| Attributes
| ----------
| short_method_name : str
| Abbreviated ordination method name.
| long_method_name : str
| Ordination method name.
| eigvals : pd.Series
| The resulting eigenvalues. The index corresponds to the ordination
| axis labels
| samples : pd.DataFrame
| The position of the samples in the ordination space, row-indexed by the
| sample id.
| features : pd.DataFrame
| The position of the features in the ordination space, row-indexed by
| the feature id.
| biplot_scores : pd.DataFrame
| Correlation coefficients of the samples with respect to the features.
| sample_constraints : pd.DataFrame
| Site constraints (linear combinations of constraining variables):
| coordinates of the sites in the space of the explanatory variables X.
| These are the fitted site scores
| proportion_explained : pd.Series
| Proportion explained by each of the dimensions in the ordination space.
| The index corresponds to the ordination axis labels
Here is my code to generate the principal component analysis object.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn import decomposition
import seaborn as sns; sns.set_style("whitegrid", {'axes.grid' : False})
import skbio
from scipy.spatial import distance
%matplotlib inline
np.random.seed(0)
# Iris dataset
DF_data = pd.DataFrame(load_iris().data,
index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
columns = load_iris().feature_names)
n,m = DF_data.shape
# print(n,m)
# 150 4
Se_targets = pd.Series(load_iris().target,
index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
name = "Species")
# Scaling mean = 0, var = 1
DF_standard = pd.DataFrame(StandardScaler().fit_transform(DF_data),
index = DF_data.index,
columns = DF_data.columns)
# Distance Matrix
Ar_dist = distance.squareform(distance.pdist(DF_standard.T, metric="braycurtis")) # (m x m) distance measure
DM_dist = skbio.stats.distance.DistanceMatrix(Ar_dist, ids=DF_standard.columns)
PCoA = skbio.stats.ordination.pcoa(DM_dist)
You can access the transformed sample coordinates with OrdinationResults.samples. This will return a pandas.DataFrame row-indexed by sample ID (i.e. the IDs in your distance matrix). Since principal coordinate analysis operates on a distance matrix of samples, transformed feature coordinates (OrdinationResults.features) are not available. Other ordination methods in scikit-bio accepting a sample x feature table as input will have the transformed feature coordinates available (e.g. CA, CCA, RDA).
Side note: the distance.squareform call is unnecessary because skbio.DistanceMatrix supports square- or vector-form arrays.
Related
My dataset is :enter image description here. First seven columns are for input metric. And the last five columns are for outputs. Output is an array of 5 numbers consist of zero or one. I am using Keras functional API for that. Whenever I try to to resample my data with individual columns, I got shape issues in merging, even if I I try to slice the rows.
Basically there's no "easy" approach to doing this. The only logical way is to maybe use Label Powerset over your design matrix, and resample based on the created column off that - though in that scenario it might be easier to "handcraft" such a transformation.
Here is one approach
import numpy as np
from sklearn.datasets import make_multilabel_classification
from sklearn.datasets import make_classification
from imblearn.over_sampling import RandomOverSampler
import pandas as pd
X0, y = make_classification()
_, X1 = make_multilabel_classification(n_classes=5, random_state=0)
# transform X1 by creating a powerset...
df_x1 = pd.DataFrame(X1, columns=[f'c{x}' for x in range(X1.shape[1])])
df_x1 = pd.merge(df_x1, df_x1.drop_duplicates().reset_index()).rename(columns={"index":"dummy"})
print(df_x1['dummy'].value_counts()) # shows imbalance
df_x1 = df_x1.reset_index() # so that we know which rows are resampled
df_y1 = df_x1['dummy']
df_x1 = df_x1[[x for x in df_x1.columns if x != 'dummy']]
ros = RandomOverSampler()
X_sample, _ = ros.fit_resample(df_x1, df_y1) # this is the resampled index
X = np.hstack([X0, X1])
X_res, y_res = X[X_sample['index'], :], y[X_sample['index']]
Where the secret sauce really is this bit:
df_x1 = pd.merge(df_x1, df_x1.drop_duplicates().reset_index()).rename(columns={"index":"dummy"})
Which re-indexes based on the selected 5 columns
df_x1 = df_x1.reset_index()
Which is then used in the RandomOverSampler, and would guarantee the 5 columns would be balanced.
Finally, we can select the indices of the sampling, to generate a dataset and labels which has been successfully resampled across both X0, X1, y
X = np.hstack([X0, X1])
X_res, y_res = X[X_sample['index'], :], y[X_sample['index']]
Here's my code:
# Load libraries
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
# Create text
text_data = np.array(['Tim is smart!',
'Joy is the best',
'Lisa is dumb',
'Fred is lazy',
'Lisa is lazy'])
# Create target vector
y = np.array([1,1,0,0,0])
# Create bag of words
count = CountVectorizer()
bag_of_words = count.fit_transform(text_data) #
# Create feature matrix
X = bag_of_words.toarray()
mnb = MultinomialNB(alpha = 1, fit_prior = True, class_prior = None)
mnb.fit(X,y)
print(count.get_feature_names())
# output:['best', 'dumb', 'fred', 'is', 'joy', 'lazy', 'lisa', 'smart', 'the', 'tim']
print(mnb.feature_log_prob_)
# output
[[-2.94443898 -2.2512918 -2.2512918 -1.55814462 -2.94443898 -1.84582669
-1.84582669 -2.94443898 -2.94443898 -2.94443898]
[-2.14006616 -2.83321334 -2.83321334 -1.73460106 -2.14006616 -2.83321334
-2.83321334 -2.14006616 -2.14006616 -2.14006616]]
My question is:
Let's say for word: "best": the probability for class 1 : -2.14006616.
What is the formula to calculate to get this score.
I am using LOG (P(best|y=class=1)) -> Log(1/2) -> can't get the -2.14006616
From the documentation we can infer that feature_log_prob_ corresponds to the empirical log probability of features given a class. Let's take an example feature "best" for the purpose of this illustration, the log probability of this feature for class 1 is -2.14006616 (as you pointed out), now if we were to convert it into actual probability score it will be np.exp(1)**-2.14006616 = 0.11764. Let's take one more step back to see how and why the probability of "best" in class 1 is 0.11764. As per the documentation of Multinomial Naive Bayes, we see that these probabilities are computed using the formula below:
Where, the numerator roughly corresponds to the number of times feature "best" appears in the class 1 (which is of our interest in this example) in the training set, and the denominator corresponds to the total count of all features for class 1. Also, we add a small smoothing value, alpha to prevent from the probabilities going to zero and n corresponds to the total number of features i.e. size of vocabulary. Computing these numbers for the example we have,
N_yi = 1 # "best" appears only once in class `1`
N_y = 7 # There are total 7 features (count of all words) in class `1`
alpha = 1 # default value as per sklearn
n = 10 # size of vocabulary
Required_probability = (1+1)/(7+1*10) = 0.11764
You can do the math in a similar fashion for any given feature and class.
I'm working with classification with imbalanced data set using Sklearn. Sklearn has calculated the false_positive_rate and true_positive_rate wrong; when I want to calculate the AUC score, the result is different from what I have gotten from the confusion matrix.
From Sklearn I got the following confusion matrix:
confusion = confusion_matrix(y_test, y_pred)
array([[ 9100, 4320],
[109007, 320068]], dtype=int64)
of course, I understand the output as:
+-----------------------------------+------------------------+
| | Predicted | Predicted |
+-----------------------------------+------------------------+
| Actual | True positive = 9100 | False-negative = 4320 |
| Actual | False-positive = 109007 | True negative = 320068|
+--------+--------------------------+------------------------+
However, for FPR and TPR, I got the following result:
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_pred)
(false_positive_rate, true_positive_rate)
(array([0. , 0.3219076, 1. ]),
array([0. , 0.7459488, 1. ]))
The result is different from the confusion_matrix. According to my table, the FPR is actually FNR, and the TPR is actually TNR. Then I checked the confusion matrix document, I found out that:
Thus, in binary classification, the count of true negatives is C0,0, false negatives is C1,0, true positives is C1,1 and false positives is C0,1.
This means that the confusion_matrix, according to Sklearn, looks like this:
+-----------------------------------+---------------------------+
| | Predicted | Predicted |
+-----------------------------------+---------------------------+
| Actual | True-Positive = 320068 | False-Negative = 109007 |
| Actual | False-Positive = 4320 | True-Negative = 9100 |
+--------+--------------------------+---------------------------+
According to the theory, for binary classification, the rare class is denoted as the positive class.
Why does Sklearn treat the majority class as positive?
After some experiments, I found out that when IsolationForest from sklearn is used for imbalanced data, if you check confusion_matrix, It can be seen that IsolationForest treats the majority (Normal) class as a positive class whereas minor class should be the positive class in Fraud/Outlier/Anomaly detection tasks.
To overcome this challenge there are 2 solutions:
Interpret the confusion matrix results in vice versa. FP instead of FN and TP instead of TN.
If you want to pass the results correctly due to this bad treatment of IF for imbalanced data you can use the following trick:
Typically IF returns -1 for outliers and 1 for inliers so if you replace 1 with -1 and then -1 with 1 in the output from the IsolationForest, Then you could use the standard metric calculations correctly in this case.
IF_model = IsolationForest(max_samples="auto",
random_state=11,
contamination = 0.1,
n_estimators=100,
n_jobs=-1)
IF_model.fit(X_train_sf, y_train_sf)
y_pred_test = IF_model.predict(X_test_sf)
counts = np.unique(y_pred_test, return_counts=True)
#(array([-1, 1]), array([44914, 4154]))
#replace 1 with -1 and then -1 with 1
if (counts[1][0] < counts[1][1] and counts[0][0] == -1) or (counts[1][0] > counts[1][1] and counts[0][0] == 1): y_pred_test = -y_pred_test
Considering confusion matrix documentation, and problem definition here, above trick should work and right form of confusion matrix for Fraud/Outlier/Anomaly detection or Binary classifiers based on litertures Ref.1, Ref.2, Ref.3 is as follows:
+----------------------------+---------------+--------------+
| | Predicted | Predicted |
+----------------------------+---------------+--------------+
| Actual (Positive class)[1] | TP | FN |
| Actual (Negative class)[-1]| FP | TN |
+----------------------------+---------------+--------------+
tn, fp, fn, tp = confusion_matrix(y_test_sf, y_pred_test).ravel()
print("TN: ",tn,"\nFP: ", fp,"\nFN: " ,fn,"\nTP: ", tp)
print("Number of positive class instances: ",tp+fn,"\nNumber of negative class instances: ", tn+fp)
check the evaluation:
print(classification_report(y_test_sf, y_pred_test, target_names=["Anomaly", "Normal"]))
I'm learning about clustering and KMeans and such, so my knowldge is very basic on the topic. What I have below is a bit of a self study on how it works. Basically, if 'a' shows up in any of the columns, 'Binary' will equal 1. Essentially I am trying to teach it a pattern. I learned the following from a tutorial using the Titanic dataset, but I've adapted to my own data.
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
import seaborn as sns
import matplotlib.pyplot as plt
my constructed data
dataset = [
[0,'x','f','g'],[1,'a','c','b'],[1,'d','k','a'],[0,'y','v','w'],
[0,'q','w','e'],[1,'c','a','l'],[0,'t','x','j'],[1,'w','o','a'],
[0,'z','m','n'],[1,'z','x','a'],[0,'f','g','h'],[1,'h','a','c'],
[1,'a','r','e'],[0,'g','c','c']
]
df = pd.DataFrame(dataset, columns=['Binary','Col1','Col2','Col3'])
df.head()
df:
Binary Col1 Col2 Col3
------------------------
1 a b c
0 x t v
0 s q w
1 n m a
1 u a r
Encode non binary to binary:
labelEncoder = LabelEncoder()
labelEncoder.fit(df['Col1'])
df['Col1'] = labelEncoder.transform(df['Col1'])
labelEncoder.fit(df['Col2'])
df['Col2'] = labelEncoder.transform(df['Col2'])
labelEncoder.fit(df['Col3'])
df['Col3'] = labelEncoder.transform(df['Col3'])
Set clusters to two, because its either 1 or 0?
X = np.array(df.drop(['Binary'], 1).astype(float))
y = np.array(df['Binary'])
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
Test it:
correct = 0
for i in range(len(X)):
predict_me = np.array(X[i].astype(float))
predict_me = predict_me.reshape(-1, len(predict_me))
prediction = kmeans.predict(predict_me)
if prediction[0] == y[i]:
correct += 1
The result:
print(f'{round(correct/len(X) * 100)}% Accuracy')
>>> 71%
How can I get it more accurate to the point where it 99.99% knows that 'a' means binary column is 1? More data?
K-means does not even try to predict this value. Because it is an unsupervised method. Because it is not a prediction algorithm; it is a structure discovery task. Don't mistake clustering for classification.
The cluster numbers have no meaning. They are 0 and 1 because these are the first two integers. K-means is randomized. Run it a few times and you will also score just 29% sometimes.
Also, k-means is designed for continuous input. You can apply it on binary encoded data, but the results will be pretty poor.
I am getting different shapes for my PCA using sklearn. Why isn't my transformation resulting in an array of the same dimensions like the docs say?
fit_transform(X, y=None)
Fit the model with X and apply the dimensionality reduction on X.
Parameters:
X : array-like, shape (n_samples, n_features)
Training data, where n_samples is the number of samples and n_features is the number of features.
Returns:
X_new : array-like, shape (n_samples, n_components)
Check this out with the iris dataset which is (150, 4) where I'm making 4 PCs:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn import decomposition
import seaborn as sns; sns.set_style("whitegrid", {'axes.grid' : False})
%matplotlib inline
np.random.seed(0)
# Iris dataset
DF_data = pd.DataFrame(load_iris().data,
index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
columns = load_iris().feature_names)
Se_targets = pd.Series(load_iris().target,
index = ["iris_%d" % i for i in range(load_iris().data.shape[0])],
name = "Species")
# Scaling mean = 0, var = 1
DF_standard = pd.DataFrame(StandardScaler().fit_transform(DF_data),
index = DF_data.index,
columns = DF_data.columns)
# Sklearn for Principal Componenet Analysis
# Dims
m = DF_standard.shape[1]
K = m
# PCA (How I tend to set it up)
M_PCA = decomposition.PCA()
A_components = M_PCA.fit_transform(DF_standard)
#DF_standard.shape, A_components.shape
#((150, 4), (150, 4))
but then when I use the same exact approach on my actual dataset (76, 1989) as in 76 samples and 1989 attributes/dimensions I get a (76, 76) array instead of (76, 1989)
DF_centered = normalize(DF_mydata, method="center", axis=0)
m = DF_centered.shape[1]
# print(m)
# 1989
M_PCA = decomposition.PCA(n_components=m)
A_components = M_PCA.fit_transform(DF_centered)
DF_centered.shape, A_components.shape
# ((76, 1989), (76, 76))
normalize is just a wrapper I made that subtracts the mean from each dimension.
(Note: this answer is adapted from my answer on Cross Validated here: Why are there only n−1 principal components for n data points if the number of dimensions is larger or equal than n?)
PCA (as most typically run) creates a new coordinate system by:
shifting the origin to the centroid of your data,
squeezes and/or stretches the axes to make them equal in length, and
rotates your axes into a new orientation.
(For more details, see this excellent CV thread: Making sense of principal component analysis, eigenvectors & eigenvalues.) However, step 3 rotates your axes in a very specific way. Your new X1 (now called "PC1", i.e., the first principal component) is oriented in your data's direction of maximal variation. The second principal component is oriented in the direction of the next greatest amount of variation that is orthogonal to the first principal component. The remaining principal components are formed likewise.
With this in mind, let's examine a simple example (suggested by #amoeba in a comment). Here is a data matrix with two points in a three dimensional space:
X = [ 1 1 1
2 2 2 ]
Let's view these points in a (pseudo) three dimensional scatterplot:
So let's follow the steps listed above. (1) The origin of the new coordinate system will be located at (1.5,1.5,1.5). (2) The axes are already equal. (3) The first principal component will go diagonally from what used to be (0,0,0) to what was originally (3,3,3), which is the direction of greatest variation for these data. Now, the second principal component must be orthogonal to the first, and should go in the direction of the greatest remaining variation. But what direction is that? Is it from (0,0,3) to (3,3,0), or from (0,3,0) to (3,0,3), or something else? There is no remaining variation, so there cannot be any more principal components.
With N=2 data, we can fit (at most) N−1=1 principal components.