error while passing data-frame through k-means - machine-learning

Although my data-frame as all the float values everywhere. While passing the data frame through k-means it shows that couldn't convert the string to float.
How to convert nan values if any to float values in the entire data-frame?

This would do your job and convert all the columns in string format to categorical codes or use one hot encoding of the variables in these columns.
import numpy as np
from sklearn.cluster import KMeans
import pandas
df = pandas.read_csv('zipIncome.csv')
print(df)
df[col_name]= df[col_name].astype('category')
df[col_name] = df[col_name].cat.codes
kmeans = KMeans(n_clusters=4,init='k-means++', max_iter=600, algorithm = 'auto').fit(df)
print (kmeans.labels_)
print(kmeans.cluster_centers_)

Based on your code, it would seem that you only instantiated the KMeans but haven't used it.
You'll need input data X that is clean (i.e. no strings etc), let's call it X
kmeans = KMeans(n_clusters=4,init='k-means++', max_iter=600, algorithm = 'auto')
clusters = kmeans.fit_predict(X)
now clusters has the cluster number for each sample in X.
(alternatively, you can do the fit(X) and then later predict(X) separately, but ultimately it is the predict that will output the cluster labels that you will need)
If you want to later get clusters on data, you should use kmeans.predict(new_data) rather than fit_predict() so that KMeans uses the learning from X, and applies it to your new_data (or depending on your needs, you might want to retrain it).
Hope this helps.
Finally, you can add another column to your pandas DataFrame by doing:
df['cluster'] = clusters
where 'cluster' is a string for your new column name, you can of course call it whatever you want

Related

Normalize and de-normalizing data in prediction model

I have developed a Random Forest model which is including two inputs as X and one output as Y. I have normalized both X and Y values for the training process.
After the model get trained, I selected the dataset as an unseen data for an input for the model. The data is coming from another resource. I normalized the X values and imported them to the trained model and get the Y-normalized value as an output. I wonder how the de normalizing process would be. I mean I have to multiply the output by which value to get the denormalized value?
I'd appreciate it if someone can help me in this regard.
You need to do the prepossessing inversely. But, you the mean and sd (standard deviation) values that used for normalization.
For example with scikit learn you can do it easily. You can do it with 1 line of code.
enter code here
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data= ...
scaled_data = scaler.fit_transform(data)
inverse = scaler.inverse_transform(scaled_data)

Linear regression on multidimensional dask arrays

I am trying to perform linear regression on multidimensional A and b for arrays that are larger-than-memory. Using sklearn notation, my arrays have shapes:
A: (n_samples, n_features)
b: (n_samples, n_targets)
I have so far been using numpy arrays, and using sklearn.linear_model.ridge_regression.
Now I am swapping to dask, and need to find something equivalent. I've tried passing the dask arrays to ridge_regression, but I observe that I run out of memory when doing so. I've looked at dask_ml, but their LinearRegression model only takes 1D-arrays b/y with shape (n_samples), and requires the data to be chunked correctly, which I am not sure how to do. I can use a standard matrix inversion approach, but then I risk singular matrix errors.
A small worked example, which shows how the ridge_regression method ends up with a numpy array, not a dask array:
import sklearn.linear_model
X = da.random.random((1024, 30))
y = da.random.random((1024, 10000))
fit = sklearn.linear_model.ridge_regression(X, y, alpha=0.0)
type(fit)
Does anyone have any suggestions for models that can perform the above regression, but for large data?

Does sklearn supports feature selection on dynamic data?

sklearn contains Implementation of different feature selection methods (filter/wrapper/embedded).
All those methods designed for static systems.
Does sklearn supports feature selection on dynamic data ? (Data which vary with time)
In dynamic data, we need to improve the efficiency of feature selection, in order to be more effective.
I found some methods on IEEE (Incremental approaches for feature selection),
So is there any implementation at sklearn or other open-source library ?
Couldn't you just re-run your process on a scheduled basis and load your data dynamically? I wouldn't expect the dependent variable to change at all, but I suppose the independent variables could change somewhat.
#1) load your dataframe
#2) copy your target variable into a new dataframe
y = df[['SeriousDlqin2yrs']]
#3) drop your target variable
x = df[df.columns[df.columns!='SeriousDlqin2yrs']]
Finally, run this.
from sklearn.ensemble import RandomForestClassifier
features = np.array(x)
clf = RandomForestClassifier()
clf.fit(x, y)
# from the calculated importances, order them from most to least important
# and make a barplot so we can visualize what is/isn't important
importances = clf.feature_importances_
sorted_idx = np.argsort(importances)
padding = np.arange(len(features)) + 0.5
plt.barh(padding, importances[sorted_idx], align='center')
plt.yticks(padding, features[sorted_idx])
plt.xlabel("Relative Importance")
plt.title("Variable Importance")
plt.show()
I just tried that, and got this result.
If you expect to get non-numeric features, you will need to use one hot encoding to handle these.
import pandas as pd
pd.get_dummies(df)
http://queirozf.com/entries/one-hot-encoding-a-feature-on-a-pandas-dataframe-an-example

Issue with the results of PCA component values

I am performing PCA on a dataset of (28 features + 1 class label) and 11M rows (samples) using the following simple code:
from sklearn.decomposition import PCA
import pandas as pd
df = pd.read_csv('HIGGS.csv', sep=',', header=None)
df_labels = df[df.columns[0]]
df_features = df.drop(df.columns[0], axis=1)
pca = PCA()
pca.fit(df_features.values)
print(pca.explained_variance_ratio_)
print(pca.explained_variance_ratio_.shape)
transformed_data = pca.transform(df_features.values)
The pca.explained_variance_ratio_ (or eigenvalues) are the following:
[0.11581302 0.09659324 0.08451179 0.07000956 0.0641502 0.05651781
0.055588 0.05446682 0.05291956 0.04468113 0.04248516 0.04108151
0.03885671 0.03775394 0.0255504 0.02181292 0.01979832 0.0185323
0.0164828 0.01047363 0.00779365 0.00702242 0.00586635 0.00531234
0.00300572 0.00135565 0.00109707 0.00046801]
Based on the explained_variance_ratio_, I don't know if there is something wrong here. The highest component is 11%, as opposed to the fact that we should be getting values starting at 99% and so. Does it imply that the dataset needs some preprocessing such as ensuring the data are in a normal distribution?
Dude, 99% for the first component means that the axis associated with the largest eigenvalue encodes 99% of the variance in your dataset. It is quite uncommon for any dataset to have a situation like this. Otherwise, the problem shrinks to a 1-D classification/regression problem.
There is nothing wrong with this output. Retain the first axes that encode aound 80% of the variance and build your model.
note: The PCA transformation is usually used to decrease the dimensions of your problem space. Since you have only 28 variables, I recommend abondoning PCA altogether.

Scikit learn - fit_transform on the test set

I am struggling to use Random Forest in Python with Scikit learn. My problem is that I use it for text classification (in 3 classes - positive/negative/neutral) and the features that I extract are mainly words/unigrams, so I need to convert these to numerical features. I found a way to do it with DictVectorizer's fit_transform:
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False)
rf = RandomForestClassifier(n_estimators = 100)
trainFeatures1 = vec.fit_transform(trainFeatures)
# Fit the training data to the training output and create the decision trees
rf = rf.fit(trainFeatures1.toarray(), LabelEncoder().fit_transform(trainLabels))
testFeatures1 = vec.fit_transform(testFeatures)
# Take the same decision trees and run on the test data
Output = rf.score(testFeatures1.toarray(), LabelEncoder().fit_transform(testLabels))
print "accuracy: " + str(Output)
My problem is that the fit_transform method is working on the train dataset, which contains around 8000 instances, but when I try to convert my test set to numerical features too, which is around 80000 instances, I get a memory error saying that:
testFeatures1 = vec.fit_transform(testFeatures)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\dict_vectorizer.py", line 143, in fit_transform
return self.transform(X)
File "C:\Python27\lib\site-packages\sklearn\feature_extraction\dict_vectorizer.py", line 251, in transform
Xa = np.zeros((len(X), len(vocab)), dtype=dtype)
MemoryError
What could possibly cause this and is there any workaround? Many thanks!
You are not supposed to do fit_transform on your test data, but only transform. Otherwise, you will get different vectorization than the one used during training.
For the memory issue, I recommend TfIdfVectorizer, which has numerous options of reducing the dimensionality (by removing rare unigrams etc.).
UPDATE
If the only problem is fitting test data, simply split it to small chunks. Instead of something like
x=vect.transform(test)
eval(x)
you can do
K=10
for i in range(K):
size=len(test)/K
x=vect.transform(test[ i*size : (i+1)*size ])
eval(x)
and record results/stats and analyze them afterwards.
in particular
predictions = []
K=10
for i in range(K):
size=len(test)/K
x=vect.transform(test[ i*size : (i+1)*size ])
predictions += rf.predict(x) # assuming it retuns a list of labels, otherwise - convert it to list
print accuracy_score( predictions, true_labels )

Resources