How do you make a KMeans prediction more accurate? - machine-learning

I'm learning about clustering and KMeans and such, so my knowldge is very basic on the topic. What I have below is a bit of a self study on how it works. Basically, if 'a' shows up in any of the columns, 'Binary' will equal 1. Essentially I am trying to teach it a pattern. I learned the following from a tutorial using the Titanic dataset, but I've adapted to my own data.
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
import seaborn as sns
import matplotlib.pyplot as plt
my constructed data
dataset = [
[0,'x','f','g'],[1,'a','c','b'],[1,'d','k','a'],[0,'y','v','w'],
[0,'q','w','e'],[1,'c','a','l'],[0,'t','x','j'],[1,'w','o','a'],
[0,'z','m','n'],[1,'z','x','a'],[0,'f','g','h'],[1,'h','a','c'],
[1,'a','r','e'],[0,'g','c','c']
]
df = pd.DataFrame(dataset, columns=['Binary','Col1','Col2','Col3'])
df.head()
df:
Binary Col1 Col2 Col3
------------------------
1 a b c
0 x t v
0 s q w
1 n m a
1 u a r
Encode non binary to binary:
labelEncoder = LabelEncoder()
labelEncoder.fit(df['Col1'])
df['Col1'] = labelEncoder.transform(df['Col1'])
labelEncoder.fit(df['Col2'])
df['Col2'] = labelEncoder.transform(df['Col2'])
labelEncoder.fit(df['Col3'])
df['Col3'] = labelEncoder.transform(df['Col3'])
Set clusters to two, because its either 1 or 0?
X = np.array(df.drop(['Binary'], 1).astype(float))
y = np.array(df['Binary'])
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
Test it:
correct = 0
for i in range(len(X)):
predict_me = np.array(X[i].astype(float))
predict_me = predict_me.reshape(-1, len(predict_me))
prediction = kmeans.predict(predict_me)
if prediction[0] == y[i]:
correct += 1
The result:
print(f'{round(correct/len(X) * 100)}% Accuracy')
>>> 71%
How can I get it more accurate to the point where it 99.99% knows that 'a' means binary column is 1? More data?

K-means does not even try to predict this value. Because it is an unsupervised method. Because it is not a prediction algorithm; it is a structure discovery task. Don't mistake clustering for classification.
The cluster numbers have no meaning. They are 0 and 1 because these are the first two integers. K-means is randomized. Run it a few times and you will also score just 29% sometimes.
Also, k-means is designed for continuous input. You can apply it on binary encoded data, but the results will be pretty poor.

Related

Why my Linear Regession model gives me error when all of my inputs are integers

I want to try all regression algorithms on my dataset and choose a best. I decide to start from Linear Regression. But i get some error.
I tried to do scaling but also get another error.
Here is my code:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
train_df = pd.read_csv('train.csv', index_col='ID')
train_df.head()
target = 'Result'
X = train_df.drop(target, axis=1)
y = train_df[target]
# Trying to scale and get even worse error
#ss = StandardScaler()
#df_scaled = pd.DataFrame(ss.fit_transform(train_df),columns = train_df.columns)
#X = df_scaled.drop(target, axis=1)
#y = df_scaled[target]
model = LogisticRegression()
model.fit(X, y)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=10000,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=10,
warm_start=False)
print(X.iloc[10])
print(model.predict([X.iloc[10]]))
print(y[10])
Here is an error:
ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(
A 0
B -19
C -19
D -19
E 0
F -19
Name: 10, dtype: int64
[0]
-19
And here is an example of dataset:
ID,A,B,C,D,E,F,Result
0,-18,18,18,-2,-12,-3,-19
1,-19,-8,0,18,18,1,0
2,0,-11,18,0,-19,18,18
3,18,-15,-12,18,-11,-4,-17
4,-17,18,-11,-17,-18,-19,18
5,18,-14,-19,-14,-15,-19,18
6,18,-17,18,18,18,-2,-1
7,-1,-11,0,18,18,18,18
8,18,-19,-18,-19,-19,18,18
9,18,18,0,0,18,18,0
10,0,-19,-19,-19,0,-19,-19
11,-19,0,-19,18,-19,-19,-6
12,-6,18,0,0,0,18,-15
13,-15,-19,-6,-19,-19,0,0
14,0,-15,0,18,18,-19,18
15,18,-19,18,-8,18,-2,-4
16,-4,-4,18,-19,18,18,18
17,18,0,18,-4,-10,0,18
18,18,0,18,18,18,18,-19
What i do wrong?
You're using LogisticRegression, which is a special case of Linear Regression used for categorical dependent variables.
This is not necessarily wrong, as you might intend to do so, but that means you need sufficient data per category and enough iterations for the model to converge (which your error points out, it hasn't done).
I suspect, however, that what you intended to use is LinearRegression (used for continuous dependent variables) from sklearn library.

How to handle class imbalance of multiple columns?

My dataset is :enter image description here. First seven columns are for input metric. And the last five columns are for outputs. Output is an array of 5 numbers consist of zero or one. I am using Keras functional API for that. Whenever I try to to resample my data with individual columns, I got shape issues in merging, even if I I try to slice the rows.
Basically there's no "easy" approach to doing this. The only logical way is to maybe use Label Powerset over your design matrix, and resample based on the created column off that - though in that scenario it might be easier to "handcraft" such a transformation.
Here is one approach
import numpy as np
from sklearn.datasets import make_multilabel_classification
from sklearn.datasets import make_classification
from imblearn.over_sampling import RandomOverSampler
import pandas as pd
X0, y = make_classification()
_, X1 = make_multilabel_classification(n_classes=5, random_state=0)
# transform X1 by creating a powerset...
df_x1 = pd.DataFrame(X1, columns=[f'c{x}' for x in range(X1.shape[1])])
df_x1 = pd.merge(df_x1, df_x1.drop_duplicates().reset_index()).rename(columns={"index":"dummy"})
print(df_x1['dummy'].value_counts()) # shows imbalance
df_x1 = df_x1.reset_index() # so that we know which rows are resampled
df_y1 = df_x1['dummy']
df_x1 = df_x1[[x for x in df_x1.columns if x != 'dummy']]
ros = RandomOverSampler()
X_sample, _ = ros.fit_resample(df_x1, df_y1) # this is the resampled index
X = np.hstack([X0, X1])
X_res, y_res = X[X_sample['index'], :], y[X_sample['index']]
Where the secret sauce really is this bit:
df_x1 = pd.merge(df_x1, df_x1.drop_duplicates().reset_index()).rename(columns={"index":"dummy"})
Which re-indexes based on the selected 5 columns
df_x1 = df_x1.reset_index()
Which is then used in the RandomOverSampler, and would guarantee the 5 columns would be balanced.
Finally, we can select the indices of the sampling, to generate a dataset and labels which has been successfully resampled across both X0, X1, y
X = np.hstack([X0, X1])
X_res, y_res = X[X_sample['index'], :], y[X_sample['index']]

Found input variables with inconsistent numbers of samples: [2, 144]

I am having a training data set consisting of 144 feedback with 72 positive and 72 negative respectively. there are two target labels positive and negative respectively. Consider the following code segment :
import pandas as pd
feedback_data = pd.read_csv('output.csv')
print(feedback_data)
data target
0 facilitates good student teacher communication. positive
1 lectures are very lengthy. negative
2 the teacher is very good at interaction. positive
3 good at clearing the concepts. positive
4 good at clearing the concepts. positive
5 good at teaching. positive
6 does not shows test copies. negative
7 good subjective knowledge. positive
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(binary = True)
cv.fit(feedback_data)
X = cv.transform(feedback_data)
X_test = cv.transform(feedback_data_test)
from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
target = [1 if i<72 else 0 for i in range(144)]
# the below line gives error
X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.50)
I do not understand what the problem is. Please help.
You are not using the count vectorizer right. This what you have now:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(binary = True)
cv.fit(df)
X = cv.transform(df)
X
<2x2 sparse matrix of type '<class 'numpy.int64'>'
with 2 stored elements in Compressed Sparse Row format>
So you see that you don't achieve what you want. you do not transform each line correctly. You don't even train the count vectorizer right because you use the entire DataFrame and not just the corpus of comments.
To solve the issue we need to make sure that the Count is well done:
if you do this (Use the right corpus):
cv = CountVectorizer(binary = True)
cv.fit(df['data'].values)
X = cv.transform(df)
X
<2x23 sparse matrix of type '<class 'numpy.int64'>'
with 0 stored elements in Compressed Sparse Row format>
you see that we are coming close to what we want. We just have to transform it right (transform each line):
cv = CountVectorizer(binary = True)
cv.fit(df['data'].values)
X = df['data'].apply(lambda x: cv.transform([x])).values
X
array([<1x23 sparse matrix of type '<class 'numpy.int64'>'
with 5 stored elements in Compressed Sparse Row format>,
...
<1x23 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>], dtype=object)
We have a more suitable X! Now we just need to check if we can split:
target = [1 if i<72 else 0 for i in range(8)] # The dataset is here of size 8
# the below line gives error
X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.50)
And it works!
You need to be sure you understand what CountVectorizer do to use it the right way

how to find which rules in decision tree that are causing misclassifications

I built an binary decision tree classifier . From the confusion matrix m i found class 0 is misclassified 495 times and class 1 is misclassified 134 times.I want to find which rules in the decision trees are actually causing the records to misclassify.
In short which record failed at the which tree node
Is there a machine learning method which can be used to find the rules in the decision tree which are causing them to misclassify
Confusion Matrix
[[14226 495]
[ 134 3271]]
Fitting the decision tree and plotting it
cv = CountVectorizer( max_features = 200,analyzer='word',ngram_range=(1, 3))
cv_addr = cv.fit_transform(data.pop('Clean_addr'))
for i, col in enumerate(cv.get_feature_names()):
data[col] = pd.SparseSeries(cv_addr[:, i].toarray().ravel(), fill_value=0)
train = data.drop(['Resi], axis=1)
Y = data['Resi']
X_train, X_test, y_train, y_test = train_test_split(train, Y, test_size=0.3,random_state =8)
rus = RandomUnderSampler(random_state=42)
X_train_res, y_train_res = rus.fit_sample(X_train, y_train)
dt=DecisionTreeClassifier(class_weight="balanced", min_samples_leaf=30)
fit_decision=dt.fit(X_train_res,y_train_res)
from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import pydotplus
dot_data = StringIO()
export_graphviz(fit_decision, out_file=dot_data,
filled=True, rounded=True,
special_characters=True,feature_names=train.columns)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import pydotplus
dot_data = StringIO()
export_graphviz(fit_decision, out_file=dot_data,
filled=True, rounded=True,
special_characters=True,feature_names=train.columns)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
Any help is appreciated.
Dtree Plot
Decision Tree Image
Dataset
https://drive.google.com/open?id=1NhXfwBIB640wJ30AyPKFnbIECCdmpyi5
Resi is the target column . Using the other data columns i am trying to predict and I have countvectorized the Clean_addr column.

increase accuracy of model in sklearn

The decision tree classification gives an accuracy of 0.52 but I want to increase the accuracy. How can I increase the accuracy by using any of the classification model available in sklearn.
I have used knn, decision tree, and cross-validation but all of them gives less accuracy.
Thanks
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
#read from the csv file and return a Pandas DataFrame.
nba = pd.read_csv('wine.csv')
# print the column names
original_headers = list(nba.columns.values)
print(original_headers)
#print the first three rows.
print(nba[0:3])
# "Position (pos)" is the class attribute we are predicting.
class_column = 'quality'
#The dataset contains attributes such as player name and team name.
#We know that they are not useful for classification and thus do not
#include them as features.
feature_columns = ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH','sulphates', 'alcohol']
#Pandas DataFrame allows you to select columns.
#We use column selection to split the data into features and class.
nba_feature = nba[feature_columns]
nba_class = nba[class_column]
print(nba_feature[0:3])
print(list(nba_class[0:3]))
train_feature, test_feature, train_class, test_class = \
train_test_split(nba_feature, nba_class, stratify=nba_class, \
train_size=0.75, test_size=0.25)
training_accuracy = []
test_accuracy = []
knn = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=1)
knn.fit(train_feature, train_class)
prediction = knn.predict(test_feature)
print("Test set predictions:\n{}".format(prediction))
print("Test set accuracy: {:.2f}".format(knn.score(test_feature, test_class)))
train_class_df = pd.DataFrame(train_class,columns=[class_column])
train_data_df = pd.merge(train_class_df, train_feature, left_index=True, right_index=True)
train_data_df.to_csv('train_data.csv', index=False)
temp_df = pd.DataFrame(test_class,columns=[class_column])
temp_df['Predicted Pos']=pd.Series(prediction, index=temp_df.index)
test_data_df = pd.merge(temp_df, test_feature, left_index=True, right_index=True)
test_data_df.to_csv('test_data.csv', index=False)
tree = DecisionTreeClassifier(max_depth=4, random_state=0)
tree.fit(train_feature, train_class)
print("Training set score: {:.3f}".format(tree.score(train_feature, train_class)))
print("Test set score Decision: {:.3f}".format(tree.score(test_feature, test_class)))
prediction = tree.predict(test_feature)
print("Confusion matrix:")
print(pd.crosstab(test_class, prediction, rownames=['True'], colnames=['Predicted'], margins=True))
cancer = nba.as_matrix()
tree = DecisionTreeClassifier(max_depth=4, random_state=0)
scores = cross_val_score(tree, train_feature,train_class, cv=10)
print("Cross-validation scores: {}".format(scores))
print("Average cross-validation score: {:.2f}".format(scores.mean()))
Usually the next step after DT are RF (and it's neighbors) or XGBoost (but it's not sklearn). Try them. And DT are very simple to overfit.
Remove outliers. Check classes in your dataset: if they are unbalanced, most of errors may be there. In this case you need to use weights while fitting or in metric function (or use f1).
You can attach here your Confusion Matrix - could be great to see.
Also NN (even from sklearn) may show better results.
Improve your preprocessing.
Methods such as DT and kNN may be sensitive to how you preprocess your columns. For example, a DT can benefit much from well-chosen thresholds on the continuous variables.

Resources