K-Nearest neighbours
I am trying to perform knn algorithm on heart disease prediction database. When I try to pickel it and create model.pkl it is giving me the notfitted error. When I am running the code it is giving me the accurate prediciton but when pickel it shows the error. How should I fit this data. I am new to machine learning so please help.
from sklearn.neighbors import KNeighborsClassifier
dataset = pd.get_dummies(df, columns = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal'])
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
standardScaler = StandardScaler()
columns_to_scale = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
dataset[columns_to_scale] = standardScaler.fit_transform(dataset[columns_to_scale])
y = dataset['target']
X = dataset.drop(['target'], axis = 1)
from sklearn.model_selection import cross_val_score
knn_scores = []
for k in range(1,21):
knn_classifier = KNeighborsClassifier(n_neighbors = k)
score=cross_val_score(knn_classifier,X,y,cv=10)
knn_scores.append(score.mean())
plt.plot([k for k in range(1, 21)], knn_scores, color = 'red')
for i in range(1,21):
plt.text(i, knn_scores[i-1], (i, knn_scores[i-1]))
plt.xticks([i for i in range(1, 21)])
plt.xlabel('Number of Neighbors (K)')
plt.ylabel('Scores')
plt.title('K Neighbors Classifier scores for different K values')
Text(0.5, 1.0, 'K Neighbors Classifier scores for different K values')
knn_classifier
knn_classifier = KNeighborsClassifier(n_neighbors = 12)
score=cross_val_score(knn_classifier,X,y,cv=10)
score.mean()
0.8448387096774195
import pickle
pickle.dump(knn_classifier, open('model.pkl', 'wb'))
Heart_disease_detector_model = pickle.load(open('model.pkl', 'rb'))
y_pred = Heart_disease_detector_model.predict(X_test)
print('Accuracy of K – Nearest Neighbor model = ',accuracy_score(y_test, y_pred))
---------------------------------------------------------------------------
> NotFittedError Traceback (most recent call last)
> <ipython-input-79-c37bd716088c> in <module>
> 2 pickle.dump(knn_classifier, open('model.pkl', 'wb'))
> 3 Heart_disease_detector_model = pickle.load(open('model.pkl', 'rb'))
> ----> 4 y_pred = Heart_disease_detector_model.predict(X_test)
> 5 print('Accuracy of K – Nearest Neighbor model = ',accuracy_score(y_test, y_pred))
>
> c:\users\jahnavi padala\miniconda3\lib\site-packages\sklearn\neighbors\_classification.py
> in predict(self, X)
> 195 X = check_array(X, accept_sparse='csr')
> 196
> --> 197 neigh_dist, neigh_ind = self.kneighbors(X)
> 198 classes_ = self.classes_
> 199 _y = self._y
>
> c:\users\jahnavi padala\miniconda3\lib\site-packages\sklearn\neighbors\_base.py in
> kneighbors(self, X, n_neighbors, return_distance)
> 647 [2]]...)
> 648 """
> --> 649 check_is_fitted(self)
> 650
> 651 if n_neighbors is None:
>
> c:\users\jahnavi padala\miniconda3\lib\site-packages\sklearn\utils\validation.py in
> inner_f(*args, **kwargs)
> 61 extra_args = len(args) - len(all_args)
> 62 if extra_args <= 0:
> ---> 63 return f(*args, **kwargs)
> 64
> 65 # extra_args > 0
>
> c:\users\jahnavi padala\miniconda3\lib\site-packages\sklearn\utils\validation.py in
> check_is_fitted(estimator, attributes, msg, all_or_any)
> 1096
> 1097 if not attrs:
> -> 1098 raise NotFittedError(msg % {'name': type(estimator).__name__})
> 1099
> 1100
>
> NotFittedError: This KNeighborsClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this
> estimator.
The error is telling you that the classifier is not yet fitted, which is exactly what it sounds like--you need to fit the model before using it. Do something like this before getting the accuracy score:
knn_classifier.fit(X, y)
So you will end up with this:
knn_classifier
knn_classifier = KNeighborsClassifier(n_neighbors = 12)
knn_classifier.fit(X, y)
You can't create a pickle without fitting the model. Before line pickle.dump(knn_classifier, open('model.pkl', 'wb')) write knn_classifier.fit(*your_X, your_Y*)
Related
I am using K-Fold method to train a Classifier. And use the KFold moudule of sklearn.
FK_split = KFold(n_splits=4, shuffle = True, random_state=0)
for epoch in range(num_epoch):
train_loss = 0.0
Acc_valid = 0.0
for train_idx, valid_idx in FK_split.split(torch_trainDataset):
train_sampler = SubsetRandomSampler(train_idx)
valid_sampler = SubsetRandomSampler(valid_idx)
train_dataloder = DataLoader(torch_trainDataset, batch_size=1, sampler=train_sampler)
valid_dataloder = DataLoader(torch_testDataset, batch_size=1, sampler=valid_sampler)
train_loss += train(model, train_dataloder, lossfunc, optimizer, train_loss)
_, acc_valid = test(model, valid_dataloder, optimizer)
and the train function, test function and acc function are defined as follows,
def train(model, data_train, lossfunc, optimizer, train_loss):
for x, y in data_train:
optimizer.zero_grad()
output = model(x)
loss = lossfunc(output, y)
loss.backward()
optimizer.step()
train_loss += loss.item()*x.size(0)
return train_loss
def get_acc(outputs, labels):
"""caculate acc"""
_, predict = torch.max(outputs.data, 1)
correct_num = (labels == predict).sum().item()
return predict, correct_num
def test(model, data_test, optimizer):
Predict = []
Acc = 0.0
for x, y in data_train:
outputs = model(x)
predict, acc = get_acc(outputs, y)
Predict.append(predict.tolist())
Acc += acc
return Predict, Acc
**However, the IndexError occurs in test process while the same method in train process works. Could you guys help me solve this problem? I attach IndexError information below. **
Output exceeds the size limit. Open the full output data in a text editor
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Cell In [38], line 34
30 valid_dataloder = DataLoader(torch_testDataset, batch_size=1, sampler=valid_sampler)
32 train_loss += train(model, train_dataloder, lossfunc, optimizer, train_loss)
---> 34 _, acc_valid = test(model, valid_dataloder, optimizer)
35 Acc_valid += acc_valid
37 Acc_valid = Acc_valid / len(valid_dataloder)
Cell In [30], line 20, in test(model, data_test, optimizer)
18 Predict = []
19 Acc = 0.0
---> 20 for i, data in enumerate(data_test, 0):
21 x, y = data
23 outputs = model(x)
File c:\Users\Ryan\anaconda3\envs\d2l\lib\site-packages\torch\utils\data\dataloader.py:681, in _BaseDataLoaderIter.__next__(self)
678 if self._sampler_iter is None:
679 # TODO(https://github.com/pytorch/pytorch/issues/76750)
680 self._reset() # type: ignore[call-arg]
--> 681 data = self._next_data()
682 self._num_yielded += 1
683 if self._dataset_kind == _DatasetKind.Iterable and \
684 self._IterableDataset_len_called is not None and \
685 self._num_yielded > self._IterableDataset_len_called:
...
File c:\Users\Ryan\anaconda3\envs\d2l\lib\site-packages\torch\utils\data\dataset.py:188, in <genexpr>(.0)
187 def __getitem__(self, index):
--> 188 return tuple(tensor[index] for tensor in self.tensors)
IndexError: index 106 is out of bounds for dimension 0 with size 27
Thank you so much if you can offer me help.
you must use same sampler and data set.
now you are using torch_testDatase with valid_sampler.
use test_sampler instead of valid_sampler.
I am working on a multi-label text classification problem (Total target labels 90). The data distribution has a long tail and class imbalance. I am working with a sample of 100k records using the OVR strategy (One Versus Rest). Since the dataset is huge, I am trying out the partial_fit method. I came to know that there were some issues previously and similar question was asked back in 2017. I tried the partial_fit and found the same issue still exist or maybe I am not doing it correctly.
Scikit-learn version : 0.22.2.post1
Code
def stream_documents(data=None):
"""Iterate over documents of the dataset.
Documents are represented as dictionaries
"""
for index,row in data.iterrows():
tmp_dict = dict()
tmp_dict['text'] = row[TEXT_FEAT]
tmp_dict['target'] = row[TARGET_LABEL]
yield tmp_dict
def get_minibatch(doc_iter, size, mlb):
"""Extract a minibatch of examples, return a tuple X_text, y.
Note: size is before excluding invalid docs with no topics assigned.
"""
data = [(doc['text'], doc['target'])
for doc in itertools.islice(doc_iter, size)]
if not len(data):
return np.asarray([], dtype=int), np.asarray([], dtype=int)
X_text, y = zip(*data)
y = pd.Series(data=y)
y_encoded = mlb.transform(y.str.split(','))
# print("Y SHAPE : ",np.asarray(y_encoded,dtype=int).shape)
return X_text, np.asarray(y_encoded,dtype=int)
def iter_minibatches(doc_iter, minibatch_size):
"""Generator of minibatches."""
X_text, y = get_minibatch(doc_iter, minibatch_size, mlb)
while len(X_text):
yield X_text, y
X_text, y = get_minibatch(doc_iter, minibatch_size, mlb)
def progress(cls_name, stats):
"""Report progress information, return a string."""
duration = time.time() - stats['t0']
s = "%20s classifier : \t" % cls_name
s += "%(n_train)6d train docs " % stats
s += "%(n_test)6d test docs " % test_stats
s += "Acc: %(accuracy).3f " % stats
s += "f1: %(f1).3f " % stats
s += "P: %(p).3f " % stats
s += "in %.2fs (%5d docs/s)" % (duration, stats['n_train'] / duration)
return s
vectorizer = HashingVectorizer(decode_error='ignore', n_features=2 ** 18)
data_stream = stream_documents(data=df_sample_xs) # X, y
partial_fit_classifiers = {
'SGD': OneVsRestClassifier(SGDClassifier(max_iter=1000, tol=1e-3)),
'Logistic':OneVsRestClassifier(LogisticRegression(solver='lbfgs',max_iter=500))
}
# test data statistics
test_stats = {'n_test': 0}
# First we hold out a number of examples to estimate accuracy
n_test_documents = 1000
tick = time.time()
X_test_text, y_test = get_minibatch(data_stream, 1000, mlb)
parsing_time = time.time() - tick
tick = time.time()
X_test = vectorizer.transform(X_test_text)
vectorizing_time = time.time() - tick
test_stats['n_test'] += len(y_test)
print("Test set is %d documents" % (len(y_test)))
cls_stats = {}
for cls_name in partial_fit_classifiers:
stats = {'n_train': 0, 'n_train_pos': 0,
'accuracy': 0.0,
'accuracy_history': [(0, 0)],
'f1': 0.0,
'f1_history': [(0,0)],
'p': 0.0,
'p_history': [(0,0)],
't0': time.time(),
'runtime_history': [(0, 0)],
'total_fit_time': 0.0}
cls_stats[cls_name] = stats
get_minibatch(data_stream, n_test_documents, mlb)
minibatch_size = 2000
minibatch_iterators = iter_minibatches(data_stream, minibatch_size)
total_vect_time = 0.0
# Main loop : iterate on mini-batchs of examples
for i, (X_train_text, y_train) in enumerate(minibatch_iterators):
tick = time.time()
X_train = vectorizer.transform(X_train_text)
total_vect_time += time.time() - tick
# print(X_train.shape,y_train.shape)
for cls_name, cls in partial_fit_classifiers.items():
tick = time.time()
print(cls_name)
# update estimator with examples in the current mini-batch
# cls.partial_fit(X_train, y_train, classes=all_classes)
cls.partial_fit(X_train, y_train, classes=mlb.transform(df_sample_xs[TARGET_LABEL].str.split(',')))
# accumulate test accuracy stats
cls_stats[cls_name]['total_fit_time'] += time.time() - tick
cls_stats[cls_name]['n_train'] += X_train.shape[0]
cls_stats[cls_name]['n_train_pos'] += sum(y_train)
tick = time.time()
cls_stats[cls_name]['accuracy'] = cls.score(X_test, y_test)
cls_stats[cls_name]['f1'] = f1_score(y_test, cls.predict(X_test))
cls_stats[cls_name]['p'] = precision_score(y_test, cls.predict(X_test))
cls_stats[cls_name]['prediction_time'] = time.time() - tick
acc_history = (cls_stats[cls_name]['accuracy'],cls_stats[cls_name]['n_train'])
cls_stats[cls_name]['accuracy_history'].append(acc_history)
f1_history = (cls_stats[cls_name]['f1'],cls_stats[cls_name]['n_train'])
cls_stats[cls_name]['f1_history'].append(f1_history)
p_history = (cls_stats[cls_name]['p'],cls_stats[cls_name]['n_train'])
cls_stats[cls_name]['p_history'].append(p_history)
run_history = (cls_stats[cls_name]['accuracy'],
cls_stats[cls_name]['f1'],
cls_stats[cls_name]['p'],
total_vect_time + cls_stats[cls_name]['total_fit_time'])
cls_stats[cls_name]['runtime_history'].append(run_history)
if i % 3 == 0:
print(progress(cls_name, cls_stats[cls_name]))
if i % 3 == 0:
print('\n')
Error
SGD
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-87-cf38c633c6aa> in <module>
31 # update estimator with examples in the current mini-batch
32 # cls.partial_fit(X_train, y_train, classes=all_classes)
---> 33 cls.partial_fit(X_train, y_train, classes=mlb.transform(df_sample_xs[TARGET_LABEL].str.split(',')))
34 # accumulate test accuracy stats
35 cls_stats[cls_name]['total_fit_time'] += time.time() - tick
/opt/virtual_env/py3/lib/python3.6/site-packages/sklearn/utils/metaestimators.py in <lambda>(*args, **kwargs)
114
115 # lambda, but not partial, allows help() to work with update_wrapper
--> 116 out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
117 # update the docstring of the returned function
118 update_wrapper(out, self.fn)
/opt/virtual_env/py3/lib/python3.6/site-packages/sklearn/multiclass.py in partial_fit(self, X, y, classes)
287 self.classes_))
288
--> 289 Y = self.label_binarizer_.transform(y)
290 Y = Y.tocsc()
291 columns = (col.toarray().ravel() for col in Y.T)
/opt/virtual_env/py3/lib/python3.6/site-packages/sklearn/preprocessing/_label.py in transform(self, y)
478 y_is_multilabel = type_of_target(y).startswith('multilabel')
479 if y_is_multilabel and not self.y_type_.startswith('multilabel'):
--> 480 raise ValueError("The object was not fitted with multilabel"
481 " input.")
482
ValueError: The object was not fitted with multilabel input.
My Dataset is a set of system calls for both malware and benign, I preprocessed it and now it looks like this
NtQueryPerformanceCounter
NtProtectVirtualMemory
NtProtectVirtualMemory
NtQuerySystemInformation
NtQueryVirtualMemory
NtQueryVirtualMemory
NtProtectVirtualMemory
NtOpenKey
NtOpenKey
NtOpenKey
NtQuerySecurityAttributesToken
NtQuerySecurityAttributesToken
NtQuerySystemInformation
NtQuerySystemInformation
NtAllocateVirtualMemory
NtFreeVirtualMemory
Now I'm using tfidf to extract the features and then use ngram to make a sequence of them
from __future__ import print_function
import numpy as np
import pandas as pd
from time import time
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.utils import shuffle
from sklearn.svm import OneClassSVM
nGRAM1 = 8
nGRAM2 = 10
weight = 4
main_corpus_MAL = []
main_corpus_target_MAL = []
main_corpus_BEN = []
main_corpus_target_BEN = []
my_categories = ['benign', 'malware']
# feeding corpus the testing data
print("Loading system call database for categories:")
print(my_categories if my_categories else "all")
import glob
import os
malCOUNT = 0
benCOUNT = 0
for filename in glob.glob(os.path.join('C:\\Users\\alika\\Documents\\testingSVM\\sysMAL', '*.txt')):
fMAL = open(filename, "r")
aggregate = ""
for line in fMAL:
linea = line[:(len(line)-1)]
aggregate += " " + linea
main_corpus_MAL.append(aggregate)
main_corpus_target_MAL.append(1)
malCOUNT += 1
for filename in glob.glob(os.path.join('C:\\Users\\alika\\Documents\\testingSVM\\sysBEN', '*.txt')):
fBEN = open(filename, "r")
aggregate = ""
for line in fBEN:
linea = line[:(len(line) - 1)]
aggregate += " " + linea
main_corpus_BEN.append(aggregate)
main_corpus_target_BEN.append(0)
benCOUNT += 1
# weight as determined in the top of the code
train_corpus = main_corpus_BEN[:(weight*len(main_corpus_BEN)//(weight+1))]
train_corpus_target = main_corpus_target_BEN[:(weight*len(main_corpus_BEN)//(weight+1))]
test_corpus = main_corpus_MAL[(len(main_corpus_MAL)-(len(main_corpus_MAL)//(weight+1))):]
test_corpus_target = main_corpus_target_MAL[(len(main_corpus_MAL)-len(main_corpus_MAL)//(weight+1)):]
def size_mb(docs):
return sum(len(s.encode('utf-8')) for s in docs) / 1e6
# size of datasets
train_corpus_size_mb = size_mb(train_corpus)
test_corpus_size_mb = size_mb(test_corpus)
print("%d documents - %0.3fMB (training set)" % (
len(train_corpus_target), train_corpus_size_mb))
print("%d documents - %0.3fMB (test set)" % (
len(test_corpus_target), test_corpus_size_mb))
print("%d categories" % len(my_categories))
print()
print("Benign Traces: "+str(benCOUNT)+" traces")
print("Malicious Traces: "+str(malCOUNT)+" traces")
print()
print("Extracting features from the training data using a sparse vectorizer...")
t0 = time()
vectorizer = TfidfVectorizer(ngram_range=(nGRAM1, nGRAM2), min_df=1, use_idf=True, smooth_idf=True) ##############
analyze = vectorizer.build_analyzer()
X_train = vectorizer.fit_transform(train_corpus)
duration = time() - t0
print("done in %fs at %0.3fMB/s" % (duration, train_corpus_size_mb / duration))
print("n_samples: %d, n_features: %d" % X_train.shape)
print()
print("Extracting features from the test data using the same vectorizer...")
t0 = time()
X_test = vectorizer.transform(test_corpus)
duration = time() - t0
print("done in %fs at %0.3fMB/s" % (duration, test_corpus_size_mb / duration))
print("n_samples: %d, n_features: %d" % X_test.shape)
print()
The output is:
Loading system call database for categories:
['benign', 'malware']
177 documents - 45.926MB (training set)
44 documents - 12.982MB (test set)
2 categories
Benign Traces: 72 traces
Malicious Traces: 150 traces
Extracting features from the training data using a sparse vectorizer...
done in 7.831695s at 5.864MB/s
n_samples: 177, n_features: 603170
Extracting features from the test data using the same vectorizer...
done in 1.624100s at 7.993MB/s
n_samples: 44, n_features: 603170
Now for the learning section I'm trying to use sklearn OneClassSVM:
print("==================\n")
print("Training: ")
classifier = OneClassSVM(kernel='linear', gamma='auto')
classifier.fit(X_test)
fraud_pred = classifier.predict(X_test)
unique, counts = np.unique(fraud_pred, return_counts=True)
print (np.asarray((unique, counts)).T)
fraud_pred = pd.DataFrame(fraud_pred)
fraud_pred= fraud_pred.rename(columns={0: 'prediction'})
main_corpus_target = pd.DataFrame(main_corpus_target)
main_corpus_target= main_corpus_target.rename(columns={0: 'Category'})
this the output to fraud_pred and main_corpus_target
prediction
0 1
1 -1
2 1
3 1
4 1
5 -1
6 1
7 -1
...
30 rows * 1 column
====================
Category
0 1
1 1
2 1
3 1
4 1
...
217 0
218 0
219 0
220 0
221 0
222 rows * 1 column
but when i try to calculate TP,TN,FP,FN:
##Performance check of the model
TP = FN = FP = TN = 0
for j in range(len(main_corpus_target)):
if main_corpus_target['Category'][j]== 0 and fraud_pred['prediction'][j] == 1:
TP = TP+1
elif main_corpus_target['Category'][j]== 0 and fraud_pred['prediction'][j] == -1:
FN = FN+1
elif main_corpus_target['Category'][j]== 1 and fraud_pred['prediction'][j] == 1:
FP = FP+1
else:
TN = TN +1
print (TP, FN, FP, TN)
I get this error:
KeyError Traceback (most recent call last)
<ipython-input-32-1046cc75ba83> in <module>
7 elif main_corpus_target['Category'][j]== 0 and fraud_pred['prediction'][j] == -1:
8 FN = FN+1
----> 9 elif main_corpus_target['Category'][j]== 1 and fraud_pred['prediction'][j] == 1:
10 FP = FP+1
11 else:
c:\users\alika\appdata\local\programs\python\python36\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
1069 key = com.apply_if_callable(key, self)
1070 try:
-> 1071 result = self.index.get_value(self, key)
1072
1073 if not is_scalar(result):
c:\users\alika\appdata\local\programs\python\python36\lib\site-packages\pandas\core\indexes\base.py in get_value(self, series, key)
4728 k = self._convert_scalar_indexer(k, kind="getitem")
4729 try:
-> 4730 return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
4731 except KeyError as e1:
4732 if len(self) > 0 and (self.holds_integer() or self.is_boolean()):
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
KeyError: 30
1) I know the error is because it's trying to access a key that isn’t in a dictionary, but i can't just insert some numbers in the fraud_pred to handle this issue, any suggestions??
2) Am i doing anything wrong that they don't match?
3) I want to compare the results to other one class classification algorithms, Due to my method, what are the best ones that i can use??
Edit: Before calculating the metrics:
You could change your fit and predict functions to:
fraud_pred = classifier.fit_predict(X_test)
Also, your main_corpus_target and X_test should have the same length, can you put the code where you create main_corpus_target please?
its created it right after the benCOUNT += 1:
main_corpus_target = main_corpus_target_MAL main_corpus_target.extend(main_corpus_target_BEN)
This means that you are creating a main_corpus_target that includes MAL and BEN, and the error you get is:
ValueError: Found input variables with inconsistent numbers of samples: [30, 222]
The number of samples of fraud_pred is 30, so you should evaluate them with an array of 30. main_corpus_target contains 222.
Watching your code, I see that you want to evaluate the X_test, which is related to test_corpus X_test = vectorizer.transform(test_corpus). It would be better to compare your results to test_corpus_target, which is the target variable of your dataset and also has a length of 30.
These two lines that you have should output the same length:
test_corpus = main_corpus_MAL[(len(main_corpus_MAL)-(len(main_corpus_MAL)//(weight+1))):]
test_corpus_target = main_corpus_target_MAL[(len(main_corpus_MAL)-len(main_corpus_MAL)//(weight+1)):]
May I ask why are you calculating the TP, TN... by yourself?
You have a faster option:
Transform the fraud_pred series, replacing the -1 to 0.
Use the confusion matrix function that sklearn offers.
Use ravel to extract the values of the confusion matrix.
An example, after transforming the -1 to 0:
from sklearn.metrics import confusion_matrix
tn, fp, fn, tp = confusion_matrix(fraud_pred, main_corpus_target['Category'].values).ravel()
Also, if you are using the last pandas version:
from sklearn.metrics import confusion_matrix
tn, fp, fn, tp = confusion_matrix(fraud_pred, main_corpus_target['Category'].to_numpy()).ravel()
I am new to ML and have been trying Feature selection with RFE approach. My dataset has 5K records and its binary classification problem. This is the code that I am following based on a tutorial online
#no of features
nof_list=np.arange(1,13)
high_score=0
#Variable to store the optimum features
nof=0
score_list =[]
for n in range(len(nof_list)):
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 0)
model = RandomForestClassifier()
rfe = RFE(model,nof_list[n])
X_train_rfe = rfe.fit_transform(X_train,y_train)
X_test_rfe = rfe.transform(X_test)
model.fit(X_train_rfe,y_train)
score = model.score(X_test_rfe,y_test)
score_list.append(score)
if(score>high_score):
high_score = score
nof = nof_list[n]
print("Optimum number of features: %d" %nof)
print("Score with %d features: %f" % (nof, high_score))
I encounter the below error. Can someone please help
TypeError Traceback (most recent call last)
<ipython-input-332-a23dfb331001> in <module>
9 model = RandomForestClassifier()
10 rfe = RFE(model,nof_list[n])
---> 11 X_train_rfe = rfe.fit_transform(X_train,y_train)
12 X_test_rfe = rfe.transform(X_test)
13 model.fit(X_train_rfe,y_train)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\base.py in fit_transform(self, X, y, **fit_params)
554 Training set.
555
--> 556 y : numpy array of shape [n_samples]
557 Target values.
558
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\feature_selection\_base.py in transform(self, X)
75 X = check_array(X, dtype=None, accept_sparse='csr',
76 force_all_finite=not tags.get('allow_nan', True))
---> 77 mask = self.get_support()
78 if not mask.any():
79 warn("No features were selected: either the data is"
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\feature_selection\_base.py in get_support(self, indices)
44 values are indices into the input feature vector.
45 """
---> 46 mask = self._get_support_mask()
47 return mask if not indices else np.where(mask)[0]
48
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\feature_selection\_rfe.py in _get_support_mask(self)
269
270 def _get_support_mask(self):
--> 271 check_is_fitted(self)
272 return self.support_
273
TypeError: check_is_fitted() missing 1 required positional argument: 'attributes'
What is your sklearn version ?
The following (using artificial data) should work fine:
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
X = np.random.rand(100,20)
y = np.ones((X.shape[0]))
#no of features
nof_list=np.arange(1,13)
high_score=0
#Variable to store the optimum features
nof=0
score_list =[]
for n in range(len(nof_list)):
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 0)
model = RandomForestClassifier()
rfe = RFE(model,nof_list[n])
X_train_rfe = rfe.fit_transform(X_train,y_train)
X_test_rfe = rfe.transform(X_test)
model.fit(X_train_rfe,y_train)
score = model.score(X_test_rfe,y_test)
score_list.append(score)
if(score>high_score):
high_score = score
nof = nof_list[n]
print("Optimum number of features: %d" %nof)
print("Score with %d features: %f" % (nof, high_score))
Optimum number of features: 1
Score with 1 features: 1.000000
Versions tested:
sklearn.__version__
'0.20.4'
sklearn.__version__
'0.21.3'
I have used DataParallel in my script for multiple Gpu.The torch.nn.DataParallel is only using the gpu with id 0 and not utilizing the gpu with id 1.So the utilization is very low.Here is the full code:
from __future__ import print_function
from miscc.utils import mkdir_p
from miscc.utils import build_super_images
from miscc.losses import sent_loss, words_loss
from miscc.config import cfg, cfg_from_file
from datasets import TextDataset
from datasets import prepare_data
from model import RNN_ENCODER, CNN_ENCODER
import os
import sys
import time
import random
import pprint
import datetime
import dateutil.tz
import argparse
import numpy as np
from PIL import Image
import torch
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable
import torch.backends.cudnn as cudnn
import torchvision.transforms as transforms
dir_path = (os.path.abspath(os.path.join(os.path.realpath(__file__), './.')))
sys.path.append(dir_path)
UPDATE_INTERVAL = 200
def parse_args():
parser = argparse.ArgumentParser(description='Train a DAMSM network')
parser.add_argument('--cfg', dest='cfg_file',
help='optional config file',
default='cfg/bird.yml', type=str)
parser.add_argument('--gpu', dest='gpu_id', type=int, default=0)
parser.add_argument('--data_dir', dest='data_dir', type=str, default='')
parser.add_argument('--manualSeed', type=int, help='manual seed')
args = parser.parse_args()
return args
def train(dataloader, cnn_model, rnn_model, batch_size, labels, optimizer, epoch, ixtoword, image_dir):
torch.nn.DataParallel(cnn_model.train(),device_ids=[0,1],dim=1).cuda()
torch.nn.DataParallel(rnn_model.train(),device_ids=[0,1],dim=1).cuda()
s_total_loss0 = 0
s_total_loss1 = 0
w_total_loss0 = 0
w_total_loss1 = 0
count = (epoch + 1) * len(dataloader)
start_time = time.time()
for step, data in enumerate(dataloader, 0):
# print('step', step)
rnn_model.zero_grad()
cnn_model.zero_grad()
imgs, captions, cap_lens, \
class_ids, keys = prepare_data(data)
# words_features: batch_size x nef x 17 x 17
# sent_code: batch_size x nef
words_features, sent_code = cnn_model(imgs[-1])
# --> batch_size x nef x 17*17
nef, att_sze = words_features.size(1), words_features.size(2)
# words_features = words_features.view(batch_size, nef, -1)
hidden = rnn_model.init_hidden(batch_size)
# words_emb: batch_size x nef x seq_len
# sent_emb: batch_size x nef
words_emb, sent_emb = rnn_model(captions, cap_lens, hidden)
w_loss0, w_loss1, attn_maps = words_loss(words_features, words_emb, labels,
cap_lens, class_ids, batch_size)
w_total_loss0 += w_loss0.data
w_total_loss1 += w_loss1.data
loss = w_loss0 + w_loss1
s_loss0, s_loss1 = \
sent_loss(sent_code, sent_emb, labels, class_ids, batch_size)
loss += s_loss0 + s_loss1
s_total_loss0 += s_loss0.data
s_total_loss1 += s_loss1.data
#
loss.backward()
#
# `clip_grad_norm` helps prevent
# the exploding gradient problem in RNNs / LSTMs.
torch.nn.utils.clip_grad_norm(rnn_model.parameters(),
cfg.TRAIN.RNN_GRAD_CLIP)
optimizer.step()
if step % UPDATE_INTERVAL == 0:
count = epoch * len(dataloader) + step
s_cur_loss0 = s_total_loss0[0] / UPDATE_INTERVAL
s_cur_loss1 = s_total_loss1[0] / UPDATE_INTERVAL
w_cur_loss0 = w_total_loss0[0] / UPDATE_INTERVAL
w_cur_loss1 = w_total_loss1[0] / UPDATE_INTERVAL
elapsed = time.time() - start_time
print('| epoch {:3d} | {:5d}/{:5d} batches | ms/batch {:5.2f} | '
's_loss {:5.2f} {:5.2f} | '
'w_loss {:5.2f} {:5.2f}'
.format(epoch, step, len(dataloader),
elapsed * 1000. / UPDATE_INTERVAL,
s_cur_loss0, s_cur_loss1,
w_cur_loss0, w_cur_loss1))
s_total_loss0 = 0
s_total_loss1 = 0
w_total_loss0 = 0
w_total_loss1 = 0
start_time = time.time()
# attention Maps
img_set, _ = \
build_super_images(imgs[-1].cpu(), captions,
ixtoword, attn_maps, att_sze)
if img_set is not None:
im = Image.fromarray(img_set)
fullpath = '%s/attention_maps%d.png' % (image_dir, step)
im.save(fullpath)
return count
def evaluate(dataloader, cnn_model, rnn_model, batch_size):
cnn_model.eval().cuda()
rnn_model.eval().cuda()
s_total_loss = 0
w_total_loss = 0
for step, data in enumerate(dataloader, 0):
real_imgs, captions, cap_lens, \
class_ids, keys = prepare_data(data)
words_features, sent_code = cnn_model(real_imgs[-1])
# nef = words_features.size(1)
# words_features = words_features.view(batch_size, nef, -1)
hidden = rnn_model.init_hidden(batch_size)
words_emb, sent_emb = rnn_model(captions, cap_lens, hidden)
w_loss0, w_loss1, attn = words_loss(words_features, words_emb, labels,
cap_lens, class_ids, batch_size)
w_total_loss += (w_loss0 + w_loss1).data
s_loss0, s_loss1 = \
sent_loss(sent_code, sent_emb, labels, class_ids, batch_size)
s_total_loss += (s_loss0 + s_loss1).data
if step == 50:
break
s_cur_loss = s_total_loss[0] / step
w_cur_loss = w_total_loss[0] / step
return s_cur_loss, w_cur_loss
def build_models():
# build model ############################################################
text_encoder = RNN_ENCODER(dataset.n_words, nhidden=cfg.TEXT.EMBEDDING_DIM)
image_encoder = CNN_ENCODER(cfg.TEXT.EMBEDDING_DIM)
labels = Variable(torch.LongTensor(range(batch_size)))
start_epoch = 0
if cfg.TRAIN.NET_E != '':
state_dict = torch.load(cfg.TRAIN.NET_E)
text_encoder.load_state_dict(state_dict)
print('Load ', cfg.TRAIN.NET_E)
#
name = cfg.TRAIN.NET_E.replace('text_encoder', 'image_encoder')
state_dict = torch.load(name)
image_encoder.load_state_dict(state_dict)
print('Load ', name)
istart = cfg.TRAIN.NET_E.rfind('_') + 8
iend = cfg.TRAIN.NET_E.rfind('.')
start_epoch = cfg.TRAIN.NET_E[istart:iend]
start_epoch = int(start_epoch) + 1
print('start_epoch', start_epoch)
if cfg.CUDA:
text_encoder = text_encoder.cuda()#torch.nn.DataParallel(text_encoder,device_ids=[0,1],dim=1)
image_encoder = image_encoder.cuda()#torch.nn.DataParallel(image_encoder,device_ids=[0,1],dim=1)
labels = labels.cuda()#torch.nn.DataParallel(labels,device_ids=[0,1],dim=1)
return text_encoder, image_encoder, labels, start_epoch
if __name__ == "__main__":
args = parse_args()
if args.cfg_file is not None:
cfg_from_file(args.cfg_file)
if args.gpu_id == -1:
cfg.CUDA = False
else:
cfg.GPU_ID =args.gpu_id
if args.data_dir != '':
cfg.DATA_DIR = args.data_dir
print('Using config:')
pprint.pprint(cfg)
if not cfg.TRAIN.FLAG:
args.manualSeed = 100
elif args.manualSeed is None:
args.manualSeed = random.randint(1, 10000)
random.seed(args.manualSeed)
np.random.seed(args.manualSeed)
torch.manual_seed(args.manualSeed)
if cfg.CUDA:
torch.cuda.manual_seed_all(args.manualSeed)
##########################################################################
now = datetime.datetime.now(dateutil.tz.tzlocal())
timestamp = now.strftime('%Y_%m_%d_%H_%M_%S')
output_dir = '../output/%s_%s_%s' % \
(cfg.DATASET_NAME, cfg.CONFIG_NAME, timestamp)
model_dir = os.path.join(output_dir, 'Model')
image_dir = os.path.join(output_dir, 'Image')
mkdir_p(model_dir)
mkdir_p(image_dir)
#torch.cuda.set_device()
cudnn.benchmark = True
# Get data loader ##################################################
imsize = cfg.TREE.BASE_SIZE * (2 ** (cfg.TREE.BRANCH_NUM-1))
batch_size = cfg.TRAIN.BATCH_SIZE
image_transform = transforms.Compose([
transforms.Scale(int(imsize * 76 / 64)),
transforms.RandomCrop(imsize),
transforms.RandomHorizontalFlip()])
dataset = TextDataset(cfg.DATA_DIR, 'train',
base_size=cfg.TREE.BASE_SIZE,
transform=image_transform)
print(dataset.n_words, dataset.embeddings_num)
assert dataset
dataloader = torch.utils.data.DataLoader(
dataset, batch_size=batch_size, drop_last=True,
shuffle=True, num_workers=int(cfg.WORKERS))
# # validation data #
dataset_val = TextDataset(cfg.DATA_DIR, 'test',
base_size=cfg.TREE.BASE_SIZE,
transform=image_transform)
dataloader_val = torch.utils.data.DataLoader(
dataset_val, batch_size=batch_size, drop_last=True,
shuffle=True, num_workers=int(cfg.WORKERS))
# Train ##############################################################
text_encoder, image_encoder, labels, start_epoch = build_models()
para = list(text_encoder.parameters())
for v in image_encoder.parameters():
if v.requires_grad:
para.append(v)
# optimizer = optim.Adam(para, lr=cfg.TRAIN.ENCODER_LR, betas=(0.5, 0.999))
# At any point you can hit Ctrl + C to break out of training early.
try:
lr = cfg.TRAIN.ENCODER_LR
for epoch in range(start_epoch, cfg.TRAIN.MAX_EPOCH):
optimizer = optim.Adam(para, lr=lr, betas=(0.5, 0.999))
epoch_start_time = time.time()
count = torch.nn.DataParallel(train(dataloader, image_encoder, text_encoder,
batch_size, labels, optimizer, epoch,
dataset.ixtoword, image_dir),device_ids=[0,1],dim=1).cuda()
print('-' * 89)
if len(dataloader_val) > 0:
s_loss, w_loss = torch.nn.DataParallel(evaluate(dataloader_val, image_encoder,
text_encoder, batch_size)).cuda()
print('| end epoch {:3d} | valid loss '
'{:5.2f} {:5.2f} | lr {:.5f}|'
.format(epoch, s_loss, w_loss, lr))
print('-' * 89)
if lr > cfg.TRAIN.ENCODER_LR/10.:
lr *= 0.98
if (epoch % cfg.TRAIN.SNAPSHOT_INTERVAL == 0 or
epoch == cfg.TRAIN.MAX_EPOCH):
torch.save(image_encoder.state_dict(),
'%s/image_encoder%d.pth' % (model_dir, epoch))
torch.save(text_encoder.state_dict(),
'%s/text_encoder%d.pth' % (model_dir, epoch))
print('Save G/Ds models.')
except KeyboardInterrupt:
print('-' * 89)
print('Exiting from training early')
I have read various articles regarding DataParallel and according to them this should have worked.Can Somebody help me in finding the solution to the problem?
You must use torch.nn.DataParallel only once that is immediately after creating an instance of the classes RNN_ENCODER and CNN_ENCODER. Your mistakes are as follows:
Everytime train method is called, the instance of RNN_ENCODER (which is rnn_model in train method) and the instance of CNN_ENCODER (which is cnn_model in train method) are wrapped by torch.nn.DataParallel. This is wrong. You will have to do it only once, which is after instantiating them in build_models method. This ensures that your model is replicated across both multiple GPUs for parallel execution. Wrapping the instance again and again for every epoch (when the train is called) is not going to help pytorch to parallelize computations.
The output of train method is count which is an integer variable. By wrapping the output of train method in torch.nn.DataParallel you are not going to achieve data parallelism.
I suggest you visit the official documentation of Pytorch at this link for better understanding of how to use Dataparallel.