Testing Prediction Model with User Input - machine-learning

I am beginner in ML however I was making a college project and I am successfully able to Train a model but I am not sure how I can test User Input. My project is to check if the data entered for a person is diabetes or not.
Data CSV:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
6 148 72 35 0 33.6 0.627 50 1
1 85 66 29 0 26.6 0.351 31 0
8 183 64 0 0 23.3 0.672 32 1
1 89 66 23 94 28.1 0.167 21 0
0 137 40 35 168 43.1 2.288 33 1
5 116 74 0 0 25.6 0.201 30 0
3 78 50 32 88 31 0.248 26 1
10 115 0 0 0 35.3 0.134 29 0
2 197 70 45 543 30.5 0.158 53 1
Code:
from sklearn.ensemble import RandomForestClassifier
random_forest_model = RandomForestClassifier(random_state=10)
random_forest_model.fit(X_train, y_train.ravel())
predict_train_data = random_forest_model.predict(X_test)
from sklearn import metrics
print("Accuracy = {0:.3f}".format(metrics.accuracy_score(y_test, predict_train_data)))
Code for User Input:
print("Enter your own data to test the model:")
pregnancy = int(input("Enter Pregnancy:"))
glucose = int(input("Enter Glucose:"))
bloodpressure = int(input("Enter Blood Pressue:"))
skinthickness = int(input("Enter Skin Thickness:"))
insulin = int(input("Enter Insulin:"))
bmi = float(input("Enter BMI:"))
DiabetesPedigreeFunction = float(input("Enter DiabetesPedigreeFunction:"))
age = int(input("Enter Age:"))
userInput = [pregnancy, glucose, bloodpressure, skinthickness, insulin, bmi,
DiabetesPedigreeFunction, age]
I want it to return 1 - if diabetes or 0 - if non-diabetes
EDIT - added x_train and y_train:
from sklearn.model_selection import train_test_split
feature_columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
predicted_class = ['Outcome']
X = data[feature_columns].values
y = data[predicted_class].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state=10)
from sklearn.ensemble import RandomForestClassifier
random_forest_model = RandomForestClassifier(random_state=10)
random_forest_model.fit(X_train, y_train.ravel())

Try
result = random_forest_model.predict([user_input])[0]
because the model expects multiple inputs (2D array) and returns the prediction for each element (list of observations).

Related

ValueError: Input X contains NaN

I'm training to classify my traffic using SVM ML..as below
import pandas as pd # for process the DataSet
import matplotlib.pyplot as plt
ds= pd.read_csv("dataset_sdn.csv") # to read the dataset with name (ds)
ds.fillna(0)
ds #
ds output
X = ds.iloc[: , [4,5,6,7,8,9,10,11,12,13,14,17,18,19,20,21]] # Input Features
Y = ds.iloc[:, 22] # OutPut
print (X)
print (Y)
X output
Y output
from sklearn.model_selection import train_test_split
X_Train, X_Test, Y_Train, Y_Test = train_test_split (X, Y, test_size=0.25, random_state=0)
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_Train = sc_X.fit_transform(X_Train)
X_Test = sc_X.transform(X_Test)
from sklearn.svm import SVC
classifier = SVC (kernel='linear', random_state=0)
classifier.fit(X_Train, Y_Train)
Y_pred = classifier.predict(X_Test)
here in this last step i get error message
ValueError Traceback (most recent call
last) Input In [43], in <cell line: 3>()
1 from sklearn.svm import SVC
2 classifier = SVC (kernel='linear', random_state=0)
----> 3 classifier.fit(X_Train, Y_Train)
5 # The output predect
6 Y_pred = classifier.predict(X_Test)
File
~\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\svm_base.py:173,
in BaseLibSVM.fit(self, X, y, sample_weight)
171 check_consistent_length(X, y)
172 else:
--> 173 X, y = self._validate_data(
174 X,
175 y,
176 dtype=np.float64,
177 order="C",
178 accept_sparse="csr",
179 accept_large_sparse=False,
180 )
182 y = self._validate_targets(y)
184 sample_weight = np.asarray(
185 [] if sample_weight is None else sample_weight, dtype=np.float64
186 )
File
~\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\base.py:596,
in BaseEstimator._validate_data(self, X, y, reset,
validate_separately, **check_params)
594 y = check_array(y, input_name="y", **check_y_params)
595 else:
--> 596 X, y = check_X_y(X, y, **check_params)
597 out = X, y
599 if not no_val_X and check_params.get("ensure_2d", True):
File
~\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\utils\validation.py:1074,
in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order,
copy, force_all_finite, ensure_2d, allow_nd, multi_output,
ensure_min_samples, ensure_min_features, y_numeric, estimator) 1069
estimator_name = _check_estimator_name(estimator) 1070 raise
ValueError( 1071 f"{estimator_name} requires y to be
passed, but the target y is None" 1072 )
-> 1074 X = check_array( 1075 X, 1076 accept_sparse=accept_sparse, 1077
accept_large_sparse=accept_large_sparse, 1078 dtype=dtype,
1079 order=order, 1080 copy=copy, 1081
force_all_finite=force_all_finite, 1082 ensure_2d=ensure_2d,
1083 allow_nd=allow_nd, 1084
ensure_min_samples=ensure_min_samples, 1085
ensure_min_features=ensure_min_features, 1086
estimator=estimator, 1087 input_name="X", 1088 ) 1090 y =
_check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator) 1092 check_consistent_length(X, y)
File
~\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\utils\validation.py:899,
in check_array(array, accept_sparse, accept_large_sparse, dtype,
order, copy, force_all_finite, ensure_2d, allow_nd,
ensure_min_samples, ensure_min_features, estimator, input_name)
893 raise ValueError(
894 "Found array with dim %d. %s expected <= 2."
895 % (array.ndim, estimator_name)
896 )
898 if force_all_finite:
--> 899 _assert_all_finite(
900 array,
901 input_name=input_name,
902 estimator_name=estimator_name,
903 allow_nan=force_all_finite == "allow-nan",
904 )
906 if ensure_min_samples > 0:
907 n_samples = _num_samples(array)
File
~\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\utils\validation.py:146,
in _assert_all_finite(X, allow_nan, msg_dtype, estimator_name,
input_name)
124 if (
125 not allow_nan
126 and estimator_name (...)
130 # Improve the error message on how to handle missing values in
131 # scikit-learn.
132 msg_err += (
133 f"\n{estimator_name} does not accept missing values"
134 " encoded as NaN natively. For supervised learning, you might want" (...)
144 "#estimators-that-handle-nan-values"
145 )
--> 146 raise ValueError(msg_err)
148 # for object dtype data, we only check for NaNs (GH-13254)
149 elif X.dtype == np.dtype("object") and not allow_nan:
ValueError: Input X contains NaN. SVC does not accept missing values
encoded as NaN natively. For supervised learning, you might want to
consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor
which accept missing values encoded as NaNs natively. Alternatively,
it is possible to preprocess the data, for instance by using an
imputer transformer in a pipeline or drop samples with missing values.
See https://scikit-learn.org/stable/modules/impute.html You can find a
list of all estimators that handle NaN values at the following page:
https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values
So, plz any advice to solve this error, although there isn't any NaN value in the dataset
You are not replacing old dataframe with new dataframe.
Use this:
ds = ds.fillna(0)
OR
ds.fillna(0, inplace=True)

Decision Tree - Exporting image via Graphviz error

I'm trying to build a Decision Tree using gridsearch and a pipeline, but I get an error when I try to export the image using graphviz. I looked online and couldn't find anything; one potential problem would've been if I didn't use the best_estimator_ instance, but I did in this case.
Everything works (getting accuracy and other metrics) except the exporting graph part.
def TreeOpt(X, y):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
std_scl = StandardScaler()
dec_tree = tree.DecisionTreeClassifier()
pipe = Pipeline(steps=[('std_slc', std_scl),
('dec_tree', dec_tree)])
criterion = ['gini', 'entropy']
max_depth = list(range(1,15))
parameters = dict(dec_tree__criterion=criterion,
dec_tree__max_depth=max_depth)
tree_gs = GridSearchCV(pipe, parameters)
tree_gs.fit(X_train, y_train)
export_graphviz(
tree_gs.best_estimator_,
out_file=("dec_tree.dot"),
feature_names=None,
class_names=None,
filled=True)
But I get
<ipython-input-2-bb91ec6ba0d9> in <module>
37 filled=True)
38
---> 39 DecTreeOptimizer(X = df.drop(['quality'], axis=1), y = df.quality)
40
<ipython-input-2-bb91ec6ba0d9> in DecTreeOptimizer(X, y)
30 print("Best score: " + str(tree_GS.best_score_))
31
---> 32 export_graphviz(
33 tree_GS.best_estimator_,
34 out_file=("dec_tree.dot"),
~\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
61 extra_args = len(args) - len(all_args)
62 if extra_args <= 0:
---> 63 return f(*args, **kwargs)
64
65 # extra_args > 0
~\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\tree\_export.py in export_graphviz(decision_tree, out_file, max_depth, feature_names, class_names, label, filled, leaves_parallel, impurity, node_ids, proportion, rotate, rounded, special_characters, precision)
767 """
768
--> 769 check_is_fitted(decision_tree)
770 own_file = False
771 return_string = False
~\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
61 extra_args = len(args) - len(all_args)
62 if extra_args <= 0:
---> 63 return f(*args, **kwargs)
64
65 # extra_args > 0
~\AppData\Local\Programs\Python\Python39\lib\site-packages\sklearn\utils\validation.py in check_is_fitted(estimator, attributes, msg, all_or_any)
1096
1097 if not attrs:
-> 1098 raise NotFittedError(msg % {'name': type(estimator).__name__})
1099
1100
NotFittedError: This Pipeline instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.```
After long searches, finally found the answer here :Plot best decision tree with pipeline and GridsearchCV
The best_estimator_ attribute returns a pipeline instead of an object, so I just had to query it like this: best_estimator_[1] (and then I found a whole other lot of problems with my code, but that's part 2).
I will leave this here in case anyone else is going to need it. Cheers!

Train/Test split object detection

Are there any script/function to split the data counting the number of class appearances in each image and balance them?
I've tryed sklearn train_test_split in this way:
data = pd.read_csv('train_labels.csv')
data.head()
Class is what I want to predict, on one image I can have 0..n rectangles and each rectangle has a class.
data = data.drop_duplicates(subset="filename")
y = data['class']
X = data.drop('class',axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2)
But when I delete duplicates in filenames I'm loosing information and maybe I send files to train or test with many other classes, but if I don't delete them I can have a file in train and test.
Thanks for your help.
Their is a library scikit-multilearn which will help to split multi-label data.
Installation: pip install scikit-multilearn
Documentation: http://scikit.ml/stratification.html
Implementation:
Let's suppose in dataframe df, X1, X2 are the feature columns and y is the target column.
Our data can be classified in following classes class1, class2, class3
df = pd.DataFrame({
"X1": [1,2,3,4,5,6,7,8],
"X2": [6,7,8,9,10,11,12,13],
"y": ["class1", "class1", "class2", "class2", "class3", "class3", "class1", "class2"]})
After running above code we have dataframe:
X1 X2 y
0 1 6 class1
1 2 7 class1
2 3 8 class2
3 4 9 class2
4 5 10 class3
5 6 11 class3
6 7 12 class1
7 8 13 class2
But scikit multilearn works with one hot target columns. So we have to transform our target column.
one_hot_classes = pd.get_dummies(df["y"])
Which will output:
class1 class2 class3
0 1 0 0
1 1 0 0
2 0 1 0
3 0 1 0
4 0 0 1
5 0 0 1
6 1 0 0
7 0 1 0
We will drop y column and concatenate one_hot_classes
df.drop("y", axis = 1, inplace=True)
df = pd.concat([df, one_hot_classes], axis=1)
After running above code:
X1 X2 class1 class2 class3
0 1 6 1 0 0
1 2 7 1 0 0
2 3 8 0 1 0
3 4 9 0 1 0
4 5 10 0 0 1
5 6 11 0 0 1
6 7 12 1 0 0
7 8 13 0 1 0
Now we have features and target columns in X and y variable
X = df[["X1", "X2"]]
y = df[["class1", "class2", "class3"]]
Now we will get splits:
from skmultilearn.model_selection import iterative_train_test_split
X_train, y_train, X_test, y_test = iterative_train_test_split(X.values, y.values, test_size = 0.5)

How to use Recursive Feature elimination?

I am new to ML and have been trying Feature selection with RFE approach. My dataset has 5K records and its binary classification problem. This is the code that I am following based on a tutorial online
#no of features
nof_list=np.arange(1,13)
high_score=0
#Variable to store the optimum features
nof=0
score_list =[]
for n in range(len(nof_list)):
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 0)
model = RandomForestClassifier()
rfe = RFE(model,nof_list[n])
X_train_rfe = rfe.fit_transform(X_train,y_train)
X_test_rfe = rfe.transform(X_test)
model.fit(X_train_rfe,y_train)
score = model.score(X_test_rfe,y_test)
score_list.append(score)
if(score>high_score):
high_score = score
nof = nof_list[n]
print("Optimum number of features: %d" %nof)
print("Score with %d features: %f" % (nof, high_score))
I encounter the below error. Can someone please help
TypeError Traceback (most recent call last)
<ipython-input-332-a23dfb331001> in <module>
9 model = RandomForestClassifier()
10 rfe = RFE(model,nof_list[n])
---> 11 X_train_rfe = rfe.fit_transform(X_train,y_train)
12 X_test_rfe = rfe.transform(X_test)
13 model.fit(X_train_rfe,y_train)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\base.py in fit_transform(self, X, y, **fit_params)
554 Training set.
555
--> 556 y : numpy array of shape [n_samples]
557 Target values.
558
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\feature_selection\_base.py in transform(self, X)
75 X = check_array(X, dtype=None, accept_sparse='csr',
76 force_all_finite=not tags.get('allow_nan', True))
---> 77 mask = self.get_support()
78 if not mask.any():
79 warn("No features were selected: either the data is"
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\feature_selection\_base.py in get_support(self, indices)
44 values are indices into the input feature vector.
45 """
---> 46 mask = self._get_support_mask()
47 return mask if not indices else np.where(mask)[0]
48
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\feature_selection\_rfe.py in _get_support_mask(self)
269
270 def _get_support_mask(self):
--> 271 check_is_fitted(self)
272 return self.support_
273
TypeError: check_is_fitted() missing 1 required positional argument: 'attributes'
What is your sklearn version ?
The following (using artificial data) should work fine:
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
X = np.random.rand(100,20)
y = np.ones((X.shape[0]))
#no of features
nof_list=np.arange(1,13)
high_score=0
#Variable to store the optimum features
nof=0
score_list =[]
for n in range(len(nof_list)):
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 0)
model = RandomForestClassifier()
rfe = RFE(model,nof_list[n])
X_train_rfe = rfe.fit_transform(X_train,y_train)
X_test_rfe = rfe.transform(X_test)
model.fit(X_train_rfe,y_train)
score = model.score(X_test_rfe,y_test)
score_list.append(score)
if(score>high_score):
high_score = score
nof = nof_list[n]
print("Optimum number of features: %d" %nof)
print("Score with %d features: %f" % (nof, high_score))
Optimum number of features: 1
Score with 1 features: 1.000000
Versions tested:
sklearn.__version__
'0.20.4'
sklearn.__version__
'0.21.3'

Testing accuracy more than training accuracy

I am building a tuned random forest model for multiclass classification.
I'm getting the following results
Training accuracy(AUC) :0.9921996
Testing accuracy(AUC) :0.992237664
I saw a question related to this on this website and the common answer seems to be that the dataset must be small and your model got lucky
But in my case I have about 300k training data points and 100k testing data points
Also my classes are well balanced
> summary(train$Bucket)
0 1 TO 30 121 TO 150 151 TO 180 181 TO 365 31 TO 60 366 TO 540 541 TO 730 61 TO 90
166034 32922 4168 4070 15268 23092 8794 6927 22559
730 + 91 TO 120
20311 11222
> summary(test$Bucket)
0 1 TO 30 121 TO 150 151 TO 180 181 TO 365 31 TO 60 366 TO 540 541 TO 730 61 TO 90
55344 10974 1389 1356 5090 7698 2932 2309 7520
730 + 91 TO 120
6770 3741
Is it possible for a model to fit this well on a large testing data? Please answer if I can do something to cross verify that my model is indeed fitting really well.
My complete code
split = sample.split(Book2$Bucket,SplitRatio =0.75)
train = subset(Book2,split==T)
test = subset(Book2,split==F)
traintask <- makeClassifTask(data = train,target = "Bucket")
rf <- makeLearner("classif.randomForest")
params <- makeParamSet(makeIntegerParam("mtry",lower = 2,upper = 10),makeIntegerParam("nodesize",lower = 10,upper = 50))
#set validation strategy
rdesc <- makeResampleDesc("CV",iters=5L)
#set optimization technique
ctrl <- makeTuneControlRandom(maxit = 5L)
#start tuning
tune <- tuneParams(learner = rf ,task = traintask ,resampling = rdesc ,measures = list(acc) ,par.set = params ,control = ctrl ,show.info = T)
rf.tree <- setHyperPars(rf, par.vals = tune$x)
tune$y
r<- train(rf.tree, traintask)
getLearnerModel(r)
testtask <- makeClassifTask(data = test,target = "Bucket")
rfpred <- predict(r, testtask)
performance(rfpred, measures = list(mmce, acc))
The difference is of order 1e-4, nothing is wrong, it is a regular, statistical error (variance of the result). Nothing to worry about. This literally means that a difference is about 0.0001 * 100,000 = 10 samples ... 10 samples out of 100k.

Resources