I am not able Training models in sklearn (scikit-learn) using python - machine-learning

i have data file it contain data to predict the admission in MS.
it contain 9 column 8 column contain student data and 9th column contain chance of selection of student.
i am new and i don't understand error come in training model
import pandas
import numpy as np
import sklearn as sl
from sklearn.neural_network import MLPClassifier
classifier = MLPClassifier()
data = pandas.read_csv('Addmition.csv')
data_array = np.array(data)
X = data_array[:,1:8]
y = data_array[:,8]
classifier.fit(X,y)
print(classifier)
Traceback (most recent call last):
File "c.py", line 14, in <module>
classifier.fit(X,y)
File "C:\Users\vishal jangid\AppData\Roaming\Python\Python37\site-packages\sklearn\neural_network\multilayer_perceptron.py", line 977, in fit
hasattr(self, "classes_")))
File "C:\Users\vishal jangid\AppData\Roaming\Python\Python37\site-packages\sklearn\neural_network\multilayer_perceptron.py", line 324, in _fit
X, y = self._validate_input(X, y, incremental)
File "C:\Users\vishal jangid\AppData\Roaming\Python\Python37\site-packages\sklearn\neural_network\multilayer_perceptron.py", line 920, in _validate_input
self._label_binarizer.fit(y)
File "C:\Users\vishal jangid\AppData\Roaming\Python\Python37\site-packages\sklearn\preprocessing\label.py", line 413, in fit
self.classes_ = unique_labels(y)
File "C:\Users\vishal jangid\AppData\Roaming\Python\Python37\site-packages\sklearn\utils\multiclass.py", line 96, in unique_labels
raise ValueError("Unknown label type: %s" % repr(ys))
ValueError: Unknown label type: (array

Try this:
import numpy as np
import sklearn as sl
from sklearn.neural_network import MLPRegressor
classifier = MLPRegressor()
data = pandas.read_csv('Addmition.csv')
data_array = np.array(data)
X = data_array[:,1:8]
y = data_array[:,8]
classifier.fit(X,y)
print(classifier)
Explanation:
In machine learning we may have two types of problems:
1) Classification:
Ex: Predict if a person is male or female. (discrete)
2) Regression:
Ex: Predict the age of the person. (continuous)
With this in hand we are going to see your problem, your label (chance of selection) is continous, thus we have a regression problem.
See that you are using the MLPClassifier, resulting in the 'Unknown label error'.
Try using the MLPRegressor.

Related

ValueError: X has 5 features, but RandomForestClassifier is expecting 2607 features as input

This is how i am converting text to count vector.
cv1 = CountVectorizer()
x_traincv=cv1.fit_transform(x_train)
a = x_traincv.toarray()
a
this the model using for predict.
from sklearn.ensemble import RandomForestClassifier as RFC
rfc_b = RFC()
rfc_b.fit(a,y_train)
y_pred = rfc_b.predict(a)
this is how i am using the live details to predict
from sklearn.feature_extraction.text import CountVectorizer
document = ["Single Hargrave France Female Graduation",]
# Create a Vectorizer Object
vectorizer = CountVectorizer()
vectorizer.fit(document)
print("Vocabulary: ", vectorizer.vocabulary_)
vector = vectorizer.transform(document)
print("Encoded Document is:")
print(vector.toarray())
I AM NOW USING THE MODEL TO PREDICT.
rfc_b.predict(vector)
THE ERROR I AM GETTING
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-62-7cc301d916e6> in <module>()
----> 1 rfc_b.predict(vector)
4 frames
/usr/local/lib/python3.7/dist-packages/sklearn/base.py in _check_n_features(self, X, reset)
399 if n_features != self.n_features_in_:
400 raise ValueError(
--> 401 f"X has {n_features} features, but {self.__class__.__name__} "
402 f"is expecting {self.n_features_in_} features as input."
403 )
ValueError: X has 5 features, but RandomForestClassifier is expecting 2607 features as input.
IT IS WORKING FINE WHEN WORKING WITH TEST SET, DID GET THE OUTPUT.
from sklearn.metrics import accuracy_score
print('Train accuracy score:',accuracy_score(y_train,y_pred))
print('Test accuracy score:', accuracy_score(y_test,rfc_b.predict(b)))
Train accuracy score: 0.987375
Test accuracy score: 0.773
BUT NOT WHEN I USE THE ABOUVE TO INPUT A SINGLE INPUT TO CHECK THE OUTPUT
You have to store your vectorizer used during training, and just call .transform on it, if you create a new one you lose meaning of dimensions used during training, and in particular - you lack many of them, but your vectorizer has no idea about this (as it only has access to your one sample).
cv1 = CountVectorizer()
x_traincv=cv1.fit_transform(x_train)
a = x_traincv.toarray()
from sklearn.ensemble import RandomForestClassifier as RFC
rfc_b = RFC()
rfc_b.fit(a,y_train)
y_pred = rfc_b.predict(a)
document = ["Single Hargrave France Female Graduation",]
vector = cv1.transform(document)
print("Encoded Document is:")
print(vector.toarray())
rfc_b.predict(vector)

Saving tensors to a .pt file in order to create a dataset

I was tasked with the creation of a dataset to test the functionality of the code we're working on.
The dataset must have a group of tensors that will be used later on in a generative model.
I'm trying to save the tensors to a .pt file, but I'm overwriting the tensors thus creating a file with only one. I've read about torch.utils.data.dataset but I'm not able to figure out by my own how to use it.
Here is my code:
import torch
import numpy as np
from torch.utils.data import Dataset
#variables that will be used to create the size of the tensors:
num_jets, num_particles, num_features = 1, 30, 3
for i in range(100):
#tensor from a gaussian dist with mean=5,std=1 and shape=size:
tensor = torch.normal(5,1,size=(num_jets, num_particles, num_features))
#We will need the tensors to be of the cpu type
tensor = tensor.cpu()
#save the tensor to 'tensor_dataset.pt'
torch.save(tensor,'tensor_dataset.pt')
#open the recently created .pt file inside a list
tensor_list = torch.load('tensor_dataset.pt')
#prints the list. Just one tensor inside .pt file
print(tensor_list)
Reason: You overwrote tensor x each time in a loop, therefore you did not get your list, and you only had x at the end.
Solution: you have the size of the tensor, you can initialize a tensor first and iterate through lst_tensors:
import torch
import numpy as np
from torch.utils.data import Dataset
num_jets, num_particles, num_features = 1, 30, 3
lst_tensors = torch.empty(size=(100,num_jets, num_particles, num_features))
for i in range(100):
lst_tensors[i] = torch.normal(5,1,size=(num_jets, num_particles, num_features))
lst_tensors[i] = lst_tensors[i].cpu()
torch.save(lst_tensors,'tensor_dataset.pt')
tensor_list = torch.load('tensor_dataset.pt')
print(tensor_list.shape) # [100,1,30,3]

Value Error - Error when checking target - LSTM

About the dataset
The following Reuters dataset contains 11228 texts that correspond to news classified in 46 categories. The texts are encripted in the sense that each word correspond to an integer number. I specify that we want to work with 2000 words.
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
num_words = 2000
(reuters_train_x, reuters_train_y), (reuters_test_x, reuters_test_y) = tf.keras.datasets.reuters.load_data(num_words=num_words)
n_labels = np.unique(reuters_train_y).shape[0]
print("labels: {}".format(n_labels))
# This is the first new
print(reuters_train_x[0])
Implementing the LSTM
I need to implement a network with a single LSTM with 10 units. The input needs an embedding with 10 dimensions before entering the LSTM cell. Finally, a dense layer needs to be added to adjust the number of outputs with the number of categories.
from keras.models import Sequential
from keras.layers import LSTM, Dense, Embedding
from from tensorflow.keras.utils import to_categorical
reuters_train_y = to_categorical(reuters_train_y, 46)
reuters_test_y = to_categorical(reuters_test_y, 46)
model = Sequential()
model.add(Embedding(input_dim = num_words, 10))
model.add(LSTM(10))
model.add(Dense(46,activation='softmax'))
Training
model.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])
history = model.fit(reuters_train_x,reuters_train_y,epochs=20,validation_data=(reuters_test_x,reuters_test_y))
The error message that I get is:
ValueError: Error when checking target: expected dense_2 to have shape (46,) but got array with shape (1,)
You need to one-hot-encode your y labels.
from tensorflow.keras.utils import to_categorical
reuters_train_y = to_categorical(reuters_train_y, 46)
reuters_test_y = to_categorical(reuters_test_y, 46)
Another bug I see in the fit function, you are passing validation_data=(reuters_test_x,reuters_train_y) but it should be validation_data=(reuters_test_x,reuters_test_y)
Your x is a numpy array of lists with different lengths. You need to pad the sequences to get a fixed shape numpy array.
reuters_train_x = tf.keras.preprocessing.sequence.pad_sequences(
reuters_train_x, maxlen=50
)
reuters_test_x = tf.keras.preprocessing.sequence.pad_sequences(
reuters_test_x, maxlen=50
)

no error while fitting the model over train data but NotFittedError while predicting over test set

Not fitted error coming up when using .predict,during fit there is no error
tried to convert dataframe into arrays still same error
Input:
rfg(n_estimators=500,random_state=42).fit(X=data_withoutnull1.iloc[:,1:8],y=data_withoutnull1['LotFrontage'])
rfg(n_estimators=500,random_state=42).predict(datawithnull1.iloc[:,1:8])
Output:
Traceback (most recent call last):
File "<ipython-input-477-10c6d72bcc12>", line 2, in <module>
rfg(n_estimators=500,random_state=42).predict(datawithnull1.iloc[:,1:8])
File "/home/sinikoibra/miniconda3/envs/pv36/lib/python3.6/site-packages/sklearn/ensemble/forest.py", line 691, in predict
check_is_fitted(self, 'estimators_')
File "/home/sinikoibra/miniconda3/envs/pv36/lib/python3.6/site-packages/sklearn/utils/validation.py", line 914, in check_is_fitted
raise NotFittedError(msg % {'name': type(estimator).__name__})
NotFittedError: This RandomForestRegressor instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.
Try like this :
# Define X and y
X=data_withoutnull1.iloc[:,1:8].values
y=data_withoutnull1['LotFrontage']
You can use train test split to split the data into training set and testing set then pass the testing set into predict.
#pass X_train to fit -- training the model, fit(X_train)
#pass X_test to predict -- can be used for prediction, predict(X_test )
or Fitting Random Forest Regression to the dataset
from sklearn.ensemble import RandomForestRegressor
rfg= RandomForestRegressor(n_estimators = 500, random_state = 42)
rfg.fit(X, y)
# Predicting a new result
y_pred = rfg.predict([[some value here]] or testing set or dataset to be predicted)

Memory error while doing Hierarchical Clustering

I have a large dataset (207989, 23), and I am trying to apply Hierarchical clustering on just one column right now to test if it's suitable for the task at my hand.
What I have tried:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import preprocessing
data = pd.read_csv('gpmd.csv', header = 0)
X = data.loc[:, ['ContextID', 'BacksGas_Flow_sccm']]
min_max_scaler = preprocessing.MinMaxScaler()
X_minmax = min_max_scaler.fit_transform(X.values[:,[1]])
import scipy.cluster.hierarchy as sch
dendrogram = sch.dendrogram(sch.linkage(X_minmax, method = 'ward'))
after doing this, I am getting the following error:
dendrogram = sch.dendrogram(sch.linkage(X_minmax, method = 'ward'))
Traceback (most recent call last):
File "<ipython-input-4-429f42b68112>", line 1, in <module>
dendrogram = sch.dendrogram(sch.linkage(X_minmax, method = 'ward'))
File "C:\Users\kashy\Anaconda3\envs\py36\lib\site-packages\scipy\cluster\hierarchy.py", line 708, in linkage
y = distance.pdist(y, metric)
File "C:\Users\kashy\Anaconda3\envs\py36\lib\site-packages\scipy\spatial\distance.py", line 1877, in pdist
dm = np.empty((m * (m - 1)) // 2, dtype=np.double)
MemoryError
Can someone explain what exactly is the problem here?
Thanks in advance
Hierarchical clustering in most variants needs O(n²) memory.
Because of this, most implementations will fail at around 65535 instances, when they hit the 32 bit mark (some may fail at 32k already). But just do the math: n * n * 8 bytes for double precision: how much memory would you need?

Resources