I am working on an ML project to predict a particular boolean column.
I am using SVM's in SKlearn. I have some features I am trying to use as my descriptive features. These features contain integers (label encoding) and have a max of three unique values.
The confusion matrix and classification report that is produced after I make the prediction is down below. I can't understand why it is classifying zero results. Any help? image of code here
The code below is using one descriptive feature that contains three unique integers, and the target feature is a boolean column.
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)
array([[9309, 0],
[8896, 0]], dtype=int64)
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
precision recall f1-score support
0 0.51 1.00 0.68 9309
1 0.00 0.00 0.00 8896
accuracy 0.51 18205
Here is the encoding and the training
encoder = preprocessing.LabelEncoder()
df['Gaze'] = encoder.fit_transform(df['Gaze'])
df['Blink'] = encoder.fit_transform(df['Blink'])
df['Brows'] = encoder.fit_transform(df['Brows'])
df['QuestionType'] = encoder.fit_transform(df['QuestionType'])
y = df[['QuestionType']]
X = df[['Blush']]
from sklearn.svm import SVC
clf = SVC(kernel='linear')
from sklearn.preprocessing import MinMaxScaler
scaling = MinMaxScaler(feature_range=(-1,1)).fit(X_train)
X_train = scaling.transform(X_train)
X_test = scaling.transform(X_test)
clf.fit(X_train, y_train)
y_score = clf.decision_function(X_test)
y_pred = clf.predict(X_test)
Related
I have a numeric dataset with just 55 samples and 270 features. I'm trying to separate these samples into clusters, however, it is hard to perform clustering in such high-dimensional space. Thus, I'm thinking about using autoencoders for performance dimensionality reduction, but I'm not sure if it is possible to do that with such a small dataset. Notice that this application is useful because the idea is to allow it to deal with different datasets that have similar characteristics.
With the following code, using Mean Squared Error as the loss function, I have achieved a loss of 4.9, which I think that is high. Notice that the dataset is already normalized.
Is it possible to use autoencoders for dimensionality reduction in this case?
This is the source for building the autoencoder and training it:
import keras
import tensorflow as tf
from keras import layers
from keras import regularizers
from keras.datasets import mnist
import numpy as np
import matplotlib.pyplot as plt
from keras.callbacks import EarlyStopping
features = 0
preservationRatio = 0.99
epochs = 500
data = loadData("dataset.csv")
samples = len(data)
features = len(data[0])
x_train = data
x_test = data
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1,min_delta=0.001, patience=50)
encoding_dim = int(features*preservationRatio)
input_number = keras.Input(shape=(features,))
# Add a Dense layer with a L1 activity regularizer
encoded = layers.Dense(encoding_dim, activation='relu',
activity_regularizer=regularizers.l1(1e-7))(input_number)
decoded = layers.Dense(features, activation='sigmoid')(encoded)
autoencoder = keras.Model(input_number, decoded)
encoder = keras.Model(input_number, encoded)
# This is our encoded input
encoded_input = keras.Input(shape=(encoding_dim,))
# Retrieve the last layer of the autoencoder model
decoder_layer = autoencoder.layers[-1]
# Create the decoder model
decoder = keras.Model(encoded_input, decoder_layer(encoded_input))
autoencoder.compile(optimizer='adam', loss=tf.keras.losses.MeanSquaredError())
x_train = x_train.reshape((len(x_train), np.prod(x_train.shape[1:])))
x_test = x_test.reshape((len(x_test), np.prod(x_test.shape[1:])))
print(x_train.shape)
print(x_test.shape)
history = autoencoder.fit(x_train, x_train,
epochs=epochs,
batch_size=20,
callbacks=[es],
shuffle=True,
validation_data=(x_test, x_test),
verbose = 1)
I am trying to build a prection problem to predict the fare of flights. My data set has several catergorical variables like class,hour,day of week, day of month, month of year etc. I am using multiple algorithms like xgboost, ANN to fit the model
Intially I have one hot encoded these variables, which led to total of 90 variables, when I tried to fit a model for this data, training r2_score was high around .90 and test score was relatively very low(0.6).
I have used sine and cosine transformation for temporal variables, this led to a total of only 27 variables. With this training accuracy has dropped to .83 but test score is increased to .70
I was thinking that my variables are sparse and tried doing a PCA, but this drastically reduced the performance both on train set and test set.
So I have few questions regarding the same.
Why is PCA not helping and inturn reducing the performance of my model so badly
Any suggestions on how to improve my model performance?
code
from xgboost import XGBRegressor
import pandas as pd
import matplotlib.pyplot as plt
dataset = pd.read_excel('Airline Dataset1.xlsx',sheet_name='Airline Dataset1')
dataset = dataset.drop(columns = ['SL. No.'])
dataset['time'] = dataset['time'] - 24
import numpy as np
dataset['time'] = np.where(dataset['time']==24,0,dataset['time'])
cat_cols = ['demand', 'from_ind', 'to_ind']
cyc_cols = ['time','weekday','month','monthday']
def cyclic_encode(data,col,col_max):
data[col + '_sin'] = np.sin(2*np.pi*data[col]/col_max)
data[col + '_cos'] = np.cos(2*np.pi*data[col]/col_max)
return data
cyclic_encode(dataset,'time',23)
cyclic_encode(dataset,'weekday',6)
cyclic_encode(dataset,'month',11)
cyclic_encode(dataset,'monthday',31)
dataset = dataset.drop(columns=cyc_cols)
ohe_dataset = pd.get_dummies(dataset,columns = cat_cols , drop_first=True)
X = ohe_dataset.iloc[:,:-1]
y = ohe_dataset.iloc[:,27:28]
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train_us, X_test_us, y_train_us, y_test_us = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_Y = StandardScaler()
X_train = sc_X.fit_transform(X_train_us)
X_test = sc_X.transform(X_test_us)
y_train = sc_Y.fit_transform(y_train_us)
y_test = sc_Y.transform(y_test_us)
#Applying PCA
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
X_train = pca.fit_transform(X_train,y_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_
regressor = XGBRegressor()
model = regressor.fit(X_train,y_train)
# Predicting the Test & Train set with regressor built
y_pred = regressor.predict(X_test)
y_pred = sc_Y.inverse_transform(y_pred)
y_pred_train = regressor.predict(X_train)
y_pred_train = sc_Y.inverse_transform(y_pred_train)
y_train = sc_Y.inverse_transform(y_train)
y_test = sc_Y.inverse_transform(y_test)
#calculate r2_score
from sklearn.metrics import r2_score
score_train = r2_score(y_train,y_pred_train)
score_test = r2_score(y_test,y_pred)
Thanks
You dont really need PCA for such small dimensional problem. Decision trees perform very well even with thousands of variables.
Here are few things you can try
Pass a watchlist and train up until you are not overfitting on validation set. https://github.com/dmlc/xgboost/blob/2d95b9a4b6d87e9f630c59995403988dee390c20/demo/guide-python/basic_walkthrough.py#L64
try all sine cosine transformations and other one hot encoding together and make a model (along with watchlist)
Looks for more causal data. Just seasonal patterns does not cause air fare fluctuations. For starting you can add flags for festivals, holidays, important dates. Also do feature engineering for proximities to these days. Weather data is also easy to find and add.
PCA usually help in cases where you have extreme dimensionality like genome data or algorithm involved doesnt do well in high dimensional data like kNN etc.
I used support vector machine model for classification using iris data set. I used train test split function to split the data-set into training and testing subsets.
when test_size was 0.3 the accuracy was low and then I decreased the size of testing subset to 0.06 and now the accuracy is 1 ie. 100%. obviously, the reason is clear, its because with testing data the amount of noise and fluctuations as decreases.
My question is- we want our model to be efficient but what value of test_size is acceptable for that. at what value of test_size will the result be viable.
here is some line of code from my program-
from sklearn import datasets
from sklearn import svm
import numpy as np
from sklearn import metrics
iris = datasets.load_iris()
X = iris.data[:, :2]
y = iris.target
C=1.0
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train ,y_test = train_test_split(X,y,test_size=0.06, random_state=4)
svc = svm.SVC(kernel='linear', C=C).fit(x_train,y_train)
y_pred = svc.predict(x_test)
print(metrics.accuracy_score(y_test,y_pred))
lin_svc = svm.LinearSVC(C=C).fit(x_train,y_train)
y_pred = lin_svc.predict(x_test)
print(metrics.accuracy_score(y_test,y_pred))
rbf_svc = svm.SVC(kernel='rbf', gamma=0.7, C=C).fit(x_train,y_train)
y_pred =rbf_svc.predict(x_test)
print(metrics.accuracy_score(y_test,y_pred))
poly_svc = svm.SVC(kernel='poly',degree=3, C=C).fit(x_train,y_train)
y_pred = poly_svc.predict(x_test)
print(metrics.accuracy_score(y_test,y_pred))
result is 100% accuracy for all 4 cases.
I'm trying to build a regression model, validate and test it and make sure it doesn't overfit the data. This is my code thus far:
from pandas import read_csv
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, cross_val_score, validation_curve
import numpy as np
import matplotlib.pyplot as plt
data = np.array(read_csv('timeseries_8_2.csv', index_col=0))
inputs = data[:, :8]
targets = data[:, 8:]
x_train, x_test, y_train, y_test = train_test_split(
inputs, targets, test_size=0.1, random_state=2)
rate1 = 0.005
rate2 = 0.1
mlpr = MLPRegressor(hidden_layer_sizes=(12,10), max_iter=700, learning_rate_init=rate1)
# trained = mlpr.fit(x_train, y_train) # should I fit before cross val?
# predicted = mlpr.predict(x_test)
scores = cross_val_score(mlpr, inputs, targets, cv=5)
print(scores)
Scores prints an array of 5 numbers where the first number usually around 0.91 and is always the largest number in the array.
I'm having a little trouble figuring out what to do with these numbers. So if the first number is the largest number, then does this mean that on the first cross validation attempt, the model scored the highest, and then the scores decreased as it kept trying to cross validate?
Also, should I fit the training the data before I call the cross validation function? I tried commenting it out and it's giving me more or less the same results.
The cross validation function performs the model fitting as part of the operation, so you gain nothing from doing that by hand:
The following example demonstrates how to estimate the accuracy of a linear kernel support vector machine on the iris dataset by splitting the data, fitting a model and computing the score 5 consecutive times (with different splits each time):
http://scikit-learn.org/stable/modules/cross_validation.html#computing-cross-validated-metrics
And yes, the returned numbers reflect multiple runs:
Returns: Array of scores of the estimator for each run of the cross validation.
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score
Finally, there is no reason to expect that the first result is the largest:
from sklearn.model_selection import cross_val_score
from sklearn import datasets
from sklearn.neural_network import MLPRegressor
boston = datasets.load_boston()
est = MLPRegressor(hidden_layer_sizes=(120,100), max_iter=700, learning_rate_init=0.0001)
cross_val_score(est, boston.data, boston.target, cv=5)
# Output
array([-0.5611023 , -0.48681641, -0.23720267, -0.19525727, -4.23935449])
I'm dealing with a simple logistic regression problem. Each sample contains 7423 features. Totally 4000 training samples and 1000 testing samples. Sklearn takes 0.01s to train the model and achieves 97% accuracy, but Keras (TensorFlow backend) takes 10s to achieve same accuracy after 50 epoches (even one epoch is 20x slower than sklearn). Anyone can shed light on this huge gap?
Samples:
X_train: matrix of 4000*7423, 0.0 <= value <= 1.0
y_train: matrix of 4000*1, value = 0.0 or 1.0
X_test: matrix of 1000*7423, 0.0 <= value <= 1.0
y_test: matrix of 1000*1, value = 0.0 or 1.0
Sklearn code:
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.metrics import accuracy_score
classifier = LogisticRegression()
**# Finished in 0.01s**
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)
print('test accuracy = %.2f' % accuracy_score(predictions, y_test))
*[output]: test accuracy = 0.97*
Keras code:
# Using TensorFlow as backend
from keras.models import Sequential
from keras.layers import Dense, Activation
model = Sequential()
model.add(Dense(1, input_dim=X_train.shape[1], activation='sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
**# Finished in 10s**
model.fit(X_train, y_train, batch_size=64, nb_epoch=50, verbose=0)
result = model.evaluate(X_test, y_test, verbose=0)
print('test accuracy = %.2f' % result[1])
*[output]: test accuracy = 0.97*
It might be the optimizer or the loss. You use a non linearity. You also use probably a different batch size under the hood in sklearn.
But the way I see it is that you have a specific task, one of the tool is tailored and made to resolve it, the other is a more complex structure that can resolve it but is not optimized to do so and probably does a lot of things that are not needed for this problem which slows everything down.