Use RBF Kernel with Chi-squared distance metric in SVM - machine-learning

How to achieve the title mentioned task. Do we have any parameter in RBF kernel to set the distance metric as chi-squared distance metric. I can see a chi2_kernel in the sk-learn library.
Below is the code that i have written.
import numpy as np
from sklearn import datasets
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix
from sklearn.preprocessing import Imputer
from numpy import genfromtxt
from sklearn.metrics.pairwise import chi2_kernel
file_csv = 'dermatology.data.csv'
dataset = genfromtxt(file_csv, delimiter=',')
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=1)
dataset = imp.fit_transform(dataset)
target = dataset[:, [34]].flatten()
data = dataset[:, range(0,34)]
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.3)
# TODO : willing to set chi-squared distance metric instead. How to do that ?
clf = svm.SVC(kernel='rbf', C=1)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(f1_score(y_test, y_pred, average="macro"))
print(precision_score(y_test, y_pred, average="macro"))
print(recall_score(y_test, y_pred, average="macro"))

Are you sure you want to compose rbf and chi2? Chi2 on its own defines a valid kernel, and all you have to do is
clf = svm.SVC(kernel=chi2_kernel, C=1)
since sklearn accepts functions as kernels (however this will require O(N^2) memory and time). If you would like to compose these two it is a bit more complex, and you will have to implement your own kernel to do that. For a bit more control (and other kernels) you might also try pykernels, however there is no support for composing yet.

Related

ERROR: Output of Logistic Regression is displayed as LogisticRegression()

I just tried to implement logistic regression on a very simple and small dataset at Jupyter notebook. But the output that I am getting at the end having applied the algorithm is unwanted and shocking. I am getting the output as LogisticRegression() only nothing but only this.
import numpy as np
import pandas as pd
df = pd.read_csv('placement.csv')
df.head()
df.info()
df = df.iloc[:,1:]
df.head()
import matplotlib.pyplot as plt
plt.scatter(df['cgpa'],df['iq'],c=df['placement'])
X = df.iloc[:,0:2]
y = df.iloc[:,-1]
X
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.1)
X_train
y_train
X_test
y_test
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_train
X_test = scaler.transform(X_test)
X_test
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X_train,y_train)
LogisticRegression() ## at the end I get this.
Please bear with me for the way I have uploaded the code.
How can I fix this output of logisticregression(), need help.
From my error I get to know that the scikit-learn model repr implementations were updated so they would not display default parameters when printing some time back.

Get 100% accuracy score on Decision tree model

I got 100% accuracy on my decision tree using decision tree algorithm but only got 75% accuracy on random forest
Is there something wrong with my model or is decision tree best suited for the dataset provide?
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state= 30)
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
classifier = classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test,y_pred)
print(cm)
At First it may look like your model is overfitted but it is not the case as you have put the test set aside.
The reason is Data Leak. Random Forest, randomly exludes some features for every tree. Now suppose you have the labels as one of the features: in some trees the label got excluded and the accuracy is reduced while in the Decission three the label is always among the featurs and predict the result perfectly.
How can you find if it is the case?
use the visualization for the Decision three and if my guess is true you will find that there a few number of decision nodes. You can also visualize the correlation between label and every feature and check out if there is any perfevt correlation or not.

Can I use autoencoders for dimensionality reduction with a very small dataset?

I have a numeric dataset with just 55 samples and 270 features. I'm trying to separate these samples into clusters, however, it is hard to perform clustering in such high-dimensional space. Thus, I'm thinking about using autoencoders for performance dimensionality reduction, but I'm not sure if it is possible to do that with such a small dataset. Notice that this application is useful because the idea is to allow it to deal with different datasets that have similar characteristics.
With the following code, using Mean Squared Error as the loss function, I have achieved a loss of 4.9, which I think that is high. Notice that the dataset is already normalized.
Is it possible to use autoencoders for dimensionality reduction in this case?
This is the source for building the autoencoder and training it:
import keras
import tensorflow as tf
from keras import layers
from keras import regularizers
from keras.datasets import mnist
import numpy as np
import matplotlib.pyplot as plt
from keras.callbacks import EarlyStopping
features = 0
preservationRatio = 0.99
epochs = 500
data = loadData("dataset.csv")
samples = len(data)
features = len(data[0])
x_train = data
x_test = data
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1,min_delta=0.001, patience=50)
encoding_dim = int(features*preservationRatio)
input_number = keras.Input(shape=(features,))
# Add a Dense layer with a L1 activity regularizer
encoded = layers.Dense(encoding_dim, activation='relu',
activity_regularizer=regularizers.l1(1e-7))(input_number)
decoded = layers.Dense(features, activation='sigmoid')(encoded)
autoencoder = keras.Model(input_number, decoded)
encoder = keras.Model(input_number, encoded)
# This is our encoded input
encoded_input = keras.Input(shape=(encoding_dim,))
# Retrieve the last layer of the autoencoder model
decoder_layer = autoencoder.layers[-1]
# Create the decoder model
decoder = keras.Model(encoded_input, decoder_layer(encoded_input))
autoencoder.compile(optimizer='adam', loss=tf.keras.losses.MeanSquaredError())
x_train = x_train.reshape((len(x_train), np.prod(x_train.shape[1:])))
x_test = x_test.reshape((len(x_test), np.prod(x_test.shape[1:])))
print(x_train.shape)
print(x_test.shape)
history = autoencoder.fit(x_train, x_train,
epochs=epochs,
batch_size=20,
callbacks=[es],
shuffle=True,
validation_data=(x_test, x_test),
verbose = 1)

How to build voting classifier in sklearn when the individual classifiers are being fit with different datasets?

I'm building a classifier using highly unbalanced data. The strategy I'm interesting in testing is ensembling a model using 3 different resampled datasets. In other words, each dataset will have all the samples from the rare class, but only n samples of the abundant class (technique #4 mentioned in this article).
I want to fit 3 different VotingClassifiers on each resampled dataset, and then combine the results of the individual models using another VotingClassifier (or similar). I know that building a single voting classifier looks like this:
# First Model
rnd_clf_1 = RandomForestClassifier()
xgb_clf_1 = XGBClassifier()
voting_clf_1 = VotingClassifier(
estimators = [
('rf', rnd_clf_1),
('xgb', xgb_clf_1),
],
voting='soft'
)
# And I can fit it with the first dataset this way:
voting_clf_1.fit(X_train_1, y_train_1)
But how to stack the three of them if they are fitted on different datasets? For example, if I had three fitted models (see code below), I could build a function that calls the .predict_proba() method on each of the models and then "manually" averages the individual probabilities.
But... is there a better way?
# Fitting the individual models... but how to combine the predictions?
voting_clf_1.fit(X_train_1, y_train_1)
voting_clf_2.fit(X_train_2, y_train_2)
voting_clf_3.fit(X_train_3, y_train_3)
Thanks!
Usually the #4 method shown in the article is implemented with same type of classifier. It looks like you want to try VotingClassifier on each sample dataset.
There is an implementation of this methodology already in imblearn.ensemble.BalancedBaggingClassifier, which is an extension from Sklearn Bagging approach.
You can feed the estimator as VotingClassifier and number of estimators as the number of times, which you want carry out the dataset sampling. Use sampling_strategy param to mention proportion of downsampling which you want on Majority class.
Working Example:
from collections import Counter
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from imblearn.ensemble import BalancedBaggingClassifier # doctest: +NORMALIZE_WHITESPACE
X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
print('Original dataset shape %s' % Counter(y))
X_train, X_test, y_train, y_test = train_test_split(X, y,
random_state=0)
rnd_clf_1 = RandomForestClassifier()
xgb_clf_1 = xgb.XGBClassifier()
voting_clf_1 = VotingClassifier(
estimators = [
('rf', rnd_clf_1),
('xgb', xgb_clf_1),
],
voting='soft'
)
bbc = BalancedBaggingClassifier(base_estimator=voting_clf_1, random_state=42)
bbc.fit(X_train, y_train) # doctest: +ELLIPSIS
y_pred = bbc.predict(X_test)
print(confusion_matrix(y_test, y_pred))
See here. May be you can reuse _predict_proba() and _collect_probas() functions after fitting your estimators manually.

In a SVM model, will results be viable when I decrease the test size to 0.06

I used support vector machine model for classification using iris data set. I used train test split function to split the data-set into training and testing subsets.
when test_size was 0.3 the accuracy was low and then I decreased the size of testing subset to 0.06 and now the accuracy is 1 ie. 100%. obviously, the reason is clear, its because with testing data the amount of noise and fluctuations as decreases.
My question is- we want our model to be efficient but what value of test_size is acceptable for that. at what value of test_size will the result be viable.
here is some line of code from my program-
from sklearn import datasets
from sklearn import svm
import numpy as np
from sklearn import metrics
iris = datasets.load_iris()
X = iris.data[:, :2]
y = iris.target
C=1.0
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train ,y_test = train_test_split(X,y,test_size=0.06, random_state=4)
svc = svm.SVC(kernel='linear', C=C).fit(x_train,y_train)
y_pred = svc.predict(x_test)
print(metrics.accuracy_score(y_test,y_pred))
lin_svc = svm.LinearSVC(C=C).fit(x_train,y_train)
y_pred = lin_svc.predict(x_test)
print(metrics.accuracy_score(y_test,y_pred))
rbf_svc = svm.SVC(kernel='rbf', gamma=0.7, C=C).fit(x_train,y_train)
y_pred =rbf_svc.predict(x_test)
print(metrics.accuracy_score(y_test,y_pred))
poly_svc = svm.SVC(kernel='poly',degree=3, C=C).fit(x_train,y_train)
y_pred = poly_svc.predict(x_test)
print(metrics.accuracy_score(y_test,y_pred))
result is 100% accuracy for all 4 cases.

Resources