Masking Layer using Keras Functional API - time-series

I'm trying to use Keras' Functional API as I need to add additional input into my recurrent neural net between two layers. This would be fine however I also have input I want to forecast and so I'm using a masking layer to keep the out of sample data from affecting the model. The masking layer throws an error when I try to pass it an Input(...) as it's expecting a list of integers not tensor variables. Is there a specific masking layer for the Functional API? Here is my code:
import numpy
import pandas
import math
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import GRU
from keras.layers import SimpleRNN
from keras.layers.core import Masking
from keras.optimizers import Adam
from keras.optimizers import RMSprop
from keras.optimizers import Nadam
from keras.layers import TimeDistributed
from keras import initializations
from keras.layers import Input, merge
from keras.models import Model
dataframe = pandas.read_csv('C:/Users/RNCZF01/Documents/Cameron-Fen/Economic Ideas/LSTM/LSTM-data/GDP+stockmarket.csv', usecols=[1,3], header=None, engine='python')
dataset = dataframe.values
dataset = dataset.astype('float32')
dataframe1 = pandas.read_csv('C:/Users/RNCZF01/Documents/Cameron-Fen/Economic Ideas/LSTM/LSTM-data/GDP+stockmarket.csv', usecols=[0], header=None, engine='python')
dataset1 = dataframe1.values
dataset1 = dataset1.astype('float32')
train, test = dataset[start:train_size+test_size,:]*mult, dataset1[start:train_size+test_size,:]*mult
#set the masking to 0.0
for each in range(test_size):
train[train_size + each,:] = [0.0,0.0]
train, test = train[:259], test[:259]
validx, validy = dataset[start:train_size+test_size,:]*mult, dataset1[start:train_size+test_size,:]
main_input = Input(shape=(259,2), name='main_input')
m = Masking(mask_value=0.0)(main_input)#error is her because masking expects indices to be integers not a tensor variable
Here is the data. Use columns 0 as the test data and columns 1 and 3 as training data in the code.

Related

Clustering on 97 features of categorical data

I am trying to apply unsupervised learning on a data with 97 features and around 6500 rows/samples. All features have discrete data (mostly from 1-10) with some being binary (0/1). What are some of the best clustering algorithms to apply on this data. Thank You!
It's impossible to say which clustering algo will perform best on your given dataset. You just have to try several methodologies and inspect the final results that you get. Here are several clustering algos that you can try.
https://github.com/ASH-WICUS/Notebooks/blob/master/Clustering%20Algorithms%20Compared.ipynb
Here is a small sample.
import statsmodels.api as sm
import numpy as np
import pandas as pd
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df_cars = pd.DataFrame(mtcars)
df_cars.head()
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.cluster import KMeans
from matplotlib import pyplot
# define dataset
X = df_cars[['mpg','hp']]
# define the model
model = KMeans(n_clusters=8)
# fit the model
model.fit(X)
# assign a cluster to each example
yhat = model.predict(X)
X['kmeans']=yhat
pyplot.scatter(X['mpg'], X['hp'], c=X['kmeans'], cmap='rainbow', s=50, alpha=0.8)
# plot X & Y coordinates and color by cluster number
import plotly.express as px
df = px.data.iris()
fig = px.scatter(df_cars, x="hp", y="mpg", color="kmeans", size='mpg', hover_data=['kmeans'])
fig.show()

How to know if my data has been scaled by StandardScaler?

"I have scaled my dataset by using Standard Scaler , Now how to know it has been scaled, I am sure it has been scaled but how to see it"
As #Coderji said you can always find out the mean and standard deviation, which should be equal to 0 and 1 respectively.
However, there is another method to visualize it.
from sklearn import datasets
import numpy as np
from sklearn.preprocessing import StandardScaler
I am using iris dataset for this example.
iris = datasets.load_iris()
X = iris.data
sc = StandardScaler()
sc.fit(X)
x = sc.transform(X)
import matplotlib.pyplot as plt
import seaborn as sns
sns.distplot(x[:,1])
See this Output for sepel length
Similarly you can see for all variables or a simple pairplot will do the job.
This gives an idea that the data is standardised visually.

How to specify the input_shape(input_dim) in the Keras Sequential model dynamically while using pipeline?

I create a Keras Sequential model by passing a list of layer instances to the constructor. For that, I need to pass an input_shape argument to the first layer to create_model() function. Generally, I can get a shape tuple like this:
input_shape=(len(X_train.keys()),)
Meanwhile, I am using the pipeline to take care of my preprocessing steps such as Imputing, Scaling, Encoding, Feature Selection, and etc. As a result, the number of variables/features after preprocessing is not the same as before, and I am not able to get the number of nodes that I want to add in this first hidden layer. Then, I got an error regarding the dense_1_input, and after that, I can update the shape accordingly.
Now, I want to know is there a way to specify the input_shape dynamically when using the pipeline.
Use pipelines to clean up modeling code
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.feature_selection import SelectFromModel, RFE
from sklearn.linear_model import LassoCV
numerical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='mean')),
('feature_selection', SelectFromModel(LassoCV(cv=5))),
('scaler', StandardScaler()),
])
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore')),
('imputer', SimpleImputer(strategy='most_frequent')),
('feature_selection', SelectFromModel(LassoCV(cv=5))),
])
# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_cols),
('cat', categorical_transformer, categorical_cols)
])
Initializing the ANN Model
import keras
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.optimizers import RMSprop
from keras.callbacks import Callback, EarlyStopping
def create_model(optimizer='adagrad',
kernel_initializer='glorot_uniform',
dropout=0.2):
model = Sequential()
model.add(Dense(64, activation='relu', kernel_initializer=kernel_initializer,
input_shape=(len(X_train.keys()),))) # len(X_train.keys()) is not correct here
model.add(Dropout(dropout))
model.add(Dense(64, activation='relu'))
model.add(Dense(1))
model.compile(loss='mean_absolute_error', optimizer=optimizer,
metrics=['mean_absolute_error'])
return model
My desired output is to access the shape of dataframe after preprocessing with pipeline.
This is probably a similar unanswered question:
Keras + DataFrameMapper + make_pipeline, input_dim dilemma

How to prevent simple keras autoencoder from over compressing data?

I am trying to use the keras frontend with tensorflow backend for a simple autoencoder as a multidimensional scaling technique to plot multidimensional data into 2 dimensions. Many times when I run it (not sure how to set random seed for keras btw) one of the dimensions is collapsed to yield a 1 dimensional embedding (the plot should help explain). Why is this happening? How can I make sure the dimensions are preserved and utilized by the autoencoder? I realize this is the most simple and basic form of an autoencoder that I have implemented but I would like to build on this to make better autoencoders for this task.
from sklearn.datasets import load_iris
from sklearn import model_selection
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Load data
X = load_iris().data
Y = pd.get_dummies(load_iris().target).as_matrix()
X_tr, X_te, Y_tr, Y_te = model_selection.train_test_split(X,Y, test_size=0.3, stratify=Y.argmax(axis=1))
dims = X_tr.shape[1]
n_classes = Y_tr.shape[1]
# Autoencoder
encoding_dim = 2
# this is our input placeholder
input_data = tf.keras.Input(shape=(4,))
# "encoded" is the encoded representation of the input
encoded = tf.keras.layers.Dense(encoding_dim,
activation='relu',
)(input_data)
# "decoded" is the lossy reconstruction of the input
decoded = tf.keras.layers.Dense(4, activation='sigmoid')(encoded)
# this model maps an input to its reconstruction
autoencoder = tf.keras.models.Model(input_data, decoded)
# this model maps an input to its encoded representation
encoder = tf.keras.models.Model(input_data, encoded)
autoencoder.compile(optimizer='adam', loss='binary_crossentropy')
network_training = autoencoder.fit(X_tr, X_tr,
epochs=100,
batch_size=5,
shuffle=True,
verbose=False,
validation_data=(X_te, X_te))
# Plot data
embeddings = encoder.predict(X_te)
plt.scatter(embeddings[:,0], embeddings[:,1], c=Y_te.argmax(axis=1), edgecolor="black", linewidth=1)
Run algorithm once
Run algorithm again

Use RBF Kernel with Chi-squared distance metric in SVM

How to achieve the title mentioned task. Do we have any parameter in RBF kernel to set the distance metric as chi-squared distance metric. I can see a chi2_kernel in the sk-learn library.
Below is the code that i have written.
import numpy as np
from sklearn import datasets
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix
from sklearn.preprocessing import Imputer
from numpy import genfromtxt
from sklearn.metrics.pairwise import chi2_kernel
file_csv = 'dermatology.data.csv'
dataset = genfromtxt(file_csv, delimiter=',')
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=1)
dataset = imp.fit_transform(dataset)
target = dataset[:, [34]].flatten()
data = dataset[:, range(0,34)]
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.3)
# TODO : willing to set chi-squared distance metric instead. How to do that ?
clf = svm.SVC(kernel='rbf', C=1)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(f1_score(y_test, y_pred, average="macro"))
print(precision_score(y_test, y_pred, average="macro"))
print(recall_score(y_test, y_pred, average="macro"))
Are you sure you want to compose rbf and chi2? Chi2 on its own defines a valid kernel, and all you have to do is
clf = svm.SVC(kernel=chi2_kernel, C=1)
since sklearn accepts functions as kernels (however this will require O(N^2) memory and time). If you would like to compose these two it is a bit more complex, and you will have to implement your own kernel to do that. For a bit more control (and other kernels) you might also try pykernels, however there is no support for composing yet.

Resources