Implementing sklearn PCA on limited number of variables in a pipeline - machine-learning

I'm setting up a machine learning pipeline to classify some data. One source of the data is a very good candidate for PCA and makes up the last n dimensions of the dataset. I would like to use PCA on these variables but not the preceding variables. From searching through stackexchange this also seems like a common issue faced --- that people want to apply PCA to just a portion of the data.
Obviously I could do the PCA first then concatenate the datasets and then pass that to the pipeline, but afaik the PCA should be part of the pipeline as otherwise information from test samples bleeds into the training data.
I want to use sklearn's PCA function (but am also open to suggestions) but it doesn't take any arguments to define which variables to do the PCA on and so it's difficult to incorporate into a pipeline.
My work around currently works by defining a new PCA which selects the desired features and looks like this:
class new_PCA(PCA):
#staticmethod
def reduce(X):
return X.iloc[:,NUMBER_NONPCA_FEATURES:]
# Define PCA for limited number of columns
def _fit(self, X):
part_X = self.reduce(X)
PCA._fit(self,part_X)
def transform(self, X):
part_X = self.reduce(X)
pca_part = PCA.transform(self, part_X)
X_new = np.concatenate([X.iloc[:,:NUMBER_NONPCA_FEATURES],pca_part],axis=1)
return X_new
def score_samples(self, X):
part_X = self.reduce(X)
PCA.score_samples(self,part_X)
def inverse_transform(self, X):
part_X = self.reduce(X)
PCA.inverse_transform(self,part_X)
So the base _fit() method and transform() are both restricted to the current variables. This seems to work and from inspecting the source code I think this covers all the methods which take the training data (X) as input.
I'm just slightly concerned that I may have overlooked something and that this may have some unintended consequences somewhere. Does this look OK?

Your workaround is not necessary, since this use case is already covered by sklearn. Different transformations for different features can be implemented by including a ColumnTransformer in the pipeline.
Consider the example below:
pipe = Pipeline(
[
('ct',
ColumnTransformer(
[("PCA", PCA(), [0, 1, 2]),
("pass", "passthrough", [3, 4, 5])])
),
('svc', SVC())
]
)
The ColumnTransformer stage of the pipeline can be seen as a junction, separating the columns coming in into subsets and applying the specified transformers to the passed columns (here passed as indices, but can be column names too).
In the example PCA is only applied to the columns with indices 0, 1 and 2, while the columns 3,4, and 5 are passed to the next stage of the pipeline without any transformation.

Related

Nominal and cyclical features encoding should be done before or after feature selection?

In a dataset I'm using to learn machine learning I have beside many others, one nominal and two cyclical features like bellow:
Location: "Orlando", "NewYork", "LosAngeles"...
Date: "2012-01-25", "2010-08-06", "2016-11-30"...
WinDir: "N", "S", "NW"...
Currently I'm at the feature engineering step of the pipeline and after that I'm gonna make a feature selection. The idea is using chi-squared or maybe mutual information statistic test for these features because the label is binary.
Here the transformations I would do to the mentioned features:
Location: OrdinalEncoder() and after that LeaveOneOutEncoder()
Date: pd.to_datetime() and split these dates in Year, Day, Month new features. After that I would do a cyclical encoding(sin/cos technique) in Day and Month.
WinDir: Same cyclical encoding as before with date
Well, here is my problem. I believe after these transformations(cyclical and leaveoneout) the features will loose the properties needed for the tests I intend to make. So, I was wondering doing a basic transformation making an OrdinalEncoding in "Location" and "WinDir" first, make the feature selection, and if I decide to keep these features make the cyclic encoding and LeaveOneOutEncoder after.
What do you think about this? Any suggestion?
Don't know if you need but here the code for cyclic encoding
class CatCyclicEncoder(BaseEstimator, TransformerMixin):
def __init__(self):
pass
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
X_aux = []
cols = range(X.shape[1])
for index in cols:
column = X[:, index]
max_value = column.max()
sin_values = [math.sin((2 * math.pi * x) / max_value) for x in list(column)]
cos_values = [math.cos((2 * math.pi * x) / max_value) for x in list(column)]
X_aux.append(sin_values)
X_aux.append(cos_values)
X_encoded = np.array(X_aux).transpose()
return X_encoded
If you are using the features to train machine learning models, then, you need to apply the feature selection method to the features that will actually be used to train the model.
Thus, if you encode the categorical variables and the cyclical features, and plan to use those to train the model, then, you need to apply the selection after the encoding.
If the features do not fit the criteria for the selection tests after the transformation, then you can use different selection tests. More info on feature selection here and here.
There is also an open-source package that automatically creates cyclical features.

Specifying class or sample weights in Keras for one-hot encoded labels in a TF Dataset

I am trying to train an image classifier on an unbalanced training set. In order to cope with the class imbalance, I want either to weight the classes or the individual samples. Weighting the classes does not seem to work. And somehow for my setup I was not able to find a way to specify the samples weights. Below you can read how I load and encode the training data and the two approaches that I tried.
Training data loading and encoding
My training data is stored in a directory structure where each image is place in the subfolder corresponding to its class (I have 32 classes in total). Since the training data is too big too all load at once into memory I make use of image_dataset_from_directory and by that describe the data in a TF Dataset:
train_ds = keras.preprocessing.image_dataset_from_directory (training_data_dir,
batch_size=batch_size,
image_size=img_size,
label_mode='categorical')
I use label_mode 'categorical', so that the labels are described as a one-hot encoded vector.
I then prefetch the data:
train_ds = train_ds.prefetch(buffer_size=buffer_size)
Approach 1: specifying class weights
In this approach I try to specify the class weights of the classes via the class_weight argument of fit:
model.fit(
train_ds, epochs=epochs, callbacks=callbacks, validation_data=val_ds,
class_weight=class_weights
)
For each class we compute weight which are inversely proportional to the number of training samples for that class. This is done as follows (this is done before the train_ds.prefetch() call described above):
class_num_training_samples = {}
for f in train_ds.file_paths:
class_name = f.split('/')[-2]
if class_name in class_num_training_samples:
class_num_training_samples[class_name] += 1
else:
class_num_training_samples[class_name] = 1
max_class_samples = max(class_num_training_samples.values())
class_weights = {}
for i in range(0, len(train_ds.class_names)):
class_weights[i] = max_class_samples/class_num_training_samples[train_ds.class_names[i]]
What I am not sure about is whether this solution works, because the keras documentation does not specify the keys for the class_weights dictionary in case the labels are one-hot encoded.
I tried training the network this way but found out that the weights did not have a real influence on the resulting network: when I looked at the distribution of predicted classes for each individual class then I could recognize the distribution of the overall training set, where for each class the prediction of the dominant classes is most likely.
Running the same training without any class weight specified led to similar results.
So I suspect that the weights don't seem to have an influence in my case.
Is this because specifying class weights does not work for one-hot encoded labels, or is this because I am probably doing something else wrong (in the code I did not show here)?
Approach 2: specifying sample weight
As an attempt to come up with a different (in my opinion less elegant) solution I wanted to specify the individual sample weights via the sample_weight argument of the fit method. However from the documentation I find:
[...] This argument is not supported when x is a dataset, generator, or keras.utils.Sequence instance, instead provide the sample_weights as the third element of x.
Which is indeed the case in my setup where train_ds is a dataset. Now I really having trouble finding documentation from which I can derive how I can modify train_ds, such that it has a third element with the weight. I thought using the map method of a dataset can be useful, but the solution I came up with is apparently not valid:
train_ds = train_ds.map(lambda img, label: (img, label, class_weights[np.argmax(label)]))
Does anyone have a solution that may work in combination with a dataset loaded by image_dataset_from_directory?

Using sample_weights with fit_generator()

Inside an autoregressive continuous problem, when the zeros take too much place, it is possible to treat the situation as a zero-inflated problem (i.e. ZIB). In other words, instead of working to fit f(x), we want to fit g(x)*f(x) where f(x) is the function we want to approximate, i.e. y, and g(x) is a function which output a value between 0 and 1 depending if a value is zero or non-zero.
Currently, I have two models. One model which gives me g(x) and another model which fits g(x)*f(x).
The first model gives me a set of weights. This is where I need your help. I can use the sample_weights arguments with model.fit(). As I work with tremendous amount of data, then I need to work with model.fit_generator(). However, fit_generator() does not have the argument sample_weights.
Is there a work around to work with sample_weights inside fit_generator()? Otherwise, how can I fit g(x)*f(x) knowing that I have already a trained model for g(x)?
You can provide sample weights as the third element of the tuple returned by the generator. From Keras documentation on fit_generator:
generator: A generator or an instance of Sequence (keras.utils.Sequence) object in order to avoid duplicate data when using multiprocessing. The output of the generator must be either
a tuple (inputs, targets)
a tuple (inputs, targets, sample_weights).
Update: Here is a rough sketch of a generator that returns the input samples and targets as well as the sample weights obtained from model g(x):
def gen(args):
while True:
for i in range(num_batches):
# get the i-th batch data
inputs = ...
targets = ...
# get the sample weights
weights = g.predict(inputs)
yield inputs, targets, weights
model.fit_generator(gen(args), steps_per_epoch=num_batches, ...)

cnn max pooling - non consecutive sliding window (skip gram like)?

When using keras to build a simple cnn like the code below and when it is used on text-based problems such as document classification, I understand that this is as if we are extracting 4-grams from the text (kernel_size of 4) and use them as features.
model = Sequential()
model.add(embedding_layer)
model.add(Conv1D(filters=100, kernel_size=4, padding='same', activation='relu'))
model.add(MaxPooling1D(pool_size=4))
model.add(Dense(4, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
and in this case, the kernel size in the conv1D layer is like a sliding window of size 4 that walks over sequences of tokens in the text to emit 4-grams.
I wonder if here is a way such that we can create 'non-consecutive sliding window in the convolution, i.e., that would generate 'skip-gram' equivalent. So for example, given the following 1d vector:
[a, b, c, d, e, f]
a conv1d with a kernel_size=3 skip=1 will scan the following sequences:
[(a,c,d),(b,d,e),(c,e,f),(d,f,padding),(e,padding,padding)] union [(a,b,d),(b,c,e),(c,d,f),(d,e,padding),(e,f,padding),(f,padding,padding)]
The reason I say 'union' is simply because I suppose from the implementation point of view, it may be easier to generate either part 1 or part 2, giving another parameter for the revised conv1d layer. and if thhat's the case and doable, I can work around this by concatenating multiple layers. But the minimum is really to have an extended conv1d layer that would take additional parameters such that it does either the first or the second part of scanning.
The idea is not new as this paper already experimented it: http://www.aclweb.org/anthology/D/D16/D16-1085.pdf
But excuse my lack of in-depth knowledge of keras I do not know how to implement it. Any suggestions please,
Many thanks in advance
You can do this creating a custom convolutional layer where certain elements in the weight matrix are zero.
You can take the regular Conv1D layer as the base class.
But before doing this, notice that you can create a "dilated" convolution by passing the dilation_rate=n parameter when creating a regular convolutional layer. This will skip n-1 grams between each taken gram in the window. Your window will have fixed regular spaces.
Creating a custom layer for that:
import keras.backend as K
#a 1D convolution that skips some entries
class SkipConv1D(Conv1D):
#in the init, let's just add a parameter to tell which grams to skip
def __init__(self, validGrams, **kwargs):
#for this example, I'm assuming validGrams is a list
#it should contain zeros and ones, where 0's go on the skip positions
#example: [1,1,0,1] will skip the third gram in the window of 4 grams
assert len(validGrams) == kwargs.get('kernel_size')
self.validGrams = K.reshape(K.constant(validGrams),(len(validGrams),1,1))
#the chosen shape matches the dimensions of the kernel
#the first dimension is the kernel size, the others are input and ouptut channels
#initialize the regular conv layer:
super(SkipConv1D,self).__init__(**kwargs)
#here, the filters, size, etc, go inside kwargs, so you should use them named
#but you may make them explicit in this __init__ definition
#if you think it's more comfortable to use it like this
#in the build method, let's replace the original kernel:
def build(self, input_shape):
#build as the original layer:
super(SkipConv1D,self).build(input_shape)
#replace the kernel
self.originalKernel = self.kernel
self.kernel = self.validGrams * self.originalKernel
Be aware of some things that weren't taken care of in this answer:
The method get_weights() will still return the original kernel, not the kernel with the skipped mask. (It's possible to fix this, but there will be an extra work, if necessary, please tell me)
There are unused weights in this layer. This is a simple implementation. The focus here was to keep it the most similar possible to an existing Conv layer, with all its features. It's also possible to use only strictly necessary weights, but this will increase the complexity a lot, and require lots of rewriting of the keras original code for recreating all the original possibilities.
If your kernel_size is too long, it will be very boring to define the validGrams var. You may want to create a version that takes some skipped indices and then converts it in the type of list used above.
Different channels skipping different grams:
It's possible to do this inside a layer as well, if instead of using a validGrams with shape (length,), you use one with shape (length,outputFilters).
In this case, at the point where we create the validGrams matrix, we should reshape it like:
validGrams = np.asarray(validGrams)
shp = (validGrams.shape[0],1,validGrams.shape[1])
validGrams = validGrams.reshape(shp)
self.validGrams = K.constant(validGrams)
You can also simply use many parallel SkipConv1D with different parameters and then concatenate their results.
inputs = Input(yourInputShape)
out = embedding_layer(inputs)
out1 = SkipConv1D(filters=50,kernel_size=4,validGrams=[1,0,1,1])(out)
out2 = SkipConv1D(filters=50,kernel_size=4,validGrams=[1,1,0,1])(out)
out = Concatenate()([out1,out2]) #if using 'channels_first' use Concatenate(axis=1)
out = MaxPooling1D(pool_size=4)(out)
out = Dense(4, activation='softmax')(out)
model = Model(inputs,out)

How to split documents into training set and test set?

I am trying to build a classification model. I have 1000 text documents in local folder. I want to divide them into training set and test set with a split ratio of 70:30(70 -> Training and 30 -> Test) What is the better approach to do so? I am using python.
I wanted a approach programatically to split the training set and test set. First to read the files in local directory. Second, to build a list of those files and shuffle them. Thirdly to split them into a training set and test set.
I tried a few ways by using built in python keywords and functions only to fail. Lastly I got the idea of approaching it. Also Cross-validation is a good option to be considered for the building general classification models.
Not sure exactly what you're after, so I'll try to be comprehensive. There will be a few steps:
Get a list of the files
Randomize the files
Split files into training and testing sets
Do the thing
1. Get a list of the files
Let's assume that your files all have the extension .data and they're all in the folder /ml/data/. What we want to do is get a list of all of these files. This is done simply with the os module. I'm assuming you have no subdirectories; this would change if there were.
import os
def get_file_list_from_dir(datadir):
all_files = os.listdir(os.path.abspath(datadir))
data_files = list(filter(lambda file: file.endswith('.data'), all_files))
return data_files
So if we were to call get_file_list_from_dir('/ml/data'), we would get back a list of all the .data files in that directory (equivalent in the shell to the glob /ml/data/*.data).
2. Randomize the files
We don't want the sampling to be predictable, as that is considered a poor way to train an ML classifier.
from random import shuffle
def randomize_files(file_list):
shuffle(file_list)
Note that random.shuffle performs an in-place shuffling, so it modifies the existing list. (Of course this function is rather silly since you could just call shuffle instead of randomize_files; you can write this into another function to make it make more sense.)
3. Split files into training and testing sets
I'll assume a 70:30 ratio instead of any specific number of documents. So:
from math import floor
def get_training_and_testing_sets(file_list):
split = 0.7
split_index = floor(len(file_list) * split)
training = file_list[:split_index]
testing = file_list[split_index:]
return training, testing
4. Do the thing
This is the step where you open each file and do your training and testing. I'll leave this to you!
Cross-Validation
Out of curiosity, have you considered using cross-validation? This is a method of splitting your data so that you use every document for training and testing. You can customize how many documents are used for training in each "fold". I could go more into depth on this if you like, but I won't if you don't want to do it.
Edit: Alright, since you requested I will explain this a little bit more.
So we have a 1000-document set of data. The idea of cross-validation is that you can use all of it for both training and testing — just not at once. We split the dataset into what we call "folds". The number of folds determines the size of the training and testing sets at any given point in time.
Let's say we want a 10-fold cross-validation system. This means that the training and testing algorithms will run ten times. The first time will train on documents 1-100 and test on 101-1000. The second fold will train on 101-200 and test on 1-100 and 201-1000.
If we did, say, a 40-fold CV system, the first fold would train on document 1-25 and test on 26-1000, the second fold would train on 26-40 and test on 1-25 and 51-1000, and on.
To implement such a system, we would still need to do steps (1) and (2) from above, but step (3) would be different. Instead of splitting into just two sets (one for training, one for testing), we could turn the function into a generator — a function which we can iterate through like a list.
def cross_validate(data_files, folds):
if len(data_files) % folds != 0:
raise ValueError(
"invalid number of folds ({}) for the number of "
"documents ({})".format(folds, len(data_files))
)
fold_size = len(data_files) // folds
for split_index in range(0, len(data_files), fold_size):
training = data_files[split_index:split_index + fold_size]
testing = data_files[:split_index] + data_files[split_index + fold_size:]
yield training, testing
That yield keyword at the end is what makes this a generator. To use it, you would use it like so:
def ml_function(datadir, num_folds):
data_files = get_file_list_from_dir(datadir)
randomize_files(data_files)
for train_set, test_set in cross_validate(data_files, num_folds):
do_ml_training(train_set)
do_ml_testing(test_set)
Again, it's up to you to implement the actual functionality of your ML system.
As a disclaimer, I'm no expert by any means, haha. But let me know if you have any questions about anything I've written here!
that's quite simple if you use numpy, first load the documents and make them a numpy array, and then:
import numpy as np
docs = np.array([
'one', 'two', 'three', 'four', 'five',
'six', 'seven', 'eight', 'nine', 'ten',
])
idx = np.hstack((np.ones(7), np.zeros(3))) # generate indices
np.random.shuffle(idx) # shuffle to make training data and test data random
train = docs[idx == 1]
test = docs[idx == 0]
print(train)
print(test)
the result:
['one' 'two' 'three' 'six' 'eight' 'nine' 'ten']
['four' 'five' 'seven']
Just make a list of the filenames using os.listdir(). Use collections.shuffle() to shuffle the list, and then training_files = filenames[:700] and testing_files = filenames[700:]
You can use train_test_split method provided by sklearn. See documentation here:
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

Resources