Why does sklearn Imputer need to fit? - machine-learning

I'm really new in this whole machine learning thing and I'm taking an online course on this subject. In this course, the instructors showed the following piece of code:
imputer = Inputer(missing_values = 'Nan', strategy = 'mean', axis=0)
imputer = Imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
I don't really get why this imputer object needs to fit. I mean, I´m just trying to get rid of missing values in my columns by replacing them with the column mean. From the little I know about programming, this is a pretty simple, iterative procedure, and wouldn´t require a model that has to train on data to be accomplished.
Can someone please explain how this imputer thing works and why it requires training to replace some missing values by the column mean?
I have read sci-kit's documentation, but it just shows how to use the methods, and not why they´re required.
Thank you.

The Imputer fills missing values with some statistics (e.g. mean, median, ...) of the data.
To avoid data leakage during cross-validation, it computes the statistic on the train data during the fit, stores it and uses it on the test data, during the transform.
from sklearn.preprocessing import Imputer
obj = Imputer(strategy='mean')
obj.fit([[1, 2, 3], [2, 3, 4]])
print(obj.statistics_)
# array([ 1.5, 2.5, 3.5])
X = obj.transform([[4, np.nan, 6], [5, 6, np.nan]])
print(X)
# array([[ 4. , 2.5, 6. ],
# [ 5. , 6. , 3.5]])
You can do both steps in one if your train and test data are identical, using fit_transform.
X = obj.fit_transform([[1, 2, np.nan], [2, 3, 4]])
print(X)
# array([[ 1. , 2. , 4. ],
# [ 2. , 3. , 4. ]])
This data leakage issue is important, since the data distribution may change from the training data to the testing data, and you don't want the information of the testing data to be already present during the fit.
See the doc for more information about cross-validation.

Related

How to pass the cv argument for randomsearchcv in sklearn if we have train, test, validate [duplicate]

Similar to Custom cross validation split sklearn I want to define my own splits for GridSearchCV for which I need to customize the built in cross-validation iterator.
I want to pass my own set of train-test indices for cross validation to the GridSearch instead of allowing the iterator to determine them for me. I went through the available cv iterators on the sklearn documentation page but couldn't find it.
For example I want to implement something like this
Data has 9 samples
For 2 fold cv I create my own set of training-testing indices
>>> train_indices = [[1,3,5,7,9],[2,4,6,8]]
>>> test_indices = [[2,4,6,8],[1,3,5,7,9]]
1st fold^ 2nd fold^
>>> custom_cv = sklearn.cross_validation.customcv(train_indices,test_indices)
>>> clf = GridSearchCV(X,y,params,cv=custom_cv)
What can be used to work like customcv?
Actually, cross-validation iterators are just that: Iterators. They give back a tuple of train/test fold at each iteration. This should then work for you:
custom_cv = zip(train_indices, test_indices)
Also, for the specific case you are mentioning, you can do
import numpy as np
labels = np.arange(0, 10) % 2
from sklearn.cross_validation import LeaveOneLabelOut
cv = LeaveOneLabelOut(labels)
Observe that list(cv) yields
[(array([1, 3, 5, 7, 9]), array([0, 2, 4, 6, 8])),
(array([0, 2, 4, 6, 8]), array([1, 3, 5, 7, 9]))]
Actually the above solution returns each row as a fold what one really needs is:
[(train_indices, test_indices)] # for one fold
[(train_indices, test_indices), # 1stfold
(train_indices, test_indices)] # 2nd fold etc

How should I split my data for cross validation and grid search?

Should i split my data in to two parts similar in size to use each half for eaxh tasks or i should do grid search on my whole data and then just do cross validation again on my whole data to check my accuracy ?
You need to split the data into test and train (20:80) (eg. test_train_split in sklearn), then run the model with the train data and check the accuracy. If its not what you expect, then you can try applying Hyper parameter Tuning.
You can do this by GridSearchCV, where you need to fit the desired estimator (depending on the type of problem ) and the parameter values.
Attached a sample code :
from sklearn.model_selection import GridSearchCV
# Create the parameter grid based on the results of random search
param_grid = {
'bootstrap': [True],
'max_depth': [50, 55, 60, 65],
'max_features': ["auto","sqrt", 2, 3],
'min_samples_leaf': [1, 2, 3],
'min_samples_split': [2, 3, 4],
'n_estimators': [60, 65, 70, 75]
}
grid_search = GridSearchCV(estimator = rfcv, param_grid = param_grid, cv = 3, n_jobs = -1, verbose = 2)
grid_search.fit(X_train, Y_train)
grid_search.best_params_
Based the best parameter results, you can fine tune the grid search.
Eg, if best parameter value is near 60 for n_estimators then you need to change the values as surrounding to 60 like [50,55,60,60]. To figure out the exact value.
Then build the machine learning model based on the best parameters value. Evaluate the train data accuracy and then predict the result using test data values.
rf = rgf(n_estimators = 70, random_state=0, min_samples_split = 2, min_samples_leaf=1, max_features = 'sqrt',bootstrap='True', max_depth=65)
regressor = rf.fit(X_train,Y_train)
pred_tuned = regressor.predict(X_test)
You can find an improvement in your accuracy !!

Constructing discrete table-based CPDs in tensorflow-probablity?

I'm trying to construct the simplest example of Bayesian network with several discrete random variables and conditional probabilities (the "Student Network" from Koller's book, see 1)
Although a bit unwieldy, I managed to build this network using pymc3. Especially, creating the CPDs is not that straightforward in pymc3, see the snippet below:
import pymc3 as pm
...
with pm.Model() as basic_model:
# parameters for categorical are indexed as [0, 1, 2, ...]
difficulty = pm.Categorical(name='difficulty', p=[0.6, 0.4])
intelligence = pm.Categorical(name='intelligence', p=[0.7, 0.3])
grade = pm.Categorical(name='grade',
p=pm.math.switch(
theano.tensor.eq(intelligence, 0),
pm.math.switch(
theano.tensor.eq(difficulty, 0),
[0.3, 0.4, 0.3], # I=0, D=0
[0.05, 0.25, 0.7] # I=0, D=1
),
pm.math.switch(
theano.tensor.eq(difficulty, 0),
[0.9, 0.08, 0.02], # I=1, D=0
[0.5, 0.3, 0.2] # I=1, D=1
)
)
)
letter = pm.Categorical(name='letter', p=pm.math.switch(
...
But I have no idea how to build this network using tensoflow-probability (versions: tfp-nightly==0.7.0.dev20190517, tf-nightly-2.0-preview==2.0.0.dev20190517)
For the unconditioned binary variables, one can use categorical distribution, such as
from tensorflow_probability import distributions as tfd
from tensorflow_probability import edward2 as ed
difficulty = ed.RandomVariable(
tfd.Categorical(
probs=[0.6, 0.4],
name='difficulty'
)
)
But how to construct the CPDs?
There are few classes/methods in tensorflow-probability that might be relevant (in tensorflow_probability/python/distributions/deterministic.py or the deprecated ConditionalDistribution) but the documentation is rather sparse (one needs deep understanding of tfp).
--- Updated question ---
Chris' answer is a good starting point. However, things are still a bit unclear even for a very simple two-variable model.
This works nicely:
jdn = tfd.JointDistributionNamed(dict(
dist_x=tfd.Categorical([0.2, 0.8], validate_args=True),
dist_y=lambda dist_x: tfd.Bernoulli(probs=tf.gather([0.1, 0.9], indices=dist_x), validate_args=True)
))
print(jdn.sample(10))
but this one fails
jdn = tfd.JointDistributionNamed(dict(
dist_x=tfd.Categorical([0.2, 0.8], validate_args=True),
dist_y=lambda dist_x: tfd.Categorical(probs=tf.gather_nd([[0.1, 0.9], [0.5, 0.5]], indices=[dist_x]))
))
print(jdn.sample(10))
(I'm trying to model categorical explicitly in the second example just for learning purposes)
-- Update: solved ---
Obviously, the last example wrongly used tf.gather_nd instead of tf.gather as we only wanted to select the first or the second row based on the dist_x outome. This code works now:
jdn = tfd.JointDistributionNamed(dict(
dist_x=tfd.Categorical([0.2, 0.8], validate_args=True),
dist_y=lambda dist_x: tfd.Categorical(probs=tf.gather([[0.1, 0.9], [0.5, 0.5]], indices=[dist_x]))
))
print(jdn.sample(10))
The tricky thing about this, and presumably the reason it's subtler than expected in PyMC, is -- as with almost everything in vectorized programming -- handling shapes.
In TF/TFP, the (IMO) nicest way to solve this is with one of the new TFP JointDistribution{Sequential,Named,Coroutine} classes. These let you naturally represent hierarchical PGM models, and then sample from them, evaluate log probs, etc.
I whipped up a colab notebook demoing all 3 approaches, for the full student network: https://colab.research.google.com/drive/1D2VZ3OE6tp5pHTsnOAf_7nZZZ74GTeex
Note the crucial use of tf.gather and tf.gather_nd to manage the vectorization of the various binary and categorical switching.
Have a look and let me know if you have any questions!

How to "remember" categorical encodings for actual predictions after training?

Suppose wanted to train a machine learning algorithm on some dataset including some categorical parameters. (New to machine learning, but my thinking is...) Even if converted all the categorical data to 1-hot-encoded vectors, how will this encoding map be "remembered" after training?
Eg. converting the initial dataset to use 1-hot encoding before training, say
universe of categories for some column c is {"good","bad","ok"}, so convert rows to
[1, 2, "good"] ---> [1, 2, [1, 0, 0]],
[3, 4, "bad"] ---> [3, 4, [0, 1, 0]],
...
, after training the model, all future prediction inputs would need to use the same encoding scheme for column c.
How then during future predictions will data inputs remember that mapping (where "good" maps to index 0, etc.) (Specifically, when planning on using a keras RNN or LSTM model)? Do I need to save it somewhere (eg. python pickle)(if so, how do I get the explicit mapping)? Or is there a way to have the model automatically handle categorical inputs internally so can just input the original label data during training and future use?
If anything in this question shows any serious confusion on my part about something, please let me know (again, very new to ML).
** Wasn't sure if this belongs in https://stats.stackexchange.com/, but posted here since specifically wanted to know how to deal with the actual code implementation of this problem.
What I've been doing is the following:
After you use StringIndexer.fit(), you can save its metadata (includes the actual encoder mapping, like "good" being the first column)
This is the following code I use (using java, but can be adjusted to python):
StringIndexerModel sim = new StringIndexer()
.setInputCol(field)
.setOutputCol(field + "_INDEX")
.setHandleInvalid("skip")
.fit(dataset);
sim.write().overwrite().save("IndexMappingModels/" + field + "_INDEX");
and later, when trying to make predictions on a new dataset, you can load the stored metadata:
StringIndexerModel sim = StringIndexerModel.load("IndexMappingModels/" + field + "_INDEX");
dataset = sim.transform(dataset);
I imagine you have already solved this issue, since it was posted in 2018, but I've not found this solution anywhere else, so I believe its worth sharing.
My thought would be to do something like this on the training/testing dataset D (using a mix of python and plain psudo-code):
Do something like
# Before: D.schema == {num_col_1: int, cat_col_1: str, cat_col_2: str, ...}
# assign unique index for each distinct label for categorical column annd store in a new column
# http://spark.apache.org/docs/latest/ml-features.html#stringindexer
label_indexer = StringIndexer(inputCol="cat_col_i", outputCol="cat_col_i_index").fit(D)
D = label_indexer.transform(D)
# After: D.schema == {num_col_1: int, cat_col_1: str, cat_col_2: str, ..., cat_col_1_index: int, cat_col_2_index: int, ...}
for all the categorical columns
Then for all of these categorical name and index columns in D, make a map of form
map = {}
for all categorical column names colname in D:
map[colname] = []
# create mapping dict for all categorical values for all
# see https://spark.apache.org/docs/latest/sql-programming-guide.html#untyped-dataset-operations-aka-dataframe-operations
for all rows r in D.select(colname, '%s_index' % colname).drop_duplicates():
enc_from = r['%s' % colname]
enc_to = r['%s_index' % colname]
map[colname].append((enc_from, enc_to))
# for cats that may appear later that have yet to be seen
# (IDK if this is best practice, may be another way, see https://medium.com/#vaibhavshukla182/how-to-solve-mismatch-in-train-and-test-set-after-categorical-encoding-8320ed03552f)
map[colname].append(('NOVEL_CAT', map[colname].len))
# sort by index encoding
map[colname].sort(key = lamdba pair: pair[1])
to end up with something like
{
'cat_col_1': [('orig_label_11', 0), ('orig_label_12', 1), ...],
'cat_col_2': [(), (), ...],
...
'cat_col_n': [(orig_label_n1, 0), ...]
}
which can then be used to generate 1-hot-encoded vectors for each categorical column in any later data sample row ds. Eg.
for all categorical column names colname in ds:
enc_from = ds[colname]
# make zero vector for 1-hot for category
col_onehot = zeros.(size = map[colname].len)
for label, index in map[colname]:
if (label == enc_from):
col_onehot[index] = 1
# make new column in sample for 1-hot vector
ds['%s_onehot' % colname] = col_onehot
break
Can then save this structure as pickle pickle.dump( map, open( "cats_map.pkl", "wb" ) ) to use to compare against categorical column values when making actual predictions later.
** There may be a better way, but I think would need to better understand this article (https://medium.com/#satnalikamayank12/on-learning-embeddings-for-categorical-data-using-keras-165ff2773fc9). Will update answer if anything.

TensorFlow Classification Using Dataset

I need to utilize TensorFlow for a project to classify items based on their attributes to a certain class (either 1, 2, or 3).
Only problem is almost every TF tutorial or example I find online is about image recognition or text classification. I can't find anything about classification based on numbers. I guess what I'm asking for is where to get started. If anyone knows of a relevant example, or if I'm just thinking about this completely wrong.
We are given the 13 attributes for each item, and need to use the TF neural network to classify each item correctly (or mark the margin of error). But nothing online is showing me even how to start with this kind of dataset.
Example of dataset: (first value is class, other values are attributes)
2, 11.84, 2.89, 2.23, 18, 112, 1.72, 1.32, 0.43, 0.95, 2.65, 0.96, 2.52, 500
3, 13.69, 3.26, 2.54, 20, 107, 1.83, 0.56, 0.5, 0.8, 5.88, 0.96, 1.82, 680
3, 13.84, 4.12, 2.38, 19.5, 89, 1.8, 0.83, 0.48, 1.56, 9.01, 0.57, 1.64, 480
2, 11.56, 2.05, 3.23, 28.5, 119, 3.18, 5.08, 0.47, 1.87, 6, 0.93, 3.69, 465
1, 14.06, 1.63, 2.28, 16, 126, 3, 3.17, 0.24, 2.1, 5.65, 1.09, 3.71, 780
Suppose you have the data in a file, data.txt. You can use Numpy to read this:
import numpy as np
xy = np.loadtxt('data.txt', unpack=True, dtype='float32')
x_data = xy[1:]
y_data = xy[0];
More information: http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.loadtxt.html
Perhaps, you may need 'np.transpose' depends on the shape of your weights and operations.
x_data = np.transpose(xy[1:])
Then, use 'placeholders' and 'feed_dict' to train/test your model:
X = tf.placeholder("float", ...
Y = tf.placeholder("float", ...
....
with tf.Session() as sess:
....
sess.run(optimizer, feed_dict={X:x_data, Y:y_data})
for this kind problem TensorFlow have an in depth tutorial here
or in toward data science here
if your looking for videos to start i think sentdex's tutorials on the titanic data-set
is what your looking for although he is using k means to do the classification
(actually I think his entire deep learning/machine learning playlist is great to start with)
you can find it here
otherwise if your looking for basic how to start
first prepossessing:
try first separating the data into class labels and inputs (pandas lib should be able to help you with this)
make your class labels into a one-hot array
than normalize the data:
it looks like your different data attributes have wildly different ranges, make sure to get them all in the same range between 0 and 1
build your model:
a simple fully connected net should do the trick
remember to make the output layer the same size as the number of classes you have
use an argmax function on the output of the finale layer to decide which class the model thinks is the proper classification

Resources