Data Preparation for training - machine-learning

I am trying to prepare the data file by creating one hot encoding of the text of characters using which I can later train my model for classification. I have a training data file which consists of lines of characters and I am doing initially the integer encoding of them and then the one hot encoding.
e.g. this is how the data file looks:
afafalkjfalkfalfjalfjalfjafajfaflajflajflajfajflajflajfjaljfafj
fgtfafadargggagagagagagavcacacacarewrtgwgjfjqiufqfjfqnmfhbqvcqvfqfqafaf
fqiuhqqhfqfqfihhhhqeqrqtqpocckfmafaflkkljlfabadakdpodqpqrqjdmcoqeijfqfjqfjoqfjoqgtggsgsgqr
This is how I am approaching it:
import pandas as pd
from sklearn import preprocessing
categorical_data = pd.read_csv('abc.txt', sep="\n", header=None)
labelEncoder = preprocessing.LabelEncoder()
X = categorical_data.apply(labelEncoder.fit_transform)
print("Afer label encoder")
print(X.head())
oneHotEncoder = preprocessing.OneHotEncoder()
oneHotEncoder.fit(X)
onehotlabels = oneHotEncoder.transform(X).toarray()
print("Shape after one hot encoding:", onehotlabels.shape)
print(onehotlabels)
I am getting the integer encoding for each line (0,1,2 in my case) and then the subsequent one hot encoded vector.
My question is that how do I do it for each character in an individual line as for prediction, the model should learn from the characters in one line( which corresponds to a certain label). Can someone give me some insight on how to proceed from there?

Given your example I end up with a DataFrame like so:
0
0 0
1 1
2 2
From your description it sounds like you want each line to have its own independent one hot encoding. So lets first look at line 1.
afafalkjfalkfalfjalfjalfjafajfaflajflajflajfajflajflajfjaljfafj
The reason you are getting the dataframe I included above is that this line is getting read into the DataFrame and then passed to the labelEncoder and oneHotEncoder as a single value, not an array of 63 values (the length of the string).
What you really want to do is pass the labelEncoder an array of size 63.
data = np.array([let for let in categorical_data[0][0]])
X = labelEncoder.fit_transform(data)
oneHotEncoder.fit(X.reshape(-1,1))
row_1_labels = oneHotEncoder.transform(X.reshape(-1,1)).toarray()
row_1_labels
array([[ 1., 0., 0., 0., 0.],
[ 0., 1., 0., 0., 0.],
[ 1., 0., 0., 0., 0.],
[ 0., 1., 0., 0., 0.],
[ 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 1., 0.],
[ 0., 0., 1., 0., 0.],
[ 0., 1., 0., 0., 0.],
[ 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 0., 1., 0.],
[ 0., 1., 0., 0., 0.],
[ 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1.],
[ 0., 1., 0., 0., 0.],
[ 0., 0., 1., 0., 0.],
[ 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1.],
[ 0., 1., 0., 0., 0.],
[ 0., 0., 1., 0., 0.],
[ 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1.],
[ 0., 1., 0., 0., 0.],
[ 0., 0., 1., 0., 0.],
[ 1., 0., 0., 0., 0.],
[ 0., 1., 0., 0., 0.],
[ 1., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0.],
[ 0., 1., 0., 0., 0.],
[ 1., 0., 0., 0., 0.],
[ 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 1.],
[ 1., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0.],
[ 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 1.],
[ 1., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0.],
[ 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 1.],
[ 1., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0.],
[ 0., 1., 0., 0., 0.],
[ 1., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0.],
[ 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 1.],
[ 1., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0.],
[ 0., 1., 0., 0., 0.],
[ 0., 0., 0., 0., 1.],
[ 1., 0., 0., 0., 0.],
[ 0., 0., 1., 0., 0.],
[ 0., 1., 0., 0., 0.],
[ 0., 0., 1., 0., 0.],
[ 1., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 1.],
[ 0., 0., 1., 0., 0.],
[ 0., 1., 0., 0., 0.],
[ 1., 0., 0., 0., 0.],
[ 0., 1., 0., 0., 0.],
[ 0., 0., 1., 0., 0.]])
You could repeat this for each row to get the independent one hot encodings. Like so:
one_hot_encodings = categorical_data.apply(lambda x: [oneHotEncoder.fit_transform(labelEncoder.fit_transform(np.array([let for let in x[0]])).reshape(-1,1)).toarray()], axis=1)
one_hot_encodings
0
0 [[1.0, 0.0, 0.0, 0.0, 0.0], [0.0, 1.0, 0.0, 0....
1 [[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0,...
2 [[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0,...
If you wanted the rows to be one hot encoded based on the values found in all rows you would just first fit the labelEncoder to all of the unique letters and then do the transformations for each row. Like so:
unique_letters = np.unique(np.array([let for row in categorical_data.values for let in row[0]]))
labelEncoder.fit(unique_letters)
unique_nums = labelEncoder.transform(unique_letters)
oneHotEncoder.fit(unique_nums.reshape(-1,1))
cat_dat = categorical_data.apply(lambda x: [np.array([let for let in x[0]])], axis=1)
one_hot_encoded = cat_dat.apply(lambda x: [oneHotEncoder.transform(labelEncoder.transform(x[0]).reshape(-1,1)).toarray()], axis=1)
one_hot_encoded
0
0 [[1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...
1 [[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0,...
2 [[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0,...
This will return you a DataFrame with each row containing the one hot encoded array of letters based on the letters from all rows.

Related

tf.one_hot returning array filled with 0's

I am new to deep learning, I am trying to do one hot encoding of this tensor
E = tf.constant(np.random.randint(1,100,size = 10))
E
Output
<tf.Tensor: shape=(10,), dtype=int64, numpy=array([48, 85, 75, 25, 28, 49, 3, 51, 47, 96])>
After using tf.one_hot() the returned array is
tf.one_hot(E,depth=10)
Output
<tf.Tensor: shape=(10, 10), dtype=float32, numpy=
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32)>
In returned tensor most one hot encoded values are
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]
According to me in one hot encoding there should be unique vector for each value but here there is same vector for many values ?
If your random integers are in the range 1-100, you'll need to keep depth=100 in the one_hot function.
Conversely, you can reduce the range of random integers to 1-10.
The idea is that if your number ranges between 1-100, it needs as many bits to represent the one-hot encoding, which you haven't provided by restricting the depth to 10.
import tensorflow as tf
import numpy as np
E = tf.constant(np.random.randint(1,10,size = 10))
tf.one_hot(E,depth=10)
Output:
<tf.Tensor: shape=(10, 10), dtype=float32, numpy=
array([[0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
[0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
[0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
[0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 1., 0.]], dtype=float32)>

One hot encoding increases sizes of target data

i have multiclass data, label or y columns contains following data :
print(y.unique())
[5 6 7 4 8 3 9]
in this case number of class is equal to 7(when modelling of deep learning), but when i do one hot encoding like this :
import keras
from keras.utils import np_utils
y_train =np_utils.to_categorical(y_train)
y_test =np_utils.to_categorical(y_test)
dimension increased to 10
print(y_train.shape) : (4547, 10)
maybe because we have numbers up to 9 and (0,1,2) is also included(in fact it is not represented in original data), how can i fix this issue?
The function tf.keras.utils.to_categorical requires the inputs to be "integers from 0 to num_classes" (see the documentation). You have a set of labels {3, 4, 5, 6, 7, 8, 9}. That is a total of seven labels, which start at the value 3. To transform this to a set of labels in [0, 7), one can subtract 3 from each label.
y_ints = y - 3
The result can be passed to tf.keras.utils.to_categorical.
import numpy as np
import tensorflow as tf
y = np.array([3, 4, 5, 6, 7, 8, 9])
y_ints = y - 3 # [0, 1, 2, 3, 4, 5, 6]
tf.keras.utils.to_categorical(y_ints)
and output is
array([[1., 0., 0., 0., 0., 0., 0.],
[0., 1., 0., 0., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 1., 0., 0., 0.],
[0., 0., 0., 0., 1., 0., 0.],
[0., 0., 0., 0., 0., 1., 0.],
[0., 0., 0., 0., 0., 0., 1.]], dtype=float32)
Another option is to use scikit-learn's extensive preprocessing methods, in particular sklearn.preprocessing.OneHotEncoder.
import numpy as np
from sklearn.preprocessing import OneHotEncoder
y = np.array([3, 4, 5, 6, 7, 8, 9])
y = y.reshape(-1, 1) # reshape to (n_samples, n_labels).
encoder = OneHotEncoder(sparse=False, dtype="float32")
encoder.fit_transform(y)
The output is
array([[1., 0., 0., 0., 0., 0., 0.],
[0., 1., 0., 0., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 1., 0., 0., 0.],
[0., 0., 0., 0., 1., 0., 0.],
[0., 0., 0., 0., 0., 1., 0.],
[0., 0., 0., 0., 0., 0., 1.]], dtype=float32)

How does cropping an image affect camera calibration intrinsics?

I have the following camera matrices for resolution 1600x1300
M1 [3x3] =
[ 1.3964689860209282e+03, 0., 8.3190541322575655e+02,
0., 1.3964689860209282e+03, 5.9990987893769318e+02,
0., 0., 1. ]
D1 [1x14] =
[ 8.0832142609575899e-02, -8.0503813500794497e-02, -1.3722038479715831e-03, -6.9032844088890799e-04, 0., 0., 0., 0., 0., 0., 0., 0., 0., 0. ]
I need to change the resolution to 1280x720, but this resolution is a cropped resolution (not resized). I understand that I have to update cx & cy. Do distorition coeffients change after cropping operation?
No change, provided you adjust the (cx, cy) coordinates of the principal point to its new location in the cropped image. This is because the focal length does not change, and the nonlinear distortion model implemented by Opencv is referenced to the principal point.

Can't get simple binary classifier to work

I've written a simple binary classifier using TensorFlow. But the only result I get for the optimized variables are NaN. Here's the code:
import tensorflow as tf
# Input values
x = tf.range(0., 40.)
y = tf.constant([0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
1., 0., 0., 1., 0., 1., 0., 1., 1., 1.,
1., 1., 0., 1., 1., 1., 0., 1., 1., 1.,
1., 1., 1., 0., 1., 1., 1., 1., 1., 1.])
# Variables
m = tf.Variable(tf.random_normal([]))
b = tf.Variable(tf.random_normal([]))
# Model and cost
model = tf.nn.sigmoid(tf.add(tf.multiply(x, m), b))
cost = -1. * tf.reduce_sum(y * tf.log(model) + (1. - y) * (1. - tf.log(model)))
# Optimizer
learn_rate = 0.05
num_epochs = 20000
optimizer = tf.train.GradientDescentOptimizer(learn_rate).minimize(cost)
# Initialize variables
init = tf.global_variables_initializer()
# Launch session
with tf.Session() as sess:
sess.run(init)
# Fit all training data
for epoch in range(num_epochs):
sess.run(optimizer)
# Display results
print("m =", sess.run(m))
print("b =", sess.run(b))
I've tried different optimizers, learning rates, and test sizes. But nothing seems to work. Any ideas?
You initialize m and b with standard deviation 1, but regarding your data x and y, you can expect m to be significantly smaller than 1. You can initialize b to zero (this is quite popular for bias terms) and m with a much smaller standard deviation (for example 0.0005) and reduce the learning rate at the same time (for example to 0.00000005). You can delay the NaN values changing these values, but they will probably eventually occur, since your data is not well-described by a linear function in my opinion.
import tensorflow as tf
import matplotlib.pyplot as plt
# Input values
x = tf.range(0., 40.)
y = tf.constant([0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
1., 0., 0., 1., 0., 1., 0., 1., 1., 1.,
1., 1., 0., 1., 1., 1., 0., 1., 1., 1.,
1., 1., 1., 0., 1., 1.,
1., 1., 1., 1.])
# Variables
m = tf.Variable(tf.random_normal([], mean=0.0, stddev=0.0005))
b = tf.Variable(tf.zeros([]))
# Model and cost
model = tf.nn.sigmoid(tf.add(tf.multiply(x, m), b))
cost = -1. * tf.reduce_sum(y * tf.log(model) + (1. - y) * (1. - tf.log(model)))
# Optimizer
learn_rate = 0.00000005
num_epochs = 20000
optimizer = tf.train.GradientDescentOptimizer(learn_rate).minimize(cost)
# Initialize variables
init = tf.global_variables_initializer()
# Launch session
with tf.Session() as sess:
sess.run(init)
# Fit all training data
for epoch in range(num_epochs):
_, xs, ys = sess.run([optimizer, x, y])
ms = sess.run(m)
bs = sess.run(b)
print(ms, bs)
plt.plot(xs,ys)
plt.plot(xs, ms * xs + bs)
plt.savefig('tf_test.png')
plt.show()
plt.clf()

Samples with no label assignment using multilabel random forest in scikit-learn

I am using Scikit-Learn's RandomForestClassifier to predict multiple labels of documents. Each document has 50 features, no document has any missing features, and each document has at least one label associated with it.
clf = RandomForestClassifier(n_estimators=20).fit(X_train,y_train)
preds = clf.predict(X_test)
However, I have noticed that after prediction there are some samples that are assigned no labels, even though the samples were not missing label data.
>>> y_test[0,:]
array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
>>> preds[0,:]
array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0.])
The results of predict_proba align with those of predict.
>>> probas = clf.predict_proba(X_test)
>>> for label in probas:
>>> print (label[0][0], label[0][1])
(0.80000000000000004, 0.20000000000000001)
(0.94999999999999996, 0.050000000000000003)
(0.94999999999999996, 0.050000000000000003)
(1.0, 0.0)
(1.0, 0.0)
(1.0, 0.0)
(0.94999999999999996, 0.050000000000000003)
(0.90000000000000002, 0.10000000000000001)
(1.0, 0.0)
(1.0, 0.0)
(0.94999999999999996, 0.050000000000000003)
(1.0, 0.0)
(0.94999999999999996, 0.050000000000000003)
(0.84999999999999998, 0.14999999999999999)
(0.90000000000000002, 0.10000000000000001)
(0.90000000000000002, 0.10000000000000001)
(1.0, 0.0)
(0.59999999999999998, 0.40000000000000002)
(0.94999999999999996, 0.050000000000000003)
(0.94999999999999996, 0.050000000000000003)
(1.0, 0.0)
Each output above shows that for each label, a higher marginal probability has been assigned to the label not appearing. My understanding of decision trees was that at least one label has to be assigned to each sample when predicting, so this leaves me a bit confused.
Is it expected behavior for a multilabel decision tree / random forest to be able to assign no labels to a sample?
UPDATE 1
The features of each document are probabilities of belonging to a topic according to a topic model.
>>>X_train.shape
(99892L, 50L)
>>>X_train[3,:]
array([ 5.21079651e-01, 1.41085893e-06, 2.55158446e-03,
5.88421331e-04, 4.17571505e-06, 9.78104112e-03,
1.14105667e-03, 7.93964896e-04, 7.85177346e-03,
1.92635026e-03, 5.21080173e-07, 4.04680406e-04,
2.68261102e-04, 4.60332012e-04, 2.01803955e-03,
6.73533276e-03, 1.38491129e-03, 1.05682475e-02,
1.79368409e-02, 3.86488757e-03, 4.46729289e-04,
8.82488825e-05, 2.09428702e-03, 4.12810745e-02,
1.81651561e-03, 6.43641626e-03, 1.39687081e-03,
1.71262909e-03, 2.95181902e-04, 2.73045908e-03,
4.77474778e-02, 7.56948497e-03, 4.22549636e-03,
3.78891036e-03, 4.64685435e-03, 6.18710017e-03,
2.40424583e-02, 7.78131179e-03, 8.14288762e-03,
1.05162547e-02, 1.83166124e-02, 3.92332202e-03,
9.83870257e-03, 1.16684231e-02, 2.02723299e-02,
3.38977762e-03, 2.69966332e-02, 3.43221675e-02,
2.78571022e-02, 7.11067964e-02])
The label data was formatted using MultiLabelBinarizer and looks like:
>>>y_train.shape
(99892L, 21L)
>>>y_train[3,:]
array([0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
UPDATE 2
The output of predict_proba above suggested above that the assigning of no classes might be an artifact of trees voting on labels (there are 20 trees and all probabilities are approximately multiples of 0.05). However, using a single decision tree, I still find there are some samples that are assigned no labels. The output looks similar to predict_proba above, in that for each sample there is a probability a given label is assigned or not to the sample. This seems to suggest that at some point the decision tree is turning the problem into binary classification, though the documentation says that the tree takes advantage of label correlations.
This can happen if the train and test data are scaled differently, or otherwise drawn from different distributions (e.g., if the tree learned to split on values that occur in train but don't occur in test).
You could inspect the trees to try to get a better understanding of what's happening. To do this, look at the DecisionTreeClassifier instances in clf.estimators_ and visualize their .tree_ properties (for example, using sklearn.tree.export_graphviz())

Resources