I am trying to train a deep learning model for a regression problem. I have 2000 significant categorical inputs each of which has 3 categories. If I convert them to dummy variables, then I will have 6,000 dummy variables as input to deep learning model and it makes optimization very hard since my inputs (6,000 dummy variables) are not zero centered. Also, variance in each dummy variable is small so 6,000 dummy variables will have a hard time to explain variance in output. I was wondering if I need to use z score for dummy variables to help optimization? Also, is there a better solution to deal with these 2,000 categorical inputs?
You should use Embeddings, which translates large sparse vectors into a lower-dimensional space that preserves semantic relationships. So for each categorical feature, you will have dense vector representation.
Here is pseudocode using TensorFlow:
unique_amount_1 = np.unique(col1)
input_1 = tf.keras.layers.Input(shape=(1,), name='input_1')
embedding_1 = tf.keras.layers.Embedding(unique_amount_1, 50, trainable=True)(input_1)
col1_embedding = tf.keras.layers.Flatten()(embedding_1)
unique_amount_2 = np.unique(col2)
input_2 = tf.keras.layers.Input(shape=(1,), name='input_2')
embedding_2 = tf.keras.layers.Embedding(unique_amount_2, 50, trainable=True)(input_2)
col2_embedding = tf.keras.layers.Flatten()(embedding_2)
combined = tf.keras.layers.concatenate([col1_embedding, col2_embedding])
result = tf.keras.layers.Dense()(combined)
model = tf.keras.Model(inputs=[col1, col2], outputs=result)
Where 50 - the size of the embedding vector.
Related
I have a set of inputs that has 5000ish features with values that vary from 0.005 to 9000000. Each of the features has similar values (a feature with a value of 10ish will not also have a value of 0.1ish)
I am trying to apply linear regression to this data set, however, the wide range of input values is inhibiting effective gradient descent.
What is the best way to handle this variance? If normalization is best, please include details on the best way to implement this normalization.
Thanks!
Simply perform it as a pre-processing step. You can do it as following:
1) Calculate mean values for each of the features in the training set and store it. Be careful, do not mess up feature mean and sample mean, so you will have a vector of size [number_of_features (5000ish)].
2) Calculate std. for each feature in the training set and store it. Size of [number_of_feature] as well
3) Update each training and testing entry as:
updated = (original_vector - mean_vector)/ std_vector
That's it!
The code will look like:
# train_data shape [train_length,5000]
# test_data [test_length, 5000]
mean = np.mean(train_data,1)
std = np.std(train_data,1)
normalized_train_data = (train_data - mean)/ std
normalized_test_data = (test_data - mean)/ std
While trying to work on credit card fraud dataset on Kaggle (link), I found out that I can have a better model if I reduce the size of the dataset for the training. Just to explain the dataset is composed of 284807 records of 31 features. In this dataset there is only 492 frauds (so only 0.17%).
I've tried to do a PCA on the full dataset to keep only the 3 most important dimensions to be able to display it. The result is the following one :
In this one, it's impossible to find a pattern to determine either it's a fraud or not.
If I reduce the dataset of non fraud only to increase the ratio (fraud/non_fraud), this is what I have with the same plot
Now, I don't know if it makes sense to fit a PCA on a reduced dataset in order to have a better decomposition. For example, if I use the PCA with 100000 points, we can say that all entries with a PCA1 > 5 is a fraud.
This is the code if you want to try it :
dataset = pd.read_csv("creditcard.csv")
sample_size = 284807-492 # between 1 and 284807-492
a = dataset[dataset["Class"] == 1] # always keep all frauds
b = dataset[dataset["Class"] == 0].sample(sample_size) # reduce non fraud qty
dataset = pd.concat([a, b]).sample(frac=1) # concat with a shuffle
# Scaling of features for the PCA
y = dataset["Class"]
X = dataset.drop("Class", axis=1)
X_scale = StandardScaler().fit_transform(X)
# Doing PCA on the dataset
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X_scale)
pca1, pca2, pca3, c = X_pca[:, 0], X_pca[:, 1], X_pca[:, 2], y
plt.scatter(pca1, pca2, s=pca3, c=y)
plt.xlabel("PCA1")
plt.ylabel("PCA2")
plt.title("{}-points".format(sample_size))
# plt.savefig("{}-points".format(sample_size), dpi=600)
Thanks for your help,
It makes sense, definitely.
The technique you are using is commonly known as Random Undersampling, and in ML it is useful in general when you are dealing with imbalanced data problems (such as the one you are describing). You can see more about it this Wikipedia page.
There are, of course, many other methods to dealt with class imbalance, but the beauty of this one is that it is quite simple and, sometimes, really effective.
I know that a decision tree doesn't get affected by scaling the data but when I scale the data within my decision tree it gives me a bad performance (bad recall, precision and accuracy)
But when I don't scale all the performance metrics the decision tree gives me an amazing result. How can this be?
Note: I use GridSearchCV but I don't think that the cross validation is the reason for my problem. Here is my code:
scaled = MinMaxScaler()
pca = PCA()
bestK = SelectKBest()
combined_transformers = FeatureUnion([ ("scale",scaled),("best", bestK),
("pca", pca)])
clf = tree.DecisionTreeClassifier(class_weight= "balanced")
pipeline = Pipeline([("features", combined_transformers), ("tree", clf)])
param_grid = dict(features__pca__n_components=[1, 2,3],
features__best__k=[1, 2,3],
tree__min_samples_split=[4,5],
tree__max_depth= [4,5],
)
grid_search = GridSearchCV(pipeline, param_grid=param_grid,scoring='f1')
grid_search.fit(features,labels)
With the scale function MinMaxScaler() my performance is:
f1 = 0.837209302326
recall = 1.0
precision = 0.72
accuracy = 0.948148148148
But without scaling:
f1 = 0.918918918919
recall = 0.944444444444
precision = 0.894736842105
accuracy = 0.977777777778
I am not familiar with scikit-learn, so excuse me if I misunderstand something.
First of all, does PCA standardize features? If it does not, it will give different results for scaled and non-scaled input.
Second, due to the randomness in splitting the samples, CV may give different results on each run. This will affect the results especially for small sample size. In addition, in case you have small sample size, the results may not be that different after all.
I have the following suggestions:
Scaling can be treated as an additional hyperparameter, which can be optimized by CV.
Perform an extra CV (called nested CV) or hold-out to estimate performance. This is done by keeping a test set, selecting your model using CV on the training data and then evaluate its performance on the test set (in case of nested CV you do this repeatedly for all folds and average the performance estimates). Of course, your final model should be trained on the whole dataset. In general, you should not use the performance estimate of the CV used for model selection, as it will be overly optimistic.
I'm working on implementing an interface between a TensorFlow basic LSTM that's already been trained and a javascript version that can be run in the browser. The problem is that in all of the literature that I've read LSTMs are modeled as mini-networks (using only connections, nodes and gates) and TensorFlow seems to have a lot more going on.
The two questions that I have are:
Can the TensorFlow model be easily translated into a more conventional neural network structure?
Is there a practical way to map the trainable variables that TensorFlow gives you to this structure?
I can get the 'trainable variables' out of TensorFlow, the issue is that they appear to only have one value for bias per LSTM node, where most of the models I've seen would include several biases for the memory cell, the inputs and the output.
Internally, the LSTMCell class stores the LSTM weights as a one big matrix instead of 8 smaller ones for efficiency purposes. It is quite easy to divide it horizontally and vertically to get to the more conventional representation. However, it might be easier and more efficient if your library does the similar optimization.
Here is the relevant piece of code of the BasicLSTMCell:
concat = linear([inputs, h], 4 * self._num_units, True)
# i = input_gate, j = new_input, f = forget_gate, o = output_gate
i, j, f, o = array_ops.split(1, 4, concat)
The linear function does the matrix multiplication to transform the concatenated input and the previous h state into 4 matrices of [batch_size, self._num_units] shape. The linear transformation uses a single matrix and bias variables that you're referring to in the question. The result is then split into different gates used by the LSTM transformation.
If you'd like to explicitly get the transformations for each gate, you can split that matrix and bias into 4 blocks. It is also quite easy to implement it from scratch using 4 or 8 linear transformations.
I developed a image processing program that identifies what a number is given an image of numbers. Each image was 27x27 pixels = 729 pixels. I take each R, G and B value which means I have 2187 variables from each image (+1 for the intercept = total of 2188).
I used the below gradient descent formula:
Repeat {
θj = θj−α/m∑(hθ(x)−y)xj
}
Where θj is the coefficient on variable j; α is the learning rate; hθ(x) is the hypothesis; y is real value and xj is the value of variable j. m is the number of training sets. hθ(x), y are for each training set (i.e. that's what the summation sign is for). Further the hypothesis is defined as:
hθ(x) = 1/(1+ e^-z)
z= θo + θ1X1+θ2X2 +θ3X3...θnXn
With this, and 3000 training images, I was able to train my program in just over an hour and when tested on a cross validation set, it was able to identify the correct image ~ 67% of the time.
I wanted to improve that so I decided to attempt a polynomial of degree 2.
However the number of variables jumps from 2188 to 2,394,766 per image! It takes me an hour just to do 1 step of gradient descent.
So my question is, how is this vast number of variables handled in machine learning? On the one hand, I don't have enough space to even hold that many variables for each training set. On the other hand, I am currently storing 2188 variables per training sample, but I have to perform O(n^2) just to get the values of each variable multiplied by another variable (i.e. the polynomial to degree 2 values).
So any suggestions / advice is greatly appreciated.
try to use some dimensionality reduction first (PCA, kernel PCA, or LDA if you are classifying the images)
vectorize your gradient descent - with most math libraries or in matlab etc. it will run much faster
parallelize the algorithm and then run in on multiple CPUs (but maybe your library for multiplying vectors already supports parallel computations)
Along with Jirka-x1's answer, I would first say that this is one of the key differences in working with image data than say text data for ML: high dimensionality.
Second... this is a duplicate, see How to approach machine learning problems with high dimensional input space?