How do you actually apply a trained model? - machine-learning

I've been slowly going through the tensorflow tutorials, and I assume I will have to again. I don't have a background in ML but am slowly pushing my way up.
Anyway, after reading through the RNN tutorial and running the training code, I am confused.
How does one actually apply the trained model so that it can be used to make language predictions?
I know this is a terrible noobish and simple question, but I believe it will be of use to others, as I have seen it asked and not answered in a satisfactory way.

In general, when you train a model, you first do a forward pass, and then a backward pass. The forward pass makes a prediction based on your input data, and the backward pass adjust your model based on how correct your prediction was.
So when you want to apply your model, you just do a forward pass with your new data as input.
In your particular example, using this code, you can see how it's done by looking at how they run the test set, starting line 286.
# They instantiate the model with is_training=False
mtest = PTBModel(is_training=False, config=eval_config)
# Then they can do a forward pass
test_perplexity = run_epoch(session, mtest, test_data, tf.no_op())
print("Test Perplexity: %.3f" % test_perplexity)
And if you want the actual prediction and not the perplexity, it is the state in the run_epoch function :
cost, state, _ = session.run([m.cost, m.final_state, eval_op],
{m.input_data: x,
m.targets: y,
m.initial_state: state})

Related

How to find the optimal learning rate, number of epochs & decay strategy in Torch.optim.adam?

I am working on a model trained on the MNIST dataset. I am using the torch.optim.adam model and have been experimenting with tuning the hyper parameters. After running a lot of tests, I have come to find a combination of hyper parameters that give 90% accuracy. However, I feel like maybe since I am new to this, there might be a more efficient way to find the optimal values of the hyperparameters. The brute force approach seems to depend on trial and error & I was wondering if there is certain strategy to find these values.
Example of the code being used is:
if __name__ == '__main__':
end = time.time()
model_ft = Net().to(device)
print(model_ft.network)
criterion = nn.CrossEntropyLoss()
optimizer_ft = optim.Adam(model_ft.parameters(), lr=1e-3)
exp_lr_scheduler = lr_scheduler.StepLR(optimizer_ft, step_size=9, gamma=0.5)
history, accuracy = train_test(model_ft, criterion, optimizer_ft, exp_lr_scheduler,
num_epochs=15)
Here I would like to find the optimal values of:-
Learning Rate
Step Size
Gamma
Number of Epochs
Any help is much appreciated!
A similar question was already answered in-depth it seems.
However, in short, you can use something called Grid Search. With Grid Search, you set the values you want to try for each hyperparameter, and then Grid Search will try every combination. This link shows how to do it with PyTorch
The following Medium Post goes more in-depth about other methods and packages to try, but I think you should start with a simple grid search.

How can I implement a custom loss function that takes into account multiple predictions of the network?

I am currently implementing a CNN with a custom error function.
The problem I am trying to solve is physics-based, so I can calculate the maximal achievable precision, or to put it another way, I know the best possible (i.e. minimal) standard deviation I can achieve. Those best possible precisions are calculated during the generation of the training data using the Cramer-Rao-lower bound (CRLB).
Right now, my error function looks something like this (in Keras):
def customLoss(yTrue, yPred):
STD = yTrue[:, 10:20]
yTrue = yTrue[:, 0:10]
dev = K.mean(K.abs(K.abs(yTrue - yPred) - STD))
return dev
In this case, I have 10 parameters, so I want to estimate with 10 CRLB's. I put the CRLB's in the target vector just to be able to handle the in the error function.
To my question. This method works, but it is not what I want. The problem is that the error is calculated considering a single prediction of the network, but to be correct the network would have to predict the same dataset/batch multiple times. By doing that I would be able to see the standard deviation of the prediction and use that to calculate the error (I'm using a Bayesian CNN).
Has someone an idea how to implement such a function in Keras or Tensorflow (I would also not mind switching to PyTorch)?

How the initial kmeans points works in to BigQuery ML?

I'm using BigQuery for machine learning, more specifically the k-means method for an unlabeled dataset where I'm trying to find clusters.
I'd like to know if someone has discovered how the BQ ML initiates the centroids.
I already tried looking at the documentation but either there is nothing or I couldn't find it.
CREATE MODEL `project.dataset.model_name`
OPTIONS(
model_type = "kmeans",
num_clusters = 3,
distance_type = "euclidean",
early_stop = TRUE,
max_iterations = 20,
standardize_features = TRUE)
AS
(SELECT * FROM `project.dataset.sample_date_to_train`
)
The results differ a little each time I run.
Has someone experience with that subject?
For someone who is still looking for an answer, recently there has been an update on BigQuery ML about this topic. Two new paramaters have been added to the CREATE MODEL statement, i.e.:
KMEANS_INIT_METHOD
KMEANS_INIT_COL
Basically you can set your custom K observations (belonging to the data table) that will serve as initial centroids for your K-means algorithm. You can find the relative documentation at this link. Maybe it's not the most exciting solution to your problem, but it's still something you can work with if you need reproducibility.
If I had to guess, it probably uses a similar logic to TensorFlow (BQML might be using TF under the hood as it is). Random partitioning seems to be the TensorFlow default, so that would be my guess.
The reason you are seeing different results each time you train the model, is due to the random nature of the initial values assigned to the centroids. The K-means algorithm begins by randomly selecting a value(position) for the k number of centroids chosen. If you review this documentation it explains the exact process when using the K-means algorithm1.

Continuous Regression in Machine Learning

Suppose we have a set of inputs (named x1, x2, ..., xn) that give us the output y. The goal is to predict y from some values of x1... xn that were not seem yet. It's clear to me that this problem can be modelled as a Regression problem on the realm of Machine Learning.
However, let's say data keep coming. I'm able to predict y from x1... xn. Furthermore, I'm able to check afterwards whether or not that prediction was a good one. If it was a good one, everything is fine. On the other hand, I would like to update my model in case that prediction deviates a lot from the real y. The one way I can see this is to insert this new data on my training set and train the regression algorithm again. Two problems arise from that. First, it may cost more than I can afford to recompute my module from scratch from time to time. Second, I may already have too much data on my training set so that new coming data is negligible. However, the new coming data might be more import than the older ones due to the nature of my problem.
It seems that a good solution would be to compute a kind of continuous regression that is more related to the new data than to the older one. I have searched for such approach but I have not found anything relevant. Perhaps I'm looking at the wrong direction. Does anyone have a clue on how to do it?
If you want to consider the newer data more important you have to use weights. Usually it is called
sample_weight
in fit() function in scikit-learn (if you use this library).
Weights can be defined as 1 / (time pass from this current observation).
Now about the second problem. If the recalculation takes much time you can cut your observations and use the latest ones. Fit your model on the whole data and on the fresh one + some part of the old data and check how much your weights are changed. I suppose if you really have a dependence between {x_i} and {y} you don't need the whole dataset.
Otherwise you can use weights again. But for now you will weight weights in the model:
model for old data: w1*x1 + w2*x2 + ...
model for new data: ~w1*x1 + ~w2*x2 + ...
common model: (w1*a1_1 + ~w1*a1_2)*x1 + (w2*a2_1 + ~w2*a2_2)*x2 + ...
Here a1_1, a2_1 are the weights for 'old model', a2_1, a2_2 - for new one, w1, w2 - coefficients of old model, ~w1, ~w2 - of the new one.
Parameters {a} can be estimated as in the first bullet (be hands), but you also can create another linear model to estimate them. But my advice: don't use non-linear regression for {a} - you will overfit.

How to implement a sequence classification LSTM network in CNTK?

I'm working on implementation of LSTM Neural Network for sequence classification. I want to design a network with the following parameters:
Input : a sequence of n one-hot-vectors.
Network topology : two-layer LSTM network.
Output: a probability that a sequence given belong to a class (binary-classification). I want to take into account only last output from second LSTM layer.
I need to implement that in CNTK but I struggle because its documentation is not written really well. Can someone help me with that?
There is a sequence classification example that follows exactly what you're looking for.
The only difference is that it uses just a single LSTM layer. You can easily change this network to use multiple layers by changing:
LSTM_function = LSTMP_component_with_self_stabilization(
embedding_function.output, LSTM_dim, cell_dim)[0]
to:
num_layers = 2 # for example
encoder_output = embedding_function.output
for i in range(0, num_layers):
encoder_output = LSTMP_component_with_self_stabilization(encoder_output.output, LSTM_dim, cell_dim)
However, you'd be better served by using the new layers library. Then you can simply do this:
encoder_output = Stabilizer()(input_sequence)
for i in range(0, num_layers):
encoder_output = Recurrence(LSTM(hidden_dim)) (encoder_output.output)
Then, to get your final output that you'd put into a dense output layer, you can first do:
final_output = sequence.last(encoder_output)
and then
z = Dense(vocab_dim) (final_output)
here you can find a straightforward approach, just add the additional layer like:
Sequential([
Recurrence(LSTM(hidden_dim), go_backwards=False),
Recurrence(LSTM(hidden_dim), go_backwards=False),
Dense(label_dim, activation=sigmoid)
])
train it, test it and apply it...
CNTK published a hands-on tutorial for language understanding that has an end to end recipe:
This hands-on lab shows how to implement a recurrent network to process text, for the Air Travel Information Services (ATIS) task of slot tagging (tag individual words to their respective classes, where the classes are provided as labels in the training data set). We will start with a straight-forward embedding of the words followed by a recurrent LSTM. This will then be extended to include neighboring words and run bidirectionally. Lastly, we will turn this system into an intent classifier.
I'm not familiar with CNTK. But since the question has been left unanswered for so long, I can perhaps suggest some advice to help you with the implementation?
I'm not sure how experienced you are with these architectures; but before moving to CNTK (which seemingly has a less active community), I'd suggest looking at other popular repositories (like Theano, tensor-flow, etc.)
For instance, a similar task in theano is given here: kyunghyuncho tutorials. Just look for "def lstm_layer" for the definitions.
A torch example can be found in Karpathy's very popular tutorials
Hope this helps a bit..

Resources