CloudML Movielens recommender in production - google-cloud-dataflow

I have worked through the README file of the Movielens CloudML sample: https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/movielens
After the pre-processing step I'm left with a bunch of TFRecord files for training evaluation and prediction. I can take the prediction records and run the prediction successfully.
To use this recommender in production, how do I get the prediction record for a specific user? My hope was something along the lines of deploying a streaming Dataflow pipeline that produces the prediction records. How would I go about that?

Related

Big-query predict using sk-learn model

I have created a sklearn model at my local machine. Then I have uploaded it on google storage. I have created a model and version in AI Platform using the same model. It is working for online prediction. Now I want to perform batch prediction and store the data into big query such as it updates big query table every time I perform the prediction.
Can someone suggest me how to do it?
AI Platform does not support writing prediction results to BigQuery at the moment.
You can write the prediction results to BigQuery with Dataflow. There are two options here:
Create Dataflow job that makes the predictions itself.
Create Dataflow job that uses AI Platform to get the model's predictions. Probably this would use online predictions.
In both cases you can define a BigQuery sink to insert new rows to your table.
Alternatively, you can use Cloud Functions to update a BigQuery table whenever a new file appears in GCS. This solution would look like:
Use gcloud to run the batch prediction (`gcloud ml-engine jobs submit prediction ... --output-path="gs://[My Bucket]/batch-predictions/"
Results are written in multiple files: gs://[My Bucket]/batch-predictions/prediction.results-*-of-NNNNN
Cloud function is triggered to parse and insert the results to BigQuery. This Medium post explains how to this up setup

Relationship between the number of runs in tensorboard and the configuration of google cloud machine learning job

When I use tensorboard to show the data, I found that there is more than one curve. I think this is related to the configuration. So could someone tell me what each curve represents?
This is not related in any way with the Cloud ML Engine. You can find
all the configurable parameters for the Engine in the docs for its REST API (training input, training output, prediction input, prediction output, model resource, version resource).
These curves from your tensorboard is something you configured in your tensorflow code, probably the training cost for several different runs, set as a summary scalar with the name "train_cost".

Training job hangs when restoring parameters from a Tensorflow ckpt file

I'm trying to train a YOLO model with the PASCAL VOC dataset using Tensorflow on Google Cloud ML Engine but my training job keeps hanging at "Restoring parameters from /root.../yolo_tiny.ckpt".
Is there a reason for this? I've been waiting for the past 4-5 hours.

machine learning in GATE tool

After running the Machine Learner Algorithm (SVM) on training data using GATE tool, I would like to test it on testing data. My question is, should I use the same trained data to be tested, also, how could the model extract the entities from the test data while the test data not annotated with the annotations that have been learnt in the trained data.
I followed the tutorial on this link http://gate.ac.uk/sale/talks/gate-course-may11/track-3/module-11-machine-learning/module-11.pdf but at the end it was a bit confusing when it talks about splitting the dataset into training and testing.
In GATE you have 3 modes of the machine learning PR - for training, evaluation and application.
What happens when you train is that the ML PR is checking the selected annotation (let's say Token), collecting it's features and learning the target class (i.e. Person, Mention or whatever). Using the example docs, the ML PR creates a model which holds values for features and basically "learns" how to classify new Tokens (or sentences, or other).
When testing, you provide the ML PR only the Tokens with all their features. Then the ML PR uses them as input for its model and decides if or what Mention to create. The ML PR actually needs everything that was there in the training corpus, except the label / target class / mention - the decision that should be made.
I think the GATE ML PR ignores the labels when in test mode, so it's not crucial to remove it.
Evaluation is a helpful option, where training and testing are done automatically, the corpus is split and results are presented. What it does is split the corpus in 2, train on one part, apply the model on the other, compare the gold standard to what it labeled. Repeat with different splits.
The usual sequence is to train and evaluate, check results, fix, add features, etc. and when you're happy with the evaluation results, switch to application and run on data that doesn't have labels.
It is crucial that you run the same pre-processing when you're training and testing. For instance if in training you've run a POS tagger and you skip this when testing, the ML PR won't have the "Token.category" feature and will calculate very different results.
Now to your questions :)
NO! Don't use the same data for testing, that is a very common mistake, if you get suspiciously good results, first check if you're doing that.
In the tutorial, when you split the corpus both parts will have all the annotations as before, so the ML PR will have all the features it needed. In real life, you'll have to run some pre-processing first as docs will come without tokens or anything.
Splitting in their case is done very simple - just save all docs to files, split files in two folders, load them as two corpora.
Hope this helps :)

how to train a classifier using video datasets

If I have a video dataset of a specific action , how could I use them to train a classifier that could be used later to classify this action.
The question is very generic. In general, there is no foul proof way of training a classifier that will work for everything. It highly depends on the data you are working with.
Here is the 'generic' pipeline:
extract features from the video
label your features (positive for the action you are looking for; negative otherwise)
split your data into 2 (or 3) sets. One for training, one for testing and the other optionally for validation
train a classifier on the labeled examples (e.g. SVM, Neural Network, Nearest Neighbor ...)
validate the results on the validation data, if that is appropriate for the algorithm
test on data you haven't used for training.
You can start with some machine learning tools here http://www.cs.waikato.ac.nz/ml/weka/
Make sure you never touch the test data for any other purposes than testing
Good luck
Almost 10 years later, here's an updated answer.
Set up a camera and collect raw video data
Save it somewhere in form of single frames. Do this yourself locally or using a cloud bucket or use a service like Sieve API. Helpful repo linked here.
Export from Sieve or cloud bucket to get data labeled. Do this yourself or using some service like Scale Rapid.
Split your dataset into train, test, and validation.
Train a classifier on the labeled samples. Use transfer learning over some existing model and fine-tune just the last few layers.
Run your model over the test set after each training epoch and save the one with the best test set performance.
Evaluate your model at the end using the validation set.
There are many repos that can help you get started: https://github.com/weiaicunzai/awesome-image-classification
The two things that can help you ensure best results include 1. high quality labeled data and 2. a diverse, curated dataset. That's what Sieve can help with!

Resources