Google Cloud ML Engine: Apply Custom Function Before Training / Predicting - machine-learning

I currently pre-process some data on my laptop before passing it to ML Engine for training. Is it possible to apply a custom pre-processing function to my data and then train, all within ML Engine?
So instead of these steps:
Pre-process data on laptop.
Send pre-processed data to ML engine for training.
I would do:
Define pre-processing function for ML Engine
Send raw data to ML Engine, where it will:
a) pre-process my data by applying the function I've specified and
b) train on that data
Is this possible and, if so, how would I do it? I don't see anything in the docs.
Thanks!

You can use some of the samples code here:
Pre-processing is done using DataFlow and then training in ML Engine using the output generated during the pre-process phase.

For preprocessing TensorFlow models, consider TensorFlow Transform (Getting Started Guide).
You may be interested in the chicago_taxi example, which includes a script for integrating the preprocessing with classification on Cloud ML Engine.

Related

Difference between spacy_sklearn and tensorflow_embedding pipelines

I want to know if there is any basic difference between how spacy_sklearn and tensorflow_embedding pipelines operate under the hood.I mean tensorflow_embedding must also be using the same concepts of word embeddings,reducing the dimensionality of data using PCA etc. Is the only difference then that spacy_sklearn has some pre trained data to draw upon in the form of pre trained vectors and tensorflow pipeline does not?Is my understanding correct?Also how is tensorflow_embedding pipeline related to the tensorflow framework offered by google?
I tried looking up tensorflow framework on google, but could not get any specific answer.I also searched about it on RASA community page, but again found no help
The spacy_sklearn pipeline uses pre-trained word vectors.This is useful if we don’t have very much training data.
The tensorflow embedding pipeline doesn’t use any pre-trained word vectors,it fits specifically for our dataset. The advantage of the tensorflow_embedding pipeline is that the word vectors will be customised for our domain.
For more information ,please refer the below link
https://rasa.com/docs/nlu/choosing_pipeline/

Relationship between the number of runs in tensorboard and the configuration of google cloud machine learning job

When I use tensorboard to show the data, I found that there is more than one curve. I think this is related to the configuration. So could someone tell me what each curve represents?
This is not related in any way with the Cloud ML Engine. You can find
all the configurable parameters for the Engine in the docs for its REST API (training input, training output, prediction input, prediction output, model resource, version resource).
These curves from your tensorboard is something you configured in your tensorflow code, probably the training cost for several different runs, set as a summary scalar with the name "train_cost".

Training and prediction via Tensorflow

I have just started coding with TensorFlow and I have classified Images.
Is there any possibility of making a prediction based on testing data?
How can I predict missing value based on the model?
Question
Does Tensorflow have the capability to read training data and test data from two separate files?
Answer
Yes! Here is an example of processing Iris data, an intro machine learning data set for Tensorflow.
If you look at the code you will see the following lines
IRIS_TRAINING = "iris_training.csv"
IRIS_TEST = "iris_test.csv"
The data is clearly separated into training and test files. You don't have to separate your data into different files in Tensorflow, but it certainly supports it and this tutorial link shows how to do it.

Is it possible to use Caffe Only for classification without any training?

Some users might see this as opinion-based-question but if you look closely, I am trying to explore use of Caffe as a purely testing platform as opposed to currently popular use as training platform.
Background:
I have installed all dependencies using Jetpack 2.0 on Nvidia TK1.
I have installed caffe and its dependencies successfully.
The MNIST example is working fine.
Task:
I have been given a convnet with all standard layers. (Not an opensource model)
The network weights and bias values etc are available after training. The training has not been done via caffe. (Pretrained Network)
The weights and bias are all in the form of MATLAB matrices. (Actually in a .txt file but I can easily write code to get them to be matrices)
I CANNOT do training of this network with caffe and must used the given weights and bias values ONLY for classification.
I have my own dataset in the form of 32x32 pixel images.
Issue:
In all tutorials, details are given on how to deploy and train a network, and then use the generated .proto and .caffemodel files to validate and classify. Is it possible to implement this network on caffe and directly use my weights/bias and training set to classify images? What are the available options here? I am a caffe-virgin so be kind. Thank you for the help!
The only issue here is:
How to initialize caffe net from text file weights?
I assume you have a 'deploy.prototxt' describing the net's architecture (layer types, connectivity, filter sizes etc.). The only issue remaining is how to set the internal weights of caffe.Net to pre-defined values saved as text files.
You can get access to caffe.Net internals, see net surgery tutorial on how this can be done in python.
Once you are able to set the weights according to your text file, you can net.save(...) the new weights into a binary caffemodel file to be used from now on. You do not have to train the net if you already have trained weights, and you can use it for generating predictions ("test").

Apache Spark (MLLib) for real time analytics

I have a few questions related with the use of Apache Spark for real-time analytics using Java. When the Spark application is submitted, the data that are stored in Cassandra database are loaded and processed via a machine learning algorithm (Support Vector Machine). Throughout Spark's streaming extension when new data arrive, they are persisted in the database, the existing dataset is re-trained and the SVM algorithm is executed. The output of this process is also stored back in the database.
Apache Spark's MLLib provides implementation of linear support vector machine. In case that I would like a non-linear SVM implementation, should I implement my own algorithm or may I use existing libraries such as libsvm or jkernelmachines? These implementations are not based on Spark's RDDs, is there a way to do this without implementing the algorithm from scratch using RDD collections? If not, that would be a huge effort if I would like to test several algorithms.
Is MLLib providing out of the box utilities for data scaling before executing the SVM algorithm? http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf as defined in section 2.2
While new dataset is streamed, do I need to re-train the hole dataset? Is there any way that I could just add the new data to the already trained data?
To answer your questions piecewise,
Spark provides the MLUtils class that allows you to load data from the LIBSVM format into RDDs - so just the data load portion won't stop you from utilizing that library. You could also implement your own algorithms if you know what you're doing, although my recommendation would be to take an existing one and tweak the objective function and see how it runs. Spark basically provides you the functionality of a distributed Stochastic Gradient Descent process - you can do anything with it.
Not that I know of. Hopefully someone else knows the answer.
What do you mean by re-training when the whole data is streamed?
From the docs,
.. except fitting occurs on each batch of data, so that the model continually updates to reflect the data from the stream.

Resources