Can we predict image captions using google-ml-engine? - google-cloud-ml-engine

The census and flowers samples show how to predict class labels using Google´s ml-engine.
Can we deploy our own model to generate image caption? If yes, how does the prediction work? What will be the format of the prediction response?
To be more specific, in the attachment shown below, probabilities sub array gives index and chance of each class. If we use an image caption model, how will the prediction response look like?
Attachment: http://boaloysius.me/sites/default/files/inline-images/predict1_0.png

Cloud ML Engine allows one to deploy nearly any TensorFlow model you are able to export. For any such model, you define your inputs and your outputs and that is what dictates the form of the requests and responses.
I think it might be useful to understand this by walking through an example. You could imagine exporting a model like so:
def my_model(image_in):
# Construct an inference graph for predicting captions
#
# image_in is a tensor/array with shape=(None,) and dtype=string
# Meaning, a batch of raw image bytes.
... Do your interesting stuff here ...
# caption_out is a tensor/matrix with shape=(None, MAX_WORDS) and
# dtype=tf.string), that is, you will be returning a batch
# of captions, one per input image, one word per column with
# padding when the number of words to output is < MAX_WORDS
return caption_out
image_in = tf.placeholder(shape=(None,), dtype=tf.string)
caption_out = my_model(image_in)
inputs = {"image_bytes": tf.saved_model.utils.build_tensor_info(image_in}
outputs = {"caption": tf.saved_model.utils.build_tensor_info(caption_out)}
signature = tf.saved_model.signature_def_utils.build_signature_def(
inputs=inputs,
outputs=outputs,
method_name='tensorflow/serving/predict'
)
After you export this model (cf this post), you would construct a JSON request like so:
{
"instances": [
{
"image_bytes": {
"b64": <base64_encoded_image1>
}
},
{
"image_bytes": {
"b64": <base64_encoded_image2>
}
}
]
}
Let us analyze the request. First, we will be sending the service a batch of images. All requests are a JSON object with an array-valued attribute called "instances"; each entry in the array is an instance to feed the graph for prediction. Note that this is why we are required to set the outermost dimension to None on models that are exported -- they need to be able to handle variable-sized batches.
Each entry in the array is itself a JSON object, where the attributes are the keys of the input dict we defined when exporting the model. In this case, we only defined image_bytes. Because image_bytes is a byte string, we need to base64 encode the data, which we do by passing JSON objects of the form {"b64": <data>}. If we wanted to send more than one image to the service, we could add a similar entry for each image to the instance array.
Now, a sample JSON response for this example might look like:
{
"predictions": [
{
"caption": [
"the",
"quick",
"brown",
"",
"",
""
]
},
{
"caption": [
"A",
"person",
"on",
"the",
"beach",
""
]
}
]
}
All responses are JSON objects with an array-valued attribute called "predictions". Each element of the array is the prediction associated with the corresponding input in the instances array from the request.
Each entry in the array is a JSON object whose attributes are determined by the keys of the outputs dict that we exported earlier. In this case, we have a single output per input called caption. Note that the source tensor for caption caption_out is actually a matrix, where the number of rows is equal to the number of instances sent to the service and the number of columns we defined to be some constant. However, instead of returning the matrix, the service independently returns each row of the matrix as entries in the prediction array. The second dimension of the matrix was some constant, and presumably the model itself will pad extra words as the empty string (as depicted above).
One very important note: in the examples above, I showed the raw JSON request/response bodies. From your post, it is apparent you are using Google's generic client which is parsing the response and adding structure around it, specifically, the object you are printing out wraps predictions within nested fields [data] and then [modelData:protected].
My personal recommendation is to not use that client for the service and to instead use a generic request/response library (along with Google's authentication library), but since you have things working you're welcome to use whatever works for you.

Related

Gensim doc2vec produce more vectors than given documents, when I pass unique integer id as tags

I'm trying to make documents vectors of gensim example using doc2vec.
I passed TaggedDocument which contains 9 docs and 9 tags.
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
idx = [0,1,2,3,4,5,6,7,100]
documents = [TaggedDocument(doc, [i]) for doc, i in zip(common_texts, idx)]
model = Doc2Vec(documents, vector_size=5, window=2, min_count=1, workers=4)
and it produces 101 vectors like this image.
gensim doc2vec produced 101 vectors
and what I want to know is
How can I be sure that the tag I passed is attached to the right vector?
How did the vectors with the tags which I didn't pass (8~99 in my case) come out? Were they computed as a blank?
If you use plain ints as your document-tags, then the Doc2Vec model will allocate enough doc-vectors for every int up to the highest int you provide - even if you don't use some of those ints.
This assumption, that all ints up to the highest declared are used, allows the code to avoid creating a redundant {tag -> slot} dictionary, saving a little memory. That specific potential savings is the main reason for supporting plain ints (rather than unique strings) as tag names.
Any such doc-vectors allocated but never subject to any traiing will be randomly-initialized the same as others - but never adjusted by training.
If you want to use plain int tag names, you should either be comfortable with this over-allocation, or make sure you only use all contiguous int IDs from 0 to your max ID, with none ununused. But unless your training data is very large, using unique string tags, and allowing the {tag -> slot} dictionary to be created, is straightforward and not too expensive in memory.
(Separately: min_count=1 is almost always a bad idea in these algorithms, as discarding rare tokens tends to give better results than letting their thin example usages interfere with other training.)

TFF: How define tff.simulation.ClientData.from_clients_and_fn Function?

In the federated learning context, One such classmethod that should work would be tff.simulation.ClientData.from_clients_and_fn. Here, if I pass a list of client_ids and a function which returns the appropriate dataset when given a client id, you will have your hands on a fully functional ClientData.
I think here, an approach for defining the function I may use is to construct a Python dict which maps client IDs to tf.data.Dataset objects--you could then define a function which takes a client id, looks up the dataset in the dict, and returns the dataset.
So I define function as below but I think it is wrong, what do you think?
list = ["0","1","2"]
tab = {"0":ds, "1":ds, "2":ds}
def create_tf_dataset_for_client_fn(id):
return ds
source = tff.simulation.ClientData.from_clients_and_fn(list, create_tf_dataset_for_client_fn)
I suppose here that the 4 clients have the same dataset :'ds'
Creating a dict of (client_id, dataset) key-value pairs is a reasonable way to set up a tff.simulation.ClientData. Indeed, the code in the question will result in all clients have the same dataset since ds is return for all values of parameter id. One thing to watch out in pre-constructing a dict of datasets is that it may require loading the entire contents of the data into memory (may fail for large datasets).
Alternatively, constructing the dataset on-demand could reduce memory usage. One example might be to have a dict of (client_id, file path) key-value pairs. Something like:
dataset_paths = {
'client_0': '/tmp/A.txt',
'client_1': '/tmp/B.txt',
'client_2': '/tmp/C.txt',
}
def create_tf_dataset_for_client_fn(id):
path = dataset_paths.get(id)
if path is None:
raise ValueError(f'No dataset for client {id}')
return tf.data.Dataset.TextLineDataset(path)
source = tff.simulation.ClientData.from_clients_and_fn(
dataset_paths.keys(), create_tf_dataset_for_client_fn)
This is similar to the approach used in tff.simulation.FilePerUserClientData. It may be useful to look at the code of that class as an example.

DataLoader - shuffle implicit pairs

Is there a way to handle the DataLoader as a list ? The idea is that I want to shuffle implicit pairs of images, without setting the shuffling into True
Basically, I have for example 10 scenes, each containing let's say 100 sequences, so they are represented inside the directory as
'1_1.png', '1_2.png', '1_3.png', '....., '2_1.png', '2_2.png', '2_3.png', ...., '3_1.png', '3_2.png', '3_3.png', ..., ...., '10_1.png', '10_2.png', '10_3.png', ...
I don't want complete shuffling of data, what I want simply is to shuffle but keeping pairs, so they are represented in the data loader as
[ '1_3.png', '1_4.png', '2_2.png', '2_3.png', '10_1.png', '10_2.png', '1_2.png', '1_3.png', ...]
and so on
Please have a look at this question which I have already asked on Stack Overflow concerning shuffling array of implicit pairs, where you can understand what I mean
As an example:
if this is a list
L = [['1_1'],['1_2'],['1_3'],['1_4'],['1_5'],['1_6'],['2_1'],['2_2'],['2_3'],['2_4'],['2_5'],['2_6'],['3_1'],['3_2'],['3_3'],['3_4'],['3_5'],['3_6']]
then this is the output
[['1_2'], ['1_3'], ['2_1'], ['2_2'], ['2_4'], ['2_5'],
['2_2'], ['2_3'], ['1_3'], ['1_4'], ['3_4'], ['3_5'],
['3_3'], ['3_4'], ['3_2'], ['3_3'], ['1_6'], ['2_1'],
['2_5'], ['2_6'], ['2_6'], ['3_1'], ['1_4'], ['1_5'],
['1_1'], ['1_2'], ['2_3'], ['2_4'], ['1_5'], ['1_6'],
['3_1'], ['3_2'], ['3_5'], ['3_6']]
I want to achieve the same for a DataLoader
The main idea, is that I want to train my network on sequential frames, but it doesn't have to be the complete sequence, but at least I need each step, two sequences are there
I think you are looking for data.Sampler: instead of the completely radom default shuffle of data.DataLoader, you can provide your own "sampler" that sample examples from your Dataset.
Looking at the input parameters of data.DataLoader:
sampler (Sampler, optional) – defines the strategy to draw samples
from the dataset. If specified, shuffle must be False.
I think a good starting point for is too look at the code of data.SubsetRandomSampler.

how to get the hash value when using StaticWordValueEncoder in Mahout

I'm look at an example in the Mahout in Action book. It uses the StaticWordValueEncoder to encoder a text in the feature hashing manner.
When encode "text to magically vectorize" with a standard analyser and probe = 1, the vector is {12:1.0, 54:1.0, 78:1.0}. However, I can't figure out which word the hash index refers to.
Is there any method to get the [hash, original word] as a pair? e.g. hash 12 refers to the word "text"?
if you have read Mahout in Action paragraph:
"The value of a continuous
variable gets added directly to one or more locations that are allocated for the storage
of the value. The location or locations are determined by the name of the feature.
This hashed feature approach has the distinct advantage of requiring less memory
and one less pass through the training data, but it can make it much harder to reverse engineer
vectors to determine which original feature mapped to a vector location."
-----I am not sure how the reverse engineering can be done(which certainly a difficult task as Author has put) Perhaps some one might put some light on this.

How to read Mahout clustering output

I have run the k-Means clustering algorithm on the synthetic control data from the Mahout tutorial, and was wondering if someone could explain how to interpret the output. I ran clusterdump and received output that looks something like this (truncated to save space):
CL-592{n=57 c=30.726, 29.813...] r=[3.528, 3.597...]}
Weight : [props - optional]: Point:
1.0 : [distance=27.453962995925863]: [24.672, 35.261, 30.486...]
1.0 : [distance=27.675053294846002]: [25.592, 29.951, 34.188...]
1.0 : [distance=28.97727289419493]: [30.696, 32.667, 34.223...]
1.0 : [distance=21.999685652862784]: [32.702, 35.219, 30.143...]
...
CL-598{n=50 c=[29.611, 29.769...] r=[3.166, 3.561...]}
Weight : [props - optional]: Point:
1.0 : [distance=27.266203490250472]: [27.679, 33.506, 23.594...]
1.0 : [distance=28.749781351838173]: [34.727, 28.325, 30.331...]
1.0 : [distance=32.635136046420186]: [27.758, 33.859, 29.879...]
1.0 : [distance=29.328974057024624]: [29.356, 26.793, 25.575...]
Could someone explain to me how to read this? From what I understand, CL-__ is a cluster ID, followed by n=number of points in the cluster, c=centroid as a vector, r=radius as a vector, and then each point in the cluster. Is this correct? Furthermore, how do I know which clustered point matches up with which input point? i.e. are the points described as a key-value pair where the key is some kind of ID for the point and the value is the vector? If not is there some way I can set it up so it is?
I believe your interpretation of the data is correct (I've only been working with Mahout for ~3 weeks, so someone more seasoned should probably weigh in on this).
As far as linking points back to the input that created them I've used NamedVector, where the name is the key for the vector. When you read one of the generated points files (clusteredPoints) you can convert each row (point vector) back into a NamedVector and retrieve the name using .getName().
Update in response to comment
When you initially read your data into Mahout, you convert it into a collection of vectors with which you then write to a file (points) for use in the clustering algorithms later. Mahout gives you several Vector types which you can use, but they also give you access to a Vector wrapper class called NamedVector which will allow you to identify each vector.
For example, you could create each NamedVector as follows:
NamedVector nVec = new NamedVector(
new SequentialAccessSparseVector(vectorDimensions),
vectorName
);
Then you write your collection of NamedVectors to file with something like:
SequenceFile.Writer writer = new SequenceFile.Writer(...);
VectorWritable writable = new VectorWritable();
// the next two lines will be in a loop, but I'm omitting it for clarity
writable.set(nVec);
writer.append(new Text(nVec.getName()), nVec);
You can now use this file as input to one of the clustering algorithms.
After having run one of the clustering algorithms with your points file, it will have generated yet another points file, but it will be in a directory named clusteredPoints.
You can then read in this points file and extract the name you associated to each vector. It'll look something like this:
IntWritable clusterId = new IntWritable();
WeightedPropertyVectorWritable vector = new WeightedPropertyVectorWritable();
while (reader.next(clusterId, vector))
{
NamedVector nVec = (NamedVector)vector.getVector();
// you now have access to the original name using nVec.getName()
}
Try to add the option -of CSV in clusterdump, you will have a more exploitable result for further treatment.
I have the same problem,(using mahout 0.6).I am also a beginner. I need to display the documents in the form of clusters to the users. So i will need document names rather that words corresponding to clusters. I have been clustering the text documents from shell script.

Resources