Kubeflow - what is xxx_op and yyyOp? - kubeflow

In Kubeflow, what does op or Op indicate in xxx_op and ```yyyOp``?
def add(a: float, b: float) -> float:
return a + b
add_op = comp.func_to_container_op(add)
#dsl.pipeline(
name='Calculation pipeline',
description='A toy pipeline that performs arithmetic calculations.'
)
def calc_pipeline(a='a',b='7'):
add_task = add_op(a, 4) #Returns a dsl.ContainerOp class instance.
Does it mean it is an Operator as in Kubeflow Operator introduction?
Kubeflow Operator helps deploy, monitor and manage the lifecycle of Kubeflow. Built using the Operator Framework which offers an open source toolkit to build, test, package operators and manage the lifecycle of operators.
Or does it mean a Component instance as in Conceptual overview of components in Kubeflow Pipelines?
A pipeline component is self-contained set of code that performs one step in the ML workflow (pipeline), such as data preprocessing, data transformation, model training, and so on. A component is analogous to a function, in that it has a name, parameters, return values, and a body
My guess is that anything that is packaged into a container and executed in a Kubeflow pipeline workflow is marked as _op or Op. Is it correct?
The term Component and Operator are abused in Kubeflow so not sure what they are exactly.

Related

Does GraalVM use same thread and heap space when calling polyglot functions?

If I call R code from Java within GraalVM (using GraalVM's polyglot function), does the R code and the Java code run on the same Java thread (ie there's no switching between OS or Java threads etc?) Also, is it the same "memory/heap" space? That is, in the example code below (which I took from https://www.baeldung.com/java-r-integration)
public double mean(int[] values) {
Context polyglot = Context.newBuilder().allowAllAccess(true).build();
String meanScriptContent = RUtils.getMeanScriptContent();
polyglot.eval("R", meanScriptContent);
Value rBindings = polyglot.getBindings("R");
Value rInput = rBindings.getMember("c").execute(values);
return rBindings.getMember("customMean").execute(rInput).asDouble();
}
does the call rBindings.getMember("c").execute(values) cause the values object (an array of ints) to be copied? Or is GraalVM smart enough to consider it a pointer to the same memory space? If it's a copy, is the copying time the same (or similar, ie within say 20%) time as if I were to a normal java clone() operation? Finally, does calling a polyglot function (in this case customMean implemented in R) have the same overhead as calling a native Java function? Bonus question: can the GraalVM JIT compiler even compile accross the layers, eg say I had this:
final long sum = IntStream.range(0,10000)
.stream()
.map(x -> x+4)
.map(x -> <<<FastR version of the following inverse operation: x-4 >>>)
.sum();
would the GraalVM compiler be as smart as say a normal Java JIT compiler and realize that the whole above statement can be simply written without the two map operations (Since they cancel each other out)?
FYI: I'm considering using GraalVM to run both my Java code and my R code, once the issue I identified here is resolved (Why is FASTR (ie GraalVM version of R) 10x *slower* compared to normal R despite Oracle's claim of 40x *faster*?) and one of the motivitations is that I hope to eliminate the 50% of time that calling R (using RServe()) from Java is spent on network IO (because Java communicates with RServer over TCP/IP and RServe and Java are on different threads and memory spaces etc etc.)
does the R code and the Java code run on the same Java thread. Also, is it the same "memory/heap" space?
Yes and yes. You can even use GraalVM VisualVM to inspect the heap: it provides standard Java view where you can see instances of FastR internal representations like RIntVector mingled with the rest of the other Java objects, or R view where you can see integer vectors, lists, environments, ...
does the call rBindings.getMember("c").execute(values) cause the values object (an array of ints) to be copied?
In general yes: most objects are passed to R as-is. Inside R you have two choices:
Explicitly convert them to some concrete type, i.e., as.integer(arg), which does not make a copy, but tells R explicitly how you want that value to be treated as "native" R type including R's value semantics.
Leave it up to the default rules, which will be applied once your objects is passed to some R builtin, e.g., int[] is treated as integer vector (but note that treating it as a list would be also reasonable in some cases). Again no copies here. And the object itself keeps its reference semantics.
However, sometimes FastR needs to make a copy:
some builtin functions cannot handle foreign objects yet
R language often implicitly copies vectors, because of its value semantics, arguments coercion, etc.
when a vector is passed to native R extension, we need to move its data to off heap memory
I would say that if you happen to have a very large vector, say GBs of data, you need to be very careful about it even in regular R. Note: FastR vectors are by default backed by Java arrays, so their size limitations apply to FastR vectors too.
Finally, does calling a polyglot function (in this case customMean implemented in R) have the same overhead as calling a native Java function?
Mostly yes, except that the function cannot be pulled and inlined into the surrounding Java code(+). The call itself is as fast as regular Java call. For the example you give: it cannot be optimized as you suggest, because the R function cannot be inlined(+). However, I would be very skeptical that any compiler can optimize this as you suggest even if both functions where pure Java code. That being said, yes: some things that compiler can optimize, like eliminating some useless computations that it can analyze well, is not going to work because of the impossibility to inline code across the Java <-> R boundary(+).
(+) Unless you'd run the Java code with Espresso (Java on Truffle), but then you would not be using Context API but Espresso's interop support.

Beam.io.WriteToPubSub throws error "The given pcoll PDone[WriteToPubSub/Write/NativeWrite.None] is not a dict, an iterable or a PCollection"

I'm getting an error whenever I use "WriteToPubSub". The code below is me trying to debug the issue. My actual code is trying to take data from failures of WriteToBigQuery in order to push it to a deadletter pubsub topic. But when I tried to do that I kept encountering the error below.
I am running Apache Beam 2.27, Python 3.8
import apache_beam as beam
from apache_beam.runners.interactive.interactive_runner import InteractiveRunner
from apache_beam.io.gcp.bigtableio import WriteToBigTable
from apache_beam.runners import DataflowRunner
import apache_beam.runners.interactive.interactive_beam as ib
from apache_beam.options import pipeline_options
from apache_beam.options.pipeline_options import GoogleCloudOptions
import google.auth
import json
import pytz
# Setting up the Apache Beam pipeline options.
options = pipeline_options.PipelineOptions(flags=[])
# Sets the project to the default project in your current Google Cloud environment.
_, options.view_as(GoogleCloudOptions).project = google.auth.default()
# Sets the Google Cloud Region in which Cloud Dataflow runs.
options.view_as(GoogleCloudOptions).region = 'asia-east1'
# Sets the job name
options.view_as(GoogleCloudOptions).job_name = 'data_ingest'
# IMPORTANT! Adjust the following to choose a Cloud Storage location.
dataflow_gcs_location = '[REDACTED]'
# Dataflow Staging Location. This location is used to stage the Dataflow Pipeline and SDK binary.
options.view_as(GoogleCloudOptions).staging_location = '%s/staging' % dataflow_gcs_location
# Dataflow Temp Location. This location is used to store temporary files or intermediate results before finally outputting to the sink.
options.view_as(GoogleCloudOptions).temp_location = '%s/temp' % dataflow_gcs_location
# The directory to store the output files of the job.
output_gcs_location = '%s/output' % dataflow_gcs_location
ib.options.recording_duration = '1m'
# The Google Cloud PubSub topic for this example.
topic = "[REDACTED]"
output_topic = "[REDACTED]"
subscription = "[REDACTED]"
deadletter_topic = "[REDACTED]"
class PrintValue(beam.DoFn):
def process(self, element):
print(element)
return [element]
p = beam.Pipeline(InteractiveRunner(),options=options)
data = p | beam.io.ReadFromPubSub(topic=topic) | beam.ParDo(PrintValue()) | beam.io.WriteToPubSub(topic=deadletter_topic)
ib.show(data, include_window_info=False)
The error given is
ValueError: The given pcoll PDone[WriteToPubSub/Write/NativeWrite.None] is not a dict, an iterable or a PCollection.
Can someone spot what the problem is?
No matter what I do, WriteToPubSub says it's receiving PDone.
EDIT:
If i use p.run(), I get the following error instead:
'PDone' object has no attribute 'to_runner_api'
In both cases, the pipeline does not try to run, it immediately errors out.
EDIT:
I've realised the problem
p = beam.Pipeline(InteractiveRunner(),options=options)
It is this line. If I remove the interactiverunner everything works. Not sure why
Beam Terminology
Apache Beam has some base concepts, that we should adhere to while leveraging the power of this programming model.
Pipeline
In simple terms, a pipeline is a series of tasks for a desired output. It can be as simple as a linear flow or could have a complex branching of tasks. The fundamental concept is read from input source(s), perform some transformations and emit to output(s).
Mathematically, beam pipeline is just a Directed Acyclic Graph of tasks.
PCollection
In simple terms, PCollections is an immutable bag of elements which could be distributed across machines. Each step in a beam pipeline will have it's input and output as a PCollection (apart from sources and sinks)
PCollection is a powerful distributed data structure that a beam pipeline operates on. It could be bounded or unbounded based on your source type.
PTransforms
In simple terms, Transforms are the operations of your pipleine. It provides processing logic and this logic is applied to each element of one or more input of PCollections.
Example : PTransform<PCollection<X>,PCollection<Y>> will transform X to Y.
Based on processing paradigm, beam provides us multiple core transforms - ParDo, GroupByKey, Flatten, Combine etc.
I/O Transforms
When you create a pipeline one need a data source to read data such as a file or a database. Likewise, you want to emit your result data to an external storage system such as topic or an object store. The transforms which deal with External Input and Output are I/O Transforms.
Usually for an external source, you will have the following
Source : A PTransform to read data from the external system. This will read from
an external system(like file, db). It excepts a PBegin (pipeline entry point) and return a PCollection.
PTransform<PBegin,PCollection>
This would be one of the entry points of your pipeline.
Sink : A PTransform that will output data to an external system. This will write to an external system(like topic, storage). It excepts a PCollection and return a PDone (pipeline entry point).
PTransform<PCollection,PDone>
This would be one of the exit points of your pipeline.
Combination of a source and sink is an I/O Connector like RedisIO, PubSubIO etc. Beam provides multiple in-built connectors and one can write a custom one also.
There are still various concepts and extenions of the above, that allow users to program complex requirements that could be run on different runners. This is what makes Beam so powerful.
Solution
In your case, ib.show(data, include_window_info=False) is throwing the below error
ValueError: The given pcoll PDone[WriteToPubSub/Write/NativeWrite.None] is not a dict, an iterable or a PCollection.
Source Code
Because your data contains result of beam.io.WriteToPubSub(topic=deadletter_topic) which is a sink and returns a PDone not a PCollection.
For your use case of BQ Writing Failures to PubSub, you could follow something below
data = beam.io.ReadFromPubSub(topic=topic) | 'Write to BQ' >> beam.io.WriteToBigQuery( ...)
(data['beam.io.gcp.bigquery.BigQueryWriteFn.FAILED_ROWS']
| 'publish failed' >> beam.io.WriteToPubSub(topic=deadletter_topic)
However, if this does not solve your issue posting the code would be useful or else you could write a custom PTransform with output tags for writing to BQ and to return failures(via tuple tags) for publising to PubSub.
P.S. : WriteToBigQuery is not a sink, but a custom PTransform that writes to big query and returns failures.

TensorFlowFederated: Passing tensor to tff.federated_computation

I have trialled TFF tutorial (MNIST) on my single machine and now I am trying to perform a multi-machine process using MNIST data.
Clearly, I cannot use create_tf_dataset_for_client so I have used GRPC to learn how to pass data from one machine to another.
My scenario is that Server will dispatch the initial model (with zeroes) to all the participating clients where the model will run on local data. Each client will dispatch the new weights to the server that will perform federated_mean.
I was thinking of using tff.learning.build_federated_averaging_process where I could hopefully customise the next function (2nd argument) but I failed... I am not even sure if we use this approach to send the model and get the weights back from remote clients.
Then I thought I could use tff.federated_mean under #tff.federated_computation decorator. However, since weights are arrays and I have a list of them (as I have a number of clients), I am unable to understand how do I create a tff.FederatedType that points to that a list of lists. Any help from someone who has modelled federation on distributed dataset will be handy to understand.
Regards,
Dev.
TFF computations are designed to be platform/runtime agnostic; a single computation can be executed by several different backends.
TFF's type system can be helpful here in reasoning about how data is expected to flow in you computation. See the custom federated algorithms part 1 tutorial for an intro to how TFF thinks about types.
The result of build_federated_averaging_process expects an argument of datasets which are placed at clients; for a dataset of element type T, in TFF's usual notation this would be denoted {T*}#C. This signature particular is agnostic with respect to how the datasets arrive at the clients, or indeed how the clients themselves are represented.
Materializing the data and representing the clients is really the job of the runtime. TFF provides a few so-called native options here.
For example, in the local Python runtime clients are represented by threads on your local machine. Datasets are simply eager tf.data.Dataset objects, and the threads pull data from the datasets during training.
In the remote Python runtime, clients are represented by (threads on) remote workers, so that a single remote worker could be running more than one client. In this case, as you note, data must be materialized on the remote worker in order to train.
There are several options for accomplishing this.
One, TFF will actually handle serialization and deserialization of eager datasets across this RPC connection for you, so you could use the identical pattern of specifying data as in the local runtime, and it should "just work". This pattern actually got significantly better in March of 2021, via the use of tf.raw_ops.DatasetToGraphV2.
Perhaps better mapping to the concepts of federated computation, however, is the use of some library functions to simply instantiate the datasets on the workers.
Suppose you have an iterative process ip, which accepts a state and data argument, where data is of type {T*}#C. Suppose further we have a TFF computation get_dataset_for_client_id, which accepts a string and returns a dataset of appropriate type (IE, its TFF type signature is tf.str -> T*).
Then we can compose these two computations into another:
#tff.federated_computation(STATE_TYPE, tff.FederatedType(tf.string, tff.CLIENTS))
def new_next(state, client_ids):
datasets_on_clients = tff.federated_map(get_dataset_for_client_id, client_ids)
return ip.next(state, datasets_on_clients)
new_next now requires the controller to only specify the ids of clients on which to train, and delegates responsibility for pointing to a data store to whoever is representing the clients.
This pattern I think is likely what you want; TFF provides some helper s like the dataset_computation attribute on tff.simulation.ClientData and tff.simulation.compose_dataset_computation_with_iterative_process, which will more or less perform the wiring we did above for you.
let's do this step by step. Please let us know if the explanation below answers your question.
Let's start with an example of TF (non-federated, just local) code that takes a dataset and does something with it, say add numbers:
#tff.tf_computation(tff.SequenceType(tf.int32))
def process_data(ds):
return ds.reduce(np.int32(0), lambda x, y: x + y)
This code takes a dataset of integer numbers at input, and returns a single integer with the sum at output.
You can confirm this by lookin at the type signature, like this:
str(process_data.type_signature)
You should see this:
(int32* -> int32)
So, process_data takes a set of integers, and returns an integer.
Now, using TFF's federated operators we can create a federated computation that does this on multiple clients, like this:
#tff.federated_computation(tff.FederatedType(tff.SequenceType(tf.int32), tff.CLIENTS))
def process_data_on_clients(federated_ds):
return tff.federated_map(process_data, federated_ds)
If you look at the type signature of this new computation (just like above), you will see this:
({int32*}#CLIENTS -> {int32}#CLIENTS)
It means process_data_on_clients takes a federated set of integers (one set per client), and returns a federated integer (one integer with the sum on each client).
What happens in the above is that, the TF logic in process_data will be executed once on each client. This is how the federated_map operator works.
Now, process_data_on_clients is a little bit like the the iterative process you are working with. It wants you to provide a federated dataset as an argument.
Let's see how we can make one by following the same pattern as above.
Here's some TF code that creates a single dataset with integers, say you supply an integer n and want to create a dataset with numbers from 1 up to n, i..e, {1, 2, ..., n}:
#tff.tf_computation(tf.int32)
def make_data(n):
return tf.data.Dataset.range(tf.cast(n, tf.int64)).map(lambda x: tf.cast(x + 1, tf.int32))
This is obviously a silly example, you could do something more along the lines of what you need (e.g., read data from a file specified by a name, etc.).
And here's what its type signature looks like:
(int32 -> int32*)
You can see the similarity to process_data.
And, just like with processing data, here's now we can make data on all clients by using the federated_map operator:
#tff.federated_computation(tff.FederatedType(tf.int32, tff.CLIENTS))
def make_data_on_clients(federated_n):
return tff.federated_map(make_data, federated_n)
This is the type signature:
({int32}#CLIENTS -> {int32*}#CLIENTS)
Great, so make_data_on_clients takes a federated integer (that tells us how many data items to produce on each client), and returns a federated dataset, just like what process_data_on_clients wants.
You can check that the two work together as intended:
federated_n = [2, 3, 4]
federated_ds = make_data_on_clients(federated_n)
result = process_data_on_clients(federated_ds)
result
You should get the sums 1+2, 1+2+3, and 1+2+3+4 on the 3 clients involved in this computation (note there were 3 numbers in the federated integer above, so there are 3 clients here):
[<tf.Tensor: shape=(), dtype=int32, numpy=3>,
<tf.Tensor: shape=(), dtype=int32, numpy=6>,
<tf.Tensor: shape=(), dtype=int32, numpy=10>]
Note that all TF code you have seen so far, including both dataset creation and dataset reduce, were being executed on the clients (using federated_map).
Now, you can put the two together:
#tff.federated_computation(tff.FederatedType(tf.int32, tff.CLIENTS))
def make_and_process_data_on_clients(federated_n):
federated_ds = make_data_on_clients(federated_n)
return process_data_on_clients(federated_ds)
And now, you can invoke the make and process data combo in one shot:
make_and_process_data_on_clients(federated_n)
Again, all TF code here is executing on clients, just like in the above.
So where does this leave you?
Going back to Keith's explanation, the iterative process you got from TFF wants a federated dataset at input, just like process_data_on_clients.
The function get_dataset_for_client_id in Keith's example is like our make_data in that it is assumed to contain TensorFlow code that you want to run on each client to physically construct a dataset on that client.
In out silly example, dataset construction logic used range, but it can be anything. For example, you could load data on each client from the same local file my_data, or using a custom TF op, or by whatever other means. Just like in our example, you can pass parameters to that function to give you more centralized control (similarly to whatever did above with the federated integer).
The code snipper new_next in Keith's example is just like our make_and_process_data_on_clients, in that it combines two federated computations: one that makes federated data on clients (supplied by you, just as discussed here), and one that processes that data (from tff.learning, the iterative process).
Does this help?
If still unclear, I would recommend to try the examples I included above on your distributed setup, since you already have one. You could inject some TF print ops to that code to confirm that the TF code you wrote is executing on the client machines in your system.
Once you get that part, it's simple tweak to replace the silly data set construction logic in make_data with one that loads a dataset on each client from whatever local data source you are using.
EDITS:
Re: how to print, any TensorFlow code that appears in the body of a #tff.tf_computation is executed in eager mode, and you can use standard TensorFlow mechanisms such as tf.print to print from within TensorFlow.
tensorflow.org/api_docs/python/tf/print
On how to configure a multi-machine system with multiple worker nodes, see the Kubernetes tutorial. Note that the machine that drives the process connects to worker nodes, not the other way round.
https://www.tensorflow.org/federated/tutorials/high_performance_simulation_with_kubernetes

Optimizing repeated transformations in Apache Beam/DataFlow

I wonder if Apache Beam.Google DataFlow is smart enough to recognize repeated transformations in the dataflow graph and run them only once. For example, if I have 2 branches:
p | GroupByKey() | FlatMap(...)
p | combiners.Top.PerKey(...) | FlatMap(...)
both will involve grouping elements by key under the hood. Will the execution engine recognize that GroupByKey() has the same input in both cases and run it only once? Or do I need to manually ensure that GroupByKey() in this case proceeds all branches where it gets used?
As you may have inferred, this behavior is runner-dependent. Each runner implements its own optimization logic.
The Dataflow Runner does not currently support this optimization.

Iterative processing in Dataflow

As shown here Dataflow pipelines are represented by a fixed DAG. I'm wondering if it's possible to implement a pipeline where the processing proceeds until a dynamically evaluated condition is satisfied based on the data computed so far.
Here's some pseudo code to illustrate what I'd like to implement:
PCollection pco = null
while(true):
pco = pco.apply(someTransform())
if (conditionSatisfied(pco)):
break
pco.Write()
It seems like you really want iterative computations. Right now Dataflow does not provide support for that, but we are aware that it is a very important use case and we are working on finding the right set of APIs to express it.
For now your workarounds are:
Iteratively run whole pipelines (run pipeline, inspect output, run again if the condition is not satisfied, etc). This has the obvious downside of pipeline setup and teardown overhead.
Build a pipeline with a hard-coded number of iterations by .apply()'ing in a loop unconditionally, then run the whole pipeline.
A combination of the two, e.g. run fixed 5-iteration pipelines until you're satisfied with the result.

Resources