I'm trying to write a Dataflow pipeline in Python that requires a large numpy matrix as a side input. The matrix is saved in cloud storage. Ideally, each Dataflow worker would load the matrix directly from cloud storage.
My understanding is that if I say matrix = np.load(LOCAL_PATH_TO_MATRIX), and then
p | "computation" >> beam.Map(computation, matrix)
the matrix get shipped from my laptop to each Datflow worker.
How could I instead direct each worker to load the matrix directly from cloud storage? Is there a beam source for "binary blob"?
Your approach is correct.
What Dataflow does, in this case, is handle the NumPy matrix as a side input. This means that it's uploaded once from your machine to the service, and the Dataflow service will send it to each worker.
Given that the matrix is large, this will make your workers use I/O to receive it from the service, and carry the burden of keeping the whole matrix in memory, but it should work.
If you want to avoid computing/loading the matrix in your machine, you can upload your matrix to GCS as a text file, read that file in, and obtain the matrix. You can do something like so:
matrix_file = 'gs://mybucket/my/matrix'
p | beam.ParDo(ComputationDoFn(matrix_file))
And your DoFn could be something like:
class ComputationDoFn(beam.DoFn):
def __init__(self, matrix_file):
self._matrix_file = matrix_file
self._matrix = None
def start_bundle(self, element):
# We check because one DoFn instance may be reused
# for different bundles.
if self._matrix is None:
self.load_matrix(self._matrix_file)
def process(self, element):
# Now process the element
def load_matrix(self, matrix_file):
# Load the file from GCS using the GCS API
I hope this makes sense. I can flesh up the functions if you feel like you need some more help.
Related
I'm experimenting with TFX for common ML pipeline work. I somewhat struggle to actually utilize StatisticsGen component to inspect an analyze data statistics.
While in case of TFDV I can access statistics in a straightforward manner:
import tensorflow_data_validation as tfdv
stats = tfdv.generate_statistics_from_csv('data.csv', delimiter=',')
stats # This gives a JSON-like output
in case of TFX itself, StatisticsGen generates a binary FeatureStats.pb file in artifacts/StatisticsGen/statistics/...
How to extract actual statistics from StatisticsGen to use it for checking data (or any other purpose)? I'm aware of the existence of interactive context's ability to visualize stats, but this is unhelpful in production environment.
I'm getting an error whenever I use "WriteToPubSub". The code below is me trying to debug the issue. My actual code is trying to take data from failures of WriteToBigQuery in order to push it to a deadletter pubsub topic. But when I tried to do that I kept encountering the error below.
I am running Apache Beam 2.27, Python 3.8
import apache_beam as beam
from apache_beam.runners.interactive.interactive_runner import InteractiveRunner
from apache_beam.io.gcp.bigtableio import WriteToBigTable
from apache_beam.runners import DataflowRunner
import apache_beam.runners.interactive.interactive_beam as ib
from apache_beam.options import pipeline_options
from apache_beam.options.pipeline_options import GoogleCloudOptions
import google.auth
import json
import pytz
# Setting up the Apache Beam pipeline options.
options = pipeline_options.PipelineOptions(flags=[])
# Sets the project to the default project in your current Google Cloud environment.
_, options.view_as(GoogleCloudOptions).project = google.auth.default()
# Sets the Google Cloud Region in which Cloud Dataflow runs.
options.view_as(GoogleCloudOptions).region = 'asia-east1'
# Sets the job name
options.view_as(GoogleCloudOptions).job_name = 'data_ingest'
# IMPORTANT! Adjust the following to choose a Cloud Storage location.
dataflow_gcs_location = '[REDACTED]'
# Dataflow Staging Location. This location is used to stage the Dataflow Pipeline and SDK binary.
options.view_as(GoogleCloudOptions).staging_location = '%s/staging' % dataflow_gcs_location
# Dataflow Temp Location. This location is used to store temporary files or intermediate results before finally outputting to the sink.
options.view_as(GoogleCloudOptions).temp_location = '%s/temp' % dataflow_gcs_location
# The directory to store the output files of the job.
output_gcs_location = '%s/output' % dataflow_gcs_location
ib.options.recording_duration = '1m'
# The Google Cloud PubSub topic for this example.
topic = "[REDACTED]"
output_topic = "[REDACTED]"
subscription = "[REDACTED]"
deadletter_topic = "[REDACTED]"
class PrintValue(beam.DoFn):
def process(self, element):
print(element)
return [element]
p = beam.Pipeline(InteractiveRunner(),options=options)
data = p | beam.io.ReadFromPubSub(topic=topic) | beam.ParDo(PrintValue()) | beam.io.WriteToPubSub(topic=deadletter_topic)
ib.show(data, include_window_info=False)
The error given is
ValueError: The given pcoll PDone[WriteToPubSub/Write/NativeWrite.None] is not a dict, an iterable or a PCollection.
Can someone spot what the problem is?
No matter what I do, WriteToPubSub says it's receiving PDone.
EDIT:
If i use p.run(), I get the following error instead:
'PDone' object has no attribute 'to_runner_api'
In both cases, the pipeline does not try to run, it immediately errors out.
EDIT:
I've realised the problem
p = beam.Pipeline(InteractiveRunner(),options=options)
It is this line. If I remove the interactiverunner everything works. Not sure why
Beam Terminology
Apache Beam has some base concepts, that we should adhere to while leveraging the power of this programming model.
Pipeline
In simple terms, a pipeline is a series of tasks for a desired output. It can be as simple as a linear flow or could have a complex branching of tasks. The fundamental concept is read from input source(s), perform some transformations and emit to output(s).
Mathematically, beam pipeline is just a Directed Acyclic Graph of tasks.
PCollection
In simple terms, PCollections is an immutable bag of elements which could be distributed across machines. Each step in a beam pipeline will have it's input and output as a PCollection (apart from sources and sinks)
PCollection is a powerful distributed data structure that a beam pipeline operates on. It could be bounded or unbounded based on your source type.
PTransforms
In simple terms, Transforms are the operations of your pipleine. It provides processing logic and this logic is applied to each element of one or more input of PCollections.
Example : PTransform<PCollection<X>,PCollection<Y>> will transform X to Y.
Based on processing paradigm, beam provides us multiple core transforms - ParDo, GroupByKey, Flatten, Combine etc.
I/O Transforms
When you create a pipeline one need a data source to read data such as a file or a database. Likewise, you want to emit your result data to an external storage system such as topic or an object store. The transforms which deal with External Input and Output are I/O Transforms.
Usually for an external source, you will have the following
Source : A PTransform to read data from the external system. This will read from
an external system(like file, db). It excepts a PBegin (pipeline entry point) and return a PCollection.
PTransform<PBegin,PCollection>
This would be one of the entry points of your pipeline.
Sink : A PTransform that will output data to an external system. This will write to an external system(like topic, storage). It excepts a PCollection and return a PDone (pipeline entry point).
PTransform<PCollection,PDone>
This would be one of the exit points of your pipeline.
Combination of a source and sink is an I/O Connector like RedisIO, PubSubIO etc. Beam provides multiple in-built connectors and one can write a custom one also.
There are still various concepts and extenions of the above, that allow users to program complex requirements that could be run on different runners. This is what makes Beam so powerful.
Solution
In your case, ib.show(data, include_window_info=False) is throwing the below error
ValueError: The given pcoll PDone[WriteToPubSub/Write/NativeWrite.None] is not a dict, an iterable or a PCollection.
Source Code
Because your data contains result of beam.io.WriteToPubSub(topic=deadletter_topic) which is a sink and returns a PDone not a PCollection.
For your use case of BQ Writing Failures to PubSub, you could follow something below
data = beam.io.ReadFromPubSub(topic=topic) | 'Write to BQ' >> beam.io.WriteToBigQuery( ...)
(data['beam.io.gcp.bigquery.BigQueryWriteFn.FAILED_ROWS']
| 'publish failed' >> beam.io.WriteToPubSub(topic=deadletter_topic)
However, if this does not solve your issue posting the code would be useful or else you could write a custom PTransform with output tags for writing to BQ and to return failures(via tuple tags) for publising to PubSub.
P.S. : WriteToBigQuery is not a sink, but a custom PTransform that writes to big query and returns failures.
I'm trying to distribute a large Dask Dataframe across multiple machines for (later) distributed computations on the dataframe. I'm using dask-distributed for this.
All the dask-distributed examples/docs I see are populating the initial data load from a network resource (hdfs, s3, etc) and does not appear to extend the DAG optimization to the load portion (seems to assume that a network load is a necessary evil and just eats the initial cost.) This is underscored on the answer to another question: Does Dask communicate with HDFS to optimize for data locality?
However, I can see cases where we would want this. For example, if we have a sharded database + dask workers co-located on nodes of this DB, we would want to force records from only the local shard to be populated into the local dask workers. From the documentation/examples, network cris-cross seems like a necessarily assumed cost. Is it possible to force parts of a single dataframe to be obtained from specific workers?
The alternative, which I've tried, is to try and force each worker to run a function (iteratively submitted to each worker) where the function loads only the data local to that machine/shard. This works, and I have a bunch of optimally local dataframes with the same column schema -- however -- now I don't have a single dataframe but n dataframes. Is it possible to merge/fuse dataframes across multiple machines so there is a single dataframe reference, but portions have affinity (within reason, as decided by the task DAG) to specific machines?
You can produce dask "collections" such as a dataframe from futures and delayed objects, which inter-operate nicely with each other.
For each partition, where you know which machine should load it, you can produce a future as follows:
f = c.submit(make_part_function, args, workers={'my.worker.ip'})
where c is the dask client and the address is the machine you'd want to see it happen on. You can also give allow_other_workers=True is this is a preference rather than a requirement.
To make a dataframe, from a list of such futures, you could do
df = dd.from_delayed([dask.delayed(f) for f in futures])
and ideally provide a meta=, giving a description of the expected dataframe. Now, further operations on a given partition will prefer to be scheduled on the same worker which already holds the data.
I am also interested in having the capability to restrict computation to a specific node (and data localized to that node). I have tried to implement the above with a simple script (see below) but looking at the resulting data frame, results the error (from dask/dataframe/utils.py::check_meta()):
ValueError: Metadata mismatch found in `from_delayed`.
Expected partition of type `DataFrame` but got `DataFrame`
Example:
from dask.distributed import Client
import dask.dataframe as dd
import dask
client = Client(address='<scheduler_ip>:8786')
client.restart()
filename_1 = 'http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv'
filename_2 = 'http://samplecsvs.s3.amazonaws.com/SalesJan2009.csv'
future_1 = client.submit(dd.read_csv, filename_1, workers='w1')
future_2 = client.submit(dd.read_csv, filename_2, workers='w2')
client.has_what()
# Returns: {'tcp://<w1_ip>:41942': ('read_csv-c08b231bb22718946756cf46b2e0f5a1',),
# 'tcp://<w2_ip>:41942': ('read_csv-e27881faa0f641e3550a8d28f8d0e11d',)}
df = dd.from_delayed([dask.delayed(f) for f in [future_1, future_2]])
type(df)
# Returns: dask.dataframe.core.DataFrame
df.head()
# Returns:
# ValueError: Metadata mismatch found in `from_delayed`.
# Expected partition of type `DataFrame` but got `DataFrame`
Note The dask environment has a two worker nodes (aliased to w1 and w2) a scheduler node and the script is running on an external host.
dask==1.2.2, distributed==1.28.1
It is odd to call many dask dataframe functions in parallel. Perhaps you meant to call many Pandas read_csv calls in parallel instead?
# future_1 = client.submit(dd.read_csv, filename_1, workers='w1')
# future_2 = client.submit(dd.read_csv, filename_2, workers='w2')
future_1 = client.submit(pandas.read_csv, filename_1, workers='w1')
future_2 = client.submit(pandas.read_csv, filename_2, workers='w2')
See https://docs.dask.org/en/latest/delayed-best-practices.html#don-t-call-dask-delayed-on-other-dask-collections for more information
I have a dynamic Dask Kubernetes cluster.
I want to load 35 parquet files (about 1.2GB) from Gcloud storage into Dask Dataframe then process it with apply() and after saving the result to parquet file to Gcloud.
During loading files from Gcloud storage, a cluster memory usage is increasing to about 3-4GB. Then workers (each worker has 2GB of RAM) are terminated/restarted and some tasks getting lost,
so cluster starts computing the same things in a circle.
I removed apply() operation and leave only read_parquet() to test
if my custom code causes a trouble, but the problem was the same, even with just single read_parquet() operation. This is a code:
client = Client('<ip>:8786')
client.restart()
def command():
client = get_client()
df = dd.read_parquet('gcs://<bucket>/files/name_*.parquet', storage_options={'token':'cloud'}, engine='fastparquet')
df = df.compute()
x = client.submit(command)
x.result()
Note: I'm submitting a single command function to run all necessary commands to avoid problems with gcsfs authentication inside a cluster
After some investigation, I understood that problem could be in .compute() which returns all data to a process, but this process (my command function) is running on a worker. Because of that, a worker doesn't have enough RAM, crashes and lose all computed task which triggers tasks re-run.
My goal is:
to read from parquet files
perform some computations with apply()
and without even returning data from a cluster write it back to Gcloud storage in parquet format.
So, simply I want to keep data on a cluster and not return it back. Just compute and save data somewhere else.
After reading Dask distributed docs, I have found client.persist()/compute() and .scatter() methods. They look like what I need, but I don't really understand how to use them.
Could you, please, help me with client.persist() and client.compute() methods for my example
or suggest another way to do it? Thank you very much!
Dask version: 0.19.1
Dask distributed version: 1.23.1
Python version: 3.5.1
df = dd.read_parquet('gcs://<bucket>/files/name_*.parquet', storage_options={'token':'cloud'}, engine='fastparquet')
df = df.compute() # this triggers computations, but brings all of the data to one machine and creates a Pandas dataframe
df = df.persist() # this triggers computations, but keeps all of the data in multiple pandas dataframes spread across multiple machines
I have built an Microsoft Azure ML Studio workspace predictive web service, and have a scernario where I need to be able to run the service with different training datasets.
I know I can setup multiple web services via Azure ML, each with a different training set attached, but I am trying to find a way to do it all within the same workspace and passing a Web Input Parameter as the input value to choose which training set to use.
I have found this article, which describes almost my scenario. However, this article relies on the training dataset that is being pulled from the Load Trained Data module, as having a static endpoint (or blob storage location). I don't see any way to dynamically (or conditionally) change this location based on a Web Input Parameter.
Basically, does Azure ML support a "conditional training data" loading?
Or, might there be a way to combine training datasets, then filter based on the passed Web Input Parameter?
This probably isn't exactly what you need, but hopefully, it helps you out.
To combine data sets, you can use the Join Data module.
To filter, that may be accomplished by executing a Python script. Here's an example.
Using the Adult Census Income Binary Classification dataset, on the age column, there's a minimum age of 17.
If I wanted to filter the data set by age, connect it to an Execute Python Script module and here's the filtering code with the pandas query method.
# The script MUST contain a function named azureml_main
# which is the entry point for this module.
import pandas as pd
def azureml_main(dataframe1 = None, dataframe2 = None):
# Return value must be of a sequence of pandas.DataFrame
return dataframe1.query("age >= 25")
And looking at that output it filters out the data set where the minimum age is now 25.
Sure, you can do that. What you would want is to use an Execute R Script or SQL Transformation module to determine, based on your input data, what model to use. Something like this:
Notice, your input data is cleaned/updated/feature engineered, then it's passed to two different SQL transforms which will tell it to go to one of two paths.
Each path has it's own training data.
Note: I am not exactly sure what your use case is, but if it were me, I would instead train two different models using the two different training data, then try to just use the models in my web service, not actually train on the web service as that would likely be quite slow.