Shuffle service now enabled by default in apache beam?

Without any changes on our side, our jobs using the python SDK for dataflow have started using the shuffle service:
According to the docs:
To use the service-based Dataflow Shuffle in your batch pipelines, specify the following parameter:
However, we have not enabled this flag.
One major effect of this is the default size of the disk has gone from 250GB to 25GB on our workers. In one case, we ran out of disk space while the worker was starting up leading to a hung job never finishing.
Questions are:
Yes, it is.
I couldn't find any announcement of this change. But it should be updated here. I'll make sure it's up-to-date.
Since Oct. 2020, batch jobs have began to opt into using Dataflow Shuffle by default. To opt out of using it, specify --experiments=shuffle_mode=appliance.


Job-based cloud processing solution

I would like to do some cloud processing on a very small cluster of machines (<5).
This processing should be based on 'jobs', where jobs are parameterized scripts that run in a certain docker environment.
As an example for what a job could be:
Run in docker image "my_machine_learning_docker"
Download some machine learning dataset from an internal server
Train some neural network on the dataset
Produce a result and upload it to a server again.
My use cases are not limited to machine learning however.
A job could also be:
Run in docker image "my_image_processing_docker"
Download a certain amount of images from some folder on a machine.
Run some image optimization algorithm on each of the images.
Upload the processed images to another server.
Now what I am looking for is some framework/tool, that keeps track of the compute servers, that receives my jobs and dispatches them to an available server. Advanced priorization, load management or something is not really required.
It should be possible to query the status of jobs and of the servers via an API (I want to do this from NodeJS).
Potentially, I could imagine this framework/tool to dynamically spin up these compute servers in in AWS, Azure or something. That would not be a hard requirement though.
I would also like to host this solution myself. So I am not looking for a commercial solution for this.
Now I have done some research, and what I am trying to do has similarities with many, many existing projects, but I have not "quite" found what I am looking for.
Similar things I have found were (selection):
CI/CD solutions such as Jenkins/Gitlab CI. Very similar, but it seems to be tailored very much towards the CI/CD case, and I am not sure whether it is such a good idea to abuse a CI/CD solution for what I am trying to do.
Kubernetes: Appears to be able to do this somehow, but is said to be very complex. It also looks like overkill for what I am trying to do.
Nomad: Appears to be the best fit so far, but it has some proprietary vibes that I am not very much a fan of. Also it still feels a bit complex...
In general, there are many many different projects and frameworks, and it is difficult to find out what the simplest solution is for what I am trying to do.
Can anyone suggest anything or point me in a direction?
Thank you
I would use Jenkins for this use case even if it appears to you as a “simple” one. You can start with the simplest pipeline which can also deal with increasing complexity of your job. Jenkins has API, lots of plugins, it can be run as container for a spin up in a cloud environment.
Its possible you're looking for something like AWS Batch flows: or google datalflow Out of the box they do scaling, distribution monitoring etc.
But if you want to roll your own ....
Option A: Queues
For your job distribution you are really just looking for a simple message queue that all of the workers listen on. In most messaging platforms, a Queue supports deliver once semantics. For example
Active MQ:
Using queues for load distribution is a common pattern.
A queue based solution can use both with manual or atuomated load balancing as the more workers you spin up, the more instances of your workers you have consuming off the queue. The same messaging solution can be used to gather the results if you need to, using message reply semantics or a dedicated reply channel. You could use the resut channel to post progress reports back and then your main application would know the status of each worker. Alternatively they could drop status in database. It probably depends on your preference for collecting results and how large the result sets would be. If large enough, you might even just drop results in an S3 bucket or some kind of filesystem.
You could use something quote simple to mange the workers - Jenkins was already suggested is in defintely a solution I have seen used for running multiple instances accross many servers as you just need to install the jenkins agent on each of the workers. This can work quote easily if you own or manage the physical servers its running on. You could use TeamCity as well.
If you want something cloud hosted, it may depend on the technology you use. Kubernetties is probably overkill here, but certiabnly could be used to spin up N nodes and increase/decrease those number of workers. To auto scale you could publish out a single metric - the queue depth - and trigger an increase in the number of workers based on how deep the queue is and a metric you work out based on cost of spinning up new nodes vs. the rate at which they are processed.
You could also look at some of the lightweight managed container solutions like or Heroku which are both much easier to setup than K8s and would let you scale up easily.
Option 2: Web workers
Can you design your solution so that it can be run as a cloud function/web worker?
If so you could set them up so that scaling is fully automated. You would hit the cloud function end point to request each job. The hosting engine would take care of the distribution and scaling of the workers. The results would be passed back in the body of the HTTP response ... a json blob.
Your workload may be too large for these solutions, but if its actually fairly light weight quick it could be a simple option.
I don't think these solutions would let you query the status of tasks easily.
If this option seems appealing there are quite a few choices:
Option 3: Google Cloud Tasks
This is a bit of a hybrid option. Essentially GCP has a queue distribution workflow where the end point is a cloud function or some other supported worker, including cloud run which uses docker images. I've not actually used it myself but maybe it fits the bill.
When I look at a problem like this, I think through the entirity of the data paths. The map between source image and target image and any metadata or status information that needs to be collected. Additionally, failure conditions need to be handled, especially if a production service is going to be built.
I prefer running Python, Pyspark with Pandas UDFs to perform the orchestration and image processing.
S3FS lets me access s3. If using Azure or Google, Databricks' DBFS lets me seamlessly read and write to cloud storage without 2 extra copy file steps.
Pyspark's binaryFile data source lets me list all of the input files to be processed. Spark lets me run this in batch or an incremental/streaming configuration. This design optimizes for end to end data flow and data reliability.
For a cluster manager I use Databricks, which lets me easily provision an auto-scaling cluster. The Databricks cluster manager lets users deploy docker containers or use cluster libraries or notebook scoped libraries.
The example below assumes the image is > 32MB and processes it out of band. If the image is in the KB range then dropping the content is not necessary and in-line processing can be faster (and simpler).
Pseudo code:
df = (
.option("pathGlobFilter", "*.png")
from typing import Iterator
def do_image_xform(path:str):
# Do image transformation, read from dbfs path, write to dbfs path
# return xform status
return "success"
def do_image_xform_udf(iterator: Iterator[pd.Series]) -> Iterator[pd.Series]:
for path in iterator:
yield do_image_xform(path)
df_status = df.withColumn('status',do_image_xform_udf(col(path)))
df_status.saveAsTable("status_table") # triggers execution, saves status.

Confluent Platform KSQL in headless mode

I've read through KSQL deployment options here So it is recommended to use headless KSQL for production deployment.
But I have not found any hints on how I can stop/change queries when in production (headless) mode when KSQL disables interactive access to server via REST/CLI. Does that mean that I need to shut down all KSQL servers in order to add/change one query?
You can deloy headless or interactive into Production, depending on what meets your needs.
Headless is designed to allow you to run a known set of queries in a locked down fashion. This can be a requirement for production systems with strict SLAs, where you don't want someone connecting and kicking off an expensive query or dropping something that causes SLAs to be broken.
As you correctly identify, the Headless deployment mode doesn't allow you to change the DDL of your cluster through a CLI/API. Instead, it would be more normal to have some kind of automation around updating the SQL file and bouncing the cluster. We are aware there is much room for improvement here.
Keep in mind that KSQL does not, at the time of writing, support updating an existing table or stream. However, this is something we're actively working towards. Until that is supported, in general you should only add queries to the file. Any deletions or changes to existing queries would require careful testing as there are many changes KSQL does not currently support. Always ensure changes are thoroughly testing before any prod deployment. Alternatively, some users spin up new clusters when changes need to be made, (hopefully infrequently!). Once caught up, they fail over clients and turn of the old cluster. Again, this is an area in which KSQL will see improvements.
Hope this helps and thanks for using KSQL!

Setting node selector for spring cloud dataflow task and stream deployments on kubernetes

We want to fix all our spring cloud dataflow task and stream deployments to a particular set of nodes.
I have this working manually for a sample task eg
task launch test-timestamp --properties "deployer.*.kubernetes.deployment.nodeSelector=env:development"
(this wasn't obvious as the documentation here seems to imply the key is just nodeSelector not deployment.nodeSelector)
This correctly adds the node selector into the pod yaml for kubernetes.
But I want this to be set automatically ie using task.platform.kubernetes.accounts.default properties in the SCDF server config.
Ive tried:
task.platform.kubernetes.accounts.default.deployment.nodeSelector: env:development
task.platform.kubernetes.accounts.default.nodeSelector: env:development
but neither seem to work. What is the correct way to configure this?
Same question for stream deployments via skipper.
Also how do I set this up for scheduled tasks?
Sorry that you've had to try a few options to get to the bottom of finding the right deployer property that actually works.
In general, from SCDF's Shell/UI, the deployer token is a short-form for property. It's a repetitive thing to supply when you have more deployer properties to configure in a stream/task, so we have a short-form for that reason.
However, the nodeSelector is not a deployer-level property with a default, though. It is only available as a deployment level property, so that means, it is only available as an option for a per deployment basis.
To put it differently, it is not available as an option for "global" configuration, so that's why task.platform.kubernetes.accounts.default.deployment.nodeSelector: env:development is not taking into account. Same also is true for Streams through Skipper, as well.
It can be improved, though. I created spring-cloud/spring-cloud-deployer-kubernetes#300 for tracking - feel free to subscribe to the notifications. Both Streams and Tasks should then be able to take advantage of it as a global configuration. Once the PR is merged, you should be able to try it with SCDF's 2.2.0.BUILD-SNAPSHOT image.
As for the K8s-scheduler implementation, we do not have support for nodeSelectors yet. I created spring-cloud/spring-cloud-scheduler-kubernetes#25 - we could collaborate on a PR if you want to port the functionality from K8s-deployer.

Google Cloud DataFlow Autoscaling not working

I'm running a dataflow job that has 800K files to process.
The job id is 2018-08-23_07_07_46-4958738268363865409.
It reports that it has successfully listed 800K files, but for some odd reason, the autoscaler only assigned 1 worker to it. Since it's processing rate is 2/sec, this is going to take a loooong time.
I didn't touch the default scaler settings which to my knowledge means it can scale freely up to 100 workers.
Why doesn't it scale?
Following Neri's suggestion, I started a new job (id 2018-08-29_13_47_04-1454220104656653184) and set autoscaling_algorithm=THROUGHPUT_BASED even though according to the documentation it should default to that anyway. Same behavior. processing speed is at 1 element per second and I have only one worker.
What's the use of running in the cloud if you cannot scale?
In order to autoscale your Dataflow Job, be sure that you use autoscalingAlgorithm = THROUGHPUT_BASED.
If you use "autoscalingAlgorithm":"NONE", then your Dataflow Job will get stuck even if it could autoscale. Otherwise, you will need to specify the number of workers you want on numWorkers.
Also, to scale to the amount of workers you want, be sure to specify (for numWorkers and maxNumWorkers) a number equal or lower to your quota, check your quota by using:
gcloud compute project-info describe

Cloud Dataflow Running really slow when reading/writing from Cloud Storage (GCS)

Since using the release of the latest build of Cloud Dataflow (0.4.150414) our jobs are running really slow when reading from cloud storage (GCS). After running for 20 minutes with 10 VMs we were only able to read in about 20 records when previously we could read in millions without issue.
It seems to be hanging, although no errors are being reported back to the console.
We received an email informing us that the latest build would be slower and that it could be countered by using more VMs but we got similar results with 50 VMs.
Here is the job id for reference: 2015-04-22_22_20_21-5463648738106751600
Instance: n1-standard-2
Region: us-central1-a
Your job seems to be using side inputs to a DoFn. Since there has been a recent change in how Cloud Dataflow SDK for Java handles side inputs, it is likely that your performance issue is related to that. I'm reposting my answer from a related question.
The evidence seems to indicate that there is an issue with how your pipeline handles side inputs. Specifically, it's quite likely that side inputs may be getting re-read from BigQuery again and again, for every element of the main input. This is completely orthogonal to the changes to the type of virtual machines used by Dataflow workers, described below.
This is closely related to the changes made in the Dataflow SDK for Java, version 0.3.150326. In that release, we changed the side input API to apply per window. Calls to sideInput() now return values only in the specific window corresponding to the window of the main input element, and not the whole side input PCollectionView. Consequently, sideInput() can no longer be called from startBundle and finishBundle of a DoFn because the window is not yet known.
For example, the following code snippet has an issue that would cause re-reading side input for every input element.
public void processElement(ProcessContext c) throws Exception {
Iterable<String> uniqueIds = c.sideInput(iterableView);
for (String item : uniqueIds) {
This code can be improved by caching the side input to a List member variable of the transform (assuming it fits into memory) during the first call to processElement, and use that cached List instead of the side input in subsequent calls.
This workaround should restore the performance you were seeing before, when side inputs could have been called from startBundle. Long-term, we will work on better caching for side inputs. (If this doesn't help fully resolve the issue, please reach out to us via email and share the relevant code snippets.)
Separately, there was, indeed, an update to the Cloud Dataflow Service around 4/9/15 that changed the default type of virtual machines used by Dataflow workers. Specifically, we reduced the default number of cores per worker because our benchmarks showed it as cost effective for typical jobs. This is not a slowdown in the Dataflow Service of any kind -- it just runs with less resources per worker, by default. Users are still given the options to override both the number of workers as well as the type of the virtual machine used by workers.
We had a similar issue. It is when the side-input is reading from a BigQuery table that has had its data streamed in, rather than bulk loaded. When we copy the table(s), and read from the copies instead everything works fine.
If your tables are streamed, try copying them and reading the copies instead. This is a workaround.
See: Dataflow performance issues
