Google Dataflow creates only one worker for large .bz2 file - google-cloud-dataflow

I am trying to process the Wikidata json dump using Cloud Dataflow.
I have downloaded the file from https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2 and hosted it into a GS bucket. It's a large (50G) .bz2 file containing a list of json dicts (one per line).
I understand that apache_beam.io.ReadFromText can handle .bz2 (I tested that on toy datasets) and that .bz2 is splittable. Therefore I was hoping that multiple workers would be created that would work in parallel on different blocks of that unique file (I'm not totally clear if/how blocks would res.
Ultimately I want to do some analytics on each line (each json dict) but as a test for ingestion I am just using the project's wordcount.py:
python -m apache_beam.examples.wordcount \
--input gs://MYBUCKET/wikidata/latest-all.json.bz2 \
--output gs://MYBUCKET/wikidata/output/entities-all.json \
--runner DataflowRunner \
--project MYPROJECT \
--temp_location gs://MYBUCKET/tmp/
At startup, autoscaling quickly increases the number of workers 1->6 but only one worker does any work and then autoscaling scales back 6->1 after a couple minutes (jobid: 2018-10-11_00_45_54-9419516948329946918)
If I disable autoscaling and set explicitly the number of workers, then all but one remain idle.
Can parallelism be achieved on this sort of input? Many thanks for any help.

Other than Hadoop, Apache Beam has not yet implemented bzip2 splitting: https://issues.apache.org/jira/browse/BEAM-683

Related

PBS array job parallelization

I am trying to submit a job on high compute cluster that needs to run a python code lets say 10000 times. I used gnu parallel but then IT team sent me a mail stating that my job is creating too many ssh login logs in their monitoring system. They asked me to use job arrays instead. My code takes about 12 seconds to run. I believe I need to use #PBS -J statement in my PBS script. Then, I am not sure if it will run in parallel. I need to execute my code lets say on 10 nodes 16 cores each i.e. 160 instances of my code running in parallel. How can I parallelize it i.e. run many instances of my code at a given time utilizing all the resources I have?
Below is the initial pbs script with gnu parallel:
#!/bin/bash
#PBS -P My_project
#PBS -N my_job
#PBS -l select=10:ncpus=16:mem=4GB
#PBS -l walltime=01:30:00
module load anaconda
module load parallel
cd $PBS_O_WORKDIR
JOBSPERNODE=16
parallel --joblog jobs.log --wd $PBS_O_WORKDIR -j $JOBSPERNODE --sshloginfile $PBS_NODEFILE --env PATH "python $PBS_O_WORKDIR/xyz.py" :::: inputs.txt
inputs.txt is a fie with integer values 0-9999 in each line which is fed to my python code as an argument. Code is highly independent and output of one instance does not affect another.
a little late but thought I'd answer anyway.
Arrays will run in parallel, but the number of jobs running at once will depend on the availability of nodes and the limit of jobs per user per queue. Essentially, each HPC will be slightly different.
Adding #PBS -J 1-10000 will create an array of 10000 jobs, and assuming the syntax is the same as the HPC I use, something like ID=$(sed -n "${PBS_ARRAY_INDEX}p" /path/to/inputs.txt) will then be the integers from inputs.txt whereby PBS array number 123 will return the 123rd line of inputs.txt.
Alternatively, since you're on an HPC, if the jobs are only taking 12 seconds each, and you have 10000 iterations, then a for loop will also complete the entire process in 33.33 hours.

Getting Dataflowrunner with --experiments=upload_graph to work

I have a pipeline that produces a dataflow graph (serialized JSON representation) that exceeds the allowable limit for the API, and thus cannot be launched via the dataflow runner for apache beam as one would normally do. And running dataflow runner with the instructed parameter --experiments=upload_graph does not work and fails saying there are no steps specified .
When getting notified about this size problem via an error, the following information is provided:
the size of the serialized JSON representation of the pipeline exceeds the allowable limit for the API.
Use experiment 'upload_graph' (--experiments=upload_graph)
to direct the runner to upload the JSON to your
GCS staging bucket instead of embedding in the API request.
Now using this parameter, does indeed result in dataflow runner uploading an additional dataflow_graph.pb file to the staging location beside the usual pipeline.pb file. Which I verified actually exists in gcp storage.
However the job in gcp dataflow then immediately fails after start with the following error:
Runnable workflow has no steps specified.
I've tried this flag with various pipelines, even apache beam example pipelines and see the same behaviour.
This can be reproduced by using word count example:
mvn archetype:generate \
-DarchetypeGroupId=org.apache.beam \
-DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples \
-DarchetypeVersion=2.11.0 \
-DgroupId=org.example \
-DartifactId=word-count-beam \
-Dversion="0.1" \
-Dpackage=org.apache.beam.examples \
-DinteractiveMode=false
cd word-count-beam/
Running it without the experiments=upload_graph parameter works:
(make sure to specify your project, and buckets if you want to run this)
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
-Dexec.args="--runner=DataflowRunner --project=<your-gcp-project> \
--gcpTempLocation=gs://<your-gcs-bucket>/tmp \
--inputFile=gs://apache-beam-samples/shakespeare/* --output=gs://<your-gcs-bucket>/counts" \
-Pdataflow-runner
Running it with the experiments=upload_graph results in pipe failing with message workflow has no steps specified
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
-Dexec.args="--runner=DataflowRunner --project=<your-gcp-project> \
--gcpTempLocation=gs://<your-gcs-bucket>/tmp \
--experiments=upload_graph \
--inputFile=gs://apache-beam-samples/shakespeare/* --output=gs://<your-gcs-bucket>/counts" \
-Pdataflow-runner
Now I would expect that dataflow runner would direct gcp dataflow to read the steps from the bucket specified as seen in the source code:
https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L881
However this seems not to be the case. Has anyone gotten this to work, or has found some documentation regarding this feature that can point me in the right direction?
The experiment has since been reverted and the messaging will be corrected in Beam 2.13.0
Revert PR
I recently ran into this issue and the solution was quite silly. I had quite a complex dataflow streaming job developed and it was working fine and the next day stopped working with error "Runnable workflow has no steps specified.". In my case, someone specified pipeline().run().waitUntilFinish() twice after creating options and due to that, I was getting this error. Removing the duplicate pipeline run resolved the issue. I still think there should be some useful error trace by beam/dataflowrunner in this scenario.

Make one instance of multiple uWSGI workers perform a extra function

I have a python flask app running on uWSGI with a config file that specifics it to spawn multiple workers (which I am assuming are identical processes).
Everything works well except for one part: the python app runs a bash command to download an update a database every day using a scheduler, which needs to run only once but multiple processes means that it runs multiple times at the same time, thus corrupting the downloaded file.
Is there a way to run this bash command on only one instance of uWSGI workers? I can't run the bash command as a separate cron job (the database update has to integrate seamlessly with the app).
Check The uWSGI cron-like interface
uWSGI’s master has an internal cron-like facility that can generate
events at predefined times. You can use it
You can set the options for example to:
[uwsgi]
; every two hours
cron = 0 -2 -1 -1 -1 /usr/bin/backup_my_home --recursive
Is that sufficient?

How does one write a Cluster Spec for Distributed YoutTube-8m challenge training?

Can someone please post a ClusterSpec for distributed training of the models defined in the YouTube-8m Challenge code?
The code tries to load a cluster spec from TF_CONFIG environment variable. However, I'm not sure what the value for TF_CONFIG should be. I have access to 2 GPUs on one machine and just want to run the model with data-level parallelism.
If you want to run YouTube 8m challenge code in a distributed manner, you have to write a yaml file (There is an example yaml file provided by Google) and then you need to pass as parameter where is located this yaml file. TF_CONFIG makes reference to the configuration variables used to train the model.
For example, for running on google cloud the starting code in a distributed manner, I have used:
JOB_NAME=yt8m_train_$(date +%Y%m%d_%H%M%S); gcloud --verbosity=debug ml-engine jobs \
submit training $JOB_NAME \
--package-path=youtube-8m --module-name=youtube-8m.train \
--staging-bucket=$BUCKET_NAME --region=us-east1 \
--config=youtube-8m/cloudml-gpu-distributed.yaml \
-- --train_data_pattern='gs://youtube8m-ml-us-east1/1/frame_level/train/train*.tfrecord' \
--frame_features=True --model=LstmModel --feature_names="rgb,audio" \
--feature_sizes="1024, 128" --batch_size=128 \
--train_dir=$BUCKET_NAME/${JOB_TO_EVAL}
The parameter config is pointing to the yaml file cloudml-gpu-distributed.yaml with the following specification:
trainingInput:
runtimeVersion: "1.0"
scaleTier: CUSTOM
masterType: standard_gpu
workerCount: 2
workerType: standard_gpu
parameterServerCount: 2
parameterServerType: standard

How to change memory in EMR hadoop streaming job

I am trying to overcome the following error in a hadoop streaming job on EMR.
Container [pid=30356,containerID=container_1391517294402_0148_01_000021] is running beyond physical memory limits
I tried searching for answers but the one I found isn't working. My job is launched as shown below.
hadoop jar ../.versions/2.2.0/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar \
-input determinations/part-00000 \
-output determinations/aggregated-0 \
-mapper cat \
-file ./det_maker.py \
-reducer det_maker.py \
-Dmapreduce.reduce.java.opts="-Xmx5120M"
The last line above is supposed to do the trick as far as I understand, but I get the error:
ERROR streaming.StreamJob: Unrecognized option: -Dmapreduce.reduce.java.opts="-Xmx5120M"
What is the correct way change the memory usage ?
Also is there some documentation that explains these things to n00bs like me?
You haven't elaborated on what memory you are running low, physical or virtual.
For both problems, take a look at Amazon's documentation:
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/TaskConfiguration_H2.html
Usually the solution is to increase the amout of memory per mapper, and possibly reduce the number of mappers:
s3://elasticmapreduce/bootstrap-actions/configure-hadoop -m mapreduce.map.memory.mb=4000
s3://elasticmapreduce/bootstrap-actions/configure-hadoop -m mapred.tasktracker.map.tasks.maximum=2

Resources