Airflow - List of downstream tasks for a task - task

May I know if there is an airflow command which list all downstream tasks for a task. For example, there are 4 tasks in a dag - dummy1 > dummy2 > dummy3 > dummy4. I need list of all downstream tasks of dummy2, output should be dummy3 and dummy4. If there is a command then it will help if there are many downstream tasks for a task and to do manual actions only on downstream tasks.
Dag - dummy1 > dummy2 > dummy3 > dummy4
Output : (downstream tasks list of dummy2)
dummy3
dummy4

There is no a CLI command for this, but it is a one-liner in Python.
Let's assume your DAG-script name is a_dag.py and the DAG object is referenced by the variable dag.
Then you can do something like this in the terminal:
$ cd airflow/dags
$ ls
a_dag.py
$ python
Python 3.8.2 ...
>>> from a_dag import dag
>>> dag.get_task('dummy2').get_flat_relative_ids()
{'dummy3', 'dummy4'}

Related

Running same DF template in parallel yields strange results

I have a dataflow job that extracts data from Cloud SQL and loads it into Cloud Storage. We've configured the job to accept parameters so we can use the same code to extract multiple tables. The dataflow job is compiled as a template.
When we create/run instances of the template in serial we get the results we expect. However if we create/run instances in parallel only a few files turn up on Cloud Storage. In both cases we can see that the DF jobs are created and terminate sucessfully.
For example we have 11 instances which produce 11 output files. In serial we get all 11 files, in parallel we only get around 3 files. During the parallel run all 11 instances were running at the same time
Can anyone offer some advice as to why this is happening? I'm assuming that temporary files created by the DF template are somehow overwritten during the parallel run?
The main motivation of running in parallel is extracting the data more quickly.
Edit
The pipeline is pretty simple:
PCollection<String> results = p
.apply("Read from Cloud SQL", JdbcIO.<String>read()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration
.create(dsDriver, dsConnection)
.withUsername(options.getCloudSqlUsername())
.withPassword(options.getCloudSqlPassword())
)
.withQuery(options.getCloudSqlExtractSql())
.withRowMapper(new JdbcIO.RowMapper<String>() {
#Override
public String mapRow(ResultSet resultSet) throws Exception {
return mapRowToJson(resultSet);
}
})
.withCoder(StringUtf8Coder.of()));
When I compile the template I do
mvn compile exec:java \
-Dexec.mainClass=com.xxxx.batch_ingestion.LoadCloudSql \
-Dexec.args="--project=myproject \
--region=europe-west1 \
--stagingLocation=gs://bucket/dataflow/staging/ \
--cloudStorageLocation=gs://bucket/data/ \
--cloudSqlInstanceId=yyyy \
--cloudSqlSchema=dev \
--runner=DataflowRunner \
--templateLocation=gs://bucket/dataflow/template/BatchIngestion"
When I invoke the template I also provide "tempLocation". I can see the dynamic temp locations are being used. Despite this I'm not seeing all the output files when running in parallel.
Thanks
Solution
Add unique tempLocation
Add unique output path & filename
Move the output files to final destination on CS after DF completes its processing

Remove some main commands and/or default options from waf in wscript

I have a waf script which adds some options, therefore I use Options from the waflib.
A minimal working example is:
from waflib import Context, Options
from waflib.Tools.compiler_c import c_compiler
def options(opt):
opt.load('compiler_c')
def configure(cnf):
cnf.load('compiler_c')
cnf.env.abc = 'def'
def build(bld):
print('hello')
Which lead to a lot of options I do not support, but others I would like to or have to support. The full list of default support commands is shown below. But how do I remove the options that are actually not supported like
some main commands, like e.g., dist, step and install or
some options like e.g., --no-msvs-lazy or
some Configuration options like e.g., -t or
completely the whole section Installation and uninstallation options
The full ouput of options is then:
waf [commands] [options]
Main commands (example: ./waf build -j4)
build : executes the build
clean : cleans the project
configure: configures the project
dist : makes a tarball for redistributing the sources
distcheck: checks if the project compiles (tarball from 'dist')
distclean: removes build folders and data
install : installs the targets on the system
list : lists the targets to execute
step : executes tasks in a step-by-step fashion, for debugging
uninstall: removes the targets installed
Options:
--version show program's version number and exit
-c COLORS, --color=COLORS
whether to use colors (yes/no/auto) [default: auto]
-j JOBS, --jobs=JOBS amount of parallel jobs (8)
-k, --keep continue despite errors (-kk to try harder)
-v, --verbose verbosity level -v -vv or -vvv [default: 0]
--zones=ZONES debugging zones (task_gen, deps, tasks, etc)
-h, --help show this help message and exit
--msvc_version=MSVC_VERSION
msvc version, eg: "msvc 10.0,msvc 9.0"
--msvc_targets=MSVC_TARGETS
msvc targets, eg: "x64,arm"
--no-msvc-lazy lazily check msvc target environments
Configuration options:
-o OUT, --out=OUT build dir for the project
-t TOP, --top=TOP src dir for the project
--prefix=PREFIX installation prefix [default: 'C:\\users\\user\\appdata\\local\\temp']
--bindir=BINDIR bindir
--libdir=LIBDIR libdir
--check-c-compiler=CHECK_C_COMPILER
list of C compilers to try [msvc gcc clang]
Build and installation options:
-p, --progress -p: progress bar; -pp: ide output
--targets=TARGETS task generators, e.g. "target1,target2"
Step options:
--files=FILES files to process, by regexp, e.g. "*/main.c,*/test/main.o"
Installation and uninstallation options:
--destdir=DESTDIR installation root [default: '']
-f, --force force file installation
--distcheck-args=ARGS
arguments to pass to distcheck
For options, The option context has a parser attribute which is a python optparse.OptionParser. You can use the remove_option method of OptionParser:
def options(opt):
opt.parser.remove_option("--top")
opt.parser.remove_option("--no-msvs-lazy")
For commands, there is a metaclass in waf that automatically register Context classes (see waflib.Context sources).
So all Context classes are stored in the global variable waflib.Context.classes. To get rid of them you can manipulate this variable. For instance to get rid of StepContext and such, you can do something like:
import waflib
def options(opt):
all_contexts = waflib.Context.classes
all_contexts.remove(waflib.Build.StepContext)
all_contexts.remove(waflib.Build.InstallContext)
all_contexts.remove(waflib.Build.UninstallContext)
Commands dist/distcheck are special case defined in waflib.Scripting. It's not easy to get rid of them.

Executing bash script on multiple lines inside multiple files in parallel using GNU parallel

I want to use GNU parallel for the following problem:
I have a few files each with several lines of text. I would like to understand how I can run a script (code.sh) on each line of text of each file and for each file in parallel. I should be able to write out the output of the operation on each input file to an output file with a different extension.
Seems this is a case of multiple parallel commands running parallel over all files and then running parallel for all lines inside each file.
This is what I used:
ls mydata_* |
parallel -j+0 'cat {} | parallel -I ./explore-bash.sh > {.}.out'
I do not know how to do this using GNU parallel. Please help.
Your solution seems reasonable. You just need to remove -I:
ls mydata_* | parallel -j+0 'cat {} | parallel ./explore-bash.sh > {.}.out'
Depending on your setup this may be faster as it will only run n jobs, where as the solution above will run n*n jobs in parallel (n = number of cores):
ls mydata_* | parallel -j1 'cat {} | parallel ./explore-bash.sh > {.}.out'

bash gnu parallel argfile syntax

I just discovered GNU parallel and I'm having some trouble running a simple parallel task. I have a simulation running over multiple values and I'd like to split it up to run in parallel using command line args. From the docs , it seems you can run parallel mycommand :::: myargfile in which myargfile contains the various arguments you would like to feed your command, in parallel. However, I didn't see any information on how the args should be listed and assumed a myargfile like this would work:
--pmin 0 --pmax 0.1
--pmin 0.1 --pmax 0.2
...
mycommand --pmin 0 --pmax 0.1 executes no problem. But when I run parallel mycommand :::: myargfile I get error: unknown option pmin 0 --pmax 0.1 (caught and decoded courtesy boost program options). parallel echo :::: myargfile correctly prints out the arguments. It's as if they are being wrapped in a string which the program can't read and not fed like they are from a standard bash script.
What's going on? How can I make this work?
Following #DmitriChubarov's link to https://stackoverflow.com/a/6258206/1328439 , I discovered that I was lacking the colsep flag:
parallel --colsep ' ' mycommand :::: myargfile
successfully executes.
After digging through manual and help pages I came up with this example. Perhaps it will save someone out there. :)
#!/usr/bin/env bash
COMMANDS=(
"cnn -a mode=flat"
"cnn -a mode=xxx"
"cnn_x -a mode=extreme"
)
parallel --verbose --progress --colsep ' ' scrapy crawl {.} ::: "${COMMANDS[#]}"

Vectorizing a solr index with mahout using lucene.vector

I'm trying to run a clustering job on Amazon EMR using Mahout.
I have a solr index that I uploaded on S3 and I want to vectorize it using mahouts lucene.vector.(this is the first step in the job flow)
The parameters for the step are:
Jar: s3n://mahout-bucket/jars/mahout-core-0.6-job.jar
MainClass: org.apache.mahout.driver.MahoutDriver
Args: lucene.vector --dir s3n://mahout-input/solr_index/ --field name --dictOut /test/solr-dict-out/dict.txt --output /test/solr-vectors-out/vectors
The error in the log is:
Unknown program 'lucene.vector' chosen.
I've done the same process locally with hadoop and Mahout and it worked fine.
How should I call the lucene.vector function on EMR?
program name, lucene.vector should be immediately after bin/mahout
/homes/cuneyt/trunk/bin/mahout lucene.vector --dir /homes/cuneyt/lucene/index --field 0 --output lda/vector --dictOut /homes/cuneyt/lda/dict.txt
I've eventually figured out the answer. The problem was I was using the wrong MainClass argument. Instead of
org.apache.mahout.driver.MahoutDriver
I should have used:
org.apache.mahout.utils.vectors.lucene.Driver
Therefore the correct arguments should have been
Jar: s3n://mahout-bucket/jars/mahout-core-0.6-job.jar MainClass:
org.apache.mahout.utils.vectors.lucene.Driver
Args: --dir s3n://mahout-input/solr_index/ --field name --dictOut /test/solr-dict-out/dict.txt --output /test/solr-vectors-out/vectors

Resources