From https://stackoverflow.com/a/59455700/6162120:
cc_library produces several outputs, which are separated by output groups. If you want to get only .so outputs, you can use filegroup with dynamic_library output group.
Where can I find the list of all the output groups produced by cc_library? And more generally, how can I list all the output groups of a bazel rule?
In the next Bazel release (after 3.7), or using Bazel#HEAD as of today, you can use cquery --output=starlark and the providers() function to do this:
$ bazel-dev cquery //:java-maven \
--output=starlark \
--starlark:expr="[p for p in providers(target)]"
["InstrumentedFilesInfo", "JavaGenJarsProvider", "JavaInfo", "JavaRuntimeClasspathProvider", "FileProvider", "FilesToRunProvider", "OutputGroupInfo"]
This isn't a replacement for documentation, but it's possible to get the output groups of targets using an aspect:
defs.bzl:
def _output_group_query_aspect_impl(target, ctx):
for og in target.output_groups:
print("output group " + str(og) + ": " + str(getattr(target.output_groups, og)))
return []
output_group_query_aspect = aspect(
implementation = _output_group_query_aspect_impl,
)
Then on the command line:
bazel build --nobuild Foo --aspects=//:defs.bzl%output_group_query_aspect
(--nobuild runs just the analysis phase and avoids running the execution phase if you don't need it)
For a java_binary this returns e.g.:
DEBUG: defs.bzl:3:5: output group _hidden_top_level_INTERNAL_: depset([<generated file _middlemen/Foo-runfiles>])
DEBUG: defs.bzl:3:5: output group _source_jars: depset([<generated file Foo-src.jar>])
DEBUG: defs.bzl:3:5: output group compilation_outputs: depset([<generated file Foo.jar>])
Related
I am wondering if you would be able to give advice about defining a Snakemake rule to combine over one, but not all wildcards? My data is organized so that I have runs and samples; most, but not all samples, were resequenced in every run. Therefore, I have pre-processing steps that are per-sample-run. Then, I have a step that combines BAM files for each run per-sample. However, the issue I'm running into is that I'm a bit confused how to define a rule so that I can list an input of all indivudal bams (from different runs) corresponding to a sample.
I'm putting my entire pipeline below, for clarity, but my real question is on rule combine_bams. How can I list all bams for a single sample in the input?
Any suggestions would be great! Thank you very much in advance!
# Define samples and runs
RUNS, SAMPLES = glob_wildcards("/labs/jandr/walter/tb/data/Stanford/{run}/{samp}_L001_R1_001.fastq.gz")
print("runs are: ", RUNS)
print("samples are: ", SAMPLES)
rule all:
input:
#trim = ['process/trim/{run}_{samp}_trim_1.fq.gz'.format(samp=sample_id, run=run_id) for sample_id, run_id in zip(sample_ids, run_ids)],
trim = expand(['process/trim/{run}_{samp}_trim_1.fq.gz'], zip, run = RUNS, samp = SAMPLES),
kraken=expand('process/trim/{run}_{samp}_trim_kr_1.fq.gz', zip, run = RUNS, samp = SAMPLES),
bams=expand('process/bams/{run}_{samp}_bwa_MTB_ancestor_reference_rg_sorted.bam', zip, run = RUNS, samp = SAMPLES), # add fixed ref/mapper (expand with zip doesn't allow these to repeate)
combined_bams=expand('process/bams/{samp}_bwa_MTB_ancestor_reference.merged.rmdup.bam', samp = np.unique(SAMPLES))
# Trim reads for quality.
rule trim_reads:
input:
p1='/labs/jandr/walter/tb/data/Stanford/{run}/{samp}_L001_R1_001.fastq.gz', # update inputs so they only include those that exist use zip.
p2='/labs/jandr/walter/tb/data/Stanford/{run}/{samp}_L001_R2_001.fastq.gz'
output:
trim1='process/trim/{run}_{samp}_trim_1.fq.gz',
trim2='process/trim/{run}_{samp}_trim_2.fq.gz'
log:
'process/trim/{run}_{samp}_trim_reads.log'
shell:
'/labs/jandr/walter/tb/scripts/trim_reads.sh {input.p1} {input.p2} {output.trim1} {output.trim2} &>> {log}'
# Filter reads taxonomically with Kraken.
rule taxonomic_filter:
input:
trim1='process/trim/{run}_{samp}_trim_1.fq.gz',
trim2='process/trim/{run}_{samp}_trim_2.fq.gz'
output:
kr1='process/trim/{run}_{samp}_trim_kr_1.fq.gz',
kr2='process/trim/{run}_{samp}_trim_kr_2.fq.gz',
kraken_stats='process/trim/{run}_{samp}_kraken.report'
log:
'process/trim/{run}_{samp}_run_kraken.log'
threads: 8
shell:
'/labs/jandr/walter/tb/scripts/run_kraken.sh {input.trim1} {input.trim2} {output.kr1} {output.kr2} {output.kraken_stats} &>> {log}'
# Map reads.
rule map_reads:
input:
ref_path='/labs/jandr/walter/tb/data/refs/{ref}.fasta.gz',
kr1='process/trim/{run}_{samp}_trim_kr_1.fq.gz',
kr2='process/trim/{run}_{samp}_trim_kr_2.fq.gz'
output:
bam='process/bams/{run}_{samp}_{mapper}_{ref}_rg_sorted.bam'
params:
mapper='{mapper}'
log:
'process/bams/{run}_{samp}_{mapper}_{ref}_map.log'
threads: 8
shell:
"/labs/jandr/walter/tb/scripts/map_reads.sh {input.ref_path} {params.mapper} {input.kr1} {input.kr2} {output.bam} &>> {log}"
# Combine reads and remove duplicates (per sample).
rule combine_bams:
input:
bams = 'process/bams/{run}_{samp}_bwa_MTB_ancestor_reference_rg_sorted.bam'
output:
combined_bam = 'process/bams/{samp}_{mapper}_{ref}.merged.rmdup.bam'
log:
'process/bams/{samp}_{mapper}_{ref}_merge_bams.log'
threads: 8
shell:
"sambamba markdup -r -p -t {threads} {input.bams} {output.combined_bam}"
Create a dictionary to associate each sample with its list of runs.
Then for the combine_bams rule, use an input function to generate the input files for that sample using the dictionary.
rule combine_bams:
input:
bams = lambda wildcards: expand('process/bams/{run}_{{samp}}_bwa_MTB_ancestor_reference_rg_sorted.bam', run=sample_dict[wildcards.sample])
output:
combined_bam = 'process/bams/{samp}_{mapper}_{ref}.merged.rmdup.bam'
log:
'process/bams/{samp}_{mapper}_{ref}_merge_bams.log'
threads: 8
shell:
"sambamba markdup -r -p -t {threads} {input.bams} {output.combined_bam}"
In the below function. I want to return important_col variable as well.
class FormatInput(beam.DoFn):
def process(self, element):
""" Format the input to the desired shape"""
df = pd.DataFrame([element], columns=element.keys())
if 'reqd' in df.columns:
important_col= 'reqd'
elif 'customer' in df.columns:
important_col= 'customer'
elif 'phone' in df.columns:
important_col= 'phone'
else:
raise ValueError(['Important columns not specified'])
output = df.to_dict('records')
return output
with beam.Pipeline(options=PipelineOptions(pipeline_args)) as p:
clean_csv = (p
| 'Read input file' >> beam.dataframe.io.read_csv('raw_data.csv'))
to_process = clean_csv | 'pre-processing' >> beam.ParDo(FormatInput())
In the above pipeline, I want to return Important_col variable from the Format Input.
Once I have that variable, I want to pass it as argument to next step in pipeline
I also want to dump to_process to CSV file.
I tried the following but none of them worked.
converted to_process to to_dataframe and tried to_csv. I got error.
I also tried to dump pcollection to csv. I am not getting how to do that. I referred official apache beam documents, but I dont find any documents similar to my use case.
Let's say I have a simple java program including 2 classes:
Example, Example2
and another class that uses both classes:
ExamplesUsage
and I have corresponding bazel build targets of kind java_library:
example, example2, examples_usage
so example and example2 need to be compiled before examples_usage is built.
I want to accumulate information from all three targets using bazel aspects propagation technique, how do I go about doing that?
Here's an example for accumulating the number of source files in this build chain:
def _counter_aspect_impl(target, ctx):
sources_count = len(ctx.rule.attr.srcs)
print("%s: own amount - %s" % (target.label.name , sources_count))
for dep in ctx.rule.attr.deps:
sources_count = sources_count + dep.count
print("%s: including deps: %s" % (target.label.name , sources_count))
return struct(count = sources_count)
counter_aspect = aspect(implementation = _counter_aspect_impl,
attr_aspects = ["deps"]
)
if we run it on the hypothetical java program we get the following output:
example2: own amount - 1.
example2: including deps: 1.
example: own amount - 1.
example: including deps: 1.
examples_usage: own amount - 1.
examples_usage: including deps: 3.
As you can see the 'dependencies' targets' aspects were run first, and only then the 'dependant' target aspect was run.
Of course in order to actually utilize the information some ctx.action or ctx.file_action needs to be called in order to persist the gathered data
The Google Bazel build tool makes it easy enough to explain that each CoffeeScript file in a particular directory tree needs to be compiled to a corresponding output JavaScript file:
# Runs "coffee" 100 times if there are 100 files:
# will run slowly if most of them need rebuilding.
[genrule(
name = 'compile-' + f,
srcs = [f],
outs = [f.replace('src/', 'static/').replace('.coffee', '.js')],
cmd = 'coffee --compile --map --output $$(dirname $#) $<',
) for f in glob(['src/**/*.coffee'])]
But given, say, 100 CoffeeScript files, this will invoke the coffee tool 100 separate times, adding many seconds to the compilation process.
Alternatively, this can be written as a single command that takes 100 files as input and produces 100 files as output:
# Runs "coffee" once on all the files:
# very slow in the case that only 1 file was edited.
coffee_files = glob(['src/**/*.coffee'])
genrule(
name = 'compile-coffee-files',
srcs = coffee_files,
outs = [f.replace('src/', 'static/').replace('.coffee', '.js') for f in coffee_files],
cmd = 'coffee --compile --map --output #D $(SRCS)',
)
Is there any way to explain to Bazel that coffee can be invoked with many files at once, and that if N of the targets are out of date, then only the N source files should be supplied to the coffee command, instead of the full list of all targets whether they need rebuilding or not?
Are coffeescript files independent of one another? If the first one works, where each file is run through coffee separately, then it would seem so. In that case, the first one will actually give you the most parallelism and incrementality.
Even if running coffee 100 times is slower than running coffee once with 100 files, you'll only be paying that cost the first time you compile everything. When you change 1 file, the other 99 won't be recompiled. But, if the startup time of coffee is so great that the 100 files is actually negligible, you might as well stick with compiling them all in one big genrule.
One way to compromise between the two extremes is to create a macro: http://bazel.io/docs/skylark/macros.html
def compile_coffee(name, srcs):
native.genrule(
name = name,
srcs = srcs,
outs = [f.replace('src/', 'static/').replace('.coffee', '.js') for f in srcs],
cmd = 'coffee --compile --map --output #D $(SRCS)',
)
and then you can use the compile_coffee macro in your build files, organizing your build into appropriately sized targets:
load("//pkg/path/to:coffee.bzl", "compile_coffee")
compile_coffee(
name = "lib",
srcs = glob(["*.coffee"]))
There's also full skylark rules: http://bazel.io/docs/skylark/rules.html but if coffee script files don't really depend on each other, then this probably isn't necessary.
There's also persistent workers: http://bazel.io/blog/2015/12/10/java-workers.html which allows you to keep a running instance of coffee around so that you don't have to pay the startup cost, but the binary has to be well behaved, and is a bit more of an investment because you typically have to write wrappers in order to wire everything up.
This will pass 20 files at a time to the CoffeeScript compiler:
BUILD:
load(":myrules.bzl", "coffeescript")
coffee_files = glob(["src/*.coffee"])
# 'coffeescript' is a macro, but it will make a target named 'main'
coffeescript(
name = "main",
srcs = coffee_files
)
myrules.bzl:
def _chunks(l, n):
n = max(1, n)
return [l[i:i+n] for i in range(0, len(l), n)]
def coffeescript(name, srcs):
i = 0
all_outs = []
for chunk in _chunks(srcs, 20):
chunk_name = "{}-{}".format(name, i)
outs = [f.replace('src/', 'static/').replace('.coffee', '.js') for f in chunk] + \
[f.replace('src/', 'static/').replace('.coffee', '.js.map') for f in chunk]
all_outs += outs
native.genrule(
name = chunk_name,
srcs = chunk,
outs = outs,
cmd = "coffee --compile --map --output $(#D)/static $(SRCS)"
)
i += 1
# make a filegroup with the original name that groups together all
# of the output files
native.filegroup(
name = name,
srcs = all_outs,
)
Then, bazel build :main will build all the CoffeeScript files, 20 at a time.
But this does have some weaknesses:
If one CoffeeScript file gets modified, then 20 will get recompiled. Not just one.
If a file gets added or deleted, then lots of files -- basically, from that point until the end of the list of files -- will get recompiled.
I've found that the best approach has been to do what #ahumesky suggested: Break things up into reasonably sized Bazel "packages", and let each package do a single compilation.
Suppose I have the followings in test.lua file:
require 'torch'
-- parse command line arguments
if not opt then
print '==> processing options'
cmd = torch.CmdLine()
cmd:text()
cmd:text('SVHN Model Definition')
cmd:text()
cmd:text('Options:')
cmd:option('-model', 'convnet', 'type of model to construct: linear | mlp | convnet')
cmd:option('-visualize', 1, 'visualize input data and weights during training')
cmd:text()
opt = cmd:parse(arg or {})
end
if opt.visualuze == 0 then
-- Do something
Now assume I want to call test.lua given some different arguments through another lua file execute.lua:
dofile ('test.lua -visualize 0') --Gives an error
However, I am getting an error which indicates that the file 'test.lua -visualize 0' cannot be found when trying to call the function through execute.lua.
So, how can I correctly run another lua file which contains torch code through another .lua file?
If you do not need to use the variables defined inside your 'test.lua', you can use os.execute:
os.execute("th test.lua -visiualize 0")