Snakemake: combine inputs over one wildcard - mapping

I am wondering if you would be able to give advice about defining a Snakemake rule to combine over one, but not all wildcards? My data is organized so that I have runs and samples; most, but not all samples, were resequenced in every run. Therefore, I have pre-processing steps that are per-sample-run. Then, I have a step that combines BAM files for each run per-sample. However, the issue I'm running into is that I'm a bit confused how to define a rule so that I can list an input of all indivudal bams (from different runs) corresponding to a sample.
I'm putting my entire pipeline below, for clarity, but my real question is on rule combine_bams. How can I list all bams for a single sample in the input?
Any suggestions would be great! Thank you very much in advance!
# Define samples and runs
RUNS, SAMPLES = glob_wildcards("/labs/jandr/walter/tb/data/Stanford/{run}/{samp}_L001_R1_001.fastq.gz")
print("runs are: ", RUNS)
print("samples are: ", SAMPLES)
rule all:
input:
#trim = ['process/trim/{run}_{samp}_trim_1.fq.gz'.format(samp=sample_id, run=run_id) for sample_id, run_id in zip(sample_ids, run_ids)],
trim = expand(['process/trim/{run}_{samp}_trim_1.fq.gz'], zip, run = RUNS, samp = SAMPLES),
kraken=expand('process/trim/{run}_{samp}_trim_kr_1.fq.gz', zip, run = RUNS, samp = SAMPLES),
bams=expand('process/bams/{run}_{samp}_bwa_MTB_ancestor_reference_rg_sorted.bam', zip, run = RUNS, samp = SAMPLES), # add fixed ref/mapper (expand with zip doesn't allow these to repeate)
combined_bams=expand('process/bams/{samp}_bwa_MTB_ancestor_reference.merged.rmdup.bam', samp = np.unique(SAMPLES))
# Trim reads for quality.
rule trim_reads:
input:
p1='/labs/jandr/walter/tb/data/Stanford/{run}/{samp}_L001_R1_001.fastq.gz', # update inputs so they only include those that exist use zip.
p2='/labs/jandr/walter/tb/data/Stanford/{run}/{samp}_L001_R2_001.fastq.gz'
output:
trim1='process/trim/{run}_{samp}_trim_1.fq.gz',
trim2='process/trim/{run}_{samp}_trim_2.fq.gz'
log:
'process/trim/{run}_{samp}_trim_reads.log'
shell:
'/labs/jandr/walter/tb/scripts/trim_reads.sh {input.p1} {input.p2} {output.trim1} {output.trim2} &>> {log}'
# Filter reads taxonomically with Kraken.
rule taxonomic_filter:
input:
trim1='process/trim/{run}_{samp}_trim_1.fq.gz',
trim2='process/trim/{run}_{samp}_trim_2.fq.gz'
output:
kr1='process/trim/{run}_{samp}_trim_kr_1.fq.gz',
kr2='process/trim/{run}_{samp}_trim_kr_2.fq.gz',
kraken_stats='process/trim/{run}_{samp}_kraken.report'
log:
'process/trim/{run}_{samp}_run_kraken.log'
threads: 8
shell:
'/labs/jandr/walter/tb/scripts/run_kraken.sh {input.trim1} {input.trim2} {output.kr1} {output.kr2} {output.kraken_stats} &>> {log}'
# Map reads.
rule map_reads:
input:
ref_path='/labs/jandr/walter/tb/data/refs/{ref}.fasta.gz',
kr1='process/trim/{run}_{samp}_trim_kr_1.fq.gz',
kr2='process/trim/{run}_{samp}_trim_kr_2.fq.gz'
output:
bam='process/bams/{run}_{samp}_{mapper}_{ref}_rg_sorted.bam'
params:
mapper='{mapper}'
log:
'process/bams/{run}_{samp}_{mapper}_{ref}_map.log'
threads: 8
shell:
"/labs/jandr/walter/tb/scripts/map_reads.sh {input.ref_path} {params.mapper} {input.kr1} {input.kr2} {output.bam} &>> {log}"
# Combine reads and remove duplicates (per sample).
rule combine_bams:
input:
bams = 'process/bams/{run}_{samp}_bwa_MTB_ancestor_reference_rg_sorted.bam'
output:
combined_bam = 'process/bams/{samp}_{mapper}_{ref}.merged.rmdup.bam'
log:
'process/bams/{samp}_{mapper}_{ref}_merge_bams.log'
threads: 8
shell:
"sambamba markdup -r -p -t {threads} {input.bams} {output.combined_bam}"

Create a dictionary to associate each sample with its list of runs.
Then for the combine_bams rule, use an input function to generate the input files for that sample using the dictionary.
rule combine_bams:
input:
bams = lambda wildcards: expand('process/bams/{run}_{{samp}}_bwa_MTB_ancestor_reference_rg_sorted.bam', run=sample_dict[wildcards.sample])
output:
combined_bam = 'process/bams/{samp}_{mapper}_{ref}.merged.rmdup.bam'
log:
'process/bams/{samp}_{mapper}_{ref}_merge_bams.log'
threads: 8
shell:
"sambamba markdup -r -p -t {threads} {input.bams} {output.combined_bam}"

Related

How to list the output groups of a bazel rule?

From https://stackoverflow.com/a/59455700/6162120:
cc_library produces several outputs, which are separated by output groups. If you want to get only .so outputs, you can use filegroup with dynamic_library output group.
Where can I find the list of all the output groups produced by cc_library? And more generally, how can I list all the output groups of a bazel rule?
In the next Bazel release (after 3.7), or using Bazel#HEAD as of today, you can use cquery --output=starlark and the providers() function to do this:
$ bazel-dev cquery //:java-maven \
--output=starlark \
--starlark:expr="[p for p in providers(target)]"
["InstrumentedFilesInfo", "JavaGenJarsProvider", "JavaInfo", "JavaRuntimeClasspathProvider", "FileProvider", "FilesToRunProvider", "OutputGroupInfo"]
This isn't a replacement for documentation, but it's possible to get the output groups of targets using an aspect:
defs.bzl:
def _output_group_query_aspect_impl(target, ctx):
for og in target.output_groups:
print("output group " + str(og) + ": " + str(getattr(target.output_groups, og)))
return []
output_group_query_aspect = aspect(
implementation = _output_group_query_aspect_impl,
)
Then on the command line:
bazel build --nobuild Foo --aspects=//:defs.bzl%output_group_query_aspect
(--nobuild runs just the analysis phase and avoids running the execution phase if you don't need it)
For a java_binary this returns e.g.:
DEBUG: defs.bzl:3:5: output group _hidden_top_level_INTERNAL_: depset([<generated file _middlemen/Foo-runfiles>])
DEBUG: defs.bzl:3:5: output group _source_jars: depset([<generated file Foo-src.jar>])
DEBUG: defs.bzl:3:5: output group compilation_outputs: depset([<generated file Foo.jar>])

Google dataflow streaming pipeline is not distributing workload over several workers after windowing

I'm trying to set up a dataflow streaming pipeline in python. I have quite some experience with batch pipelines. Our basic architecture looks like this:
The first step is doing some basic processing and takes about 2 seconds per message to get to the windowing. We are using sliding windows of 3 seconds and 3 second interval (might change later so we have overlapping windows). As last step we have the SOG prediction that takes about 15ish seconds to process and which is clearly our bottleneck transform.
So, The issue we seem to face is that the workload is perfectly distributed over our workers before the windowing, but the most important transform is not distributed at all. All the windows are processed one at a time seemingly on 1 worker, while we have 50 available.
The logs show us that the sog prediction step has an output once every 15ish seconds which should not be the case if the windows would be processed over more workers, so this builds up huge latency over time which we don't want. With 1 minute of messages, we have a latency of 5 minutes for the last window. When distribution would work, this should only be around 15sec (the SOG prediction time). So at this point we are clueless..
Does anyone see if there is something wrong with our code or how to prevent/circumvent this?
It seems like this is something happening in the internals of google cloud dataflow. Does this also occur in java streaming pipelines?
In batch mode, Everything works fine. There, one could try to do a reshuffle to make sure no fusion etc occurs. But that is not possible after windowing in streaming.
args = parse_arguments(sys.argv if argv is None else argv)
pipeline_options = get_pipeline_options(project=args.project_id,
job_name='XX',
num_workers=args.workers,
max_num_workers=MAX_NUM_WORKERS,
disk_size_gb=DISK_SIZE_GB,
local=args.local,
streaming=args.streaming)
pipeline = beam.Pipeline(options=pipeline_options)
# Build pipeline
# pylint: disable=C0330
if args.streaming:
frames = (pipeline | 'ReadFromPubsub' >> beam.io.ReadFromPubSub(
subscription=SUBSCRIPTION_PATH,
with_attributes=True,
timestamp_attribute='timestamp'
))
frame_tpl = frames | 'CreateFrameTuples' >> beam.Map(
create_frame_tuples_fn)
crops = frame_tpl | 'MakeCrops' >> beam.Map(make_crops_fn, NR_CROPS)
bboxs = crops | 'bounding boxes tfserv' >> beam.Map(
pred_bbox_tfserv_fn, SERVER_URL)
sliding_windows = bboxs | 'Window' >> beam.WindowInto(
beam.window.SlidingWindows(
FEATURE_WINDOWS['goal']['window_size'],
FEATURE_WINDOWS['goal']['window_interval']),
trigger=AfterCount(30),
accumulation_mode=AccumulationMode.DISCARDING)
# GROUPBYKEY (per match)
group_per_match = sliding_windows | 'Group' >> beam.GroupByKey()
_ = group_per_match | 'LogPerMatch' >> beam.Map(lambda x: logging.info(
"window per match per timewindow: # %s, %s", str(len(x[1])), x[1][0][
'timestamp']))
sog = sliding_windows | 'Predict SOG' >> beam.Map(predict_sog_fn,
SERVER_URL_INCEPTION,
SERVER_URL_SOG )
pipeline.run().wait_until_finish()
In beam the unit of parallelism is the key--all the windows for a given key will be produced on the same machine. However, if you have 50+ keys they should get distributed among all workers.
You mentioned that you were unable to add a Reshuffle in streaming. This should be possible; if you're getting errors please file a bug at https://issues.apache.org/jira/projects/BEAM/issues . Does re-windowing into GlobalWindows make the issue with reshuffling go away?
It looks like you do not necessarily need GroupByKey because you are always grouping on the same key. Instead you could maybe use CombineGlobally to append all the elements inside the window in stead of the GroupByKey (with always the same key).
combined = values | beam.CombineGlobally(append_fn).without_defaults()
combined | beam.ParDo(PostProcessFn())
I am not sure how the load distribution works when using CombineGlobally but since it does not process key,value pairs I would expect another mechanism to do the load distribution.

Shift in the columns of spool file

I am using a shell script to extract the data from 'extr' table. The extr table is a very big table having 410 columns. The table has 61047 rows of data. The size of one record is around 5KB.
I the script is as follows:
#!/usr/bin/ksh
sqlplus -s \/ << rbb
set pages 0
set head on
set feed off
set num 20
set linesize 32767
set colsep |
set trimspool on
spool extr.csv
select * from extr;
/
spool off
rbb
#-------- END ---------
One fine day the extr.csv file was having 2 records with incorrect number of columns (i.e. one record with more number of columns and other with less). Upon investigation I came to know that the two duplicate records were repeated in the file. The primary key of the records should ideally be unique in file but in this case 2 records were repeated. Also, the shift in the columns was abrupt.
Small example of the output file:
5001|A1A|AAB|190.00|105|A
5002|A2A|ABB|180.00|200|F
5003|A3A|AAB|153.33|205|R
5004|A4A|ABB|261.50|269|F
5005|A5A|AAB|243.00|258|G
5006|A6A|ABB|147.89|154|H
5003|A7A|AAB|249.67|AAB|153.33|205|R
5004|A8A|269|F
5009|A9A|AAB|368.00|358|S
5010|AAA|ABB|245.71|215|F
Here the primary key records for 5003 and 5004 have reappeared in place of 5007 and 5008. Also the duplicate reciords have shifted the records of 5007 and 5008 by appending/cutting down their columns.
Need your help in analysing why this happened? Why the 2 rows were extracted multiple times? Why the other 2 rows were missing from the file? and Why the records were shifted?
Note: This script is working fine since last two years and has never failed except for one time (mentioned above). It ran successfully during next run. Recently we have added one more program which accesses the extr table with cursor (select only).
I reproduced a similar behaviour.
;-> cat input
5001|A1A|AAB|190.00|105|A
5002|A2A|ABB|180.00|200|F
5003|A3A|AAB|153.33|205|R
5004|A4A|ABB|261.50|269|F
5005|A5A|AAB|243.00|258|G
5006|A6A|ABB|147.89|154|H
5009|A9A|AAB|368.00|358|S
5010|AAA|ABB|245.71|215|F
See the input file as your database.
Now I write a script that accesses "the database" and show some random freezes.
;-> cat writeout.sh
# Start this script twice
while IFS=\| read a b c d e f; do
# I think you need \c for skipping \n, but I do it different one time
echo "$a|$b|$c|$d|" | tr -d "\n"
(( sleeptime = RANDOM % 5 ))
sleep ${sleeptime}
echo "$e|$f"
done < input >> output
EDIT: Removed cat input | in script above, replaced by < input
Start this script twice in the background
;-> ./writeout.sh &
;-> ./writeout.sh &
Wait until both jobs are finished and see the result
;-> cat output
5001|A1A|AAB|190.00|105|A
5002|A2A|ABB|180.00|200|F
5003|A3A|AAB|153.33|5001|A1A|AAB|190.00|105|A
5002|A2A|ABB|180.00|205|R
5004|A4A|ABB|261.50|269|F
5005|A5A|AAB|243.00|200|F
5003|A3A|AAB|153.33|258|G
5006|A6A|ABB|147.89|154|H
5009|A9A|AAB|368.00|358|S
5010|AAA|ABB|245.71|205|R
5004|A4A|ABB|261.50|269|F
5005|A5A|AAB|243.00|258|G
5006|A6A|ABB|147.89|215|F
154|H
5009|A9A|AAB|368.00|358|S
5010|AAA|ABB|245.71|215|F
When I edit the last line of writeout.sh into done > output I do not see the problem, but that might be due to buffering and the small amount of data.
I still don't know exactly what happened in your case, but it really seems like 2 progs writing simultaneously to the same script.
A job in TWS could have been restarted manually, 2 scripts in your masterscript might write to the same file or something else.
Preventing this in the future can be done using some locking / checks (when the output file exists, quit and return errorcode to TWS).

Bad Result And Evaluation From Giza++

I have tried to work with giza++ on window (using Cygwin compiler).
I used this code:
//Suppose source language is French and target language is English
plain2snt.out FrenchCorpus.f EnglishCorpus.e
mkcls -c30 -n20 -pFrenchCorpus.f -VFrenchCorpus.f.vcb.classes opt
mkcls -c30 -n20 -pEnglishCorpus.e -VEnglishCorpus.e.vcb.classes opt
snt2cooc.out FrenchCorpus.f.vcb EnglishCorpus.e.vcb FrenchCorpus.f_EnglishCorpus.e.snt >courpuscooc.cooc
GIZA++ -S FrenchCorpus.f.vcb -T EnglishCorpus.e.vcb -C FrenchCorpus.f_EnglishCorpus.e.snt -m1 100 -m2 30 -mh 30 -m3 30 -m4 30 -m5 30 -p1 o.95 -CoocurrenceFile courpuscooc.cooc -o dictionary
But the after getting the output files from giza++ and evaluate output, I observed that the results were too bad.
My evaluation result was:
RECALL = 0.0889
PRECISION = 0.0990
F_MEASURE = 0.0937
AER = 0.9035
Dose any body know the reason? Could the reason be that I have forgotten some parameters or I should change some of them?
in other word:
first I wanted train giza++ by huge amount of data and then test it by small corpus and compare its result by desired alignment(GOLD STANDARD) , but I don't find any document or useful page in web.
can you introduce useful document?
Therefore I ran it by small courpus (447 sentence) and compared result by desired alignment.do you think this is right way?
Also I changed my code as follows and got better result but It's still not good:
GIZA++ -S testlowsf.f.vcb -T testlowde.e.vcb -C testlowsf.f_testlowde.e.snt -m1 5 -m2 0 -mh 5 -m3 5 -m4 0 -CoocurrenceFile inputcooc.cooc -o dictionary -model1dumpfrequency 1 -model4smoothfactor 0.4 -nodumps 0 -nsmooth 4 -onlyaldumps 1 -p0 0.999 -diagonal yes -final yes
result of evaluation :
// suppose A is result of GIZA++ and G is Gold standard. As and Gs is S link in A And G files. Ap and Gp is p link in A and G files.
RECALL = As intersect Gs/Gs = 0.6295
PRECISION = Ap intersect Gp/A = 0.1090
FMEASURE = (2*PRECISION*RECALL)/(RECALL + PRECISION) = 0.1859
AER = 1 - ((As intersect Gs + Ap intersect Gp)/(A + S)) = 0.7425
Do you know the reason?
Where did you get those parameters? 100 iterations of model1?! Well, if you actually manage to run this, I strongly suspect that you have a very small parallel corpus. If so, you should consider adding more parallel data in training. And how exactly do you calculate the recall and precision measures?
EDIT:
With less than 500 sentences you're unlikely to get any reasonable performance. The usual way to do it is not find a larger (unaligned) parallel corpus, run GIZA++ on both together and then evaluate the small part for which you have the manual alignments. Check Europarl or MultiUN, these are freely available corpora, both contain a relatively large amount of English-French parallel data. The instructions on preparing the data can be found on the websites.

LDA Mahout only one topic

I am trying to follow the example on using LDA on the Reuters data as indicated in the Mahout In Action book. However, regardless of the number of times I run it, I always get only one topic.
I ran the command as indicated:
mahout lda -i reuters-vectors/tf-vectors -o reuters-lda-sparse -k 10 -v 34262 -x 20 -ow
I got the number from running seqdumper. After the command has run, I run the LDAPrintTopics as indicated in the book and get the following:
Topic 0
===========
billion [p(billion|topic_0) = 0.04580929884162013
pct [p(pct|topic_0) = 0.043323700764985575
dlrs [p(dlrs|topic_0) = 0.031395871939373196
3 [p(3|topic_0) = 0.027311386657272094
1987 [p(1987|topic_0) = 0.025690077982656934
1 [p(1|topic_0) = 0.022727304049111215
reuter [p(reuter|topic_0) = 0.019572283708227903
mln [p(mln|topic_0) = 0.014569551610736616
april [p(april|topic_0) = 0.014453636611524965
march [p(march|topic_0) = 0.014359948846622552
Is there way to get more topics out of LDA?
Thanks.
Your command says -k 10 which specifies that there need to be 10 topics.
See this https://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html
Try changing your data set, may be its too small to generate 10 different topics

Resources