No reads mapped in proper pairs, in paired-end sequencing bamfile using samtools - alignment

I am working with a bamfile of paired-end whole genome sequencing, and want to filter out reads from a specific genomic region that are not mapped in a proper pair (these sometimes indicate a structural variant). I am using samtools, and tried to filter the reads with the "flag" option, to select for reads that are not mapped in a proper pair. If I'm correct, these reads should not have 2 in their flag value.
(https://broadinstitute.github.io/picard/explain-flags.html)
However, according to samtools none of my reads are mapped in a proper pair. When I count (-c) all the reads in the region I specified, without filter, it gives me a total count of 179:
samtools view input.bam "8:113483114-113483213" -c
179
When I filter for reads that are properly paired, i.e. flag contains "2" (-f 2), the count is zero:
samtools view input.bam "8:113483114-113483213" -f 2 -c
0
I checked to see if the reads are recognized to be paired (-f 1), and if the mates are mapped (-F 8), and all of them are:
samtools view input.bam "8:113483114-113483213" -f 1 -F 8 -c
179
I also tried for other regions in the genome and encounter the same problem everywhere.
I checked the regions in IGV using the same BAM files, and IGV tells me most of the reads are properly paired, and only a few aren't.
Does anyone know what is going on here? Were the BAM files flagged in a way that doesn't account for proper mate mapping?
Any help is welcome! Thanks a lot.

Related

What are the columns in perf-stat when run with -r and -x

I'm trying to interpret the results of perf-stat run on a program. I know that it was run with -r 30 and -x. From https://perf.wiki.kernel.org/index.php/Tutorial is says that if run with -r the stddev will be reported but I'm not sure which of these columns that is and I'm having trouble finding information on the output when run with -x. One example of the output I've recieved is this
19987,,cache-references,0.49%,562360,100.00
256,,cache-misses,10.65%,562360,100.00
541747,,branches,0.07%,562360,100.00
7098,,branch-misses,0.78%,562360,100.00
60,,page-faults,0.43%,560411,100.00
0.560244,,cpu-clock,0.28%,560411,100.00
0.560412,,task-clock,0.28%,560411,100.00
My guess is that the % column is the standard deviation as a percentage of the first column but I'm not sure.
My question in summary is what do the columns represent? Which column is the standard deviation?
You are very close. Here are some blanks filled in.
Arithmetic mean of the measured values.
The unit if known. E.g. on my system it shows 'msec' for 'cpu-clock'.
Event name
Standard deviation scaled to 100% = mean
Time during which counting this event was actually running
Fraction of enabled time during which this event was actually running (in %)
The last two are relevant for multiplexing: If there are more counters selected than can be recorded concurrently, the denoted percentage will drop below 100.
On my system (Linux 5.0.5, not sure since when this is available), there is also a shadow stat for some metrics which compute a derived metric. For example the cpu-clock will compute CPUs utilized or branch-misses computes the fraction of all branches that are missed.
Shadow stat value
Shadow stat description
Note that this format changes with some other options. For example if you display the metrics with a more fine granular grouping (e.g. per cpu), information about these groups will be prepended in additional columns.

Find size contributed by each external library on iOS

I'm trying to reduce my app store binary size and we have lots of external libs that might be contributing to the size of the final ipa. Is there any way to find out how much each external static lib takes up in the final binary (Other than going about removing each one ?) ?
All of this information is contained in the link map, if you have the patience for sifting through it (for large apps, it can be quite large). The link map has a listing of all the libraries, their object files, and all symbols that were packaged into your app, all in human-readable text. Normally, projects aren't configured to generate them by default, so you'll have to make a quick project file change.
From within Xcode:
Under 'Build Settings' for your target, search for "map"
In the results below, under the 'Linking' section, set 'Write Link Map File' to "Yes"
Make sure to make note of the full path and file name listed under 'Path to Link Map File'
The next time you build your app you'll get a link map dumped to that file path. Note that the path is relative to your app's location in the DerivedData folder (usually ~/Library/Developer/Xcode/DerivedData/<your-app-name>-<random-string-of-letters-and-numbers>/Build/Intermediates/..., but YMMV). Since it's just a text file, you can read it with any text editor.
The contents of the link map are divided into 3 sections, of which 2 will be relevant to what you're looking for:
Object Files: this section contains a listing of all of the object files included in your final app, including your own code and that of any third-party libraries you've included. Importantly, each object file also lists the library where it came from;
Sections: this section, not relevant to your question, contains a list of the processor segments and their sections;
Symbols: this section contains the raw data that you're interested in: a list of all symbols/methods with their absolute location (i.e. address in the processor's memory map), size, and most important of all, a cross-reference to their containing object module (under the 'File' column).
From this raw data, you have everything you need to do the required size calculation. From #1, you see that, for every library, there are N possible constituent object modules; from #2, you see that, for every object module, there are M possible symbols, each occupying size S. For any given library, then, your rough order of size will be something like O(N * M * S). That's only to give you an indication of the components that would go into your actual calculations, it's not any sort of a useful formula. To perform the calculation itself, I'm sorry to say that I'm not aware of any existing tools that will do the requisite processing for you, but given that the link map is just a text file, with a little script magic and ingenuity you can construct a script to do the heavy lifting.
For example, I have a little sample project that links to the following library: https://github.com/ColinEberhardt/LinqToObjectiveC (the sample project itself is from a nice tutorial on ReactiveCocoa, here: http://www.raywenderlich.com/62699/reactivecocoa-tutorial-pt1), and I want to know how much space it occupies. I've generated a link map, TwitterInstant-LinkMap-normal-x86_64.txt (it runs in the simulator). In order to find all object modules included by the library, I do this:
$ grep -i "libLinqToObjectiveC.a" TwitterInstant-LinkMap-normal-x86_64.txt
which gives me this:
[ 8] /Users/XXX/Library/Developer/Xcode/DerivedData/TwitterInstant-ecppmzhbawtxkwctokwryodvgkur/Build/Products/Debug-iphonesimulator/libLinqToObjectiveC.a(LinqToObjectiveC-dummy.o)
[ 9] /Users/XXX/Library/Developer/Xcode/DerivedData/TwitterInstant-ecppmzhbawtxkwctokwryodvgkur/Build/Products/Debug-iphonesimulator/libLinqToObjectiveC.a(NSArray+LinqExtensions.o)
[ 10] /Users/XXX/Library/Developer/Xcode/DerivedData/TwitterInstant-ecppmzhbawtxkwctokwryodvgkur/Build/Products/Debug-iphonesimulator/libLinqToObjectiveC.a(NSDictionary+LinqExtensions.o)
The first column contains the cross-references to the symbol table that I need, so I can search for those:
$ cat TwitterInstant-LinkMap-normal-x86_64.txt | grep -e "\[ 8\]"
which gives me:
0x100087161 0x0000001B [ 8] literal string: PodsDummy_LinqToObjectiveC
0x1000920B8 0x00000008 [ 8] anon
0x100093658 0x00000048 [ 8] l_OBJC_METACLASS_RO_$_PodsDummy_LinqToObjectiveC
0x1000936A0 0x00000048 [ 8] l_OBJC_CLASS_RO_$_PodsDummy_LinqToObjectiveC
0x10009F0A8 0x00000028 [ 8] _OBJC_METACLASS_$_PodsDummy_LinqToObjectiveC
0x10009F0D0 0x00000028 [ 8] _OBJC_CLASS_$_PodsDummy_LinqToObjectiveC
The second column contains the size of the symbol in question (in hexadecimal), so if I add them all up, I get 0x103, or 259 bytes.
Even better, I can do a bit of stream hacking to whittle it down to the essential elements and do the addition for me:
$ cat TwitterInstant-LinkMap-normal-x86_64.txt | grep -e "\[ 8\]" | grep -e "0x" | awk '{print $2}' | xargs printf "%d\n" | paste -sd+ - | bc
which gives me the number straight up:
259
Doing the same for "\[ 9\]" (13016 bytes) and "\[ 10\]" (5503 bytes), and adding them to the previous 259 bytes, gives me 18778 bytes.
You can certainly improve upon the stream hacking I've done here to make it a bit more robust (in this implementation, you have to make sure you get the exact number of spaces right and quote the brackets), but you at least get the idea.
Make a .ipa file of your app and save it in your system.
Then open the terminal and execute the following command:
unzip -lv /path/to/your/app.ipa
It will return a table of data about your .ipa file. The size column has the compressed size of each file within your .ipa file.
I think you should be able to extract the information you need from this:
symbols -w -noSources YourFileHere
Ref: https://devforums.apple.com/message/926442#926442
IIRC, it isn't going to give you clear summary information on each lib, but you should find that the functions from each library should be clustered together, so with a bit of effort you can calculate the approximate contribution from each lib:
Also make sure that you set Generate Debug Symbols to NO in your build settings. This can reduce the size of your static library by about 30%.
In case it's part of your concern, a static library is just the relevant .o files archived together plus some bookkeeping. So a 1.7mb static library — even if the code within it is the entire 1.7mb — won't usually add 1.7mb to your product. The usual rules about dead code stripping will apply.
Beyond that you can reduce the built size of your code. The following probably isn't a comprehensive list.
In your target's build settings look for 'Optimization Level'. By switching that to 'Fastest, Smallest -Os' you'll permit the compiler to sacrifice some speed for size.
Make sure you're building for thumb, the more compact ARM code. Assuming you're using LLVM that means making sure you don't have -mno-thumb anywhere in your project settings.
Also consider which architectures you want to build for. Apple doesn't allow submission of an app that supports both ARMv6 and the iPhone 5 screen and have dropped ARMv6 support entirely from the latest Xcode. So there's probably no point including that at this point.

weka 3.7 explorer cannot classify text

I am trying to do text classification using weka 3.7 explorer. I converted 2 text files( separated into two dir class1 and class2) into arff using text loader. Before doing so, I standardized the case to lower. Now when I load the file into weka and apply filter stringtowordvector (such as stopwords,usewordcount, usestoplist, stemmer - snowballstemmer) I do not see any change in my list of variables . All the variables (words ) are given as 1 or 0 against each class.
Please help me.
Here is my filter command
weka.filters.unsupervised.attribute.StringToWordVector -R first-last -W 1000 -prune-rate -1.0 -C -N 0 -S -stemmer weka.core.stemmers.SnowballStemmer -M 1 -tokenizer "weka.core.tokenizers.WordTokenizer -delimiters \" \r\n\t.,;:\\'\\"()?!\""
That happend to me when I wanted to read from .csv and use StringToWord vector.
My problem was, that the text attribute was of type nominal and not String. I used the class "NominalToString", used it to changed values to String, and then it worked.

Speaker adaptation with HTK

I am trying to adapt a monophone-based recogniser to a specific speaker. I am using the recipe given in HTKBook 3.4.1 section 3.6.2. I am getting stuck on the HHEd part which I am invoking like sp:
HHEd -A -D -T 1 -H hmm15/hmmdefs -H hmm15/macros -M classes regtree.hed monophones1eng
The error I end up with is as follows:
ERROR [+999] Components missing from Base Class list (2413 3375)
ERROR [+999] BaseClass check failed
The folder classes contains the file global which has the following contents:
~b ‘‘global’’
<MMFIDMASK> *
<PARAMETERS> MIXBASE
<NUMCLASSES> 1
<CLASS> 1 {*.state[2-4].mix[1-25]}
The hmmdefs file within hmm15 had some mixture components (I am using 25 mixture components per state of each phone) missing. I tried to "fill in the blanks" by giving in mixture components with random mean and variance values but zero weigths. This too has had no effect.
The hmms are left-right hmms with 5 states (3 emitting), each state modelled by a 25 component mixture. Each component in turn is modelled by an MFCC with EDA components. There are 46 phones in all.
My questions are:
1. Is the way I am invoking HHEd correct? Can it be invoked in the above manner for monophones?
2. I know that the base class list (rtree.base must contain every single mixture component, but where do I find these missing mixture components?
NOTE: Please let me know in case more information is needed.
Edit 1: The file regtree.hed contains the following:
RN "models"
LS "stats_engOnly_3_4"
RC 32 "rtree"
Thanks,
Sriram
They way you invoke HHEd looks fine. The components are missing as they have become defunct. To deal with defunct components read HTKBook-3.4.1 Section 8.4 page 137.
Questions:
- What does regtree.hed contain?
- How much data (in hours) are you using? 25 mixtures might be excessive.
You might want to use a more gradual increase in mixtures - MU +1 or MU +2 and limit the number of mixtures (a guess: 3-8 depending on training data amount).

how much time does grid.py take to run?

I am using libsvm for binary classification.. I wanted to try grid.py , as it is said to improve results.. I ran this script for five files in separate terminals , and the script has been running for more than 12 hours..
this is the state of my 5 terminals now :
[root#localhost tools]# python grid.py sarts_nonarts_feat.txt>grid_arts.txt
Warning: empty z range [61.3997:61.3997], adjusting to [60.7857:62.0137]
line 2: warning: Cannot contour non grid data. Please use "set dgrid3d".
Warning: empty z range [61.3997:61.3997], adjusting to [60.7857:62.0137]
line 4: warning: Cannot contour non grid data. Please use "set dgrid3d".
[root#localhost tools]# python grid.py sgames_nongames_feat.txt>grid_games.txt
Warning: empty z range [64.5867:64.5867], adjusting to [63.9408:65.2326]
line 2: warning: Cannot contour non grid data. Please use "set dgrid3d".
Warning: empty z range [64.5867:64.5867], adjusting to [63.9408:65.2326]
line 4: warning: Cannot contour non grid data. Please use "set dgrid3d".
[root#localhost tools]# python grid.py sref_nonref_feat.txt>grid_ref.txt
Warning: empty z range [62.4602:62.4602], adjusting to [61.8356:63.0848]
line 2: warning: Cannot contour non grid data. Please use "set dgrid3d".
Warning: empty z range [62.4602:62.4602], adjusting to [61.8356:63.0848]
line 4: warning: Cannot contour non grid data. Please use "set dgrid3d".
[root#localhost tools]# python grid.py sbiz_nonbiz_feat.txt>grid_biz.txt
Warning: empty z range [67.9762:67.9762], adjusting to [67.2964:68.656]
line 2: warning: Cannot contour non grid data. Please use "set dgrid3d".
Warning: empty z range [67.9762:67.9762], adjusting to [67.2964:68.656]
line 4: warning: Cannot contour non grid data. Please use "set dgrid3d".
[root#localhost tools]# python grid.py snews_nonnews_feat.txt>grid_news.txt
Wrong input format at line 494
Traceback (most recent call last):
File "grid.py", line 223, in run
if rate is None: raise "get no rate"
TypeError: exceptions must be classes or instances, not str
I had redirected the outputs to files , but those files for now contain nothing..
And , the following files were created :
sbiz_nonbiz_feat.txt.out
sbiz_nonbiz_feat.txt.png
sarts_nonarts_feat.txt.out
sarts_nonarts_feat.txt.png
sgames_nongames_feat.txt.out
sgames_nongames_feat.txt.png
sref_nonref_feat.txt.out
sref_nonref_feat.txt.png
snews_nonnews_feat.txt.out (--> is empty )
There's just one line of information in .out files..
the ".png" files are some GNU PLOTS .
But i dont understand what the above GNUplots / warnings convey .. Should i re-run them ?
Can anyone please tell me on how much time this script might take if each input file contains about 144000 lines..
Thanks and regards
Your data is huge, 144 000 lines. So this will take sometime. I used large data such as yours and it took up to a week to finish. If you using images, which I suppose you are, hence the large data, try resizing your image before creating the data. You should get approximately the same results with your images resized.
The libSVM faq speaks to your question:
Q: Why grid.py/easy.py sometimes generates the following warning message?
Warning: empty z range [62.5:62.5], adjusting to [61.875:63.125]
Notice: cannot contour non grid data!
Nothing is wrong and please disregard the message. It is from gnuplot when drawing the contour.
As a side note, you can parallelize your grid.py operations. The libSVM tools directory README file has this to say on the matter:
Parallel grid search
You can conduct a parallel grid search by dispatching jobs to a
cluster of computers which share the same file system. First, you add
machine names in grid.py:
ssh_workers = ["linux1", "linux5", "linux5"]
and then setup your ssh so that the authentication works without
asking a password.
The same machine (e.g., linux5 here) can be listed more than once if
it has multiple CPUs or has more RAM. If the local machine is the
best, you can also enlarge the nr_local_worker. For example:
nr_local_worker = 2
In my Ubuntu 10.04 installation grid.py is actually /usr/bin/svm-grid.py
I guess grid.py is trying to find the optimal value for C (or Nu)?
I don't have an answer for the amount of time it will take, but you might want to try this SVM library, even though it's an R package: svmpath.
As described on that page there, it will compute the entire "regularization path" for a two class SVM classifier in about as much time as it takes to train an SVM using one value of your penalty param C (or Nu).
So, instead of training and doing cross validation for an SVM with a value x for your C parameter, then doing all of that again for value x+1 for C, x+2, etc. You can just train the SVM once, then query its predictive performance for different values of C post-facto, so to speak.
Change:
if rate is None: raise "get no rate"
in line 223 in grid.py to:
if rate is None: raise ValueError("get no rate")
Also, try adding:
gnuplot.write("set dgrid3d\n")
after this line in grid.py:
gnuplot.write("set contour\n")
This should fix your warnings and errors, but I am not sure if it will work, since grid.py seems to think your data has no rate.

Resources