dexseq_count.py vs htseq-count - analysis

I have some RNA paired-end files which I have used tophat to allign
~/tophat-2.0.7.Linux_x86_64/./tophat -p10 --segment-length 18
--no-coverage-search --segment-mismatches 1 -G
then samtools
samtools sort -n /media//PRE_6_8_2/accepted_hits.bam
/media/PRE_6_8_2/PRE_6_8_2.sort.hg19.bam
and then
samtools view /media/PRE_6_8_2/PRE_6_8_2.sort.hg19.bam.bam
/media/PRE_6_8_2/PRE_6_8_2.sort.hg19.sam
then I am trying to use DEXSeq and using the
python2.7 /home/3.0/DEXSeq
/python_scripts/dexseq_count.py -p yes -r name -s no
/media/PRE_6_8_2/hg19/hg19_IlluminaAnnotation_genes.gtf
/media/PRE_6_8_2/PRE_6_8_2.sort.hg19.sam
/media/PRE_6_8_2//PRE_6_8_2/untreated1fb.txt
and gives me
_ambiguous 0
_empty 39845364
_lowaqual 0
_notaligned 0
without any reads for any exons. 0 for everything
whereas by using the htseq-count -r name -s no everything looks fine (reads in genes)...
could you please advice me how to solve it? I really need to run the DEXseq as I want reads per exon to detect splice differences.
Thank you in advance for your help

I had a similar problem, please use the correct genome reference file/ gene annotation file (.gff file obtained by flattening .gtf file downloaded from reference genome) and you will get the count for your exons.

Related

tshark returns 0 results for filter icmp.no_resp but wireshark returns 12 resutls with the same filter

I am trying to do packet capture analysis with tshark on about 30000 files looking for a needle in the haystack.The files containing interesting needles contain icmp failures. I wrote a script which iterates though these files with tshark but they all return 0 results.
tshark -r <filename> -Y "icmp.no_resp"
tshark -r <filename> -Y "icmp.resp_not_found"
Both ofthese commands yield 0 results. However when I open a specific file and use the display filter "icmp.no_resp" or "icmp.resp_not_found" I see results.
Is this a bug in T-shark where it can't identify response not found?
I'm running tshark/wireshark v3.6.7 on Ubuntu
I figured it out.
tshark requires multiple passes to identify certain display filters. Doing a command like so creates this.
tshark -r <filename> -Y "icmp.resp_not_found" -2
I hope this helps someone in the future.

lcov remove option is not removing coverage data as expected

I'm using lcov to generate coverage reports. I have a tracefile (broker.info) with this content (relevant fragment shown):
$ lcov -r broker.info
...
[/var/lib/jenkins/jobs/ContextBroker-PreBuild-UnitTest/workspace/test/unittests/orionTypes/]
EntityTypeResponse_test.cpp | 100% 11| 100% 6| - 0
...
[/var/lib/jenkins/jobs/ContextBroker-PreBuild-UnitTest/workspace/test/unittests/parse/]
CompoundValueNode_test.cpp | 100% 82| 100% 18| - 0
...
[/var/lib/jenkins/jobs/ContextBroker-PreBuild-UnitTest/workspace/test/unittests/rest/]
OrionError_test.cpp |92.1% 38| 100% 6| - 0
...
[/var/lib/jenkins/jobs/ContextBroker-PreBuild-UnitTest/workspace/test/unittests/serviceRoutines/]
badVerbAllFour_test.cpp | 100% 24| 100% 7| - 0
...
I want to remove all the info corresponding to test/unittest files.
I have attemped to use the -r option which, according to man page is:
-r tracefile pattern
--remove tracefile pattern
Remove data from tracefile.
Use this switch if you want to remove coverage data for a particular set of files from a tracefile. Additional command line parameters will be interpreted as
shell wildcard patterns (note that they may need to be escaped accordingly to prevent the shell from expanding them first). Every file entry in tracefile
which matches at least one of those patterns will be removed.
The result of the remove operation will be written to stdout or the tracefile specified with -o.
Only one of -z, -c, -a, -e, -r, -l, --diff or --summary may be specified at a time.
Thus, I'm using
$ lcov -r broker.info 'test/unittests/*' -o broker.info2
As far as I understand test/unittest/* matches the files under test/unittest. However, it's not working (note Deleted 0 files below):
Reading tracefile broker.info
Deleted 0 files
Writing data to broker.info2
Summary coverage rate:
lines......: 92.6% (58313 of 62978 lines)
functions..: 96.0% (6451 of 6718 functions)
branches...: no data found
I have tried also this variants (same result):
$ lcov -r broker.info "test/unittests/*" -o broker.info2
$ lcov -r broker.info "test/unittests/\*" -o broker.info2
$ lcov -r broker.info "test/unittests" -o broker.info2
So, maybe I'm doing something wrong?
I'm using lcov version 1.13 (just in case the data is relevant)
Thanks!
I have been testing another options and the following one seems to work, using the wildcard in the prefix also:
$ lcov -r broker.info "*/test/unittests/*" -o broker.info2
Maybe it is something new in version 1.13 because in version 1.11 it seems it works without wildcard in the prefix...
The below mentioned lcov command is working fine, even with wild characters (lcov 1.14):
lcov --remove meson-logs/coverage.info '/home/builduser/external/*' '/home/builduser/unittest/*' -o meson-logs/sourcecoverage.info

extract a file from xz file

I have a huge file file.tar.xz containing many smaller text files with a similar structure. I want to quickly examine a file out of the compressed file and have a glimpse of files content structure. I don't have information about names of the files within the compressed file. Is there anyway to extract a single file out given the above the above scenario?
Thank you.
EDIT: I don't want to tar -xvf file.tar.xz.
Based on the discussion in the comments, I tried the following which worked for me. It might not be the most optimal solution, the regex might need some improvement, but you'll get the idea.
I first created a demo archive:
cd /tmp
mkdir demo
for i in {1..100}; do echo $i > "demo/$i.txt"; done
cd demo && tar cfJ ../demo.tar.xz * && cd ..
demo.tar.xz now contains 100 txt files.
The following lists the contents of the archive, selects the first file and stores the path within the archive into the variable firstfile:
firstfile=`tar -tvf demo.tar.xz | grep -Po -m1 "(?<=:[0-9]{2} ).*$"`
echo $firstfile will output 1.txt.
You can now extract this single file from the archive:
tar xf demo.tar.xz $firstfile

What is the truly correct usage of -S parameter on weka classifier A1DE?

So I'm using weka 3.7.11 in a Windows machine (and runnings bash scripts with cygwin), and I found an inconsistency regarding the AODE classifier (which in this version of weka, comes from an add-on package).
Using Averaged N-Dependencies Estimators from the GUI, I get the following configuration (from an example that worked alright in the Weka Explorer):
weka.classifiers.meta.FilteredClassifier -F "weka.filters.unsupervised.attribute.Discretize -F -B 10 -M -1.0 -R first-last" -W weka.classifiers.bayes.AveragedNDependenceEstimators.A1DE -- -F 1 -M 1.0 -S
So I modified this to get the following command in my bash script:
java -Xmx60G -cp "C:\work\weka-3.7.jar;C:\Users\Oracle\wekafiles\packages\AnDE\AnDE.jar" weka.classifiers.meta.FilteredClassifier \
-t train_2.arff -T train_1.arff \
-classifications "weka.classifiers.evaluation.output.prediction.CSV -distribution -p 1 -file predictions_final_multi.csv -suppress" \
-threshold-file umbral_multi.csv \
-F "weka.filters.unsupervised.attribute.Discretize -F -B 10 -M -1.0 -R first-last" \
-W weka.classifiers.bayes.AveragedNDependenceEstimators.A1DE -- -F 1 -M 1.0 -S
But this gives me the error:
Weka exception: No value given for -S option.
Which is weird, since this was not a problem with the GUI. In the GUI, the Information box says that -S it's just a flag ("Subsumption Resolution can be achieved by using -S option"), so it shouldn't expect any number at all, which is consistent with what I got using the Explorer.
So then, what's the deal with the -S option when using the command line? Looking at the error text given by weka, I found this:
Options specific to classifier weka.classifiers.bayes.AveragedNDependenceEstimators.A1DE:
-D
Output debugging information
-F <int>
Impose a frequency limit for superParents (default is 1)
-M <double>
Specify a weight to use with m-estimate (default is 1)
-S <int>
Specify a critical value for specialization-generalilzation SR (default is 100)
-W
Specify if to use weighted AODE
So it seems that this class works in two different ways, depending on which method I use (GUI vs. Command Line).
The solution I found, at least for the meantime, was to write -S 100 on my script. Is this really the same as just putting -S in the GUI?
Thanks in advance.
JM
I've had a play with this Classifier, and can confirm that what you are experiencing on your end is consistent with what I have here. From the GUI, the -S Option (subsumption Resolution) requires no parameters while the Command Prompt does (specialization-generalization SR).
They don't sound like the same parameter, so you may need to raise this issue with the developer of the third party package if you would like to know more information on these parameters. You can find this information from the Tools -> Package Manager -> AnDE, which will point you to the contacts for the library.

An easy way to diff log files, ignoring the time stamps?

I need to diff two log files but ignore the time stamp part of each line (the first 12 characters to be exact). Is there a good tool, or a clever awk command, that could help me out?
Depending on the shell you are using, you can turn the approach #Blair suggested into a 1-liner
diff <(cut -b13- file1) <(cut -b13- file2)
(+1 to #Blair for the original suggestion :-)
#EbGreen said
I would just take the log files and strip the timestamps off the start of each line then save the file out to different files. Then diff those files.
That's probably the best bet, unless your diffing tool has special powers.
For example, you could
cut -b13- file1 > trimmed_file1
cut -b13- file2 > trimmed_file2
diff trimmed_file1 trimmed_file2
See #toolkit's response for an optimization that makes this a one-liner and obviates the need for extra files. If your shell supports it. Bash 3.2.39 at least seems to...
Answers using cut are fine but sometimes keeping timestamps within the diff output is appreciable. As the OP's question is about ignoring the time stamps (not removing them), I share here my tricky command line:
diff -I '^#' <(sed -r 's/^((.){12})/#\1\n/' 1.log) <(sed -r 's/^((.){12})/#\1\n/' 2.log)
sed isolates the timestamps (# before and \n after) within a process substitution
diff -I '^#' ignores lines having these timestamps (lines beginning by #)
example
Two log files having same content but different timestamps:
$> for ((i=1;i<11;i++)) do echo "09:0${i::1}:00.000 data $i"; done > 1.log
$> for ((i=1;i<11;i++)) do echo "11:00:0${i::1}.000 data $i"; done > 2.log
Basic diff command line says all lines are different:
$> diff 1.log 2.log
1,10c1,10
< 09:01:00.000 data 1
< 09:02:00.000 data 2
< 09:03:00.000 data 3
< 09:04:00.000 data 4
< 09:05:00.000 data 5
< 09:06:00.000 data 6
< 09:07:00.000 data 7
< 09:08:00.000 data 8
< 09:09:00.000 data 9
< 09:01:00.000 data 10
---
> 11:00:01.000 data 1
> 11:00:02.000 data 2
> 11:00:03.000 data 3
> 11:00:04.000 data 4
> 11:00:05.000 data 5
> 11:00:06.000 data 6
> 11:00:07.000 data 7
> 11:00:08.000 data 8
> 11:00:09.000 data 9
> 11:00:01.000 data 10
Our tricky diff -I '^#' does not display any difference (timestamps ignored):
$> diff -I '^#' <(sed -r 's/^((.){12})/#\1\n/' 1.log) <(sed -r 's/^((.){12})/#\1\n/' 2.log)
$>
Change 2.log (replace data by foo on the 6th line) and check again:
$> sed '6s/data/foo/' -i 2.log
$> diff -I '^#' <(sed -r 's/^((.){12})/#\1\n/' 1.log) <(sed -r 's/^((.){12})/#\1\n/' 2.log)
11,13c11,13
11,13c11,13
< #09:06:00.000
< data 6
< #09:07:00.000
---
> #11:00:06.000
> foo 6
> #11:00:07.000
=> timestamps are kept in the diffoutput!
You can also use the side by side feature using -y or --side-by-side option:
$> diff -y -I '^#' <(sed -r 's/^((.){12})/#\1\n/' 1.log) <(sed -r 's/^((.){12})/#\1\n/' 2.log)
#09:01:00.000 #11:00:01.000
data 1 data 1
#09:02:00.000 #11:00:02.000
data 2 data 2
#09:03:00.000 #11:00:03.000
data 3 data 3
#09:04:00.000 #11:00:04.000
data 4 data 4
#09:05:00.000 #11:00:05.000
data 5 data 5
#09:06:00.000 | #11:00:06.000
data 6 | foo 6
#09:07:00.000 | #11:00:07.000
data 7 data 7
#09:08:00.000 #11:00:08.000
data 8 data 8
#09:09:00.000 #11:00:09.000
data 9 data 9
#09:01:00.000 #11:00:01.000
data 10 data 10
old sed
If your sed implementation does not support the -r option, you may have to count the twelve dots <(sed 's/^\(............\)/#\1\n/' 1.log) or use another pattern of your choice ;)
For a graphical option, Meld can do this using its text filters feature.
It allows for ignoring lines based on one or more python regex. The differences still appear, but lines that don't have any other differences won't be highlighted.
Use Kdiff3 and at Configure>Diff edit "Line-Matching Preprocessor command" to something like:
sed "s/[ 012][0-9]:[0-5][0-9]:[0-5][0-9]//"
This will filter out time-stamps from comparison alignment algorithm.
Kdiff3 also lets you manually align specific lines.
I want to propose a solution for Visual Studio Code:
Install this extension - https://marketplace.visualstudio.com/items?itemName=ryu1kn.partial-diff
Configure it like this - https://github.com/ryu1kn/vscode-partial-diff/issues/49#issuecomment-608299085
Run extension command "Toggle Pre-Comparison Text Normalization Rules" and enable rule added on step #2
Use the extension (here is an explanation of it's UI quirk - https://github.com/ryu1kn/vscode-partial-diff/issues/11)

Resources