Sort rps-blast results by position of the hit - parsing

I'm beginning with biopython and I have a question about parsing results. I used a tutorial to get involved in this and here is the code that I used:
from Bio.Blast import NCBIXML
for record in NCBIXML.parse(open("/Users/jcastrof/blast/pruebarpsb.xml")):
if record.alignments:
print "Query: %s..." % record.query[:60]
for align in record.alignments:
for hsp in align.hsps:
print " %s HSP,e=%f, from position %i to %i" \
% (align.hit_id, hsp.expect, hsp.query_start, hsp.query_end)
Part of the result obtained is:
gnl|CDD|225858 HSP,e=0.000000, from position 32 to 1118
gnl|CDD|225858 HSP,e=0.000000, from position 1775 to 2671
gnl|CDD|214836 HSP,e=0.000000, from position 37 to 458
gnl|CDD|214836 HSP,e=0.000000, from position 1775 to 2192
gnl|CDD|214838 HSP,e=0.000000, from position 567 to 850
And what I want to do is to sort that result by position of the hit (Hsp_hit-from), like this:
gnl|CDD|225858 HSP,e=0.000000, from position 32 to 1118
gnl|CDD|214836 HSP,e=0.000000, from position 37 to 458
gnl|CDD|214838 HSP,e=0.000000, from position 567 to 850
gnl|CDD|225858 HSP,e=0.000000, from position 1775 to 2671
gnl|CDD|214836 HSP,e=0.000000, from position 1775 to 2192
My input file for rps-blast is a *.xml file.
Any suggestion to proceed?
Thanks!

The HSPs list is just a Python list, and can be sorted as usual. Try:
align.hsps.sort(key = lambda hsp: hsp.query_start)
However, you are dealing with a nested list (each match has a list of HSPs), and you want to sort over all of them. Here making your own list might be best - something like this:
for record in ...:
print "Query: %s..." % record.query[:60]
hits = sorted((hsp.query_start, hsp.query_end, hsp.expect, align.hit_id) \
for hsp in align.hsps for align in record.alignments)
for q_start, q_end, expect, hit_id in hits:
print " %s HSP,e=%f, from position %i to %i" \
% (hit_id, expect, q_start, q_end)
Peter

Related

Sci-kit Learn Mutual Information Classification- Dataframe and function issues

I've currently got the below set of smoothed data:
print(df_smooth.dropna())`
mean std skew kurtosis peak2peak rms crestFactor \
4 0.247555 2.100961 0.001668 3.024679 20.628402 2.115862 5.066747
5 0.237015 2.062690 -0.000792 3.029156 20.314159 2.076466 5.043114
6 0.230783 2.044657 -0.001680 3.028746 20.219575 2.057846 5.030472
7 0.235838 1.986232 -0.001031 3.025417 19.497090 2.000425 4.960363
8 0.235062 1.984086 -0.001014 3.031342 19.817176 1.998209 4.989612
9 0.238660 1.968814 -0.001608 3.023882 19.340179 1.983427 4.998115
10 0.223305 1.975597 -0.000197 3.045224 19.701747 1.988305 5.135947
11 0.219480 2.007902 -0.002460 3.060428 20.252087 2.020074 5.117502
12 0.214518 2.071287 -0.002944 3.092217 21.489908 2.082439 5.302407
13 0.244281 2.122538 -0.003717 3.094335 21.792449 2.137164 5.271366
14 0.235806 2.161333 -0.003364 3.123866 23.128965 2.174895 5.472129
15 0.233630 2.175946 -0.002682 3.152740 24.045300 2.189226 5.610038
16 0.236764 2.188906 -0.000032 3.203623 24.745386 2.202420 5.772337
17 0.262289 2.205111 0.000350 3.192511 24.708587 2.221785 5.681394
18 0.229795 2.139946 0.001239 3.183109 23.745617 2.152940 5.564731
19 0.243538 2.150018 0.001071 3.170558 23.385026 2.164355 5.427326
20 0.266458 2.097468 -0.000830 3.144338 22.084817 2.115172 5.236667
21 0.280729 2.106302 -0.000618 3.101014 21.434129 2.125517 5.147621
22 0.252042 2.078190 0.000259 3.100911 20.991519 2.093988 5.231684
23 0.252297 2.097652 0.000383 3.126250 21.790854 2.113380 5.378267
24 0.250502 2.078781 0.000042 3.129014 21.559732 2.094428 5.340024
25 0.220506 2.070573 0.001974 3.110477 21.473643 2.082461 5.364519
26 0.204412 2.049979 -0.000306 3.227532 22.975315 2.060236 5.706146
27 0.215429 2.103150 -0.001421 3.275257 23.719901 2.114265 5.660891
28 0.216689 2.137870 -0.001783 3.298750 24.040561 2.148948 5.614089
29 0.208962 2.160487 0.000547 3.349068 24.546959 2.170628 5.732873
30 0.227231 2.267705 0.000101 3.413948 25.958169 2.279131 5.745555
31 0.221097 2.258519 0.001567 3.379193 25.424651 2.269446 5.662354
32 0.204962 2.224569 0.000951 3.458483 25.984242 2.234101 5.862379
33 0.224707 2.283631 0.000046 3.516125 27.410217 2.294934 6.024091
34 0.248792 2.354713 -0.001143 3.630634 29.159253 2.368248 6.197140
35 0.229501 2.339020 -0.000673 3.743356 30.695670 2.350898 6.613011
36 0.255474 2.454993 -0.001164 3.780962 32.480614 2.468843 6.627903
37 0.257979 2.530495 0.000630 3.962767 33.656646 2.544310 6.661273
38 0.232977 2.498537 0.001111 3.931879 32.754947 2.510044 6.557506
39 0.237025 2.392735 -0.000920 3.919665 31.277647 2.405969 6.494115
40 0.243630 2.368295 -0.001569 3.812383 29.306347 2.382131 6.077379
41 0.221252 2.305374 -0.000861 4.032235 29.548822 2.317355 6.292428
42 0.215262 2.254417 -0.002057 3.977328 28.970507 2.266098 6.353168
43 0.208581 2.240020 -0.001403 4.154288 30.121039 2.251270 6.630079
44 0.170230 2.302794 -0.001867 4.307822 31.556097 2.309174 6.838202
45 0.168889 2.353960 -0.001309 4.433633 32.825109 2.360053 6.977719
46 0.163156 2.337222 -0.001097 4.238485 31.344888 2.342934 6.658564
47 0.165685 2.369817 -0.002246 4.151915 31.154929 2.375626 6.438286
48 0.190677 2.552397 -0.003645 4.311166 33.473407 2.559565 6.428513
49 0.210200 2.667889 0.004168 4.495159 35.625185 2.676223 6.500683
I want to use the sckikit learn Mutual Information Classification to test for Monotonicity in this dataset, but am having trouble with the syntax (more specifically around the X-value) and the splitting of the full dataset into test and train sets.
I only want 40% of the dataset to be used at the "test data".
Currently this is the command I have:
X_train,X_test,y_train,y_test=train_test_split(df_smooth.dropna(),
test_size=0.4,
random_state=0)
print(X_train)
This is the error I get:
ValueError: not enough values to unpack (expected 4, got 2)
from sklearn.feature_selection import mutual_info_classif
mutual_info = mutual_info_classif(X_train, y_train)
The output I want is something like this:
Monotonicity bar chart- descending
Where the MIC array is ranked from highest to low.
Using the following command:
from sklearn.feature_selection import mutual_info_classif
mutual_info = mutual_info_classif(X_train, y_train)
mutual_info
I tried extracting the ordered numbers 1-49 from the dataframe (which is what I believe is used as the "x" syntax input into the MCI function), but they don't seem to be part of the dataframe when called with iloc[:,0] (which displays the values in the "mean" column). I don't know how this takes into account the dropped "n/a" line values.
If you're testing for something like "the degree of monotonicity between two variables," you're probably looking for Spearman's rank correlation coefficient, which is implemented in scipy.stats.spearmanr:
MRE:
from io import StringIO
import pandas as pd
from scipy import stats
data = StringIO("""mean,std,skew,kurtosis,peak2peak,rms,crestFactor
0.247555,2.100961,0.001668,3.024679,20.628402,2.115862,5.066747
0.237015,2.062690,-0.000792,3.029156,20.314159,2.076466,5.043114
0.230783,2.044657,-0.001680,3.028746,20.219575,2.057846,5.030472
0.235838,1.986232,-0.001031,3.025417,19.497090,2.000425,4.960363
0.235062,1.984086,-0.001014,3.031342,19.817176,1.998209,4.989612
""")
df = pd.read_csv(data)
for var in df.columns:
print(f"{var} {stats.spearmanr(df[var], range(len(df))).correlation:.2f}")
Comparing the first five values of each column to the strictly monotonic sequence range() yields the following table, suggesting the first few samples are antimonotone:
mean -0.70
std -1.00
skew -0.60
kurtosis 0.60
peak2peak -0.90
rms -1.00
crestFactor -0.90

unmapped reads using bwa

i'm trying to use BWA MEM to align some WGS files, but i notice something strange.
When I used samtools flagstat to check these .bam files, I notice that most reads were unmapped.
76124692 + 0 in total (QC-passed reads + QC-failed reads)
308 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
708109 + 0 mapped (0.93% : N/A)
76124384 + 0 paired in sequencing
38062192 + 0 read1
38062192 + 0 read2
0 + 0 properly paired (0.00% : N/A)
12806 + 0 with itself and mate mapped
694995 + 0 singletons (0.91% : N/A)
11012 + 0 with mate mapped to a different chr
1682 + 0 with mate mapped to a different chr (mapQ>=5)
Previously, I used Samtofastq to convert my .bam file to .fastq. When I head this file, this is shown:
#SRR1513845.100000000/1
AACGAAACGAAAAGAAAAGAAAAGAAAGAAAAAGAAAGGAACAGAAAAG
+
AAA?=>'2&)&)&&))2(-'(,.%)&31%%'6/6,(1,501046124&6
#SRR1513845.100000000/2
AATTAATTAAGCCCCGAAGGAAGCGAGAAACACTG
+
AAA?B=AB#A#A=?A>AA#?.#?8<.1;><*17?<
#SRR1513845.100000001/1
TATAACCATATAACAAATCCAAGCCCAACAGAGAAGAGAAACAAAAAGA
+
>27<#>&856;.'.&9.%>%::-5194&:+'5);;%1&'/%%999%5(8
#SRR1513845.100000001/2
TCCAACTGATATCGTAATT
+
#3<#A>:8;?:383>=3:=
#SRR1513845.100000003/1
TATCGGTCTTGTTTAG
+
=1;=6?(4>4A13?0A
#SRR1513845.100000003/2
TTCAGGTGCCTCGAAGTTGGATAAGG
+
==>>9#;?3<A5>7);)<9-<25<9?
#SRR1513845.100000004/1
GTCATTTAGCCCAAGAGAATGGC
+
BB#ABA##A?</A>>25A;#4:5
#SRR1513845.100000004/2
GGAGATCGAGTCAAATTTTATGCTAGGTAT
+
%A:<#7A##=4AA?7<A5>#;3&?>>:;:>
#SRR1513845.100000012/1
GCGTCGTTATCCAAAA
+
>A:9:?88=<=0&>>9
#SRR1513845.100000012/2
TGGAAATATTTATTACCCCCCCCCCCCCCCCCCCCCCCCCC
+
A;>#A;4;=??8=:#;-4<?632;=:67;>=):9>9%88=9
#SRR1513845.100000016/1
CGTGGAATGGGGTGTGATTTAATTATCGAATGGCGTCCGATCCAGATT
These characters (<.#;:) are normal and influence in bwa's alignment?
Here is my bwa code:
bwa mem -M -t 38 -p hsa_GRCh38.fa SRR1513_fastqtosam.fq -o SRRR1513_aligned.bam
and my samtofastq code
java -Xmx8G -jar picard.jar SamToFastq \
I= SRR1513_fastqtosam.bam \
FASTQ= SRR1513_fastqtosam.fq \
CLIPPING_ATTRIBUTE=XT \
CLIPPING_ACTION=2 \
INTERLEAVE=true \
NON_PF=true TMP_DIR=./temp
I'm stuck in this from a few hours.
Thanks in advance!
UPDATE:
I just notice a flag during bwa mem alignment
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation FR as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs

Getting a 2D histogram of a grayscale image in Julia

Using the Images package, I can open up a color image, convert it to Gray scale and then :
using Images
img_gld = imread("...path to some color jpg...")
img_gld_gs = convert(Image{Gray},img_gld)
#change from floats to Array of values between 0 and 255:
img_gld_gs = reinterpret(Uint8,data(img_gld_gs))
Now I've got a 1920X1080 array of Uint8's:
julia> img_gld_gs
1920x1080 Array{Uint8,2}
Now I want to get a histogram of the 2D array of Uint8 values:
julia> hist(img_gld_gs)
(0.0:50.0:300.0,
6x1080 Array{Int64,2}:
1302 1288 1293 1302 1297 1300 1257 1234 … 12 13 13 12 13 15 14
618 632 627 618 623 620 663 686 189 187 187 188 185 183 183
0 0 0 0 0 0 0 0 9 9 8 7 8 7 7
0 0 0 0 0 0 0 0 10 12 9 7 13 7 9
0 0 0 0 0 0 0 0 1238 1230 1236 1235 1230 1240 1234
0 0 0 0 0 0 0 0 … 462 469 467 471 471 468 473)
But, instead of 6x1080, I'd like 256 slots in the histogram to show total number of times each value has appeared. I tried:
julia> hist(img_gld_gs,256)
But that gives:
(2.0:1.0:252.0,
250x1080 Array{Int64,2}:
So instead of a 256x1080 Array, it's 250x1080. Is there any way to force it to have 256 bins (without resorting to writing my own hist function)? I want to be able to compare different images and I want the histogram for each image to have the same number of bins.
Assuming you want a histogram for the entire image (rather than one per row), you might want
hist(vec(img_gld_gs), -1:255)
which first converts the image to a 1-dimensional vector. (You can also use img_gld_gs[:], but that copies the data.)
Also note the range here: the hist function uses a left-open interval, so it will omit counting zeros unless you use something smaller than 0.
hist also accepts a vector (or range) as an optional argument that specifies the edge boundaries, so
hist(img_gld_gs, 0:256)
should work.

Get value from string if finded match regex

Imagine i have a list of string´s like this:
Hello Word 132 132 132 GoodBye!! Should return 132132132
Hello Word 132 132 GoodBye!! Should return nil
132 132 132 GoodBye! Should return 132132132
132132132 GoodBye! Should return 132132132
1321321321GoodBye! Should return nil
132 132 1321 Should return nil
How can i check whether the phrase has 9 followed algorithms, or separated by space, and get that same number?
Thanks
You can use this regex
(\d{3})(\s?\1){2}
and remove any whitespace in the match.
DEMO BTW, if you don't want to match in
Some123 123 123Thing
you can use word boundaries \b(\d{3})(\s?\1){2}\b

Can I write reverse div operator?

I have a mathematical equation and How can I find the it's reverse ?
My equation:
var
x,y:integer;
begin
//example x=1234;
x-(x div 100):=y
end;
after the code I konw "y" how can I find the "x"?(1234)
In general, you can't. Since div does integer division, there are potentially many inputs that can/will produce the same result. Starting from that result, and of those inputs is an equally likely possibility as the original input. For example:
175 div 7 = 25
176 div 7 = 25
177 div 7 = 25
178 div 7 = 25
179 div 7 = 25
180 div 7 = 25
181 div 7 = 25
Starting from 25, any of those numbers from 175 to 181 would be an equally viable answer.

Resources