The European Nucleotide Archive (ENA) provides annotated coding sequences (.cds) of many genomes at https://ftp.ebi.ac.uk/pub/databases/ena/coding/con-std_latest/con/.
A pice of file:
ID BAM65753; SV 1; linear; genomic DNA; CON; PRO; 1074 BP.
XX
PA BA000057.1
XX
DT 02-NOV-2012 (Rel. 114, Created)
DT 07-NOV-2012 (Rel. 114, Last updated, Version 2)
XX
DE Ralstonia pickettii outer membrane protein (porin)
XX
KW .
XX
OS Ralstonia pickettii
OC Bacteria; Proteobacteria; Betaproteobacteria; Burkholderiales;
OC Burkholderiaceae; Ralstonia.
XX
RN [1]
RA Hatta T., Hara H., Takizawa N.;
RT ;
RL Submitted (11-OCT-2011) to the INSDC.
RL Contact:Takashi Hatta Okayama University of Science, Department of
RL Biomedical Engineering, Faculty of Engineering; Ridai-cho 1-1, Okayama,
RL Okayama 700-0005, Japan
XX
RN [2]
RX PUBMED; 22738955.
RA Hatta T., Fujii E., Takizawa N.;
RT "Analysis of two gene clusters involved in 2,4,6-trichlorophenol
RT degradation by Ralstonia pickettii DTP0602";
RL Biosci. Biotechnol. Biochem. 76(5):892-899(2012).
XX
DR MD5; f9c860c4130219abd3d574f26fa6df85.
XX
FH Key Location/Qualifiers
FH
FT source 1..1074
FT /organism="Ralstonia pickettii"
FT /strain="DTP0602"
FT /mol_type="genomic DNA"
FT /db_xref="taxon:329"
FT CDS BA000057.1:333324..334397
FT /codon_start=1
FT /transl_table=11
FT /product="outer membrane protein (porin)"
FT /db_xref="GOA:G9M5T3"
FT /db_xref="InterPro:IPR023614"
FT /db_xref="InterPro:IPR033900"
FT /db_xref="UniProtKB/TrEMBL:G9M5T3"
FT /protein_id="BAM65753.1"
FT /translation="MAKRPRNAALCTALLTAGLGFNANAQSSVTLYGQVDSYIGSTRAA
FT GGERALVVGAGGMQTSYWGMKGVEDLGSGMRAIFDLNGFYRVDTGRSGRSDTDGFFTRS
FT AFVGLQSNRYGTVKLGRNTTPYFLSTILFNPLVDSYAFGPSIFHTYKAATNGQVYDPGI
FT IGDSGWSNSVVYSTPTFGGLTANLIYAFGEQAGSTGQSKWGGNLTYFNGAFGATAAFQQ
FT VKFNATPGDVTAPSALVGFNKQNAAQVGLSYDFKVVKMFAQGQYIKTDINGGAGDIRHT
FT NAQLGASVPLGAGSVLLSYAYGRTRHGTNDFSRNTAAIAYDYNLSKRTDLYAAYFYDKL
FT TSQSHGDAFGVGMRHRF"
XX
SQ Sequence 1074 BP; 218 A; 340 C; 318 G; 198 T; 0 other;
atggccaaaa gaccgcgcaa cgctgcactg tgcaccgccc tgctgacagc gggactaggc 60
ttcaatgcca atgcgcaatc gagcgtgacg ctgtacgggc aagtcgattc ctacatcggc 120
agcacacgcg ccgcgggcgg ggaacgcgcc ttggtcgtcg gtgcaggcgg tatgcagacg 180
tcctactggg ggatgaaggg cgtcgaggat cttgggagcg gcatgcgtgc catcttcgac 240
ctgaacgggt tctaccgcgt cgatacgggg cgatccggca gatcggatac tgacggcttc 300
ttcacccgca gcgccttcgt gggcctgcag agcaatcgct acggtacggt caagctgggc 360
cgcaacacca cgccatactt cctgtcgacg atcctgttca acccgctggt cgattcgtac 420
gcgttcgggc catcgatctt tcatacctac aaggccgcca ccaacggaca ggtctacgac 480
cccggcatca ttggcgactc cggctggtcg aactccgtcg tgtactcgac gccgacgttc 540
ggcggcctga ccgccaacct catctacgcc ttcggcgagc aggccggcag taccggccag 600
agcaagtggg gcggaaacct gacctatttc aacggcgcat tcggagccac ggcagcgttc 660
cagcaagtca agttcaatgc gacaccagga gacgtcaccg ctcccagcgc cctggttggc 720
ttcaacaagc agaatgcggc ccaggtcgga ctgtcttacg atttcaaggt ggtcaagatg 780
tttgcccagg gtcagtacat caagaccgat atcaatgggg gcgcgggcga catcagacac 840
acgaacgccc agctcggcgc ctcggttccc cttggcgctg gcagcgtctt gctgtcatac 900
gcgtacggcc ggaccaggca tggcactaac gacttcagca ggaataccgc ggcaatcgcc 960
tatgactaca acctgtcaaa gcgcaccgac ttgtacgcgg cctactttta cgacaagctg 1020
acttcccaat cccatggcga tgcgttcggg gtggggatgc ggcatcgctt ctga 1074
//
How can I parse the file without missing any information? My goal is to mapping the UniProtKB Accession with the nucleotide sequences.
I tried to use the SeqIO in Biopython to parse this file. My goal is to mapping the UniProtKB Accession with the nucleotide sequences, my code:
# Bio.__version__ = '1.79'
from Bio import SeqIO
cds_file = open("/data3/jsun/spgen/ena_data/CON_PRO_1.cds", 'r')
for record in SeqIO.parse(cds_file, "gb"):
print(record.id)
break
However, the db_xref information of CDS is missing in record.features. Is there any way I can get this information using the SeqIO parser? Thanks.
This is an extension of my previous post.
I have the following dataframes (df1 and df2) that I'm trying to merge:
year <- c("2002", "2002", "1999", "1999", "1997", "2002")
state <- c("TN", "TN", "AL", "AL", "CA", "TN")
name <- c("Molly Homes, Jane Doe", "Sally", "David", "Laura", "John", "Kate")
df1 <- data.frame(year, state, name)
year <- c("2002", "1999")
state <- c("TN", "AL")
versus <- c("Homes (v. Vista)", "#laura v. dAvid")
df2 <- data.frame(year, state, versus)
And I df4 is my ideal output:
year <- c("2002", "2002", "1999", "1999", "1997", "2002")
state <- c("TN", "TN", "AL", "AL", "CA", "TN")
name <- c("Molly Homes, Jane Doe", "Sally", "David", "Laura", "John", "Kate")
versus <- c("Homes (v. Vista)", "# george v. SALLY", "#laura v. dAvid", "#laura v. dAvid", NA, NA)
df4 <- data.frame(year, state, name, versus)
The kind responders on the last post suggested this (and a variation):
library(dplyr)
df3 <- left_join(df1,df2, by=c("year","state")) %>%
rowwise() %>%
mutate(versus:=if_else(grepl(name,versus,ignore.case=T), versus,as.character(NA)))
The problem with the above code is that it doesn't match subsets. Ideally, I'd like grepl(x, y) to match each other, vice versa. If x is in y and/or y is in x, then it's TRUE and results in the value in the "versus" column.
fuzzyjoin is meant for regex searches like this :-)
library(dplyr)
# library(tidyr) # unnest
# library(fuzzyjoin) # fuzzy_*_join
df1 %>%
mutate(
rn = row_number(),
ptn = strsplit(name, "[ ,]+")
) %>%
tidyr::unnest(ptn) %>%
fuzzyjoin::fuzzy_left_join(df2,
by = c("year" = "year", "state" = "state", "ptn" = "versus"),
match_fun = list(`==`, `==`, function(...) Vectorize(grepl)(..., ignore.case = TRUE))
) %>%
group_by(rn, year = year.x, state = state.x, name) %>%
summarize(versus = na.omit(versus)[1], .groups = "drop") %>%
select(-rn)
# # A tibble: 6 x 4
# year state name versus
# <chr> <chr> <chr> <chr>
# 1 2002 TN Molly Homes, Jane Doe Homes (v. Vista)
# 2 2002 TN Sally NA
# 3 1999 AL David #laura v. dAvid
# 4 1999 AL Laura #laura v. dAvid
# 5 1997 CA John NA
# 6 2002 TN Kate NA
We need a way to retrieve the series of whole words, and check if any of them appear (case-insensitive) within the versus column. Here is one simple way to do this:
Create function (f(n,v)), which takes strings n and v, extracts the whole words (wrds) from n, and then counts how many of them are found in v. Returns TRUE if this count exceeds 0
f <- function(n,v) {
wrds = stringr::str_extract_all(n, "\\b\\w*\\b")[[1]]
sum(sapply(wrds[which(nchar(wrds)>1)], grepl,x=v,ignore.case=T))>0
}
Left join the original frames, and apply f() by row, retaining versus if one or more whole words from name are found in veruss, else set to NA
left_join(df1,df2, by=c("year","state")) %>%
rowwise() %>%
mutate(versus:=if_else(f(name, versus), versus,NA_character_))
Output:
1 2002 TN Molly Homes, Jane Doe Homes (v. Vista)
2 2002 TN Sally NA
3 1999 AL David #laura v. dAvid
4 1999 AL Laura #laura v. dAvid
5 1997 CA John NA
6 2002 TN Kate NA
Input:
df1 = structure(list(year = c("2002", "2002", "1999", "1999", "1997",
"2002"), state = c("TN", "TN", "AL", "AL", "CA", "TN"), name = c("Molly Homes, Jane Doe",
"Sally", "David", "Laura", "John", "Kate")), class = "data.frame", row.names = c(NA,
-6L))
df2 = structure(list(year = c("2002", "1999"), state = c("TN", "AL"
), versus = c("Homes (v. Vista)", "#laura v. dAvid")), class = "data.frame", row.names = c(NA,
-2L))
Consider the following toy data:
input strL Country Population Median_Age Sex_Ratio GDP Trade year
"United States of America" 3999 55 1.01 5000 13.1 2012
"United States of America" 6789 43 1.03 7689 7.6 2013
"United States of America" 9654 39 1.00 7689 4.04 2014
"Afghanistan" 544 24 0.76 457 -0.73 2012
"Afghanistan" 720 19 0.90 465 -0.76 2013
"Afghanistan" 941 17 0.92 498 -0.81 2014
"China" 7546 44 1.01 2000 10.2 2012
"China" 10000 40 0.96 3400 14.3 2013
"China" 12000 38 0.90 5900 16.1 2014
"Canada" 7546 44 1.01 2000 1.2 2012
"Canada" 10000 40 0.96 3400 3.1 2013
"Canada" 12000 38 0.90 5900 8.5 2014
end
I run different regressions (using three different independent variables):
*reg1
local var "GDP Trade"
foreach ii of local var{
qui reg `ii' Population i.year
est table, b p
outreg2 Population using table, drop(i.year*) bdec(3) sdec(3) nocons tex(nopretty) append
}
*reg2
local var "GDP Trade"
foreach ii of local var{
qui reg `ii' Median_Age i.year
est table, b p
outreg2 Population using table2, drop(i.year*) bdec(3) sdec(3) nocons tex(nopretty) append
}
*reg3
local var "GDP Trade"
foreach ii of local var{
qui reg `ii' Sex_Ratio i.year
est table, b p
outreg2 Population using table3, drop(i.year*) bdec(3) sdec(3) nocons tex(nopretty) append
}
I use the append option to append different dependent variables that are to be regressed on the same set of independent variables. Hence, I obtain three different tables.
I wish to "merge" these tables when I compile in LaTeX, so that they appear as a single table, with three different panels, one below the other.
Table1
Table2
Table3
I can use the tex(frag) option of the community-contributed command outreg2, but that will not give me the desired outcome.
Here is a simple way of doing this, using the community-contributed command esttab:
clear
input strL Country Population Median_Age Sex_Ratio GDP Trade year
"United States of America" 3999 55 1.01 5000 13.1 2012
"United States of America" 6789 43 1.03 7689 7.6 2013
"United States of America" 9654 39 1.00 7689 4.04 2014
"Afghanistan" 544 24 0.76 457 -0.73 2012
"Afghanistan" 720 19 0.90 465 -0.76 2013
"Afghanistan" 941 17 0.92 498 -0.81 2014
"China" 7546 44 1.01 2000 10.2 2012
"China" 10000 40 0.96 3400 14.3 2013
"China" 12000 38 0.90 5900 16.1 2014
"Canada" 7546 44 1.01 2000 1.2 2012
"Canada" 10000 40 0.96 3400 3.1 2013
"Canada" 12000 38 0.90 5900 8.5 2014
end
local var "GDP Trade"
foreach ii of local var{
regress `ii' Population i.year
matrix I = e(b)
matrix A = nullmat(A) \ I[1,1]
local namesA `namesA' Population_`ii'
}
matrix rownames A = `namesA'
local var "GDP Trade"
foreach ii of local var{
regress `ii' Median_Age i.year
matrix I = e(b)
matrix B = nullmat(B) \ I[1,1]
local namesB `namesB' Median_Age_`ii'
}
matrix rownames B = `namesB'
local var "GDP Trade"
foreach ii of local var{
regress `ii' Sex_Ratio i.year
matrix I = e(b)
matrix C = nullmat(C) \ I[1,1]
local namesC `namesC' Sex_Ratio_`ii'
}
matrix rownames C = `namesC'
matrix D = A \ B \ C
Results:
esttab matrix(D), refcat(Population_GDP "Panel 1" ///
Median_Age_GDP "Panel 2" ///
Sex_Ratio_GDP "Panel 3", nolabel) ///
gaps noobs nomtitles ///
varwidth(20) ///
title(Table 1. Results)
Table 1. Results
---------------------------------
c1
---------------------------------
Panel 1
Population_GDP .3741343
Population_Trade .0009904
Panel 2
Median_Age_GDP 202.1038
Median_Age_Trade .429315
Panel 3
Sex_Ratio_GDP 18165.85
Sex_Ratio_Trade 27.965
---------------------------------
Using the tex option:
\begin{table}[htbp]\centering
\caption{Table 1. Results}
\begin{tabular}{l*{1}{c}}
\hline\hline
& c1\\
\hline
Panel 1 & \\
[1em]
Population\_GDP & .3741343\\
[1em]
Population\_Trade & .0009904\\
[1em]
Panel 2 & \\
[1em]
Median\_Age\_GDP & 202.1038\\
[1em]
Median\_Age\_Trade & .429315\\
[1em]
Panel 3 & \\
[1em]
Sex\_Ratio\_GDP & 18165.85\\
[1em]
Sex\_Ratio\_Trade & 27.965\\
\hline\hline
\end{tabular}
\end{table}
EDIT:
This preserves the original format:
local var "GDP Trade"
foreach ii of local var{
regress `ii' Population i.year
matrix I = e(b)
matrix A = (nullmat(A) , I[1,1])
local namesA `namesA' `ii'
}
matrix rownames A = Population
matrix colnames A = `namesA'
local var "GDP Trade"
foreach ii of local var{
regress `ii' Median_Age i.year
matrix I = e(b)
matrix B = nullmat(B) , I[1,1]
local namesB `namesB' `ii'
}
matrix rownames B = "Median Age"
matrix colnames B = `namesB'
local var "GDP Trade"
foreach ii of local var{
regress `ii' Sex_Ratio i.year
matrix I = e(b)
matrix C = nullmat(C) , I[1,1]
local namesC `namesC' `ii'
}
matrix rownames C = "Sex Ratio"
matrix colnames C = `namesC'
matrix D = A \ B \ C
Table 1. Results
--------------------------------------
GDP Trade
--------------------------------------
Population .3741343 .0009904
Median Age 202.1038 .429315
Sex Ratio 18165.85 27.965
--------------------------------------
With the code shown hereunder, spyder 3.3.4 return the following error message :
AttributeError : module 'random' has no attribute 'randit'
The code is as follows:
import random, math
SV = 0 #somme des valeurs du temps d'attente
SC = 0 # somme des carrés du temps d'attente
nb_simul = 5000 #nb simulations
for k in range(nb_simul):
arr1 = random.randit(0 , 60) #arrivée 1
arr2 = random.randit(0 , 60) #arrivée 2
c = abs(arr1 - arr2)#tps d'attente
SV = SV + c
SC = SC + c*c
MV = SV / nb_simul#moyenne des temps d'attente
MC = SC / nb_simul#moyenne des carrés des temps d'attente
e = sqrt((MC - MV*MV)/nb_simul) #écart-type des moyennes des temps
d'attente
print("Estimation ponctuelle :", MV)
print("Intervalle de confiance à 95% [", MV - 1.96*e, ";", MV +
1.96*e,"]" )
function splitSat(str, pat, max, regex)
pat = pat or "\n" --Patron de búsqueda
max = max or #str
local t = {}
local c = 1
if #str == 0 then
return {""}
end
if #pat == 0 then
return nil
end
if max == 0 then
return str
end
repeat
local s, e = str:find(pat, c, not regex) -- Dentro del string str, busca el patron pat desde la posicion c
-- guarda en s el numero de inicio y en e el numero de fin
max = max - 1
if s and max < 0 then
if #(str:sub(c)) > 0 then -- Si la longitud de la porcion de string desde c hasta el final es mayor que 0
t[#t+1] = str:sub(c)
else values
t[#t+1] = "" --create a table with empty
end
else
if #(str:sub(c, s and s - 1)) > 0 then -- Si la longitud de la porcion de string str entre c y s
t[#t+1] = str:sub(c, s and s - 1)
else
t[#t+1] = "" --create a table with empty values
end
end
c = e and e + 1 or #str + 1
until not s or max < 0
return t
end
I'd like to know what this function is doing. I know that it makes a kind of table taking a string and a pattern. Especially I want to know what *t[#t+1] = str:sub(c, s and s - 1)* is doing.
From what I get, it splits a long string into substrings that match a certain pattern and ignores everything in between the pattern maches. For example, it might match the string 11aa22 to the pattern \d\d, resulting in the table ["11", "22"].
t[#t+1] = <something> inserts a value at the end of table t, it's the same as table.insert(t, <something>)
#t returns the length of an array (that is, a table with consecutive numeric indices), for example, #[1, 2, 3] == 3
str:sub(c, s and s - 1) takes advantage of many of luas features. s and s - 1 evaluates to s-1 if s is not nil, and nil otherwise. Just s-1 would throw an error if s was nil
10 and 10 - 1 == 9
10 - 1 == 9
nil and nil - 1 == nil
nil - 1 -> throws an error
str:sub(a, b) just returns a substring starting at a and ending at b (a and b being numeric indices)
("abcde"):sub(2,4) == "bcd"