Is there a way to use BioPython to convert FASTA files to a Genbank format? There are many answers on how to convert from Genbank to FASTA, but not the other way around.
before convert, you must asign alphabet to sequence (DNA or Protein)
from Bio import SeqIO
from Bio.Alphabet import generic_dna, generic_protein
input_handle = open("test.fasta", "rU")
output_handle = open("test.gb", "w")
sequences = list(SeqIO.parse(input_handle, "fasta"))
#asign generic_dna or generic_protein
for seq in sequences:
seq.seq.alphabet = generic_dna
count = SeqIO.write(sequences, output_handle, "genbank")
output_handle.close()
input_handle.close()
print "Coverted %i records" % count
for input:
>I28Q9A102FII8J rank=0668881 x=2144.0 y=1105.0 length=418
ACGTCATGAGAGTTTGATCATGGCTCAGGACGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGATGAA
GCTCCAGCTTGCTGGGGTGGATTAGTGGCGAACGGGTGAGTAACACGTGAGTAACCTGCCCTTGACTCTGGGAT
AAGCGTTGGAAACGACGTCTAATACCGGATATGACGACCGATGGCATCATCTGGTTGTGGAAAGAATTTTGGTC
AAGGATGGACTCGCGGCCTATCAGGTAGTTGGTGAGGTAATGGCTCACCAAGCCTACGACGGGTAGCCGGCCTG
AGAGGGTGACCGGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCA
CAATGGGCGAAAGCCTGATGCAGCAACGCCGCGTGAGGGATGACGGCC
>I28Q9A102JMH72 rank=0320459 x=3829.0 y=3120.0 length=512
ACGTCATGAGAGTTTGATCCTGGCTCAGGATGAACGCTAGCGGCAGGCTTAACACATGCAAGTCGAGGGTAGAA
ATAGCTTGCTATTTTGAGACCGGCGCACGGGTGCGTAACGCGTATGCAATCTGCCTTTTACAGGGGAATAGCCC
AGAGAAATTTGGATTAATGCCCCATAGCGCTGCAGGGCGGCATCGCCGAGCAGCTAAAGTCACAACGGTAAAGA
TGAGCATGCGTCCCATTAGCTAGTTGGTAAGGTAACGGCTTACCAAGGCGATGATGGGTAGGGTCCTGAGAGGG
AGATCCCCCACACTGGTACTGAGACACGGACCAGACTCCTACGGGAGGCAGCAGTGAGGAATATTGGTCAATGG
GCGCAAGCCTGAACCAGCCATGCCGCGTGCAGGATGAAGGCCTTCGGGTTGTAAACTGCTTTTGACGGAACGAA
AAAGCT
you get:
LOCUS I28Q9A102FII8J 418 bp DNA UNK 01-JAN-1980
DEFINITION I28Q9A102FII8J rank=0668881 x=2144.0 y=1105.0 length=418
ACCESSION I28Q9A102FII8J
VERSION I28Q9A102FII8J
KEYWORDS .
SOURCE .
ORGANISM .
.
FEATURES Location/Qualifiers
ORIGIN
1 acgtcatgag agtttgatca tggctcagga cgaacgctgg cggcgtgctt aacacatgca
61 agtcgaacga tgaagctcca gcttgctggg gtggattagt ggcgaacggg tgagtaacac
121 gtgagtaacc tgcccttgac tctgggataa gcgttggaaa cgacgtctaa taccggatat
181 gacgaccgat ggcatcatct ggttgtggaa agaattttgg tcaaggatgg actcgcggcc
241 tatcaggtag ttggtgaggt aatggctcac caagcctacg acgggtagcc ggcctgagag
301 ggtgaccggc cacactggga ctgagacacg gcccagactc ctacgggagg cagcagtggg
361 gaatattgca caatgggcga aagcctgatg cagcaacgcc gcgtgaggga tgacggcc
//
LOCUS I28Q9A102JMH72 450 bp DNA UNK 01-JAN-1980
DEFINITION I28Q9A102JMH72 rank=0320459 x=3829.0 y=3120.0 length=512
ACCESSION I28Q9A102JMH72
VERSION I28Q9A102JMH72
KEYWORDS .
SOURCE .
ORGANISM .
.
FEATURES Location/Qualifiers
ORIGIN
1 acgtcatgag agtttgatcc tggctcagga tgaacgctag cggcaggctt aacacatgca
61 agtcgagggt agaaatagct tgctattttg agaccggcgc acgggtgcgt aacgcgtatg
121 caatctgcct tttacagggg aatagcccag agaaatttgg attaatgccc catagcgctg
181 cagggcggca tcgccgagca gctaaagtca caacggtaaa gatgagcatg cgtcccatta
241 gctagttggt aaggtaacgg cttaccaagg cgatgatggg tagggtcctg agagggagat
301 cccccacact ggtactgaga cacggaccag actcctacgg gaggcagcag tgaggaatat
361 tggtcaatgg gcgcaagcct gaaccagcca tgccgcgtgc aggatgaagg ccttcgggtt
421 gtaaactgct tttgacggaa cgaaaaagct
//
here's an update of Jose's answer for python3 and new biopython. Biopython doesn't use alphabets any longer. Maybe it will save you a bit of time.
from Bio import SeqIO
input_handle = open("test.fasta", "r")
output_handle = open("test.gb", "w")
sequences = list(SeqIO.parse(input_handle, "fasta"))
# assign molecule type
for seq in sequences:
seq.annotations['molecule_type'] = 'DNA'
count = SeqIO.write(sequences, output_handle, "genbank")
output_handle.close()
input_handle.close()
print("Converted {} records".format(count))
It is possible to convert the fasta to gb format for unsubmitted sequences, which dont have accession numbers. Yet to be submitted to NCBI.
Related
The European Nucleotide Archive (ENA) provides annotated coding sequences (.cds) of many genomes at https://ftp.ebi.ac.uk/pub/databases/ena/coding/con-std_latest/con/.
A pice of file:
ID BAM65753; SV 1; linear; genomic DNA; CON; PRO; 1074 BP.
XX
PA BA000057.1
XX
DT 02-NOV-2012 (Rel. 114, Created)
DT 07-NOV-2012 (Rel. 114, Last updated, Version 2)
XX
DE Ralstonia pickettii outer membrane protein (porin)
XX
KW .
XX
OS Ralstonia pickettii
OC Bacteria; Proteobacteria; Betaproteobacteria; Burkholderiales;
OC Burkholderiaceae; Ralstonia.
XX
RN [1]
RA Hatta T., Hara H., Takizawa N.;
RT ;
RL Submitted (11-OCT-2011) to the INSDC.
RL Contact:Takashi Hatta Okayama University of Science, Department of
RL Biomedical Engineering, Faculty of Engineering; Ridai-cho 1-1, Okayama,
RL Okayama 700-0005, Japan
XX
RN [2]
RX PUBMED; 22738955.
RA Hatta T., Fujii E., Takizawa N.;
RT "Analysis of two gene clusters involved in 2,4,6-trichlorophenol
RT degradation by Ralstonia pickettii DTP0602";
RL Biosci. Biotechnol. Biochem. 76(5):892-899(2012).
XX
DR MD5; f9c860c4130219abd3d574f26fa6df85.
XX
FH Key Location/Qualifiers
FH
FT source 1..1074
FT /organism="Ralstonia pickettii"
FT /strain="DTP0602"
FT /mol_type="genomic DNA"
FT /db_xref="taxon:329"
FT CDS BA000057.1:333324..334397
FT /codon_start=1
FT /transl_table=11
FT /product="outer membrane protein (porin)"
FT /db_xref="GOA:G9M5T3"
FT /db_xref="InterPro:IPR023614"
FT /db_xref="InterPro:IPR033900"
FT /db_xref="UniProtKB/TrEMBL:G9M5T3"
FT /protein_id="BAM65753.1"
FT /translation="MAKRPRNAALCTALLTAGLGFNANAQSSVTLYGQVDSYIGSTRAA
FT GGERALVVGAGGMQTSYWGMKGVEDLGSGMRAIFDLNGFYRVDTGRSGRSDTDGFFTRS
FT AFVGLQSNRYGTVKLGRNTTPYFLSTILFNPLVDSYAFGPSIFHTYKAATNGQVYDPGI
FT IGDSGWSNSVVYSTPTFGGLTANLIYAFGEQAGSTGQSKWGGNLTYFNGAFGATAAFQQ
FT VKFNATPGDVTAPSALVGFNKQNAAQVGLSYDFKVVKMFAQGQYIKTDINGGAGDIRHT
FT NAQLGASVPLGAGSVLLSYAYGRTRHGTNDFSRNTAAIAYDYNLSKRTDLYAAYFYDKL
FT TSQSHGDAFGVGMRHRF"
XX
SQ Sequence 1074 BP; 218 A; 340 C; 318 G; 198 T; 0 other;
atggccaaaa gaccgcgcaa cgctgcactg tgcaccgccc tgctgacagc gggactaggc 60
ttcaatgcca atgcgcaatc gagcgtgacg ctgtacgggc aagtcgattc ctacatcggc 120
agcacacgcg ccgcgggcgg ggaacgcgcc ttggtcgtcg gtgcaggcgg tatgcagacg 180
tcctactggg ggatgaaggg cgtcgaggat cttgggagcg gcatgcgtgc catcttcgac 240
ctgaacgggt tctaccgcgt cgatacgggg cgatccggca gatcggatac tgacggcttc 300
ttcacccgca gcgccttcgt gggcctgcag agcaatcgct acggtacggt caagctgggc 360
cgcaacacca cgccatactt cctgtcgacg atcctgttca acccgctggt cgattcgtac 420
gcgttcgggc catcgatctt tcatacctac aaggccgcca ccaacggaca ggtctacgac 480
cccggcatca ttggcgactc cggctggtcg aactccgtcg tgtactcgac gccgacgttc 540
ggcggcctga ccgccaacct catctacgcc ttcggcgagc aggccggcag taccggccag 600
agcaagtggg gcggaaacct gacctatttc aacggcgcat tcggagccac ggcagcgttc 660
cagcaagtca agttcaatgc gacaccagga gacgtcaccg ctcccagcgc cctggttggc 720
ttcaacaagc agaatgcggc ccaggtcgga ctgtcttacg atttcaaggt ggtcaagatg 780
tttgcccagg gtcagtacat caagaccgat atcaatgggg gcgcgggcga catcagacac 840
acgaacgccc agctcggcgc ctcggttccc cttggcgctg gcagcgtctt gctgtcatac 900
gcgtacggcc ggaccaggca tggcactaac gacttcagca ggaataccgc ggcaatcgcc 960
tatgactaca acctgtcaaa gcgcaccgac ttgtacgcgg cctactttta cgacaagctg 1020
acttcccaat cccatggcga tgcgttcggg gtggggatgc ggcatcgctt ctga 1074
//
How can I parse the file without missing any information? My goal is to mapping the UniProtKB Accession with the nucleotide sequences.
I tried to use the SeqIO in Biopython to parse this file. My goal is to mapping the UniProtKB Accession with the nucleotide sequences, my code:
# Bio.__version__ = '1.79'
from Bio import SeqIO
cds_file = open("/data3/jsun/spgen/ena_data/CON_PRO_1.cds", 'r')
for record in SeqIO.parse(cds_file, "gb"):
print(record.id)
break
However, the db_xref information of CDS is missing in record.features. Is there any way I can get this information using the SeqIO parser? Thanks.
I have some problem for using Biopython, count and sum each base's numbers for parsing FASTA file. In FASTA file, total A is how much? and total T is?
but there's some problem.
1.
handle2="/home/koreanraichu/sra_data_mo.fasta"
for record2 in SeqIO.parse(handle2,"fasta"):
print(Seq(record2.seq).count("A"))
print(type(Seq(record2.seq).count("A")))
This is code, was it successfully read sequence and count adenine, but It never summarize each numbers. I tried it for list append and sum(), simply add but there's no effective. (each count type is int, but never added and printed separately)
for record2 in SeqIO.parse(handle2,"fasta"):
if len(record2.seq) > 100:
i=0
i=i+len(record2.seq)
else:
j=0
j=j+len(record2.seq)
print(i,j)
like upper, this code doesn't work. I meant this code for It is a conditional sum code that adds DNA of 100 bp or more and DNA of less than 100 bp separately. but it never works, too. it prints last record's data.
What can I do things for solve this?
try this code for first problem:
from Bio import SeqIO
# from Bio.Seq import Seq
handle2="Fasta.fa"
for record2 in SeqIO.parse(handle2,"fasta"):
# print(record2.seq, type(record2.seq))
# print(str(record2.seq), type(str(record2.seq)))
print(record2.seq.count("A"))
# print(type(record2.seq).count("A")) ### --> TypeError: count() missing 1 required positional argument: 'sub'
summarize = 0
for i in 'ATGC':
x = record2.seq.count(i)
print(i, ' : ', x)
summarize += record2.seq.count(i)
print(summarize)
given my test fasta :
>Rosalind_4402
GCAGCTAGCTAGCTAGCTGGGATTCGGATCGGCGCCCCGAGAGGATTCTTTCAGCTGTAA
GAATTTATCCTCGATCGGGCTATAAAACCTACGCATATCTGCTAGCTGAGGGGCTATCTT
output:
27
A : 27
T : 32
G : 32
C : 29
120
second code :
from Bio import SeqIO
# from Bio.Seq import Seq
# handle2="/home/koreanraichu/sra_data_mo.fasta"
handle2="Fasta2.fa"
i=0
j=0
for record2 in SeqIO.parse(handle2,"fasta"):
if len(record2.seq) > 100:
print('>100 : ', len(record2.seq))
i=i+len(record2.seq)
else:
print('else : ', len(record2.seq))
j=j+len(record2.seq)
print('> 100 summarize : ', i, ' else summarize : ',j)
given test fasta:
>Rosalind_4402
GCAGCTAGCTAGCTAGCTGGGATTCGGATCGGCGCCCCGAGAGGATTCTTTCAGCTGTAA
GAATTTATCCTCGATCGGGCTATAAAACCTACGCATATCTGCTAGCTGAGGGGCTATCTT
>Rosalind_4403
GCAGCTAGCTAGCTAGCTGGGATTCGGATCGGCGCCCCGAGAGGATTCTTTCAGCTGTAA
GAATTTATCCTCGATCGGGCTATAAAACCTACGCATATCTGCTAGCTGAGGGGCTATCTT
GCAGCTAGCTAGCTAGCTGGGATTCGGATCGGCGCCCCGAGAGGATTCTTTCAGCTGTAA
GAATTTATCCTCGATCGGGCTATAAAACCTACGCATATCTGCTAGCTGAGGGGCTATCTT
>Rosalind_4404
GCAGCTAGCTAGCTAGCTGGGATTCGGATCGGCGCCCCGAGAGGATTCTTTCAGCTGTAA
>Rosalind_4405
GCAGCTAGCTAGCTAGCTGGGATTCGGATCGGCGCCCCGAGAGGATT
>Rosalind_4406
GCAGCTAGCTAGCTAGCTGGGATTCGGATCGGCGCCCCGAGAGGATTCTTTCAGCTGTAA
GAATTTATCCTCGATCGGGCTATAAAACCTACGCATATCTGCTAGCTGAGGGGCTATCTT
CTTTCAGCTGTAAGAATTTATCCTCGATCGGGCTATAAAACCTACGCATATCTGCTAGCT
GAGGGGCTATCTT
>Rosalind_4407
GCAGCTAGCTAGCTAGCTGGGATT
>Rosalind_4408
GCAGCTAGCTAGCTAGCTGGGATTCGGATCGGCGCCCCGAGAGGATTCTTTCAGCTGTAA
GAATTTATCCTCGATCGGGCTATAAAACCTACGCATATCTGCTAGCTGAGGGGCTATCTT
CTTTCAGCTGTAAGAATTTATCCTCGATCGGGCTATAAAACCTACGCATATCTGCTAGC
output:
>100 : 120
>100 : 240
else : 60
else : 47
>100 : 193
else : 24
>100 : 179
> 100 summarize : 732 else summarize : 131
I am trying generate a weblogo for the protein sequences provided. The following is my code:
from Bio.Seq import Seq
from Bio import motifs
from Bio.Alphabet import generic_protein
instances = [Seq("RWST"),
Seq("RTAG"),
Seq("RQGC"),
Seq("RMAA"),
]
m = motifs.create(instances)
m.weblogo("mymotif.png")
I get the following error:
counts[letter][position] += 1
KeyError: 'R'
Full stack trace:
<ipython-input-3-ee8922743152> in <module>()
10
11
---> 12 m = motifs.create(instances)
13 m.weblogo("mymotif.png")
lib/site-packages/Bio/motifs/__init__.py in create(instances, alphabet)
21 def create(instances, alphabet=None):
22 instances = Instances(instances, alphabet)
---> 23 return Motif(instances=instances, alphabet=alphabet)
24
25
lib/site-packages/Bio/motifs/__init__.py in __init__(self, alphabet, instances, counts)
236 self.instances = instances
237 alphabet = self.instances.alphabet
--> 238 counts = self.instances.count()
239 self.counts = matrix.FrequencyPositionMatrix(alphabet, counts)
240 self.length = self.counts.length
lib/site-packages/Bio/motifs/__init__.py in count(self)
192 for instance in self:
193 for position, letter in enumerate(instance):
--> 194 counts[letter][position] += 1
195 return counts
196
KeyError: 'R'
Motif takes an alphabet as a keyword (named) argument, so does motifs.create. If there is none, BioPython assumes the sequence is a DNA and in your case R is not found in the alphabet.
For your example you would need to use IUPAC.protein to make it work.
Note: BioPython uses letters internally to see which characters are available, genericProtein has no letters.
from Bio import motifs
from Bio.Alphabet import IUPAC
from Bio.Seq import Seq
instances = [Seq("RWST", IUPAC.protein),
Seq("RTAG", IUPAC.protein),
Seq("RQGC", IUPAC.protein),
Seq("RMAA", IUPAC.protein),
]
m = motifs.create(instances, IUPAC.protein)
m.weblogo("mymotif.png")
I have tried it this way first:
for model in structure:
for residue in model.get_residues():
if PDB.is_aa(residue):
x += 1
and then that way:
len(structure[0][chain])
But none of them seem to work...
Your code should work and give you the correct results.
from Bio import PDB
parser = PDB.PDBParser()
pdb1 ='./1bfg.pdb'
structure = parser.get_structure("1bfg", pdb1)
model = structure[0]
res_no = 0
non_resi = 0
for model in structure:
for chain in model:
for r in chain.get_residues():
if r.id[0] == ' ':
res_no +=1
else:
non_resi +=1
print ("Residues: %i" % (res_no))
print ("Other: %i" % (non_resi))
res_no2 = 0
non_resi2 = 0
for model in structure:
for residue in model.get_residues():
if PDB.is_aa(residue):
res_no2 += 1
else:
non_resi2 += 1
print ("Residues2: %i" % (res_no2))
print ("Other2: %i" % (non_resi2))
Output:
Residues: 126
Other: 99
Residues2: 126
Other2: 99
Your statement
print (len(structure[0]['A']))
gives you the sum (225) of all residues, in this case all amino acids and water atoms.
The numbers seem to be correct when compared to manual inspection using PyMol.
What is the specific error message you are getting or the output you are expecting? Any specific PDB file?
Since the PDB file is mostly used to store the coordinates of the resolved atoms, it is not always possible to get the full structure. Another approach would be use to the cif files.
from Bio import PDB
parser = PDB.PDBParser()
pdb1 ='./1bfg.cif'
m = PDB.MMCIF2Dict.MMCIF2Dict(pdb1)
if '_entity_poly.pdbx_seq_one_letter_code' in m.keys():
print ('Full structure:')
full_structure = (m['_entity_poly.pdbx_seq_one_letter_code'])
print (full_structure)
print (len(full_structure))
Output:
Full structure:
PALPEDGGSGAFPPGHFKDPKRLYCKNGGFFLRIHPDGRVDGVREKSDPHIKLQLQAEERGVVSIKGVSANRYLAMKEDGRLLASKSVTDECFFFERLESNNYNTYRSRKYTSWYVALKRTGQYKLGSKTGPGQKAILFLPMSAKS
146
For multiple chains:
from Bio import PDB
parser = PDB.PDBParser()
pdb1 ='./4hlu.cif'
m = PDB.MMCIF2Dict.MMCIF2Dict(pdb1)
if '_entity_poly.pdbx_seq_one_letter_code' in m.keys():
full_structure = m['_entity_poly.pdbx_seq_one_letter_code']
chains = m['_entity_poly.pdbx_strand_id']
for c in chains:
print('Chain %s' % (c))
print('Sequence: %s' % (full_structure[chains.index(c)]))
It's just:
from Bio.PDB import PDBParser
from Bio import PDB
pdb = PDBParser().get_structure("1bfg", "1bfg.pdb")
for chain in pdb.get_chains():
print(len([_ for _ in chain.get_residues() if PDB.is_aa(_)]))
I appreciated Peters' answer, but I also realized the res.id[0] == " " is more robust (i.e. HIE). PDB.is_aa() cannot detect HIE is an amino acid while HIE is ε-nitrogen protonated histidine. So I recommend:
from Bio import PDB
parser = PDB.PDBParser()
pdb1 ='./1bfg.pdb'
structure = parser.get_structure("1bfg", pdb)
model = structure[0]
res_no = 0
non_resi = 0
for model in structure:
for chain in model:
for r in chain.get_residues():
if r.id[0] == ' ':
res_no +=1
else:
non_resi +=1
print ("Residues: %i" % (res_no))
print ("Other: %i" % (non_resi))
I think you would actually want to do something like
m = Bio.PDB.MMCIF2Dict.MMCIF2Dict(pdb_cif_file)
if '_entity_poly.pdbx_seq_one_letter_code' in m.keys():
full_structure = m['_entity_poly.pdbx_seq_one_letter_code']
chains = m['_entity_poly.pdbx_strand_id']
for c in chains:
for ci in c.split(','):
print('Chain %s' % (ci))
print('Sequence: %s' % (full_structure[chains.index(c)]))
I need to calculate the entropy of a dna sequence in a fasta file, from the base 10000 to the base 11000
here is what I know, but I need to calculate the entropy of the sequence between the 10,000th to 11,000th base
from math import log
def logent(x):
if x<=0:
return 0
else:
return -x*log(x)
def entropy(lis):
return sum([logent(elem) for elem in lis])
for i in SeqIO.parse("hsvs.fasta", "fasta"):
lisfreq1=[i.seq.count(base)*1.0/len(i.seq) for base in ["A", "C","G","T"]]
entropy(lisfreq1)
Your sequence is just a string, you can therefore simply slice it, e.g.
seq_start = 10000
seq_end = 11000 + 1
for i in SeqIO.parse("hsvs.fasta", "fasta"):
sub_seq = i.seq[seq_start:seq_end]
lisfreq1=[sub_seq.count(base)*1.0/len(sub_seq) for base in ["A", "C","G","T"]]