What can I do add this counts, and make right dataframe? - biopython

I have some problem for using Biopython, count and sum each base's numbers for parsing FASTA file. In FASTA file, total A is how much? and total T is?
but there's some problem.
1.
handle2="/home/koreanraichu/sra_data_mo.fasta"
for record2 in SeqIO.parse(handle2,"fasta"):
print(Seq(record2.seq).count("A"))
print(type(Seq(record2.seq).count("A")))
This is code, was it successfully read sequence and count adenine, but It never summarize each numbers. I tried it for list append and sum(), simply add but there's no effective. (each count type is int, but never added and printed separately)
for record2 in SeqIO.parse(handle2,"fasta"):
if len(record2.seq) > 100:
i=0
i=i+len(record2.seq)
else:
j=0
j=j+len(record2.seq)
print(i,j)
like upper, this code doesn't work. I meant this code for It is a conditional sum code that adds DNA of 100 bp or more and DNA of less than 100 bp separately. but it never works, too. it prints last record's data.
What can I do things for solve this?

try this code for first problem:
from Bio import SeqIO
# from Bio.Seq import Seq
handle2="Fasta.fa"
for record2 in SeqIO.parse(handle2,"fasta"):
# print(record2.seq, type(record2.seq))
# print(str(record2.seq), type(str(record2.seq)))
print(record2.seq.count("A"))
# print(type(record2.seq).count("A")) ### --> TypeError: count() missing 1 required positional argument: 'sub'
summarize = 0
for i in 'ATGC':
x = record2.seq.count(i)
print(i, ' : ', x)
summarize += record2.seq.count(i)
print(summarize)
given my test fasta :
>Rosalind_4402
GCAGCTAGCTAGCTAGCTGGGATTCGGATCGGCGCCCCGAGAGGATTCTTTCAGCTGTAA
GAATTTATCCTCGATCGGGCTATAAAACCTACGCATATCTGCTAGCTGAGGGGCTATCTT
output:
27
A : 27
T : 32
G : 32
C : 29
120
second code :
from Bio import SeqIO
# from Bio.Seq import Seq
# handle2="/home/koreanraichu/sra_data_mo.fasta"
handle2="Fasta2.fa"
i=0
j=0
for record2 in SeqIO.parse(handle2,"fasta"):
if len(record2.seq) > 100:
print('>100 : ', len(record2.seq))
i=i+len(record2.seq)
else:
print('else : ', len(record2.seq))
j=j+len(record2.seq)
print('> 100 summarize : ', i, ' else summarize : ',j)
given test fasta:
>Rosalind_4402
GCAGCTAGCTAGCTAGCTGGGATTCGGATCGGCGCCCCGAGAGGATTCTTTCAGCTGTAA
GAATTTATCCTCGATCGGGCTATAAAACCTACGCATATCTGCTAGCTGAGGGGCTATCTT
>Rosalind_4403
GCAGCTAGCTAGCTAGCTGGGATTCGGATCGGCGCCCCGAGAGGATTCTTTCAGCTGTAA
GAATTTATCCTCGATCGGGCTATAAAACCTACGCATATCTGCTAGCTGAGGGGCTATCTT
GCAGCTAGCTAGCTAGCTGGGATTCGGATCGGCGCCCCGAGAGGATTCTTTCAGCTGTAA
GAATTTATCCTCGATCGGGCTATAAAACCTACGCATATCTGCTAGCTGAGGGGCTATCTT
>Rosalind_4404
GCAGCTAGCTAGCTAGCTGGGATTCGGATCGGCGCCCCGAGAGGATTCTTTCAGCTGTAA
>Rosalind_4405
GCAGCTAGCTAGCTAGCTGGGATTCGGATCGGCGCCCCGAGAGGATT
>Rosalind_4406
GCAGCTAGCTAGCTAGCTGGGATTCGGATCGGCGCCCCGAGAGGATTCTTTCAGCTGTAA
GAATTTATCCTCGATCGGGCTATAAAACCTACGCATATCTGCTAGCTGAGGGGCTATCTT
CTTTCAGCTGTAAGAATTTATCCTCGATCGGGCTATAAAACCTACGCATATCTGCTAGCT
GAGGGGCTATCTT
>Rosalind_4407
GCAGCTAGCTAGCTAGCTGGGATT
>Rosalind_4408
GCAGCTAGCTAGCTAGCTGGGATTCGGATCGGCGCCCCGAGAGGATTCTTTCAGCTGTAA
GAATTTATCCTCGATCGGGCTATAAAACCTACGCATATCTGCTAGCTGAGGGGCTATCTT
CTTTCAGCTGTAAGAATTTATCCTCGATCGGGCTATAAAACCTACGCATATCTGCTAGC
output:
>100 : 120
>100 : 240
else : 60
else : 47
>100 : 193
else : 24
>100 : 179
> 100 summarize : 732 else summarize : 131

Related

why max_samples does not accept float type?

I am doing machine learning.Here I want to find the best triple (max_samples, n_trees and threshold) that gives the greatest performance in terms of area under ROC curve and area under recall precison curve.
Here is the code:
def meilleur_triplet(x,classes):
for n_trees in np.arange(100,160,10):
for sample_size in np.arange(0.1,1,0.1):
for threshold in np.arange(0.4,1,0.1):
model=IforestLocal(sample_size,n_trees)
model.fit(x)
y_pred,y_score=model.predict(x,threshold)
auc=roc_auc_score(classes,y_pred)
auc_pr=average_precision_score(classes,y_pred)
Now when I use max_samples with a range of int I don't have an error however if it's in float I have the following error:
**
TypeError Traceback (most recent call last)
Input In [201], in <cell line: 1>()
----> 1 meilleur_triplet(X_glass,y_glass)
Input In [200], in meilleur_triplet(x, classes)
6 for threshold in np.arange(0.4,1,0.1):#(0.4,1,0.1)
8 model=IforestLocal(sample_size,n_trees)
----> 9 model.fit(x)
File ~\Desktop\THESE\Maurras\Code_Maurras\iforest_D.py:45, in IsolationForest.fit(self, X)
42 self.sample_size = len_x
44 for i in range(self.n_trees):
---> 45 sample_idx = random.sample(list(range(len_x)), self.sample_size)
46 # TODO: Must be deleted before compute the memory consumption of the methods
47 self.samples.append(sample_idx)
File ~\anaconda3\lib\random.py:450, in Random.sample(self, population, k, counts)
448 if not 0 <= k <= n:
449 raise ValueError("Sample larger than population or is negative")
--> 450 result = [None] * k
451 setsize = 21 # size of a small set minus size of an empty list
452 if k > 5:
TypeError: can't multiply sequence by non-int of type 'numpy.float64'
**
This is where I called the function
meilleur_triplet(X_glass,y_glass)
Thank you please help me

is there a way to split list to chunks?

I have a very large list (~1M items) and I want to unpack it.
I saw that it is impossible to unpack a list larger than 8000 items so I wanted to chunk it.
Basically this function in python
def chunks(lst, size):
for i in range(0, len(lst), size):
yield lst[i:i+size]
(for reference I am running the lua script on redis using eval)
You could write an iterator that gives you chunks of a certain size.
-- an iterator function that will traverse over lst
function chunks(lst, size)
-- our current index
local i = 1
-- our chunk counter
local count = 0
return function()
-- abort if we reached the end of lst
if i > #lst then return end
-- get a slice of lst
local chunk = table.move(lst, i, i + size -1, 1, {})
-- next starting index
i = i + size
count = count + 1
return count, chunk
end
end
-- test
local a = {1,2,3,4,5,6,7,8,9,10,11}
for i, chunk in chunks(a, 2) do
print(string.format("#%d: %s", i, table.concat(chunk, ",")))
end
Output:
#1: 1,2
#2: 3,4
#3: 5,6
#4: 7,8
#5: 9,10
#6: 11
Read this:
https://www.lua.org/pil/7.1.html
https://www.lua.org/manual/5.4/manual.html#3.3.5

Key error in biopython while creating motif

I am trying generate a weblogo for the protein sequences provided. The following is my code:
from Bio.Seq import Seq
from Bio import motifs
from Bio.Alphabet import generic_protein
instances = [Seq("RWST"),
Seq("RTAG"),
Seq("RQGC"),
Seq("RMAA"),
]
m = motifs.create(instances)
m.weblogo("mymotif.png")
I get the following error:
counts[letter][position] += 1
KeyError: 'R'
Full stack trace:
<ipython-input-3-ee8922743152> in <module>()
10
11
---> 12 m = motifs.create(instances)
13 m.weblogo("mymotif.png")
lib/site-packages/Bio/motifs/__init__.py in create(instances, alphabet)
21 def create(instances, alphabet=None):
22 instances = Instances(instances, alphabet)
---> 23 return Motif(instances=instances, alphabet=alphabet)
24
25
lib/site-packages/Bio/motifs/__init__.py in __init__(self, alphabet, instances, counts)
236 self.instances = instances
237 alphabet = self.instances.alphabet
--> 238 counts = self.instances.count()
239 self.counts = matrix.FrequencyPositionMatrix(alphabet, counts)
240 self.length = self.counts.length
lib/site-packages/Bio/motifs/__init__.py in count(self)
192 for instance in self:
193 for position, letter in enumerate(instance):
--> 194 counts[letter][position] += 1
195 return counts
196
KeyError: 'R'
Motif takes an alphabet as a keyword (named) argument, so does motifs.create. If there is none, BioPython assumes the sequence is a DNA and in your case R is not found in the alphabet.
For your example you would need to use IUPAC.protein to make it work.
Note: BioPython uses letters internally to see which characters are available, genericProtein has no letters.
from Bio import motifs
from Bio.Alphabet import IUPAC
from Bio.Seq import Seq
instances = [Seq("RWST", IUPAC.protein),
Seq("RTAG", IUPAC.protein),
Seq("RQGC", IUPAC.protein),
Seq("RMAA", IUPAC.protein),
]
m = motifs.create(instances, IUPAC.protein)
m.weblogo("mymotif.png")

How do I show the contents of a matrix?

Consider:
: cell-matrix
create ( width height "name" ) over , * cells allot
does> ( x y -- addr ) dup cell+ >r # * + cells r> + ;
It is the definition that makes the matrix and then you assign values like this:
5 5 cell-matrix test
And then you stuff the values into there.... There they're...
36 0 0 test !
(I think)
Nowhere on the Internet can you find anything to explain this. How do you show the contents of the matrix?
If you want to print the contents of the whole matrix, you can do something like:
: .row ( addr u -- addr' u ) tuck 0 do #+ . loop swap cr ;
: .matrix ( u addr -- ) >body #+ rot 0 do .row loop 2drop ;
Note that your cell-matrix doesn't save the number of rows, so you have to supply this number to .matrix. E.g. like this:
2 3 cell-matrix foo
3 ' foo .matrix
Logically simple:
100 0 0 test ! ok
400 1 0 test ! ok
0 0 test # . 100 ok
1 0 test # . 400 ok

Convert FASTA to GenBank

Is there a way to use BioPython to convert FASTA files to a Genbank format? There are many answers on how to convert from Genbank to FASTA, but not the other way around.
before convert, you must asign alphabet to sequence (DNA or Protein)
from Bio import SeqIO
from Bio.Alphabet import generic_dna, generic_protein
input_handle = open("test.fasta", "rU")
output_handle = open("test.gb", "w")
sequences = list(SeqIO.parse(input_handle, "fasta"))
#asign generic_dna or generic_protein
for seq in sequences:
seq.seq.alphabet = generic_dna
count = SeqIO.write(sequences, output_handle, "genbank")
output_handle.close()
input_handle.close()
print "Coverted %i records" % count
for input:
>I28Q9A102FII8J rank=0668881 x=2144.0 y=1105.0 length=418
ACGTCATGAGAGTTTGATCATGGCTCAGGACGAACGCTGGCGGCGTGCTTAACACATGCAAGTCGAACGATGAA
GCTCCAGCTTGCTGGGGTGGATTAGTGGCGAACGGGTGAGTAACACGTGAGTAACCTGCCCTTGACTCTGGGAT
AAGCGTTGGAAACGACGTCTAATACCGGATATGACGACCGATGGCATCATCTGGTTGTGGAAAGAATTTTGGTC
AAGGATGGACTCGCGGCCTATCAGGTAGTTGGTGAGGTAATGGCTCACCAAGCCTACGACGGGTAGCCGGCCTG
AGAGGGTGACCGGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCA
CAATGGGCGAAAGCCTGATGCAGCAACGCCGCGTGAGGGATGACGGCC
>I28Q9A102JMH72 rank=0320459 x=3829.0 y=3120.0 length=512
ACGTCATGAGAGTTTGATCCTGGCTCAGGATGAACGCTAGCGGCAGGCTTAACACATGCAAGTCGAGGGTAGAA
ATAGCTTGCTATTTTGAGACCGGCGCACGGGTGCGTAACGCGTATGCAATCTGCCTTTTACAGGGGAATAGCCC
AGAGAAATTTGGATTAATGCCCCATAGCGCTGCAGGGCGGCATCGCCGAGCAGCTAAAGTCACAACGGTAAAGA
TGAGCATGCGTCCCATTAGCTAGTTGGTAAGGTAACGGCTTACCAAGGCGATGATGGGTAGGGTCCTGAGAGGG
AGATCCCCCACACTGGTACTGAGACACGGACCAGACTCCTACGGGAGGCAGCAGTGAGGAATATTGGTCAATGG
GCGCAAGCCTGAACCAGCCATGCCGCGTGCAGGATGAAGGCCTTCGGGTTGTAAACTGCTTTTGACGGAACGAA
AAAGCT
you get:
LOCUS I28Q9A102FII8J 418 bp DNA UNK 01-JAN-1980
DEFINITION I28Q9A102FII8J rank=0668881 x=2144.0 y=1105.0 length=418
ACCESSION I28Q9A102FII8J
VERSION I28Q9A102FII8J
KEYWORDS .
SOURCE .
ORGANISM .
.
FEATURES Location/Qualifiers
ORIGIN
1 acgtcatgag agtttgatca tggctcagga cgaacgctgg cggcgtgctt aacacatgca
61 agtcgaacga tgaagctcca gcttgctggg gtggattagt ggcgaacggg tgagtaacac
121 gtgagtaacc tgcccttgac tctgggataa gcgttggaaa cgacgtctaa taccggatat
181 gacgaccgat ggcatcatct ggttgtggaa agaattttgg tcaaggatgg actcgcggcc
241 tatcaggtag ttggtgaggt aatggctcac caagcctacg acgggtagcc ggcctgagag
301 ggtgaccggc cacactggga ctgagacacg gcccagactc ctacgggagg cagcagtggg
361 gaatattgca caatgggcga aagcctgatg cagcaacgcc gcgtgaggga tgacggcc
//
LOCUS I28Q9A102JMH72 450 bp DNA UNK 01-JAN-1980
DEFINITION I28Q9A102JMH72 rank=0320459 x=3829.0 y=3120.0 length=512
ACCESSION I28Q9A102JMH72
VERSION I28Q9A102JMH72
KEYWORDS .
SOURCE .
ORGANISM .
.
FEATURES Location/Qualifiers
ORIGIN
1 acgtcatgag agtttgatcc tggctcagga tgaacgctag cggcaggctt aacacatgca
61 agtcgagggt agaaatagct tgctattttg agaccggcgc acgggtgcgt aacgcgtatg
121 caatctgcct tttacagggg aatagcccag agaaatttgg attaatgccc catagcgctg
181 cagggcggca tcgccgagca gctaaagtca caacggtaaa gatgagcatg cgtcccatta
241 gctagttggt aaggtaacgg cttaccaagg cgatgatggg tagggtcctg agagggagat
301 cccccacact ggtactgaga cacggaccag actcctacgg gaggcagcag tgaggaatat
361 tggtcaatgg gcgcaagcct gaaccagcca tgccgcgtgc aggatgaagg ccttcgggtt
421 gtaaactgct tttgacggaa cgaaaaagct
//
here's an update of Jose's answer for python3 and new biopython. Biopython doesn't use alphabets any longer. Maybe it will save you a bit of time.
from Bio import SeqIO
input_handle = open("test.fasta", "r")
output_handle = open("test.gb", "w")
sequences = list(SeqIO.parse(input_handle, "fasta"))
# assign molecule type
for seq in sequences:
seq.annotations['molecule_type'] = 'DNA'
count = SeqIO.write(sequences, output_handle, "genbank")
output_handle.close()
input_handle.close()
print("Converted {} records".format(count))
It is possible to convert the fasta to gb format for unsubmitted sequences, which dont have accession numbers. Yet to be submitted to NCBI.

Resources