Find sequence IDs of DNA subsequences in DNA-sequences from FASTA-file - biopython

I want to make a function that reads a FASTA-file with DNA sequences(possibly ambiguous) and inputs a subsequence that returns all sequence IDs of the sequences that contain the given subsequence.
To make the script more efficient, I tried to use nt_search to make give all possibilities of the ambiguous sequence from the FASTA. This seemed more efficient than producing all unambiguous possibilities, especially for larger sequences an FASTA-files.
Right now, I'm struggling to see how I can check whether the subsequence is part of the output given bynt_search.
I want to see if eg 'CGC' (input subsequence) is part of the possibilities given by nt_search: ['TA[GATC][AT][GT]GCGGT'] and return all sequence IDs of sequences for which this is true.
What I have so far:
def bonus_subsequence(file, unambiguous_sequence):
seq_records = SeqIO.parse(file,'fasta', alphabet =ambiguous_dna)
resultListOfSeqIds = []
print(f'Unambiguous sequence {unambiguous_sequence} could be a subsequence of:')
for record in seq_records:
d = Seq.IUPAC.IUPACData.ambiguous_dna_values
couldBeSubSequence = False;
if unambiguous_sequence in nt_search(unambiguous_sequence,record):
couldBeSubSequence = True;
if couldBeSubSequence == True:
print(f'{record.id}')
resultListOfSeqIds.append({record.id})
In a second phase, I want to be able to also use this for ambiguous subsequences, but I'd be more than happy with help on this first question, thanks in advance!

I don't know if I understood You well but you can try this:
Example fasta file:
>seq1
ATGTACGTACGTACNNNNACTG
>seq2
NNNATCGTAGTCANNA
>seq3
NNNNATGNNN
Code:
from Bio import SeqIO
from Bio import SeqUtils
from Bio.Alphabet.IUPAC import ambiguous_dna
if __name__ == '__main__':
sub_seq = input('Enter a subsequence: ')
results = []
with open('test.fasta', 'r') as fh:
for seq in SeqIO.parse(fh, 'fasta', alphabet=ambiguous_dna):
if sub_seq in seq:
results.append((seq.name))
print(results, sep='\n')
Results (console):
Enter a subsequence: ATG
Results:
seq1
seq3
Enter a subsequence: NNNA
Results:
seq1
seq2
seq3

Related

Why does the 'join' method for Seq object in Biopython not work on the last element of a list?

The code below is from the Biopython tutorial. I intend to add 'N5' after every contig. Why is the trailing N10 not present after the third contig "TTGCA"?
from Bio.Seq import Seq
contigs = [Seq("ATG"), Seq("ATCCCG"), Seq("TTGCA")]
spacer = Seq("N"*10)
spacer.join(contigs)
output
Seq('ATGNNNNNNNNNNATCCCGNNNNNNNNNNTTGCA')
expected output
Seq('ATGNNNNNNNNNNATCCCGNNNNNNNNNNTTGCANNNNNNNNNN')
Doesn't the index in Python and Biopython both begin with 0?
Thank you
This has nothing to do with biopython.
This is just how string.join works:
configs = ["ATG", "ATCCCG", "TTGCA"]
spacer = "N"*10
spacer.join(configs)
Result:
ATGNNNNNNNNNNATCCCGNNNNNNNNNNTTGCA
As it should - according to help(str.join):
join(self, iterable, /)
Concatenate any number of strings.
The string whose method is called is inserted in between each given string.
The result is returned as a new string.
Example: '.'.join(['ab', 'pq', 'rs']) -> 'ab.pq.rs'

How to query a region in a fasta file using biopython?

I have a fasta file with some reference genome.
I would like to obtain the reference nucleotides as a string given the chromosome, start and end indexes.
I am looking for a function which would look like this in code:
from Bio import SeqIO
p = '/path/to/refernce.fa'
seqs = SeqIO.parse(p.open(), 'fasta')
string = seqs.query(id='chr7', start=10042, end=10252)
and string should be like : 'GGCTACGAACT...'
All I have found is how to iterate over seqs, and how to pull data from NCBI, which is not what I'm looking for.
What is the right way to do this in biopython?
AFAIK, biopython does not currently have this functionality. For random lookups using an index (please see samtools faidx), you'll probably want either pysam or pyfaidx. Here's an example using the pysam.FastaFile class which allows you to quickly 'fetch' sequences in a region:
import pysam
ref = pysam.FastaFile('/path/to/reference.fa')
seq = ref.fetch('chr7', 10042, 10252)
print(seq)
Or using pyfaidx and the 'get_seq' method:
from pyfaidx import Fasta
ref = Fasta('/path/to/reference.fa')
seq = ref.get_seq('chr7', 10042, 10252)
print(seq)

Convert Table Elements to Integers

I'm trying to create a list of integers, similar to python where one would say
x = input("Enter String").split() # 1 2 3 5
x = list(map(int,x)) # Converts x = "1","2",3","5" to x = 1,2,3,5
Here's my code asking for the input, then splitting the input into a table, i need help converting the contents of the table to integers as they're being referenced later in a function, and i'm getting a string vs integer comparison error. I've tried changing the split for-loop to take a number but that doesn't work, I'm familiar with a python conversion but not with Lua so I'm looking for some guidance in converting my table or handling this better.
function main()
print("Hello Welcome the to Change Maker - LUA Edition")
print("Enter a series of change denominations, separated by spaces")
input = io.read()
deno = {}
for word in input:gmatch("%w+") do table.insert(deno,word) end
end
--Would This Work?:
--for num in input:gmatch("%d+") do table.insert(deno,num) end
Just convert your number-strings to numbers using tonumber
local number = tonumber("1")
So
for num in input:gmatch("%d+") do table.insert(deno,tonumber(num)) end
Should do the trick

Biopython: Cant use .count() for biopython

My goal here is to receive the amount of time 'g' appears in a DNA sequence.
I imported a DNA sequence via Biopython using list comprehension
seq = [record for record in SeqIO.parse('sequences/hiv.gbk.rtf', 'fasta')]
I then tried using the .count() method on the newly created list comp variable
print(seq.count('g'))
I get an error that reads
NotImplementedError: SeqRecord comparison is deliberately not
implemented. Explicitly compare the attributes of interest.
Anyone know what the dealio is? Biopython's manual says all standard python methods should work.
You are trying to apply count to a list. You would to need to apply it to the sequence of each element, e.g.
print(seq[0].seq.count('g'))
or if you want to get the sum of all sequences
print(sum([s.seq.count('g') for s in seq]))
Here is a minimal working example
from Bio import SeqIO
txt = """>gnl|TC-DB|O60669|2.A.1.13.5 Monocarboxylate transporter 2 - Homo sapiens (Human).
MPPMPSAPPVHPPPDGGWGWIVVGAAFISIGFSYAFPKAVTVFFKEIQQIFHTTYSEIAW
>gnl|TC-DB|O60706|3.A.1.208.23 ATP-binding cassette sub-family C member 9 OS=Homo sapiens GN=ABCC9 PE=1 SV=2
MSLSFCGNNISSYNINDGVLQNSCFVDALNLVPHVFLLFITFPILFIGWGSQSSKVQIHH
>gnl|TC-DB|O60721|3.A.1.208.23 Sodium/potassium/calcium exchanger 1 OS=Homo sapiens GN=SLC24A1 PE=1 SV=1
MGKLIRMGPQERWLLRTKRLHWSRLLFLLGMLIIGSTYQHLRRPRGLSSLWAAVSSHQPI
>gnl|TC-DB|O60779|2.A.1.13.5 Thiamine transporter 1 (THTR-1) (ThTr1) (Thiamine carrier 1) (TC1) - Homo sapiens (Human).
MDVPGPVSRRAAAAAATVLLRTARVRRECWFLPTALLCAYGFFASLRPSEPFLTPYLLGP"""
filename = 'sequences.fa'
with open(filename, 'w') as f:
f.write(txt)
seqs = [record for record in SeqIO.parse(filename, 'fasta')]
print(sum([s.seq.count('P') for s in seqs]))
>>> 21
print(seqs[0].seq.count('P'))
>>> 9

Can Z3 call python function during decision making of variables?

I am trying to solve a problem, for example I have a 4 point and each two point has a cost between them. Now I want to find a sequence of nodes which total cost would be less than a bound. I have written a code but it seems not working. The main problem is I have define a python function and trying to call it with in a constraint.
Here is my code: I have a function def getVal(n1,n2): where n1, n2 are Int Sort. The line Nodes = [ Int("n_%s" % (i)) for i in range(totalNodeNumber) ] defines 4 points as Int sort and when I am adding a constraint s.add(getVal(Nodes[0], Nodes[1]) + getVal(Nodes[1], Nodes[2]) < 100) then it calls getVal function immediately. But I want that, when Z3 will decide a value for Nodes[0], Nodes[1], Nodes[2], Nodes[3] then the function should be called for getting the cost between to points.
from z3 import *
import random
totalNodeNumber = 4
Nodes = [ Int("n_%s" % (i)) for i in range(totalNodeNumber) ]
def getVal(n1,n2):
# I need n1 and n2 values those assigned by Z3
cost = random.randint(1,20)
print cost
return IntVal(cost)
s = Solver()
#constraint: Each Nodes value should be distinct
nodes_index_distinct_constraint = Distinct(Nodes)
s.add(nodes_index_distinct_constraint)
#constraint: Each Nodes value should be between 0 and totalNodeNumber
def get_node_index_value_constraint(i):
return And(Nodes[i] >= 0, Nodes[i] < totalNodeNumber)
nodes_index_constraint = [ get_node_index_value_constraint(i) for i in range(totalNodeNumber)]
s.add(nodes_index_constraint)
#constraint: Problem with this constraint
# Here is the problem it's just called python getVal function twice without assiging Nodes[0],Nodes[1],Nodes[2] values
# But I want to implement that - Z3 will call python function during his decission making of variables
s.add(getVal(Nodes[0], Nodes[1]) + getVal(Nodes[1], Nodes[2]) + getVal(Nodes[2], Nodes[3]) < 100)
if s.check() == sat:
print "SAT"
print "Model: "
m = s.model()
nodeIndex = [ m.evaluate(Nodes[i]) for i in range(totalNodeNumber) ]
print nodeIndex
else:
print "UNSAT"
print "No solution found !!"
If this is not a right way to solve the problem then could you please tell me what would be other alternative way to solve it. Can I encode this kind of problem to find optimal sequence of way points using Z3 solver?
I don't understand what problem you need to solve. Definitely, the way getVal is formulated does not make sense. It does not use the arguments n1, n2. If you want to examine values produced by a model, then you do this after Z3 returns from a call to check().
I don't think you can use a python function in your SMT logic. What you could alternatively is define getVal as a Function like this
getVal = Function('getVal',IntSort(),IntSort(),IntSort())
And constraint the edge weights as
s.add(And(getVal(0,1)==1,getVal(1,2)==2,getVal(0,2)==3))
The first two input parameters of getVal represent the node ids and the last integer represents the weight.

Resources