I want to make a function that reads a FASTA-file with DNA sequences(possibly ambiguous) and inputs a subsequence that returns all sequence IDs of the sequences that contain the given subsequence.
To make the script more efficient, I tried to use nt_search to make give all possibilities of the ambiguous sequence from the FASTA. This seemed more efficient than producing all unambiguous possibilities, especially for larger sequences an FASTA-files.
Right now, I'm struggling to see how I can check whether the subsequence is part of the output given bynt_search.
I want to see if eg 'CGC' (input subsequence) is part of the possibilities given by nt_search: ['TA[GATC][AT][GT]GCGGT'] and return all sequence IDs of sequences for which this is true.
What I have so far:
def bonus_subsequence(file, unambiguous_sequence):
seq_records = SeqIO.parse(file,'fasta', alphabet =ambiguous_dna)
resultListOfSeqIds = []
print(f'Unambiguous sequence {unambiguous_sequence} could be a subsequence of:')
for record in seq_records:
d = Seq.IUPAC.IUPACData.ambiguous_dna_values
couldBeSubSequence = False;
if unambiguous_sequence in nt_search(unambiguous_sequence,record):
couldBeSubSequence = True;
if couldBeSubSequence == True:
print(f'{record.id}')
resultListOfSeqIds.append({record.id})
In a second phase, I want to be able to also use this for ambiguous subsequences, but I'd be more than happy with help on this first question, thanks in advance!
I don't know if I understood You well but you can try this:
Example fasta file:
>seq1
ATGTACGTACGTACNNNNACTG
>seq2
NNNATCGTAGTCANNA
>seq3
NNNNATGNNN
Code:
from Bio import SeqIO
from Bio import SeqUtils
from Bio.Alphabet.IUPAC import ambiguous_dna
if __name__ == '__main__':
sub_seq = input('Enter a subsequence: ')
results = []
with open('test.fasta', 'r') as fh:
for seq in SeqIO.parse(fh, 'fasta', alphabet=ambiguous_dna):
if sub_seq in seq:
results.append((seq.name))
print(results, sep='\n')
Results (console):
Enter a subsequence: ATG
Results:
seq1
seq3
Enter a subsequence: NNNA
Results:
seq1
seq2
seq3
I want to download in fasta format all the peptide sequences in the NCBI protein database (i.e. > and the peptide name, followed by the peptide sequence), I saw there is a MESH term describing what a peptide is here, but I can't work out how to incorporate it.
I wrote this:
import Bio
from Bio import Entrez
Entrez.email = 'test#gmail.com'
handle = Entrez.esearch(db="protein", term="peptide")
record = handle.read()
out_handle = open('myfasta.fasta', 'w')
out_handle.write(record.rstrip('\n'))
but it only prints out 995 IDs, no sequences to file, I'm wondering if someone could demonstrate where I'm going wrong.
Note that a search for the term peptide in the NCBI protein database returns 8187908 hits, so make sure that you actually want to download the peptide sequences for all these hits into one big fasta file.
>>> from Bio import Entrez
>>> Entrez.email = 'test#gmail.com'
>>> handle = Entrez.esearch(db="protein", term="peptide")
>>> record = Entrez.read(handle)
>>> record["Count"]
'8187908'
The default number of records that Entrez.esearch returns is 20. This is to prevent overloading NCBI's servers.
>>> len(record["IdList"])
20
To get the full list of records, change the retmax parameter:
>>> count = record["Count"]
>>> handle = Entrez.esearch(db="protein", term="peptide", retmax=count)
>>> record = Entrez.read(handle)
>>> len(record['IdList'])
8187908
The way to download all the records is to use Entrez.epost
From chapter 9.4 of the BioPython tutorial:
EPost uploads a list of UIs for use in subsequent search strategies; see the EPost help page for more information. It is available from Biopython through the Bio.Entrez.epost() function.
To give an example of when this is useful, suppose you have a long list of IDs you want to download using EFetch (maybe sequences, maybe citations – anything). When you make a request with EFetch your list of IDs, the database etc, are all turned into a long URL sent to the server. If your list of IDs is long, this URL gets long, and long URLs can break (e.g. some proxies don’t cope well).
Instead, you can break this up into two steps, first uploading the list of IDs using EPost (this uses an “HTML post” internally, rather than an “HTML get”, getting round the long URL problem). With the history support, you can then refer to this long list of IDs, and download the associated data with EFetch.
[...] The returned XML includes two important strings, QueryKey and WebEnv which together define your history session. You would extract these values for use with another Entrez call such as EFetch.
Read [chapter 9.15.: Searching for and downloading sequences using the history][3] to learn how to use QueryKey and WebEnv
A full working example would then be:
from Bio import Entrez
import time
from urllib.error import HTTPError
DB = "protein"
QUERY = "peptide"
Entrez.email = 'test#gmail.com'
handle = Entrez.esearch(db=DB, term=QUERY, rettype='fasta')
record = Entrez.read(handle)
count = record['Count']
handle = Entrez.esearch(db=DB, term=QUERY, retmax=count, rettype='fasta')
record = Entrez.read(handle)
id_list = record['IdList']
post_xml = Entrez.epost(DB, id=",".join(id_list))
search_results = Entrez.read(post_xml)
webenv = search_results['WebEnv']
query_key = search_results['QueryKey']
batch_size = 200
with open('peptides.fasta', 'w') as out_handle:
for start in range(0, count, batch_size):
end = min(count, start+batch_size)
print(f"Going to download record {start+1} to {end}")
attempt = 0
success = False
while attempt < 3 and not success:
attempt += 1
try:
fetch_handle = Entrez.efetch(db=DB, rettype='fasta',
retstart=start, retmax=batch_size,
webenv=webenv, query_key=query_key)
success = True
except HTTPError as err:
if 500 <= err.code <= 599:
print(f"Received error from server {err}")
print(f"Attempt {attempt} of 3")
time.sleep(15)
else:
raise
data = fetch_handle.read()
fetch_handle.close()
out_handle.write(data)
The first few lines of peptides.fasta then look like this:
>QGT67293.1 RepA leader peptide Tap (plasmid) [Klebsiella pneumoniae]
MLRKLQAQFLCHSLLLCNISAGSGD
>QGT67288.1 RepA leader peptide Tap (plasmid) [Klebsiella pneumoniae]
MLRKLQAQFLCHSLLLCNISAGSGD
>QGT67085.1 thr operon leader peptide [Klebsiella pneumoniae]
MNRIGMITTIITTTITTGNGAG
>QGT67083.1 leu operon leader peptide [Klebsiella pneumoniae]
MIRTARITSLLLLNACHLRGRLLGDVQR
>QGT67062.1 peptide antibiotic transporter SbmA [Klebsiella pneumoniae]
MFKSFFPKPGPFFISAFIWSMLAVIFWQAGGGDWLLRVTGASQNVAISAARFWSLNYLVFYAYYLFCVGV
FALFWFVYCPHRWQYWSILGTSLIIFVTWFLVEVGVAINAWYAPFYDLIQSALATPHKVSINQFYQEIGV
FLGIAIIAVIIGVMNNFFVSHYVFRWRTAMNEHYMAHWQHLRHIEGAAQRVQEDTMRFASTLEDMGVSFI
NAVMTLIAFLPVLVTLSEHVPDLPIVGHLPYGLVIAAIVWSLMGTGLLAVVGIKLPGLEFKNQRVEAAYR
KELVYGEDDETRATPPTVRELFRAVRRNYFRLYFHYMYFNIARILYLQVDNVFGLFLLFPSIVAGTITLG
LMTQITNVFGQVRGSFQYLISSWTTLVELMSIYKRLRSFERELDGKPLQEAIPTLR
I am following this document clustering tutorial. As an input I give a txt file which can be downloaded here. It's a combined file of 3 other txt files divided with a use of \n. After creating a tf-idf matrix I received this warning:
,,UserWarning: Your stop_words may be inconsistent with your preprocessing.
Tokenizing the stop words generated tokens ['abov', 'afterward', 'alon', 'alreadi', 'alway', 'ani', 'anoth', 'anyon', 'anyth', 'anywher', 'becam', 'becaus', 'becom', 'befor', 'besid', 'cri', 'describ', 'dure', 'els', 'elsewher', 'empti', 'everi', 'everyon', 'everyth', 'everywher', 'fifti', 'forti', 'henc', 'hereaft', 'herebi', 'howev', 'hundr', 'inde', 'mani', 'meanwhil', 'moreov', 'nobodi', 'noon', 'noth', 'nowher', 'onc', 'onli', 'otherwis', 'ourselv', 'perhap', 'pleas', 'sever', 'sinc', 'sincer', 'sixti', 'someon', 'someth', 'sometim', 'somewher', 'themselv', 'thenc', 'thereaft', 'therebi', 'therefor', 'togeth', 'twelv', 'twenti', 'veri', 'whatev', 'whenc', 'whenev', 'wherea', 'whereaft', 'wherebi', 'wherev', 'whi', 'yourselv'] not in stop_words.
'stop_words.' % sorted(inconsistent))".
I guess it has something to do with the order of lemmatization and stop words removal, but as this is my first project in txt processing, I am a bit lost and I don't know how to fix this...
import pandas as pd
import nltk
from nltk.corpus import stopwords
import re
import os
import codecs
from sklearn import feature_extraction
import mpld3
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
stopwords = stopwords.words('english')
stemmer = SnowballStemmer("english")
def tokenize_and_stem(text):
# first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = []
# filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
stems = [stemmer.stem(t) for t in filtered_tokens]
return stems
def tokenize_only(text):
# first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = []
# filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
return filtered_tokens
totalvocab_stemmed = []
totalvocab_tokenized = []
with open('shortResultList.txt', encoding="utf8") as synopses:
for i in synopses:
allwords_stemmed = tokenize_and_stem(i) # for each item in 'synopses', tokenize/stem
totalvocab_stemmed.extend(allwords_stemmed) # extend the 'totalvocab_stemmed' list
allwords_tokenized = tokenize_only(i)
totalvocab_tokenized.extend(allwords_tokenized)
vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_stemmed)
print ('there are ' + str(vocab_frame.shape[0]) + ' items in vocab_frame')
print (vocab_frame.head())
#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
min_df=0.2, stop_words='english',
use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))
with open('shortResultList.txt', encoding="utf8") as synopses:
tfidf_matrix = tfidf_vectorizer.fit_transform(synopses) #fit the vectorizer to synopses
print(tfidf_matrix.shape)
The warning is trying to tell you that if your text contains "always" it will be normalised to "alway" before matching against your stop list which includes "always" but not "alway". So it won't be removed from your bag of words.
The solution is to make sure that you preprocess your stop list to make sure that it is normalised like your tokens will be, and pass the list of normalised words as stop_words to the vectoriser.
I had the same problem and for me the following worked:
include stopwords into tokenize function and then
remove stopwords parameter from tfidfVectorizer
Like so:
1.
stopwords = stopwords.words('english')
stemmer = SnowballStemmer("english")
def tokenize_and_stem(text):
tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = []
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
#exclude stopwords from stemmed words
stems = [stemmer.stem(t) for t in filtered_tokens if t not in stopwords]
return stems
Delete stopwords parameter from vectorizer:
tfidf_vectorizer = TfidfVectorizer(
max_df=0.8, max_features=200000, min_df=0.2,
use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3)
)
I faced this problem because of PT-BR language.
TL;DR: Remove the accents of your language.
# Special thanks for the user Humberto Diogenes from Python List (answer from Aug 11, 2008)
# Link: http://python.6.x6.nabble.com/O-jeito-mais-rapido-de-remover-acentos-de-uma-string-td2041508.html
# I found the issue by chance (I swear, haha) but this guy gave the tip before me
# Link: https://github.com/scikit-learn/scikit-learn/issues/12897#issuecomment-518644215
import spacy
nlp = spacy.load('pt_core_news_sm')
# Define default stopwords list
stoplist = spacy.lang.pt.stop_words.STOP_WORDS
def replace_ptbr_char_by_word(word):
""" Will remove the encode token by token"""
word = str(word)
word = normalize('NFKD', word).encode('ASCII','ignore').decode('ASCII')
return word
def remove_pt_br_char_by_text(text):
""" Will remove the encode using the entire text"""
text = str(text)
text = " ".join(replace_ptbr_char_by_word(word) for word in text.split() if word not in stoplist)
return text
df['text'] = df['text'].apply(remove_pt_br_char_by_text)
I put the solution and references in this gist.
Manually adding those words in the 'stop_words' list can solve the problem.
stop_words = safe_get_stop_words('en')
stop_words.extend(['abov', 'afterward', 'alon', 'alreadi', 'alway', 'ani', 'anoth', 'anyon', 'anyth', 'anywher', 'becam', 'becaus', 'becom', 'befor', 'besid', 'cri', 'describ', 'dure', 'els', 'elsewher', 'empti', 'everi', 'everyon', 'everyth', 'everywher', 'fifti', 'forti', 'henc', 'hereaft', 'herebi', 'howev', 'hundr', 'inde', 'mani', 'meanwhil', 'moreov', 'nobodi', 'noon', 'noth', 'nowher', 'onc', 'onli', 'otherwis', 'ourselv', 'perhap', 'pleas', 'sever', 'sinc', 'sincer', 'sixti', 'someon', 'someth', 'sometim', 'somewher', 'themselv', 'thenc', 'thereaft', 'therebi', 'therefor', 'togeth', 'twelv', 'twenti', 'veri', 'whatev', 'whenc', 'whenev', 'wherea', 'whereaft', 'wherebi', 'wherev', 'whi', 'yourselv'])
from Bio.Blast import NCBIXML
from Bio.Blast import NCBIWWW
result_handle = NCBIWWW.qblast(
"blastn",
"nr",
"CACTTATTTAGTTAGCTTGCAACCCTGGATTTTTGTTTACTGGAGAGGCC",
entrez_query='"Beutenbergia cavernae DSM 12333" [Organism]')
blast_records = NCBIXML.parse(result_handle)
for blast_record in blast_records:
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
print(hsp.query[0:75] + '...')
print(hsp.match[0:75] + '...')
print(hsp.sbjct[0:75] + '...')
this does not give me an output, although the sequence is actually a sequence of the genome,
so i must get a result.
where is the error?
the query is correct?
Your query isn't returning any results. The default parameters for blast are the cause. These parameters work better in this particular case of small length queries:
result_handle = NCBIWWW.qblast(
"blastn",
"nr",
"CACTTATTTAGTTAGCTTGCAACCCTGGATTTTTGTTTACTGGAGAGGCC",
megablast=False,
expect=1000,
word_size=7,
nucl_reward=1,
nucl_penalty=-3,
gapcosts="5 2",
entrez_query='Beutenbergia cavernae DSM 12333 [Organism]')
Particularly the expect parameter plays a major role here.
I am fairly new using python and I love it. However I am stuck with this problem and I hope you could give me a hind about what I am missing.
I have a list of gene IDs in an excel file and I am trying to use xrld and biopython to retrieve sequences and save (in fasta format) my results in to a text document. so far, my code allow me to see the results in the shell but it only save the last sequence in a document.
this is my code:
import xlrd
import re
book = xlrd.open_workbook('ids.xls')
sh = book.sheet_by_index(0)
for rx in range(sh.nrows):
if sh.row(rx)[0].value:
from Bio import Entrez
from Bio import SeqIO
Entrez.email = "mail#xxx.com"
in_handle = Entrez.efetch(db="nucleotide", rettype="fasta", id=sh.row(rx)[0].value)
record = SeqIO.parse(in_handle, "fasta")
for record in SeqIO.parse(in_handle, "fasta"):
print record.format("fasta")
out_handle = open("example.txt", "w")
SeqIO.write(record, out_handle, "fasta")
in_handle.close()
out_handle.close()
As I mentioned, the file "example.txt", only have the last sequence (in fasta format) that shows the shell.
could anyone please help me how to get al the sequences I retrieve from NCBI in the same document?
Thank you very much
Antonio
I am also fairly new to python and also love it! this is my first attempt at answering a question, but maybe it is because of your loop structure and the 'w' mode? perhaps try changing ("example.txt", "w") to append mode ("example.txt", "a") as below?
import xlrd
import re
book = xlrd.open_workbook('ids.xls')
sh = book.sheet_by_index(0)
for rx in range(sh.nrows):
if sh.row(rx)[0].value:
from Bio import Entrez
from Bio import SeqIO
Entrez.email = "mail#xxx.com"
in_handle = Entrez.efetch(db="nucleotide", rettype="fasta", id=sh.row(rx)[0].value)
record = SeqIO.parse(in_handle, "fasta")
for record in SeqIO.parse(in_handle, "fasta"):
print record.format("fasta")
out_handle = open("example.txt", "a")
SeqIO.write(record, out_handle, "fasta")
in_handle.close()
out_handle.close()
Nearly there my friends!
The main problem is that your For loop keeps closing the file each loop. I also fixed some minor issues that should just speed up the code (e.g. you kept importing Bio each loop).
Use this new code:
out_handle = open("example.txt", "w")
import xlrd
import re
from Bio import Entrez
from Bio import SeqIO
book = xlrd.open_workbook('ids.xls')
sh = book.sheet_by_index(0)
for rx in range(sh.nrows):
if sh.row(rx)[0].value:
Entrez.email = "mail#xxx.com"
in_handle = Entrez.efetch(db="nucleotide", rettype="fasta", id=rx)
record = SeqIO.parse(in_handle, "fasta")
SeqIO.write(record, out_handle, "fasta")
in_handle.close()
out_handle.close()
If it still errors, It must be a problem in your excel file. Send this to me if the error still persists and I will help :)