Reading a Fasta file from Url address - url

I'm using Python 3.4.
I wrote some code to read Fasta file from internet site, but it didn't work.
http://www.uniprot.org/uniprot/B5ZC00.fasta
(I can download and read it as text file, but I'm planning to read multiple Fasta files from given site.)
(1) The first attempt
# read FASTA file
def read_fasta(filename_as_string):
"""
open text file with FASTA format
read it and convert it into string list
convert the list to dictionary
>>> read_fasta('sample.txt')
{'Rosalind_0000':'GTAT....ATGA', ... }
"""
f = open(filename_as_string,'r')
content = [line.strip() for line in f]
f.close()
new_content = []
for line in content:
if '>Rosalind' in line:
new_content.append(line.strip('>'))
new_content.append('')
else:
new_content[-1] += line
dict = {}
for i in range(len(new_content)-1):
if i % 2 == 0:
dict[new_content[i]] = new_content[i+1]
return dict
This code can read any Fasta file in my desktop computer, but it failed to read the same Fasta file from internet site.
>>> from urllib.request import urlopen
>>> html = urlopen("http://www.uniprot.org/uniprot/B5ZC00.fasta")
>>> print (read_fasta(html))
TypeError: invalid file: <http.client.HTTPResponse object at 0x02A62EF0>
(2) The second attempt
>>> from urllib.request import urlopen
>>> html = urlopen("http://www.uniprot.org/uniprot/B5ZC00.fasta")
>>> lines = [x.strip() for x in html.readlines()]
>>> print (lines)
[b'>sp|B5ZC00|SYG_UREU1 Glycine--tRNA ligase OS=Ureaplasma urealyticum serovar 10 (strain ATCC 33699 / Western) GN=glyQS PE=3 SV=1', b'MKNKFKTQEELVNHLKTVGFVFANSEIYNGLANAWDYGPLGVLLKNNLKNLWWKEFVTKQ', b'KDVVGLDSAIILNPLVWKASGHLDNFSDPLIDCKNCKARYRADKLIESFDENIHIAENSS', b'NEEFAKVLNDYEISCPTCKQFNWTEIRHFNLMFKTYQGVIEDAKNVVYLRPETAQGIFVN', b'FKNVQRSMRLHLPFGIAQIGKSFRNEITPGNFIFRTREFEQMEIEFFLKEESAYDIFDKY', b'LNQIENWLVSACGLSLNNLRKHEHPKEELSHYSKKTIDFEYNFLHGFSELYGIAYRTNYD', b'LSVHMNLSKKDLTYFDEQTKEKYVPHVIEPSVGVERLLYAILTEATFIEKLENDDERILM', b'DLKYDLAPYKIAVMPLVNKLKDKAEEIYGKILDLNISATFDNSGSIGKRYRRQDAIGTIY', b'CLTIDFDSLDDQQDPSFTIRERNSMAQKRIKLSELPLYLNQKAHEDFQRQCQK']
I thought that I could modify my code to read online Fasta file as a string list, but soon I realized that it was not easy.
>>> print (type(lines[0]))
<class 'bytes'>
I can't remove the dirty 'b' character in the head of each element of list.
>>> print (lines[0])
b'>sp|B5ZC00|SYG_UREU1 Glycine--tRNA ligase ...
>>> print (lines[0][1:])
b'sp|B5ZC00|SYG_UREU1 Glycine--tRNA ligase ...
(3) Questions
How can I remove the dirty 'b' character?
Is there any better way to read Fasta file from given Url?
With some help, I think I can modify and complete my code.
Thanks.

I'm late, but I answer if useful
in python 2
import urllib2
url = "http://www.uniprot.org/uniprot/B5ZC00.fasta"
response = urllib2.urlopen(url)
fasta = response.read()
print fasta
in python 3
from urllib.request import urlopen
url = "http://www.uniprot.org/uniprot/B5ZC00.fasta"
response = urlopen(url)
fasta = response.read().decode("utf-8", "ignore")
print(fasta)
you get:
>sp|B5ZC00|SYG_UREU1 Glycine--tRNA ligase OS=Ureaplasma urealyticum serovar 10 (strain ATCC 33699 / Western) GN=glyQS PE=3 SV=1
MKNKFKTQEELVNHLKTVGFVFANSEIYNGLANAWDYGPLGVLLKNNLKNLWWKEFVTKQ
KDVVGLDSAIILNPLVWKASGHLDNFSDPLIDCKNCKARYRADKLIESFDENIHIAENSS
NEEFAKVLNDYEISCPTCKQFNWTEIRHFNLMFKTYQGVIEDAKNVVYLRPETAQGIFVN
FKNVQRSMRLHLPFGIAQIGKSFRNEITPGNFIFRTREFEQMEIEFFLKEESAYDIFDKY
LNQIENWLVSACGLSLNNLRKHEHPKEELSHYSKKTIDFEYNFLHGFSELYGIAYRTNYD
LSVHMNLSKKDLTYFDEQTKEKYVPHVIEPSVGVERLLYAILTEATFIEKLENDDERILM
DLKYDLAPYKIAVMPLVNKLKDKAEEIYGKILDLNISATFDNSGSIGKRYRRQDAIGTIY
CLTIDFDSLDDQQDPSFTIRERNSMAQKRIKLSELPLYLNQKAHEDFQRQCQK
BONUS
It is better using biopython (example for python 2)
from Bio import SeqIO
import urllib2
url = "http://www.uniprot.org/uniprot/B5ZC00.fasta"
response = urllib2.urlopen(url)
fasta_iterator = SeqIO.parse(response, "fasta")
for seq in fasta_iterator:
print seq.format("fasta")

If you're only interested in the primary amino acid sequence (wanting to ignore the header), try the following:
link = str(sys.argv[1]) #fasta file URL provided as command line argument
FASTA = urllib.urlopen(link).readlines()[1:] # as list without header (">...")
FASTA = "".join(FASTA).replace("\n","") # as a string free of new line markers
print FASTA

A little late to the party, but trying Jose's Biopython answer no longer works in Python 3. Here's an alternative:
from Bio import SeqIO
import requests
from io import StringIO
link = "http://www.uniprot.org/uniprot/B5ZC00.fasta"
data = requests.get(link).text
fasta_iterator = SeqIO.parse(StringIO(data), "fasta")
# Pretty print the fasta info
for seq in fasta_iterator:
print(seq.format("fasta"))

Related

How to query a region in a fasta file using biopython?

I have a fasta file with some reference genome.
I would like to obtain the reference nucleotides as a string given the chromosome, start and end indexes.
I am looking for a function which would look like this in code:
from Bio import SeqIO
p = '/path/to/refernce.fa'
seqs = SeqIO.parse(p.open(), 'fasta')
string = seqs.query(id='chr7', start=10042, end=10252)
and string should be like : 'GGCTACGAACT...'
All I have found is how to iterate over seqs, and how to pull data from NCBI, which is not what I'm looking for.
What is the right way to do this in biopython?
AFAIK, biopython does not currently have this functionality. For random lookups using an index (please see samtools faidx), you'll probably want either pysam or pyfaidx. Here's an example using the pysam.FastaFile class which allows you to quickly 'fetch' sequences in a region:
import pysam
ref = pysam.FastaFile('/path/to/reference.fa')
seq = ref.fetch('chr7', 10042, 10252)
print(seq)
Or using pyfaidx and the 'get_seq' method:
from pyfaidx import Fasta
ref = Fasta('/path/to/reference.fa')
seq = ref.get_seq('chr7', 10042, 10252)
print(seq)

User Warning: Your stop_words may be inconsistent with your preprocessing

I am following this document clustering tutorial. As an input I give a txt file which can be downloaded here. It's a combined file of 3 other txt files divided with a use of \n. After creating a tf-idf matrix I received this warning:
,,UserWarning: Your stop_words may be inconsistent with your preprocessing.
Tokenizing the stop words generated tokens ['abov', 'afterward', 'alon', 'alreadi', 'alway', 'ani', 'anoth', 'anyon', 'anyth', 'anywher', 'becam', 'becaus', 'becom', 'befor', 'besid', 'cri', 'describ', 'dure', 'els', 'elsewher', 'empti', 'everi', 'everyon', 'everyth', 'everywher', 'fifti', 'forti', 'henc', 'hereaft', 'herebi', 'howev', 'hundr', 'inde', 'mani', 'meanwhil', 'moreov', 'nobodi', 'noon', 'noth', 'nowher', 'onc', 'onli', 'otherwis', 'ourselv', 'perhap', 'pleas', 'sever', 'sinc', 'sincer', 'sixti', 'someon', 'someth', 'sometim', 'somewher', 'themselv', 'thenc', 'thereaft', 'therebi', 'therefor', 'togeth', 'twelv', 'twenti', 'veri', 'whatev', 'whenc', 'whenev', 'wherea', 'whereaft', 'wherebi', 'wherev', 'whi', 'yourselv'] not in stop_words.
'stop_words.' % sorted(inconsistent))".
I guess it has something to do with the order of lemmatization and stop words removal, but as this is my first project in txt processing, I am a bit lost and I don't know how to fix this...
import pandas as pd
import nltk
from nltk.corpus import stopwords
import re
import os
import codecs
from sklearn import feature_extraction
import mpld3
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
stopwords = stopwords.words('english')
stemmer = SnowballStemmer("english")
def tokenize_and_stem(text):
# first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = []
# filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
stems = [stemmer.stem(t) for t in filtered_tokens]
return stems
def tokenize_only(text):
# first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = []
# filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
return filtered_tokens
totalvocab_stemmed = []
totalvocab_tokenized = []
with open('shortResultList.txt', encoding="utf8") as synopses:
for i in synopses:
allwords_stemmed = tokenize_and_stem(i) # for each item in 'synopses', tokenize/stem
totalvocab_stemmed.extend(allwords_stemmed) # extend the 'totalvocab_stemmed' list
allwords_tokenized = tokenize_only(i)
totalvocab_tokenized.extend(allwords_tokenized)
vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_stemmed)
print ('there are ' + str(vocab_frame.shape[0]) + ' items in vocab_frame')
print (vocab_frame.head())
#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
min_df=0.2, stop_words='english',
use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))
with open('shortResultList.txt', encoding="utf8") as synopses:
tfidf_matrix = tfidf_vectorizer.fit_transform(synopses) #fit the vectorizer to synopses
print(tfidf_matrix.shape)
The warning is trying to tell you that if your text contains "always" it will be normalised to "alway" before matching against your stop list which includes "always" but not "alway". So it won't be removed from your bag of words.
The solution is to make sure that you preprocess your stop list to make sure that it is normalised like your tokens will be, and pass the list of normalised words as stop_words to the vectoriser.
I had the same problem and for me the following worked:
include stopwords into tokenize function and then
remove stopwords parameter from tfidfVectorizer
Like so:
1.
stopwords = stopwords.words('english')
stemmer = SnowballStemmer("english")
def tokenize_and_stem(text):
tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = []
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
#exclude stopwords from stemmed words
stems = [stemmer.stem(t) for t in filtered_tokens if t not in stopwords]
return stems
Delete stopwords parameter from vectorizer:
tfidf_vectorizer = TfidfVectorizer(
max_df=0.8, max_features=200000, min_df=0.2,
use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3)
)
I faced this problem because of PT-BR language.
TL;DR: Remove the accents of your language.
# Special thanks for the user Humberto Diogenes from Python List (answer from Aug 11, 2008)
# Link: http://python.6.x6.nabble.com/O-jeito-mais-rapido-de-remover-acentos-de-uma-string-td2041508.html
# I found the issue by chance (I swear, haha) but this guy gave the tip before me
# Link: https://github.com/scikit-learn/scikit-learn/issues/12897#issuecomment-518644215
import spacy
nlp = spacy.load('pt_core_news_sm')
# Define default stopwords list
stoplist = spacy.lang.pt.stop_words.STOP_WORDS
def replace_ptbr_char_by_word(word):
""" Will remove the encode token by token"""
word = str(word)
word = normalize('NFKD', word).encode('ASCII','ignore').decode('ASCII')
return word
def remove_pt_br_char_by_text(text):
""" Will remove the encode using the entire text"""
text = str(text)
text = " ".join(replace_ptbr_char_by_word(word) for word in text.split() if word not in stoplist)
return text
df['text'] = df['text'].apply(remove_pt_br_char_by_text)
I put the solution and references in this gist.
Manually adding those words in the 'stop_words' list can solve the problem.
stop_words = safe_get_stop_words('en')
stop_words.extend(['abov', 'afterward', 'alon', 'alreadi', 'alway', 'ani', 'anoth', 'anyon', 'anyth', 'anywher', 'becam', 'becaus', 'becom', 'befor', 'besid', 'cri', 'describ', 'dure', 'els', 'elsewher', 'empti', 'everi', 'everyon', 'everyth', 'everywher', 'fifti', 'forti', 'henc', 'hereaft', 'herebi', 'howev', 'hundr', 'inde', 'mani', 'meanwhil', 'moreov', 'nobodi', 'noon', 'noth', 'nowher', 'onc', 'onli', 'otherwis', 'ourselv', 'perhap', 'pleas', 'sever', 'sinc', 'sincer', 'sixti', 'someon', 'someth', 'sometim', 'somewher', 'themselv', 'thenc', 'thereaft', 'therebi', 'therefor', 'togeth', 'twelv', 'twenti', 'veri', 'whatev', 'whenc', 'whenev', 'wherea', 'whereaft', 'wherebi', 'wherev', 'whi', 'yourselv'])

Biopython: Cant use .count() for biopython

My goal here is to receive the amount of time 'g' appears in a DNA sequence.
I imported a DNA sequence via Biopython using list comprehension
seq = [record for record in SeqIO.parse('sequences/hiv.gbk.rtf', 'fasta')]
I then tried using the .count() method on the newly created list comp variable
print(seq.count('g'))
I get an error that reads
NotImplementedError: SeqRecord comparison is deliberately not
implemented. Explicitly compare the attributes of interest.
Anyone know what the dealio is? Biopython's manual says all standard python methods should work.
You are trying to apply count to a list. You would to need to apply it to the sequence of each element, e.g.
print(seq[0].seq.count('g'))
or if you want to get the sum of all sequences
print(sum([s.seq.count('g') for s in seq]))
Here is a minimal working example
from Bio import SeqIO
txt = """>gnl|TC-DB|O60669|2.A.1.13.5 Monocarboxylate transporter 2 - Homo sapiens (Human).
MPPMPSAPPVHPPPDGGWGWIVVGAAFISIGFSYAFPKAVTVFFKEIQQIFHTTYSEIAW
>gnl|TC-DB|O60706|3.A.1.208.23 ATP-binding cassette sub-family C member 9 OS=Homo sapiens GN=ABCC9 PE=1 SV=2
MSLSFCGNNISSYNINDGVLQNSCFVDALNLVPHVFLLFITFPILFIGWGSQSSKVQIHH
>gnl|TC-DB|O60721|3.A.1.208.23 Sodium/potassium/calcium exchanger 1 OS=Homo sapiens GN=SLC24A1 PE=1 SV=1
MGKLIRMGPQERWLLRTKRLHWSRLLFLLGMLIIGSTYQHLRRPRGLSSLWAAVSSHQPI
>gnl|TC-DB|O60779|2.A.1.13.5 Thiamine transporter 1 (THTR-1) (ThTr1) (Thiamine carrier 1) (TC1) - Homo sapiens (Human).
MDVPGPVSRRAAAAAATVLLRTARVRRECWFLPTALLCAYGFFASLRPSEPFLTPYLLGP"""
filename = 'sequences.fa'
with open(filename, 'w') as f:
f.write(txt)
seqs = [record for record in SeqIO.parse(filename, 'fasta')]
print(sum([s.seq.count('P') for s in seqs]))
>>> 21
print(seqs[0].seq.count('P'))
>>> 9

Can h5py load a file from a byte array in memory?

My python code is receiving a byte array which represents the bytes of the hdf5 file.
I'd like to read this byte array to an in-memory h5py file object without first writing the byte array to disk. This page says that I can open a memory mapped file, but it would be a new, empty file. I want to go from byte array to in-memory hdf5 file, use it, discard it and not to write to disk at any point.
Is it possible to do this with h5py? (or with hdf5 using C if that is the only way)
You could try to use Binary I/O to create a File object and read it via h5py:
f = io.BytesIO(YOUR_H5PY_STREAM)
h = h5py.File(f,'r')
You can use io.BytesIO or tempfile to create h5 objects, which showed in official docs http://docs.h5py.org/en/stable/high/file.html#python-file-like-objects.
The first argument to File may be a Python file-like object, such as an io.BytesIO or tempfile.TemporaryFile instance. This is a convenient way to create temporary HDF5 files, e.g. for testing or to send over the network.
tempfile.TemporaryFile
>>> tf = tempfile.TemporaryFile()
>>> f = h5py.File(tf)
or io.BytesIO
"""Create an HDF5 file in memory and retrieve the raw bytes
This could be used, for instance, in a server producing small HDF5
files on demand.
"""
import io
import h5py
bio = io.BytesIO()
with h5py.File(bio) as f:
f['dataset'] = range(10)
data = bio.getvalue() # data is a regular Python bytes object.
print("Total size:", len(data))
print("First bytes:", data[:10])
The following example uses tables which can still read and manipulate the H5 format in lieu of H5PY.
import urllib.request
import tables
url = 'https://s3.amazonaws.com/<your bucket>/data.hdf5'
response = urllib.request.urlopen(url)
h5file = tables.open_file("data-sample.h5", driver="H5FD_CORE",
driver_core_image=response.read(),
driver_core_backing_store=0)

save sequences from NCBI in fasta format using a list of IDs in excel

I am fairly new using python and I love it. However I am stuck with this problem and I hope you could give me a hind about what I am missing.
I have a list of gene IDs in an excel file and I am trying to use xrld and biopython to retrieve sequences and save (in fasta format) my results in to a text document. so far, my code allow me to see the results in the shell but it only save the last sequence in a document.
this is my code:
import xlrd
import re
book = xlrd.open_workbook('ids.xls')
sh = book.sheet_by_index(0)
for rx in range(sh.nrows):
if sh.row(rx)[0].value:
from Bio import Entrez
from Bio import SeqIO
Entrez.email = "mail#xxx.com"
in_handle = Entrez.efetch(db="nucleotide", rettype="fasta", id=sh.row(rx)[0].value)
record = SeqIO.parse(in_handle, "fasta")
for record in SeqIO.parse(in_handle, "fasta"):
print record.format("fasta")
out_handle = open("example.txt", "w")
SeqIO.write(record, out_handle, "fasta")
in_handle.close()
out_handle.close()
As I mentioned, the file "example.txt", only have the last sequence (in fasta format) that shows the shell.
could anyone please help me how to get al the sequences I retrieve from NCBI in the same document?
Thank you very much
Antonio
I am also fairly new to python and also love it! this is my first attempt at answering a question, but maybe it is because of your loop structure and the 'w' mode? perhaps try changing ("example.txt", "w") to append mode ("example.txt", "a") as below?
import xlrd
import re
book = xlrd.open_workbook('ids.xls')
sh = book.sheet_by_index(0)
for rx in range(sh.nrows):
if sh.row(rx)[0].value:
from Bio import Entrez
from Bio import SeqIO
Entrez.email = "mail#xxx.com"
in_handle = Entrez.efetch(db="nucleotide", rettype="fasta", id=sh.row(rx)[0].value)
record = SeqIO.parse(in_handle, "fasta")
for record in SeqIO.parse(in_handle, "fasta"):
print record.format("fasta")
out_handle = open("example.txt", "a")
SeqIO.write(record, out_handle, "fasta")
in_handle.close()
out_handle.close()
Nearly there my friends!
The main problem is that your For loop keeps closing the file each loop. I also fixed some minor issues that should just speed up the code (e.g. you kept importing Bio each loop).
Use this new code:
out_handle = open("example.txt", "w")
import xlrd
import re
from Bio import Entrez
from Bio import SeqIO
book = xlrd.open_workbook('ids.xls')
sh = book.sheet_by_index(0)
for rx in range(sh.nrows):
if sh.row(rx)[0].value:
Entrez.email = "mail#xxx.com"
in_handle = Entrez.efetch(db="nucleotide", rettype="fasta", id=rx)
record = SeqIO.parse(in_handle, "fasta")
SeqIO.write(record, out_handle, "fasta")
in_handle.close()
out_handle.close()
If it still errors, It must be a problem in your excel file. Send this to me if the error still persists and I will help :)

Resources