User Warning: Your stop_words may be inconsistent with your preprocessing - vectorization

I am following this document clustering tutorial. As an input I give a txt file which can be downloaded here. It's a combined file of 3 other txt files divided with a use of \n. After creating a tf-idf matrix I received this warning:
,,UserWarning: Your stop_words may be inconsistent with your preprocessing.
Tokenizing the stop words generated tokens ['abov', 'afterward', 'alon', 'alreadi', 'alway', 'ani', 'anoth', 'anyon', 'anyth', 'anywher', 'becam', 'becaus', 'becom', 'befor', 'besid', 'cri', 'describ', 'dure', 'els', 'elsewher', 'empti', 'everi', 'everyon', 'everyth', 'everywher', 'fifti', 'forti', 'henc', 'hereaft', 'herebi', 'howev', 'hundr', 'inde', 'mani', 'meanwhil', 'moreov', 'nobodi', 'noon', 'noth', 'nowher', 'onc', 'onli', 'otherwis', 'ourselv', 'perhap', 'pleas', 'sever', 'sinc', 'sincer', 'sixti', 'someon', 'someth', 'sometim', 'somewher', 'themselv', 'thenc', 'thereaft', 'therebi', 'therefor', 'togeth', 'twelv', 'twenti', 'veri', 'whatev', 'whenc', 'whenev', 'wherea', 'whereaft', 'wherebi', 'wherev', 'whi', 'yourselv'] not in stop_words.
'stop_words.' % sorted(inconsistent))".
I guess it has something to do with the order of lemmatization and stop words removal, but as this is my first project in txt processing, I am a bit lost and I don't know how to fix this...
import pandas as pd
import nltk
from nltk.corpus import stopwords
import re
import os
import codecs
from sklearn import feature_extraction
import mpld3
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
stopwords = stopwords.words('english')
stemmer = SnowballStemmer("english")
def tokenize_and_stem(text):
# first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = []
# filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
stems = [stemmer.stem(t) for t in filtered_tokens]
return stems
def tokenize_only(text):
# first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = []
# filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
return filtered_tokens
totalvocab_stemmed = []
totalvocab_tokenized = []
with open('shortResultList.txt', encoding="utf8") as synopses:
for i in synopses:
allwords_stemmed = tokenize_and_stem(i) # for each item in 'synopses', tokenize/stem
totalvocab_stemmed.extend(allwords_stemmed) # extend the 'totalvocab_stemmed' list
allwords_tokenized = tokenize_only(i)
totalvocab_tokenized.extend(allwords_tokenized)
vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_stemmed)
print ('there are ' + str(vocab_frame.shape[0]) + ' items in vocab_frame')
print (vocab_frame.head())
#define vectorizer parameters
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
min_df=0.2, stop_words='english',
use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3))
with open('shortResultList.txt', encoding="utf8") as synopses:
tfidf_matrix = tfidf_vectorizer.fit_transform(synopses) #fit the vectorizer to synopses
print(tfidf_matrix.shape)

The warning is trying to tell you that if your text contains "always" it will be normalised to "alway" before matching against your stop list which includes "always" but not "alway". So it won't be removed from your bag of words.
The solution is to make sure that you preprocess your stop list to make sure that it is normalised like your tokens will be, and pass the list of normalised words as stop_words to the vectoriser.

I had the same problem and for me the following worked:
include stopwords into tokenize function and then
remove stopwords parameter from tfidfVectorizer
Like so:
1.
stopwords = stopwords.words('english')
stemmer = SnowballStemmer("english")
def tokenize_and_stem(text):
tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = []
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
#exclude stopwords from stemmed words
stems = [stemmer.stem(t) for t in filtered_tokens if t not in stopwords]
return stems
Delete stopwords parameter from vectorizer:
tfidf_vectorizer = TfidfVectorizer(
max_df=0.8, max_features=200000, min_df=0.2,
use_idf=True, tokenizer=tokenize_and_stem, ngram_range=(1,3)
)

I faced this problem because of PT-BR language.
TL;DR: Remove the accents of your language.
# Special thanks for the user Humberto Diogenes from Python List (answer from Aug 11, 2008)
# Link: http://python.6.x6.nabble.com/O-jeito-mais-rapido-de-remover-acentos-de-uma-string-td2041508.html
# I found the issue by chance (I swear, haha) but this guy gave the tip before me
# Link: https://github.com/scikit-learn/scikit-learn/issues/12897#issuecomment-518644215
import spacy
nlp = spacy.load('pt_core_news_sm')
# Define default stopwords list
stoplist = spacy.lang.pt.stop_words.STOP_WORDS
def replace_ptbr_char_by_word(word):
""" Will remove the encode token by token"""
word = str(word)
word = normalize('NFKD', word).encode('ASCII','ignore').decode('ASCII')
return word
def remove_pt_br_char_by_text(text):
""" Will remove the encode using the entire text"""
text = str(text)
text = " ".join(replace_ptbr_char_by_word(word) for word in text.split() if word not in stoplist)
return text
df['text'] = df['text'].apply(remove_pt_br_char_by_text)
I put the solution and references in this gist.

Manually adding those words in the 'stop_words' list can solve the problem.
stop_words = safe_get_stop_words('en')
stop_words.extend(['abov', 'afterward', 'alon', 'alreadi', 'alway', 'ani', 'anoth', 'anyon', 'anyth', 'anywher', 'becam', 'becaus', 'becom', 'befor', 'besid', 'cri', 'describ', 'dure', 'els', 'elsewher', 'empti', 'everi', 'everyon', 'everyth', 'everywher', 'fifti', 'forti', 'henc', 'hereaft', 'herebi', 'howev', 'hundr', 'inde', 'mani', 'meanwhil', 'moreov', 'nobodi', 'noon', 'noth', 'nowher', 'onc', 'onli', 'otherwis', 'ourselv', 'perhap', 'pleas', 'sever', 'sinc', 'sincer', 'sixti', 'someon', 'someth', 'sometim', 'somewher', 'themselv', 'thenc', 'thereaft', 'therebi', 'therefor', 'togeth', 'twelv', 'twenti', 'veri', 'whatev', 'whenc', 'whenev', 'wherea', 'whereaft', 'wherebi', 'wherev', 'whi', 'yourselv'])

Related

How to print the best matching hit in the BLAST search? / BioPython

I'm trying to making a BLAST search with a nucleotide sequence and print the best matching hit but not sure about which option/command should I use. There are options like max_hpsp and best_hit_overhang. I don't have an idea about their differences and I want to print just 1 hit. (best matching one) Should i use max_hpsp 1?
I wrote this code but it's still not useful. If you could tell me, where I am mistaken and what should to do, I would be very appreciated :) Thank you!
from Bio.Blast import NCBIWWW
seq = Seq("GTTGA......CT")
def best_matching_hit(seq):
try:
result_handle = NCBIWWW.qblast("blastn", "nt", seq)
except:
print('BLAST run failed!')
return None
blast_record = NCBIXML.read(result_handle)
for hit in blast_record.alignments:
for hsp in hit.hsps:
if hsp.expect == max_hsps 1:
print(hit.title)
print(hsp.sbjct)
best_matching_hit(seq)
this returns just one hit , the first one I suppose, as per
Limiting the number of hits in a Biopython NCBIWWW Search on Biostars:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Mon Jun 7 15:28:11 2021
#author: Pietro
https://stackoverflow.com/questions/67872118/how-to-print-the-best-matching-hit-in-the-blast-search-biopython
"""
from Bio.Blast import NCBIWWW
from Bio.Seq import Seq
seq = Seq("ATGGCGTGGAATGAGCCTGGAAATAACAACGGCAACAATGGCCGCGATAATGACCCTTGGGGTAATAA\
TAATCGTGGTGGCCAGCGTCCTGGTGGCCGAGATCAAGGTCCGCCAGATTTAGATGAAGTGTTCAACAA\
ACTGAGTCAAAAGCTGGGTGGCAAGTTTGGTAAAAAAGGCGGCGGTGGTTCCTCTATCGGCGGTGGCGG\
TGGTGCAATTGGCTTTGGTGTCATTGCGATCATTGCAATTGCGGTGTGGATTTTCGCTGGTTTTTACAC\
CATCGGTGAAGCAGAGCGTGGTGTTGTACTGCGTTTAGGTAAATACGATCGTATCGTAGACCCAGGCCT\
TAACTGGCGTCCTCGTTTTATTGATGAATACGAAGCGGTTAACGTACAAGCGATTCGCTCACTACGTGC\
ATCTGGTCTAATGCTGACGAAAGATGAAAACGTAGTAACGGTTGCAATGGACGTTCAATACCGAGTTGC\
TGACCCATACAAATACCTATACCGCGTGACCAATGCAGATGATAGCTTGCGTCAAGCAACAGACTCTGC\
GCTACGTGCGGTAATTGGTGATTCACTAATGGATAGCATTCTAACCAGTGGTCGTCAGCAAATTCGTCA\
AAGCACTCAAGAAACACTAAACCAAATCATCGATAGCTATGATATGGGTCTGGTGATTGTTGACGTGAA\
CTTCCAGTCTGCACGTCCGCCAGAGCAAGTAAAAGATGCGTTTGATGACGCGATTGCTGCGCGTGAGGA\
TGAAGAGCGTTTCATCCGTGAAGCAGAAGCTTACAAGAACGAAATCTTGCCGAAGGCAACGGGTCGTGC\
TGAACGTTTGAAGAAGGAAGCTCAAGGTTACAACGAGCGTGTAACTAACGAAGCATTAGGTCAAGTAGC\
ACAGTTTGAAAAACTACTACCTGAATACCAAGCGGCTCCTGGCGTAACACGTGACCGTCTGTACATTGA\
CGCGATGGAAGAGGTTTACACCAACACATCTAAAGTGTTGATTGACTCTGAATCAAGCGGCAACCTTTT\
GTACCTACCAATCGATAAATTGGCAGGTCAAGAAGGCCAAACAGACACTAAACGTAAATCGAAATCTTC\
TTCAACCTACGATCACATTCAACTAGAGTCTGAGCGTACACAAGAAGAAACATCGAACACGCAGTCTCG\
TTCAACAGGTACACGTCAAGGGAGATACTAA")
def best_matching_hit(seq):
try:
result_handle = NCBIWWW.qblast("blastn", "nt", seq, hitlist_size=1)
except:
print('BLAST run failed!')
return None
blast_record = result_handle.read()
print(blast_record)
best_matching_hit(seq)

How to query a region in a fasta file using biopython?

I have a fasta file with some reference genome.
I would like to obtain the reference nucleotides as a string given the chromosome, start and end indexes.
I am looking for a function which would look like this in code:
from Bio import SeqIO
p = '/path/to/refernce.fa'
seqs = SeqIO.parse(p.open(), 'fasta')
string = seqs.query(id='chr7', start=10042, end=10252)
and string should be like : 'GGCTACGAACT...'
All I have found is how to iterate over seqs, and how to pull data from NCBI, which is not what I'm looking for.
What is the right way to do this in biopython?
AFAIK, biopython does not currently have this functionality. For random lookups using an index (please see samtools faidx), you'll probably want either pysam or pyfaidx. Here's an example using the pysam.FastaFile class which allows you to quickly 'fetch' sequences in a region:
import pysam
ref = pysam.FastaFile('/path/to/reference.fa')
seq = ref.fetch('chr7', 10042, 10252)
print(seq)
Or using pyfaidx and the 'get_seq' method:
from pyfaidx import Fasta
ref = Fasta('/path/to/reference.fa')
seq = ref.get_seq('chr7', 10042, 10252)
print(seq)

Find sequence IDs of DNA subsequences in DNA-sequences from FASTA-file

I want to make a function that reads a FASTA-file with DNA sequences(possibly ambiguous) and inputs a subsequence that returns all sequence IDs of the sequences that contain the given subsequence.
To make the script more efficient, I tried to use nt_search to make give all possibilities of the ambiguous sequence from the FASTA. This seemed more efficient than producing all unambiguous possibilities, especially for larger sequences an FASTA-files.
Right now, I'm struggling to see how I can check whether the subsequence is part of the output given bynt_search.
I want to see if eg 'CGC' (input subsequence) is part of the possibilities given by nt_search: ['TA[GATC][AT][GT]GCGGT'] and return all sequence IDs of sequences for which this is true.
What I have so far:
def bonus_subsequence(file, unambiguous_sequence):
seq_records = SeqIO.parse(file,'fasta', alphabet =ambiguous_dna)
resultListOfSeqIds = []
print(f'Unambiguous sequence {unambiguous_sequence} could be a subsequence of:')
for record in seq_records:
d = Seq.IUPAC.IUPACData.ambiguous_dna_values
couldBeSubSequence = False;
if unambiguous_sequence in nt_search(unambiguous_sequence,record):
couldBeSubSequence = True;
if couldBeSubSequence == True:
print(f'{record.id}')
resultListOfSeqIds.append({record.id})
In a second phase, I want to be able to also use this for ambiguous subsequences, but I'd be more than happy with help on this first question, thanks in advance!
I don't know if I understood You well but you can try this:
Example fasta file:
>seq1
ATGTACGTACGTACNNNNACTG
>seq2
NNNATCGTAGTCANNA
>seq3
NNNNATGNNN
Code:
from Bio import SeqIO
from Bio import SeqUtils
from Bio.Alphabet.IUPAC import ambiguous_dna
if __name__ == '__main__':
sub_seq = input('Enter a subsequence: ')
results = []
with open('test.fasta', 'r') as fh:
for seq in SeqIO.parse(fh, 'fasta', alphabet=ambiguous_dna):
if sub_seq in seq:
results.append((seq.name))
print(results, sep='\n')
Results (console):
Enter a subsequence: ATG
Results:
seq1
seq3
Enter a subsequence: NNNA
Results:
seq1
seq2
seq3

What is wrong with this CMAC computation?

I have an example of a CMAC computation, which I want to reproduce in Python, however I am failing. The example looks like this:
key = 3ED0920E5E6A0320D823D5987FEAFBB1
msg = CEE9A53E3E463EF1F459635736738962&cmac=
The expected (truncated) CMAC looks like this (note: truncated means that every second byte is dropped)
ECC1E7F6C6C73BF6
So I tried to reenact this example with the following code:
from Crypto.Hash import CMAC
from Crypto.Cipher import AES
from binascii import hexlify, unhexlify
def generate_cmac(key, msg):
"""generate a truncated cmac message.
Inputs:
key: 1-dimensional bytearray of arbitrary length
msg: 1-dimensional bytearray of arbitrary length
Outputs:
CMAC: The cmac number
CMAC_t: Trunacted CMAC"""
# Generate CMAC via the CMAC algorithm
cobj = CMAC.new(key=key, ciphermod=AES)
cobj.update(msg)
mac_raw = cobj.digest()
# Truncate by initializing an empty array and assigning every second byte
mac_truncated = bytearray(8 * b'\x00')
it2 = 0
for it in range(len(mac_raw)):
if it % 2:
mac_truncated[it2:it2+1] = mac_raw[it:it+1]
it2 += 1
return mac_raw, mac_truncated
key = unhexlify('3ED0920E5E6A0320D823D5987FEAFBB1') # The key as in the example
msg = 'CEE9A53E3E463EF1F459635736738962&cmac=' # The msg as in the example
msg_utf = msg.encode('utf-8')
msg_input = hexlify(msg_utf) # Trying to get the bytearray
mac, mact_calc = generate_cmac(key, msg_input) # Calculate the CMAC and truncated CMAC
# However the calculated CMAC does not match the cmac of the example
My function generate_cmac() works perfectly for other cases, why not for this example?
(If anybody is curious, the example stems from this document Page 18/Table 6)
Edit: An example for a successful cmac computation is the following:
mact_expected = unhexlify('94EED9EE65337086') # as stated in the application note
key = unhexlify('3FB5F6E3A807A03D5E3570ACE393776F') # called K_SesSDMFileReadMAC
msg = [] # zero length input
mac, mact_calc = generate_cmac(key, msg) # mact_expected and mact_calc are the same
assert mact_expected == mact_calc, "Example 1 failed" # This assertion passes
TLDR: overhexlification
Much to my stupefaction, the linked example indeed seems to mean CEE9A53E3E463EF1F459635736738962&cmac=when it writes that, since the box below contains 76 hex characters for the the 38 bytes coding that in ASCII, that is 434545394135334533453436334546314634353936333537333637333839363226636d61633d.
However I'm positive that this does not need to be further hexlified on the tune of 76 bytes as the code does. In other words, my bets are on
key = unhexlify('3ED0920E5E6A0320D823D5987FEAFBB1')
msg = 'CEE9A53E3E463EF1F459635736738962&cmac='.encode()
mac, mact_calc = generate_cmac(key, msg)

Reading a Fasta file from Url address

I'm using Python 3.4.
I wrote some code to read Fasta file from internet site, but it didn't work.
http://www.uniprot.org/uniprot/B5ZC00.fasta
(I can download and read it as text file, but I'm planning to read multiple Fasta files from given site.)
(1) The first attempt
# read FASTA file
def read_fasta(filename_as_string):
"""
open text file with FASTA format
read it and convert it into string list
convert the list to dictionary
>>> read_fasta('sample.txt')
{'Rosalind_0000':'GTAT....ATGA', ... }
"""
f = open(filename_as_string,'r')
content = [line.strip() for line in f]
f.close()
new_content = []
for line in content:
if '>Rosalind' in line:
new_content.append(line.strip('>'))
new_content.append('')
else:
new_content[-1] += line
dict = {}
for i in range(len(new_content)-1):
if i % 2 == 0:
dict[new_content[i]] = new_content[i+1]
return dict
This code can read any Fasta file in my desktop computer, but it failed to read the same Fasta file from internet site.
>>> from urllib.request import urlopen
>>> html = urlopen("http://www.uniprot.org/uniprot/B5ZC00.fasta")
>>> print (read_fasta(html))
TypeError: invalid file: <http.client.HTTPResponse object at 0x02A62EF0>
(2) The second attempt
>>> from urllib.request import urlopen
>>> html = urlopen("http://www.uniprot.org/uniprot/B5ZC00.fasta")
>>> lines = [x.strip() for x in html.readlines()]
>>> print (lines)
[b'>sp|B5ZC00|SYG_UREU1 Glycine--tRNA ligase OS=Ureaplasma urealyticum serovar 10 (strain ATCC 33699 / Western) GN=glyQS PE=3 SV=1', b'MKNKFKTQEELVNHLKTVGFVFANSEIYNGLANAWDYGPLGVLLKNNLKNLWWKEFVTKQ', b'KDVVGLDSAIILNPLVWKASGHLDNFSDPLIDCKNCKARYRADKLIESFDENIHIAENSS', b'NEEFAKVLNDYEISCPTCKQFNWTEIRHFNLMFKTYQGVIEDAKNVVYLRPETAQGIFVN', b'FKNVQRSMRLHLPFGIAQIGKSFRNEITPGNFIFRTREFEQMEIEFFLKEESAYDIFDKY', b'LNQIENWLVSACGLSLNNLRKHEHPKEELSHYSKKTIDFEYNFLHGFSELYGIAYRTNYD', b'LSVHMNLSKKDLTYFDEQTKEKYVPHVIEPSVGVERLLYAILTEATFIEKLENDDERILM', b'DLKYDLAPYKIAVMPLVNKLKDKAEEIYGKILDLNISATFDNSGSIGKRYRRQDAIGTIY', b'CLTIDFDSLDDQQDPSFTIRERNSMAQKRIKLSELPLYLNQKAHEDFQRQCQK']
I thought that I could modify my code to read online Fasta file as a string list, but soon I realized that it was not easy.
>>> print (type(lines[0]))
<class 'bytes'>
I can't remove the dirty 'b' character in the head of each element of list.
>>> print (lines[0])
b'>sp|B5ZC00|SYG_UREU1 Glycine--tRNA ligase ...
>>> print (lines[0][1:])
b'sp|B5ZC00|SYG_UREU1 Glycine--tRNA ligase ...
(3) Questions
How can I remove the dirty 'b' character?
Is there any better way to read Fasta file from given Url?
With some help, I think I can modify and complete my code.
Thanks.
I'm late, but I answer if useful
in python 2
import urllib2
url = "http://www.uniprot.org/uniprot/B5ZC00.fasta"
response = urllib2.urlopen(url)
fasta = response.read()
print fasta
in python 3
from urllib.request import urlopen
url = "http://www.uniprot.org/uniprot/B5ZC00.fasta"
response = urlopen(url)
fasta = response.read().decode("utf-8", "ignore")
print(fasta)
you get:
>sp|B5ZC00|SYG_UREU1 Glycine--tRNA ligase OS=Ureaplasma urealyticum serovar 10 (strain ATCC 33699 / Western) GN=glyQS PE=3 SV=1
MKNKFKTQEELVNHLKTVGFVFANSEIYNGLANAWDYGPLGVLLKNNLKNLWWKEFVTKQ
KDVVGLDSAIILNPLVWKASGHLDNFSDPLIDCKNCKARYRADKLIESFDENIHIAENSS
NEEFAKVLNDYEISCPTCKQFNWTEIRHFNLMFKTYQGVIEDAKNVVYLRPETAQGIFVN
FKNVQRSMRLHLPFGIAQIGKSFRNEITPGNFIFRTREFEQMEIEFFLKEESAYDIFDKY
LNQIENWLVSACGLSLNNLRKHEHPKEELSHYSKKTIDFEYNFLHGFSELYGIAYRTNYD
LSVHMNLSKKDLTYFDEQTKEKYVPHVIEPSVGVERLLYAILTEATFIEKLENDDERILM
DLKYDLAPYKIAVMPLVNKLKDKAEEIYGKILDLNISATFDNSGSIGKRYRRQDAIGTIY
CLTIDFDSLDDQQDPSFTIRERNSMAQKRIKLSELPLYLNQKAHEDFQRQCQK
BONUS
It is better using biopython (example for python 2)
from Bio import SeqIO
import urllib2
url = "http://www.uniprot.org/uniprot/B5ZC00.fasta"
response = urllib2.urlopen(url)
fasta_iterator = SeqIO.parse(response, "fasta")
for seq in fasta_iterator:
print seq.format("fasta")
If you're only interested in the primary amino acid sequence (wanting to ignore the header), try the following:
link = str(sys.argv[1]) #fasta file URL provided as command line argument
FASTA = urllib.urlopen(link).readlines()[1:] # as list without header (">...")
FASTA = "".join(FASTA).replace("\n","") # as a string free of new line markers
print FASTA
A little late to the party, but trying Jose's Biopython answer no longer works in Python 3. Here's an alternative:
from Bio import SeqIO
import requests
from io import StringIO
link = "http://www.uniprot.org/uniprot/B5ZC00.fasta"
data = requests.get(link).text
fasta_iterator = SeqIO.parse(StringIO(data), "fasta")
# Pretty print the fasta info
for seq in fasta_iterator:
print(seq.format("fasta"))

Resources