Biopython retrieving particular CDS from a whole genome - biopython

I am new to Stackoverflow. I am trying to automate search process using Biopython. I have two lists, one with protein GI numbers and one with corresponding nucleotide GI numbers.
For example:
protein_GI=[588489721,788136950,409084506]
nucleo_GI=[588489708,788136846,409084493]
Second list was created using ELink. However, the nucleotide GIs correspond to whole genomes. I need to retrieve particular CDS from each genome which match the protein GI.
I tried using again ELink with different link names ("protein_nucleotide_cds","protein_nuccore") but all I get is id numbers for whole genomes. Should I try some other link names?
I also tried the following EFetch code:
import Bio
from Bio import Entrez
Entrez.email = None
handle=Entrez.efetch(db="sequences",id="588489708,588489721",rettype="fasta",retmode="text")
print(handle.read())
This method gives me nucleotide and protein sequences in fasta file but the nucleotide sequence is a whole genome.
I would be very grateful, if somebody could help me.
Thanking you in advance!

I hope help you
import Bio
from Bio import Entrez
from Bio import SeqIO
Entrez.email = "mail#example.com"
gi_protein = "GI:588489721"
gi_genome = "GI:588489708"
handle=Entrez.efetch(db="sequences", id=gi_protein,rettype="fasta", retmode="text")
protein = SeqIO.parse(handle, "fasta").next()
handle=Entrez.efetch(db="sequences", id=gi_genome, rettype="gbwithparts", retmode="text")
genome = SeqIO.parse(handle, "gb").next()
#to extract feature with 'id' equal to protein
feature = [f for f in gb.features if "db_xref" in f.qualifiers and gi_protein in f.qualifiers["db_xref"]]
#to get location of CDS
start = feature[0].location.start.position
end = feature[0].location.end.position
strand = feature[0].location.strand
seq = genome[start: end]
if strand == 1:
print seq.seq
else:
#if strand is -1 then to get reverse complement
print seq.reverse_complement().seq
print protein.seq
then you get:
ATGGATTATATTGTTTCAGCACGAAAATATCGTCCCTCTACCTTTGTTTCGGTGGTAGGG
CAGCAGAACATCACCACTACATTAAAAAATGCCATTAAAGGCAGTCAACTGGCACACGCC
TATCTTTTTTGCGGACCGCGAGGTGTGGGAAAGACGACTTGTGCCCGTATCTTTGCTAAA
ACCATCAACTGTTCGAATATATCAGCTGATTTTGAAGCGTGTAATGAGTGTGAATCCTGT
AAGTCTTTTAATGAGAATCGTTCTTATAATATTCATGAACTGGATGGAGCCTCCAATAAC
TCAGTAGAGGATATCAGGAGTCTGATTGATAAAGTTCGTGTTCCACCTCAGATAGGTAGT
TATAGTGTATATATTATCGATGAGGTTCACATGTTATCGCAGGCAGCTTTTAATGCTTTT
CTTAAAACATTGGAAGAGCCACCCAAGCATGCCATCTTTATTTTGGCCACTACTGAAAAA
CATAAAATACTACCAACGATCCTGTCTCGTTGCCAGATTTACGATTTTAATAGGATTACC
ATTGAAGATGCGGTAGGTCATTTAAAATATGTAGCAGAGAGTGAGCATATAACTGTGGAA
GAAGAGGGGTTAACCGTCATTGCACAAAAAGCTGATGGAGCTATGCGGGATGCACTTTCC
ATCTTTGATCAGATTGTGGCTTTCTCAGGTAAAAGTATCAGCTATCAGCAAGTAATCGAT
AATTTGAATGTATTGGATTATGATTTTTACTTTAGGTTGGTGGATGCTTTTCTGGCAGAA
GATACTACACAAACACTATTGATTTTTGATGAGATATTGAAACGGGGATTTGATGCACAT
CATTTTATTTCCGGTTTAAGTTCTCATTTGCGTGATTTACTTGTATGTAAGGATGCAGCC
ACCATTCAGTTGCTGGATGTGGGTGCTAAAATTAAGGAGAAGTACGGTGTTCAGGCGCAA
AAAAGTACGATTGACTTTTTAATGGATGCTTTAAATATTACCAACGATTGCGATTTGCAA
TATAGGGTGGCTAAAAATAAGCGTTTGCATGTGGAGTTTGCTCTTCTTAAGATAGCACGT
GTATTAGATGAACAAAGAAAAAAGTAG
MDYIVSARKYRPSTFVSVVGQQNITTTLKNAIKGSQLAHAYLFCGPRGVGKTTCARIFAK
TINCSNISADFEACNECESCKSFNENRSYNIHELDGASNNSVEDIRSLIDKVRVPPQIGS
YSVYIIDEVHMLSQAAFNAFLKTLEEPPKHAIFILATTEKHKILPTILSRCQIYDFNRIT
IEDAVGHLKYVAESEHITVEEEGLTVIAQKADGAMRDALSIFDQIVAFSGKSISYQQVID
NLNVLDYDFYFRLVDAFLAEDTTQTLLIFDEILKRGFDAHHFISGLSSHLRDLLVCKDAA
TIQLLDVGAKIKEKYGVQAQKSTIDFLMDALNITNDCDLQYRVAKNKRLHVEFALLKIAR
VLDEQRKK

Related

How to extract data from this .txt files and put the data into pandas dataframe by columns (like 'Report', 'Findings', 'Impression', 'Recomendation')

Radiology Report
I am trying to extract the data by the subject ('findings', 'impression') and trying to put it on pandas dataframe
Here is an example code with two text files (text1, text2) and subjects (Indication, comparision, findings, and impression.
import re
import pandas as pd
text1 = '''FINAL REPORT EXAMINATION: CHEST (PORTABLE AP) INDICATION: ___ year old woman with cough neutropenic // r/o infection TECHNIQUE: Single frontal view of the chest COMPARISON: Chest radiograph from ___, ___. FINDINGS: Right subclavian catheter tip terminates in the lower SVC. Cardiac size is normal. The lungs are clear. There is no pneumothorax or pleural effusion. IMPRESSION: No evidence of pneumonia. '''
text2 = '''FINAL REPORT EXAMINATION: CHEST (PORTABLE AP) INDICATION: ___ year old woman with cough neutropenic // r/o infection TECHNIQUE: Single frontal view of the chest COMPARISON: Chest radiograph from ___, ___. FINDINGS: Right subclavian catheter tip terminates in the lower SVC. Cardiac size is normal. The lungs are clear. There is no pneumothorax or pleural effusion. IMPRESSION: No evidence of pneumonia. '''
subjects = ("INDICATION", "COMPARISON", "FINDINGS", "IMPRESSION")
data = [re.split('|'.join(subjects), text)[1:] for text in [text1, text2]]
data = pd.DataFrame(data, columns = subjects)
the data is as follows.
INDICATION COMPARISON FINDINGS IMPRESSION
0 : ___ year old woman with cough neutropenic //... : Chest radiograph from ___, ___. : Right subclavian catheter tip terminates in ... : No evidence of pneumonia.
1 : ___ year old woman with cough neutropenic //... : Chest radiograph from ___, ___. : Right subclavian catheter tip terminates in ... : No evidence of pneumonia.
To extract data from a .txt file and put it into a Pandas dataframe, you can use the following steps:
Import the Pandas library:
import pandas as pd
Open the .txt file and read its contents into a string:
with open('file.txt', 'r') as f:
data = f.read()
Split the string into a list of lines:
lines = data.split('\n')
Create an empty dictionary to store the data:
data_dict = {}
Iterate over the list of lines and extract the data for each column:
for line in lines:
if 'Indication:' in line:
data_dict['Report'] = line.split(':')[1].strip()
elif 'Comparison:' in line:
data_dict['Findings'] = line.split(':')[1].strip()
elif 'Technique:' in line:
data_dict['Impression'] = line.split(':')[1].strip()
elif 'Findings:' in line:
data_dict['Recommendation'] = line.split(':')[1].strip()
Create a Pandas dataframe from the dictionary:
df = pd.DataFrame.from_dict(data_dict, orient='index').transpose()
Display the dataframe:
print(df)
This will extract the data from the .txt file and create a Pandas dataframe with columns 'Report', 'Findings', 'Impression', and 'Recommendation'. You can adjust the code to suit your specific needs and data structure.

How calculate clusters coherence/quality?

I did embeddings with fasttext and I have clusters thanks to KMeans.
I would like to calculate similarities inside each cluster to check if the sentences inside are well clustered. I want to keep sentences with good similarities in each clusters. If the similarity is not good, I want to exit sentence that not belong to a cluster, and next group similar sentences not belonging to clusters.
How can I do it in a good manner ? I thought using cosine similarity but don't know how to compare all sentences inside a cluster
Maybe something like this...
# clustering words into similar groups:
import numpy as np
from sklearn.cluster import AffinityPropagation
import distance
words = 'XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL,DFKLKSLFD,ABC,DLFKFKDLD,XYZ,LDPELDKSL'.split(',') #Replace this line
words = np.asarray(words) #So that indexing with a list will work
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])
affprop = AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
for cluster_id in np.unique(affprop.labels_):
exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
cluster_str = ", ".join(cluster)
print(" - *%s:* %s" % (exemplar, cluster_str))
Result:
- *LDPELDKSL:* LDPELDKSL
- *DFKLKSLFD:* DFKLKSLFD
- *XYZ:* ABC, XYZ
- *DLFKFKDLD:* DLFKFKDLD
See these links for additional guidance on how to cluster text.
https://towardsdatascience.com/applying-machine-learning-to-classify-an-unsupervised-text-document-e7bb6265f52
https://scikit-learn.org/stable/auto_examples/text/plot_document_clustering.html
https://pythonprogramminglanguage.com/kmeans-text-clustering/
http://brandonrose.org/clustering
Here are a couple examples using Cosine Similarity.
d1 = "plot: two teen couples go to a church party, drink and then drive."
d2 = "films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , but there's never really been a comic book like from hell before . "
d3 = "every now and then a movie comes along from a suspect studio , with every indication that it will be a stinker , and to everybody's surprise ( perhaps even the studio ) the film becomes a critical darling . "
d4 = "damn that y2k bug . "
documents = [d1, d2, d3, d4]
import nltk, string, numpy
nltk.download('punkt') # first-time use only
stemmer = nltk.stem.porter.PorterStemmer()
def StemTokens(tokens):
return [stemmer.stem(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def StemNormalize(text):
return StemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))
nltk.download('wordnet') # first-time use only
lemmer = nltk.stem.WordNetLemmatizer()
def LemTokens(tokens):
return [lemmer.lemmatize(token) for token in tokens]
remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
def LemNormalize(text):
return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))
from sklearn.feature_extraction.text import CountVectorizer
LemVectorizer = CountVectorizer(tokenizer=LemNormalize, stop_words='english')
LemVectorizer.fit_transform(documents)
print(LemVectorizer.vocabulary_)
tf_matrix = LemVectorizer.transform(documents).toarray()
print(tf_matrix)
tf_matrix.shape
from sklearn.feature_extraction.text import TfidfTransformer
tfidfTran = TfidfTransformer(norm="l2")
tfidfTran.fit(tf_matrix)
print(tfidfTran.idf_)
import math
def idf(n,df):
result = math.log((n+1.0)/(df+1.0)) + 1
return result
print("The idf for terms that appear in one document: " + str(idf(4,1)))
print("The idf for terms that appear in two documents: " + str(idf(4,2)))
tfidf_matrix = tfidfTran.transform(tf_matrix)
print(tfidf_matrix.toarray())
cos_similarity_matrix = (tfidf_matrix * tfidf_matrix.T).toarray()
print(cos_similarity_matrix)
from sklearn.feature_extraction.text import TfidfVectorizer
TfidfVec = TfidfVectorizer(tokenizer=LemNormalize, stop_words='english')
def cos_similarity(textlist):
tfidf = TfidfVec.fit_transform(textlist)
return (tfidf * tfidf.T).toarray()
cos_similarity(documents)
https://sites.temple.edu/tudsc/2017/03/30/measuring-similarity-between-texts-in-python/
# Define the documents
doc_trump = "Mr. Trump became president after winning the political election. Though he lost the support of some republican friends, Trump is friends with President Putin"
doc_election = "President Trump says Putin had no political interference is the election outcome. He says it was a witchhunt by political parties. He claimed President Putin is a friend who had nothing to do with the election"
doc_putin = "Post elections, Vladimir Putin became President of Russia. President Putin had served as the Prime Minister earlier in his political career"
documents = [doc_trump, doc_election, doc_putin]
# Scikit Learn
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
# Create the Document Term Matrix
count_vectorizer = CountVectorizer(stop_words='english')
count_vectorizer = CountVectorizer()
sparse_matrix = count_vectorizer.fit_transform(documents)
# OPTIONAL: Convert Sparse Matrix to Pandas Dataframe if you want to see the word frequencies.
doc_term_matrix = sparse_matrix.todense()
df = pd.DataFrame(doc_term_matrix,
columns=count_vectorizer.get_feature_names(),
index=['doc_trump', 'doc_election', 'doc_putin'])
df
# Compute Cosine Similarity
from sklearn.metrics.pairwise import cosine_similarity
print(cosine_similarity(df, df))
https://www.machinelearningplus.com/nlp/cosine-similarity/

How do I do multiple pairwise alignments from a FASTA file and print the percentage similarity?

I want to multiple pairwise comparisons for every protein sequence contained in a FASTA file and then print the percentage sequence similarity (either an average or individually). I think I need to use itertools to create all of the combinations, align them and then probably divide the number of matches by the aligned sequence length to get the % sequence similarity but I am having trouble with the specific script I need to do this, preferably in biopython if possible. Any help is appreciated.
My answer does not involve Biopython, but since no other answer has been posted yet, I will post it anyway:
The bioinformatics package Biotite (https://www.biotite-python.org/), a package I am currently developing, would solve your problem using the following script:
import numpy as np
import biotite
import biotite.sequence as seq
import biotite.sequence.io.fasta as fasta
import biotite.sequence.align as align
import biotite.database.entrez as entrez
# 5 example sequences (bacterial luciferase variants)
uids = [
'Q7N575', 'P19839', 'P09140', 'P07740', 'P24113'
]
# Download these sequences as one file from NCBI
file_name = entrez.fetch_single_file(
uids, biotite.temp_file("fasta"), db_name="protein", ret_type="fasta"
)
# Read each sequence in the file as 'ProteinSequence' object
fasta_file = fasta.FastaFile()
fasta_file.read(file_name)
sequences = list(fasta.get_sequences(fasta_file).values())
# BLOSUM62
substitution_matrix = align.SubstitutionMatrix.std_protein_matrix()
# Matrix that will be filled with pairwise sequence identities
identities = np.ones((len(sequences), len(sequences)))
# Iterate over sequences
for i in range(len(sequences)):
for j in range(i):
# Align sequences pairwise
alignment = align.align_optimal(
sequences[i], sequences[j], substitution_matrix
)[0]
# Calculate pairwise sequence identities and fill matrix
identity = align.get_sequence_identity(alignment)
identities[i,j] = identity
identities[j,i] = identity
print(identities)
The output:
[[1. 0.97214485 0.62921348 0.84225352 0.59776536]
[0.97214485 1. 0.62359551 0.85352113 0.60055866]
[0.62921348 0.62359551 1. 0.61126761 0.85393258]
[0.84225352 0.85352113 0.61126761 1. 0.59383754]
[0.59776536 0.60055866 0.85393258 0.59383754 1. ]]

Is there any way to get abstracts for a given list of pubmed ids?

I have list of pmids
i want to get abstracts for both of them in a single url hit
pmids=[17284678,9997]
abstract_dict={}
url = https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?
db=pubmed&id=**17284678,9997**&retmode=text&rettype=xml
My requirement is to get in this format
abstract_dict={"pmid1":"abstract1","pmid2":"abstract2"}
I can get in above format by trying each id and updating the dictionary, but to optimize time I want to give all ids to url and process and get only abstracts part.
Using BioPython, you can give the joined list of Pubmed IDs to Entrez.efetch and that will perform a single URL lookup:
from Bio import Entrez
Entrez.email = 'your_email#provider.com'
pmids = [17284678,9997]
handle = Entrez.efetch(db="pubmed", id=','.join(map(str, pmids)),
rettype="xml", retmode="text")
records = Entrez.read(handle)
abstracts = [pubmed_article['MedlineCitation']['Article']['Abstract']['AbstractText'][0]
for pubmed_article in records['PubmedArticle']]
abstract_dict = dict(zip(pmids, abstracts))
This gives as result:
{9997: 'Electron paramagnetic resonance and magnetic susceptibility studies of Chromatium flavocytochrome C552 and its diheme flavin-free subunit at temperatures below 45 degrees K are reported. The results show that in the intact protein and the subunit the two low-spin (S = 1/2) heme irons are distinguishable, giving rise to separate EPR signals. In the intact protein only, one of the heme irons exists in two different low spin environments in the pH range 5.5 to 10.5, while the other remains in a constant environment. Factors influencing the variable heme iron environment also influence flavin reactivity, indicating the existence of a mechanism for heme-flavin interaction.',
17284678: 'Eimeria tenella is an intracellular protozoan parasite that infects the intestinal tracts of domestic fowl and causes coccidiosis, a serious and sometimes lethal enteritis. Eimeria falls in the same phylum (Apicomplexa) as several human and animal parasites such as Cryptosporidium, Toxoplasma, and the malaria parasite, Plasmodium. Here we report the sequencing and analysis of the first chromosome of E. tenella, a chromosome believed to carry loci associated with drug resistance and known to differ between virulent and attenuated strains of the parasite. The chromosome--which appears to be representative of the genome--is gene-dense and rich in simple-sequence repeats, many of which appear to give rise to repetitive amino acid tracts in the predicted proteins. Most striking is the segmentation of the chromosome into repeat-rich regions peppered with transposon-like elements and telomere-like repeats, alternating with repeat-free regions. Predicted genes differ in character between the two types of segment, and the repeat-rich regions appear to be associated with strain-to-strain variation.'}
Edit:
In the case of pmids without corresponding abstracts, watch out with the fix you suggested:
abstracts = [pubmed_article['MedlineCitation']['Article']['Abstract'] ['AbstractText'][0]
for pubmed_article in records['PubmedArticle'] if 'Abstract' in
pubmed_article['MedlineCitation']['Article'].keys()]
Suppose you have the list of Pubmed IDs pmids = [1, 2, 3], but pmid 2 doesn't have an abstract, so abstracts = ['abstract of 1', 'abstract of 3']
This will cause problems in the final step where I zip both lists together to make a dict:
>>> abstract_dict = dict(zip(pmids, abstracts))
>>> print(abstract_dict)
{1: 'abstract of 1',
2: 'abstract of 3'}
Note that abstracts are now out of sync with their corresponding Pubmed IDs, because you didn't filter out the pmids without abstracts and zip truncates to the shortest list.
Instead, do:
abstract_dict = {}
without_abstract = []
for pubmed_article in records['PubmedArticle']:
pmid = int(str(pubmed_article['MedlineCitation']['PMID']))
article = pubmed_article['MedlineCitation']['Article']
if 'Abstract' in article:
abstract = article['Abstract']['AbstractText'][0]
abstract_dict[pmid] = abstract
else:
without_abstract.append(pmid)
print(abstract_dict)
print(without_abstract)
from Bio import Entrez
import time
Entrez.email = 'your_email#provider.com'
pmids = [29090559 29058482 28991880 28984387 28862677 28804631 28801717 28770950 28768831 28707064 28701466 28685492 28623948 28551248]
handle = Entrez.efetch(db="pubmed", id=','.join(map(str, pmids)),
rettype="xml", retmode="text")
records = Entrez.read(handle)
abstracts = [pubmed_article['MedlineCitation']['Article']['Abstract']['AbstractText'][0] if 'Abstract' in pubmed_article['MedlineCitation']['Article'].keys() else pubmed_article['MedlineCitation']['Article']['ArticleTitle'] for pubmed_article in records['PubmedArticle']]
abstract_dict = dict(zip(pmids, abstracts))
print abstract_dict

Biopython: Local alignment between DNA sequences doesn't find optimal alignment

I'm writing code to find local alignments between two sequences. Here is a minimal, working example I've been working on:
from Bio import pairwise2
from Bio.pairwise2 import format_alignment
seq1 = "GTGGTCCTAGGC"
seq2 = "GCCTAGGACCAC"
# scores for the alignment
match =1
mismatch = -2
gapopen = -2
gapext = 0
# see: http://biopython.org/DIST/docs/api/Bio.pairwise2-module.html
# 'localms' takes <seq1,seq2, match,mismatch,open,extend>
for a in pairwise2.align.localms(seq1,seq2,match,mismatch,gapopen,gapext):
print(format_alignment(*a))
The following code runs with the output
GTGGTCCTAGGC----
|||||
----GCCTAGGACCAC
Score=5
But a score of '6' should be possible, finding the 'C-C' next to the 5 alignments, like so:
GTGGTCCTAGGC----
||||||
----GCCTAGGACCAC
Score=6
Any ideas on what's going on?
This seems to be a bug in the current implementation of local alignments in Biopython's pairwise2 module. There is a recent pull request (#782) on Biopython's GitHub, which should solve your problem:
>>> from Bio import pairwise2 # This is the version from the pull request
>>> seq1 = 'GTGGTCCTAGGC'
>>> seq2 = 'GCCTAGGACCAC'
>>> for a in pairwise2.align.localms(seq1, seq2, 1, -2, -2, 0):
print pairwise2.format_alignment(*a)
GTGGTCCTAGGC----
||||||
----GCCTAGGACCAC
Score=6
If you are working with short sequences only, you can just download
the code for pairwise2.py from the pull request
mentioned above. In addition you need to 'inactivate' the respective
C module (cpairwise2.pyd or
cpairwise2.so), e.g. by renaming it or by removing the
import of the C functions at the end of
pairwise2.py(from .cpairwise import ...).
If your are working with longer sequences, you will need the speed enhancement of the C module. Thus you also have to download
cpairwise2module.c from the pull request and compile it
into cpairwise2.pyd (for Windows systems) or
cpairwise2.so (Unix, Linux).
EDIT: In Biopython 1.68 the problem is solved.

Resources