Writing and saving GenBank files with biobython SeqIO module - biopython

I wand to safe some DNA sequences in genbank file format to include information about genes, domains, etc. I know how to create SeqRecord objects and include all information I wand to have in the file:
#my DNA sequence and encoded protein sequence of gene1
genome_seq = 'ATTTTGTGCAGCCGAGAGCGCGAGCGAAGCGCTTAAAAAATTCCCCCGCTCTGTTCTCCGGCAGGACACAAAGTCATGCCGTGGAGACCGCCGGTCCATAACGTGCCAGGTAGAGAGAATCAATGGTTTGCAGCGTTCTTTCACGGTCATGCTGCTTTCTGCGGGTGTGGTGACCCTGTTGGGCATCTTAACGGAAGC'
protein_seq = 'QQRILGVKLRLLFNQVQKIQQNQDP'
#position of gene1
start = 12
end = start + len(protein_seq)
#some information
name = 'my_contig'
bioproject = 'BodySites'
sample_type='blood'
taxonomy = ['Homo Sapiens']
reference_prot_ID = 'YP_92845z2093857'
#dictionaries with information for SeqFeature qualifiers and SeqRecord annotations
dict1 = {'gene':'ORF1', 'ref_ID': reference_prot_ID, 'translation':protein_seq}
dict2 = {'SOURCE': sample_type, 'ORGANISM': 'Human', 'Taxonomy':taxonomy}
#create SeqFeature and SeqRecord
f1 = SeqFeature(FeatureLocation(start, end, strand=1), type='domain', qualifiers=dict1)
my_features = [f1]
record = SeqRecord(Seq(genome_seq, alphabet=IUPAC.unambiguous_dna), id=name, name=name\
description=bioproject, annotations=dict2, features = my_features)
print(record)
with open('/media/sf_Desktop/test.gb', 'w') as handle:
SeqIO.write(record, handle, 'genbank')
What I get printed on the screend for the SeqRecord object looks like this, where everything seems to be included:
ID: my_contig
Name: ma_contig
Description: BodySites
Number of features: 1
/SOURCE=blood
/ORGANISM=Human
/Taxonomy=['Homo Sapiens']
Seq('ATTTTGTGCAGCCGAGAGCGCGAGCGAAGCGCTTAAAAAATTCCCCCGCTCTGT...AGC', IUPACUnambiguousDNA())
But in the resulting file the information on SOURCE, ORGANISM and Taxonomy is missing:
LOCUS my_contig 198 bp DNA UNK 01-JAN-1980
DEFINITION BodySites.
ACCESSION my_contig
VERSION my_contig
KEYWORDS .
SOURCE .
ORGANISM .
.
FEATURES Location/Qualifiers
domain 13..37
/gene="ORF1"
/ref_ID="YP_92845z2093857"
/translation="QQRILGVKLRLLFNQVQKIQQNQDP"
ORIGIN
1 attttgtgca gccgagagcg cgagcgaagc gcttaaaaaa ttcccccgct ctgttctccg
61 gcaggacaca aagtcatgcc gtggagaccg ccggtccata acgtgccagg tagagagaat
121 caatggtttg cagcgttctt tcacggtcat gctgctttct gcgggtgtgg tgaccctgtt
181 gggcatctta acggaagc
//
Can anyone help me how to include also the annotation information in the output file?
I found that for the GenBank.Record module it is possible to include all information and it looks very nice on the screen, but there is no information on how to save a Record object to a file...

OK, I found my mistake:
all annotation titles need to be in lowercase letters. So changing 'SOURCE to 'source', 'ORGANISM' to 'organism' and so on, did the job.
Cheers!

Related

LUA indexed table access via named constants

I am using LUA as embedded language on a µC project, so the ressources are limited. To save some cycles and memory I do always only indexed based table access (table[1]) instead og hash-based access (table.someMeaning = 1). This saves a lot of memory.
The clear drawback of this is approach are the magic numbers thrughtout the code.
A Cpp-like preprocessor would help here to replace the number with named-constants.
Is there a good way to achieve this?
A preprocessor in LUA itself, loading the script and editing the chunk and then loading it would be a variant, but I think this exhausts the ressources in the first place ...
So, I found a simple solution: write your own preprocessor in Lua!
It's probably the most easy thing to do.
First, define your symbols globally:
MySymbols = {
FIELD_1 = 1,
FIELD_2 = 2,
FIELD_3 = 3,
}
Then you write your preprocessing function, which basically just replace the strings from MySymbols by their value.
function Preprocess (FilenameIn, FilenameOut)
local FileIn = io.open(FilenameIn, "r")
local FileString = FileIn:read("*a")
for Name, Value in pairs(MySymbols) do
FileString = FileString:gsub(Name, Value)
end
FileIn:close()
local FileOut = io.open(FilenameOut, "w")
FileOut:write(FileString)
FileOut:close()
end
Then, if you try with this input file test.txt:
TEST FIELD_1
TEST FIELD_2
TEST FIELD_3
And call the following function:
Preprocess("test.txt", "test-out.lua")
You will get the fantastic output file:
TEST 1
TEST 2
TEST 3
I let you the joy to integrate it with your scripts/toolchain.
If you want to avoid attributing the number manually, you could just add a wonderful closure:
function MakeCounter ()
local Count = 0
return function ()
Count = Count + 1
return Count
end
end
NewField = MakeCounter()
MySymbols = {
FIELD_1 = NewField(),
FIELD_2 = NewField(),
FIELD_3 = NewField()
}

Saving SEC 10-K annual report text to files (trouble with decoding)

I am trying to bulk-download the text visible to the "end-user" from 10-K SEC Edgar reports (don't care about tables) and save it in a text file. I have found the code below on Youtube, however I am facing 2 challenges:
I am not sure if I am capturing all text, and when I print the URL from below, I receive very weird output (special characters e.g., at the very end of the print-out)
I can't seem to save the text in txt files, not sure if this is due to encoding (I am entirely new to programming).
import re
import requests
import unicodedata
from bs4 import BeautifulSoup
def restore_windows_1252_characters(restore_string):
def to_windows_1252(match):
try:
return bytes([ord(match.group(0))]).decode('windows-1252')
except UnicodeDecodeError:
# No character at the corresponding code point: remove it.
return ''
return re.sub(r'[\u0080-\u0099]', to_windows_1252, restore_string)
# define the url to specific html_text file
new_html_text = r"https://www.sec.gov/Archives/edgar/data/796343/0000796343-14-000004.txt"
# grab the response
response = requests.get(new_html_text)
page_soup = BeautifulSoup(response.content,'html5lib')
page_text = page_soup.html.body.get_text(' ',strip = True)
# normalize the text, remove characters. Additionally, restore missing window characters.
page_text_norm = restore_windows_1252_characters(unicodedata.normalize('NFKD', page_text))
# print: this works however gives me weird special characters in the print (e.g., at the very end)
print(page_text_norm)
# save to file: this only gives me an empty text file
with open('testfile.txt','w') as file:
file.write(page_text_norm)
Try this. If you take the data you expect as an example, it will be easier for people to understand your needs.
from simplified_scrapy import SimplifiedDoc,req,utils
url = 'https://www.sec.gov/Archives/edgar/data/796343/0000796343-14-000004.txt'
html = req.get(url)
doc = SimplifiedDoc(html)
# text = doc.body.text
text = doc.body.unescape() # Converting HTML entities
utils.saveFile("testfile.txt",text)

Sphinx references to other sections containing section number and section title

I am using Sphinx to write a document with lots of references:
.. _human-factor:
The Human Factor
================
...
(see :ref:`human-factor` for details)
The compiled document contains something like this:
(see The Human Factor for details)
Instead I would like to have it formatted like this:
(see 5.1 The Human Factor for details)
I tried to google the solution and I found out that the latex hyperref package can do this but I have no idea how to add this to the Sphinx build.
I resolved it by basically using numsec.py from here: https://github.com/jterrace/sphinxtr
I had to replace the doctree_resolved function with this one to get section number + title (e.g. "5.1 The Human Factor").
def doctree_resolved(app, doctree, docname):
secnums = app.builder.env.toc_secnumbers
for node in doctree.traverse(nodes.reference):
if 'refdocname' in node:
refdocname = node['refdocname']
if refdocname in secnums:
secnum = secnums[refdocname]
emphnode = node.children[0]
textnode = emphnode.children[0]
toclist = app.builder.env.tocs[refdocname]
anchorname = None
for refnode in toclist.traverse(nodes.reference):
if refnode.astext() == textnode.astext():
anchorname = refnode['anchorname']
if anchorname is None:
continue
linktext = '.'.join(map(str, secnum[anchorname]))
node.replace(emphnode, nodes.Text(linktext
+ ' ' + textnode))
To make it work one needs to include the numsec extension in conf.py and also to add :numbered: in the toctree like so:
.. toctree::
:maxdepth: 1
:numbered:

How can you join two or more dictionaries created by Bio.SeqIO.index?

I would like to be able to join the two "dictionaries" stored in "indata" and "pairdata", but this code,
indata = SeqIO.index(infile, infmt)
pairdata = SeqIO.index(pairfile, infmt)
indata.update(pairdata)
produces the following error:
indata.update(pairdata)
TypeError: update() takes exactly 1 argument (2 given)
I have tried using,
indata = SeqIO.to_dict(SeqIO.parse(infile, infmt))
pairdata = SeqIO.to_dict(SeqIO.parse(pairfile, infmt))
indata.update(pairdata)
which does work, but the resulting dictionaries take up too much memory to be practical for for the sizes of infile and pairfile I have.
The final option I have explored is:
indata = SeqIO.index_db(indexfile, [infile, pairfile], infmt)
which works perfectly, but is very slow. Does anyone know how/whether I can successfully join the two indexes from the first example above?
SeqIO.index returns a read-only dictionary-like object, so update will not work on it (apologies for the confusing error message; I just checked in a fix for that to the main Biopython repository).
The best approach is to either use index_db, which will be slower but
only needs to index the file once, or to define a higher level object
which acts like a dictionary over your multiple files. Here is a
simple example:
from Bio import SeqIO
class MultiIndexDict:
def __init__(self, *indexes):
self._indexes = indexes
def __getitem__(self, key):
for idx in self._indexes:
try:
return idx[key]
except KeyError:
pass
raise KeyError("{0} not found".format(key))
indata = SeqIO.index("f001", "fasta")
pairdata = SeqIO.index("f002", "fasta")
combo = MultiIndexDict(indata, pairdata)
print combo['gi|3318709|pdb|1A91|'].description
print combo['gi|1348917|gb|G26685|G26685'].description
print combo["key_failure"]
In you don't plan to use the index again and memory isn't a limitation (which both appear to be true in your case), you can tell Bio.SeqIO.index_db(...) to use an in memory SQLite3 index with the special index name ":memory:" like so:
indata = SeqIO.index_db(":memory:", [infile, pairfile], infmt)
where infile and pairfile are filenames, and infmt is their format type as defined in Bio.SeqIO (e.g. "fasta").
This is actually a general trick with Python's SQLite3 library. For a small set of files this should be much faster than building the SQLite index on disk.

Review of file harvesting / decent code and request for more efficient example

I have a structure from a data source that's not providing an ftp source (ie, I can't simply walk the folder structures so easily as I would with net/ftp) so I'm having to parse each http file list (the default apache generated folder browse html) , then descend, rinse and repeat until at the floor level that contains the files to harvest out into a more appropriate file hierarchy.
I feel that I could probably use mechanize along with nokogiri to do something much more efficient. Could you provide and example of what would work better in this scenario?
Presently I'm doing something like:
doc = Nokogiri::HTML(open("some_base_url"))
link_list = []
doc.xpath('//a').each do |node|
unless node.text =~ /Name|Last modified|Size|Description|Parent Directory/
link_list.push(url + node.text)
end
end
link_list.each do |sub_folder_url|
doc = Nokogiri::HTML(open(sub_folder_url))
# ... rinse and repeat until at the bottom level with files, then we have all file urls to pull
end
output_path = []
url_list.each do |url|
filepath_elements = url.split('/')
filename_elements = filepath_elements.last.split('_.')
date = filename_elements[0]
time = filename_elements[1]
detail = filename_elements[2]
data_type_and_filetype = filename_elements[3].split('.')
data_type = data_type_and_filetype[0]
file_type = data_type_and_filetype[1]
output_path = [date, time, resoution, lense_type] * '_' + '.' + file_type
end
# pull in all final_urls_to_retrieve to it's respective new output_path[location_in_final_urls_to_retrieve]

Resources