How can you join two or more dictionaries created by Bio.SeqIO.index? - biopython

I would like to be able to join the two "dictionaries" stored in "indata" and "pairdata", but this code,
indata = SeqIO.index(infile, infmt)
pairdata = SeqIO.index(pairfile, infmt)
indata.update(pairdata)
produces the following error:
indata.update(pairdata)
TypeError: update() takes exactly 1 argument (2 given)
I have tried using,
indata = SeqIO.to_dict(SeqIO.parse(infile, infmt))
pairdata = SeqIO.to_dict(SeqIO.parse(pairfile, infmt))
indata.update(pairdata)
which does work, but the resulting dictionaries take up too much memory to be practical for for the sizes of infile and pairfile I have.
The final option I have explored is:
indata = SeqIO.index_db(indexfile, [infile, pairfile], infmt)
which works perfectly, but is very slow. Does anyone know how/whether I can successfully join the two indexes from the first example above?

SeqIO.index returns a read-only dictionary-like object, so update will not work on it (apologies for the confusing error message; I just checked in a fix for that to the main Biopython repository).
The best approach is to either use index_db, which will be slower but
only needs to index the file once, or to define a higher level object
which acts like a dictionary over your multiple files. Here is a
simple example:
from Bio import SeqIO
class MultiIndexDict:
def __init__(self, *indexes):
self._indexes = indexes
def __getitem__(self, key):
for idx in self._indexes:
try:
return idx[key]
except KeyError:
pass
raise KeyError("{0} not found".format(key))
indata = SeqIO.index("f001", "fasta")
pairdata = SeqIO.index("f002", "fasta")
combo = MultiIndexDict(indata, pairdata)
print combo['gi|3318709|pdb|1A91|'].description
print combo['gi|1348917|gb|G26685|G26685'].description
print combo["key_failure"]

In you don't plan to use the index again and memory isn't a limitation (which both appear to be true in your case), you can tell Bio.SeqIO.index_db(...) to use an in memory SQLite3 index with the special index name ":memory:" like so:
indata = SeqIO.index_db(":memory:", [infile, pairfile], infmt)
where infile and pairfile are filenames, and infmt is their format type as defined in Bio.SeqIO (e.g. "fasta").
This is actually a general trick with Python's SQLite3 library. For a small set of files this should be much faster than building the SQLite index on disk.

Related

Biopython : how to extract only relevant atom and save a pdb file (not locally)?

Using Biopython. I have a list of atoms. rep_atoms = [CA, CB, CD3] (Carbon atoms).
I want to save only these from any given PDB file. I don't want to save it locally; I want it to save in the memory (Lots of iteration).
I have arrived at the code below, but it saves the file locally and is very slow.
So, my goal is from each atom in PDB, if it is present in rep_atoms. Make a new_pdb store only that information so that when I call it later in my code, it should be a PDB file without getting saved in my computer in a local folder.
How do I append each atom? Printing all atoms is very fast. I want to append it, but it wouldn't be a PDB structure file. What should I do?
from Bio.PDB import .... PDBIO, Select ....
class rep_atom_Select(Select):
def accept_atom(self, atom):
if atom.get_name() in rep_atoms:
return 1
else:
return 0
def rep_atoms_pdb(input_pdb):
io = PDBIO()
io.set_structure(input_pdb)
for model in input_pdb:
for chain in model:
for residue in chain:
for atom in residue:
if atom.get_name() in rep_atoms:
print(atom)
# dnr_only = io.save("dnr_only.pdb", rep_atom_Select())
Save after the loop, once, instead of thousands of times inside the loop.
def rep_atoms_pdb(input_pdb):
my_atoms = list()
for model in input_pdb:
for chain in model:
for residue in chain:
for atom in residue:
if atom.get_name() in rep_atoms: # or if rep_atom_Select().accept_atom(atom):
my_atoms.append(atom) # or something like this
# The function returns the list of extracted atoms
return my_atoms
Your definition of rep_atom_Select() does not seem to be directly compatible with this design, nor am I sure receiving the atoms as a list is actually what you want, but this should at least give you a nudge in the right direction.
Brief reading of the Bio.PDB.PDBIO documentation suggests that you might simply want to return the actual PDBIO object. I think something like this:
class rep_atom_Select(Select):
def accept_atom(self, atom):
if atom.get_name() in rep_atoms:
return 1
else:
return 0
def rep_atoms_pdb(input_pdb):
io = rep_atom_Select()
io.set_structure(input_pdb)
return io
This is based on a very cursory reading of the documentation, but at least demonstrates how you would use your overridden class to select only some of the atoms in the input_pdb structure.

How broadcast variables are used in dask parallelization

I have some code applying a map function on a dask bag. I need a lookup dictionary to apply that function and it doesn't work with client.scatter.
I don't know if I am doing the right things, because the workers starts, but they don't do anything. I have tried different configuration looking to different examples, but I can't get it to work. Any support will be appreciated.
I know from Spark, you define a broadcast variable and you access the content by variable.value inside the function you want to apply. I don't see the same with dask.
# Function to map
def transform_contacts_add_to_historic_sin(data,historic_dict):
raw_buffer = ''
line = json.loads(data)
if line['timestamp] > historic_dict['timestamp]:
raw_buffer = raw_buffer + line['vid']
return raw_buffer
# main program
# historic_dict is a dictionary previously filled, which is the lookup variable for map function
# file_records will be a list of json.dump getting from a S3 file
from distributed import Client
client = Client()
historic_dict_scattered = client.scatter(historic_dict, broadcast=True)
file_records = []
raw_data = s3_procedure.read_raw_file(... S3 file.......)
data = TextIOWrapper(raw_data)
for line in data:
file_records.append(line)
bag_chunk = db.from_sequence(file_records, npartitions=16)
bag_transform = bag_chunk.map(lambda x: transform_contacts_add_to_historic(x), args=[historic_dict_scattered])
bag_transform.compute()
If your dictionary is small you can just include it directly
def func(partition, d):
return ...
my_dict = {...}
b = b.map(func, d=my_dict)
If it's large then you might want to wrap it up in Dask delayed first
my_dict = dask.delayed(my_dict)
b = b.map(func, d=my_dict)
If it's very large then yes, you might want to scatter it first (though I would avoid this if things work out with either of the approaches above).
[my_dict] = client.scatter([my_dict])
b = b.map(func, d=my_dict)

How to execute Report given the results of a previously executed Report in ABAP

My problem is the following:
I have one report called Y5000112.
My colleagues always execute it manually once with selection screen variant V1 and then execute it a second time with variant V2 adding the results of the first execution to the selection.
Those results in this case are PERNR.
My goal:
Automate this - execute that query twice with one click and automatically fill the PERNR selection of the second execution with the PERNR results of the first execution.
I found out how to trigger a report execution and after that another one, how to set it to a certain variant and got this far - [EDIT] after the first answer I got a bit further but I still have no idea how to loop through my results and put them into the next Report submit:
DATA: t_list TYPE TABLE OF abaplist.
* lt_seltab TYPE TABLE OF rsparams,
* ls_selline LIKE LINE OF lt_seltab.
SUBMIT Y5000114
USING SELECTION-SET 'MA OPLAN TEST'
EXPORTING LIST TO MEMORY
AND RETURN.
CALL FUNCTION 'LIST_FROM_MEMORY'
TABLES
listobject = t_list
EXCEPTIONS
not_found = 1
OTHERS = 2.
IF sy-subrc <> 0.
WRITE 'Unable to get list from memory'.
ELSE.
* I want to fill ls_seltab here with all pernr (table pa0020) but I haven't got a clue how to do this
* LOOP AT t_list.
* WRITE /t_list.
* ENDLOOP.
SUBMIT Y5000114
* WITH-SELECTION-TABLE ls_seltab
USING SELECTION-SET 'MA OPLAN TEST2'
AND RETURN.
ENDIF.
P.S.
I'm not very familiar with ABAP so if I didn't provide enough Information just let me know in the comments and I'll try to find out whatever you need to know in order to solve this.
Here's my imaginary JS-Code that can express very generally what I'm trying to accomplish.
function submitAndReturnExport(Reportname,VariantName,OptionalPernrSelection)
{...return resultObject;}
var t_list = submitAndReturnExport("Y5000114","MA OPLAN TEST");
var pernrArr = [];
for (var i in t_list)
{
pernrArr.push(t_list[i]["pernr"]);
}
submitAndReturnExport("Y5000114","MA OPLAN TEST2",pernrArr);
It's not that easy as it supposed to, so there won't be any one-line snippet. There is no standard way of getting results from report. Try EXPORTING LIST TO MEMORY clause, but consider that the report may need to be adapted:
SUBMIT [report_name]
WITH SELECTION-TABLE [rspar_tab]
EXPORTING LIST TO MEMORY
AND RETURN.
The result of the above statement should be read from memory and adapted for output:
call function 'LIST_FROM_MEMORY'
TABLES
listobject = t_list
EXCEPTIONS
not_found = 1
others = 2.
if sy-subrc <> 0.
message 'Unable to get list from memory' type 'E'.
endif.
call function 'WRITE_LIST'
TABLES
listobject = t_list
EXCEPTIONS
EMPTY_LIST = 1
OTHERS = 2
.
if sy-subrc <> 0.
message 'Unable to write list' type 'E'.
endif.
Another (and more efficient approach, IMHO) is to gain access to resulting grid via class cl_salv_bs_runtime_info. See the example here
P.S. Executing the same report with different parameters which are mutually-dependent (output pars of 1st iteration = input pars for the 2nd) is definitely a bad design, and those manipulations should be done internally. As for me one'd better rethink the whole architecture of the report.

TypeError when attempting to parse pubmed EFetch

I'm new to this python/biopyhton stuff, so am struggling to work out why the following code, pretty much lifted straight out of the Biopython Cookbook, isn't doing what I'd expect.
I'd have thought it'd end up with the interpreter display two list containing the same number, but all i get is one list and then a message saying TypeError: 'generator' object is not subscriptable.
I'm guessing something is going wrong with the Medline.parse step and the result of the efetch isn't being processed in a way that allows subsequent interation to extract the PMID values. Or, the efetch isn't returning anything.
Any pointers at to what I'm doing wrong?
Thanks
from Bio import Medline
from Bio import Entrez
Entrez.email = 'A.N.Other#example.com'
handle = Entrez.esearch(db="pubmed", term="biopython")
record = Entrez.read(handle)
print(record['IdList'])
items = record['IdList']
handle2 = Entrez.efetch(db="pubmed", id=items, rettype="medline", retmode="text")
records = Medline.parse(handle2)
for r in records:
print(records['PMID'])
You're trying to print records['PMID'] which is a generator. I think you meant to do print(r['PMID']) which will print the 'PMID' entry in the current record dictionary object for each iteration. This is confirmed by the example given in the Bio.Medline.parse() documentation.

DBF Large Char Field

I have a database file that I beleive was created with Clipper but can't say for sure (I have .ntx files for indexes which I understand is what Clipper uses). I am trying to create a C# application that will read this database using the System.Data.OleDB namespace.
For the most part I can sucessfully read the contents of the tables there is one field that I cannot. This field called CTRLNUMS that is defined as a CHAR(750). I have read various articles found through Google searches that suggest field larger than 255 chars have to be read through a different process than the normal assignment to a string variable. So far I have not been successful in an approach that I have found.
The following is a sample code snippet I am using to read the table and includes two options I used to read the CTRLNUMS field. Both options resulted in 238 characters being returned even though there is 750 characters stored in the field.
Here is my connection string:
Provider=Microsoft.Jet.OLEDB.4.0;Data Source=c:\datadir;Extended Properties=DBASE IV;
Can anyone tell me the secret to reading larger fields from a DBF file?
using (OleDbConnection conn = new OleDbConnection(connectionString))
{
conn.Open();
using (OleDbCommand cmd = new OleDbCommand())
{
cmd.Connection = conn;
cmd.CommandType = CommandType.Text;
cmd.CommandText = string.Format("SELECT ITEM,CTRLNUMS FROM STUFF WHERE ITEM = '{0}'", stuffId);
using (OleDbDataReader dr = cmd.ExecuteReader())
{
if (dr.Read())
{
stuff.StuffId = dr["ITEM"].ToString();
// OPTION 1
string ctrlNums = dr["CTRLNUMS"].ToString();
// OPTION 2
char[] buffer = new char[750];
int index = 0;
int readSize = 5;
while (index < 750)
{
long charsRead = dr.GetChars(dr.GetOrdinal("CTRLNUMS"), index, buffer, index, readSize);
index += (int)charsRead;
if (charsRead < readSize)
{
break;
}
}
}
}
}
}
You can find a description of the DBF structure here: http://www.dbf2002.com/dbf-file-format.html
What I think Clipper used to do was modify the Field structure so that, in Character fields, the Decimal Places held the high-order byte of the size, so Character field sizes were really 256*Decimals+Size.
I may have a C# class that reads dbfs (natively, not ADO/DAO), it could be modified to handle this case. Let me know if you're interested.
Are you still looking for an answer? Is this a one-off job or something that needs doing regularly?
I have a Python module that is primarily intended to extract data from all kinds of DBF files ... it doesn't yet handle the length_high_byte = decimal_places hack, but it's a trivial change. I'd be quite happy to (a) share this with you and/or (b) get a copy of such a DBF file for testing.
Added later: Extended-length feature added, and tested against files I've created myself. Offer to share code with anyone who would like to test it still stands. Still interested in getting some "real" files myself for testing.
3 suggestions that might be worth a shot...
1 - use Access to create a linked table to the DBF file, then use .Net to hit the table in the access database instead of going direct to the DBF.
2 - try the FoxPro OLEDB provider
3 - parse the DBF file by hand. Example is here.
My guess is that #1 should work the easiest, and #3 will give you the opportunity to fine tune your cussing skills. :)

Resources