Biopython : how to extract only relevant atom and save a pdb file (not locally)? - biopython

Using Biopython. I have a list of atoms. rep_atoms = [CA, CB, CD3] (Carbon atoms).
I want to save only these from any given PDB file. I don't want to save it locally; I want it to save in the memory (Lots of iteration).
I have arrived at the code below, but it saves the file locally and is very slow.
So, my goal is from each atom in PDB, if it is present in rep_atoms. Make a new_pdb store only that information so that when I call it later in my code, it should be a PDB file without getting saved in my computer in a local folder.
How do I append each atom? Printing all atoms is very fast. I want to append it, but it wouldn't be a PDB structure file. What should I do?
from Bio.PDB import .... PDBIO, Select ....
class rep_atom_Select(Select):
def accept_atom(self, atom):
if atom.get_name() in rep_atoms:
return 1
else:
return 0
def rep_atoms_pdb(input_pdb):
io = PDBIO()
io.set_structure(input_pdb)
for model in input_pdb:
for chain in model:
for residue in chain:
for atom in residue:
if atom.get_name() in rep_atoms:
print(atom)
# dnr_only = io.save("dnr_only.pdb", rep_atom_Select())

Save after the loop, once, instead of thousands of times inside the loop.
def rep_atoms_pdb(input_pdb):
my_atoms = list()
for model in input_pdb:
for chain in model:
for residue in chain:
for atom in residue:
if atom.get_name() in rep_atoms: # or if rep_atom_Select().accept_atom(atom):
my_atoms.append(atom) # or something like this
# The function returns the list of extracted atoms
return my_atoms
Your definition of rep_atom_Select() does not seem to be directly compatible with this design, nor am I sure receiving the atoms as a list is actually what you want, but this should at least give you a nudge in the right direction.
Brief reading of the Bio.PDB.PDBIO documentation suggests that you might simply want to return the actual PDBIO object. I think something like this:
class rep_atom_Select(Select):
def accept_atom(self, atom):
if atom.get_name() in rep_atoms:
return 1
else:
return 0
def rep_atoms_pdb(input_pdb):
io = rep_atom_Select()
io.set_structure(input_pdb)
return io
This is based on a very cursory reading of the documentation, but at least demonstrates how you would use your overridden class to select only some of the atoms in the input_pdb structure.

Related

Read data from XLSX provided as XSTRING

An Excel file (.xlsx) is uploaded on the frontend which is UI5 Fiori.
The file contents come to SAP ABAP backend via ODATA in XSTRING format.
I need to store that XSTRING into an internal table and then in a DDIC table. Eg: Suppose the Excel has 5 columns then I want to store that data of 5 columns in the corresponding columns in the DDIC table.
I have tried various Function Modules like:
SCMS_XSTRING_TO_BINARY
SCMS_BINARY_TO_STRING
and following Classes & methods:
cl_bcs_convert=>raw_to_string
cl_soap_xml_helper=>xstring_to_string
but none were able to convert the XSTRING to STRING.
Can you please suggest which function module or class/method can be used to solve the problem?
For most comfort, use abap2xlsx.
If you cannot or do not want to use that, you can alternatively parse the Excel file on your own. .xlsx files are basically .zip files with a different file ending. Use cl_abap_zip->load to open the xstring you receive and ->get to extract the individual files from the zip. Afterwards, use XML parsers like cl_ixml or transformations to parse the XML content of the files.
Note that Excel's XML is a complicated file format, with several files that work together to form the worksheets. Refer to Microsoft's File format reference for Word, Excel, and PowerPoint for details. It's non-trivial to interpret this, so you will usually be a lot happier with abap2xlsx.
abap2xlsx is the most powerful and feature-rich way of doing this, as said by Florian, it supports styles, charts, complex tables, however it may not be always available due to the system limitations, restrictions to install custom packages in system or whatever.
Here is the way how to accomplish this with pure standard without using custom frameworks.
Since Netweaver 7.02 SAP supports Open Microsoft formats natively and provides classes for handling them: CL_XLSX_DOCUMENT, CL_DOCX_DOCUMENT and CL_PPTX_DOCUMENT, abap2xlsx is built at these classes too, yes. So let's start a bit of reinventing the wheel.
XLSX file is an OpenXML archive of files, of which the most interesting: sheet1.xml and sharedStrings.xml. Let's build a sample based on MARC table fields
Now you want to transfer this table to internal table with the same structure. The steps would be:
Extract needed files from XLSX archive
Read worksheet structure from sheet1.xml
Read sheet values from sharedStrings.xml
Map them together and write the result to the internal table
Here is the sample class that handles the job, I used the cl_openxml_helper applet to load XLSX, but you can receive XSTRINGed XLSX in whatever way.
CLASS xlsx_reader DEFINITION.
PUBLIC SECTION.
TYPES: BEGIN OF ty_marc,
matnr TYPE char20,
werks TYPE char20,
disls TYPE char20,
ekgrp TYPE char20,
dismm TYPE char20,
END OF ty_marc,
tt_marc TYPE STANDARD TABLE OF ty_marc WITH EMPTY KEY.
METHODS: read RETURNING VALUE(tab) TYPE tt_marc,
extract_xml IMPORTING index TYPE i
xstring TYPE xstring
RETURNING VALUE(rv_xml_data) TYPE xstring.
ENDCLASS.
CLASS xlsx_reader IMPLEMENTATION.
METHOD read.
TYPES: BEGIN OF ty_row,
value TYPE string,
index TYPE abap_bool,
END OF ty_row,
BEGIN OF ty_worksheet,
row_id TYPE i,
row TYPE TABLE OF ty_row WITH EMPTY KEY,
END OF ty_worksheet,
BEGIN OF ty_si,
t TYPE string,
END OF ty_si.
DATA: data TYPE TABLE OF ty_si,
sheet TYPE TABLE OF ty_worksheet.
TRY.
DATA(xstring_xlsx) = cl_openxml_helper=>load_local_file( 'C:\marc.xlsx' ).
CATCH cx_openxml_not_found.
ENDTRY.
"Read the sheet XML
DATA(xml_sheet) = extract_xml( EXPORTING xstring = xstring_xlsx iv_xml_index = 2 ).
"Read the data XML
DATA(xml_data) = extract_xml( EXPORTING xstring = xstring_xlsx iv_xml_index = 3 ).
TRY.
* transforming structure into ABAP
CALL TRANSFORMATION zsheet
SOURCE XML xml_sheet
RESULT root = sheet.
* transforming data into ABAP
CALL TRANSFORMATION zxlsx_data
SOURCE XML xml_data
RESULT root = data.
CATCH cx_xslt_exception.
CATCH cx_st_match_element.
CATCH cx_st_ref_access.
ENDTRY.
* mapping structure and data
LOOP AT sheet ASSIGNING FIELD-SYMBOL(<fs_row>).
APPEND INITIAL LINE TO tab ASSIGNING FIELD-SYMBOL(<line>).
LOOP AT <fs_row>-row ASSIGNING FIELD-SYMBOL(<fs_cell>).
ASSIGN COMPONENT sy-tabix OF STRUCTURE <line> TO FIELD-SYMBOL(<fs_field>).
CHECK sy-subrc = 0.
<fs_field> = COND #( WHEN <fs_cell>-index = abap_false THEN <fs_cell>-value ELSE VALUE #( data[ <fs_cell>-value + 1 ]-t OPTIONAL ) ).
ENDLOOP.
ENDLOOP.
ENDMETHOD.
METHOD extract_xml.
TRY.
DATA(lo_package) = cl_xlsx_document=>load_document( iv_data = xstring ).
DATA(lo_parts) = lo_package->get_parts( ).
CHECK lo_parts IS BOUND AND lo_package IS BOUND.
DATA(lv_uri) = lo_parts->get_part( 2 )->get_parts( )->get_part( index )->get_uri( )->get_uri( ).
DATA(lo_xml_part) = lo_package->get_part_by_uri( cl_openxml_parturi=>create_from_partname( lv_uri ) ).
rv_xml_data = lo_xml_part->get_data( ).
CATCH cx_openxml_format cx_openxml_not_found.
ENDTRY.
ENDMETHOD.
ENDCLASS.
zsheet transformation:
<?sap.transform simple?>
<tt:transform xmlns:tt="http://www.sap.com/transformation-templates" template="main">
<tt:root name="root"/>
<tt:template name="main">
<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:x14ac=
"http://schemas.microsoft.com/office/spreadsheetml/2009/9/ac" xmlns:xr="http://schemas.microsoft.com/office/spreadsheetml/2014/revision" xmlns:xr2="http://schemas.microsoft.com/office/spreadsheetml/2015/revision2" xmlns:xr3=
"http://schemas.microsoft.com/office/spreadsheetml/2016/revision3">
<tt:skip count="4"/>
<sheetData>
<tt:loop name="row" ref="root">
<row>
<tt:attribute name="r" value-ref="row_id"/>
<tt:loop name="cells" ref="$row.ROW">
<c>
<tt:cond><tt:attribute name="t" value-ref="index"/><tt:assign to-ref="index" val="C('X')"/></tt:cond>
<v><tt:value ref="value"/></v>
</c>
</tt:loop>
</row>
</tt:loop>
</sheetData>
<tt:skip count="2"/>
</worksheet>
</tt:template>
</tt:transform>
zxlsx_data transformation
<?sap.transform simple?>
<tt:transform xmlns:tt="http://www.sap.com/transformation-templates" template="main">
<tt:root name="ROOT"/>
<tt:template name="main">
<sst xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
<tt:loop name="line" ref=".ROOT">
<si>
<t>
<tt:value ref="t"/>
</t>
</si>
</tt:loop>
</sst>
</tt:template>
</tt:transform>
Here is how to call it:
START-OF-SELECTION.
DATA(reader) = NEW xlsx_reader( ).
DATA(marc) = reader->read( ).
The code is pretty self-explanatory, but let's put a couple of notes:
File sheet1.xml contains a special attribute t in each cell which denotes either the value should be treated as a literal or a reference to sharedStrings.xml
I used two simple transformations but XSLT can be used as well, possibly allowing you to reduce all XML stuff to single transformation
I deliberately used generic char20 types to be able to handle headers. If you wanna preserve native types, then you cannot read table header (skip the first line in sheet LOOP), because you'll receive type violation and dump. If you receive table without headers, then it is fine to declare structure with native types
If you don't want to use transformations then sXML is your friend. You can parse XML with classes as well, but ST transformation are considerably faster
With some additional effort you can make this snippet dynamic and parse XLSX with any structure
You can read more about this approach in this doc.

Confirming existence of a string in an xml table Lua

Good afternoon everyone,
My problem is that I have 2 XML lists
<List1> <Agency>String</Agency> </List1>
and
<List2><Agency2>String</Agency2><List2>.
In Lua I need to create a program which is parsing this list and when the user inputs a matching string from List 1 or List 2, the program needs to actually confirm to the user if the string belongs to either L1 or L2 or if the string is inexistent. I'm new to Lua and to programming generally speaking and I would be very grateful for you answers. I have LuaExpat as a plugin but I can't seem to be able to actually read from file, I can only do some beginner tricks if the xml list is written in the code. At a later time this small program will be fed by an RSS.
require("lxp")
local stuff = {}
xmldata="<Top><A/> <B a='1'/> <B a='2'/><B a='3'/><C a='3'/></Top>"
function doFunc(parser, name, attr)
if not (name == 'B') then return end
stuff[#stuff+1]= attr
end
local xml = lxp.new{StartElement = doFunc}
xml:parse(xmldata)
xml:close()
print(stuff[3].a)
This code is a tutorial over the web that works, everything is just fine it prints nr. 3. Now I want to know how to do that from an actual file, as if I input io.read:(file, "r" or "rb" ) under xmldata variable and run the same thing it returns either empty space or nil.

Modify Lua Chunk Environment: Lua 5.2

It is my understanding that in Lua 5.2 that environments are stored in upvalues named _ENV. This has made it really confusing for me to modify the environment of a chunk before running it, but after loading it.
I would like to load a file with some functions and use the chunk to inject those functions into various environments. Example:
chunk = loadfile( "file" )
-- Inject chunk's definitions
chunk._ENV = someTable -- imaginary syntax
chunk( )
chunk._ENV = someOtherTable
chunk( )
Is this possible from within Lua? The only examples I can find of modifying this upvalue are with the C api (another example from C api), but I am trying to do this from within Lua. Is this possible?
Edit: I'm unsure of accepting answers using the debug library. The docs state that the functions may be slow. I'm doing this for efficiency so that entire chunks don't have to be parsed from strings (or a file, even worse) just to inject variable definitions into various environments.
Edit: Looks like this is impossible: Recreating setfenv() in Lua 5.2
Edit: I suppose the best way for me to do this is to bind a C function that can modify the environment. Though this is a much more annoying way of going about it.
Edit: I believe a more natural way to do this would be to load all chunks into separate environments. These can be "inherited" by any other environment by setting a metatable that refers to a global copy of a chunk. This does not require any upvalue modification post-load, but still allows for multiple environments with those function definitions.
The simplest way to allow a chunk to be run in different environments is to make this explicit and have it receive an environment. Adding this line at the top of the chunk achieves this:
_ENV=...
Now you can call chunk(env1) and later chunk(env2) at your pleasure.
There, no debug magic with upvalues.
Although it will be clear if your chunk contains that line, you can add it at load time, by writing a suitable reader function that first sends that line and then the contents of the file.
I do not understand why you want to avoid using the debug library, while you are happy to use a C function (neither is possible in a sandbox.)
It can be done using debug.upvaluejoin:
function newEnvForChunk(chunk, index)
local newEnv = {}
local function source() return newEnv end
debug.upvaluejoin(chunk, 1, source, 1)
if index then setmetatable(newEnv, {__index=index}) end
return newEnv
end
Now load any chunk like this:
local myChunk = load "print(x)"
It will initially inherit the enclosing _ENV. Now give it a new one:
local newEnv = newEnvForChunk(myChunk, _ENV)
and insert a value for 'x':
newEnv.x = 99
Now when you run the chunk, it should see the value for x:
myChunk()
=> 99
If you don't want to modify your chunk (per LHF's great answer) here are two alternatives:
Set up a blank environment, then dynamically change its environment to yours
function compile(code)
local meta = {}
local env = setmetatable({},meta)
return {meta=meta, f=load('return '..code, nil, nil, env)}
end
function eval(block, scope)
block.meta.__index=scope
return block.f()
end
local block = compile('a + b * c')
print(eval(block, {a=1, b=2, c=3})) --> 7
print(eval(block, {a=2, b=3, c=4})) --> 14
Set up a blank environment, and re-set its values with your own each time
function compile(code)
local env = {}
return {env=env, f=load('return '..code, nil, nil, env)}
end
function eval(block, scope)
for k,_ in pairs(block.env) do block.env[k]=nil end
for k,v in pairs(scope) do block.env[k]=v end
return block.f()
end
local block = compile('a + b * c')
print(eval(block, {a=1, b=2, c=3})) --> 7
print(eval(block, {a=2, b=3, c=4})) --> 14
Note that if micro-optimizations matter, the first option is about 2✕ as slow as the _ENV=... answer, while the second options is about 8–9✕ as slow.

How can you join two or more dictionaries created by Bio.SeqIO.index?

I would like to be able to join the two "dictionaries" stored in "indata" and "pairdata", but this code,
indata = SeqIO.index(infile, infmt)
pairdata = SeqIO.index(pairfile, infmt)
indata.update(pairdata)
produces the following error:
indata.update(pairdata)
TypeError: update() takes exactly 1 argument (2 given)
I have tried using,
indata = SeqIO.to_dict(SeqIO.parse(infile, infmt))
pairdata = SeqIO.to_dict(SeqIO.parse(pairfile, infmt))
indata.update(pairdata)
which does work, but the resulting dictionaries take up too much memory to be practical for for the sizes of infile and pairfile I have.
The final option I have explored is:
indata = SeqIO.index_db(indexfile, [infile, pairfile], infmt)
which works perfectly, but is very slow. Does anyone know how/whether I can successfully join the two indexes from the first example above?
SeqIO.index returns a read-only dictionary-like object, so update will not work on it (apologies for the confusing error message; I just checked in a fix for that to the main Biopython repository).
The best approach is to either use index_db, which will be slower but
only needs to index the file once, or to define a higher level object
which acts like a dictionary over your multiple files. Here is a
simple example:
from Bio import SeqIO
class MultiIndexDict:
def __init__(self, *indexes):
self._indexes = indexes
def __getitem__(self, key):
for idx in self._indexes:
try:
return idx[key]
except KeyError:
pass
raise KeyError("{0} not found".format(key))
indata = SeqIO.index("f001", "fasta")
pairdata = SeqIO.index("f002", "fasta")
combo = MultiIndexDict(indata, pairdata)
print combo['gi|3318709|pdb|1A91|'].description
print combo['gi|1348917|gb|G26685|G26685'].description
print combo["key_failure"]
In you don't plan to use the index again and memory isn't a limitation (which both appear to be true in your case), you can tell Bio.SeqIO.index_db(...) to use an in memory SQLite3 index with the special index name ":memory:" like so:
indata = SeqIO.index_db(":memory:", [infile, pairfile], infmt)
where infile and pairfile are filenames, and infmt is their format type as defined in Bio.SeqIO (e.g. "fasta").
This is actually a general trick with Python's SQLite3 library. For a small set of files this should be much faster than building the SQLite index on disk.

Recursive method not returning to previous line position in caller

I have been trying to find a solution to this for some time. I have found questions and answers on recursion but nothing that seemed to fit this particular situation.
I have written a class which should go through the given folder and all subfolders and rename files and folders if a particular search pattern is found.
Everything works as expected the replaceAllInDir gets called, it replaces files and folders if needed. The next step then is to do the same for all subfolders within the given folder.
So a subfolder gets identified and replaceAllInDir gets called from within itself. Let's assum the particular subfolder called does not contain any subfolders. I would then expect that we return to the parent folder and continue looking for other subfolders. But instead control is not returned to the parent calling method and the program ends.
I am aware of other ways of solving the actual use case, but I cannot explain the behaviour of ruby.
class MultiFileAndFolderRename
attr_accessor :rootDir, :searchPattern, :replacePattern
def initialize(rootDir, searchPattern, replacePattern)
#rootDir = rootDir
#searchPattern = searchPattern
#replacePattern = replacePattern
end
def execute
replaceAllInDir(#rootDir)
end
def getValidDirEntries(dir)
dirList = Dir.entries(dir)
dirList.delete('.')
dirList.delete('..')
dirList
end
def replaceAllInDir(currentDir)
Dir.chdir(currentDir)
puts "Processing directory: " + Dir.pwd
dirList = getValidDirEntries(currentDir)
dirList.each { |dirEntry|
attemptRename(dirEntry)
}
dirList = getValidDirEntries(currentDir)
dirList.each { |dirEntry|
if File.directory?(dirEntry)
newDir = currentDir + '\\' + dirEntry
rntemp = MultiFileAndFolderRename.new(newDir, 'searchString', 'replaceString')
rntemp.replaceAllInDir(newDir)
end
}
end
def attemptRename(dirEntry)
if dirEntry.match(#searchPattern)
newname = dirEntry.to_s.sub(#searchPattern, #replacePattern)
FileUtils.mv(dirEntry.to_s, newname)
end
end
end
You have a bug. The first line of replaceAllInDir() is Dir.chdir(). chdir() changes the directory of the current process on a global scale. It's not call-stack dependent. So later when you move into a subdirectory and change into that, the change becomes permanent even if you return from the recursion.
You need to change back to the correct directory after any call to replaceAllInDir(). For example:
...
dirList.each { |dirEntry|
if File.directory?(dirEntry)
....
rntemp.replaceAllInDir(newDir)
Dir.chdir(currentDir) # <- Restore us back to the correct directory
end
}
I have tried your code, and I have found numerous errors in it. Perhaps if you fix them, your idea is working.
You should include in a library like that a part at the end that allows to call it from the shell: MultiFileAndFolderRename.new(ARGV[0], ARGV[1], ARGV[2]).execute if __FILE__ == $0 This ensures when you call the ruby code from the shell by ruby rename.rb test old new, your class will be instantiated, and the search and replace pattern will be set accordingly.
You shouldn't set the current directory, because that ensures that the line getValidDirEntries(currentDir) will not work. If you eg. call it for the directory test, and then change your current directory to test, inside the directory, getValidDirEntries('test') will not work like expected.
You should use only forward slashes instead of the double backward ones. So your code will work on Linux and Mac OS X as well.
When you instantiate the new instance of MultiFileAndFolderRename (which is not necessary), the arguments to the initializer are the wrong ones. Instead, you should use your current instance and just call self.replaceAllInDir(newDir) instead of rntemp = MultiFileAndFolderRename.new(newDir, 'searchString', 'replaceString');rntemp.replaceAllInDir(newDir).
I think the wrong instantiation is the major reason why it works not as expected, but the others should be fixed as well.

Resources