Biopython PDBIO assembly chain IDs - biopython

I am using Bio.PDB to parse structures in mmCIF and PDB format. I realised that PDBIO does not deal well with two-character chain identifiers (like ‘AA’ or ‘AB’) found in assembly structures. I have made a slight change to the code that fits me. Attached you will find the modified PDBIO module. What it does basically is checking the length of the chain identifier string and adds a space in front of it, if is a single character. The formatting string is modified accordingly.
These are my changes in Bio.PDB.PDBIO module. Please consider it putting it in a future update.
Modified:
_ATOM_FORMAT_STRING = "%s%5i %-4s%c%3s%s%4i%c %8.3f%8.3f%8.3f%s%6.2f %4s%2s%2s\n"
Modified:
for chain in model.get_list():
if not select.accept_chain(chain):
continue
chain_id = chain.get_id()
if len(chain_id)==1: #Added line
chain_id = ' {}'.format(chain_id) #Added line
Modified:
fp.write("TER %5i %3s %s%4i%c \n

Stackoverflow is a site to ask questions. What you are proposing is a change to BioPython software. Luckily, BioPython is open-source, so you create a pull request so that your change can be added to the software.
Go to https://github.com/biopython/biopython/blob/master/Bio/PDB/PDBIO.py
Click on the arrow icon in the top right corner. This will create a fork of the BioPython repository
Make the changes that you mentioned above in your fork and a title and a description:
Click on propose file change. You can now visually compare your modifications side by side.
If everything looks OK, click on create pull request. This will send a pull request to the master branch of the BioPython repository. There it will be reviewed. If the authors of the BioPython software agree that this is a usefull change, they will merge it into the sofware.

Related

Isabelle's document preparation

I would like to obtain the LaTeX code associated with this theory. Previous answers only provide links to the documentation. Let me describe what I did.
I went to the directory of Hales.thy and executed isabelle mkroot, followed by isabelle build -D ., which generated a file named document and a *.pdf file which was suspiciously (nearly) empty. Modifications of this command by adding Hales.thy as a parameter didn't succeed.
I would appreciate if someone could describe briefly the commands needed.
As a precaution, copy the file Hales.thy into a new directory that does not contain any other files and run isabelle mkroot again.
If I understand correctly, your theory contains sorry. In this case, for the build to succeed you need to enable the quick_and_dirty mode. For this, before the first occurrence of sorry in your theory file, you need to insert declare [[quick_and_dirty=true]].
Your theory contains raw text that is not suitably formatted. Try replacing the relevant lines with the following: text‹The case \<^text>‹t^2 = 1› corresponds to a product of intersecting lines which cannot be a group› and text‹The case \<^text>‹t = 0› corresponds to a circle which has been treated before›.
Once this is done, you should be able to use the ROOT file in the appendix below. As you can see, I have specified the theory file explicitly and also added the relevant imported sessions.
Appendix
session Hales = HOL +
options [document = pdf, document_output = "output"]
sessions
"HOL-Library"
"HOL-Algebra"
theories
"Hales"
document_files
"root.tex"

Jira import issues via CSV - Embedded newlines in text fields are not preserved

This is a cross-poss from the Atlassian Jira forum
I am migrating issues from one Jira instance (the source) to another Jira instance (the destination).
I cannot use the Project Configurator add-on since the source & destination versions are different.
I am exporting issues to a CSV and then importing the CSV to the destination.
I have several multi-line text fields in which the data contains newlines.
The CSV is created correctly (the data in the columns is enclosed in double quotes to protect the embedded CR/LF).
See the Jira reference
However, after the successful import via CSV to the destination Jira, the CR/LF are gone and the text field contains all lines concatenated.
The source field:
MyValue-1
MyValue-2
MyValue-3
Date in the export/import CSV:
"MyValue-1<CR><LF>
MyValue-2<CR><LF>
MyValue-3<CR><LF>
"
Destination field:
MyValue-1MyValue2MyValue-3
Am I doing something wrong here?
There is another ADD-on called Configuration Manager (Botron) its on market place. The add-on is helpful. May be you can give a try.
Thanks
I reached out to Atlassian support and they came back with the answer that it's a known bug:
https://jira.atlassian.com/browse/JRASERVER-46365
In the bug description they suggest 2 workarounds. I checked the 1st one and it works!

Deedle - what's the schema format for readCsv

I was using Deedle in F# to read a txt file (no header) to data frame, and cannot find any example about how to specify the schema.
let df= Frame.ReadCsv(datafile, separators="\t", hasHeaders=false, schema=schema)
I tried to give a string with names separated by ',', but seems don't work.
let schema = #"name, age, address";
I did some search on the doc, but only find following - don't know where I can find the info. :(
schema - A string that specifies CSV schema. See the documentation
for information about the schema format.
The schema format is the same as in the CSV type provider in F# Data.
The only problem (quite important!) is that the Deedle library had a bug where it completely ignores the schema parameter, so no matter what you provide, it would be ignored.
I just submitted a pull request that fixes the bug and also includes some examples (in the form of unit tests). See the pull request here (and click on "Files changed" to see the samples).
If you do not want to wait for a new release, just get the code from my GitHub fork and build it using build.cmd in the root (run this for the first time to restore packages). The complete build requires local installation of R (because it builds R plugin too), but it should build Deedle.dll and then fail... (After the first run of build.cmd, you can just use Deedle.sln solution).

Options for MeCab Japanese tokenizer on iOS?

I'm using the iPhone library for MeCab found at https://github.com/FLCLjp/iPhone-libmecab . I'm having some trouble getting it to tokenize all possible words. Specifically, I cannot tokenize "吉本興業" into two pieces "吉本" and "興業". Are there any options that I could use to fix this? The iPhone library does not expose anything, but it uses C++ underneath the objective-c wrapper. I assume there must be some sort of setting I could change to give more fine-grained control, but I have no idea where to start.
By the way, if anyone wants to tag this 'mecab' that would probably be appropriate. I'm not allowed to create new tags yet.
UPDATE: The iOS library is calling mecab_sparse_tonode2() defined in libmecab.cpp. If anyone could point me to some English documentation on that file it might be enough.
There is nothing iOS-specific in this. The dictionary you are using with mecab (probably ipadic) contains an entry for the company name 吉本興業. Although both parts of the name are listed as separate nouns as well, mecab has a strong preference to tag the compound name as one word.
Mecab lacks a feature that allows the user to choose whether or not compounds should be split into parts. Note that such a feature is generally hard to implement because not everyone agrees on which compounds can be split and which ones can't. E.g. is 容疑者 a compound made up of 容疑 and 者? From a purely morphological point of view perhaps yes, but for most practical applications probably no.
If you have a list of compounds you'd like to get segmented, a quick fix is to create a user dictionary for the parts they consist of, and make mecab use this in addition to the main dictionary.
There is Japanese documentation on how to do this here. For your particular example, it would involve the steps below.
Make a user dictionary with two entries, one for 吉本 and one for 興業:
吉本,,,100,名詞,固有名詞,人名,名,*,*,よしもと,ヨシモト,ヨシモト
興業,,,100,名詞,一般,*,*,*,*,こうぎょう,コウギョウ,コウギョウ
I suspect that both entries exist in the default dictionary already, but by adding them to a user dictionary and specifying a relatively low specificness indicator (I've used 100 for both -- the lower, the more likely to be split), you can get mecab to tend to prefer the parts over the whole.
Compile the user dictionary:
$> $MECAB/libexec/mecab/mecab-dict-index -d /usr/lib64/mecab/dic/ipadic -u mydic.dic -f utf-8 -t utf-8 ./mydic
You may have to adjust the command. The above assumes:
Mecab was installed from source in $MECAB. If you use mecab installed by a package manager, you might have difficulties finding the mecab-dict-index tool. Best install from source.
The default dictionary is in /usr/lib64/mecab/dict/ipadic. This is not part of the mecab package; it comes as a separate package (e.g. this) and you may have difficulties finding this, too.
mydic is the name of the user dictionary created in step 1. mydic.dic is the name of the compiled dictionary you'll get as output (needs not exist).
Both the system dictionary (-t option) and the user dictionary (-f option) are encoded in UTF-8. This may be wrong, in which case you'll get an error message later when you use mecab.
Modify the mecab configuration. In a system-wide installation, this is a file named /usr/lib64/mecab/dic/ipadic/dicrc or similar. In your case it may be located somewhere else. Add the following line to the end of the configuration file:
userdic = home/myhome/mydic.dic
Make sure the absolute path to the dictionary compiled above is correct.
If you then run mecab against your input, it will split the compound into its parts (I tested it, using mecab 0.994 on a Linux system).
A more thorough fix would be to get the source of the default dictionary and manually remove all compoun nouns you want to get split, then recompile the dictionary. As a general remark, using a CJK tokenizer for a serious application in production mode over a longer period of time usually involves a certain amount of dictionary maintenance (adding/removing entries) regularly.

How does one merge texts together? like a version control system would?

Wikipedia says this is pretty good: http://en.wikipedia.org/wiki/Merge_(revision_control)#Three-way_merge
But how does one implement that? or are there any gems / plugins for Ruby on Rails that will handle that for me?
My situation:
• I have base text
• changes from person A
• changes from person B
• both changes should be included and not overriding the other
any directions I could be pointed in? thanks!
I think you should look at the merge3 gem again [source].
This small example explains it:
require 'rubygems'
require 'merge3'
start = <<TEXT
This is the baseline.
The start.
The end.
TEXT
changed_A = <<TEXT
This is the baseline.
The start (changed by A).
The end.
TEXT
changed_B = <<TEXT
This is the baseline.
The start.
B added this line.
The end.
TEXT
result = Merge3::three_way(start, changed_A, changed_B)
puts result
The output it generates is:
This is the baseline.
The start (changed by A).
B added this line.
The end.
I am not sure how it handles merge conflicts, and since it is supposed to handle 3-way merges of files, it seems to be line-based. If that is a problem (as your example tries to compare simple string), you could add a newline between every character.
Hope this helps.
If you're storing versioned text that you want to be able to merge, then it sounds like you have a perfect use case for calling a version control system. Store the text in files and call the VCS for version control operations (perhaps the Git or Grit gems would be helpful).

Resources