I am using Lua 5.2. I am receiving large tables (1-dimensional array) of size 800,000. I want to dump these tables quickly. I found an article on wiki titled Save Table To File and used it but found not up to the mark. A sample table saved using this method, i.e., table.save(table, filename) is shared in my DropBox here. (File is too large to put here. 8MB approx)
Since my primary concern is speed, I am ready to adopt binary serialization if such exists.
Are you bound to Lua 5.2? 5.3 introduced bitwise operators and built-in binary pack/unpack operations (see chapter 13, “Bits and Bytes”, of Programming in Lua, 4th edition). There are also specific algorithms and recommendations for serializing tables in chapter 15, “Data Files and Serialization”. These chapters will be your best source of information for a proper implementation.
Related
I'm investigating using biopython to process PDB files, but it looks to me like the CONECT records are not handled. For instance, I'm wanting to use it to extract a ligand from a PDB structure, and I'm getting the atoms (HETATM) written but the corresponding CONECT records are lost. Is there a way to include these?
https://github.com/biopython/biopython/issues/3468:
Feature Request: Support for CONECT Records #3468
Open
zacharyrs opened this issue on 15 Jan · 4 comments
Open
Feature Request: Support for CONECT Records
#3468
zacharyrs opened this issue on 15 Jan · 4 comments
Comments
#zacharyrs
zacharyrs commented on 15 Jan
Hey all,
This isn't so much a feature request but a query to gauge interest.
I noticed in BioJava, there is built in support within the PDBParser for CONECT records (see here).
In contrast, BioPython lacks this, with the only related issue being here.
I'm just curious if there would be interest in me PR'ing with an implementation akin to BioJava, or if this is something I should just handle independently?
Cheers!
#JoaoRodrigues
Member
JoaoRodrigues commented on 15 Jan
Hi #zacharyrs
Thanks for opening the issue. CONECT records are a PDB-specific feature and have no correspondence to the newer (and now standard) mmCIF format. That said, I don't think there are any objections to adding them to the current parser, if you are so inclined! The CONECT record format parsed by Biojava is not standard either, by the way. There are no fields for the type of bond in the original format specification.
#zacharyrs
Author
zacharyrs commented on 17 Jan •
So I've read into the source for BioJava CONECT record handling, and also how it handles mmCIF with the bonds list. I'm thinking of implementing a connects instance attribute on the structure, and only reading/writing the CONECTs per the PDB spec (no bond type information).
The biggest hold up in this would be the copy methods - if you copy the structure you should retain the CONECTs, but for chain, residue, or atom I think you'd lose it, though I think this is similar to BioJava.
Old approach thoughts...
With a second day's thought and an implementation attempt, I've realised a few issues with the above (hidden) approaches.
My current approach (at zacharyrs:biopython/feature/conect_record_support) implements a Conect object, that takes two atom serial numbers, and the bond order.
When parsing, for each record I iterate through the bonded atoms and create individual Conect instances under the connects list attribute on the structure. For records that share a prior records atom serial numbers, I just increment the bond order. When writing, I construct a temporary dictionary, like shown, then just loop through and repeat records according to the order.
{
atom_a_serial: {
bond_order: [atom_b_serial]
}
}
This works to read and then write a PDB, whilst preserving the CONECTS, however it does not behave if you modify the structure, as there's no link between the connects and the actual atoms. I could store the bond onto the actual atoms, akin to mmCIF under BioJava, but it might just be adding a lot of bloat for people who don't need the CONECT records...
#JoaoRodrigues
Member
JoaoRodrigues commented on 22 Mar
Hi #zacharyrs
Sorry for missing this follow-up. Implementing CONECT statements properly is hard because you'd need to keep track of who is connected to who. The simplest way is to add a dictionary of pairs of atom indexes but if you delete or add atoms, you risk mangling this info. Also, again, mmCIF files do not have an equivalent of CONECT statments.
You could, although I'm not sure how much I personally like that approach, just have a "bonds" attribute per atom that stores references to the atom objects it is bonded to. That way you wouldn't need to worry about indexes, but you'd have to have a way to cleanup these lists when you delete atoms otherwise you would have references left behind and the garbage cleaner wouldn't be happy about it.
#zacharyrs
Author
zacharyrs commented on 22 Mar
Hi #JoaoRodrigues - no worries, it's been a busy time for me too!
I ended up actually shifting to mmtf files, given they hold the bond information, so this has been dropped a little from my mind. I implemented a custom parser that retains all information and can serialise and deserialise easily, but that's a closed source project right now.
For the mmCIF/PDB approach, the bonds/conects attribute on the atom seems like the only feasible way to handle it. This is basically what I've done for mmtf files in my parser. I have a Bond instance that stores bond order and atom references, akin to the Conect instance described/implemented above. There's a helper on the Bond class that deletes it from both atoms, which I can then call as needed.
This is for the case of calling Saxon from a Java application. I understand that Saxon can use XPath 3.1 to run queries against JSON files. A couple of question on this:
Where is there an example of how to do this? I've done searches and find lots of answers on details of doing this, but noting on how to read in the file and perform queries. Is it the same as XML?
Is it possible to have a schema file for the JSON so returned values are correctly typed? If so, how?
Is XQuery also able to perform queries on JSON?
What version of Saxon supports this? (We are using 9.9.1.1 and want to know if I need to upgrade.)
Technically, you don't run queries against JSON files; you run them against the data structure that results from parsing a JSON file, which is a structure of maps and arrays. You can parse the JSON file using the parse-json() or json-doc() functions, and then query the result using operators that work on maps and arrays. Some of these (and examples of their use) are shown in the spec at
https://www.w3.org/TR/xpath-31/#id-maps-and-arrays
Googling for "query maps arrays JSON XPath 3.1" finds quite a lot of useful material. Or get Priscilla Walmsley's book: http://www.datypic.com/books/xquery/chapter24.html
Data types: the data types of string, number, and boolean that are intrinsic to JSON are automatically recognized by their form. There's no capability to do further typing using a schema.
XQuery is a superset of XPath, but as far as JSON/Maps/Arrays are concerned, I think the facilities in XPath and those in XQuery are exactly the same.
Saxon has added a bit of extra conformance and performance in each successive release. 9.9 is pretty complete in its coverage; 10.0 adds some optimizations (like a new internal data structure for maps whose keys are all strings, such as you get when you parse JSON). Details of changes in successive Saxon releases are described in copious detail at http://www.saxonica.com/documentation/index.html#!changes
I'm not too familiar with Machine Learning techniques, and i want to know if I can transfer a final trained-model to another machine. More specifically, i'm trying to solve a sound classification problem by training a model on a regular PC, and then implement / transfer its output model to an embedded system where no libraries are allowed (C programming). The system does not support file reading either.
So my question is.
Are there learning methods with output models simple enough that it can be implemented easily on other systems? How would you implement it? (Something like Q-learning? although Q-learning wouldn't be appropriate in my project.)
I would like some pointers, thanks in advance.
Any arbitrary "blob" of data can be converted into a C byte array and compled and linked directly with your code. A code generator is simple enough to write, but there are tools that will do that directly such a Segger Bin2C (and any number of other tools called "bin2c") or the swiss-army knife of embedded data converters SRecord.
Since SRecord can do so many things, getting it to do this one thing is less than obvious:
srec_cat mymodel.nn -binary -o model.c -C-Array model -INClude
will generate a model.c and model.h file defining a data array containing the byte content of mymodel.nn.
Is there any documentation of the moses.ini format for Moses? Running moses at the command line without arguments returns available feature names but not their available arguments. Additionally, the structure of the .ini file is not specified in the manual that I can see.
The main idea is that the file contains settings that will be used by the translation model. Thus, the documentation of values and options in moses.ini should be looked up in the Moses feature specifications.
Here are some excerpt I found on the Web about moses.ini.
In the Moses Core, we have some details:
7.6.5 moses.ini All feature functions are specified in the [feature] section. It should be in the format:
* Feature-name key1=value1 key2=value2 .... For example, KENLM factor=0 order=3 num-features=1 lazyken=0 path=file.lm.gz
Also, there is a hint on how to print basic statistics about all components mentioned in the moses.ini.
Run the script
analyse_moses_model.pl moses.ini
This can be useful to set the order of mapping steps to avoid explosion of translation options or just to check that the model components are as big/detailed as we expect.
In the Center for Computational Language and EducAtion Research (CLEAR) Wiki, there is a sample file with some documentation:
Parameters
It is recommended to make an .ini file to storage all of your setting.
input-factors
- Using factor model or not
mapping
- To use LM in memory (T) or read the file in hard disk directly (G)
ttable-file
- Indicate the num. of source-factor, num. of target-factor, num of score, and
the path to translation table file
lmodel-file
- Indicate the type using for LM (0:SRILM, 1:IRSTLM), using factor number, the order (n-gram) of LM, and the path to language model file
If it is not enough, there is another description on this page, see "Decoder configuration file" section
The sections
[ttable-file] and [lmodel-file] contain pointers to the phrase table
file and language model file, respectively. You may disregard the
numbers on those lines. For the time being, it's enough to know that
the last one of the numbers in the language model specification is the
order of the n-gram model.
The configuration file also contains some feature weights. Note that
the [weight-t] section has 5 weights, one for each feature contained
in the phrase table.
The moses.ini file created by the training process will not work with
your decoder without modification because it relies on a language
model library that is not compiled into our decoder. In order to make
it work, open the moses.ini file and find the language model
specification in the line immediately after the [lmodel-file] heading.
The first number on this line will be 0, which stands for SRILM.
Change it into 8 and leave the rest of the line untouched. Then your
configuration should work.
Parsec is designed to parse textual information, but it occurs to me that Parsec could also be suitable to do binary file format parsing for complex formats that involve conditional segments, out-of-order segments, etc.
Is there an ability to do this or a similar, alternative package that does this? If not, what is the best way in Haskell to parse binary file formats?
The key tools for parsing binary files are:
Data.Binary
cereal
attoparsec
Binary is the most general solution, Cereal can be great for limited data sizes, and attoparsec is perfectly fine for e.g. packet parsing. All of these are aimed at very high performance, unlike Parsec. There are many examples on hackage as well.
You might be interested in AttoParsec, which was designed for this purpose, I think.
I've used Data Binary successfully.
It works fine, though you might want to use Parsec 3, Attoparsec, or Iteratees. Parsec's reliance on String as its intermediate representation may bloat your memory footprint quite a bit, whereas the others can be configured to use ByteStrings.
Iteratees are particularly attractive because it is easier to ensure they won't hold onto the beginning of your input and can be fed chunks of data incrementally a they come available. This prevents you from having to read the entire input into memory in advance and lets you avoid other nasty workarounds like lazy IO.
The best approach depends on the format of the binary file.
Many binary formats are designed to make parsing easy (unlike text formats that are primarily to be read by humans). So any union data type will be preceded by a discriminator that tells you what type to expect, all fields are either fixed length or preceded by a length field, and so on. For this kind of data I would recommend Data.Binary; typically you create a matching Haskell data type for each type in the file, and then make each of those types an instance of Binary. Define the "get" method for reading; it returns a "Get" monad action which is basically a very simple parser. You will also need to define a "put" method.
On the other hand if your binary data doesn't fit into this kind of world then you will need attoparsec. I've never used that, so I can't comment further, but this blog post is very positive.