How to target HDF5 datasets as Snakemake input/output? - hdf5

How can I tell Snakemake that the input/output of a rule should be an HDF5 dataset (so with its own specific "path" within the actual real HDF5 file path)?

snakemake considers the input and output file paths as strings, regardless of the file type and content. If your HDF5 files have a specific path, then use those paths in the input/output directives, possibly using a function-as-input to locate the HDF5 path.
For a more useful answer you should add to your question some example code of what you are trying to do.

Related

Stanford NLP - Using Parsed or Tagged text to generate Full XML

I'm trying to extract data from the PennTreeBank, Wall Street Journal corpus. Most of it already has the parse trees, but some of the data is only tagged.
i.e. wsj_DDXX.mrg and wsj_DDXX.pos files.
I would like to use the already parsed trees and tagged data in these files so as not to use the parser and taggers within CoreNLP, but I still want the output file format that CoreNLP gives; namely, the XML file that contains the dependencies, entity coreference, and the parse tree and tagged data.
I've read many of the java docs but I cannot figure out how to get it the way I described.
For POS, I tried using the LexicalizedParser and it allows me to use the tags, but I can only generate an XML file with the some of the information I want; there is no option for coreference or generating the parse trees. To get it to correctly generate the sub-optimal XML files here, I had to write a script to get rid of all of the brackets within the files. This is the command I use:
java -cp "*" edu.stanford.nlp.parser.lexparser.LexicalizedParser -outputFormat typedDependenciesCollapsed,wordsAndTags -outputFilesExtension xml -outputFormatOptions xml -writeOutputFiles -outputFilesDirectory my\dir -tokenized -tagSeparator / -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerMethod newCoreLabelTokenizerFactory edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz my\wsj\files\dir
I also can't generate the data I would like to have for the WSJ data that already has the trees. I tried using what is said here and I looked at the corresponding Javadocs. I used the command similar to what is described. But I had to write a python program to retrieve the stdout data resulting from analyzing each file and wrote it into a new file. This resulting data is only a text file with the dependencies and is not in the desired XML notation.
To summarize, I would like to use the POS and tree data from these PTB files in order to generate a CoreNLP parse corresponding to what would occur if I used CoreNLP on a regular text file. The pseudo command would be like this:
java -cp "*" edu.stanford.nlp.pipeline.CoreNLP -useTreeFile wsj_DDXX.mrg
and
java -cp "*" edu.stanford.nlp.pipeline.CoreNLP -usePOSFile wsj_DDXX.pos
Edit: fixed a link.
Yes, this is possible, but a bit tricky and there is no out of the box feature that can do this, so you will have to write some code. The basic idea is to replace the tokenize, ssplit and pos annotators (and in case you also have trees the parse annotator) with your code that loads these annotations from your annotated files.
On a very high level you have to do the following:
Load your trees with MemoryTreebank
Loop through all the trees and for each tree create a sentence CoreMap to which you add
a TokensAnnotation
a TreeAnnotation and the SemanticGraphCoreAnnotations
Create an Annotation object with a list containing the CoreMap objects for all sentences
Run the StanfordCoreNLP pipeline with the annotators option set to lemma,ner,dcoref and the option enforceRequirements set to false.
Take a look at the individual annotators to see how to add the required annotations. E.g. there is a method in ParserAnnotatorUtils that adds the SemanticGraphCoreAnnotations.

Documentation of Moses (statistical machine translation) mose.ini file format?

Is there any documentation of the moses.ini format for Moses? Running moses at the command line without arguments returns available feature names but not their available arguments. Additionally, the structure of the .ini file is not specified in the manual that I can see.
The main idea is that the file contains settings that will be used by the translation model. Thus, the documentation of values and options in moses.ini should be looked up in the Moses feature specifications.
Here are some excerpt I found on the Web about moses.ini.
In the Moses Core, we have some details:
7.6.5 moses.ini All feature functions are specified in the [feature] section. It should be in the format:
* Feature-name key1=value1 key2=value2 .... For example, KENLM factor=0 order=3 num-features=1 lazyken=0 path=file.lm.gz
Also, there is a hint on how to print basic statistics about all components mentioned in the moses.ini.
Run the script
analyse_moses_model.pl moses.ini
This can be useful to set the order of mapping steps to avoid explosion of translation options or just to check that the model components are as big/detailed as we expect.
In the Center for Computational Language and EducAtion Research (CLEAR) Wiki, there is a sample file with some documentation:
Parameters
It is recommended to make an .ini file to storage all of your setting.
input-factors
- Using factor model or not
mapping
- To use LM in memory (T) or read the file in hard disk directly (G)
ttable-file
- Indicate the num. of source-factor, num. of target-factor, num of score, and
the path to translation table file
lmodel-file
- Indicate the type using for LM (0:SRILM, 1:IRSTLM), using factor number, the order (n-gram) of LM, and the path to language model file
If it is not enough, there is another description on this page, see "Decoder configuration file" section
The sections
[ttable-file] and [lmodel-file] contain pointers to the phrase table
file and language model file, respectively. You may disregard the
numbers on those lines. For the time being, it's enough to know that
the last one of the numbers in the language model specification is the
order of the n-gram model.
The configuration file also contains some feature weights. Note that
the [weight-t] section has 5 weights, one for each feature contained
in the phrase table.
The moses.ini file created by the training process will not work with
your decoder without modification because it relies on a language
model library that is not compiled into our decoder. In order to make
it work, open the moses.ini file and find the language model
specification in the line immediately after the [lmodel-file] heading.
The first number on this line will be 0, which stands for SRILM.
Change it into 8 and leave the rest of the line untouched. Then your
configuration should work.

Storing applicative version info in SPSS sav file

I'm using C SPSS I/O library to write and read sav files.
I need to store my own version number in sav file. The requirements are:
1) That version should not be visible to user when he/she uses regular SPSS programs.
2) Obviously, regular SPSS programs and the I/O module should not overwrite the number.
Please, advice about that place or function.
Regards,
There is a header field in the sav file that identifies the creator. However, that would be overwritten if the file is resaved. It would be visible with commands such as sysfile info.
Another approach would be to create a custom file attribute using a name that is unlikely to be used by anyone else. It would also be visible in a few system status commands such as DISPLAY DICT and I think, CODEBOOK. It could be overwritten, with the DATASET ATTRIBUTE command but would not be changed just by resaving the file.

How can I view the content of a .bin file in opennlp

I am trying to use OpenNLP in a project I am working in and i am very new to it. I tried out using the Named Entity Recognition with the training data available at http://opennlp.sourceforge.net/models-1.5/
However I want to see the training data that have been used. i.e. to actually open the .bin file and see its content in English. Can some one pls point me in the correct direction.
I have tried to use UltraISO to read the .bin file but i was not successful.
PLs help !!
Thanx :)
Use the Unix file command to find the file type, like file en-token.bin. For most OpenNLP .bin files, it will tell you that these are just ZIP files.
the bin file is actually the bytes of a serialized java object representing a TokenNameFinder implementation called a NameFinderME (ME meaning Maximum entropy, which is the main multinomial logistic regression (ish) algorithm used in OpenNLP). You will not be able to see the training data by doing anything to this file.
Correction: it's not the name finder, it's the namefinderMODEL that is serialized.

Hyper directory - Can a (Linux) directory in addition to a list of subdirectories and files also contain text itself?

Under Linux, are there ways to add comments, description (text, rich text, hypertext .. ) to a directory itself, rather than by means of auxiliary files in such a directory, like README.txt, INSTALL.txt, NOTE_ON_WHY_WE_DID_THIS_THIS_WAY.txt, .. ?
In such a generalized directory, a directory entry (subdirectory/file) would be represented as (hyper)link, at least in one view of such a generalized directory. A "classical directory view" may also be available for generalized directories, in which the commments, description, mentioned above, would be omitted, or be available through an auxiliary file. I am aware this may require either special formatting of the storage medium, or a software layer on top of a classical disk formatting structure. The views would have to be derived from the generalized directory and not vice versa (in order to avoid consistency problems between the views).
Not in general, but some file-systems have extended file attributes. You could use getfattr(1), setfattr(1). See attr(5), listxattr(2), setxattr(2) etc...
AFAIK, few utilities are using these extended file attributes (and that surprises me; I would imagine that desktop environments would e.g. use them to store e.g. the MIME type of files, but they usually don't). There is a significant (file-system specific) limit on these extended attributes, e.g. 255 bytes
A more practical and traditional way would be to decide to store your additional meta-data in some hidden directory (with a name starting with a dot, like .git/ used by git)
I can't refer to all filesystems, but at least in extX, directory contains only the names of files/dirs which are in this dir, their inode numbers, and offset beetween where the next pair (dir/file - inode) starts. Generally such data which describe dirs are kept in inode structure (not inside directory itself), for instance owner of dir, atime, ctime extended attributes, number of links and so on, all these things are there. You can look on such structure in kernel source, and there is not such field which allows to put "labels" on the file/dir. In theory you would use some "unused" fields of this structure, but only in theory since these is very limited space.
Interesting question, but I believe not. From what I remember, directories are just pointers to files other directories, so I don't think it would be possible to store text in them. Maybe if you re-enginner the whole filesystem...

Resources