Documentation of Moses (statistical machine translation) mose.ini file format? - machine-learning

Is there any documentation of the moses.ini format for Moses? Running moses at the command line without arguments returns available feature names but not their available arguments. Additionally, the structure of the .ini file is not specified in the manual that I can see.

The main idea is that the file contains settings that will be used by the translation model. Thus, the documentation of values and options in moses.ini should be looked up in the Moses feature specifications.
Here are some excerpt I found on the Web about moses.ini.
In the Moses Core, we have some details:
7.6.5 moses.ini All feature functions are specified in the [feature] section. It should be in the format:
* Feature-name key1=value1 key2=value2 .... For example, KENLM factor=0 order=3 num-features=1 lazyken=0 path=file.lm.gz
Also, there is a hint on how to print basic statistics about all components mentioned in the moses.ini.
Run the script
analyse_moses_model.pl moses.ini
This can be useful to set the order of mapping steps to avoid explosion of translation options or just to check that the model components are as big/detailed as we expect.
In the Center for Computational Language and EducAtion Research (CLEAR) Wiki, there is a sample file with some documentation:
Parameters
It is recommended to make an .ini file to storage all of your setting.
input-factors
- Using factor model or not
mapping
- To use LM in memory (T) or read the file in hard disk directly (G)
ttable-file
- Indicate the num. of source-factor, num. of target-factor, num of score, and
the path to translation table file
lmodel-file
- Indicate the type using for LM (0:SRILM, 1:IRSTLM), using factor number, the order (n-gram) of LM, and the path to language model file
If it is not enough, there is another description on this page, see "Decoder configuration file" section
The sections
[ttable-file] and [lmodel-file] contain pointers to the phrase table
file and language model file, respectively. You may disregard the
numbers on those lines. For the time being, it's enough to know that
the last one of the numbers in the language model specification is the
order of the n-gram model.
The configuration file also contains some feature weights. Note that
the [weight-t] section has 5 weights, one for each feature contained
in the phrase table.
The moses.ini file created by the training process will not work with
your decoder without modification because it relies on a language
model library that is not compiled into our decoder. In order to make
it work, open the moses.ini file and find the language model
specification in the line immediately after the [lmodel-file] heading.
The first number on this line will be 0, which stands for SRILM.
Change it into 8 and leave the rest of the line untouched. Then your
configuration should work.

Related

Converting between M3 `loc` scheme and regular `loc` type?

The M3 Core module returns a sort of simplified loc representation in Rascal. For instance, a method in file MapParser might have the loc: |java+method:///MapParser/a()|.
However, this is evidently different from the other loc scheme I tend to see, which would look more or less like: |project://main-scheme/src/tests/MapParser.java|.
This wouldn't be a problem, except that some functions only accept one scheme or another. For instance, the function appendToFile(loc file, value V...) does not accept this scheme M3 uses, and will reject it with an error like: IO("Unsupported scheme java+method").
So, how can I convert between both schemes easily? I would like to preserve all information, like highlighted sections for instance.
Cheers.
There are two differences at play here.
Physical vs Logical Locations
java+method is an logical location, and project is a physical location. I think the best way to describe their difference is that a physical location describes the location of an actual file, or a subset of an actual file. A logical location describes the location of a certain entity in the context of a bigger model. For example, a java method in a java class/project. Often logical locations can be mapped to a physical location, but that is not always true.
For m3 for example you can use resolveLocation from IO to get the actual offset in the file that the logical location points to.
Read-only vs writeable locations
Not all locations are writeable, I don't think any logical location is. But there are also physical locations that are read only. The error you are getting is generic in that sense.
Rascal does support writing in the middle of text files, most likely you do not want to use appendToFile as it will append after the location you point it too. Most likely you want to replace a section of the text with your new section, so a regular writeFile should work.
Some notes
Note that you would have to recalculate all the offsets in the file after every write. So the resolved physical locations for the logical locations would be outdated, as the file has changed since constructing the m3 model and its corresponding map between logical and physical locations.
So for this use case, you might want to think of a better way. The nicest solution is using a grammar, and rewrite the parse tree's of the file, and after rewriting overwrite the old file. Note that the most recent Java grammar shipped with Rascal is for Java 5, so this might be a bit more work than you would like. Perhaps frame your goal as a new Stack Overflow question, and we'll see what other options might be applicable.

How to set whitespace tokenizer on NER Model?

i am creating a custom NER model using CoreNLP 3.6.0
My props are:
# location of the training file
trainFile = /home/damiano/stanford-ner.tsv
# location where you would like to save (serialize) your
# classifier; adding .gz at the end automatically gzips the file,
# making it smaller, and faster to load
serializeTo = ner-model.ser.gz
# structure of your training file; this tells the classifier that
# the word is in column 0 and the correct answer is in column 1
map = word=0,answer=1
# This specifies the order of the CRF: order 1 means that features
# apply at most to a class pair of previous class and current class
# or current class and next class.
maxLeft=1
# these are the features we'd like to train with
# some are discussed below, the rest can be
# understood by looking at NERFeatureFactory
useClassFeature=true
useWord=true
# word character ngrams will be included up to length 6 as prefixes
# and suffixes only
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useDisjunctive=true
useSequences=true
usePrevSequences=true
# the last 4 properties deal with word shape features
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
I build with this command:
java -classpath "stanford-ner.jar:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -prop /home/damiano/stanford-ner.prop
The problem is when i use this model to retrieve the entities inside a textfile. The command is:
java -classpath "stanford-ner.jar:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier ner-model.ser.gz -textFile file.txt
Where file.txt is:
Hello!
my
name
is
John.
The output is:
Hello/O !/O
my/O name/O is/O John/PERSON ./O
As you can see it split "Hello!" into two tokens. Same thing for "John."
I must use whitespace tokenizer.
How can i set it?
why does CoreNlp is splitting those words in two tokens?
You set your own tokenizer by specifying the classname to the tokenizerFactory flag/property:
tokenizerFactory = edu.stanford.nlp.process.WhitespaceTokenizer$WhitespaceTokenizerFactory
You can specify any class that implements Tokenizer<T> interface, but the included WhitespaceTokenizer sounds like what you want. If the tokenizer has options you can specify them with tokenizerOptions For instance, here, if you also specify:
tokenizerOptions = tokenizeNLs=true
then the newlines in your input will be preserved in the input (for output options that don't convert things always into a one-token-per-line format).
Note: Options like tokenize.whitespace=true apply at the level of CoreNLP. They aren't interpreted (you get a warning saying that the option is ignored) if provided to individual components like CRFClassifier.
As Nikita Astrakhantsev notes, this isn't necessarily a good thing to do. Doing it at test time would only be correct if your training data is also whitespace separated, but otherwise will adversely affect performance. And having tokens like the ones you get from whitespace separation are bad for doing subsequent NLP processing such as parsing.
Upd. If you want to use whitespace tokenizer here, simply add tokenize.whitespace=true to your properties file. look at Christopher Manning's answer.
However, and answering to your second question, 'why does CoreNlp is splitting those words in two tokens?', I'd suggest to keep the default tokenizer (which is PTBTokenizer), because it simply lets to obtain better results. Usually the reason to switch to whitespace tokenization is high demand to processing speed or (usually - and) low demand to tokenization quality.
Since you are going to use it for further NER, I doubt that it is your case.
Even in your example, if you have token John. after tokenization, it can not be captured by gazette or train examples.
More details and reasons why tokenization isn't that simple can be found here.

Stanford NLP - Using Parsed or Tagged text to generate Full XML

I'm trying to extract data from the PennTreeBank, Wall Street Journal corpus. Most of it already has the parse trees, but some of the data is only tagged.
i.e. wsj_DDXX.mrg and wsj_DDXX.pos files.
I would like to use the already parsed trees and tagged data in these files so as not to use the parser and taggers within CoreNLP, but I still want the output file format that CoreNLP gives; namely, the XML file that contains the dependencies, entity coreference, and the parse tree and tagged data.
I've read many of the java docs but I cannot figure out how to get it the way I described.
For POS, I tried using the LexicalizedParser and it allows me to use the tags, but I can only generate an XML file with the some of the information I want; there is no option for coreference or generating the parse trees. To get it to correctly generate the sub-optimal XML files here, I had to write a script to get rid of all of the brackets within the files. This is the command I use:
java -cp "*" edu.stanford.nlp.parser.lexparser.LexicalizedParser -outputFormat typedDependenciesCollapsed,wordsAndTags -outputFilesExtension xml -outputFormatOptions xml -writeOutputFiles -outputFilesDirectory my\dir -tokenized -tagSeparator / -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerMethod newCoreLabelTokenizerFactory edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz my\wsj\files\dir
I also can't generate the data I would like to have for the WSJ data that already has the trees. I tried using what is said here and I looked at the corresponding Javadocs. I used the command similar to what is described. But I had to write a python program to retrieve the stdout data resulting from analyzing each file and wrote it into a new file. This resulting data is only a text file with the dependencies and is not in the desired XML notation.
To summarize, I would like to use the POS and tree data from these PTB files in order to generate a CoreNLP parse corresponding to what would occur if I used CoreNLP on a regular text file. The pseudo command would be like this:
java -cp "*" edu.stanford.nlp.pipeline.CoreNLP -useTreeFile wsj_DDXX.mrg
and
java -cp "*" edu.stanford.nlp.pipeline.CoreNLP -usePOSFile wsj_DDXX.pos
Edit: fixed a link.
Yes, this is possible, but a bit tricky and there is no out of the box feature that can do this, so you will have to write some code. The basic idea is to replace the tokenize, ssplit and pos annotators (and in case you also have trees the parse annotator) with your code that loads these annotations from your annotated files.
On a very high level you have to do the following:
Load your trees with MemoryTreebank
Loop through all the trees and for each tree create a sentence CoreMap to which you add
a TokensAnnotation
a TreeAnnotation and the SemanticGraphCoreAnnotations
Create an Annotation object with a list containing the CoreMap objects for all sentences
Run the StanfordCoreNLP pipeline with the annotators option set to lemma,ner,dcoref and the option enforceRequirements set to false.
Take a look at the individual annotators to see how to add the required annotations. E.g. there is a method in ParserAnnotatorUtils that adds the SemanticGraphCoreAnnotations.

Storing applicative version info in SPSS sav file

I'm using C SPSS I/O library to write and read sav files.
I need to store my own version number in sav file. The requirements are:
1) That version should not be visible to user when he/she uses regular SPSS programs.
2) Obviously, regular SPSS programs and the I/O module should not overwrite the number.
Please, advice about that place or function.
Regards,
There is a header field in the sav file that identifies the creator. However, that would be overwritten if the file is resaved. It would be visible with commands such as sysfile info.
Another approach would be to create a custom file attribute using a name that is unlikely to be used by anyone else. It would also be visible in a few system status commands such as DISPLAY DICT and I think, CODEBOOK. It could be overwritten, with the DATASET ATTRIBUTE command but would not be changed just by resaving the file.

Extensible toolkits or approaches to sniffing file formats from messy data?

Are there any frameworks out there to support file format sniffing using declarative, fuzzy schema and/or syntax definitions for the valid formats? I'm looking for something that can handle dirty or poorly formatted files, potentially across multiple versions of file format definitions/schemas, and make it easy to write rules- or pattern-based sniffers that make a best guess at file types based on introspection.
I'm looking for something declarative, allowing you define formats descriptively, maybe a DSL, something like:
format A, v1.0:
is tabular
has a "id" and "name" column
may have a "size" column
with integer values in 1-10 range
is tab-delimited
usually ends in .txt or .tab
format A, v1.1:
is tabular
has a "id" column
may have a "name" column
may have a "size" column
with integer values in 1-10 range
is tab- or comma-separated
usually ends in .txt, .csv or .tab
The key is that the incoming files may be mis-formatted, either due to user error or poorly implemented export from other tools, and the classification may be non-deterministic. So this would need to support multiple, partial matching to format definitions, along with useful explanations. A simple voting scheme is probably enough to rank guesses (i.e. the more problems found, the lower the match score).
For example, given the above definitions, a comma-delimited "test.txt" file with an "id" column and "size" column with no values would result in a sniffer log something like:
Probably format A, v1.1
- but "size" column is empty
Possibly format A, v1.0
- but "size" column is empty
- but missing "name" column
- but is comma-delimited
The Sniffer functionality in the Python standard library is heading in the right direction, but I'm looking for something more general and extensible (and not limited to tabular data). Any suggestions on where to look for something like this?
First of all, I am glad I have found this question - I am thinking of something similar too (declarative solution to markup any file format and feed it, along with the file itself, to the tool that can verify the file).
What you are naming "sniffer" is widely known as "file carver" and this person is big at carving: http://en.wikipedia.org/wiki/Simson_Garfinkel
Not only he has developed an outstanding carver, he has also provided the definition of different cases of incomplete files.
So, If you are working on some particular file format repair tool - check the aforementioned classification to find out how complex is the problem. For example, carving from incompletely received data stream and carving from disk image defers significantly. Carving from disk image with fragmented disk would be insanely more difficult, whereas padding some video file with meaningless data, just to make it open by video player is easy - you just have to provide the correct format.
Hope it helped.
Regards

Resources