Stanford NLP - Using Parsed or Tagged text to generate Full XML - parsing

I'm trying to extract data from the PennTreeBank, Wall Street Journal corpus. Most of it already has the parse trees, but some of the data is only tagged.
i.e. wsj_DDXX.mrg and wsj_DDXX.pos files.
I would like to use the already parsed trees and tagged data in these files so as not to use the parser and taggers within CoreNLP, but I still want the output file format that CoreNLP gives; namely, the XML file that contains the dependencies, entity coreference, and the parse tree and tagged data.
I've read many of the java docs but I cannot figure out how to get it the way I described.
For POS, I tried using the LexicalizedParser and it allows me to use the tags, but I can only generate an XML file with the some of the information I want; there is no option for coreference or generating the parse trees. To get it to correctly generate the sub-optimal XML files here, I had to write a script to get rid of all of the brackets within the files. This is the command I use:
java -cp "*" edu.stanford.nlp.parser.lexparser.LexicalizedParser -outputFormat typedDependenciesCollapsed,wordsAndTags -outputFilesExtension xml -outputFormatOptions xml -writeOutputFiles -outputFilesDirectory my\dir -tokenized -tagSeparator / -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerMethod newCoreLabelTokenizerFactory edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz my\wsj\files\dir
I also can't generate the data I would like to have for the WSJ data that already has the trees. I tried using what is said here and I looked at the corresponding Javadocs. I used the command similar to what is described. But I had to write a python program to retrieve the stdout data resulting from analyzing each file and wrote it into a new file. This resulting data is only a text file with the dependencies and is not in the desired XML notation.
To summarize, I would like to use the POS and tree data from these PTB files in order to generate a CoreNLP parse corresponding to what would occur if I used CoreNLP on a regular text file. The pseudo command would be like this:
java -cp "*" edu.stanford.nlp.pipeline.CoreNLP -useTreeFile wsj_DDXX.mrg
and
java -cp "*" edu.stanford.nlp.pipeline.CoreNLP -usePOSFile wsj_DDXX.pos
Edit: fixed a link.

Yes, this is possible, but a bit tricky and there is no out of the box feature that can do this, so you will have to write some code. The basic idea is to replace the tokenize, ssplit and pos annotators (and in case you also have trees the parse annotator) with your code that loads these annotations from your annotated files.
On a very high level you have to do the following:
Load your trees with MemoryTreebank
Loop through all the trees and for each tree create a sentence CoreMap to which you add
a TokensAnnotation
a TreeAnnotation and the SemanticGraphCoreAnnotations
Create an Annotation object with a list containing the CoreMap objects for all sentences
Run the StanfordCoreNLP pipeline with the annotators option set to lemma,ner,dcoref and the option enforceRequirements set to false.
Take a look at the individual annotators to see how to add the required annotations. E.g. there is a method in ParserAnnotatorUtils that adds the SemanticGraphCoreAnnotations.

Related

Pentaho data integration spoon XML parsing

I have some 20k xml files in a path. I wanna access only one file at a item like a queue for XML stax step, not for XML input step. Is there any option to get one file at a time, process it and then go for next file. Example: like a for each loop.
You should use a job with two sub-transformations.
The first transformation has one “get file names” step followed by “copy rows to result”.
The parent job sets 2nd transformation to run for each input row and map the filename step from ktr 1 to filename parameter of ktr 2.
2nd transformation has a parameter filename, that is used to define which file to read with StaX parser.
You can Get the files names and give the result toe the Get data form XML, with the option XML source is define in a field.
One transformation with two steps, and automagically parallel. But if your solution needs to be scalable (more than just reading a item and writing it somewhere), I suggest you use nsouza's solution. In case an error happens, you'll have a chance to know which file is the guilty.

How to parse dynamic XML with a SAX parser

Scenario:
Large (dynamic) xml files being uploaded by users.
We need to map the xml to our own database structure.
We need to use a SAX parser (or something like it) because of memory issues when parsing large XML files.
We currently use https://github.com/craigambrose/sax_stream for parsing XML's that all have the same structure.
For a new feature, we need to parse XML with unknown contents.
How would one use a SAX parser when the xml nodes are different each time ?
I've tried using https://github.com/soulcutter/saxerator, especially the at_depth() function could come in handy to collect the elements at a certain depth, after that we could get the elements inside a node by using the for_tag() function. Based on this info we maybe could create a mapping on the fly
If a SAX parser isn't an option, are there any alternatives for parsing very large (dynamic) XML files?

Documentation of Moses (statistical machine translation) mose.ini file format?

Is there any documentation of the moses.ini format for Moses? Running moses at the command line without arguments returns available feature names but not their available arguments. Additionally, the structure of the .ini file is not specified in the manual that I can see.
The main idea is that the file contains settings that will be used by the translation model. Thus, the documentation of values and options in moses.ini should be looked up in the Moses feature specifications.
Here are some excerpt I found on the Web about moses.ini.
In the Moses Core, we have some details:
7.6.5 moses.ini All feature functions are specified in the [feature] section. It should be in the format:
* Feature-name key1=value1 key2=value2 .... For example, KENLM factor=0 order=3 num-features=1 lazyken=0 path=file.lm.gz
Also, there is a hint on how to print basic statistics about all components mentioned in the moses.ini.
Run the script
analyse_moses_model.pl moses.ini
This can be useful to set the order of mapping steps to avoid explosion of translation options or just to check that the model components are as big/detailed as we expect.
In the Center for Computational Language and EducAtion Research (CLEAR) Wiki, there is a sample file with some documentation:
Parameters
It is recommended to make an .ini file to storage all of your setting.
input-factors
- Using factor model or not
mapping
- To use LM in memory (T) or read the file in hard disk directly (G)
ttable-file
- Indicate the num. of source-factor, num. of target-factor, num of score, and
the path to translation table file
lmodel-file
- Indicate the type using for LM (0:SRILM, 1:IRSTLM), using factor number, the order (n-gram) of LM, and the path to language model file
If it is not enough, there is another description on this page, see "Decoder configuration file" section
The sections
[ttable-file] and [lmodel-file] contain pointers to the phrase table
file and language model file, respectively. You may disregard the
numbers on those lines. For the time being, it's enough to know that
the last one of the numbers in the language model specification is the
order of the n-gram model.
The configuration file also contains some feature weights. Note that
the [weight-t] section has 5 weights, one for each feature contained
in the phrase table.
The moses.ini file created by the training process will not work with
your decoder without modification because it relies on a language
model library that is not compiled into our decoder. In order to make
it work, open the moses.ini file and find the language model
specification in the line immediately after the [lmodel-file] heading.
The first number on this line will be 0, which stands for SRILM.
Change it into 8 and leave the rest of the line untouched. Then your
configuration should work.

Mahout: Importing CSV file to Sequence Files using regexconverter or arff.vector

I just started learning how to use mahout. I'm not a java programmer however, so I'm trying to stay away from having to use the java library.
I noticed there is a shell tool regexconverter. However, the documentation is sparse and non instructive. Exactly what does specifying a regex option do, and what does the transformer class and formatter class do? The mahout wiki is marvelously opaque. I'm assuming the regex option specifies what counts as a "unit" or so.
The example they list is of using the regexconverter to convert http log requests to sequence files I believe. I have a csv file with slightly altered http log requests that I'm hoping to convert to sequence files. Do I simply change the regex expression to take each entire row? I'm trying to run a Bayes classifier, similar to the 20 newsgroups example which seems to be done completely in the shell without need for java coding.
Incidentally, the arff.vector command seems to allow me to convert an arff file directly to vectors. I'm unfamiliar with arff, thought it seems to be something I can easily convert csv log files into. Should I use this method instead, and skip the sequence file step completely?
Thanks for the help.

to tag NE on multiple files using Stanford NER

I want to use Stanford NER to tag name entity in multiple files. In documentation it is said that we can use the option -testFiles with list of test files separated with commas but it does not work in my case like:
java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier
-loadClassifier ner-model.ser.gz -testFiles Test_file1.tsv,Test_file2.tsv
but it works when we input only one file.
Does system also have inline evaluation (FOR P, R) for all multiple files? I just wonder how it works in case of multiple files.
Thanks in advance.
Khadaka
You have to use prop.txt file to use multiple tsv files. Check this link
https://nlp.stanford.edu/software/crf-faq.html#mfiles
Below is the snippet from the NER FAQ page
How do I train one model from multiple files?
Instead of setting the trainFile property or flag, set the
trainFileList property or flag. Use a comma separated list of files.

Resources