to tag NE on multiple files using Stanford NER - named-entity-recognition

I want to use Stanford NER to tag name entity in multiple files. In documentation it is said that we can use the option -testFiles with list of test files separated with commas but it does not work in my case like:
java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier
-loadClassifier ner-model.ser.gz -testFiles Test_file1.tsv,Test_file2.tsv
but it works when we input only one file.
Does system also have inline evaluation (FOR P, R) for all multiple files? I just wonder how it works in case of multiple files.
Thanks in advance.
Khadaka

You have to use prop.txt file to use multiple tsv files. Check this link
https://nlp.stanford.edu/software/crf-faq.html#mfiles
Below is the snippet from the NER FAQ page
How do I train one model from multiple files?
Instead of setting the trainFile property or flag, set the
trainFileList property or flag. Use a comma separated list of files.

Related

Delimiter for CSV file in IIB

I am developing an integration in IIB and one of the requirements for output (multiple CSV files) is a comma delimiter instead of semicollon. Semicolon is is on the input. Im using two mapping nodes to produce separate files from one input, but struggle to find option for delimiter.
There are two mapping nodes that uses xsd shemas and .maps to produce output.
First mapping creates canonical dfdl format that is ready to be parsed to multipe files in second mapping node.
There is not much code. just setup in IIB
I would like to produce comma separated CSV instead of semicollon.
Thanks in advance
I found a solution. You can simply view and edit the xsd code in text editor and change the delimiter there.

How can I merge several files on SPSS by variable label?

I have 48 .sav data sets containing results of a monthly survey. I need to merge the cases of all common variables from them, in order to come up with a 4 years aggregate. As I'm new to SPSS and I'm not very proficient with syntax (although i can follow it) I would normally do this using Data - Merge files - Add Cases but most of these common variables have different variable names on each data set as the questions are not always formulated in the same order and some questions only appear on one or two data sets.
However, the variable labels do not change from one data set to another. It would be great if someone knows a way to merge this data sets by variable label instead of variable name. Swapping variable names and variable labels would also do as then I could use Data - Merge files - Add Cases without problems.
Many thanks beforehand!
The merge procedures such as ADD FILES (Data > Merge Files > Add Cases) provide a capability to rename variables in the input files before merging. However, if there are a lot of variables to merge, this would get pretty tedious and error prone. Also, the dialog box supports only merging two files, while syntax allows up to 50.
Variable labels are generally not valid as variable names due to the typical presence of characters such as blanks and punctuation and length restrictions. If you have a rule that could be used to turn labels into valid variable names, that could be automated, or if the variables are always in the same order and are present in all the files, they could be renamed something like V1, V2, ...
The renaming could be done manually in syntax that you would craft for each file, or this could be done with a short Python program that you run on each file. I can write that for you if you provide details and, preferably, a sample dataset to test with (jkpeck AT gmail.com).
The Python code could loop over all the sav files in a directory and apply the renaming logic to each in one step.

How can I cluster similar type of sentences based on their context and extract keywords from them

I wanted to cluster sentences based on their context and extract common keywords from similar context sentences.
For example
1. I need to go to home
2. I am eating
3. He will be going home tomorrow
4. He is at restaurant
Sentences 1 and 3 will be similar with keyword like go and home and maybe it's synonyms like travel and house .
Pre existing API will be helpful like using IBM Watson somehow
This API actually is doing what you are exactly asking for (Clustering sentences + giving key-words):
http://www.rxnlp.com/api-reference/cluster-sentences-api-reference/
Unfortunately the algorithm used for clustering and the for generating the key-words is not available.
Hope this helps.
You can use RapidMiner with Text Processing Extension.
Insert each sentence in a seperate file and put them all in a folder.
Put the operators and make a design like below.
Click on the Process Documents from files operator and in the right bar side choose "Edit list" on "Text directories" field. Then choose the folder that contains your files.
Double click on Process Documents from files operator and in the new window add the operators like below design(just the ones you need).
Then run your process.

Stanford NLP - Using Parsed or Tagged text to generate Full XML

I'm trying to extract data from the PennTreeBank, Wall Street Journal corpus. Most of it already has the parse trees, but some of the data is only tagged.
i.e. wsj_DDXX.mrg and wsj_DDXX.pos files.
I would like to use the already parsed trees and tagged data in these files so as not to use the parser and taggers within CoreNLP, but I still want the output file format that CoreNLP gives; namely, the XML file that contains the dependencies, entity coreference, and the parse tree and tagged data.
I've read many of the java docs but I cannot figure out how to get it the way I described.
For POS, I tried using the LexicalizedParser and it allows me to use the tags, but I can only generate an XML file with the some of the information I want; there is no option for coreference or generating the parse trees. To get it to correctly generate the sub-optimal XML files here, I had to write a script to get rid of all of the brackets within the files. This is the command I use:
java -cp "*" edu.stanford.nlp.parser.lexparser.LexicalizedParser -outputFormat typedDependenciesCollapsed,wordsAndTags -outputFilesExtension xml -outputFormatOptions xml -writeOutputFiles -outputFilesDirectory my\dir -tokenized -tagSeparator / -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerMethod newCoreLabelTokenizerFactory edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz my\wsj\files\dir
I also can't generate the data I would like to have for the WSJ data that already has the trees. I tried using what is said here and I looked at the corresponding Javadocs. I used the command similar to what is described. But I had to write a python program to retrieve the stdout data resulting from analyzing each file and wrote it into a new file. This resulting data is only a text file with the dependencies and is not in the desired XML notation.
To summarize, I would like to use the POS and tree data from these PTB files in order to generate a CoreNLP parse corresponding to what would occur if I used CoreNLP on a regular text file. The pseudo command would be like this:
java -cp "*" edu.stanford.nlp.pipeline.CoreNLP -useTreeFile wsj_DDXX.mrg
and
java -cp "*" edu.stanford.nlp.pipeline.CoreNLP -usePOSFile wsj_DDXX.pos
Edit: fixed a link.
Yes, this is possible, but a bit tricky and there is no out of the box feature that can do this, so you will have to write some code. The basic idea is to replace the tokenize, ssplit and pos annotators (and in case you also have trees the parse annotator) with your code that loads these annotations from your annotated files.
On a very high level you have to do the following:
Load your trees with MemoryTreebank
Loop through all the trees and for each tree create a sentence CoreMap to which you add
a TokensAnnotation
a TreeAnnotation and the SemanticGraphCoreAnnotations
Create an Annotation object with a list containing the CoreMap objects for all sentences
Run the StanfordCoreNLP pipeline with the annotators option set to lemma,ner,dcoref and the option enforceRequirements set to false.
Take a look at the individual annotators to see how to add the required annotations. E.g. there is a method in ParserAnnotatorUtils that adds the SemanticGraphCoreAnnotations.

Storing applicative version info in SPSS sav file

I'm using C SPSS I/O library to write and read sav files.
I need to store my own version number in sav file. The requirements are:
1) That version should not be visible to user when he/she uses regular SPSS programs.
2) Obviously, regular SPSS programs and the I/O module should not overwrite the number.
Please, advice about that place or function.
Regards,
There is a header field in the sav file that identifies the creator. However, that would be overwritten if the file is resaved. It would be visible with commands such as sysfile info.
Another approach would be to create a custom file attribute using a name that is unlikely to be used by anyone else. It would also be visible in a few system status commands such as DISPLAY DICT and I think, CODEBOOK. It could be overwritten, with the DATASET ATTRIBUTE command but would not be changed just by resaving the file.

Resources