Storing applicative version info in SPSS sav file - spss

I'm using C SPSS I/O library to write and read sav files.
I need to store my own version number in sav file. The requirements are:
1) That version should not be visible to user when he/she uses regular SPSS programs.
2) Obviously, regular SPSS programs and the I/O module should not overwrite the number.
Please, advice about that place or function.
Regards,

There is a header field in the sav file that identifies the creator. However, that would be overwritten if the file is resaved. It would be visible with commands such as sysfile info.
Another approach would be to create a custom file attribute using a name that is unlikely to be used by anyone else. It would also be visible in a few system status commands such as DISPLAY DICT and I think, CODEBOOK. It could be overwritten, with the DATASET ATTRIBUTE command but would not be changed just by resaving the file.

Related

Stanford NLP - Using Parsed or Tagged text to generate Full XML

I'm trying to extract data from the PennTreeBank, Wall Street Journal corpus. Most of it already has the parse trees, but some of the data is only tagged.
i.e. wsj_DDXX.mrg and wsj_DDXX.pos files.
I would like to use the already parsed trees and tagged data in these files so as not to use the parser and taggers within CoreNLP, but I still want the output file format that CoreNLP gives; namely, the XML file that contains the dependencies, entity coreference, and the parse tree and tagged data.
I've read many of the java docs but I cannot figure out how to get it the way I described.
For POS, I tried using the LexicalizedParser and it allows me to use the tags, but I can only generate an XML file with the some of the information I want; there is no option for coreference or generating the parse trees. To get it to correctly generate the sub-optimal XML files here, I had to write a script to get rid of all of the brackets within the files. This is the command I use:
java -cp "*" edu.stanford.nlp.parser.lexparser.LexicalizedParser -outputFormat typedDependenciesCollapsed,wordsAndTags -outputFilesExtension xml -outputFormatOptions xml -writeOutputFiles -outputFilesDirectory my\dir -tokenized -tagSeparator / -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerMethod newCoreLabelTokenizerFactory edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz my\wsj\files\dir
I also can't generate the data I would like to have for the WSJ data that already has the trees. I tried using what is said here and I looked at the corresponding Javadocs. I used the command similar to what is described. But I had to write a python program to retrieve the stdout data resulting from analyzing each file and wrote it into a new file. This resulting data is only a text file with the dependencies and is not in the desired XML notation.
To summarize, I would like to use the POS and tree data from these PTB files in order to generate a CoreNLP parse corresponding to what would occur if I used CoreNLP on a regular text file. The pseudo command would be like this:
java -cp "*" edu.stanford.nlp.pipeline.CoreNLP -useTreeFile wsj_DDXX.mrg
and
java -cp "*" edu.stanford.nlp.pipeline.CoreNLP -usePOSFile wsj_DDXX.pos
Edit: fixed a link.
Yes, this is possible, but a bit tricky and there is no out of the box feature that can do this, so you will have to write some code. The basic idea is to replace the tokenize, ssplit and pos annotators (and in case you also have trees the parse annotator) with your code that loads these annotations from your annotated files.
On a very high level you have to do the following:
Load your trees with MemoryTreebank
Loop through all the trees and for each tree create a sentence CoreMap to which you add
a TokensAnnotation
a TreeAnnotation and the SemanticGraphCoreAnnotations
Create an Annotation object with a list containing the CoreMap objects for all sentences
Run the StanfordCoreNLP pipeline with the annotators option set to lemma,ner,dcoref and the option enforceRequirements set to false.
Take a look at the individual annotators to see how to add the required annotations. E.g. there is a method in ParserAnnotatorUtils that adds the SemanticGraphCoreAnnotations.

Documentation of Moses (statistical machine translation) mose.ini file format?

Is there any documentation of the moses.ini format for Moses? Running moses at the command line without arguments returns available feature names but not their available arguments. Additionally, the structure of the .ini file is not specified in the manual that I can see.
The main idea is that the file contains settings that will be used by the translation model. Thus, the documentation of values and options in moses.ini should be looked up in the Moses feature specifications.
Here are some excerpt I found on the Web about moses.ini.
In the Moses Core, we have some details:
7.6.5 moses.ini All feature functions are specified in the [feature] section. It should be in the format:
* Feature-name key1=value1 key2=value2 .... For example, KENLM factor=0 order=3 num-features=1 lazyken=0 path=file.lm.gz
Also, there is a hint on how to print basic statistics about all components mentioned in the moses.ini.
Run the script
analyse_moses_model.pl moses.ini
This can be useful to set the order of mapping steps to avoid explosion of translation options or just to check that the model components are as big/detailed as we expect.
In the Center for Computational Language and EducAtion Research (CLEAR) Wiki, there is a sample file with some documentation:
Parameters
It is recommended to make an .ini file to storage all of your setting.
input-factors
- Using factor model or not
mapping
- To use LM in memory (T) or read the file in hard disk directly (G)
ttable-file
- Indicate the num. of source-factor, num. of target-factor, num of score, and
the path to translation table file
lmodel-file
- Indicate the type using for LM (0:SRILM, 1:IRSTLM), using factor number, the order (n-gram) of LM, and the path to language model file
If it is not enough, there is another description on this page, see "Decoder configuration file" section
The sections
[ttable-file] and [lmodel-file] contain pointers to the phrase table
file and language model file, respectively. You may disregard the
numbers on those lines. For the time being, it's enough to know that
the last one of the numbers in the language model specification is the
order of the n-gram model.
The configuration file also contains some feature weights. Note that
the [weight-t] section has 5 weights, one for each feature contained
in the phrase table.
The moses.ini file created by the training process will not work with
your decoder without modification because it relies on a language
model library that is not compiled into our decoder. In order to make
it work, open the moses.ini file and find the language model
specification in the line immediately after the [lmodel-file] heading.
The first number on this line will be 0, which stands for SRILM.
Change it into 8 and leave the rest of the line untouched. Then your
configuration should work.

How can I view the content of a .bin file in opennlp

I am trying to use OpenNLP in a project I am working in and i am very new to it. I tried out using the Named Entity Recognition with the training data available at http://opennlp.sourceforge.net/models-1.5/
However I want to see the training data that have been used. i.e. to actually open the .bin file and see its content in English. Can some one pls point me in the correct direction.
I have tried to use UltraISO to read the .bin file but i was not successful.
PLs help !!
Thanx :)
Use the Unix file command to find the file type, like file en-token.bin. For most OpenNLP .bin files, it will tell you that these are just ZIP files.
the bin file is actually the bytes of a serialized java object representing a TokenNameFinder implementation called a NameFinderME (ME meaning Maximum entropy, which is the main multinomial logistic regression (ish) algorithm used in OpenNLP). You will not be able to see the training data by doing anything to this file.
Correction: it's not the name finder, it's the namefinderMODEL that is serialized.

retrieve txt content of as many file types as possible

I maintain a client server DMS written in Delphi/Sql Server.
I would like to allow the users to search a string inside all the documents stored in the db. (files are stored as blob, they are stored as zipped files to save space).
My idea is to index them on "checkin", so as i store a nwe file I extract all the text information in it and put it in a new DB field. So somehow my files table will be:
ID_FILE integer
ZIPPED_FILE blob
TEXT_CONTENT text field (nvarchar in sql server)
I would like to support "indexing" of at least most common text-like files, such as:pdf, txt, rtf, doc, docx,pdf, may be adding xls and xlsx, ppt, pptx.
For MS Office files I can use ActiveX since I alerady do it in my application, for txt files i can simply read the file, but for pdf and odt?
Could you suggest the best techinque or even a 3rd party component (not free too) that parses with "no fear" all file types?
Thanks
searching documents this way would leed to a very slow and inconvenient to use, I'd advice you produce two additional tables instead of TEXT_CONTENT field.
When you parse the text, you should extract valuable words and try to standardise them so that you
- get rid of lower/upper case problems
- get rid of characters that might be used interchangeably.
i.e. in Turkish we have รง character that might be entered as c.
- get rid of verbs that are common in the language you are dealing with.
i.e. "Thing I am looking for", "Thing" "Looking" might be in your interest
- get rid of whatever problem use face.
Each word, that has already an entry in the table should re-use the ID already given in the string_search table.
the records may look like this.
original_file_table
zip_id number
zip_file blob
string_search
str_id number
standardized_word text (or any string type with an appropriate secondary index)
file_string_reference
zip_id number
str_id number
I hope that I could give you the idea what I am thinking of.
Your major problem is zipping your files before putting them as a blob in your database which makes them unsearchable by the database itself. I would suggest the following.
Don't zip files you put in the database. Disk space is cheap.
You can write a query like this as long as you save the files in a text field.
Select * from MyFileTable Where MyFileData like '%Thing I am looking for%'
This is slow but it will work. This will work because the text in most of those file types is in plain text not binary (though some of the newer file types are now binary)
The other alternative is to use an indexing engine such as Apache Lucene or Apache Solr which will as you put it
parses with "no fear" all file types?

Mahout: Importing CSV file to Sequence Files using regexconverter or arff.vector

I just started learning how to use mahout. I'm not a java programmer however, so I'm trying to stay away from having to use the java library.
I noticed there is a shell tool regexconverter. However, the documentation is sparse and non instructive. Exactly what does specifying a regex option do, and what does the transformer class and formatter class do? The mahout wiki is marvelously opaque. I'm assuming the regex option specifies what counts as a "unit" or so.
The example they list is of using the regexconverter to convert http log requests to sequence files I believe. I have a csv file with slightly altered http log requests that I'm hoping to convert to sequence files. Do I simply change the regex expression to take each entire row? I'm trying to run a Bayes classifier, similar to the 20 newsgroups example which seems to be done completely in the shell without need for java coding.
Incidentally, the arff.vector command seems to allow me to convert an arff file directly to vectors. I'm unfamiliar with arff, thought it seems to be something I can easily convert csv log files into. Should I use this method instead, and skip the sequence file step completely?
Thanks for the help.

Resources