How can I view the content of a .bin file in opennlp - bin

I am trying to use OpenNLP in a project I am working in and i am very new to it. I tried out using the Named Entity Recognition with the training data available at http://opennlp.sourceforge.net/models-1.5/
However I want to see the training data that have been used. i.e. to actually open the .bin file and see its content in English. Can some one pls point me in the correct direction.
I have tried to use UltraISO to read the .bin file but i was not successful.
PLs help !!
Thanx :)

Use the Unix file command to find the file type, like file en-token.bin. For most OpenNLP .bin files, it will tell you that these are just ZIP files.

the bin file is actually the bytes of a serialized java object representing a TokenNameFinder implementation called a NameFinderME (ME meaning Maximum entropy, which is the main multinomial logistic regression (ish) algorithm used in OpenNLP). You will not be able to see the training data by doing anything to this file.
Correction: it's not the name finder, it's the namefinderMODEL that is serialized.

Related

Loading trained model on embedded systems (No libraries)

I'm not too familiar with Machine Learning techniques, and i want to know if I can transfer a final trained-model to another machine. More specifically, i'm trying to solve a sound classification problem by training a model on a regular PC, and then implement / transfer its output model to an embedded system where no libraries are allowed (C programming). The system does not support file reading either.
So my question is.
Are there learning methods with output models simple enough that it can be implemented easily on other systems? How would you implement it? (Something like Q-learning? although Q-learning wouldn't be appropriate in my project.)
I would like some pointers, thanks in advance.
Any arbitrary "blob" of data can be converted into a C byte array and compled and linked directly with your code. A code generator is simple enough to write, but there are tools that will do that directly such a Segger Bin2C (and any number of other tools called "bin2c") or the swiss-army knife of embedded data converters SRecord.
Since SRecord can do so many things, getting it to do this one thing is less than obvious:
srec_cat mymodel.nn -binary -o model.c -C-Array model -INClude
will generate a model.c and model.h file defining a data array containing the byte content of mymodel.nn.

Storing applicative version info in SPSS sav file

I'm using C SPSS I/O library to write and read sav files.
I need to store my own version number in sav file. The requirements are:
1) That version should not be visible to user when he/she uses regular SPSS programs.
2) Obviously, regular SPSS programs and the I/O module should not overwrite the number.
Please, advice about that place or function.
Regards,
There is a header field in the sav file that identifies the creator. However, that would be overwritten if the file is resaved. It would be visible with commands such as sysfile info.
Another approach would be to create a custom file attribute using a name that is unlikely to be used by anyone else. It would also be visible in a few system status commands such as DISPLAY DICT and I think, CODEBOOK. It could be overwritten, with the DATASET ATTRIBUTE command but would not be changed just by resaving the file.

.names and .data in Weka

I am new to Weka.
I am trying to run some algorithms in Weka using UCI ML repository but I don't know how to use the .names and .data files in Weka.
Can anyone tell me how to convert .data and .name to arff format?
Please look at the "creating a .arff file" section in http://storm.cis.fordham.edu/~gweiss/data-mining/weka.html
If you want a simpler solution using only the .data file (.names are related to data description): just edit the .data file, insert a header (insert a first line with a different name for each attribute and all separated by a comma), save and rename it to .csv. Be careful that this solution shall not handle properly non basic data types and encounter problems with missing values.
I'm not sure if this helps but other way to use your data is to create an .arff like the example:
http://www.let.rug.nl/~coltekin/ml08/weather.arff
if your database are simple you can just make your own .arff if you wanna use a bigger database using .csv then transform into arff is more recommended

Mahout: Importing CSV file to Sequence Files using regexconverter or arff.vector

I just started learning how to use mahout. I'm not a java programmer however, so I'm trying to stay away from having to use the java library.
I noticed there is a shell tool regexconverter. However, the documentation is sparse and non instructive. Exactly what does specifying a regex option do, and what does the transformer class and formatter class do? The mahout wiki is marvelously opaque. I'm assuming the regex option specifies what counts as a "unit" or so.
The example they list is of using the regexconverter to convert http log requests to sequence files I believe. I have a csv file with slightly altered http log requests that I'm hoping to convert to sequence files. Do I simply change the regex expression to take each entire row? I'm trying to run a Bayes classifier, similar to the 20 newsgroups example which seems to be done completely in the shell without need for java coding.
Incidentally, the arff.vector command seems to allow me to convert an arff file directly to vectors. I'm unfamiliar with arff, thought it seems to be something I can easily convert csv log files into. Should I use this method instead, and skip the sequence file step completely?
Thanks for the help.

what is appropriate for me? generateAllGrams() or is generateCollocations() enough for me?

I am developing a project on wordnet-based document summarizer.in that i need to extract collocations. i tried to research as much as I could, but since i have not worked with Mahout before I am having difficulty in understanding how CollocDriver.java works (in API context)
while scouring through the web, i landed on this :
Mahout Collocations
this is the problem: i have a POSTagged input text. i need to identify collocations in it.i have got collocdriver.java code..now i need to know how do i use it? whether to use generateAllGrams() method or only generateCollocations() method is enough for my subtask within my summarizer..??
and most importantly HOW to use it? i raise this question coz I admit, i dont know the API well,
i also got a grepcode version of collocdriver the two implementations seem to be slightly different..the inputs are in string for the grepcode version and in the form of Path object in the original...
my questions: what is configuration object in input params and how to use it?? will the source / destn will be in string (as in grepcode) or Path (as in original)??
what will be the output?
i have done some further R & D on collocdriver program...i found out that it uses a sequence file and then vector generation...i wanna know how this sequence file / vector generation works..plz help..
To get collocation using mahout,u need to follow some simple steps
1) You must make a sequence file from ur input text file.
/bin/mahout seqdirectory -i /home/developer/Desktop/colloc/ -o /home/developer/Desktop/colloc/test-seqdir -c UTF-8 -chunk 5
2)There are two ways to generate collocations from a sequence file.
a)Convert sequence file to sparse vector and find out the collocation
b)Directly find out the collocation from the sequence file (with out creating the sparse vector)
3)Here i am considering choice b.
/bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i /home/developer/Desktop/colloc/test-seqdir -o /home/developer/Desktop/colloc/test-colloc -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng 3 -p
Just check out the output folder,the files u need is over there !!! (in sequence file format)
/bin/mahout seqdumper -s /home/developer/Desktop/colloc/test-colloc/ngrams/part-r-00000 >> out.txt will give u a text output !!!

Resources