weka 3.7 explorer cannot classify text - machine-learning

I am trying to do text classification using weka 3.7 explorer. I converted 2 text files( separated into two dir class1 and class2) into arff using text loader. Before doing so, I standardized the case to lower. Now when I load the file into weka and apply filter stringtowordvector (such as stopwords,usewordcount, usestoplist, stemmer - snowballstemmer) I do not see any change in my list of variables . All the variables (words ) are given as 1 or 0 against each class.
Please help me.
Here is my filter command
weka.filters.unsupervised.attribute.StringToWordVector -R first-last -W 1000 -prune-rate -1.0 -C -N 0 -S -stemmer weka.core.stemmers.SnowballStemmer -M 1 -tokenizer "weka.core.tokenizers.WordTokenizer -delimiters \" \r\n\t.,;:\\'\\"()?!\""

That happend to me when I wanted to read from .csv and use StringToWord vector.
My problem was, that the text attribute was of type nominal and not String. I used the class "NominalToString", used it to changed values to String, and then it worked.

Related

A guide to convert_imageset.cpp

I am relatively new to machine learning/python/ubuntu.
I have a set of images in .jpg format where half contain a feature I want caffe to learn and half don't. I'm having trouble in finding a way to convert them to the required lmdb format.
I have the necessary text input files.
My question is can anyone provide a step by step guide on how to use convert_imageset.cpp in the ubuntu terminal?
Thanks
A quick guide to Caffe's convert_imageset
Build
First thing you must do is build caffe and caffe's tools (convert_imageset is one of these tools).
After installing caffe and makeing it make sure you ran make tools as well.
Verify that a binary file convert_imageset is created in $CAFFE_ROOT/build/tools.
Prepare your data
Images: put all images in a folder (I'll call it here /path/to/jpegs/).
Labels: create a text file (e.g., /path/to/labels/train.txt) with a line per input image . For example:
img_0000.jpeg 1
img_0001.jpeg 0
img_0002.jpeg 0
In this example the first image is labeled 1 while the other two are labeled 0.
Convert the dataset
Run the binary in shell
~$ GLOG_logtostderr=1 $CAFFE_ROOT/build/tools/convert_imageset \
--resize_height=200 --resize_width=200 --shuffle \
/path/to/jpegs/ \
/path/to/labels/train.txt \
/path/to/lmdb/train_lmdb
Command line explained:
GLOG_logtostderr flag is set to 1 before calling convert_imageset indicates the logging mechanism to redirect log messages to stderr.
--resize_height and --resize_width resize all input images to same size 200x200.
--shuffle randomly change the order of images and does not preserve the order in the /path/to/labels/train.txt file.
Following are the path to the images folder, the labels text file and the output name. Note that the output name should not exist prior to calling convert_imageset otherwise you'll get a scary error message.
Other flags that might be useful:
--backend - allows you to choose between an lmdb dataset or levelDB.
--gray - convert all images to gray scale.
--encoded and --encoded_type - keep image data in encoded (jpg/png) compressed form in the database.
--help - shows some help, see all relevant flags under Flags from tools/convert_imageset.cpp
You can check out $CAFFE_ROOT/examples/imagenet/convert_imagenet.sh
for an example how to use convert_imageset.

Problems in training text on AdaGram.jl

I'm a newbie to Julia programming language. I am trying to install Adaptive Skip-gram (AdaGram) model on my machine. I'm facing the following problems. Before training a model we need the tokenized file and a dictionary file. Now my question is, what is the input that should be given for tokenize.sh and dictionary.sh. Please let me know the actual way in which the generation of output files happen and also the extension of the same.
This is the website link I'm referring to : https://github.com/sbos/AdaGram.jl .
This is exactly similar to https://code.google.com/p/word2vec/
The package provides a few shell scripts to pre-process the data and fit the model:
you have to call them from the shell, i.e., outside Julia.
# Install the package
julia -e 'Pkg.clone("https://github.com/sbos/AdaGram.jl.git")'
julia -e 'Pkg.build("AdaGram")'
# Download some text
wget http://www.gutenberg.org/ebooks/100.txt.utf-8
# Tokenize the text, and count the words
~/.julia/v0.3/AdaGram/utils/tokenize.sh 100.txt.utf-8 text.txt
~/.julia/v0.3/AdaGram/utils/dictionary.sh text.txt dictionary.txt
# Train the model
~/.julia/v0.3/AdaGram/train.sh text.txt dictionary.txt model
You can then use the model, from Julia:
using AdaGram
vm, dict = load_model("model");
expected_pi(vm, dict.word2id["hamlet"])
nearest_neighbors(vm, dict, "hamlet", 1, 10)

Error while executing DetEval software to evaluate the performance of my text recognition algorithm

I have come up with a text recognition algorithm. This algorithm recognizes text in natural images. I am trying to test it against the groundtruth available for the dataset of ICDAR's robust reading challenge. For this, I have generated an xml file containing coordinates of text regions in a scene image, as recognized by my algorithm. A similar xml file is provided for the groundtruth data.
To generate quantitative results of comparison of the two xml files, i am required to use DetEval software (as mentioned in the site). I have installed a command line version on linux.
The problem is: DetEval is not reading the input xml files. Specifically,
I run the following command (As per the instructions on the DetEval website):
rocplot /home/ekta/workspace/extract/result_ICDAR_2011/txt/GT2.xml { /home/ekta/workspace/extract/result_ICDAR_2011/txt/final.xml }
Here, GT2.xml is the groundtruth and final.xml is the file generated by my algorithm.
I get the following error message:
evaldetection -p 0.8,0.4,0.8,0.4,0.4,0.8,0,1 "{" "/home/ekta/workspace/extract/result_ICDAR_2011/txt/GT2.xml" | readdeteval -p 1 - >> /tmp/evaldetectioncurves20130818-21541-1kum9m9-0
evaldetection -p 0.8,0.4,0.8,0.4,0.4,0.8,0,1 "{" "/home/ekta/workspace/extract/result_ICDAR_2011/txt/GT2.xml"I/O warning : failed to load external entity "{"
Couldn't parse document {
-:1: parser error : Document is empty
^
-:1: parser error : Start tag expected, '<' not found
^
I/O error : Invalid seek
Couldn't parse document -
rocplot: ERROR running the command:
evaldetection -p 0.8,0.4,0.8,0.4,0.4,0.8,0,1 "{" "/home/ekta/workspace/extract/result_ICDAR_2011/txt/GT2.xml" | readdeteval -p 1 - >> /tmp/evaldetectioncurves20130818-21541-1kum9m9-0Error code: 256
What do i do? I am positive that there is no error in generating my xml file because even the groundtruth file obtained from the website is not being parsed. Please help!
Regards
Ekta
So, I managed to solve this issue. Turns out I was giving the wrong commands. rocplots is to be used only when I need to have multiple runs on the ground truth and detection files with varying evaluation parameters. See this paper to know more about the parameters involved.
Currently, I have one ground truth file and one detection file and I need to run it using just the default parameters used by DetEval. So, here is what needs to be done:
Go to directory where you have detevalcmd directory and enter detevalcmd directory. Run the following commands in that directory:
./evaldetection /path/to/detection/results/DetectionFilename.xml /path/to/ground/truth/file/GroundTruthFilename.xml > /path/where/you/want/to/store/results/result.xml
This will store the results in result.xml. Next, run the following command:
2. ./readdeteval /path/where/you/stored/results/result.xml.
This will give something like:
**100% of the images contain objects.
Generality: xxx
Inverse-Generality: xxx
<evaluation noImages="xxx">
<icdar2003 r="xxx" p="xxx" hmean="xxx" noGT="XXX" noD="xxx"/>
<score r="Xxx" p="xxx" hmean="xxx" noGT="xxx" noD="xxx"/>
</evaluation>**
So, there you go! you got the recall, precision etc. for you algorithm.

Speaker adaptation with HTK

I am trying to adapt a monophone-based recogniser to a specific speaker. I am using the recipe given in HTKBook 3.4.1 section 3.6.2. I am getting stuck on the HHEd part which I am invoking like sp:
HHEd -A -D -T 1 -H hmm15/hmmdefs -H hmm15/macros -M classes regtree.hed monophones1eng
The error I end up with is as follows:
ERROR [+999] Components missing from Base Class list (2413 3375)
ERROR [+999] BaseClass check failed
The folder classes contains the file global which has the following contents:
~b ‘‘global’’
<MMFIDMASK> *
<PARAMETERS> MIXBASE
<NUMCLASSES> 1
<CLASS> 1 {*.state[2-4].mix[1-25]}
The hmmdefs file within hmm15 had some mixture components (I am using 25 mixture components per state of each phone) missing. I tried to "fill in the blanks" by giving in mixture components with random mean and variance values but zero weigths. This too has had no effect.
The hmms are left-right hmms with 5 states (3 emitting), each state modelled by a 25 component mixture. Each component in turn is modelled by an MFCC with EDA components. There are 46 phones in all.
My questions are:
1. Is the way I am invoking HHEd correct? Can it be invoked in the above manner for monophones?
2. I know that the base class list (rtree.base must contain every single mixture component, but where do I find these missing mixture components?
NOTE: Please let me know in case more information is needed.
Edit 1: The file regtree.hed contains the following:
RN "models"
LS "stats_engOnly_3_4"
RC 32 "rtree"
Thanks,
Sriram
They way you invoke HHEd looks fine. The components are missing as they have become defunct. To deal with defunct components read HTKBook-3.4.1 Section 8.4 page 137.
Questions:
- What does regtree.hed contain?
- How much data (in hours) are you using? 25 mixtures might be excessive.
You might want to use a more gradual increase in mixtures - MU +1 or MU +2 and limit the number of mixtures (a guess: 3-8 depending on training data amount).

how much time does grid.py take to run?

I am using libsvm for binary classification.. I wanted to try grid.py , as it is said to improve results.. I ran this script for five files in separate terminals , and the script has been running for more than 12 hours..
this is the state of my 5 terminals now :
[root#localhost tools]# python grid.py sarts_nonarts_feat.txt>grid_arts.txt
Warning: empty z range [61.3997:61.3997], adjusting to [60.7857:62.0137]
line 2: warning: Cannot contour non grid data. Please use "set dgrid3d".
Warning: empty z range [61.3997:61.3997], adjusting to [60.7857:62.0137]
line 4: warning: Cannot contour non grid data. Please use "set dgrid3d".
[root#localhost tools]# python grid.py sgames_nongames_feat.txt>grid_games.txt
Warning: empty z range [64.5867:64.5867], adjusting to [63.9408:65.2326]
line 2: warning: Cannot contour non grid data. Please use "set dgrid3d".
Warning: empty z range [64.5867:64.5867], adjusting to [63.9408:65.2326]
line 4: warning: Cannot contour non grid data. Please use "set dgrid3d".
[root#localhost tools]# python grid.py sref_nonref_feat.txt>grid_ref.txt
Warning: empty z range [62.4602:62.4602], adjusting to [61.8356:63.0848]
line 2: warning: Cannot contour non grid data. Please use "set dgrid3d".
Warning: empty z range [62.4602:62.4602], adjusting to [61.8356:63.0848]
line 4: warning: Cannot contour non grid data. Please use "set dgrid3d".
[root#localhost tools]# python grid.py sbiz_nonbiz_feat.txt>grid_biz.txt
Warning: empty z range [67.9762:67.9762], adjusting to [67.2964:68.656]
line 2: warning: Cannot contour non grid data. Please use "set dgrid3d".
Warning: empty z range [67.9762:67.9762], adjusting to [67.2964:68.656]
line 4: warning: Cannot contour non grid data. Please use "set dgrid3d".
[root#localhost tools]# python grid.py snews_nonnews_feat.txt>grid_news.txt
Wrong input format at line 494
Traceback (most recent call last):
File "grid.py", line 223, in run
if rate is None: raise "get no rate"
TypeError: exceptions must be classes or instances, not str
I had redirected the outputs to files , but those files for now contain nothing..
And , the following files were created :
sbiz_nonbiz_feat.txt.out
sbiz_nonbiz_feat.txt.png
sarts_nonarts_feat.txt.out
sarts_nonarts_feat.txt.png
sgames_nongames_feat.txt.out
sgames_nongames_feat.txt.png
sref_nonref_feat.txt.out
sref_nonref_feat.txt.png
snews_nonnews_feat.txt.out (--> is empty )
There's just one line of information in .out files..
the ".png" files are some GNU PLOTS .
But i dont understand what the above GNUplots / warnings convey .. Should i re-run them ?
Can anyone please tell me on how much time this script might take if each input file contains about 144000 lines..
Thanks and regards
Your data is huge, 144 000 lines. So this will take sometime. I used large data such as yours and it took up to a week to finish. If you using images, which I suppose you are, hence the large data, try resizing your image before creating the data. You should get approximately the same results with your images resized.
The libSVM faq speaks to your question:
Q: Why grid.py/easy.py sometimes generates the following warning message?
Warning: empty z range [62.5:62.5], adjusting to [61.875:63.125]
Notice: cannot contour non grid data!
Nothing is wrong and please disregard the message. It is from gnuplot when drawing the contour.
As a side note, you can parallelize your grid.py operations. The libSVM tools directory README file has this to say on the matter:
Parallel grid search
You can conduct a parallel grid search by dispatching jobs to a
cluster of computers which share the same file system. First, you add
machine names in grid.py:
ssh_workers = ["linux1", "linux5", "linux5"]
and then setup your ssh so that the authentication works without
asking a password.
The same machine (e.g., linux5 here) can be listed more than once if
it has multiple CPUs or has more RAM. If the local machine is the
best, you can also enlarge the nr_local_worker. For example:
nr_local_worker = 2
In my Ubuntu 10.04 installation grid.py is actually /usr/bin/svm-grid.py
I guess grid.py is trying to find the optimal value for C (or Nu)?
I don't have an answer for the amount of time it will take, but you might want to try this SVM library, even though it's an R package: svmpath.
As described on that page there, it will compute the entire "regularization path" for a two class SVM classifier in about as much time as it takes to train an SVM using one value of your penalty param C (or Nu).
So, instead of training and doing cross validation for an SVM with a value x for your C parameter, then doing all of that again for value x+1 for C, x+2, etc. You can just train the SVM once, then query its predictive performance for different values of C post-facto, so to speak.
Change:
if rate is None: raise "get no rate"
in line 223 in grid.py to:
if rate is None: raise ValueError("get no rate")
Also, try adding:
gnuplot.write("set dgrid3d\n")
after this line in grid.py:
gnuplot.write("set contour\n")
This should fix your warnings and errors, but I am not sure if it will work, since grid.py seems to think your data has no rate.

Resources