how much time does grid.py take to run? - machine-learning

I am using libsvm for binary classification.. I wanted to try grid.py , as it is said to improve results.. I ran this script for five files in separate terminals , and the script has been running for more than 12 hours..
this is the state of my 5 terminals now :
[root#localhost tools]# python grid.py sarts_nonarts_feat.txt>grid_arts.txt
Warning: empty z range [61.3997:61.3997], adjusting to [60.7857:62.0137]
line 2: warning: Cannot contour non grid data. Please use "set dgrid3d".
Warning: empty z range [61.3997:61.3997], adjusting to [60.7857:62.0137]
line 4: warning: Cannot contour non grid data. Please use "set dgrid3d".
[root#localhost tools]# python grid.py sgames_nongames_feat.txt>grid_games.txt
Warning: empty z range [64.5867:64.5867], adjusting to [63.9408:65.2326]
line 2: warning: Cannot contour non grid data. Please use "set dgrid3d".
Warning: empty z range [64.5867:64.5867], adjusting to [63.9408:65.2326]
line 4: warning: Cannot contour non grid data. Please use "set dgrid3d".
[root#localhost tools]# python grid.py sref_nonref_feat.txt>grid_ref.txt
Warning: empty z range [62.4602:62.4602], adjusting to [61.8356:63.0848]
line 2: warning: Cannot contour non grid data. Please use "set dgrid3d".
Warning: empty z range [62.4602:62.4602], adjusting to [61.8356:63.0848]
line 4: warning: Cannot contour non grid data. Please use "set dgrid3d".
[root#localhost tools]# python grid.py sbiz_nonbiz_feat.txt>grid_biz.txt
Warning: empty z range [67.9762:67.9762], adjusting to [67.2964:68.656]
line 2: warning: Cannot contour non grid data. Please use "set dgrid3d".
Warning: empty z range [67.9762:67.9762], adjusting to [67.2964:68.656]
line 4: warning: Cannot contour non grid data. Please use "set dgrid3d".
[root#localhost tools]# python grid.py snews_nonnews_feat.txt>grid_news.txt
Wrong input format at line 494
Traceback (most recent call last):
File "grid.py", line 223, in run
if rate is None: raise "get no rate"
TypeError: exceptions must be classes or instances, not str
I had redirected the outputs to files , but those files for now contain nothing..
And , the following files were created :
sbiz_nonbiz_feat.txt.out
sbiz_nonbiz_feat.txt.png
sarts_nonarts_feat.txt.out
sarts_nonarts_feat.txt.png
sgames_nongames_feat.txt.out
sgames_nongames_feat.txt.png
sref_nonref_feat.txt.out
sref_nonref_feat.txt.png
snews_nonnews_feat.txt.out (--> is empty )
There's just one line of information in .out files..
the ".png" files are some GNU PLOTS .
But i dont understand what the above GNUplots / warnings convey .. Should i re-run them ?
Can anyone please tell me on how much time this script might take if each input file contains about 144000 lines..
Thanks and regards

Your data is huge, 144 000 lines. So this will take sometime. I used large data such as yours and it took up to a week to finish. If you using images, which I suppose you are, hence the large data, try resizing your image before creating the data. You should get approximately the same results with your images resized.

The libSVM faq speaks to your question:
Q: Why grid.py/easy.py sometimes generates the following warning message?
Warning: empty z range [62.5:62.5], adjusting to [61.875:63.125]
Notice: cannot contour non grid data!
Nothing is wrong and please disregard the message. It is from gnuplot when drawing the contour.
As a side note, you can parallelize your grid.py operations. The libSVM tools directory README file has this to say on the matter:
Parallel grid search
You can conduct a parallel grid search by dispatching jobs to a
cluster of computers which share the same file system. First, you add
machine names in grid.py:
ssh_workers = ["linux1", "linux5", "linux5"]
and then setup your ssh so that the authentication works without
asking a password.
The same machine (e.g., linux5 here) can be listed more than once if
it has multiple CPUs or has more RAM. If the local machine is the
best, you can also enlarge the nr_local_worker. For example:
nr_local_worker = 2
In my Ubuntu 10.04 installation grid.py is actually /usr/bin/svm-grid.py

I guess grid.py is trying to find the optimal value for C (or Nu)?
I don't have an answer for the amount of time it will take, but you might want to try this SVM library, even though it's an R package: svmpath.
As described on that page there, it will compute the entire "regularization path" for a two class SVM classifier in about as much time as it takes to train an SVM using one value of your penalty param C (or Nu).
So, instead of training and doing cross validation for an SVM with a value x for your C parameter, then doing all of that again for value x+1 for C, x+2, etc. You can just train the SVM once, then query its predictive performance for different values of C post-facto, so to speak.

Change:
if rate is None: raise "get no rate"
in line 223 in grid.py to:
if rate is None: raise ValueError("get no rate")
Also, try adding:
gnuplot.write("set dgrid3d\n")
after this line in grid.py:
gnuplot.write("set contour\n")
This should fix your warnings and errors, but I am not sure if it will work, since grid.py seems to think your data has no rate.

Related

What are the columns in perf-stat when run with -r and -x

I'm trying to interpret the results of perf-stat run on a program. I know that it was run with -r 30 and -x. From https://perf.wiki.kernel.org/index.php/Tutorial is says that if run with -r the stddev will be reported but I'm not sure which of these columns that is and I'm having trouble finding information on the output when run with -x. One example of the output I've recieved is this
19987,,cache-references,0.49%,562360,100.00
256,,cache-misses,10.65%,562360,100.00
541747,,branches,0.07%,562360,100.00
7098,,branch-misses,0.78%,562360,100.00
60,,page-faults,0.43%,560411,100.00
0.560244,,cpu-clock,0.28%,560411,100.00
0.560412,,task-clock,0.28%,560411,100.00
My guess is that the % column is the standard deviation as a percentage of the first column but I'm not sure.
My question in summary is what do the columns represent? Which column is the standard deviation?
You are very close. Here are some blanks filled in.
Arithmetic mean of the measured values.
The unit if known. E.g. on my system it shows 'msec' for 'cpu-clock'.
Event name
Standard deviation scaled to 100% = mean
Time during which counting this event was actually running
Fraction of enabled time during which this event was actually running (in %)
The last two are relevant for multiplexing: If there are more counters selected than can be recorded concurrently, the denoted percentage will drop below 100.
On my system (Linux 5.0.5, not sure since when this is available), there is also a shadow stat for some metrics which compute a derived metric. For example the cpu-clock will compute CPUs utilized or branch-misses computes the fraction of all branches that are missed.
Shadow stat value
Shadow stat description
Note that this format changes with some other options. For example if you display the metrics with a more fine granular grouping (e.g. per cpu), information about these groups will be prepended in additional columns.

How to change data length parameter in maxima software?

I need to use maxima software to deal with data. I try to read data from a text file constructed as
1 2 3
11 22 33
ect.
Following comands allow for loading data sufficiently.
load(numericalio);
read_matrix("path to the file");
The problem arises when I apply them to a more realistic (larger) data set. In this case the message appears Expression longer than allowed by the configuration setting.
How to overcome this problem? I cannot see any option in configuration menu. I would be grateful for advice.
I ran into the same error message today, at it seems to be related to the size of the output that wxMaxima receives from the Maxima executable.
If you wish to display the output regardless, you can change it in the configuration here:
Edit>Configure>Worksheet>Show long expressions
Note that showing a massive expression or amount of data may dramatically slow the program down, so consider hiding the output (use a $ instead of a ; at the end of your lines) if you don't need to visualize the data.

A guide to convert_imageset.cpp

I am relatively new to machine learning/python/ubuntu.
I have a set of images in .jpg format where half contain a feature I want caffe to learn and half don't. I'm having trouble in finding a way to convert them to the required lmdb format.
I have the necessary text input files.
My question is can anyone provide a step by step guide on how to use convert_imageset.cpp in the ubuntu terminal?
Thanks
A quick guide to Caffe's convert_imageset
Build
First thing you must do is build caffe and caffe's tools (convert_imageset is one of these tools).
After installing caffe and makeing it make sure you ran make tools as well.
Verify that a binary file convert_imageset is created in $CAFFE_ROOT/build/tools.
Prepare your data
Images: put all images in a folder (I'll call it here /path/to/jpegs/).
Labels: create a text file (e.g., /path/to/labels/train.txt) with a line per input image . For example:
img_0000.jpeg 1
img_0001.jpeg 0
img_0002.jpeg 0
In this example the first image is labeled 1 while the other two are labeled 0.
Convert the dataset
Run the binary in shell
~$ GLOG_logtostderr=1 $CAFFE_ROOT/build/tools/convert_imageset \
--resize_height=200 --resize_width=200 --shuffle \
/path/to/jpegs/ \
/path/to/labels/train.txt \
/path/to/lmdb/train_lmdb
Command line explained:
GLOG_logtostderr flag is set to 1 before calling convert_imageset indicates the logging mechanism to redirect log messages to stderr.
--resize_height and --resize_width resize all input images to same size 200x200.
--shuffle randomly change the order of images and does not preserve the order in the /path/to/labels/train.txt file.
Following are the path to the images folder, the labels text file and the output name. Note that the output name should not exist prior to calling convert_imageset otherwise you'll get a scary error message.
Other flags that might be useful:
--backend - allows you to choose between an lmdb dataset or levelDB.
--gray - convert all images to gray scale.
--encoded and --encoded_type - keep image data in encoded (jpg/png) compressed form in the database.
--help - shows some help, see all relevant flags under Flags from tools/convert_imageset.cpp
You can check out $CAFFE_ROOT/examples/imagenet/convert_imagenet.sh
for an example how to use convert_imageset.

Error while executing DetEval software to evaluate the performance of my text recognition algorithm

I have come up with a text recognition algorithm. This algorithm recognizes text in natural images. I am trying to test it against the groundtruth available for the dataset of ICDAR's robust reading challenge. For this, I have generated an xml file containing coordinates of text regions in a scene image, as recognized by my algorithm. A similar xml file is provided for the groundtruth data.
To generate quantitative results of comparison of the two xml files, i am required to use DetEval software (as mentioned in the site). I have installed a command line version on linux.
The problem is: DetEval is not reading the input xml files. Specifically,
I run the following command (As per the instructions on the DetEval website):
rocplot /home/ekta/workspace/extract/result_ICDAR_2011/txt/GT2.xml { /home/ekta/workspace/extract/result_ICDAR_2011/txt/final.xml }
Here, GT2.xml is the groundtruth and final.xml is the file generated by my algorithm.
I get the following error message:
evaldetection -p 0.8,0.4,0.8,0.4,0.4,0.8,0,1 "{" "/home/ekta/workspace/extract/result_ICDAR_2011/txt/GT2.xml" | readdeteval -p 1 - >> /tmp/evaldetectioncurves20130818-21541-1kum9m9-0
evaldetection -p 0.8,0.4,0.8,0.4,0.4,0.8,0,1 "{" "/home/ekta/workspace/extract/result_ICDAR_2011/txt/GT2.xml"I/O warning : failed to load external entity "{"
Couldn't parse document {
-:1: parser error : Document is empty
^
-:1: parser error : Start tag expected, '<' not found
^
I/O error : Invalid seek
Couldn't parse document -
rocplot: ERROR running the command:
evaldetection -p 0.8,0.4,0.8,0.4,0.4,0.8,0,1 "{" "/home/ekta/workspace/extract/result_ICDAR_2011/txt/GT2.xml" | readdeteval -p 1 - >> /tmp/evaldetectioncurves20130818-21541-1kum9m9-0Error code: 256
What do i do? I am positive that there is no error in generating my xml file because even the groundtruth file obtained from the website is not being parsed. Please help!
Regards
Ekta
So, I managed to solve this issue. Turns out I was giving the wrong commands. rocplots is to be used only when I need to have multiple runs on the ground truth and detection files with varying evaluation parameters. See this paper to know more about the parameters involved.
Currently, I have one ground truth file and one detection file and I need to run it using just the default parameters used by DetEval. So, here is what needs to be done:
Go to directory where you have detevalcmd directory and enter detevalcmd directory. Run the following commands in that directory:
./evaldetection /path/to/detection/results/DetectionFilename.xml /path/to/ground/truth/file/GroundTruthFilename.xml > /path/where/you/want/to/store/results/result.xml
This will store the results in result.xml. Next, run the following command:
2. ./readdeteval /path/where/you/stored/results/result.xml.
This will give something like:
**100% of the images contain objects.
Generality: xxx
Inverse-Generality: xxx
<evaluation noImages="xxx">
<icdar2003 r="xxx" p="xxx" hmean="xxx" noGT="XXX" noD="xxx"/>
<score r="Xxx" p="xxx" hmean="xxx" noGT="xxx" noD="xxx"/>
</evaluation>**
So, there you go! you got the recall, precision etc. for you algorithm.

How to increase the ipython qtconsole scrollback buffer limit

When I load ipython with any one of:
ipython qtconsole
ipython qtconsole --pylab
ipython qtconsole --pylab inline
The output buffer only holds the last 500 lines. To see this run:
for x in range(0, 501):
...: print x
Is there a configuration option for this?
I've tried adjusting --cache-size but this does not seem to make a difference.
Quickly:
ipython qtconsole --IPythonWidget.buffer_size=1000
Or you can set it permanently by adding:
c.IPythonWidget.buffer_size=1000
in your ipython config file.
For discovering this sort of thing, a helpful trick is:
ipython qtconsole --help-all | grep PATTERN
For instance, you already had 'buffer', so:
$> ipython qtconsole --help-all | grep -C 3 buffer
...
--IPythonWidget.buffer_size=<Integer>
Default: 500
The maximum number of lines of text before truncation. Specifying a non-
positive number disables text truncation (not recommended).
If IPython used a different name than you expect and that first search turned up nothing, then you could use 500, since you knew what the value was that you wanted to change, which would also find the relevant config.
The accepted answer is no longer correct if you are using Jupyter. Instead, the command line option should be:
jupyter qtconsole --ConsoleWidget.buffer_size=5000
You can choose whatever value you want, just make it larger than the default of 500.
If you want to make this permanent, go to your home directory - C:\Users\username, /Users/username, or /home/username - then go into the .jupyter folder (create it if it doesn't exist), then create the file jupyter_qtconsole_config.py and open it up in your favorite editor. Add the following line:
c.ConsoleWidget.buffer_size=5000
Again, the number can be anything, just as long as it is an integer larger than 500. Don't worry that c isn't defined in this particular file, it is already defined elsewhere in the startup machinery.
Thanks to #firescape for the pointer in the right direction.

Resources