The output of cvb in mahout 0.7 - mahout

I'm running Mahout 0.7 on hadoop 1.0.4. I want to see the result of Reuters dataset for the topic modeling task. However, I'm getting kinda useless result when I use the vectordump tools in Mahout.
I've read the following set of instructions for this example:
Run cvb in mahout 0.8.
but after executing vectordump tools, I receive a huge file in the output which contains something like the following lines: {0.01:5.726429339702471E-12,0.05:6.196569958376538E-9,...}
which I'm not sure if this is the actual output we are supposed to see for the Reuters dataset.

The same thing has happened and the solution is simple:
get their latest version in their svn server: http://svn.apache.org/repos/asf/mahout/trunk
That happens because there is a bug of vectorSize in Mahout 0.7.

I think they haven't provided that type of output you are looking for https://issues.apache.org/jira/browse/MAHOUT-1470

Related

How to run a python script in the background on azure

I have a uni project in which I have to run a number of machine learning algorithms like SVM, ME, Naive bayes, etc... and perform a grid search on them, to find the optimal sets of hyper-parameters. Running all these would take an exceedingly long amount of time (48-168 hours total but run- in batches) and considering my computer becomes more or less unusable while I run them, I was attempting to find a solution which allowed me to run my code externally. The scripts I have to run are in python and my plan was to run them on azure to make use of its "Azure for students" $100 credit.
My original plan was to use azure's ml notebook section and then run the python scripts in the terminal they provide. My problem with this route is as far as I can tell, when the browser closes, the computation stops which is a problem. I looked into it, and I found some articles mentioning a combination of 'ctrl-z', 'bg', and 'disown', to disconnect the process from the shell but I thought there should definitely be a better way to do it. (I also wasn't sure how this worked in my case where there were 8 processes running at once using gridsearchcv's n_jobs=-1 feature).
I then realized a better way to do this would be to use pipelines. My intent was to create a number of pipelines of the form:
(Import data in xlsx file) -> (python script to run ML) -> (export data to working directory)
And then run them until all the work is completed. In the first stage I used the parameters,
And I got the error,
My intention was to have the excel file pipe into the python script as a data frame but this implantation (and all the others I've tried) isn't working.
My question first question is, how do I get the excel data to pipe into the python script properly?
My second question is, is there a better way to go about doing this? Would running it on the shell be an easier way to do it? If so, how do ensure it runs while my browser is closed? Are there other services that would be better? My main metrics for this are price (Cheap) and time limit (ability to run for long time) but any suggestions would be greatly appreciated.
I also tried using google colab, this worked but it felt slower than running on my computer.
To run a grid search with AzureML, you would use the Sweep job. The simplest way to kick of a Sweep is via the CLI. See here for an example.
$schema: https://azuremlschemas.azureedge.net/latest/sweepJob.schema.json
type: sweep
trial:
command: >-
python hello-sweep.py
--A ${{inputs.A}}
--B ${{search_space.B}}
--C ${{search_space.C}}
code: src
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu#latest
inputs:
A: 0.5
compute: azureml:cpu-cluster
sampling_algorithm: random
search_space:
B:
type: choice
values: ["hello", "world", "hello_world"]
C:
type: uniform
min_value: 0.1
max_value: 1.0
objective:
goal: minimize
primary_metric: random_metric
limits:
max_total_trials: 4
max_concurrent_trials: 2
timeout: 3600
display_name: hello-sweep-example
experiment_name: hello-sweep-example
description: Hello sweep job example.
You can start that job using the AzureML v2 CLI with the following command:
az ml job create -f hello-sweep.yml
That will create max_total_trials number of jobs for different parameter combinations as defined in the search_space governed by the sampling_algorithm, which can be random, grid or bayesian.
The actual job that is started is defined under trial. You need a program or script of some sort that you can execute via a command line and that can take parameters via that command line. command is that command that is executed, code is a folder on the local machine that contains the script/program you want to run and environment is a registered environment in your workspace. azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu#latest is one that is predefined in AzureML, but you can also create your own.
If you prefer Python, here is the same thing done in Python.
See here for a blog post on How to do hyperparameter tuning using Azure ML.

Google Dataflow spending hours estimating input size

I'm fairly new to Google Dataflow and I am finding that the service spends several hours estimating the input file size before actually processing data, and will often do several recounts for large input collections before failing. I'm using Apache Beam 2.9 and the io.ReadFromText method.
The logs start with a comment about beginning estimation of input file size and continue to log an update every 10k files counted.
Is there a way to skip this step or to significantly increase the pace in which it does the count?
The Python ReadFromText source is based on FileBasedSource. If you look at the code for it, you can find that its estimate_size method is inefficient for a very large set of files.
As we discussed in the comments, you may be able to improve this bottleneck by manually partitioning the range of files. For example, if your files are gs://my_bucket/file001, gs://my_bucket/file002, ... gs://my_bucket/file999, you should be able to add 10 sources, something like so:
p = Pipeline()
file_lines = (
[p | ReadFromText('gs://my_bucket/file%s*' % i) for i in range(10)]
| beam.Flatten())
This should help your pipeline scale for a case like this.
As for a permanent solution... I imagine that one could try to propose improvements to the source itself, so that a future version may have better performance.
There are also transforms based on Java's FileIO transforms that we are planning to implement. That may be helpful as well. But it's not so near in the horizon.

Text clustering within a log file

I am working on a problem of finding similar content in a log file. Let's say I have a log file which looks like this:
show version
Operating System (OS) Software
Software
BIOS: version 1.0.10
loader: version N/A
kickstart: version 4.2(7b)
system: version 4.2(7b)
BIOS compile time: 01/08/09
kickstart image file is: bootflash:/m9500-sf2ek9-kickstart-mz.4.2.7b.bin
kickstart compile time: 8/16/2010 13:00:00 [09/29/2010 23:10:48]
system image file is: bootflash:/m9500-sf2ek9-mz.4.2.7b.bin
system compile time: 8/16/2010 13:00:00 [09/30/2010 00:46:36]`
Hardware
xxxx MDS 9509 (9 Slot) Chassis ("xxxxxxx/xxxxx-2")
xxxxxxx, xxxx with 1033100 kB of memory.
Processor Board ID xxxx
Device name: xxx-xxx-1
bootflash: 1000440 kB
slot0: 0 kB (expansion flash)
For a human eye, it can easily be understood that "Software" and the data below is a section and "Hardware" and the data below is another section. Is there a way I can model using machine learning or some other technique to cluster similar sections based on a pattern? Also, I have shown 2 similar kinds of pattern but the patterns between sections might vary and hence should identify as different section. I have tried to find similarity using cosine similarity but it doesn't help much because the words aren't similar but the pattern is.
I see actually two separate machine learning problems:
1) If I understood you correctly the first problem you want to solve is the problem to split each log into distinct section, so one for Hardware, one for Software etc.
In order to achieve this one approach could be try to extract heading which mark the beginning of a new section. In order to do so you could manually label a set of different logs and label each row as heading=true, heading= false
No you could try to train a classifier which takes your labeled data as an input and the result could be a model.
2) Now that you have this different sections, you can split each log into those section and treat each section as a separate document.
Now I would first try a straigt-forward document clustering using a standard nlp pipeline:
Tokenize your document to get the tokens
Normalize them (maybe stemming is not the best idea for logs)
Create for each document a tf-idf vector
Start with a simple clustering algorithm like k-means to try to cluster the different section
After the clustering you should have the section similar to each other in the same cluster
I hope this helped, I think especially the first task is quit hard and maybe hand-tailored patterns will perform better.

Error while executing DetEval software to evaluate the performance of my text recognition algorithm

I have come up with a text recognition algorithm. This algorithm recognizes text in natural images. I am trying to test it against the groundtruth available for the dataset of ICDAR's robust reading challenge. For this, I have generated an xml file containing coordinates of text regions in a scene image, as recognized by my algorithm. A similar xml file is provided for the groundtruth data.
To generate quantitative results of comparison of the two xml files, i am required to use DetEval software (as mentioned in the site). I have installed a command line version on linux.
The problem is: DetEval is not reading the input xml files. Specifically,
I run the following command (As per the instructions on the DetEval website):
rocplot /home/ekta/workspace/extract/result_ICDAR_2011/txt/GT2.xml { /home/ekta/workspace/extract/result_ICDAR_2011/txt/final.xml }
Here, GT2.xml is the groundtruth and final.xml is the file generated by my algorithm.
I get the following error message:
evaldetection -p 0.8,0.4,0.8,0.4,0.4,0.8,0,1 "{" "/home/ekta/workspace/extract/result_ICDAR_2011/txt/GT2.xml" | readdeteval -p 1 - >> /tmp/evaldetectioncurves20130818-21541-1kum9m9-0
evaldetection -p 0.8,0.4,0.8,0.4,0.4,0.8,0,1 "{" "/home/ekta/workspace/extract/result_ICDAR_2011/txt/GT2.xml"I/O warning : failed to load external entity "{"
Couldn't parse document {
-:1: parser error : Document is empty
^
-:1: parser error : Start tag expected, '<' not found
^
I/O error : Invalid seek
Couldn't parse document -
rocplot: ERROR running the command:
evaldetection -p 0.8,0.4,0.8,0.4,0.4,0.8,0,1 "{" "/home/ekta/workspace/extract/result_ICDAR_2011/txt/GT2.xml" | readdeteval -p 1 - >> /tmp/evaldetectioncurves20130818-21541-1kum9m9-0Error code: 256
What do i do? I am positive that there is no error in generating my xml file because even the groundtruth file obtained from the website is not being parsed. Please help!
Regards
Ekta
So, I managed to solve this issue. Turns out I was giving the wrong commands. rocplots is to be used only when I need to have multiple runs on the ground truth and detection files with varying evaluation parameters. See this paper to know more about the parameters involved.
Currently, I have one ground truth file and one detection file and I need to run it using just the default parameters used by DetEval. So, here is what needs to be done:
Go to directory where you have detevalcmd directory and enter detevalcmd directory. Run the following commands in that directory:
./evaldetection /path/to/detection/results/DetectionFilename.xml /path/to/ground/truth/file/GroundTruthFilename.xml > /path/where/you/want/to/store/results/result.xml
This will store the results in result.xml. Next, run the following command:
2. ./readdeteval /path/where/you/stored/results/result.xml.
This will give something like:
**100% of the images contain objects.
Generality: xxx
Inverse-Generality: xxx
<evaluation noImages="xxx">
<icdar2003 r="xxx" p="xxx" hmean="xxx" noGT="XXX" noD="xxx"/>
<score r="Xxx" p="xxx" hmean="xxx" noGT="xxx" noD="xxx"/>
</evaluation>**
So, there you go! you got the recall, precision etc. for you algorithm.

How do I plot benchmark data in a Jenkins matrix project

I have several Jenkins matrix projects in where I output benchmark results (i.e. execution times) in a CSV file. I'd like to plot these execution times as a function of the build number, so I can see if my projects are regressing over time.
I can confirm Plot Plugin is a correct and quite useful approach. BTW, it supports CSV as well: plot configuration example
I've been using it for several years without any problem. Benchmarks results were generated as a property file. Benchmark id (series id) was used as a key and result as a value. One build produces one result for each benchmark. Having that data it is quite easy to create plot configuration ant track performance.
This may help you:
https://wiki.jenkins-ci.org/display/JENKINS/Plot+Plugin
It adds plotting capabilities to Jenkins.

Resources