Verifying the Model generated by the classifier - mahout

I am using Mahout Naive bayes classification algorithm to classify the input documents to known categories.
I am able to build the model using mahout commands.
mahout seq2sparse
mahout split
mahout trainnb
mahout testnb
Test results looks good.
Now, I would like to verify my model with real data.
I am trying below command to verify the ouptut:
mahout org.apache.mahout.classifier.Classify \
-m /data/model/ \
--classify /data/input.txt \
--encoding UTF-8 \
--analyzer org.apache.mahout.vectorizer.DefaultAnalyzer \
--defaultCat unknown \
-ng 1 \
-type bayes \
-source hdfs
This command is failing with "java.lang.ClassNotFoundException: org.apache.mahout.classifier.Classify".
I have set the mahout core jar and other mahout jars in the classpath. I am using Mahout 0.9.
How to run the classifier in Mahout 0.9 ?

Related

Error: "argument --job-dir: expected one argument" while training model using AI Platform on GCP

Running macOS Mojave.
I am following the official getting started documentation to run a model using AI platform.
So far I managed to train my model locally using:
# This is similar to `python -m trainer.task --job-dir local-training-output`
# but it better replicates the AI Platform environment, especially
# for distributed training (not applicable here).
gcloud ai-platform local train \
--package-path trainer \
--module-name trainer.task \
--job-dir local-training-output
I then proceed to train the model using AI platform by going through the following steps:
Setting environment variables export JOB_NAME="my_first_keras_job" and export JOB_DIR="gs://$BUCKET_NAME/keras-job-dir".
Run the following command to package the trainer/ directory:
Command as indicated in docs:
gcloud ai-platform jobs submit training $JOB_NAME \
--package-path trainer/ \
--module-name trainer.task \
--region $REGION \
--python-version 3.5 \
--runtime-version 1.13 \
--job-dir $JOB_DIR \
--stream-logs
I get the error:
ERROR: (gcloud.ai-platform.jobs.submit.training) argument --job-dir:
expected one argument Usage: gcloud ai-platform jobs submit training
JOB [optional flags] [-- USER_ARGS ...] optional flags may be
--async | --config | --help | --job-dir | --labels | ...
As far as I understand --job-dir: does indeed have one argument.
I am not sure what I'm doing wrong. I am running the above command from the trainer/ directory as is shown in the documentation. I tried removing all spaces as described here but the error persists.
Are you running this command locally? Or on a AI notebook VM in Jupyter? Based on your details I assume youre running it locally, I am not a mac expert, but hopefully this is helpful.
I just worked through the same error on an AI notebook VM and my issue was that even though I assigned it a value in a previous Jupyter cell, the $JOB_NAME variable was passing along an empty string in the gcloud command. Try running the following to make sure your code is actually passing a value for $JOB_DIR when you are making the gcloud ai-platform call.
echo $JOB_DIR

Training locally with ML Engine & GCloud

I would like to train my model locally using this command:
gcloud ml-engine local train
--module-name cloud_runner
--job-dir ./tmp/output
The issue is that it complains that --job-dir: Must be of form gs://bucket/object.
This is a local train so I'm wondering why it wants the output to be a gs storage bucket rather than a local directory.
As explained by other gcloud --job-dir expects the location to be in GCS. To go around that you can pass it as a folder directly to your module.
gcloud ml-engine local train \
--package-path trainer \
--module-name trainer.task \
-- \
--train-files $TRAIN_FILE \
--eval-files $EVAL_FILE \
--job-dir $JOB_DIR \
--train-steps $TRAIN_STEPS
The --package-path argument to the gcloud command should point to a directory that is a valid Python package, i.e., a directory that contains an init.py file (often an empty file). Note that it should be a local directory, not one on GCS.
The --module argument will be the fully qualified name of a valid Python module within that package. You can organize your directories however you want, but for the sake of consistency, the samples all have a Python package named trainer with the module to be run named task.py.
-- Source
So you need to change this block with valid path:
gcloud ml-engine local train
--module-name cloud_runner
--job-dir ./tmp/output
Specifically, your error is due to --job-dir ./tmp/output because it is expecting a path on your gcloud
Local training tries to emulate what happens when you run using the Cloud because the point of local training is to detect problems before submitting your job to the service.
Using a local job-dir when using the CMLE service is an error because the output wouldn't persist after the job finishes.
So local training with gcloud also requires that job-dir be a GCS location.
If you want to run locally and not use GCS you can just run your TensorFlow program directly and not use gcloud.

vowpal wabbit for binary text classification setup

I am using vw-8.20170116 for a binary text classification problem. The text strings are concatenated from several short (5-20 words) strings. The input looks like
-1 1.0 |aa .... ... ..... |bb ... ... .... .. |cc ....... .. ...
1 5.0 |aa .... ... ..... |bb ..... .. .... . |cc .... .. ...
The command that I am using for training is
./vw-8.20170116 -d train_feat.txt -k -c -f model.vw --ngram 2 --skips 2 --nn 10 --loss_function logistic --passes 100 --l2 1e-8 --holdout_off --threads --ignore bc
and for test
./vw-8.20170116 -d test_feat.txt -t --loss_function logistic --link logistic -i model.vw -p test_pred.txt
Question: How can I get vw to run (train) in parallel on my 8-core machine? I thought --threads should help but I am not seeing any speedups. And how do I control the number of cores used?
Using this link for reference.

Creates package but no export

My job completes with no error. The logs show "accuracy", "auc", and other statistical measures of my model. ML-engine creates a package subdirectory, and a tar under that, as expected. But, there's no export directory, checkpoint, eval, graph or any other artifact that I'm accustom to seeing when I train locally. Am I missing something simple with the command I'm using to call the service?
gcloud ml-engine jobs submit training $JOB_NAME \
--job-dir $OUTPUT_PATH \
--runtime-version 1.0 \
--module-name trainer.task \
--package-path trainer/ \
--region $REGION \
-- \
--model_type wide \
--train_data $TRAIN_DATA \
--test_data $TEST_DATA \
--train_steps 1000 \
--verbose-logging true
The logs show this: model directory = /tmp/tmpS7Z2bq
But I was expecting my model to go to the GCS bucket I defined in $OUTPUT_PATH.
I'm following the steps under "Run a single-instance trainer in the cloud" from the getting started docs.
Maybe you could show where and how you declare the $OUTPUT_PATH?
Also the model directory, might be the directory within the $OUTPUT_PATH where you could find the model of that specific Job.

Run cvb in mahout 0.8

The current Mahout 0.8-SNAPSHOT includes a Collapsed Variational Bayes (cvb) version for Topic Modeling and removed the Latent Dirichlet Analysis (lda) approach, because cvb can be parallelized way better. Unfortunately there is only documentation for lda on how to run an example and generate meaningful output.
Thus, I want to:
preprocess some texts correctly
run the cvb0_local version of cvb
inspect the results by looking at the top n words in each of the generated topics
So here are the subsequent Mahout commands I had to call in a linux shell to do it.
$MAHOUT_HOME points to my mahout/bin folder.
$MAHOUT_HOME/mahout seqdirectory \
-i path/to/directory/with/texts \
-o out/sequenced
$MAHOUT_HOME/mahout seq2sparse -i out/sequenced \
-o out/sparseVectors \
--namedVector \
-wt tf
$MAHOUT_HOME/mahout rowid \
-i out/sparseVectors/tf-vectors/ \
-o out/matrix
$MAHOUT_HOME/mahout cvb0_local \
-i out/matrix/matrix \
-d out/sparseVectors/dictionary.file-0 \
-a 0.5 \
-top 4 -do out/cvb/do_out \
-to out/cvb/to_out
Inspect the output by showing the top 10 words of each topic:
$MAHOUT_HOME/mahout vectordump \
-i out/cvb/to_out \
--dictionary out/sparseVectors/dictionary.file-0 \
--dictionaryType sequencefile \
--vectorSize 10 \
-sort out/cvb/to_out
Thanks to JoKnopp for the detail commands.
If you get:
Exception in thread "main" java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.String
you need to add the command line option "maxIterations":
--maxIterations (-m) maxIterations
I use -m 20 and it works
refer to:
https://issues.apache.org/jira/browse/MAHOUT-1141

Resources