Mahout trainAdaptiveLogistic - mahout

I'm trying to train a logistic regression model with mahout. The command line and the output look like this:
mahout trainAdaptiveLogistic --passes 100 --input /home/cloudera/Desktop/final.csv --features 20 --output /home/cloudera/Desktop/model/adaptivemodel --target Action --categories 2 --predictors Open High Close --types n n n
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /usr/lib/hadoop/bin/hadoop and HADOOP_CONF_DIR=/etc/hadoop/conf
MAHOUT-JOB: /usr/lib/mahout/mahout-examples-0.7-cdh4.7.1-job.jar
15/04/02 07:34:53 WARN driver.MahoutDriver: No trainAdaptiveLogistic.props found on classpath, will use command-line arguments only
20
Action ~ 0.000*Close + 0.000*High + 0.000*Open
Close 0.00003
High 0.00004
Open 0.00003
0.000000000 0.000033367 0.000036516 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000034630 0.000000000 0.000000000
15/04/02 07:38:30 INFO driver.MahoutDriver: Program took 216959 ms (Minutes: 3.6159833333333333)
The first few lines of a file I use are these:
Open,High,Low,Close,Volume,Adj Close,Action
59.30,60.05,58.88,59.41,3373800,59.41,BUY
59.64,60.26,58.88,59.83,3069100,59.83,BUY
58.91,59.25,58.21,59.03,3559500,59.03,SELL
59.57,60.44,58.67,58.68,3302000,58.68,BUY
Why are the coefficients so small and so close to zero? Is it wrong to try to predict a word with numeric predictors?

This looks like a case of selecting wrong type for the predictor variable "Action". In this case it should be categorical instead of plain text. You can try assigning binary ( 0 for SELL, 1 for BUY) variables to the data and then use appropriate Feature encoder.

Related

Multi-class classification in sparse dataset

I have a dataset of factory workstations.
There are two types of error in same particular time.
User selects error and time interval (dependent variable-y)
Machines produces errors during production (independent variables-x)
User selected error types are 8 unique in total so I tried to predict those errors using machine-produced errors(total 188 types) and some other numerical features such as avg. machine speed, machine volume, etc.
Each row represents user-selected error in particular time;
For example in first line user selects time interval as:
2018-01-03 12:02:00 - 2018-01-03 12:05:37
and m_er_1(machine error 1) also occured in same time interval 12 times.
m_er_1_dur(machine error 1 duration) is total duration of machine error in seconds
So I matched those two tables and looks like below;
user_error m_er_1 m_er_2 m_er_3 ... m_er_188 avg_m_speed .. m_er_1_dur
A 12 0 0 0 150 217
B 0 0 2 0 10 0
A 3 0 0 6 34 37
A 0 0 0 0 5 0
D 0 0 0 0 3 0
E 0 0 0 0 1000 0
In the end, I have 1900 rows 390 rows( 376( 188 machine error counts + 188 machine error duration) + 14 numerical features) and due to machine errors it is a sparse dataset, lots of 0.
There a none outliers, none nan values, I normalized and tried several classification algorithms( SVM, Logistic Regression, MLPC, XGBoost, etc.)
I also tried PCA but didn't work well, for 165 components explained_variance_ratio is around 0.95
But accuracy metrics are very low, for logistic regression accuracy score is 0.55 and MCC score around 0.1, recall, f1, precision also very low.
Are there some steps that I miss? What would you suggest for multiclass classification for sparse dataset?
Thanks in advance

Vowpal Wabbit: Cannot retrieve latent factors with gd_mf_weights from a trained --rank model

I trained a rank 40 model on the movielens data, but cannot retrieve the weights from the trained model with gd_mf_weights. I'm following the syntax from the VW matrix factorization example but it is giving me errors. Please advise.
Model training call:
vw --rank 40 -q ui --l2 0.1 --learning_rate 0.015 --decay_learning_rate 0.97 --power_t 0 --passes 50 --cache_file movielens.cache -f movielens.reg -d train.vw
Weights generating call:
library/gd_mf_weights -I train.vw -O '/data/home/mlteam/notebooks/Recommenders-master/notebooks/Outputs/movielens' --vwparams '-q ui --rank 40 -i movielens.reg'
Error:
WARNING: model file has set of {-q, --cubic, --interactions} settings stored, but they'll be OVERRIDEN by set of {-q, --cubic, --interactions} settings from command line.
creating quadratic features for pairs: ui
finished run
number of examples = 0
weighted example sum = 0
weighted label sum = 0
average loss = -nan
total feature number = 0
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injectorboost::program_options::multiple_occurrences >'
what(): option '--rank' cannot be specified more than once
Aborted (core dumped)
If I just run it without specifying rank and interaction variables, it doesn't return the same trained model, since the parameters displayed are different from before.
library/gd_mf_weights -I train.vw -O '/data/home/mlteam/notebooks/Recommenders-master/notebooks/Outputs/movielens' --vwparams '-i movielens.reg'
creating quadratic features for pairs: ui
Num weight bits = 18
learning rate = 10
initial_t = 1
power_t = 0.5
using no cache
Reading datafile =
num sources = 0
Segmentation fault (core dumped)
If I run weights generation with the entire set of model training parameters, it just ignores my extra parameters (and finishes much faster than 50 passes would take) and returns same weights from a randomly initiated rank 40 model.
library/gd_mf_weights -I train.vw -0 '/data/home/mlteam/notebooks/Recommenders-master/notebooks/Outputs/movielens' --vwparams '--rank 40 -q ui --l2 0.1 --learning_rate 0.015 --decay_learning_rate 0.97 --power_t 0 --passes 50 --cache_file movielens.cache -f movielens.reg -d train.vw'

OpenCV Error: Bad argument (Can not get new positive sample

I am trying to train my own OpenCV Haar Classifier for cup detection.
I have 100 images which contain cup and 400 images which do not contain cup, So,
No of Positive Images = 100
No. of Negative Images = 400
At first I created dat for both of them by
find ./Negative_Images -name '*.jpg' >negatives.dat
find ./Positive_Images -name '*.jpg' >positives.dat
Next, I run the following command to generate samples (I put value for sample 100 as no of my positive images are 100. Is it right? )
perl createtrainsamples.pl positives.dat negatives.dat samples 100 "opencv_createsamples -bgcolor 0 -bgthresh 0 -maxxangle 1.1 -maxyangle 1.1 maxzangle 0.5 -maxidev 40 -w 80 -h 60"
Now 100 samples (*.jpg.vec) are created in samples folder. Next, I run the following command to generate samples.vac
python ./tools/mergevec.py -v samples/ -o samples.vec
mergevec.py found in the tutorial by mrnugget
Now for the next command is "opencv_haartraining",
opencv_traincascade -data classifier -vec samples.vec -bg negatives.dat -precalcValBufSize 2500 -precalcIdxBufSize 2500 -numPos 100 -numNeg 400 -numStages 15 -minhitrate 0.99 -maxfalsealarm 0.5 -w 80 -h 60
I am receiving error Error: Can not get new positive sample
Someone solved it by numPos = noOfPositiveImages*0.9, But it did not work for me
From different sources, I found a formula to calculate the value for numPose.
vec-file has to contain >= (numPose + (numStages-1) * (1 - minHitRate) * numPose) + S
So far I understand, for me
vec-file has to contain = 100 (As I had 100 positive Images, and from those 100 samples were created)
numStage = 4 (Or it can be any other value, as I want)
minHitRate = 0.99
S = count of samples from vec-file.(Some other place says, the count of all the skipped samples from vec-file (for all stages))
I do not understand, what value should I put for S?
Can anyone explain this formula with example? What value should I put in the command to solve this error?

Vowpal Wabbit - precision recall f-measure

How do you usually get precision, recall and f-measure from a model created in Vowpal Wabbit on a classification problem?
Are there any available scripts or programs that are commonly used for this with vw's output?
To make a minimal example using the following data in playtennis.txt :
2 | sunny 85 85 false
2 | sunny 80 90 true
1 | overcast 83 78 false
1 | rain 70 96 false
1 | rain 68 80 false
2 | rain 65 70 true
1 | overcast 64 65 true
2 | sunny 72 95 false
1 | sunny 69 70 false
1 | rain 75 80 false
1 | sunny 75 70 true
1 | overcast 72 90 true
1 | overcast 81 75 false
2 | rain 71 80 true
I create the model with:
vw playtennis.txt --oaa 2 -f playtennis.model --loss_function logistic
Then, I get predictions and raw predictions of the trained model on the training data itself with:
vw -t -i playtennis.model playtennis.txt -p playtennis.predict -r playtennis.rawp
Going from here, what scripts or programs do you usually use to get precision, recall and f-measure, given training data playtennis.txt and the predictions on the training data in playtennis.predict?
Also, if this where a multi-label classification problem (each instance can have more than 1 target label, which vw can also handle), would your proposed scripts or programs capable to process these?
Given that you have a pair of 'predicted vs actual' value for each example, you can use Rich Caruana's KDD perf utility to compute these (and many other) metrics.
In the case of multi-class, you should simply consider every correctly classified case a success and every class-mismatch a failure to predict correctly.
Here's a more detailed recipe for the binary case:
# get the labels into *.actual (correct) file
$ cut -d' ' -f1 playtennis.txt > playtennis.actual
# paste the actual vs predicted side-by-side (+ cleanup trailing zeros)
$ paste playtennis.actual playtennis.predict | sed 's/\.0*$//' > playtennis.ap
# convert original (1,2) classes to binary (0,1):
$ perl -pe 's/1/0/g; s/2/1/g;' playtennis.ap > playtennis.ap01
# run perf to determine precision, recall and F-measure:
$ perf -PRE -REC -PRF -file playtennis.ap01
PRE 1.00000 pred_thresh 0.500000
REC 0.80000 pred_thresh 0.500000
PRF 0.88889 pred_thresh 0.500000
Note that as Martin mentioned, vw uses the {-1, +1} convention for binary classification, whereas perf uses the {0, 1} convention so you may have to translate back and forth when switching between the two.
For binary classification, I would recommend to use labels +1 (play tennis) and -1 (don't play tennis) and --loss_function=logistic (although --oaa 2 and labels 1 and 2 can be used as well). VW then reports the logistic loss, which may be more informative/useful evaluation measure than accuracy/precision/recall/f1 (depending on the application). If you want 0/1 loss (i.e. "one minus accuracy"), add --binary.
For precision, recall, f1-score, auc and other measures, you can use the perf tool as suggested in arielf's answer.
For standard multi-class classification (one correct class for each example), use --oaa N --loss_function=logistic and VW will report the 0/1 loss.
For multi-label multi-class classification (more correct labels per example allowed), you can use --multilabel_oaa N (or convert each original example into N binary-classification examples).

optimize hive query for multitable join

INSERT OVERWRITE TABLE result
SELECT /*+ STREAMTABLE(product) */
i.IMAGE_ID,
p.PRODUCT_NO,
p.STORE_NO,
p.PRODUCT_CAT_NO,
p.CAPTION,
p.PRODUCT_DESC,
p.IMAGE1_ID,
p.IMAGE2_ID,
s.STORE_ID,
s.STORE_NAME,
p.CREATE_DATE,
CASE WHEN custImg.IMAGE_ID is NULL THEN 0 ELSE 1 END,
CASE WHEN custImg1.IMAGE_ID is NULL THEN 0 ELSE 1 END,
CASE WHEN custImg2.IMAGE_ID is NULL THEN 0 ELSE 1 END
FROM image i
JOIN PRODUCT p ON i.IMAGE_ID = p.IMAGE1_ID
JOIN PRODUCT_CAT pcat ON p.PRODUCT_CAT_NO = pcat.PRODUCT_CAT_NO
JOIN STORE s ON p.STORE_NO = s.STORE_NO
JOIN STOCK_INFO si ON si.STOCK_INFO_ID = pcat.STOCK_INFO_ID
LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg ON i.IMAGE_ID = custImg.IMAGE_ID
LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg1 ON p.IMAGE1_ID = custImg1.IMAGE_ID
LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg2 ON p.IMAGE2_ID = custImg2.IMAGE_ID;
I have a join query where i am joining huge tables and i am trying to optimize this hive query. Here are some facts about the tables
image table has 60m rows,
product table has 1b rows,
product_cat has 1000 rows,
store has 1m rows,
stock_info has 100 rows,
customizable_image has 200k rows.
a product can have one or two images (image1 and image2) and product level information are stored only in product table. i tried moving the join with product to the bottom but i couldnt as all other following joins require data from the product table.
Here is what i tried so far,
1. I gave the hint to hive to stream product table as its the biggest one
2. I bucketed the table (during create table) into 256 buckets (on image_id) and then did the join - didnt give me any significant performance gain
3. changed the input format to sequence file from textfile(gzip files) , so that it can be splittable and hence more mappers can be run if hive want to run more mappers
Here are some key logs from hive console. I ran this hive query in aws. Can anyone help me understand the primary bottleneck here ? This job is only processing a subset of the actual data.
Stage-14 is selected by condition resolver.
Launching Job 1 out of 11
Number of reduce tasks not specified. Estimated from input data size: 22
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Kill Command = /home/hadoop/bin/hadoop job -kill job_201403242034_0001
Hadoop job information for Stage-14: number of mappers: 341; number of reducers: 22
2014-03-24 20:55:05,709 Stage-14 map = 0%, reduce = 0%
.
2014-03-24 23:26:32,064 Stage-14 map = 100%, reduce = 100%, Cumulative CPU 34198.12 sec
MapReduce Total cumulative CPU time: 0 days 9 hours 29 minutes 58 seconds 120 msec
.
2014-03-25 00:33:39,702 Stage-30 map = 100%, reduce = 100%, Cumulative CPU 20879.69 sec
MapReduce Total cumulative CPU time: 0 days 5 hours 47 minutes 59 seconds 690 msec
.
2014-03-26 04:15:25,809 Stage-14 map = 100%, reduce = 100%, Cumulative CPU 3903.4 sec
MapReduce Total cumulative CPU time: 0 days 1 hours 5 minutes 3 seconds 400 msec
.
2014-03-26 04:25:05,892 Stage-30 map = 100%, reduce = 100%, Cumulative CPU 2707.34 sec
MapReduce Total cumulative CPU time: 45 minutes 7 seconds 340 msec
.
2014-03-26 04:45:56,465 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 3901.99 sec
MapReduce Total cumulative CPU time: 0 days 1 hours 5 minutes 1 seconds 990 msec
.
2014-03-26 04:54:56,061 Stage-26 map = 100%, reduce = 100%, Cumulative CPU 2388.71 sec
MapReduce Total cumulative CPU time: 39 minutes 48 seconds 710 msec
.
2014-03-26 05:12:35,541 Stage-4 map = 100%, reduce = 100%, Cumulative CPU 3792.5 sec
MapReduce Total cumulative CPU time: 0 days 1 hours 3 minutes 12 seconds 500 msec
.
2014-03-26 05:34:21,967 Stage-5 map = 100%, reduce = 100%, Cumulative CPU 4432.22 sec
MapReduce Total cumulative CPU time: 0 days 1 hours 13 minutes 52 seconds 220 msec
.
2014-03-26 05:54:43,928 Stage-21 map = 100%, reduce = 100%, Cumulative CPU 6052.96 sec
MapReduce Total cumulative CPU time: 0 days 1 hours 40 minutes 52 seconds 960 msec
MapReduce Jobs Launched:
Job 0: Map: 59 Reduce: 18 Cumulative CPU: 3903.4 sec HDFS Read: 37387 HDFS Write: 12658668325 SUCCESS
Job 1: Map: 48 Cumulative CPU: 2707.34 sec HDFS Read: 12658908810 HDFS Write: 9321506973 SUCCESS
Job 2: Map: 29 Reduce: 10 Cumulative CPU: 3901.99 sec HDFS Read: 9321641955 HDFS Write: 11079251576 SUCCESS
Job 3: Map: 42 Cumulative CPU: 2388.71 sec HDFS Read: 11079470178 HDFS Write: 10932264824 SUCCESS
Job 4: Map: 42 Reduce: 12 Cumulative CPU: 3792.5 sec HDFS Read: 10932405443 HDFS Write: 11812454443 SUCCESS
Job 5: Map: 45 Reduce: 13 Cumulative CPU: 4432.22 sec HDFS Read: 11812679475 HDFS Write: 11815458945 SUCCESS
Job 6: Map: 42 Cumulative CPU: 6052.96 sec HDFS Read: 11815691155 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 days 7 hours 32 minutes 59 seconds 120 msec
OK
The query is still taking longer than 5 hours in Hive where as in RDBMS it takes only 5 hrs. I need some help in optimizing this query, so that it executes much faster. Interestingly, when i ran the task with 4 large core instances, the time taken improved only by 10 mins compared to the run with 3 large instance core instances. but when i ran the task with 3 med cores, it took 1hr 10 mins more.
This brings me to the question, "is Hive even the right choice for such complex joins" ?
I suspect the bottleneck is just in sorting your product table, since it seems much larger than the others. I think joins with Hive for tables over a certain size become untenable, simply because they require a sort.
There are parameters to optimize sorting, like io.sort.mb, which you can try setting, so that more sorting occurs in memory, rather than spilling to disk, re-reading and re-sorting. Look at the number of spilled records, and see if this much larger than your inputs. There are a variety of ways to optimize sorting. It might also help to break your query up into multiple subqueries so it doesn't have to sort as much at one time.
For the stock_info , and product_cat tables, you could probably keep them in memory since they are so small ( Check out the 'distributed_map' UDF in Brickhouse ( https://github.com/klout/brickhouse/blob/master/src/main/java/brickhouse/udf/dcache/DistributedMapUDF.java ) For custom image, you might be able to use a bloom filter, if having a few false positives is not a real big problem.
To completely remove the join, perhaps you could store the image info in a keystone DB like HBase to do lookups instead. Brickhouse also had UDFs for HBase , like hbase_get and base_cached_get .

Resources