Vectorizing a solr index with mahout using lucene.vector - mahout

I'm trying to run a clustering job on Amazon EMR using Mahout.
I have a solr index that I uploaded on S3 and I want to vectorize it using mahouts lucene.vector.(this is the first step in the job flow)
The parameters for the step are:
Jar: s3n://mahout-bucket/jars/mahout-core-0.6-job.jar
MainClass: org.apache.mahout.driver.MahoutDriver
Args: lucene.vector --dir s3n://mahout-input/solr_index/ --field name --dictOut /test/solr-dict-out/dict.txt --output /test/solr-vectors-out/vectors
The error in the log is:
Unknown program 'lucene.vector' chosen.
I've done the same process locally with hadoop and Mahout and it worked fine.
How should I call the lucene.vector function on EMR?

program name, lucene.vector should be immediately after bin/mahout
/homes/cuneyt/trunk/bin/mahout lucene.vector --dir /homes/cuneyt/lucene/index --field 0 --output lda/vector --dictOut /homes/cuneyt/lda/dict.txt

I've eventually figured out the answer. The problem was I was using the wrong MainClass argument. Instead of
org.apache.mahout.driver.MahoutDriver
I should have used:
org.apache.mahout.utils.vectors.lucene.Driver
Therefore the correct arguments should have been
Jar: s3n://mahout-bucket/jars/mahout-core-0.6-job.jar MainClass:
org.apache.mahout.utils.vectors.lucene.Driver
Args: --dir s3n://mahout-input/solr_index/ --field name --dictOut /test/solr-dict-out/dict.txt --output /test/solr-vectors-out/vectors

Related

snakemake: MissingOutputException within docker

I am trying to run a pipeline within a docker using snakemake. I am having problem using the sortmerna tool to produce {sample}_merged_sorted_mRNA and {sample}_merged_sorted output from control_merged.fq and treated_merged.fq input files.
Here my Snakefile:
SAMPLES = ["control","treated"]
for smp in SAMPLES:
print("Sample " + smp + " will be processed")
rule final:
input:
expand('/output/{sample}_merged.fq', sample=SAMPLES),
expand('/output/{sample}_merged_sorted', sample=SAMPLES),
expand('/output/{sample}_merged_sorted_mRNA', sample=SAMPLES),
rule sortmerna:
input: '/output/{sample}_merged.fq',
output: merged_file='/output/{sample}_merged_sorted_mRNA', merged_sorted='/output/{sample}_merged_sorted',
message: """---SORTING---"""
shell:
'''
sortmerna --ref /usr/share/sortmerna/rRNA_databases/silva-bac-23s-id98.fasta,/ usr/share/sortmerna/rRNA_databases/index/silva-bac-23s-id98: --reads {input} --paired_in -a 16 --log --fastx --aligned {output.merged_file} --other {output.merged_sorted} -v
'''
When runnig this I get:
Waiting at most 5 seconds for missing files.
MissingOutputException in line 57 of /input/Snakefile:
Missing files after 5 seconds:
/output/control_merged_sorted_mRNA
/output/control_merged_sorted
This might be due to filesystem latency. If that is the case, consider to increase the wait $ime with --latency-wait.
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /input/.snakemake/log/2018-11-05T091643.911334.snakemake.log
I tried to increase the latency with --latency-wait but I get the same result. Funny thing is that two output files control_merged_sorted_mRNA.fq and control_merged_sorted.fq are produced but the program fails and exits. The version of snakemake is 5.3.0. Any help?
snakemake fails because the outputs described by the rule sortmerna are not produced. This is not a latency problem, it is a problem with your outputs.
Your rule sortmerna expects as output:
/output/control_merged_sorted_mRNA
and
/output/control_merged_sorted
but the program you are using (I know nothing about sortmerna) is apparently producing
/output/control_merged_sorted_mRNA.fq
and
/output/control_merged_sorted.fq
Make sure that when you specify the options --aligned and --other on the command line of your program, it should be the real names of the files produced or if it is only the basename and the program will add a suffix .fq. If you are in the latter case, I suggest you use:
rule final:
input:
expand('/output/{sample}_merged.fq', sample=SAMPLES),
expand('/output/{sample}_merged_sorted', sample=SAMPLES),
expand('/output/{sample}_merged_sorted_mRNA', sample=SAMPLES),
rule sortmerna:
input:
'/output/{sample}_merged.fq',
output:
merged_file='/output/{sample}_merged_sorted_mRNA.fq',
merged_sorted='/output/{sample}_merged_sorted.fq'
params:
merged_file_basename='/output/{sample}_merged_sorted_mRNA',
merged_sorted_basename='/output/{sample}_merged_sorted'
message: """---SORTING---"""
shell:
"""
sortmerna --ref /usr/share/sortmerna/rRNA_databases/silva-bac-23s-id98.fasta,/usr/share/sortmerna/rRNA_databases/index/silva-bac-23s-id98: --reads {input} --paired_in -a 16 --log --fastx --aligned {params.merged_file_basename} --other {params.merged_sorted_basename} -v
"""

unable to classify using Weka from command line

I am using weka 3.6.13 and trying to use a model to classify data:
java -cp weka-stable-3.6.13.jar weka.classifiers.Evaluation weka.classifiers.trees.RandomForest -l Parking.model -t Data_features_class_ques-2.arff
java.lang.Exception: training and test set are not compatible
though the model works when we use the GUI, through Explorer->Claasify ->Supplied test set and load the arff file->right click on result list and load model-> again right click -> re-evaluate model on current data set...
Any pointers please help.
If your data contains "String" features then first use StringToWordVector in batch mode i.e. for both data set in single command (command 1) then use command 2 and command 3.
Command 1.
java weka.filters.unsupervised.attribute.StringToWordVector -b -R first-last -i training.arff -o training_s2w.arff -r test.arff -s test_s2w.arff
Command 2.
java weka.classifiers.trees.RandomForest -t training_s2w.arff -d model.model
Command 3.
java weka.classifiers.trees.RandomForest -T test_s2w.arff -l model.model -p 0 > result.txt
PS: add path for weka.jar accordingly.

Jenkins - Posting results to a external monitoring job is adding garbage to the build job log

I have a external monitor job that I'm pushing the result of another job to it with curl and base on this link :
Monitoring external jobs
After I create the job I just need to run a curl command with the body encoded in HEX to the specified url and then a build will be created and the output will be added to it but what I get instead is part of my output in clear text and the rest in weird characters like so :
Started
Asking akamai to purge this urls:
http://xxx/sites/all/modules/custom/uk.png http://aaaaaasites/all/modules/custom/flags/jp.png
<html><head><title>401 Unauthorized</title> </h�VC��&�G����CV�WF��&��VC�������R&R��BWF��&��VBF�66W72F�B&W6�W&6S�����&�G�����F����F�RW&�F �6�V6�7FGW2�bF�R&WVW7B�2��F�RF��RF�v�B�2��6�Ɩ�r&6�w&�V�B��"F�6�V6�7FGW2�bF�RF�6�W#�v�F��rf�"���F�W&vRF��6O request please keep in mind this is an estimated time
Waiting for another 60 seconds
Asking akamai to purge this urls:
...
..
..
This is how I'm doing it :
export output=`cat msg.out|xxd -c 256 -ps`
curl -k -X POST -d "<run><log encoding=\"hexBinary\">$output</log><result>0</result> <duration>2000</duration></run>" https://$jenkinsuser:$jenkinspass#127.0.0.1/jenkins/job/akamai_purge_results/postBuildResult -H'.crumb:c775f3aa15464563456346e'
If I cat that file is all fine and even if I edit it with vi I can't see any problem with it.
Do you guys have any idea how to fix this ?
Could it be a problem with the hex encoding ? ( I tried hex/enc/dec pages with the result of xxd and they look fine)
Thanks.
I had the same issue, and stumbled across this: http://blog.markfeeney.com/2010/01/hexbinary-encoding.html
From that page, you can get the encoding you need via this command:
echo "Hello world" | hexdump -v -e '1/1 "%02x"'
48656c6c6f20776f726c640a
An excerpt from the explanation:
So what the hell is that? -v means don't suppress any duplicate data
in the output, and -e is the format string. hexdump's very particular
about the formatting of the -e argument; so careful with the quotes.
The 1/1 means for every 1 byte encountered in the input, apply the
following formatting pattern 1 time. Despite this sounding like the
default behaviour in the man page, the 1/1 is not optional. /1 also
works, but the 1/1 is very very slightly more readable, IMO. The
"%02x" is just a standard-issue printf-style format code.
So in your case, you would do this (removing 'export' in favor of inline variable)
OUTPUT=`cat msg.out | hexdump -v -e '1/1 "%02x"'` curl -k -X POST -d "<run><log encoding=\"hexBinary\">$OUTPUT</log><result>0</result> <duration>2000</duration></run>" https://$jenkinsuser:$jenkinspass#127.0.0.1/jenkins/job/akamai_purge_results/postBuildResult -H'.crumb:c775f3aa15464563456346e'

Monitoring URLs with Nagios

I'm trying to monitor actual URLs, and not only hosts, with Nagios, as I operate a shared server with several websites, and I don't think its enough just to monitor the basic HTTP service (I'm including at the very bottom of this question a small explanation of what I'm envisioning).
(Side note: please note that I have Nagios installed and running inside a chroot on a CentOS system. I built nagios from source, and have used yum to install into this root all dependencies needed, etc...)
I first found check_url, but after installing it into /usr/lib/nagios/libexec, I kept getting a "return code of 255 is out of bounds" error. That's when I decided to start writing this question (but wait! There's another plugin I decided to try first!)
After reviewing This Question that had almost practically the same problem I'm having with check_url, I decided to open up a new question on the subject because
a) I'm not using NRPE with this check
b) I tried the suggestions made on the earlier question to which I linked, but none of them worked. For example...
./check_url some-domain.com | echo $0
returns "0" (which indicates the check was successful)
I then followed the debugging instructions on Nagios Support to create a temp file called debug_check_url, and put the following in it (to then be called by my command definition):
#!/bin/sh
echo `date` >> /tmp/debug_check_url_plugin
echo $* /tmp/debug_check_url_plugin
/usr/local/nagios/libexec/check_url $*
Assuming I'm not in "debugging mode", my command definition for running check_url is as follows (inside command.cfg):
'check_url' command definition
define command{
command_name check_url
command_line $USER1$/check_url $url$
}
(Incidentally, you can also view what I was using in my service config file at the very bottom of this question)
Before publishing this question, however, I decided to give 1 more shot at figuring out a solution. I found the check_url_status plugin, and decided to give that one a shot. To do that, here's what I did:
mkdir /usr/lib/nagios/libexec/check_url_status/
downloaded both check_url_status and utils.pm
Per the user comment / review on the check_url_status plugin page, I changed "lib" to the proper directory of /usr/lib/nagios/libexec/.
Run the following:
./check_user_status -U some-domain.com.
When I run the above command, I kept getting the following error:
bash-4.1# ./check_url_status -U mydomain.com
Can't locate utils.pm in #INC (#INC contains: /usr/lib/nagios/libexec/ /usr/local/lib/perl5 /usr/local/share/perl5 /usr/lib/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib/perl5 /usr/share/perl5) at ./check_url_status line 34.
BEGIN failed--compilation aborted at ./check_url_status line 34.
So at this point, I give up, and have a couple of questions:
Which of these two plugins would you recommend? check_url or check_url_status?
(After reading the description of check_url_status, I feel that this one might be the better choice. Your thoughts?)
Now, how would I fix my problem with whichever plugin you recommended?
At the beginning of this question, I mentioned I would include a small explanation of what I'm envisioning. I have a file called services.cfg which is where I have all of my service definitions located (imagine that!).
The following is a snippet of my service definition file, which I wrote to use check_url (because at that time, I thought everything worked). I'll build a service for each URL I want to monitor:
###
# Monitoring Individual URLs...
#
###
define service{
host_name {my-shared-web-server}
service_description URL: somedomain.com
check_command check_url!somedomain.com
max_check_attempts 5
check_interval 3
retry_interval 1
check_period 24x7
notification_interval 30
notification_period workhours
}
I was making things WAY too complicated.
The built-in / installed by default plugin, check_http, can accomplish what I wanted and more. Here's how I have accomplished this:
My Service Definition:
define service{
host_name myers
service_description URL: my-url.com
check_command check_http_url!http://my-url.com
max_check_attempts 5
check_interval 3
retry_interval 1
check_period 24x7
notification_interval 30
notification_period workhours
}
My Command Definition:
define command{
command_name check_http_url
command_line $USER1$/check_http -I $HOSTADDRESS$ -u $ARG1$
}
The better way to monitor urls is by using webinject which can be used with nagios.
The below problem is due to the reason that you dont have the perl package utils try installing it.
bash-4.1# ./check_url_status -U mydomain.com Can't locate utils.pm in #INC (#INC contains:
You can make an script plugin. It is easy, you only have to check the URL with something like:
`curl -Is $URL -k| grep HTTP | cut -d ' ' -f2`
$URL is what you pass to the script command by param.
Then check the result: If you have an code greater than 399 you have a problem, else... everything is OK! THen an right exit mode and the message for Nagios.

Can't read Mahout generated sequence files with hadoop streaming

I am trying to stream a sequence file generated by one of the Mahout examples to see its contents:
hadoop jar hadoop-streaming-0.20.2-cdh3u0.jar \
-input /tmp/mahout-work-me/20news-bydate/bayes-test-input-output/ \
-output /tmp/me/mm \
-mapper "cat" \
-reducer "wc -l" \
-inputformat SequenceFileAsTextInputFormat
The job starts successfully and eventually dies with:
11/11/30 21:08:39 INFO streaming.StreamJob: map 0% reduce 0%
11/11/30 21:09:17 INFO streaming.StreamJob: map 100% reduce 100%
java.lang.RuntimeException: java.io.IOException: WritableName can't load class: org.apache.mahout.common.StringTuple
I wonder if something is wrong with my streaming jar file, if I I need to point explicitly to the Mahout jar that has this class (tried setting HADOOP_CLASSPATH to the location of mahout-core-0.5-cdh3u2.jar but did not work), or maybe even something else?
Any help is appreciated. Thanks.
Add this option:
-libjars mahout-core-0.5-cdh3u2.jar

Resources