After I run: mahout org.apache.mahout.cf.taste.example.jester.JesterRecommenderEvaluatorRunner
mahout org.apache.mahout.cf.taste.example.jester.JesterRecommenderEvaluatorRunner
Running on hadoop, using HADOOP_HOME=/usr
HADOOP_CONF_DIR=/etc/hadoop/conf
11/04/23 23:52:18 WARN driver.MahoutDriver: No org.apache.mahout.cf.taste.example.jester.JesterRecommenderEvaluatorRunner.props found on classpath, will use command-line arguments only
11/04/23 23:52:18 INFO file.FileDataModel: Creating FileDataModel for file src/main/java/org/apache/mahout/cf/taste/example/jester/jester-data-1.csv
11/04/23 23:52:18 INFO file.FileDataModel: Reading file info...
11/04/23 23:52:18 INFO file.FileDataModel: Read lines: 7074
11/04/23 23:52:18 INFO model.GenericDataModel: Processed 7074 users
11/04/23 23:52:19 INFO eval.AbstractDifferenceRecommenderEvaluator: Beginning evaluation using 0.9 of FileDataModel[dataFile:/usr/local/mahout-distribution-0.4/examples/src/main/java/org/apache/mahout/cf/taste/example/jester/jester-data-1.csv]
11/04/23 23:52:19 INFO model.GenericDataModel: Processed 2155 users
11/04/23 23:52:19 INFO slopeone.MemoryDiffStorage: Building average diffs...
11/04/23 23:52:19 INFO eval.AbstractDifferenceRecommenderEvaluator: Beginning evaluation of 855 users
11/04/23 23:52:19 INFO eval.AbstractDifferenceRecommenderEvaluator: Starting timing of 855 tasks in 4 threads
11/04/23 23:52:19 INFO eval.AbstractDifferenceRecommenderEvaluator: Average time per recommendation: 2ms
11/04/23 23:52:19 INFO eval.AbstractDifferenceRecommenderEvaluator: Approximate memory used: 9MB / 56MB
11/04/23 23:52:19 INFO eval.AbstractDifferenceRecommenderEvaluator: Unable to recommend in 0 cases
11/04/23 23:52:19 INFO eval.AbstractDifferenceRecommenderEvaluator: Evaluation result: 154472.97849261735
11/04/23 23:52:19 INFO jester.JesterRecommenderEvaluatorRunner: 154472.97849261735
11/04/23 23:52:19 INFO driver.MahoutDriver: Program took 740 ms
No idea where to check the results?
Thanks!
That is the result. You are running an evaluation on one recommender implementation, one which scores how well that one recommender predicts ratings. It shows the average difference between actual and predicted rating.
What result are you looking for?
However something looks pretty wrong here: 154472.97849261735 is way too large. When I run it, I get an average difference of 3.41 (on a scale of 10).
I would run with the latest code from Subversion, ideally. 0.4 is 6 months old, although I don't know of any bugs here. You also don't need to run this via the driver program, though it works.
Really I suspect your jester-data-1.csv file is wrong somehow. Best to follow up on user#mahout.apache.org.
Related
I'm inserting relationships into Neo4j via the BatchInserter (as part of the initial data insertion). The process quickly slows down to 1,000,000 per ~1hr. The rest of the data (12,390,251 relationships, 6 million nodes, 74 million properties) goes in in less than an hour.
I've identified the culprit as the RelationshipGroupStore is increasing by only a few bytes at a time. While it starts off at 2megs increases in the file size, it eventually slows down to only a few bytes at a time.
I'm curious if this is related to the memory_mappnig, but there doesn't seem to be an option for the RelationshipGroupStore for the BatchInserter (probably the rest of the Kernel as well).
I've tried putting the following into the configuration but it did not seem to have an effect
neostore.relationshipgroupstore.db.mapped_memory=2G
Any ideas? Should I not do this in BatchInserter (50 mln Relationships for my current dataset, many more for others). I also believe this was much faster with older versions of Neo4j. As you can imagine 50 million relationships over 6 million nodes creates lots of highly connected nodes, which Neo4j 2.1.X have handled very well when it comes to query speed. I'm just trying to speed up insertion now.
Neo4j 2.1.5
Java: Various Versions, 1.7, 1.8
OS: Ubuntu, Mac OS X, CentOS
Neo4j Configuration:
neostore.propertystore.db.index.mapped_memory=1M
neostore.propertystore.db.strings.mapped_memory=1G
neostore.propertystore.db.index.keys.mapped_memory=1M
store_dir=data.db
dump_configuration=true
use_memory_mapped_buffers=true
neostore.propertystore.db.arrays.mapped_memory=512M
neostore.propertystore.db.mapped_memory=2G
neostore.relationshipstore.db.mapped_memory=2G
neostore.nodestore.db.mapped_memory=1G
JVM Settings
"-Xms10G"
"-Xms10G"
"-XX:+UseConcMarkSweepGC"
"-XX:+UseBiasedLocking"
"-XX:+AggressiveOpts"
"-XX:+UseCompressedOops"
"-XX:+UseFastAccessorMethods"
"-XX:+DoEscapeAnalysis"
"-Xss4096k"
"-d64"
"-server"
Here are the logs showing the speed decrease:
2014-Oct-19 00:42:43 -0500 - 1000000 correlations recorded
2014-Oct-19 01:06:36 -0500 - 2000000 correlations recorded
2014-Oct-19 01:33:45 -0500 - 3000000 correlations recorded
2014-Oct-19 02:02:27 -0500 - 4000000 correlations recorded
2014-Oct-19 02:35:37 -0500 - 5000000 correlations recorded
2014-Oct-19 03:11:37 -0500 - 6000000 correlations recorded
2014-Oct-19 03:55:40 -0500 - 7000000 correlations recorded
2014-Oct-19 04:45:53 -0500 - 8000000 correlations recorded
2014-Oct-19 05:46:36 -0500 - 9000000 correlations recorded
2014-Oct-19 06:50:18 -0500 - 10000000 correlations recorded
2014-Oct-19 08:05:46 -0500 - 11000000 correlations recorded
And then I aborted it. I've tried forcing a GC after each 1,000,000 relationships, and "throttling" by giving a 20 second sleep, but none have seemed to help. I've removed the GC force but have left in the Sleep, although neither seem to help much.
Any help is appreciated.
Thanks!
My Mahout recommender is returning no results although from the looks of the evaluator output it seems like it should:
2014-10-15 18:33:36,704 INFO GenericDataModel - Processed 90 users
2014-10-15 18:33:36,735 INFO AbstractDifferenceRecommenderEvaluator - Beginning evaluation using 0.99 of GenericDataModel[users:1116,1117,1118...]
2014-10-15 18:33:36,767 INFO GenericDataModel - Processed 89 users
2014-10-15 18:33:36,767 INFO AbstractDifferenceRecommenderEvaluator - Beginning evaluation of 75 users
2014-10-15 18:33:36,767 INFO AbstractDifferenceRecommenderEvaluator - Starting timing of 75 tasks in 8 threads
2014-10-15 18:33:36,782 INFO StatsCallable - Average time per recommendation: 15ms
2014-10-15 18:33:36,782 INFO StatsCallable - Approximate memory used: 876MB / 1129MB
2014-10-15 18:33:36,782 INFO StatsCallable - Unable to recommend in 0 cases
2014-10-15 18:33:36,845 INFO AbstractDifferenceRecommenderEvaluator - Evaluation result: 1.0599938694784354
I'm assuming that "Unable to recommend in 0 cases" means that it was able to recommend in all cases.
I then iterate of the User id set and only see
2014-10-15 18:33:36,923 DEBUG GenericUserBasedRecommender - Recommending items for user ID '1164'
2014-10-15 18:33:36,923 DEBUG Recommendations are: []
for each Id.
I'm I reading the debug log correctly?
Thanks.
Not exactly sure what that log message means, it looks like that stat is printed every 1000 itereations so it must refer to the past 1000 requests, not for all time but that's just a guess.
In any case you will very seldom be able to recommend to all users. There will often be users that do not get recommendations due to not enough usage history them, or usage history that does not overlap other users. There will also be new users for which you have no preference history and they will get no collaborative filtering recs either. Remember that you are only using preferences that have been expressed, this does not mean you have all users represented.
You should always have some fallback method to make recommendations for this case, even recently popular or promoted items would be better than nothing.
After a job is finished, how can I know the maximum resident size it required at any given point while running?
(tried /usr/bin/time, but not installed on the server)
Thank you!
PBS MOM reports some statistics back and it gets recorded in the PBS server log.
A handy utility called tracejob parses the logs to extract all entries related to a specific job given a job ID.
For example after the job completion on PBS Pro 12.1 tracejob would return several lines including the following
07/11/2014 16:37:27 S Exit_status=0 resources_used.cpupercent=98
resources_used.cput=01:49:14 resources_used.mem=5368kb
resources_used.ncpus=1 resources_used.vmem=38276kb
resources_used.walltime=01:49:22
Here 5368 kb would correspond to the maximum rss.
Similarly on Torque 3.0.5
07/15/2014 03:45:12 S Exit_status=0 resources_used.cput=20:44:10
resources_used.mem=704692kb
resources_used.vmem=1110224kb
resources_used.walltime=20:44:30
Here the maximum rss was 704692 kb
Simple question:
What is the memory limitation of the Pig LOAD statement?
More detailed question:
Is there any way to relate available physical resources (disk, RAM, CPU) to the maximum size of a directory that a Pig LOAD statement can handle?
Scenario:
A research project is using a Pig script that is trying to load a directory containing 12,000+ files with a total size of 891GB in a single Pig LOAD statement, copied below. The files are gzipped WAT files which describe, in raw text, a collection of web pages. When run, the job appears to crash/hang/freeze our cluster every time. Since we are all new to Hadoop, the suspicion has been on resources and configuration until I finally was able to review the code.
-- load data from I_WATS_DIR
Orig = LOAD '$I_WATS_DIR' USING org.archive.hadoop.ArchiveJSONViewLoader('Envelope.ARC-Header- Metadata.Target-URI','var2...','var3...','var4...{1,2,3,4} as
(src:chararray,timestamp:chararray,html_base:chararray,relative:chararray,path:chararray,text:chararray,alt:chararray);
Details:
CLUSTER
1 front end node, 16 cores, 64GB RAM, 128GB swap, NameNode
3 compute nodes, 16 cores, 128GB RAM, 128GB swap, DataNode
TEST JOB 1
Same script referenced above, loading a directory with 1 file
Resident memory reported 1.2GB
Input: 138MB
Output: 207MB
Reduce input records: 1,630,477
Duration: 4m 11s
TEST JOB 2
Same script, 17 files
Resident memory: 16.4GB
Input: 3.5GB
Output: 1.3GB
Reduce input records: 10,648,807
Duration: 6m 48s
TEST JOB 3
Same script, 51 files
Resident memory: 41.4GB
Input: 10.9GB
Output: not recorded
Reduce input records: 31,968,331
Duration: 6m 18s
Final thoughts:
This is a 4 node cluster with nothing else running on it, fully dedicated to Cloudera Hadoop CDH4, running this 1 job only. Hoping this is all the info people need to answer my original question! I strongly suspect that some sort of file parsing loop that loads 1 file at a time is the solution, but I know even less about Pig than I do about Hadoop. I do have a programming/development background, but in this case I am the sys admin, not the researcher or programmer.
Based on your description of your cluster and the amount of data your pushing through it, it sounds like you are running out of space during the map/shuffle phase of the job. The temporary data is sent over the network, uncompressed, and then written to disk on the reducer before being processed in the reduce phase. One thing you can try is to compress the output of the mappers by setting mapred.map.compress.output to true (and specifying the desired codec).
But with only four nodes, I suspect you're just trying to do too much at once. If you can, try splitting up your job into multiple steps. For example, if you are doing the standard word count example, do the word count over small portions of your data, and then run a second MR program that sums those counts.
I am experimenting a bit with mahout and started building everything and have a look at the examples. I am mostly interested in collaborative filtering, so I started with the example on finding recomendations from the BookCrossing dataset. I managed to get everything working, the sample runs without errors. However, the outbput is something like this:
INFO: Creating FileDataModel for file /tmp/taste.bookcrossing.
INFO: Reading file info...
INFO: Read lines: 433647
INFO: Processed 10000 users
INFO: Processed 20000 users
INFO: Processed 30000 users
INFO: Processed 40000 users
INFO: Processed 50000 users
INFO: Processed 60000 users
INFO: Processed 70000 users
INFO: Processed 77799 users
INFO: Beginning evaluation using 0.9 of BookCrossingDataModel
INFO: Processed 10000 users
INFO: Processed 20000 users
INFO: Processed 22090 users
INFO: Beginning evaluation of 4245 users
INFO: Starting timing of 4245 tasks in 2 threads
INFO: Average time per recommendation: 296ms
INFO: Approximate memory used: 115MB / 167MB
INFO: Unable to recommend in 1 cases
INFO: Average time per recommendation: 67ms
INFO: Approximate memory used: 107MB / 167MB
INFO: Unable to recommend in 2363 cases
INFO: Average time per recommendation: 72ms
INFO: Approximate memory used: 146MB / 167MB
INFO: Unable to recommend in 5095 cases
INFO: Average time per recommendation: 71ms
INFO: Approximate memory used: 113MB / 167MB
INFO: Unable to recommend in 7596 cases
INFO: Average time per recommendation: 71ms
INFO: Approximate memory used: 130MB / 167MB
INFO: Unable to recommend in 10896 cases
INFO: Evaluation result: 1.0895580110095793
When I check the code, I can see that is does this:
RecommenderIRStatsEvaluator evaluator = new GenericRecommenderIRStatsEvaluator();
File ratingsFile = TasteOptionParser.getRatings(args);
DataModel model =
ratingsFile == null ? new BookCrossingDataModel(true) : new BookCrossingDataModel(ratingsFile, true);
IRStatistics evaluation = evaluator.evaluate(
new BookCrossingBooleanRecommenderBuilder(),
new BookCrossingDataModelBuilder(),
model,
null,
3,
Double.NEGATIVE_INFINITY,
1.0);
log.info(String.valueOf(evaluation));
So that seems to be correct, but I would like to see more details from the generated suggestions and/or similarities. The object returned is of type IRStatistics, which exposes only some numbers on the statistics of the results. Should I look somewhere else? Is this recommender not intended for getting any actual recommendations?
You are not actually generating recommendations, here you are just performing an evaluation.
This example from the Mahout in Action book (link) should give you an idea on how to actually get recommendations.
The example only requests recommendations for one user, in your case you would iterate through all the users and get every users recommendations, then you decide what to do with that, like output them to a file.
Also the example doesn't use the data model builder or the recommender builder, but it shouldn't be hard for you to figure it out by looking at the method signatures.