Mahout runs out of heap space - mahout

I am running NaiveBayes on a set of tweets using Mahout. Two files, one 100 MB and one 300 MB. I changed JAVA_HEAP_MAX to JAVA_HEAP_MAX=-Xmx2000m ( earlier it was 1000). But even then, mahout ran for a few hours ( 2 to be precise) before it complained of heap space error. What should i do to resolve ?
Some more info if it helps : I am running on a single node, my laptop infact and it has 3GB of RAM (only) .
Thanks.
EDIT: I ran it the third time with <1/2 of the data that i used the first time ( first time i used 5.5 million tweets, second i used 2million ) and i still got a heap space problem. I am posting the complete error for completion purposes :
17 May, 2011 2:16:22 PM
org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO: map 50% reduce 0%
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.lang.AbstractStringBuilder.<init>(AbstractStringBuilder.java:62)
at java.lang.StringBuilder.<init>(StringBuilder.java:85)
at org.apache.hadoop.mapred.JobClient.monitorAndPrintJob(JobClient.java:1283)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1251)
at org.apache.mahout.classifier.bayes.mapreduce.common.BayesFeatureDriver.runJob(BayesFeatureDriver.java:63)
at org.apache.mahout.classifier.bayes.mapreduce.bayes.BayesDriver.runJob(BayesDriver.java:44)
at org.apache.mahout.classifier.bayes.TrainClassifier.trainNaiveBayes(TrainClassifier.java:54)
at org.apache.mahout.classifier.bayes.TrainClassifier.main(TrainClassifier.java:162)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:184)
17 May, 2011 7:14:53 PM org.apache.hadoop.mapred.LocalJobRunner$Job run
WARNING: job_local_0001
java.lang.OutOfMemoryError: Java heap space
at java.lang.String.substring(String.java:1951)
at java.lang.String.subSequence(String.java:1984)
at java.util.regex.Pattern.split(Pattern.java:1019)
at java.util.regex.Pattern.split(Pattern.java:1076)
at org.apache.mahout.classifier.bayes.mapreduce.common.BayesFeatureMapper.map(BayesFeatureMapper.java:78)
at org.apache.mahout.classifier.bayes.mapreduce.common.BayesFeatureMapper.map(BayesFeatureMapper.java:46)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
And i am posting the part of the bin/mahout script that i changed :
Original :
JAVA=$JAVA_HOME/bin/java
JAVA_HEAP_MAX=-Xmx1000m
if [ "$MAHOUT_HEAPSIZE" != "" ]; then
#echo "run with heapsize $MAHOUT_HEAPSIZE"
JAVA_HEAP_MAX="-Xmx""$MAHOUT_HEAPSIZE""m"
#echo $JAVA_HEAP_MAX
fi
Modified :
JAVA=$JAVA_HOME/bin/java
JAVA_HEAP_MAX=-Xmx2000m
if [ "$MAHOUT_HEAPSIZE" != "" ]; then
#echo "run with heapsize $MAHOUT_HEAPSIZE"
JAVA_HEAP_MAX="-Xmx""$MAHOUT_HEAPSIZE""m"
#echo $JAVA_HEAP_MAX
fi

You're not specifying what process ran out of memory, which is important. You need to set MAHOUT_HEAPSIZE, not whatever JAVA_HEAP_MAX is.

Did you modify the heap size for the hadoop environment or the mahout one? See if this query on mahout list helps. From personal experience, I can suggest that you reduce the data size that you are trying to process. Whenever I tried to execute the Bayes classifier on my laptop, after running for a few hours, the heap space would get exhausted.
I'd suggest that you run this off EC2. I think the basic S3/EC2 option is free for usage.

When you start mahout process, you can runn "jps" it will show all the java process running on your machine with your user-id. "jps" will return you a process-id. You can find the process and can run "jmap -heap process-id" to see your heap space utilization.
With this approach you can estimate at which part of your processing memory is exhausted and where you need to increase.

Related

GNU Parallel -- How to understand "block-size" setting, and guess what to set it to?

How do I set the block-size parameter when running grep using GNU parallel on a single machine with multiple cores, based on the "large_file" filesize, "small_file" filesize and machine I'm using to get the fastest performance possible (or please correct me if there is something else that im missing here)? What are the performance issues/speed bottlenecks I'll run into when setting it too high or too low? I understand what block-size does, in that it blocks out the large_file in chunks, and sends those chunks to each job, but I'm still missing the potential for how and why that would impact speed of execution.
The command in question:
parallel --pipepart --block 100M --jobs 10 -a large_file.csv grep -f small_file.csv
where large_file.csv has in it:
123456 1
234567 2
345667 22
and
where small_file.csv has in it:
1$
2$
and so on...
Thank you!
parallel --pipepart --block -1 --jobs 10 -a large_file.csv grep -f small_file.csv
--block -1 will split large_file.csv into one block per jobslot (here 10 chunks). The splitting will be done on the fly, so it will not be read into RAM to do the splitting.
Splitting into n evenly sized blocks (where n = number of jobs to run in parallel) often makes sense if the time spent per line is roughly the same. If it varies a lot (say, some lines take 100 times longer to process than others), then it may make more sense to chop into more bits. E.g. --block -10 will split into 10 times as many blocks as --block -1.
The optimal value can seldom be guessed in advance, because it may also depend on how fast your disk is. So try different values and identify where the bottleneck is. It is typically one of disk I/O, CPU, RAM, command startup time.

What is the difference between 'time -f "%M"' and 'valgrind --tool=massif'?

I want to see the peak memory usage of a command. I have a parametrized algorithm and I want to know when the program will crash due with an out of memory error on my machine (12GB RAM).
I tried:
/usr/bin/time -f "%M" command
valgrind --tool=massif command
The first one gave me 1414168 (1.4GB; thank you ks1322 for pointing out it is measured in KB!) and valgrind gave me
$ ms_print massif.out
--------------------------------------------------------------------------------
n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
--------------------------------------------------------------------------------
75 26,935,731,596 22,420,728 21,956,875 463,853 0
I'm a bit confused which number I should take, but let's assume "total" (22MB).
And the massif-visualizer shows me
Now I have 3 different numbers for the same command:
valgrind --tool=massif command + ms_print: 22MB
valgrind --tool=massif command + massif-visualizer: 206MB (this is what I see in htop and I guess this is what I'm interested in)
time -f "%M" command: 1.4GB
Which is the number I should look at? Why are the numbers different at all?
/usr/bin/time -f "%M" measures the maximum RSS (resident set size), that is the memory used by the process that is in RAM and not swapped out. This memory includes the heap, the stack, the data segment, etc.
This measures the max RSS of the children processes (including grandchildren) taken individually (not the max of the sum of the RSS of the children).
valgrind --tool=massif, as the documentation says:
measures only heap memory, i.e. memory allocated with malloc, calloc, realloc, memalign, new, new[], and a few other, similar functions. This means it does not directly measure memory allocated with lower-level system calls such as mmap, mremap, and brk
This measures only the memory in the child (not grandchildren).
This does not measure the stack nor the text and data segments.
(options likes --pages-as-heap=yes and --stacks=yes enable to measure more)
So in your case the differences are:
time takes into account the grandchildren, while valgrind does not
time does not measure the memory swapped out, while valgrind does
time measures the stack and data segments, while valgrind does not
You should now:
check if some children are responsible of the memory consumption
try profiling with valgrind --tool=massif --stacks=yes to check the stack
try profiling with valgrind --tool=massif --pages-as-heap=yes to check the rest of the memory usage

Jenkins web ui is totally unresponsive

My jenkins instance has been running for over two years without issue but yesterday quit responding to http requests. No errors, just clocks and clocks.
I've restarted the service, then restarted the entire server.
There's been a lot of mention of a thread dump. I attempted to get that but I'm not sure that this is that.
Heap
PSYoungGen total 663552K, used 244203K [0x00000000d6700000, 0x0000000100000000, 0x0000000100000000)
eden space 646144K, 36% used [0x00000000d6700000,0x00000000e4df5f70,0x00000000fde00000)
from space 17408K, 44% used [0x00000000fef00000,0x00000000ff685060,0x0000000100000000)
to space 17408K, 0% used [0x00000000fde00000,0x00000000fde00000,0x00000000fef00000)
ParOldGen total 194048K, used 85627K [0x0000000083400000, 0x000000008f180000, 0x00000000d6700000)
object space 194048K, 44% used [0x0000000083400000,0x000000008879ee10,0x000000008f180000)
Metaspace used 96605K, capacity 104986K, committed 105108K, reserved 1138688K
class space used 12782K, capacity 14961K, committed 14996K, reserved 1048576K
Ubuntu 16.04.5 LTS
I prefer looking in the jenkins log file. There you can see error and then fix them.

Issues with Profiling java application using jProbe

Im currently doing dynamic memory analysis, for our eclipse based application using jprobe.After starting the eclipse application and jprobe, when I try to profile the eclipse application, the application gets closed abruptly causing a Fatal error. A fatal error log file is generated. In the Fatal error log file, I could see that the PermGen space seems to be full. Below is a sample Heap summary which I got in the log file
Heap
def new generation total 960K, used 8K [0x07b20000, 0x07c20000, 0x08000000)
eden space 896K, 0% used [0x07b20000, 0x07b22328, 0x07c00000)
from space 64K, 0% used [0x07c00000, 0x07c00000, 0x07c10000)
to space 64K, 0% used [0x07c10000, 0x07c10000, 0x07c20000)
tenured generation total 9324K, used 5606K [0x08000000, 0x0891b000, 0x0bb20000)
the space 9324K, 60% used [0x08000000, 0x08579918, 0x08579a00, 0x0891b000)
compacting perm gen total 31744K, used 31723K [0x0bb20000, 0x0da20000, 0x2bb20000)
the space 31744K, 99% used [0x0bb20000, 0x0da1af00, 0x0da1b000, 0x0da20000)
ro space 8192K, 66% used [0x2bb20000, 0x2c069920, 0x2c069a00, 0x2c320000)
rw space 12288K, 52% used [0x2c320000, 0x2c966130, 0x2c966200, 0x2cf20000)
I tried to increase the permGen space, using the command -XX:MaxPermSize=512m. But that doesnt seem to work. I would like to know how to increase the PermGen size via command prompt. I would like to know if I have to go to the java location in my computer and execute the above command or should I increase the PermGen space specifically for the eclipse application or Jprobe ? Please advise.
Any help on this is much appreciated.

Measure top memory consumption (linux program)

How can I measure the top (the maximum) memory usage of some programm?
It do a lot of malloc/free, and run rather fast, so I can't see the max memory in top.
I want smth like time utility:
$ time ./program
real xx sec
user xx sec
sys xx sec
and
$ mem_report ./program
max memory used xx mb
shared mem xx mb
The time call is your shell. If you call /usr/bin/time, the program, you will get some knowledge of resident memory usage. Note however that it may not count memory-mapped files, shared memory and other details which you may need.
If you are on linux, you can wrap your program in a script that polls:
# for your current process
/proc/self/statm
# or a process you know the pid of
/proc/{pid}/statm
and writes out the results - you can aggregate them afterwards.

Resources