How to change memory in EMR hadoop streaming job - memory

I am trying to overcome the following error in a hadoop streaming job on EMR.
Container [pid=30356,containerID=container_1391517294402_0148_01_000021] is running beyond physical memory limits
I tried searching for answers but the one I found isn't working. My job is launched as shown below.
hadoop jar ../.versions/2.2.0/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar \
-input determinations/part-00000 \
-output determinations/aggregated-0 \
-mapper cat \
-file ./det_maker.py \
-reducer det_maker.py \
-Dmapreduce.reduce.java.opts="-Xmx5120M"
The last line above is supposed to do the trick as far as I understand, but I get the error:
ERROR streaming.StreamJob: Unrecognized option: -Dmapreduce.reduce.java.opts="-Xmx5120M"
What is the correct way change the memory usage ?
Also is there some documentation that explains these things to n00bs like me?

You haven't elaborated on what memory you are running low, physical or virtual.
For both problems, take a look at Amazon's documentation:
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/TaskConfiguration_H2.html
Usually the solution is to increase the amout of memory per mapper, and possibly reduce the number of mappers:
s3://elasticmapreduce/bootstrap-actions/configure-hadoop -m mapreduce.map.memory.mb=4000
s3://elasticmapreduce/bootstrap-actions/configure-hadoop -m mapred.tasktracker.map.tasks.maximum=2

Related

Compare perf-stat results to that of likwid-perfctr results

I want to do some comparison between the outputs of perf-stat to that of likwid-perfctr. Is there a way to do that. I tried running two commands, one for perf-stat, and the other for liquid-perfctr.
The commands are:
sudo perf stat -C 2 -e instructions, BR_INST_RETIRED.ALL_BRANCHES,branches,rc004,INST_RETIRED.ANY ./loop
sudo likwid-perfctr -C 2 -g MYLIST1 -f ./loop
The first instruction is related to perf-stat which captures importantly branches, and instructions count redundantly. The second instruction is related to likwid-perfctr which captures similar data. Just to mention I wrote my own group called MYLIST1 for likwid-perfctr.
But when I compare both the results, its turning out to be quite different.
Output Comparison
So, when we look into the output, INSTR_RETIRED_ANY in perf stat are: 15552, to that of likwid-perfctr are: 190594. And branches are: 3168 vs 42744.
I'm not sure what I'm doing wrong. Or is there any way to properly do that.

Getting Dataflowrunner with --experiments=upload_graph to work

I have a pipeline that produces a dataflow graph (serialized JSON representation) that exceeds the allowable limit for the API, and thus cannot be launched via the dataflow runner for apache beam as one would normally do. And running dataflow runner with the instructed parameter --experiments=upload_graph does not work and fails saying there are no steps specified .
When getting notified about this size problem via an error, the following information is provided:
the size of the serialized JSON representation of the pipeline exceeds the allowable limit for the API.
Use experiment 'upload_graph' (--experiments=upload_graph)
to direct the runner to upload the JSON to your
GCS staging bucket instead of embedding in the API request.
Now using this parameter, does indeed result in dataflow runner uploading an additional dataflow_graph.pb file to the staging location beside the usual pipeline.pb file. Which I verified actually exists in gcp storage.
However the job in gcp dataflow then immediately fails after start with the following error:
Runnable workflow has no steps specified.
I've tried this flag with various pipelines, even apache beam example pipelines and see the same behaviour.
This can be reproduced by using word count example:
mvn archetype:generate \
-DarchetypeGroupId=org.apache.beam \
-DarchetypeArtifactId=beam-sdks-java-maven-archetypes-examples \
-DarchetypeVersion=2.11.0 \
-DgroupId=org.example \
-DartifactId=word-count-beam \
-Dversion="0.1" \
-Dpackage=org.apache.beam.examples \
-DinteractiveMode=false
cd word-count-beam/
Running it without the experiments=upload_graph parameter works:
(make sure to specify your project, and buckets if you want to run this)
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
-Dexec.args="--runner=DataflowRunner --project=<your-gcp-project> \
--gcpTempLocation=gs://<your-gcs-bucket>/tmp \
--inputFile=gs://apache-beam-samples/shakespeare/* --output=gs://<your-gcs-bucket>/counts" \
-Pdataflow-runner
Running it with the experiments=upload_graph results in pipe failing with message workflow has no steps specified
mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \
-Dexec.args="--runner=DataflowRunner --project=<your-gcp-project> \
--gcpTempLocation=gs://<your-gcs-bucket>/tmp \
--experiments=upload_graph \
--inputFile=gs://apache-beam-samples/shakespeare/* --output=gs://<your-gcs-bucket>/counts" \
-Pdataflow-runner
Now I would expect that dataflow runner would direct gcp dataflow to read the steps from the bucket specified as seen in the source code:
https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L881
However this seems not to be the case. Has anyone gotten this to work, or has found some documentation regarding this feature that can point me in the right direction?
The experiment has since been reverted and the messaging will be corrected in Beam 2.13.0
Revert PR
I recently ran into this issue and the solution was quite silly. I had quite a complex dataflow streaming job developed and it was working fine and the next day stopped working with error "Runnable workflow has no steps specified.". In my case, someone specified pipeline().run().waitUntilFinish() twice after creating options and due to that, I was getting this error. Removing the duplicate pipeline run resolved the issue. I still think there should be some useful error trace by beam/dataflowrunner in this scenario.

Google Dataflow creates only one worker for large .bz2 file

I am trying to process the Wikidata json dump using Cloud Dataflow.
I have downloaded the file from https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.bz2 and hosted it into a GS bucket. It's a large (50G) .bz2 file containing a list of json dicts (one per line).
I understand that apache_beam.io.ReadFromText can handle .bz2 (I tested that on toy datasets) and that .bz2 is splittable. Therefore I was hoping that multiple workers would be created that would work in parallel on different blocks of that unique file (I'm not totally clear if/how blocks would res.
Ultimately I want to do some analytics on each line (each json dict) but as a test for ingestion I am just using the project's wordcount.py:
python -m apache_beam.examples.wordcount \
--input gs://MYBUCKET/wikidata/latest-all.json.bz2 \
--output gs://MYBUCKET/wikidata/output/entities-all.json \
--runner DataflowRunner \
--project MYPROJECT \
--temp_location gs://MYBUCKET/tmp/
At startup, autoscaling quickly increases the number of workers 1->6 but only one worker does any work and then autoscaling scales back 6->1 after a couple minutes (jobid: 2018-10-11_00_45_54-9419516948329946918)
If I disable autoscaling and set explicitly the number of workers, then all but one remain idle.
Can parallelism be achieved on this sort of input? Many thanks for any help.
Other than Hadoop, Apache Beam has not yet implemented bzip2 splitting: https://issues.apache.org/jira/browse/BEAM-683

Java Memory Usage, many contradictory numbers

I am running multiple instances of a java web app (Play Framework).
The longer the web apps run, the less memory is available until I restart the web apps. Sometimes I get an OutOfMemory Exception.
I am trying to find the problem, but I get a lot of contradictory infos, so I am having trouble finding the source.
This are the infos:
Ubuntu 14.04.5 LTS with 12 GB of RAM
OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-3~14.04.1-b14)
OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode)
EDIT:
Here are the JVM settings:
-Xms64M
-Xmx128m
-server
(I am not 100% sure if these parameters are passed correctly to the JVM since I am using an /etc/init.d script with start-stop-daemon which starts the play framework script, which starts the JVM)
This is, how I use it:
start() {
echo -n "Starting MyApp"
sudo start-stop-daemon --background --start \
--pidfile ${APPLICATION_PATH}/RUNNING_PID \
--chdir ${APPLICATION_PATH} \
--exec ${APPLICATION_PATH}/bin/myapp \
-- \
-Dinstance.name=${NAME} \
-Ddatabase.name=${DATABASE} \
-Dfile.encoding=utf-8 \
-Dsun.jnu.encoding=utf-8 \
-Duser.country=DE \
-Duser.language=de \
-Dhttp.port=${PORT} \
-J-Xms64M \
-J-Xmx128m \
-J-server \
-J-XX:+HeapDumpOnOutOfMemoryError \
>> \
$LOGFILE 2>&1
I am picking on instance of the web apps now:
htop shows 4615M of VIRT and 338M of RES.
When I create a heap dump with jmap -dump:live,format=b,file=mydump.dump <mypid> the file has only about 50MB.
When I open it in Eclipse MAT the overview shows "20.1MB" of used memory (with the "Keep unreachable objects" option set to ON).
So how can 338MB shown in htop shrink to 20.1MB in Eclipse MAT?
I don't think this is GC related, because it doesn't matter how long I wait, htop always shows about this amount of memory, it never goes down.
In fact, I would assume that my simple app does not use more then 20MB, mabye 30MB.
I compared to heap dumps with a age difference of 4 hours with Eclipse MAT and I don't see any significant increase in objects.
PS: I added the -XX:+HeapDumpOnOutOfMemoryError option, but I have to wait for 5-7 days until it happens again. I hope to find the problem earlier with you helping me interpreting my numbers.
Thank you,
schube
The heap is the memory containing Java objects. htop surely doesn’t know about the heap. Among the things that contribute to the used memory, as reported by VIRT are
The JVM’s own code and that of the required libraries
The byte code and meta information of the loaded classes
The JIT compiled code of frequently used methods
I/O buffers
thread stacks
memory allocated for the heap, but currently not containing live objects
When you dump the heap, it will contain live Java objects, plus meta information allowing to understand the content, like class and field names. When a tool calculates the used heap, it will incorporate the objects only. So it will naturally be smaller than the heap dump file size. Also, this used memory often does not contain the unusable memory due to padding/alignment, further, the tools sometimes assume the wrong pointer size, as the relevant information (32 bit architecture vs 64 bit architecture vs compressed oops) is not available in the heap dump. These errors may sum up.
Note that there might be other reasons for an OutOfMemoryError than having too much objects in the heap. E.g. there might be too much meta information, due to a memory leak combined with dynamic class loading or too many native I/O buffers…

Use all cores to make OpenCV 3 [duplicate]

Quick question: what is the compiler flag to allow g++ to spawn multiple instances of itself in order to compile large projects quicker (for example 4 source files at a time for a multi-core CPU)?
You can do this with make - with gnu make it is the -j flag (this will also help on a uniprocessor machine).
For example if you want 4 parallel jobs from make:
make -j 4
You can also run gcc in a pipe with
gcc -pipe
This will pipeline the compile stages, which will also help keep the cores busy.
If you have additional machines available too, you might check out distcc, which will farm compiles out to those as well.
There is no such flag, and having one runs against the Unix philosophy of having each tool perform just one function and perform it well. Spawning compiler processes is conceptually the job of the build system. What you are probably looking for is the -j (jobs) flag to GNU make, a la
make -j4
Or you can use pmake or similar parallel make systems.
People have mentioned make but bjam also supports a similar concept. Using bjam -jx instructs bjam to build up to x concurrent commands.
We use the same build scripts on Windows and Linux and using this option halves our build times on both platforms. Nice.
If using make, issue with -j. From man make:
-j [jobs], --jobs[=jobs]
Specifies the number of jobs (commands) to run simultaneously.
If there is more than one -j option, the last one is effective.
If the -j option is given without an argument, make will not limit the
number of jobs that can run simultaneously.
And most notably, if you want to script or identify the number of cores you have available (depending on your environment, and if you run in many environments, this can change a lot) you may use ubiquitous Python function cpu_count():
https://docs.python.org/3/library/multiprocessing.html#multiprocessing.cpu_count
Like this:
make -j $(python3 -c 'import multiprocessing as mp; print(int(mp.cpu_count() * 1.5))')
If you're asking why 1.5 I'll quote user artless-noise in a comment above:
The 1.5 number is because of the noted I/O bound problem. It is a rule of thumb. About 1/3 of the jobs will be waiting for I/O, so the remaining jobs will be using the available cores. A number greater than the cores is better and you could even go as high as 2x.
make will do this for you. Investigate the -j and -l switches in the man page. I don't think g++ is parallelizable.
distcc can also be used to distribute compiles not only on the current machine, but also on other machines in a farm that have distcc installed.
I'm not sure about g++, but if you're using GNU Make then "make -j N" (where N is the number of threads make can create) will allow make to run multple g++ jobs at the same time (so long as the files do not depend on each other).
GNU parallel
I was making a synthetic compilation benchmark and couldn't be bothered to write a Makefile, so I used:
sudo apt-get install parallel
ls | grep -E '\.c$' | parallel -t --will-cite "gcc -c -o '{.}.o' '{}'"
Explanation:
{.} takes the input argument and removes its extension
-t prints out the commands being run to give us an idea of progress
--will-cite removes the request to cite the software if you publish results using it...
parallel is so convenient that I could even do a timestamp check myself:
ls | grep -E '\.c$' | parallel -t --will-cite "\
if ! [ -f '{.}.o' ] || [ '{}' -nt '{.}.o' ]; then
gcc -c -o '{.}.o' '{}'
fi
"
xargs -P can also run jobs in parallel, but it is a bit less convenient to do the extension manipulation or run multiple commands with it: Calling multiple commands through xargs
Parallel linking was asked at: Can gcc use multiple cores when linking?
TODO: I think I read somewhere that compilation can be reduced to matrix multiplication, so maybe it is also possible to speed up single file compilation for large files. But I can't find a reference now.
Tested in Ubuntu 18.10.

Resources