Spark Cloudera - Worker Memory Setting [duplicate]

Spark Cloudera - Worker Memory Setting [duplicate] - memory

I am configuring an Apache Spark cluster.
When I run the cluster with 1 master and 3 slaves, I see this on the master monitor page:
Memory
2.0 GB (512.0 MB Used)
2.0 GB (512.0 MB Used)
6.0 GB (512.0 MB Used)
I want to increase the used memory for the workers but I could not find the right config for this. I have changed spark-env.sh as below:
export SPARK_WORKER_MEMORY=6g
export SPARK_MEM=6g
export SPARK_DAEMON_MEMORY=6g
export SPARK_JAVA_OPTS="-Dspark.executor.memory=6g"
export JAVA_OPTS="-Xms6G -Xmx6G"
But the used memory is still the same. What should I do to change used memory?

When using 1.0.0+ and using spark-shell or spark-submit, use the --executor-memory option. E.g.
spark-shell --executor-memory 8G ...
0.9.0 and under:
When you start a job or start the shell change the memory. We had to modify the spark-shell script so that it would carry command line arguments through as arguments for the underlying java application. In particular:
OPTIONS="$#"
...
$FWDIR/bin/spark-class $OPTIONS org.apache.spark.repl.Main "$#"
Then we can run our spark shell as follows:
spark-shell -Dspark.executor.memory=6g
When configuring it for a standalone jar, I set the system property programmatically before creating the spark context and pass the value in as a command line argument (I can make it shorter than the long winded system props then).
System.setProperty("spark.executor.memory", valueFromCommandLine)
As for changing the default cluster wide, sorry, not entirely sure how to do it properly.
One final point - I'm a little worried by the fact you have 2 nodes with 2GB and one with 6GB. The memory you can use will be limited to the smallest node - so here 2GB.

In Spark 1.1.1, to set the Max Memory of workers.
in conf/spark.env.sh, write this:
export SPARK_EXECUTOR_MEMORY=2G
If you have not used the config file yet, copy the template file
cp conf/spark-env.sh.template conf/spark-env.sh
Then make the change and don't forget to source it
source conf/spark-env.sh

In my case, I use ipython notebook server to connect to spark. I want to increase the memory for executor.
This is what I do:
from pyspark import SparkContext
from pyspark.conf import SparkConf
conf = SparkConf()
conf.setMaster(CLUSTER_URL).setAppName('ipython-notebook').set("spark.executor.memory", "2g")
sc = SparkContext(conf=conf)

According to Spark documentation you can change the Memory per Node with command line argument --executor-memory while submitting your application. E.g.
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://master.node:7077 \
--executor-memory 8G \
--total-executor-cores 100 \
/path/to/examples.jar \
1000
I've tested and it works.

The default configuration for the worker is to allocate Host_Memory - 1Gb for each worker. The configuration parameter to manually adjust that value is SPARK_WORKER_MEMORY, like in your question:
export SPARK_WORKER_MEMORY=6g.

Related

Optimizing of Opengrok on large base

i have a server instance here with 4 Cores and 32 GB RAM and Ubuntu 20.04.3 LTS installed. On this machine there is an opengrok-instance running as docker container.
Inside of the docker container it uses AdoptOpenJDK:
OpenJDK Runtime Environment AdoptOpenJDK-11.0.11+9 (build 11.0.11+9)
Eclipse OpenJ9 VM AdoptOpenJDK-11.0.11+9 (build openj9-0.26.0, JRE 11 Linux amd64-64-Bit Compressed References 20210421_975 (JIT enabled, AOT enabled)
OpenJ9 - b4cc246d9
OMR - 162e6f729
JCL - 7796c80419 based on jdk-11.0.11+9)
The code-base that the opengrok-indexer scans is 320 GB big and tooks 21 hours.
What i am figured is out was, that i've am disable the history-option it tooks lesser time. Is there a possibility to reduce this time, if the history-flag is set.
Here are my index-command:
opengrok-indexer -J=-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -J=-Djava.util.logging.config.file=/usr/share/tomcat10/conf/logging.properties -J=-XX:-UseGCOverheadLimit -J=-Xmx30G -J=-Xms30G -J=-server -a /var/opengrok/dist/lib/opengrok.jar -- -R /var/opengrok/etc/read-only.xml -m 256 -c /usr/bin/ctags -s /var/opengrok/src/ -d /var/opengrok/data --remote on -H -P -S -G -W /var/opengrok/etc/configuration.xml --progress -v -O on -T 3 --assignTags --search --remote on -i *.so -i *.o -i *.a -i *.class -i *.jar -i *.apk -i *.tar -i *.bz2 -i *.gz -i *.obj -i *.zip"
Thank you for your help in advance.
Kind Regards
Siegfried

You should try to increase the number of threads using the following options:
--historyThreads number
The number of threads to use for history cache generation on repository level. By default the number of threads will be set to the number of available CPUs.
Assumes -H/--history.
--historyFileThreads number
The number of threads to use for history cache generation when dealing with individual files.
By default the number of threads will be set to the number of available CPUs.
Assumes -H/--history.
-T, --threads number
The number of threads to use for index generation, repository scan
and repository invalidation.
By default the number of threads will be set to the number of available
CPUs. This influences the number of spawned ctags processes as well.
Take a look at the "renamedHistory" option too. Theoretically "off" is the default option but this has a huge impact on the index time, so it's worth the check:
--renamedHistory on|off
Enable or disable generating history for renamed files.
If set to on, makes history indexing slower for repositories
with lots of renamed files. Default is off.

Increase Flume MaxHeap

Good Afternoon,
I'm having trouble increasing the Heap Size for Flume. As a result, I get:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
I've increased the heap defined in "flume-env.sh" as well as Hadoop/Yarn. No luck.
One thing to notice, on starting flume, the Exec (processbuilder?) seems to be defining heap as 20Mb. Any ideas on how to override it?
Info: Including Hadoop libraries found via (/usr/local/hadoop/bin/hadoop) for HDFS access
Info: Including Hive libraries found via () for Hive access
+ exec /usr/lib/jvm/java-9-openjdk-amd64/bin/java -Xmx20m -cp 'conf:/usr/local/flume/lib/* :
........
Ultimately I'm trying to set Heapsize to 1512MB.

Increasing the heap in "flume_env.sh" should work. You can also try executing your Flume agent as follows:
flume-ng agent -n myagent -Xmx512m
Flume is able to read -D and -X options from the command line.

Java Memory Usage, many contradictory numbers

I am running multiple instances of a java web app (Play Framework).
The longer the web apps run, the less memory is available until I restart the web apps. Sometimes I get an OutOfMemory Exception.
I am trying to find the problem, but I get a lot of contradictory infos, so I am having trouble finding the source.
This are the infos:
Ubuntu 14.04.5 LTS with 12 GB of RAM
OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-3~14.04.1-b14)
OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode)
EDIT:
Here are the JVM settings:
-Xms64M
-Xmx128m
-server
(I am not 100% sure if these parameters are passed correctly to the JVM since I am using an /etc/init.d script with start-stop-daemon which starts the play framework script, which starts the JVM)
This is, how I use it:
start() {
echo -n "Starting MyApp"
sudo start-stop-daemon --background --start \
--pidfile ${APPLICATION_PATH}/RUNNING_PID \
--chdir ${APPLICATION_PATH} \
--exec ${APPLICATION_PATH}/bin/myapp \
-- \
-Dinstance.name=${NAME} \
-Ddatabase.name=${DATABASE} \
-Dfile.encoding=utf-8 \
-Dsun.jnu.encoding=utf-8 \
-Duser.country=DE \
-Duser.language=de \
-Dhttp.port=${PORT} \
-J-Xms64M \
-J-Xmx128m \
-J-server \
-J-XX:+HeapDumpOnOutOfMemoryError \
>> \
$LOGFILE 2>&1
I am picking on instance of the web apps now:
htop shows 4615M of VIRT and 338M of RES.
When I create a heap dump with jmap -dump:live,format=b,file=mydump.dump <mypid> the file has only about 50MB.
When I open it in Eclipse MAT the overview shows "20.1MB" of used memory (with the "Keep unreachable objects" option set to ON).
So how can 338MB shown in htop shrink to 20.1MB in Eclipse MAT?
I don't think this is GC related, because it doesn't matter how long I wait, htop always shows about this amount of memory, it never goes down.
In fact, I would assume that my simple app does not use more then 20MB, mabye 30MB.
I compared to heap dumps with a age difference of 4 hours with Eclipse MAT and I don't see any significant increase in objects.
PS: I added the -XX:+HeapDumpOnOutOfMemoryError option, but I have to wait for 5-7 days until it happens again. I hope to find the problem earlier with you helping me interpreting my numbers.
Thank you,
schube

The heap is the memory containing Java objects. htop surely doesn’t know about the heap. Among the things that contribute to the used memory, as reported by VIRT are
The JVM’s own code and that of the required libraries
The byte code and meta information of the loaded classes
The JIT compiled code of frequently used methods
I/O buffers
thread stacks
memory allocated for the heap, but currently not containing live objects
When you dump the heap, it will contain live Java objects, plus meta information allowing to understand the content, like class and field names. When a tool calculates the used heap, it will incorporate the objects only. So it will naturally be smaller than the heap dump file size. Also, this used memory often does not contain the unusable memory due to padding/alignment, further, the tools sometimes assume the wrong pointer size, as the relevant information (32 bit architecture vs 64 bit architecture vs compressed oops) is not available in the heap dump. These errors may sum up.
Note that there might be other reasons for an OutOfMemoryError than having too much objects in the heap. E.g. there might be too much meta information, due to a memory leak combined with dynamic class loading or too many native I/O buffers…

azk - How to increase a VM memory in azk?

I am trying to increase the memory of VM into AZK. Is there some enviroment variable for do that? Can someone help me please?
azk (http://azk.io/)

The amount of memory must be set before starting azk agent. So, be sure the agent is down and run:
export AZK_VM_MEMORY=[memory size in MB]
azk agent start
As a shorthand, you can put the export command into your .profile, .bashrc or .zshrc file (depending on the shell you are using) to make that config persistent between different terminal sessions.
Note: by default, azk uses 1/6 of the total memory (or 512MB, whichever is greater) for the VM

How can I run Neo4j with larger heap size, specify -server and correct GC strategy

As a someone who never really messed with the JVM much how can I ensure my Neo4j instances are running with all of the recommended JVM settings. E.g. Heap size, server mode, and -XX:+UseConcMarkSweepGC
Should these be set inside a config file? Can I set the dynamically at runtime? Are they set at a system level? Can I have different settings when running two instances of neo4j on the same machine?
It is a bit fuzzy at what point all of these things get set.
I am running neo4j inside a docker container so that is something to consider as well.
Dockerfile as follows. I am starting neo4j with the console command
FROM dockerfile/java:oracle-java8
# INSTALL OS DEPENDENCIES AND NEO4J
ADD /files/neo4j-enterprise-2.1.3-unix.tar.gz /opt/neo
RUN rm /opt/neo/neo4j-enterprise-2.1.3/conf/neo4j-server.properties
ADD /files/neo4j-server.properties /opt/neo/neo4j-enterprise-2.1.3/conf/neo4j-server.properties
#RUN mv -f /files/neo4j-server.properties /opt/neo/neo4j-enterprise-2.1.3/conf/neo4j-server.properties
EXPOSE 7474
CMD ["console"]
ENTRYPOINT ["/opt/neo/neo4j-enterprise-2.1.3/bin/neo4j"]

Ok, so you are using the Neo4j server script. In this case you should configure the low level JVM properties in neo4j.properties which should also live in the conf directory. Basically do the same thing for neo4j.properties as you already do for neo4j-server.properties. Create the properties file in your Docker context and configure the properties you want to add. Then in the Dockerfile use:
ADD /files/neo4j.properties /opt/neo/neo4j-enterprise-2.1.3/conf/neo4j.properties
The syntax in the properties files is the following (from the documetnation):
# initial heap size (in MB)
wrapper.java.initmemory=<value>
# maximum heap size (in MB)
wrapper.java.maxmemory=<value>
# additional literal JVM parameter, where N is a number for each
wrapper.java.additional.N=<value>
See also http://docs.neo4j.org/chunked/stable/server-performance.html.
One way to test whether the settings are applied is to run jinfo <pid> in the Docker container, where is the process id of the Neo4j JVM. To enter the container, you can either change the entrypoint to /bin/bash at the command line when you run the container or you use nsenter. The latter would be my choice.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Spark Cloudera - Worker Memory Setting [duplicate] - memory

The default configuration for the worker is to allocate Host_Memory - 1Gb for each worker. The configuration parameter to manually adjust that value is SPARK_WORKER_MEMORY, like in your question: export SPARK_WORKER_MEMORY=6g.

Related

Optimizing of Opengrok on large base

Increase Flume MaxHeap

Java Memory Usage, many contradictory numbers

azk - How to increase a VM memory in azk?

How can I run Neo4j with larger heap size, specify -server and correct GC strategy

Categories

Resources