Spark Worker asking for absurd amounts of virtual memory - memory

I am running a spark job on a 2 node yarn cluster. My dataset is not large (< 100MB) just for testing and the worker is getting killed because it is asking for too much virtual memory. The amounts here are absurd. 2GB out of 11GB physical memory used, 300GB virtual memory used.
16/02/12 05:49:43 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 2.1 (TID 22, ip-172-31-6-141.ec2.internal): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Container marked as failed: container_1455246675722_0023_01_000003 on host: ip-172-31-6-141.ec2.internal. Exit status: 143. Diagnostics: Container [pid=23206,containerID=container_1455246675722_0023_01_000003] is running beyond virtual memory limits. Current usage: 2.1 GB of 11 GB physical memory used; 305.3 GB of 23.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1455246675722_0023_01_000003 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 23292 23213 23292 23206 (python) 15 3 101298176 5514 python -m pyspark.daemon
|- 23206 1659 23206 23206 (bash) 0 0 11431936 352 /bin/bash -c /usr/lib/jvm/java-7-openjdk-amd64/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms10240m -Xmx10240m -Djava.io.tmpdir=/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1455246675722_0023/container_1455246675722_0023_01_000003/tmp '-Dspark.driver.port=37386' -Dspark.yarn.app.container.log.dir=/mnt/yarn/logs/application_1455246675722_0023/container_1455246675722_0023_01_000003 -XX:MaxPermSize=256m org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler#172.31.0.92:37386 --executor-id 2 --hostname ip-172-31-6-141.ec2.internal --cores 8 --app-id application_1455246675722_0023 --user-class-path file:/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1455246675722_0023/container_1455246675722_0023_01_000003/app.jar 1> /mnt/yarn/logs/application_1455246675722_0023/container_1455246675722_0023_01_000003/stdout 2> /mnt/yarn/logs/application_1455246675722_0023/container_1455246675722_0023_01_000003/stderr
|- 23341 23292 23292 23206 (python) 87 8 39464374272 23281 python -m pyspark.daemon
|- 23350 23292 23292 23206 (python) 86 7 39463976960 24680 python -m pyspark.daemon
|- 23329 23292 23292 23206 (python) 90 6 39464521728 23281 python -m pyspark.daemon
|- 23213 23206 23206 23206 (java) 1168 61 11967115264 359820 /usr/lib/jvm/java-7-openjdk-amd64/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms10240m -Xmx10240m -Djava.io.tmpdir=/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1455246675722_0023/container_1455246675722_0023_01_000003/tmp -Dspark.driver.port=37386 -Dspark.yarn.app.container.log.dir=/mnt/yarn/logs/application_1455246675722_0023/container_1455246675722_0023_01_000003 -XX:MaxPermSize=256m org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler#172.31.0.92:37386 --executor-id 2 --hostname ip-172-31-6-141.ec2.internal --cores 8 --app-id application_1455246675722_0023 --user-class-path file:/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1455246675722_0023/container_1455246675722_0023_01_000003/app.jar
|- 23347 23292 23292 23206 (python) 87 10 39464783872 23393 python -m pyspark.daemon
|- 23335 23292 23292 23206 (python) 83 9 39464112128 23216 python -m pyspark.daemon
|- 23338 23292 23292 23206 (python) 81 9 39463714816 24614 python -m pyspark.daemon
|- 23332 23292 23292 23206 (python) 86 6 39464374272 24812 python -m pyspark.daemon
|- 23344 23292 23292 23206 (python) 85 30 39464374272 23281 python -m pyspark.daemon
Container killed on request. Exit code is 143
Does anyone know why this might be happening? I've been trying modifying various yarn and spark configurations, but I know something is deeply wrong for it to be asking for this much vmem.

The command I was running was
spark-submit --executor-cores 8 ...
It turns out the executor-cores flag doesn't do what I thought it does. It makes 8 copies of the pyspark.daemon process, running 8 copies of the worker process to run jobs. Each process was using 38GB of virtual memory, which is unnecessarily large, but 8 * 38 ~ 300, so that explains that.
It's actually a very poorly named flag. If I set executor-cores to 1, it makes one daemon, but the daemon will use multiple cores, as seen via htop.

Related

What is the real memory available in Docker container?

I've run mongodb service via docker-compose like this:
version: '2'
services:
mongo:
image: mongo
environment:
MONGO_INITDB_ROOT_USERNAME: root
MONGO_INITDB_ROOT_PASSWORD: example
mem_limit: 4GB
If I run docker stats I can see 4 GB allocated:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
cf3ccbd17464 michal_mongo_1 0.64% 165.2MiB / 4GiB 4.03% 10.9kB / 4.35kB 0B / 483kB 35
But I run this command I get RAM from my laptop which is 32 GB:
~$ docker exec michal_mongo_1 free -g
total used free shared buff/cache available
Mem: 31 4 17 0 9 24
Swap: 1 0 1
How does mem_limit affect the memory size then?
free (and other utilities like top) will not report correct numbers inside a memory-constraint container because it gathers its information from /proc/meminfo which is not namespaced.
If you want the actual limit, you must use the entries populated by cgroup pseudo-filesystem under /sys/fs/cgroup.
For example:
docker run --rm -i --memory=128m busybox cat /sys/fs/cgroup/memory/memory.limit_in_bytes
The real-time usage information is available under /sys/fs/cgroup/memory/memory.stat.
You will probably need the resident-set-size (rss), for example (inside the container):
grep -E -e '^rss\s+' /sys/fs/cgroup/memory/memory.stat
For a more in-depth explanation, see also this article

why docker image with elasticsearch status restarting always?

unbuntu 16.04, ram 1gb, on aws instance
I had to run old instance of elasticsearch so I wanted use a docker image of elasticsearch 5.3.3 version. by seeing the stackoverflow multiple links with same title, i have modified my installation of docker image based elasticsearch as below
sudo docker run -p 9200:9200 -p 9300:9300 -d -e "http.host=0.0.0.0" -e "transport.host=127.0.0.1" -e "xpack.security.enabled=false" --restart=unless-stopped --name careerassistant-elastic docker.elastic.co/elasticsearch/elasticsearch:5.3.3
the installation is finished and have problem with accessing the elasticsearch, though I had mutliple modifications as above command, I couldnt resolve the issue. when on
sudo docker ps
the status is still --> restarting(1) 48 seconds ago
when i was checking the log of the docker i couldnt understand anything as i am new to docker and its utilization
> docker logs --tail 50 --follow --timestamps careerassistant-elastic
i got the following output
2020-05-04T09:36:00.552415247Z CmaTotal: 0 kB
2020-05-04T09:36:00.552418314Z CmaFree: 0 kB
2020-05-04T09:36:00.552421364Z HugePages_Total: 0
2020-05-04T09:36:00.552424343Z HugePages_Free: 0
2020-05-04T09:36:00.552427401Z HugePages_Rsvd: 0
2020-05-04T09:36:00.552430358Z HugePages_Surp: 0
2020-05-04T09:36:00.552433336Z Hugepagesize: 2048 kB
2020-05-04T09:36:00.552436334Z DirectMap4k: 67584 kB
2020-05-04T09:36:00.552439415Z DirectMap2M: 980992 kB
2020-05-04T09:36:00.552442390Z
2020-05-04T09:36:00.552445460Z
2020-05-04T09:36:00.552448777Z CPU:total 1 (initial active 1) (1 cores per cpu, 1 threads per core) family 6 model 63 stepping 2, cmov, cx8, fxsr, mmx, sse, sse2, sse3, ssse3, sse4.1, sse4.2, popcnt, avx, avx2, aes, clmul, erms, lzcnt, tsc, bmi1, bmi2
2020-05-04T09:36:00.552452312Z
2020-05-04T09:36:00.552455227Z /proc/cpuinfo:
2020-05-04T09:36:00.552458471Z processor : 0
2020-05-04T09:36:00.552461695Z vendor_id : GenuineIntel
2020-05-04T09:36:00.552464872Z cpu family : 6
2020-05-04T09:36:00.552467992Z model : 63
2020-05-04T09:36:00.552471311Z model name : Intel(R) Xeon(R) CPU E5-2676 v3 # 2.40GHz
2020-05-04T09:36:00.552474616Z stepping : 2
2020-05-04T09:36:00.552477715Z microcode : 0x43
2020-05-04T09:36:00.552480781Z cpu MHz : 2400.040
2020-05-04T09:36:00.552483934Z cache size : 30720 KB
2020-05-04T09:36:00.552486978Z physical id : 0
2020-05-04T09:36:00.552490023Z siblings : 1
2020-05-04T09:36:00.552493103Z core id : 0
2020-05-04T09:36:00.552496146Z cpu cores : 1
2020-05-04T09:36:00.552511390Z apicid : 0
2020-05-04T09:36:00.552515457Z initial apicid : 0
2020-05-04T09:36:00.552518523Z fpu : yes
2020-05-04T09:36:00.552521677Z fpu_exception : yes
2020-05-04T09:36:00.552524702Z cpuid level : 13
2020-05-04T09:36:00.552527802Z wp : yes
2020-05-04T09:36:00.552531691Z flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm invpcid_single kaiser fsgsbase bmi1 avx2 smep bmi2 erms invpcid xsaveopt
2020-05-04T09:36:00.552535638Z bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
2020-05-04T09:36:00.552538954Z bogomips : 4800.08
2020-05-04T09:36:00.552545171Z clflush size : 64
2020-05-04T09:36:00.552548419Z cache_alignment : 64
2020-05-04T09:36:00.552551514Z address sizes : 46 bits physical, 48 bits virtual
2020-05-04T09:36:00.552554916Z power management:
2020-05-04T09:36:00.552558030Z
2020-05-04T09:36:00.552561090Z
2020-05-04T09:36:00.552564141Z
2020-05-04T09:36:00.552567135Z Memory: 4k page, physical 1014424k(76792k free), swap 0k(0k free)
2020-05-04T09:36:00.552570458Z
2020-05-04T09:36:00.552573441Z vm_info: OpenJDK 64-Bit Server VM (25.131-b11) for linux-amd64 JRE (1.8.0_131-b11), built on Jun 16 2017 13:51:29 by "buildozer" with gcc 6.3.0
2020-05-04T09:36:00.552576947Z
2020-05-04T09:36:00.552579894Z time: Mon May 4 09:36:00 2020
2020-05-04T09:36:00.552582956Z elapsed time: 0 seconds (0d 0h 0m 0s)
2020-05-04T09:36:00.552586052Z
can someone help me out to figure what could be the problem for docker status to be restarting ?
I run my docker container on AWS ec2 t2.small which has 2 GB RAM as t2.micro memory(1GB) isn't enough for running the Elasticsearch container, so it should be fine for you as well until you have configured a lot more things. Looked into your logs but I don't see any error, hence difficult to debug without your docker-file.
Below is my docker-compose file for running Elasticsearch 7.6 in a docker container in AWS t2.small instance, let me know if it doesn't work for you and happy to help further.
version: '2.2'
services:
  #Elasticsearch Docker Images: https://www.docker.elastic.co/
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:7.6.0
    container_name: elasticsearch
    environment:
      - xpack.security.enabled=false
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
    cap_add:
      - IPC_LOCK
    volumes:
      - elasticsearch-data:/usr/share/elasticsearch/data
    ports:
      - 9200:9200
      - 9300:9300
volumes:
  elasticsearch-data:
    driver: local
You can run it using docker-compose up -d -e "discovery.type=single-node" command. and refer my this Elasticsearch docker container in non-prod mode to eliminate vm.max_map_count=262144 requirement answer if you face any memory-related issue like vm.max_map_count=262144 requirement

What's memory shown in docker stats really mean?

1) I use next to start a container:
docker run --name test -idt python:3 python -m http.server
2) Then, I try to validate memory usage like next:
a)
root#shubuntu1:~# ps aux | grep "python -m http.server"
root 17416 3.0 0.2 27368 19876 pts/0 Ss+ 17:11 0:00 python -m http.server
b)
root#shubuntu1:~# docker exec -it test ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.9 0.2 27368 19876 pts/0 Ss+ 09:11 0:00 python -m http.
c)
root#shubuntu1:~# docker stats --no-stream test
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
d72f2ece6816 test 0.01% 12.45MiB / 7.591GiB 0.16% 3.04kB / 0B 0B / 0B 1
You can see from both docker host & docker container, we could see python -m http.server consume 19876/1024=19.1289MB memory (RSS), but from docker stats, I find 12.45MB, why it show container memory consume even less than the PID1 process in container?
rss        RSS      resident set size, the non-swapped physical memory that a task has used (in kiloBytes). (alias rssize, rsz).
MEM USAGE / LIMIT the total memory the container is using, and the total amount of memory it is allowed to use

"update --memory" can not work

Docker Version :17.04.0-ce
os :windows 7
I start container using the command :docker run -it -memory 4096MB <container-id>
check the memory using the command :docker stats --no-stream | grep <container-id>
the result is :
5fbc6df8f90f 0.23% 86.52 MB / 995.8 Mib 2.59% 648B / 0B 17.2G / 608 MB 31
when update the memory,the result is also the same:
$ docker update -m 4500MB --memory-swap 4500MB --memory-reservation 4500MB 5fbc6df8f90f
5fbc6df8f90f
$ docker stats --no-stream | grep 5fbc6df8f90f
5fbc6df8f90f 0.23% 86.52 MB / 995.8 Mib 2.59% 648B / 0B 17.2G / 608 MB 31
why "--memory" can not work ,the memory is always the same 995.8Mib?
The docker stats command is showing you how much memory the entire docker host has, or with D4W, how much memory you have in the Linux VM. To increase this threshold, go into the settings of Docker to change the memory allocated to the VM. See this documentation for more details.

Docker error at higher core counts on a multi core machine

I am running a Centos Container using docker on a RHEL 65 machine. I am trying to run an MPI application (MILC) on 16 cores.
My server has 20 cores and 128 GB of memory.
My application runs fine until 15 cores but fails with the APPLICATION TERMINATED WITH THE EXIT STRING: Bus error (signal 7) error when using 16 cores and up. At 16 cores and up these are the messages I see in the logs.
Jul 16 11:29:17 localhost abrt[100668]: Can't open /proc/413/status: No such file or directory
Jul 16 11:29:17 localhost abrt[100669]: Can't open /proc/414/status: No such file or directory
Jul 16 11:29:17 localhost abrt[100670]: Can't open /proc/417/status: No such file or directory
A few details on the container:
kernel 2.6.32-431.el6.x86_64
Official centos from docker hub
Started container as:
docker run -t -i -c 20 -m 125g --name=test --net=host centos /bin/bash
I would greatly appreciate any and all feedback regarding this. Please do let me know if I can provide any further information.
Regards

Resources