Is this a bug of spark stream or memory leak? - memory

I submit my code to a spark stand alone cluster. Submit command is like below:
nohup ./bin/spark-submit \
--master spark://ES01:7077 \
--executor-memory 4G \
--num-executors 1 \
--total-executor-cores 1 \
--conf "spark.storage.memoryFraction=0.2" \
./myCode.py 1>a.log 2>b.log &
I specify the executor use 4G memory in above command. But use the top command to monitor the executor process, I notice the memory usage keeps growing. Now the top Command output is below:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
12578 root 20 0 20.223g 5.790g 23856 S 61.5 37.3 20:49.36 java
My total memory is 16G so 37.3% is already bigger than the 4GB I specified. And it is still growing.
Use the ps command , you can know it is the executor process.
[root#ES01 ~]# ps -awx | grep spark | grep java
10409 ? Sl 1:43 java -cp /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/ -Xms4G -Xmx4G -XX:MaxPermSize=256m org.apache.spark.deploy.master.Master --ip ES01 --port 7077 --webui-port 8080
10603 ? Sl 6:16 java -cp /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/ -Xms4G -Xmx4G -XX:MaxPermSize=256m org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://ES01:7077
12420 ? Sl 10:16 java -cp /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/ -Xms1g -Xmx1g -XX:MaxPermSize=256m org.apache.spark.deploy.SparkSubmit --master spark://ES01:7077 --conf spark.storage.memoryFraction=0.2 --executor-memory 4G --num-executors 1 --total-executor-cores 1 /opt/flowSpark/sparkStream/ForAsk01.py
12578 ? Sl 21:03 java -cp /opt/spark-1.6.0-bin-hadoop2.6/conf/:/opt/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.6.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/hadoop-2.6.2/etc/hadoop/ -Xms4096M -Xmx4096M -Dspark.driver.port=52931 -XX:MaxPermSize=256m org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler#10.79.148.184:52931 --executor-id 0 --hostname 10.79.148.184 --cores 1 --app-id app-20160511080701-0013 --worker-url spark://Worker#10.79.148.184:52660
Below are the code. It is very simple so I do not think there is memory leak
if __name__ == "__main__":
dataDirectory = '/stream/raw'
sc = SparkContext(appName="Netflow")
ssc = StreamingContext(sc, 20)
# Read CSV File
lines = ssc.textFileStream(dataDirectory)
lines.foreachRDD(process)
ssc.start()
ssc.awaitTermination()
The code for process function is below. Please note that I am using HiveContext not SqlContext here. Because SqlContext do not support window function
def getSqlContextInstance(sparkContext):
if ('sqlContextSingletonInstance' not in globals()):
globals()['sqlContextSingletonInstance'] = HiveContext(sparkContext)
return globals()['sqlContextSingletonInstance']
def process(time, rdd):
if rdd.isEmpty():
return sc.emptyRDD()
sqlContext = getSqlContextInstance(rdd.context)
# Convert CSV File to Dataframe
parts = rdd.map(lambda l: l.split(","))
rowRdd = parts.map(lambda p: Row(router=p[0], interface=int(p[1]), flow_direction=p[9], bits=int(p[11])))
dataframe = sqlContext.createDataFrame(rowRdd)
# Get the top 2 interface of each router
dataframe = dataframe.groupBy(['router','interface']).agg(func.sum('bits').alias('bits'))
windowSpec = Window.partitionBy(dataframe['router']).orderBy(dataframe['bits'].desc())
rank = func.dense_rank().over(windowSpec)
ret = dataframe.select(dataframe['router'],dataframe['interface'],dataframe['bits'], rank.alias('rank')).filter("rank<=2")
ret.show()
dataframe.show()
Actually I found below code will cause the problem:
# Get the top 2 interface of each router
dataframe = dataframe.groupBy(['router','interface']).agg(func.sum('bits').alias('bits'))
windowSpec = Window.partitionBy(dataframe['router']).orderBy(dataframe['bits'].desc())
rank = func.dense_rank().over(windowSpec)
ret = dataframe.select(dataframe['router'],dataframe['interface'],dataframe['bits'], rank.alias('rank')).filter("rank<=2")
ret.show()
Because If I remove these 5 line. The code can run all night without showing memory increase. But adding them will cause the memory usage of executor grow to a very high number.
Basically the above code is just some window + grouby in SparkSQL. So is this a bug?

Disclaimer: this answer isn't based on debugging, but more on observations and the documentation Apache Spark provides
I don't believe that this is a bug to begin with!
Looking at your configurations, we can see that you are focusing mostly on the executor tuning, which isn't wrong, but you are forgetting the driver part of the equation.
Looking at the spark cluster overview from Apache Spark documentaion
As you can see, each worker has an executor, however, in your case, the worker node is the same as the driver node! Which frankly is the case when you run locally or on a standalone cluster in a single node.
Further, the driver takes 1G of memory by default unless tuned using spark.driver.memory flag. Furthermore, you should not forget about the heap usage from the JVM itself, and the Web UI that's been taken care of by the driver too AFAIK!
When you delete the lines of code you mentioned, your code is left without actions as map function is just a transformation, hence, there will be no execution, and therefore, you don't see memory increase at all!
Same applies on groupBy as it is just a transformation that will not be executed unless an action is being called which in your case is agg and show further down the stream!
That said, try to minimize your driver memory and the overall number of cores in spark which is defined by spark.cores.max if you want to control the number of cores on this process, then cascade down to the executors. Moreover, I would add spark.python.profile.dump to your list of configuration so you can see a profile for your spark job execution, which can help you more with understanding the case, and to tune your cluster more to your needs.

As I can see in your 5 lines, maybe the groupBy is the issue , would you try with reduceBy, and see how it performs.
See here and here.

Related

when use pyspark pandas_udf, python worker use too much memory and exceeding memory limit

spark version is 2.4.0, my cluster has four nodes and each node has 16 CPU and 128g RAM.
I am using jupyter-notebook conncet pyspark. The working process is read kudu data by spark then calculate by pandas udf. On the terminal start pyspark
PYSPARK_DRIVER_PYTHON="/opt/anaconda2/envs/py3/bin/jupyter" PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark2 --jars kudu-spark2_2.11-1.7.0-cdh5.15.0.jar
--conf spark.executor.memory=40g --conf spark.executor.memoryOverhead=5g --num-executors=4 --executor-cores=8 --conf yarn.nodemanager.vmem-check-enabled=false
My dataset only 6g and 32 partitions. when running i can see each node has a executor contains 8 python worker and each python worker uses 6g memory! Container killed by yarn because memory limit.
Container killed by YARN for exceeding memory limits. 45.1 GB of 45 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead
I'm confused why does it take up so much memory? data size for each partiton only ~200M. Isn't pandas_udf avoiding serialization and deserialization by pyarrow? Maybe jupyter causes the quesiton?
I am very grateful if anyone helps me.
This is my code.
df = spark.range(0, 800000000)
df= df.select("id",rand(seed=10).alias("uniform"),randn(seed=27).alias("normal"),
randn(seed=27).alias("normal1"),randn(seed=1).alias("normal3"))
df=df.withColumn("flag",
F.array(
F.lit("0"),
F.lit("1"),
F.lit("2"),
F.lit("3"),
F.lit("4"),
F.lit("5"),
F.lit("6"),
F.lit("7"),
F.lit("8"),
F.lit("9"),
F.lit("10"),
F.lit("11"),
F.lit("12"),
F.lit("13"),
F.lit("14"),
F.lit("15"),
F.lit("16"),
F.lit("17"),
F.lit("18"),
F.lit("19"),
F.lit("20"),
F.lit("21"),
F.lit("22"),
).getItem(
(F.rand()*23).cast("int")
)
)
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import *
schema = StructType([
StructField("flag", IntegerType()),
])
#pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def del_data(data):
import os
os.environ["ARROW_PRE_0_15_IPC_FORMAT"] = "1"
return data[["flag"]]
df.groupBy('flag').apply(del_data).write.csv('/tmp/')

Vivado Synthesis hangs in Docker container spawned by Jenkins

I'm attempting to move our large FPGA build into a Jenkins CI environment, but the build hangs at the end of synthesis when run in a Docker container spawned by Jenkins.
I've attempted to replicate the environment that Jenkins is creating, but when I spawn a Docker container myself, there's no issue with the build.
I've tried:
reducing the number of jobs (aka threads) that Vivado uses, thinking
that perhaps there was some thread collision occurring when writing
out log files
on the same note, used the -nolog -nojournal options on the vivado
commands to remove any log file collisions
taking control of the cloned/checked-out project and running commands
as the local user in the Docker container
I also have an extremely small build that makes it through the entire build process in Jenkins with no issue, so I don't think there is a fundamental flaw with my Docker containers.
agent {
docker {
image "vivado:2017.4"
args """
-v <MOUNT XILINX LICENSE FILE>
--dns <DNS_ADDRESS>
--mac-address <MAC_ADDRESS>
"""
}
}
steps {
sh "chmod -R 777 ."
dir(path: "${params.root_dir}") {
timeout(time: 15, unit: 'MINUTES') {
// Create HLS IP for use in Vivado project
sh './run_hls.sh'
}
timeout(time: 20, unit: 'MINUTES') {
// Create vivado project, add sources, constraints, HLS IP, generated IP
sh 'source source_vivado.sh && vivado -mode batch -source tcl/setup_proj.tcl'
}
timeout(time: 20, unit: 'MINUTES') {
// Create block designs from TCL scripts
sh 'source source_vivado.sh && vivado -mode batch -source tcl/run_bd.tcl'
}
timeout(time: 1, unit: 'HOURS') {
// Synthesize complete project
sh 'source source_vivado.sh && vivado -mode batch -source tcl/run_synth.tcl'
}
}
}
This code block below was running 1 job with a 12 hour timeout. You can see that Synthesis finished, then a timeout occurred 8 hours later.
[2019-04-17T00:30:06.131Z] Finished Writing Synthesis Report : Time (s): cpu = 00:01:53 ; elapsed = 00:03:03 . Memory (MB): peak = 3288.852 ; gain = 1750.379 ; free physical = 332 ; free virtual = 28594
[2019-04-17T00:30:06.131Z] ---------------------------------------------------------------------------------
[2019-04-17T00:30:06.131Z] Synthesis finished with 0 errors, 0 critical warnings and 671 warnings.
[2019-04-17T08:38:37.742Z] Sending interrupt signal to process
[2019-04-17T08:38:43.013Z] Terminated
[2019-04-17T08:38:43.013Z]
[2019-04-17T08:38:43.013Z] Session terminated, killing shell... ...killed.
[2019-04-17T08:38:43.013Z] script returned exit code 143
Running the same commands in locally spawned Docker containers has no issues whatsoever. Unfortunately, the timeout Jenkins step doesn't appear to flush open buffers, as my post:unsuccesful step that prints out all log files doesn't find synth_1, though I wouldn't expect there to be anything different from the Jenkins capture.
Are there any known issues with Jenkins/Vivado integration? Is there a way to enter a Jenkins spawned container so I can try and duplicate what I'm expecting vs what I'm experiencing?
EDIT: I've since added in a timeout in the actual tcl scripts to move past the wait_on_runs command used in run_synth.tcl, but now I'm experiencing the same hanging behavior during implementation.
The problem lies in the way vivado deals (or doesn't deal...) with its forked processes. Specifically I think this applies to the parallel synthesis. This is maybe, why you only see it in some of your projects. In the state you describe above (stuck after "Synthesis finished") I noticed a couple of abandoned zombie processes of vivado. To my understanding these are child processes which ended, but the parent didn't collect the status before ending themselves. Tracing with strace even reveals that vivado tries to kill these processes:
restart_syscall(<... resuming interrupted nanosleep ...>) = 0
kill(319, SIG_0) = 0
kill(370, SIG_0) = 0
kill(422, SIG_0) = 0
kill(474, SIG_0) = 0
nanosleep({tv_sec=5, tv_nsec=0}, 0x7f86edcf4dd0) = 0
kill(319, SIG_0) = 0
kill(370, SIG_0) = 0
kill(422, SIG_0) = 0
kill(474, SIG_0) = 0
nanosleep({tv_sec=5, tv_nsec=0}, <detached ...>
But (as we all know) you can't kill zombies, they are already dead...
Normally these processes would be adopted by the init process and handled there. But in the case of Jenkins Pipeline in Docker there is no init by default. The pipeline spawns the container and runs cat with no inputs to keep it alive. This way cat becomes pid 1 and takes the abandoned children of vivado. cat of course doesn't know what do do with them and ignores them (a tragedy really).
cat,1
|-(sh,16)
|-sh,30 -c ...
| |-sh,31 -c ...
| | `-sleep,5913 3
| `-sh,32 -xe /home/user/.jenkins/workspace...
| `-sh,35 -xe /home/user/.jenkins/workspace...
| `-vivado,36 /opt/Xilinx/Vivado/2019.2/bin/vivado -mode tcl ...
| `-loader,60 /opt/Xilinx/Vivado/2019.2/bin/loader -exec vivado -mode tcl ...
| `-vivado,82 -mode tcl ...
| |-{vivado},84
| |-{vivado},85
| |-{vivado},111
| |-{vivado},118
| `-{vivado},564
|-(vivado,319)
|-(vivado,370)
|-(vivado,422)
`-(vivado,474)
Luckily there is a way to have an init process in the docker container. Passing the --init argument with the docker run solves the problem for me.
agent {
docker {
image 'vivado:2019.2'
args '--init'
}
}
This creates the init process vivado seems to rely on and the build runs without problems.
Hope this helps you!
Cheers!

check_cpu + nsclient : set critical threshold only on 5min period

I am using centreon (nagios) to monitor the CPUs of some VMs using NSClient. In my case it makes only sense to set the critical state of the cpu probe if the average cpu load is > 95 over the 5m period. Is this achievable ?
I cannot find documentation on how to specify that in the critical param
Default command
check_cpu
Returns
CPU Load ok
'total 5m load'=0%;80;90 'total 1m load'=0%;80;90 'total 5s load'=7%;80;90
Command with specific threshold (but all time period can match)
check_cpu "critical=load > 90"
It is not exactly what I wanted to do but what I did is the following
check_nrpe -u -H XX.XXX.X.XXX -c check_cpu -a "crit=load > 95" "warn=load > 90" time=5m
Which limits the output to the 5m time period.
Note that to execute this from centreon you have to set the following variables inside the nsclient.ini file (waisted a lot of time on that one)
[/settings/NRPE/server]
allow nasty characters=true
[/settings/external scripts]
allow nasty characters=true
Check this script,
define service{
use generic-service
host_name xxx
service_description CPU Load
check_command check_nrpe!check_load
contact_groups sysadmin
}
---
command[check_load]=/usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20
You can try something like that
check_nrpe -u -H XX.XXX.X.XXX -c check_cpu -a "warning=time = '5m' and load > 80" "critical=time = '5m' and load > 90" show-all
You can also check the documentation for more info.

dask jobqueue worker failure at startup 'Resource temporarily unavailable'

I'm running dask over slurm via jobqueue and I have been getting 3 errors pretty consistently...
Basically my question is what could be causing these failures? At first glance the problem is that too many workers are writing to disk at once, or my workers are forking into many other processes, but it's pretty difficult to track that. I can ssh into the node but I'm not seeing an abnormal number of processes, and each node has a 500gb ssd, so I shouldn't be writing excessively.
Everything below this is just information about my configurations and such
My setup is as follows:
cluster = SLURMCluster(cores=1, memory=f"{args.gbmem}GB", queue='fast_q', name=args.name,
env_extra=["source ~/.zshrc"])
cluster.adapt(minimum=1, maximum=200)
client = await Client(cluster, processes=False, asynchronous=True)
I suppose i'm not even sure if processes=False should be set.
I run this starter script via sbatch under the conditions of 4gb of memory, 2 cores (-c) (even though i expect to only need 1) and 1 task (-n). And this sets off all of my jobs via the slurmcluster config from above. I dumped my slurm submission scripts to files and they look reasonable.
Each job is not complex, it is a subprocess.call( command to a compiled executable that takes 1 core and 2-4 GB of memory. I require the client call and further calls to be asynchronous because I have a lot of conditional computations. So each worker when loaded should consist of 1 python processes, 1 running executable, and 1 shell.
Imposed by the scheduler we have
>> ulimit -a
-t: cpu time (seconds) unlimited
-f: file size (blocks) unlimited
-d: data seg size (kbytes) unlimited
-s: stack size (kbytes) 8192
-c: core file size (blocks) 0
-m: resident set size (kbytes) unlimited
-u: processes 512
-n: file descriptors 1024
-l: locked-in-memory size (kbytes) 64
-v: address space (kbytes) unlimited
-x: file locks unlimited
-i: pending signals 1031203
-q: bytes in POSIX msg queues 819200
-e: max nice 0
-r: max rt priority 0
-N 15: unlimited
And each node has 64 cores. so I don't really think i'm hitting any limits.
i'm using the jobqueue.yaml file that looks like:
slurm:
name: dask-worker
cores: 1 # Total number of cores per job
memory: 2 # Total amount of memory per job
processes: 1 # Number of Python processes per job
local-directory: /scratch # Location of fast local storage like /scratch or $TMPDIR
queue: fast_q
walltime: '24:00:00'
log-directory: /home/dbun/slurm_logs
I would appreciate any advice at all! Full log is below.
FORK BLOCKING IO ERROR
distributed.nanny - INFO - Start Nanny at: 'tcp://172.16.131.82:13687'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/home/dbun/.local/share/pyenv/versions/3.7.0/lib/python3.7/multiprocessing/forkserver.py", line 250, in main
pid = os.fork()
BlockingIOError: [Errno 11] Resource temporarily unavailable
distributed.dask_worker - INFO - End worker
Aborted!
CANT START NEW THREAD ERROR
https://pastebin.com/ibYUNcqD
BLOCKING IO ERROR
https://pastebin.com/FGfxqZEk
EDIT:
Another piece of the puzzle:
It looks like dask_worker is running multiple multiprocessing.forkserver calls? does that sound reasonable?
https://pastebin.com/r2pTQUS4
This problem was caused by having ulimit -u too low.
As it turns out each worker has a few processes associated with it, and the python ones have multiple threads. In the end you end up with approximately 14 threads that contribute to your ulimit -u. Mine was set to 512, and with a 64 core system I was likely hitting ~896. It looks like the a maximum threads per a process I could have had would have been 8.
Solution:
in .zshrc (.bashrc) I added the line
ulimit -u unlimited
Haven't had any problems since.

Snakemake memory limiting

In Snakemake, I have 5 rules. For each I set the memory limit by resources mem_mb option.
It looks like this:
rule assembly:
input:
file1 = os.path.join(MAIN_DIR, "1.txt"), \
file2 = os.path.join(MAIN_DIR, "2.txt"), \
file3 = os.path.join(MAIN_DIR, "3.txt")
output:
foldr = dir, \
file4 = os.path.join(dir, "A.png"), \
file5 = os.path.join(dir, "A.tsv")
resources:
mem_mb=100000
shell:
" pythonscript.py -i {input.file1} -v {input.file2} -q {input.file3} --cores 5 -o {output.foldr} "
I want to limit the memory usage of the whole Snakefile by doing something like:
snakamake --snakefile mysnakefile_snakefile --resources mem_mb=100000
So not all jobs would use 100GB each ( if I have 5 rules, meaning as 500GB memory allocation), but all of their executions will be maximum 100GB ( 5 jobs, total of 100 GB allocation?)
The command line argument sets the total limit. The Snakemake scheduler will ensure that for the set of running jobs, the sum of the mem_mb resources will not exceed the total limit.
I think this is exactly what you want, isn't it? You just need to set the per-job expected memory in the rule itself. Note that Snakemake does not measure this for you. You have to define that value yourself in the rule. E.g., if you expect your job to use 100MB memory, put mem_mb=100 into that rule.

Resources