Jenkins Request Handling is Very Slow - jenkins

We have jenkins server(master / Agents) hosted in AWS and we setup agents(on demand based on Jobs Queue) connection through swarm-client.
Mainly when Rebuilding jenkins Job / Replay jenkins Job take more time approximately 7 mins.
Down Scaling agents: We have a python script used for downscaling the agents, if they are idle and those request processing also taking more time approximately an average 5 mins(When there are 10+ request are coming at a time)
Scaling up agents: Create Slaves is little less comparing above two may be an average 2 mins.
GET / POST requests are very fast and these are served with in 100 milliseconds.
Below is the command used to connect agent to jenkins master:
====
java -jar /usr/share/jenkins/swarm-client.jar -fsroot /var/jenkins_agent_home -deleteExistingClients -disableClientsUniqueId -executors 1 -master https://xxxxxx/ -passwordEnvVariable PASSWORD -e PASSWORD=redacted -username user -fsroot /var/jenkins/0 -name xxl-xxxxxxx -labels xxl_linux -mode exclusive
===
Jenkins server:
Memory: 72 GB
Cores: 32
Heap memory: 16 GB
Java Version: 11
Garbage collector used: G1GC
enter image description here
Please support here to improve performance. Thanks in Advance !!

Related

Jenkins super slow

I have Jenkins server whose GUI is super slow whenever I refresh it takes at least a few mins to respond. I am copying the entire Jenkins folder from one server to another and starting the Jenkins
Couple of scenarios I have tested
1. setting up Jenkins on high-performance server ie 64GB Ram 16Vcpus and 300GB of Harddrive
2. Increasing the Java Memory arguments of the server to min and max to 16GB and made sure it is using the G1GC Algorithm
Expecting Jenkins to be as fast as possible

Airbnb Airflow using all system resources

We've set up Airbnb/Apache Airflow for our ETL using LocalExecutor, and as we've started building more complex DAGs, we've noticed that Airflow has starting using up incredible amounts of system resources. This is surprising to us because we mostly use Airflow to orchestrate tasks that happen on other servers, so Airflow DAGs spend most of their time waiting for them to complete--there's no actual execution that happens locally.
The biggest issue is that Airflow seems to use up 100% of CPU at all times (on an AWS t2.medium), and uses over 2GB of memory with the default airflow.cfg settings.
If relevant, we're running Airflow using docker-compose running the container twice; once as scheduler and once as webserver.
What are we doing wrong here? Is this normal?
EDIT:
Here is the output from htop, ordered by % Memory used (since that seems to be the main issue now, I got CPU down):
I suppose in theory I could reduce the number of gunicorn workers (it's at the default of 4), but I'm not sure what all the /usr/bin/dockerd processes are. If Docker is complicating things I could remove it, but it's made deployment of changes really easy and I'd rather not remove it if possible.
I have also tried everything I could to get the CPU usage down and Matthew Housley's advice regarding MIN_FILE_PROCESS_INTERVAL was what did the trick.
At least until airflow 1.10 came around... then the CPU usage went through the roof again.
So here is everything I had to do to get airflow to work well on a standard digital ocean droplet with 2gb of ram and 1 vcpu:
1. Scheduler File Processing
Prevent airflow from reloading the dags all the time and set:
AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL=60
2. Fix airflow 1.10 scheduler bug
The AIRFLOW-2895 bug in airflow 1.10, causes high CPU load, because the scheduler keeps looping without a break.
It's already fixed in master and will hopefully be included in airflow 1.10.1, but it could take weeks or months until its released. In the meantime this patch solves the issue:
--- jobs.py.orig 2018-09-08 15:55:03.448834310 +0000
+++ jobs.py 2018-09-08 15:57:02.847751035 +0000
## -564,6 +564,7 ##
self.num_runs = num_runs
self.run_duration = run_duration
+ self._processor_poll_interval = 1.0
self.do_pickle = do_pickle
super(SchedulerJob, self).__init__(*args, **kwargs)
## -1724,6 +1725,8 ##
loop_end_time = time.time()
self.log.debug("Ran scheduling loop in %.2f seconds",
loop_end_time - loop_start_time)
+ self.log.debug("Sleeping for %.2f seconds", self._processor_poll_interval)
+ time.sleep(self._processor_poll_interval)
# Exit early for a test mode
if processor_manager.max_runs_reached():
Apply it with patch -d /usr/local/lib/python3.6/site-packages/airflow/ < af_1.10_high_cpu.patch;
3. RBAC webserver high CPU load
If you upgraded to use the new RBAC webserver UI, you may also notice that the webserver is using a lot of CPU persistently.
For some reason the RBAC interface uses a lot of CPU on startup. If you are running on a low powered server, this can cause a very slow webserver startup and permanently high CPU usage.
I have documented this bug as AIRFLOW-3037. To solve it you can adjust the config:
AIRFLOW__WEBSERVER__WORKERS=2 # 2 * NUM_CPU_CORES + 1
AIRFLOW__WEBSERVER__WORKER_REFRESH_INTERVAL=1800 # Restart workers every 30min instead of 30seconds
AIRFLOW__WEBSERVER__WEB_SERVER_WORKER_TIMEOUT=300 #Kill workers if they don't start within 5min instead of 2min
With all of these tweaks my airflow is using only a few % of CPU during idle time on a digital ocean standard droplet with 1 vcpu and 2gb of ram.
I just ran into an issue like this. Airflow was consuming roughly a full vCPU in a t2.xlarge instance, with the vast majority of this coming from the scheduler container. Checking the scheduler logs, I could see that it was processing my single DAG more than once a second even though it only runs once a day.
I found that the MIN_FILE_PROCESS_INTERVAL was set to the default value of 0, so the scheduler was looping over the DAG. I changed the process interval to 65 seconds, and Airflow now uses less than 10 percent of a vCPU in a t2.medium instance.
Try to change the below config in airflow.cfg
# after how much time a new DAGs should be picked up from the filesystem
min_file_process_interval = 0
# How many seconds to wait between file-parsing loops to prevent the logs from being spammed.
min_file_parsing_loop_time = 1
the key point is HOW to processing dag files.
reduce cpu usage from 80%+ to 30% for scheduler on a 8-core server, i have updated 2 config key,
min_file_process_interval from 0 to 60.
max_threads from 1000 to 50.
For starters, you can use htop to monitor and debug your CPU usage.
I would suggest that you run webserver and scheduler processes on the same docker container which would reduce the resources required to run two containers on a ec2 t2.medium. Airflow workers need resources for downloading data and reading it in memory but webserver and scheduler are pretty lightweight processes. Makes sure when you run webserver you are controlling the number of workers running on the instance using the cli.
airflow webserver [-h] [-p PORT] [-w WORKERS]
[-k {sync,eventlet,gevent,tornado}]
[-t WORKER_TIMEOUT] [-hn HOSTNAME] [--pid [PID]] [-D]
[--stdout STDOUT] [--stderr STDERR]
[-A ACCESS_LOGFILE] [-E ERROR_LOGFILE] [-l LOG_FILE]
[-d]
I have faced the same issue deploying airflow on EKS.Its resolved by updating max_threads to 128 in airflow config.
max_threads: Scheduler will spawn multiple threads in parallel to schedule dags. This is controlled by max_threads with default value of 2. User should increase this value to a larger value (e.g numbers of cpus where scheduler runs - 1) in production.
From here
https://airflow.apache.org/docs/stable/faq.html
I tried to run Airflow on a AWS t2.micro instance (1vcpu, 1gb of memory, eligible for free tier), and had the same issue : the worker consumed 100% of the cpu and consumed all available memory.
The EC2 instance was totally stuck and unusable, of course Airflow didn't working.
So I created a 4GB swap file using the method described here. With the swap, no more issues, Airflow was fully functionnal.
Of course, with only one vcpu, you cannot expect incredible performances, but it runs.

How can I measure hardware specs for Continuous Integration server (Jenkins)?

Jenkins' webpage does speak about server specifications. The matter is that I have to ask to systems a server for CI, and I have to specify those requirements, and of course justify them.
I have to decide the following things:
Hard disk capacity, for the whole server, considering the OS. This spec is considered the more critical for the hardware providers.
RAM.
Number of cores.
And these are the things to take into account:
The OS they'll provide me will probably be Ubuntu Server.
I'm not going to run more than 1 build simultaneously, in the 99'9% of the cases.
I'm going to work with Moodle, so the source code size will be quite much (the whole repo is of about 700Mb).
Regarding my experience with Jenkins and Linux, I would recommend the following configuration:
A CentOS machine or VM (Ubuntu Server is OK too)
Minimum 2 CPU
Minimum 2 GB of RAM
A 30 GB partition for the OS
Another partition for Jenkins (like /jenkins)
Regarding the partition size for Jenkins, it depends of the number of jobs (and their workspaces size).
My Jenkins partition is 100 GB (I have around 100 jobs and some large Git repo to clone).

Spark: what's the advantages of having multiple executors per node for a Job?

I am running my job on AWS-EMR cluster. It is a 40 nodes cluster using cr1.8xlarge instances. Each cr1.8xlarge has 240G memory and 32 cores. I can run with the following config:
--driver-memory 180g --driver-cores 26 --executor-memory 180g --executor-cores 26 --num-executors 40 --conf spark.default.parallelism=4000
or
--driver-memory 180g --driver-cores 26 --executor-memory 90g --executor-cores 13 --num-executors 80 --conf spark.default.parallelism=4000
Since from the job-tracker website, the number of tasks running simultaneously is mainly just the number of cores (cpu) available. So I am wondering is there any advantages or specific scenarios that we want to have more than one executor per node?
Thanks!
Yes, there are advantages of running multiple executors per node - especially on large instances like yours. I recommend that you read this blog post from Cloudera.
A snippet of the post that would be of particular interest to you:
To hopefully make all of this a little more concrete, here’s a worked example of configuring a Spark app to use as much of the cluster as possible: Imagine a cluster with six nodes running NodeManagers, each equipped with 16 cores and 64GB of memory. The NodeManager capacities, yarn.nodemanager.resource.memory-mb and yarn.nodemanager.resource.cpu-vcores, should probably be set to 63 * 1024 = 64512 (megabytes) and 15 respectively. We avoid allocating 100% of the resources to YARN containers because the node needs some resources to run the OS and Hadoop daemons. In this case, we leave a gigabyte and a core for these system processes. Cloudera Manager helps by accounting for these and configuring these YARN properties automatically.
The likely first impulse would be to use --num-executors 6 --executor-cores 15 --executor-memory 63G. However, this is the wrong approach because:
63GB + the executor memory overhead won’t fit within the 63GB capacity of the NodeManagers.
The application master will take up a core on one of the nodes, meaning that there won’t be room for a 15-core executor on that node.
15 cores per executor can lead to bad HDFS I/O throughput.
A better option would be to use --num-executors 17 --executor-cores 5 --executor-memory 19G. Why?
This config results in three executors on all nodes except for the one with the AM, which will have two executors.
--executor-memory was derived as (63/3 executors per node) = 21. 21 * 0.07 = 1.47. 21 – 1.47 ~ 19.

GNU Parallel: set remote server run on 1 job with all CPU

I have jobs (multiprocessing python codes) that ideally take 4 CPUs to run on each remote machine. In GNU parallel, how do I set up the arguments to make each remote server (assuming 4 cores) run one job at a time, using all its 4 cores on the same job (instead of using its 4 cores to run 4 job by default)
Run (25% of 4 cores) jobs in parallel with each 4 args:
-j 25% -N4

Resources