Jenkins: 2 master nodes using NFS - jenkins

I´m thinking about the following high availability solution for my enviroment:
Datacenter with one powered on Jenkins master node.
Datacenter for desasters with one off Jenkins master node.
Datacenter one is always powered on, the second is only for disasters. My idea is install the two jenkins using the same ip but with a shared NFS. If the first has fallen, the second starts with the same ip and I still having my service successfully
My question is, can this solution work?.
Thanks all by the hekp ;)

I don't see any challenges as such why it should not work. But you still got to monitor in case of switch-over because I have faced the situation where jobs that were running when jenkins abruptly shuts down were still in the queue when service was recovered but they never completed afterwards, I had to manually delete the build using script console.
Over the jenkins forum a lot of people have reported such bugs, most of them seems to have fixed, but still there are cases where this might happen, and it is because every time jenkins is restarted/started the configuration is reloaded from the disk. So there is inconsistency at times because of in memory config that were there earlier and reloaded config.
So in your case, it might happen that your executor thread would still be blocked when service is recovered. Thus you got to make sure that everything is running fine after recovery.

Related

What is the state of the art running Jenkins - dedicated server or containers?

I've been running Jenkins in container for about 6 months, only one controller/master and no additional nodes, because its not needed in my case, I think. It works OK. However I find it to be a hassle to make changes to it, not because I'm afraid it will crash, but because it takes a long time to build the image (15+ min), installing SDK's etc. (1.3G).
My question is what is the state of the art running Jenkins? Would it be better to move Jenkins to a dedicated server (VM) with a webserver (reverse proxy)?
what-are-the-advantages-of-running-jenkins-in-a-docker-container
Is 15 mins a long time because you make a change, build, find out something is wrong and need to make another change?
I would look at how you are building the container and get all the SDKs installed in the early layers so that rebuilding the container can use those layers from cache and move your layers that change to as late as possible so as little of the container needs rebuilding.
It is then worth looking at image layer caches if you clean your build environment regularly (I use Artifactory)
Generally, I would advocate that as little building is done on the Controller and this is shipped out to agents which are capable of running Docker.
This way you don't need to install loads of SDKs and change your Controller that often etc as you do all that in containers as and when you need them.
I use the EC2 cloud plugin to dynamically spin up agents as and when they are needed. But you could have a static pool of agents if you are not working in a cloud provider.

1 Jenkins 2 clusters, agent connection issue on 2nd cluster

I'm sitting with a new issue that you might also face soon. I need a little help if possible. I've spent about almost 2 working weeks on this.
I have 2 possible solutions for my problem.
CONTEXT
I have 2 kubernetes clusters called FS and TC.
The Jenkins I am using runs on TC.
The slaves do deploy in FS from the TC Jenkins, however the slaves in FS would not connect to the Jenkins master in TC.
The slaves make use of a TCP connection that requires a HOST and PORT. However, the exposed jnlp service on TC is HTTP (http:/jenkins-jnlp.tc.com/) which uses nginx to auto generate the URL.
Even if I use
HOST: jenkins-jnlp.tc.com
PORT: 80
It will still complain that it's getting serial data instead of binary data.
The complaint
For TC I made use of the local jnlp service HOST (jenkins-jnlp.svc.cluster.local) with PORT (50000). This works well for our current TC environment.
SOLUTIONS
Solution #1
A possible solution would involve having a HTTP to TCP relay container running between the slave and master on FS. It will then be linked up to the HTTP url in TC (http:/jenkins-jnlp.tc.com/), encapsulating the HTTP connection to TCP (localhost:50000) and vice versa.
The slaves on FS can then connect to the TC master using that TCP port being exposed from that container in the middle.
Diagram to understand better
Solution #2
People kept complaining and eventually someone made a new functionality to Jenkins around 20 Feb 2020. They introduced Websockets that can run over HTTP and convert it to TCP on the slave.
I did set it up, but it seems too new and is not working for me even though the slave on FS says it's connected, it's still not properly communicating with the Jenkins master on TC. It still sees the agent/slave pod as offline.
Here are the links I used
Original post
Update note on Jenkins
Details on Jenkins WebSocket
Jenkins inbound-agent github
DockerHub jenkins-inbound-agent
CONCLUSION
After a lot of fiddling, research and banging my head on the wall, I think the only solution is solution #1. Problem with solution #1, a simple tool or service to encapsulate HTTP to TCP and back does not exist (that I know of, I searched for days). This means, I'll have to make one myself.
Solution #2 is still too new, zero to none docs to help me out or make setting it up easy and seems to come with some bugs. It seems the only way to fix these bugs would be to modify both Jenkins and the jnlp agent's code, which I have no idea where to even start.
UPDATE #1
I'm halfway done with the code for the intermediate container. I can now get a downstream from HTTP to TCP, I just have to set up an upstream TCP to HTTP.
Also considering the amount of multi-treading required to run a single central docker container to convert the protocols. I figured on adding the the HTTP-to-TCP container as a sidecar to the Jenkins agent when I'm done.
This way every time a slave spins up in a different cluster, it will automatically be able to connect and I don't have to worry about multiple connections. That is the theory, but obviously I want results and so do you guys.

HTTP 503 errors from Cloud Run app in one GCP projects but not the other

The issue
I am using the same container (similar resources) on 2 projects -- production and staging. Both have custom domains setup with cloud flare DNS and are on the same region. Container build is done in a completely different project and IAM is used to handle the access to these containers. Both project services have 80 concurrency and 300 seconds time out for all 5 services.
All was working good 3 days back but from yesterday almost all cloud run services on staging (thankfully) started throwing 503 randomly and for most requests. Some services were not even deployed for a week. The same containers are running fine on production project, no issues.
Ruled out causes
anything to do with Cloudflare (I tried the URL cloud run gives it has the issue of 503)
anything with build or containers (I tried the demo hello world container with go - it has the issue too)
Resources: I tried giving it 1 GB ram and 2 cpus but the problem persisted
issues on deployment (deploy multiple branches - didn't work)
issue in code (just routed traffic to old 2-3 days old revision but still issue was there)
Issue on service level ( I used the same container to create a completely new service, it also had the issue)
Possible causes
something on cloud run or cloud run load balancer
may some env vars but that also doesn't seem to be the issue
Response Codes
I just ran a quick check with vegeta (30 secs with 10 rps) same container on staging and production for a static file path and below are the responses:
Staging
Production
If anyone has any insights on this it would help greatly.
Based on your explanation, I cannot understand what's going on. You explained what doesn't work but didn't point out what works (does your app run locally? are you able to run a hello world sample application?)
So I'll recommend some debugging tips.
If you're getting a HTTP 5xx status code, first, check your application's logs. Is it printing ANY logs? Is there logs of a request? Does your application have and deployed with "verbose" logging setting?
Try hitting your *.run.app domain directly. If it's not working, then it's not a domain or dns or cloudflare issue. Try debugging and/or redeploying your app. Deploy something that works first. If *.run.app domain works, then the issue is not in Cloud Run.
Make sure you aren't using Cloudflare in proxy mode (e.g. your DNS points to Cloud Run; not Cloudflare) as there's a known issue about certificate issuance/renewals when domains are behind Cloudflare, right now.
Beyond these, if a redeploy seems to solve your problem, maybe try redeploying. It could be very likely some configuration recently became different two different projects.
See Cloud Run Troubleshooting
https://cloud.google.com/run/docs/troubleshooting
Do you see 503 errors under high load?
The Cloud Run (fully managed) load balancer strives to distribute incoming requests over the necessary amount of container instances. However, if your container instances are using a lot of CPU to process requests, the container instances will not be able to process all of the requests, and some requests will be returned with a 503 error code.
To mitigate this, try lowering the concurrency. Start from concurrency = 1 and gradually increase it to find an acceptable value. Refer to Setting concurrency for more details.

Port allocation when running build job in Jenkins

My project is structured in such a way that the build job in Jenkins is triggered from a push to Git. As part of my application logic, I spin up kafka and elastic search instances to be used in my test cases downstream.
The issue I have right now is, when a developer pushes his changes to Git, it triggers a build in Jenkins which in turn runs our code and spawns kafka broker in localhost:9092 and elastic search in localhost:9200.
When another developer working on some other change simultaneously, pushes his code, it triggers the build job again and tries to spin up another instance of kafka/elastic search but fails with the exception “Port already in use”.
I am looking at options on how to handle this scenario.
Will running these instances inside of docker container help to some extent? How do I handle the port issue in that case?
Yes dockerizing these instances can indeed help as you can spawn them multiple times.
You could create a docker container per component including your application and then let them talk to each other by linking them or using docker-compose
That way you would not have to expose the ports to the "outside" world but keep it internal within the docker environment.
That way you would not have the “Port already in use”. The only problem is memory in that case. e.g. if 100 pushes are done to the git repo, you might run out of memory...

Tomcat service is going down

Tomcat service is going down .. We are running test plans on Jenkins, at times tomcat service is going down, there are no errors seen in logs. This occurs very rarely, but what could be the reason?
This is because of low memory assigned to jvm, by increasing memory we were able to run jobs peacefully without dropping of services

Resources