Tomcat service is going down .. We are running test plans on Jenkins, at times tomcat service is going down, there are no errors seen in logs. This occurs very rarely, but what could be the reason?
This is because of low memory assigned to jvm, by increasing memory we were able to run jobs peacefully without dropping of services
Related
The issue
I am using the same container (similar resources) on 2 projects -- production and staging. Both have custom domains setup with cloud flare DNS and are on the same region. Container build is done in a completely different project and IAM is used to handle the access to these containers. Both project services have 80 concurrency and 300 seconds time out for all 5 services.
All was working good 3 days back but from yesterday almost all cloud run services on staging (thankfully) started throwing 503 randomly and for most requests. Some services were not even deployed for a week. The same containers are running fine on production project, no issues.
Ruled out causes
anything to do with Cloudflare (I tried the URL cloud run gives it has the issue of 503)
anything with build or containers (I tried the demo hello world container with go - it has the issue too)
Resources: I tried giving it 1 GB ram and 2 cpus but the problem persisted
issues on deployment (deploy multiple branches - didn't work)
issue in code (just routed traffic to old 2-3 days old revision but still issue was there)
Issue on service level ( I used the same container to create a completely new service, it also had the issue)
Possible causes
something on cloud run or cloud run load balancer
may some env vars but that also doesn't seem to be the issue
Response Codes
I just ran a quick check with vegeta (30 secs with 10 rps) same container on staging and production for a static file path and below are the responses:
Staging
Production
If anyone has any insights on this it would help greatly.
Based on your explanation, I cannot understand what's going on. You explained what doesn't work but didn't point out what works (does your app run locally? are you able to run a hello world sample application?)
So I'll recommend some debugging tips.
If you're getting a HTTP 5xx status code, first, check your application's logs. Is it printing ANY logs? Is there logs of a request? Does your application have and deployed with "verbose" logging setting?
Try hitting your *.run.app domain directly. If it's not working, then it's not a domain or dns or cloudflare issue. Try debugging and/or redeploying your app. Deploy something that works first. If *.run.app domain works, then the issue is not in Cloud Run.
Make sure you aren't using Cloudflare in proxy mode (e.g. your DNS points to Cloud Run; not Cloudflare) as there's a known issue about certificate issuance/renewals when domains are behind Cloudflare, right now.
Beyond these, if a redeploy seems to solve your problem, maybe try redeploying. It could be very likely some configuration recently became different two different projects.
See Cloud Run Troubleshooting
https://cloud.google.com/run/docs/troubleshooting
Do you see 503 errors under high load?
The Cloud Run (fully managed) load balancer strives to distribute incoming requests over the necessary amount of container instances. However, if your container instances are using a lot of CPU to process requests, the container instances will not be able to process all of the requests, and some requests will be returned with a 503 error code.
To mitigate this, try lowering the concurrency. Start from concurrency = 1 and gradually increase it to find an acceptable value. Refer to Setting concurrency for more details.
I have created a number of Spring Boot application, which all work like magic in isolation or when started up one of the other manually.
My challenge is that I want to deploy a stack with all the services in a Docker Swarm.
Initially I didn't understand what was going on, as it seemed like all my containers were hanging.
Turns out running a single Spring Boot application spikes up my CPU utilization to max it out for a good couple of seconds (20s+ to start up).
Now the issue is that Docker Swarm is launching 10 of these containers simultaneously and my load average goes above 80 and the system grinds to a halt. The container HEALTHCHECKS starts timing out and eventually Docker restarts them. This is an endless cycle and may or may not stabilize and if it does stabilize it takes a minimum of 30 minutes. So much for micro services vs big fat Java EE applications :(
Is there any way to convince Docker to rollout the containers one by one? I'm sure this will help a lot.
There is a rolling update parameter - https://docs.docker.com/engine/swarm/swarm-tutorial/rolling-update/ - but is does not seem applicable to startup deployment.
Your help will be greatly appreciated.
I've also tried systemd (which isn't ideal for distributed micro services). It worked slightly better than Docker, but have the same issue when deploying all the applications at once.
Initially I wanted to try Kubernetes, but I've got enough on my plate and if I can get away with Docker Swarm, that would be awesome.
Thanks!
I have a universal react app hosted in a docker container in a minikube (kubernetes) dev environment. I use virtualbox and I actually have more microservices on this vm.
In this react app, I use pm2 to restart my app on changes to server code, and webpack hmr to hot-reload client code on changes to client code.
Every say 15-45 seconds, pm2 is logging the below message to me indicating that the app exited due to a SIGKILL.
App [development] with id [0] and pid [299], exited with code [0] via signal [SIGKILL]
I can't for the life of me figure out why it is happening. It is relatively frequent, but not so frequent that it happens every second. It's quite annoying because each time it happens, my webpack bundle has to recompile.
What are some reasons why pm2 might receive a SIGKILL in this type of dev environment? Also, what are some possible ways of debugging this?
I noticed that my services that use pm2 to restart on server changes do NOT have this problem when they are just backend services. I.e. when they don't have webpack. In addition, I don't see these SIGKILL problems in my prod version of the app. That suggests to me there is some problem with the combination of webpack hmr setup, pm2, and minikube / docker.
I've tried the app locally (not in docker /minikube) and it works fine without any sigkills, so it can't be webpack hmr on its own. Does kubernetes kill services that use a lot of memory? (Maybe it thinks my app is using a lot of memory). If that's not the case, what might be some reasons kubernetes or docker send SIGKILL? Is there any way to debug this?
Any guidance is greatly appreciated. Thanks
I can't quite tell from the error message you posted, but usually this is a result of the kernel OOM Killer (Out of Memory Killer) taking out your process. This can be either because your process is just using up too much memory, or you have a cgroup setting on your container that is overly aggressive and causing it to get killed. You may also have under-allocated memory to your VirtualBox instance.
Normally you'll see Docker reporting that the container exited with code 137 in docker ps -a
dmesg or your syslogs on the node in question may show the kernel OOM killer output.
I´m thinking about the following high availability solution for my enviroment:
Datacenter with one powered on Jenkins master node.
Datacenter for desasters with one off Jenkins master node.
Datacenter one is always powered on, the second is only for disasters. My idea is install the two jenkins using the same ip but with a shared NFS. If the first has fallen, the second starts with the same ip and I still having my service successfully
My question is, can this solution work?.
Thanks all by the hekp ;)
I don't see any challenges as such why it should not work. But you still got to monitor in case of switch-over because I have faced the situation where jobs that were running when jenkins abruptly shuts down were still in the queue when service was recovered but they never completed afterwards, I had to manually delete the build using script console.
Over the jenkins forum a lot of people have reported such bugs, most of them seems to have fixed, but still there are cases where this might happen, and it is because every time jenkins is restarted/started the configuration is reloaded from the disk. So there is inconsistency at times because of in memory config that were there earlier and reloaded config.
So in your case, it might happen that your executor thread would still be blocked when service is recovered. Thus you got to make sure that everything is running fine after recovery.
We're using Docker in Swarm mode to host a number of services. Recently we've hit an issue where we get connection timeouts intermittently (sometimes as much as every other request) when trying to access some services.
We've upgraded the environment to the latest version of Docker (currently Docker version 17.03.0-ce, build 3a232c8), done a staggered reboot of all servers (trying to maintain uptime if possible even though this environment is technically a test environment) and tried stopping / starting services as well, but the issue still persists.
I'm confident the issue is not related to the service that's running in Docker, as we're seeing it on various services which have until recently been running without issue, I think it's more likely an environmental issue, or some problem with Docker's internal routing in the overlay network, but not sure how to prove / solve this.
Any advice on how to diagnose or solve this would be greatly appreciated!