Amazon Web Services 503 error - http-status-code-503

We have been trouble shooting a 503 error stating servers are at capacity.
we have removed the potential problems after monitoring the time stamp of the occurence. Everything tests healthy in our dashboard, we stopped instances and restarted them, as well as creating new instances to see if that would help. It is not a big website so we are nowhere near capacity..
Sorry On to the question...Has anybody run into this problem and how did they go about trouble shooting beyond what we have already done.
Thanks,

Found the answer...When we created our new AMI in AWS for our new Instance. We had an AMI that had the same image and no code deployment. So the two AMI's cancelled eachother out and our load balancer just stopped and didn't serv up anything else...
We had to Terminate the AMI, stop and restart the app instances and the and most importantly had to go to code deployment and re deploy the code.
Hope this helps somebody else down the road.

Related

Application slow down due to zombie process?

We face the application downtime/issue while uploading the file to Azure storage via Arc.
There is no specific code error, but facing a timeout issue.
It gets resolved once the Azure web app is restarted.
It has happened intermittently.
Since we cannot find the root cause, we consulted if there is an issue on the Azure side.
The Microsoft team says the system health is OK but pointing towards accumulated zombie processes. EPMD and inet_gethost
On searching, I understand that these are created by Erlang runtime.
Please let me know if we have some process to kill these zombie processes from time to time?
Also, do they contribute to the application downtime?
Thanks
Please let me know, if we have some process to kill these zombie processes time to time?
If you're running a sensible init process, these zombie processes should be correctly reaped. This can often be a problem if you run Erlang as the top-level process inside a container, for instance. Can you give more detail of your environment?
Also do they really contribute for the application downtime?
Depends on how many of them there are, but probably not, no.

Production website becomes unresponsive on certain pages

I have a weird issue that just started popping up for our customers. The portal they've been using for years has started freezing on some of the pages that the user navigates to. I tried restarting the IIS Server, the site within and the Application Pool under which the site is site is running. No difference.
In Chrome Dev Tools I can see that it is always one of these three calls that take time to complete:
When it happens, one of those three calls will report that the request is not finished, like this:
When eventually the call completes, I can see that the Content Download took 3.8 minutes. Not sure whether it is relevant or not, but it is always 3.8 minutes:
Did anyone else encounter a similar situation? Is there a suggestion on how to figure out what is happening all of a sudden that triggers these type of behaviours?
TIA,
Ed
Edit: The resource that fails to load after 3.8 minutes always generates a net::ERR_CONNECTION_RESET error:
Edit2: Thanks to all of you trying to help. A little update: I was able to isolate to problem to an issue with the server not serving some of the files. either *.css or *.js. The setting is that of two identical servers placed behind a load balancer. Apparently, the load balancer software was recently updated and right after that we started having these issues. I am working closely with the IT department of our client, trying to figure out what is the impact of the newer version that seems to have triggered all this drama.

Neo4j HA Servers keep failing

We have just put our system into production and we have a lot of users on the production system. Our servers keep failing and we are not sure why. It seems to start with one server then it elects a new master and then a few minutes later all the servers go down in the cluster. I have it setup to send all the writes to the read databases and to leave the writes to the master. I have looked through the logs and cannot seem to find a root cause. Let me know what logs I should upload and or where I should look. Today alone we have had to restart the servers 4 times and it fixes it for a bit but its not a cure for the issue.
All databases are 16GB ram and 8 cpus and SSDs. I have them setup with the following settings in the neo4j.properties
neostore.nodestore.db.mapped_memory=1024M
neostore.relationshipstore.db.mapped_memory=2048M
neostore.propertystore.db.mapped_memory=6144M
neostore.propertystore.db.strings.mapped_memory=512M
neostore.propertystore.db.arrays.mapped_memory=512M
We are using newrelic to monitor the server and we do not see the hardware getting above 50% CPU and 40% memory so we are pretty sure that is not it.
Any help is appreciated :)

After hitting "Power Cycle" on DigitalOcean, the app doesn't work

I needed to create an image of the current droplet on DigitalOcean, so I powered off the droplet from dashboard, created the new image of the current droplet and then wanted to turn on the droplet again. So I hit "Power Cycle".
The problem is that after 10 minutes, the app is still not up. What happened and, how can I fix it?
Thanks
It sounds like the application may not be configured to start on boot. The power cycle feature at DO is basically just a gentle reboot (it tries to wait for things to shut down instead of deploying more harsh methods to stop the processes). Does the application start up if you SSH in and start it manually? If so, next check on what starts it automatically at boot. It might have encountered an error (environment errors are common) and should have reported somewhere (syslog, application log, etc.).
Just to help you out moving forward, DO supports live snapshots now, so you should be able to snapshot your droplet without powering it off. :)

Windows Azure WebRole stuck in a deployment loop

I've been struggling with this one for a couple of days now. My current Windows Azure WebRole is stuck in a loop where the status keeps changing between Initializing, Busy, Stopping and Stopped.
It never goes live, and I can can never see the website as a result. The WebRole is an "out of the box" MVC 2 application with Copy Local set to true on the Mvc dll and I haven't even tried hooking up a storage or WorkerRole yet, and there is nothing really happening inside the Start method that I can see would crash.
I've really tried going back to basics to ensure nothing can complicate the process and the website launches without a problem on the Dev Fabric and yes it looks just like the standard "Home", "About" MVC app - just can't get it running in the cloud!
Funny thing is, a few days ago, this exact package worked on the staging area in the cloud, and I could even see it in the browser - but could never get it swapped over to production, so I deleted everything and started from scratch, and now I can't even get it running on staging...
Does anyone have any ideas on what I could do to diagnose this problem myself because since logging this problem on the forums 2 days ago, there has been no improvement or feedback.
Any help appreciated,
Regards,
Rob G
Turns out there are a number of things that can cause this to happen. A full thread on the Microsoft forums goes through most of them and details my adventures in the arena.
http://social.msdn.microsoft.com/Forums/en-US/windowsazure/thread/1482c1af-16e3-46ca-846e-14f511c35750
Hope this helps...
I think the best starting point is enabling remote desktop on all role instances.
Saves a lot of heart ache wondering why the heck isn't the diagnostics aren't logging anything.
By remoting in you can eye ball the event logs and find lots of reasons for azure unhappness

Resources