In periods we are experiencing many "Request timed out" exceptions (System.Web.HttpException) from a specific endpoint that is called often.
It appears not to be related to high-peak periods and has been experienced right after deployment and at random times. No pattern.
The solution is not to increase the execution timeout as the requests are normally completed within seconds.
Neither the web server nor the backend SQL Server is stressed. We have even seen low CPU usage during an incident period.
From ApplicationInsights I got the exact endpoint failing, which is a standard controller action. However, there is no additional information. No stack trace. No error code. Nothing. The exception is thrown at any time between 1 second and minutes after the request start.
From ApplicationInsight I can see that some of the requests to the failing endpoint are completed. However, the response time is extremely long (up to 8 minutes).
I have found nothing in the IIS logs. We have set up the failed request logging and waiting for the next incident. However, we do not expect to get more information than we already got from ApplicationInsights.
I'm uncertain whether this is an ASP.NET MVC application issue or an IIS configuration. It puzzles me, that no stack trace is available.
Any suggestions on how to approach this challenge? Pointers to articles/blogs that can help me solve the issue are very much appreciated.
UPDATE
I was looking through our trace logs and realized that they were not complete, i.e., entries were missing. We use ApplicationInsights (AI) for tracing. AI is configured to keep all traces, exceptions, and events, and it is working flawlessly in DEV and STAGING.
We have two AI environments: AI-PROD and AI-TEST. The environment is selected in web.config via instrumentation key. The entire AI config is in the ApplicationInsights.config and this file is the same in DEV, STAGING, and PROD.
I tried to connect STAGING to the AI-PROD environment to verify that it was not a problem with the environment. It worked flawlessly.
I disabled AI in PROD and the server started without throwing “Request timed out” errors during startup. When PROD is connected to either the AI-PROD or the AI-TEST environment I get “Request timed out” errors during startup.
Related
I have a weird issue that just started popping up for our customers. The portal they've been using for years has started freezing on some of the pages that the user navigates to. I tried restarting the IIS Server, the site within and the Application Pool under which the site is site is running. No difference.
In Chrome Dev Tools I can see that it is always one of these three calls that take time to complete:
When it happens, one of those three calls will report that the request is not finished, like this:
When eventually the call completes, I can see that the Content Download took 3.8 minutes. Not sure whether it is relevant or not, but it is always 3.8 minutes:
Did anyone else encounter a similar situation? Is there a suggestion on how to figure out what is happening all of a sudden that triggers these type of behaviours?
TIA,
Ed
Edit: The resource that fails to load after 3.8 minutes always generates a net::ERR_CONNECTION_RESET error:
Edit2: Thanks to all of you trying to help. A little update: I was able to isolate to problem to an issue with the server not serving some of the files. either *.css or *.js. The setting is that of two identical servers placed behind a load balancer. Apparently, the load balancer software was recently updated and right after that we started having these issues. I am working closely with the IT department of our client, trying to figure out what is the impact of the newer version that seems to have triggered all this drama.
We are investigating an issue on a deployed cloud run service, where requests made to the service occasionnaly fail with a StatusCodeError: 500, while no log of said requests appear on cloud run.
Served requests usually produce two log lines detailing the request, route and exit code (POST 200 on https://service-name.a.run.app/route/...)
One with log name projects/XXX/logs/run.googleapis.com/stdout is produced by our application to log the serving of every request
One with log name projects/XXX/logs/run.googleapis.com/requests is automatically produced by cloud run on every request
When the incident occurs, none of those are logged. The client (running in a gke pod in the same project) has the only log of the failing requests, with the following message:
StatusCodeError: 500 - "\n<html><head>\n<meta http-equiv=\"content-type\" content=\"text/html;charset=utf-8\">\n<title>500 Server Error</title>\n</head>\n<body text=#000000 bgcolor=#ffffff>\n<h1>Error: Server Error</h1>\n<h2>The server encountered an error and could not complete your request.<p>Please try again in 30 seconds.</h2>\n<h2></h2>\n</body></html>\n"
Rough timeline of the last incident:
14:41 - Service is serving requests as expected, producing both log lines each time
14:44 to 14:56 - Cloud run logs are empty, every request made to the service (~30) gets the 500 error message
14:56 - Cloud run terminates the currently running container instance, (as happens after some inactivity for instance), which is correctly logged by the application ([INFO] Handling signal: term)
14:58 - Cloud run instantiates a new container instance and starts serving incoming requests (which are logged normally)
The absence of logs during the incident makes it hard to investigate its cause, and at this stage we would be gratefull for any kind of lead.
Our service has another known issue, that may or may not be related. The service is designed to avoid multiple replicas, as a single one should be able to handle the load and serve concurrent requests (cloud run concurency = 80), but has a relatively long cold start time (~30s). This leads to 429 errors when a spike of requests comes while no replica is available (because of cloud run hard capping concurrency to 1 during cold start). This issue was somewhat mitigated by allowing some replication (currently maxScale = 3), since each replica can put a request on hold during the cold start, but will require some work on the client side to handle correctly (simple retries after the cold start).
I have found this PIT that describes the aforementioned behavior. It seems to happen because a part of Cloud Run thinks that there are already provisioned instances handling the traffic but there aren't. This issue is currently being worked on internally but there's no ETA for a fix at the moment.
The current workaround is to set a maximum number of instances to at least 4.
The last couple of days I've tried to debug a network error from d00m. I'm starting to run out of ideas/leads and my hope is that other SO users have valuable experience that might be useful. I hope to be able to provide all relevant information, but I'm not personally in control of the server environments.
The whole thing started by users noticing a couple of "network errors" in our app. The error seemed to occur randomly, without any noticeable pattern related to internet connectivity, iOS version or backend updates. The two errors that occurs behind the scenes are:
Error Domain=NSURLErrorDomain Code=-1001 "The request timed out."
and more frequently:
Error Domain=kCFErrorDomainCFNetwork Code=-1005 "The network connection was lost.
After debugging it for a couple of days, I've managed to reproduce these errors (occurring at random) by firing approx. 10 random (GET and POST) requests towards our backend with a random sleep timer between each request (set at 1-20 seconds). However, it only occurs in periods. What I've experienced the last couple of days is that when a "period of error" starts, I get one of the two errors every once or twice I run the code (meaning an error rate of 1/10 or 1/20 requests). This error rate continues for a couple of hours and then the error disappears for a couple of hours and then it starts all over.
Some quick facts about the setup:
Happens on device and simulator
Happens on iOS 8.4 and iOS 7.1 - although v. 8.4 is the main one I use for testing.
We use NSURLSession for our network requests. We also have AFNetworking included (updated to latest version), but we only use the Security part for SSL Pinning. Even with SSL pinning totally turned off, the error still occurs.
Some findings I've written down during the last couple of days:
It seems to only happen on our production environments which has some different configuration as our staging environments. This lead me to think that it might be related to the keep-alive bug as discussed here and here. However, our ops department have set up a new staging environment sending the same keep-alive header as the production environments, but this did not make the error occur on the staging environment.
Our Android version of the app were unable to reproduce the error using the same setup of requests. Further, we've not received any customer issues on "network errors" in the Android app.
My gut feeling says that it's related to the server environment and the HTTP implementation in iOS. I'm however unable to track down a convincing pattern that proves anything. I've made the same setup using a simple Rails script, and when the next "error period" occur, I'll be ready to try and reproduce it outside of iOS land. I'll update the question when this happens.
I'm not looking for solutions involving resetting wifi settings, shutting down the simulator or similar as I do not see this as feasible solutions in a production environment. I've also considered making the retry-loop-fix as mentioned in the GitHub issue, but I see this as a last resort.
Please let me know if you need any more information.
In my experience, those sorts of problems usually point to massive packet loss, particularly over a cell network, where minor variations in multipath interference and other issues can make the difference between reliably passing traffic and not.
Another possibility that comes to mind is poor-quality NAT implementations, in the unlikely event that your server's timeout interval is long enough to cause the NAT to give up on the TCP connection.
Either way, the only way to know for sure what's happening is to take a packet trace. To do that, connect a Mac to the Internet via a wired connection, enable network sharing over Wi-Fi, and connect the iOS device to that Wi-Fi network. Then run Wireshark and tell it to monitor the bridge interface. Instructions here:
http://www.howtogeek.com/104278/how-to-use-wireshark-to-capture-filter-and-inspect-packets/
From there, you should be able to see exactly what is being sent and when. That will probably go a long way towards understanding why it is failing.
Ok, I lost a lot of time investigeting similar issue.
1005 could be coused by known iOS bug and there are couple fixes. For example add header
"Connection" with value "close".
More info
1001 is a different story. In my case the problem was strange (bad?) firewall on the server. It was banning device when there was many (not so many) requests in short period of time.
I believe you can do easy test if you are facing similar issue.
Send a lot of (depends of firewall settings) requests in loop (let's say 50 in 1 second).
Close/Kill app (this will close connections to server)
(OPTIONAL) Wait a while (lets say 60 seconds)
Start app again and try send request
If you now got timeout for all next requests, you probably have same issue and you should talk with sever guys.
PS: if you don't have access to server, you can give user info that he should restart wifi on device to quit that timeout loop. It could be last resort in some cases.
My heroku app occasionally experiences long run times, on the order of 8 seconds (which is the trigger point to recieve email warnings about long response times). I originally assumed the issue was related to dyno sleeping, but our new production environment has redundant dynos, and shouldn't sleep.
The issue doesn't occur on any specific route -- even a route as simple as the 'ping' route used by the front-end to keep a session alive can produce it. I don't think it changes anything, but the latest example occured in an 'options' request -- and the followup request didn't experience any delay at all.
How can I further diagnose this issue? I've examined the logs around the request in question, there were few log entries in that time period, most chatter from the POSTGRES DB that -- if I'm reading it right -- was saying that it was up, running, and had no connections currently running code. For some reason, the request just... randomly... took forever.
I've been working with TFS for months without a problem, now suddently I can't check-in "big" files (2kb files are ok, but 50kb files or multiple files are not). TFS is hosted on a server in the same network.
When I try to check-in, it gives-me an error like: "Check In: Operation not performed : The underlying connection was closed: An unexpected error occurred on a receive.
Please refer to the Output window for more information". The "more information" in output window is just the same error.
The event viewer of the server shows nothing, and I've been looking in Google for the past couple hours and turned out nothing yet.
The error message The underlying connection was closed is an indicator that something between the client and the server is dropping the connection unexpectedly.
Some things to investigate/try:
Is the application pool on the server restarting? Look at the Application Event Log on the AT server. Look for ASP.NET and W3SVC warnings/errors that indicate the app pool has restarted.
How is the client connecting to the server? Is there a HTTP proxy in the middle? Is the server behind a load balancer or firewall device? What is the idle timeout set to? Is it honoring HTTP Keep-Alive settings?
Is it failing for all clients? Can you checkin the same file on the TFS server itself?
If none of those seem to lead you in the right direction, you'll need to setup a Fiddler or NetMon trace on your client and/or your server.
We had a similar problem in TFS 2008, where the lock table was getting big and causing some problems. Take a look at tbl_lock in your TfsVersionControl database. This should be a very small number of rows.
In our experience when this number approached or exceeded 1,000 rows, we started seeing significant issues with check-ins.
Our resolution to this was to turn on merging for the binary files that we were storing.