Random and occasional network error (NSURLErrorDomain Code=-1001 and NSURLErrorDomain Code=-1005) - ios

The last couple of days I've tried to debug a network error from d00m. I'm starting to run out of ideas/leads and my hope is that other SO users have valuable experience that might be useful. I hope to be able to provide all relevant information, but I'm not personally in control of the server environments.
The whole thing started by users noticing a couple of "network errors" in our app. The error seemed to occur randomly, without any noticeable pattern related to internet connectivity, iOS version or backend updates. The two errors that occurs behind the scenes are:
Error Domain=NSURLErrorDomain Code=-1001 "The request timed out."
and more frequently:
Error Domain=kCFErrorDomainCFNetwork Code=-1005 "The network connection was lost.
After debugging it for a couple of days, I've managed to reproduce these errors (occurring at random) by firing approx. 10 random (GET and POST) requests towards our backend with a random sleep timer between each request (set at 1-20 seconds). However, it only occurs in periods. What I've experienced the last couple of days is that when a "period of error" starts, I get one of the two errors every once or twice I run the code (meaning an error rate of 1/10 or 1/20 requests). This error rate continues for a couple of hours and then the error disappears for a couple of hours and then it starts all over.
Some quick facts about the setup:
Happens on device and simulator
Happens on iOS 8.4 and iOS 7.1 - although v. 8.4 is the main one I use for testing.
We use NSURLSession for our network requests. We also have AFNetworking included (updated to latest version), but we only use the Security part for SSL Pinning. Even with SSL pinning totally turned off, the error still occurs.
Some findings I've written down during the last couple of days:
It seems to only happen on our production environments which has some different configuration as our staging environments. This lead me to think that it might be related to the keep-alive bug as discussed here and here. However, our ops department have set up a new staging environment sending the same keep-alive header as the production environments, but this did not make the error occur on the staging environment.
Our Android version of the app were unable to reproduce the error using the same setup of requests. Further, we've not received any customer issues on "network errors" in the Android app.
My gut feeling says that it's related to the server environment and the HTTP implementation in iOS. I'm however unable to track down a convincing pattern that proves anything. I've made the same setup using a simple Rails script, and when the next "error period" occur, I'll be ready to try and reproduce it outside of iOS land. I'll update the question when this happens.
I'm not looking for solutions involving resetting wifi settings, shutting down the simulator or similar as I do not see this as feasible solutions in a production environment. I've also considered making the retry-loop-fix as mentioned in the GitHub issue, but I see this as a last resort.
Please let me know if you need any more information.

In my experience, those sorts of problems usually point to massive packet loss, particularly over a cell network, where minor variations in multipath interference and other issues can make the difference between reliably passing traffic and not.
Another possibility that comes to mind is poor-quality NAT implementations, in the unlikely event that your server's timeout interval is long enough to cause the NAT to give up on the TCP connection.
Either way, the only way to know for sure what's happening is to take a packet trace. To do that, connect a Mac to the Internet via a wired connection, enable network sharing over Wi-Fi, and connect the iOS device to that Wi-Fi network. Then run Wireshark and tell it to monitor the bridge interface. Instructions here:
http://www.howtogeek.com/104278/how-to-use-wireshark-to-capture-filter-and-inspect-packets/
From there, you should be able to see exactly what is being sent and when. That will probably go a long way towards understanding why it is failing.

Ok, I lost a lot of time investigeting similar issue.
1005 could be coused by known iOS bug and there are couple fixes. For example add header
"Connection" with value "close".
More info
1001 is a different story. In my case the problem was strange (bad?) firewall on the server. It was banning device when there was many (not so many) requests in short period of time.
I believe you can do easy test if you are facing similar issue.
Send a lot of (depends of firewall settings) requests in loop (let's say 50 in 1 second).
Close/Kill app (this will close connections to server)
(OPTIONAL) Wait a while (lets say 60 seconds)
Start app again and try send request
If you now got timeout for all next requests, you probably have same issue and you should talk with sever guys.
PS: if you don't have access to server, you can give user info that he should restart wifi on device to quit that timeout loop. It could be last resort in some cases.

Related

Request timed out (System.Web.HttpException)

In periods we are experiencing many "Request timed out" exceptions (System.Web.HttpException) from a specific endpoint that is called often.
It appears not to be related to high-peak periods and has been experienced right after deployment and at random times. No pattern.
The solution is not to increase the execution timeout as the requests are normally completed within seconds.
Neither the web server nor the backend SQL Server is stressed. We have even seen low CPU usage during an incident period.
From ApplicationInsights I got the exact endpoint failing, which is a standard controller action. However, there is no additional information. No stack trace. No error code. Nothing. The exception is thrown at any time between 1 second and minutes after the request start.
From ApplicationInsight I can see that some of the requests to the failing endpoint are completed. However, the response time is extremely long (up to 8 minutes).
I have found nothing in the IIS logs. We have set up the failed request logging and waiting for the next incident. However, we do not expect to get more information than we already got from ApplicationInsights.
I'm uncertain whether this is an ASP.NET MVC application issue or an IIS configuration. It puzzles me, that no stack trace is available.
Any suggestions on how to approach this challenge? Pointers to articles/blogs that can help me solve the issue are very much appreciated.
UPDATE
I was looking through our trace logs and realized that they were not complete, i.e., entries were missing. We use ApplicationInsights (AI) for tracing. AI is configured to keep all traces, exceptions, and events, and it is working flawlessly in DEV and STAGING.
We have two AI environments: AI-PROD and AI-TEST. The environment is selected in web.config via instrumentation key. The entire AI config is in the ApplicationInsights.config and this file is the same in DEV, STAGING, and PROD.
I tried to connect STAGING to the AI-PROD environment to verify that it was not a problem with the environment. It worked flawlessly.
I disabled AI in PROD and the server started without throwing “Request timed out” errors during startup. When PROD is connected to either the AI-PROD or the AI-TEST environment I get “Request timed out” errors during startup.

iOS Network Connection Failure Policy suggestions

I'm looking for suggestions to the best way to handle network connection issues for an iPhone app (iOS9/Swift2/Xcode7), to give the best user experience since we know that mobile data networks are unreliable. I have my coding options in place but I'd like to know what's worked well for other experienced techs. There's lots of info out there but nothing I could find specific to a strategy to occur when there is a connection failure.
Here is my basic strategy dealing with failed connections I'd like to implement (along with questions):
App sends request to api.myserver.com and the request fails
Wait X second(s) and try request to api.myserver.com again (how many tries and at what time interval would you suggest?)
Try pinging some other server (i.e. google.com) to see if we can access a resource other than api.myserver.com
If we can successfully ping google.com then we know our internet is working, so we try once again to ping api.myserver.com
If this last ping fails then we alert the user that we can't communicate for some reason and to try again later
I'm using the philosophy outlined in this SO answer recommended by an Apple tech, which in general means you always check the connection to your server first, using Reachability as a separate check to ensure phone hardware is available.
At any time during this process if Reachability is false then we would put our request in a queue to be tried again when the phone hardware connection was restored.
I think I've got a handle on the code involved, but looking for insights like "this is what worked for our app and gives a good user experience during connection issues...and was approved for use in the Apple app store...". I understand the concepts of trying/retrying connections in the case of failure and alerting the user (currently my code already does this successfully), but still not solid on a good policy to use for how many times should I try to reconnect and at what intervals?
For most of the apps I have worked on it was useful to define a couple of categories of requests which have different rules. For each category consider if retries are appropriate and how long you can really afford to wait before considering the request(s) a failure.
At the most sensitive are blocking requests, things which the user must allow to complete before they can proceed. Sign in, checkout, some editing actions, etc. For these it is often not worth retrying(1) and failures need to be communicated to the user immediately: if the device is offline let the user decide when to try again, if the request fails you've probably already made the user wait too long. Since failures tend to block the user they usually also need to be communicated prominently.
Less sensitive are usually use initiated but non-blocking actions: pull-to-refresh, loading details of a selected collection item, or performing a search. Your user might be waiting to see the results but is probably free to give up or navigate elsewhere in the app and check back later. Failures still need to be communicated so users can choose to try again or at least know to stop waiting but the notification of those failures can be less prominent. Here retries start to make sense. I usually start by trying to define a time limit from the user's perspective, how long will they wait before the app feels broken, and then let that be your limit for how long a request can wait for connectivity or make any number of retries in response to failed connections.
Even less sensitive are requests triggered only indirectly by your user; polling for updates, loading non-essential images, warming caches. These you might retry but the impact of failure is often so low that it may not matter.
Of all of those requests your retry policy really only impacts #2 so I would make sure you actually have requests of that type before worrying about it. Assuming those do actually apply to your app...
Wait X second(s) and try request to api.myserver.com again (how many tries and at what time interval would you suggest?)
I would set some interval here (in the tens to hundreds of milliseconds depending on your normal api performance) to avoid an accidental flood of requests. I don't want to suggest a precise number when I don't have a solid justification for it.
My experience has been that optimizing this value is unlikely to make a perceptible difference to your users because requests often take hundreds of milliseconds to fail and users are only willing to wait for a few thousand milliseconds so making 1 or 5 or 10 requests in that time doesn't really change the final outcome. If you are able to set different expectations with your users then your results may vary.
Try pinging some other server (i.e. google.com) to see if we can access a resource other than api.myserver.com
If we can successfully ping google.com then we know our internet is working, so we try once again to ping api.myserver.com
I would not assume that this is true nor do I think that making an extra request to a third party will help you make useful predictions about when to attempt to reach your own systems. This seems like extra work to build and maintain and likely to be a source of misleading results more than valuable information. In what scenario do you imagine this provides useful information to your app or its user?
Maybe not the answer you're looking for, hopefully it's still useful.
Disclaimer: my experience is biased toward apps with a fairly simple set of REST or RPC style network requests. If you're working on a problem which calls for streaming data, P2P connections, or some other scenario then don't start with these assumptions.
(1) One end note here because I see it as a source of failures so often: These requests should really be idempotent. Yes, even those POSTs creating new resources, checking out your cart, or whatever. When you cannot safely repeat a request you'll eventually see cases where the request completed but the client never got the acknowledgement so it looked like a failure. It's much easier to recover through a retry (automatic or user triggered) of the same request than to detect and recover from duplicate requests.
For better network performance. In my application I use to ping Google server for before every API request if its reachable then I called my server API else no network alert.
If you are on wifi network then also you have to do the same, because wifi reachability only checks for wifi connectivity not for network access.

Recreating Thrift clients in iOS

Shortly, our project uses a Thrift server and mobile clients with multiplexing.
While I was developing the iOS client, I encountered a strange problem;
When I first created the client and made calls, it is OK and it works as expected.
Since there is no close method for Cocoa Thrift client, I am hoping ARC will take care of it.
After some time, I create another client for the same service and do the same things, but this time, when I made a service call, client hangs and after some time in throws a "'TTransportException', reason: 'Cannot read. Remote side has closed.'".
In the server, operation is successfully completed and the value returned.
Does anybody have an idea about what I am doing wrong?
Thanks in advance!
Reading your question i remembered that we encountered a very similar problem in very a different environment. If ARC takes care of your client and closes the connection, especially the port, this might be the reason why recreating the client again with the same port is the root of your problem. Opening the same port shortly after closing it can take a very long time (minutes) depending on timeouts.
Sorry no real answer to your problem but maybe a hint were to look for.

My app fails to connect to the server some times

I've been helplessly observing this problem for a couple months now, and have decided this is my best shot.
I'm not sure what the cause of the problem is, but I can list some of the things I'm doing. I have an iOS app that uses AFNetworking to connect to a remote server hosted by Google App Engine using HTTP POST requests.
Now, everything works great, but sometimes, very very sporadically and random, I get failed requests. The activity indicator spins and spins for about a minute, and I get no feedback at the end - just a failed request. I check my server logs, and I don't see any errors. After the failed request, I try again, and it works fine. It works fine for the whole day. And then another time randomly the issue repeats itself, sometimes spinning for 10 seconds with a fail, or a minute.
Generally, what can possibly be the cause of this? Is it normal to have some failed connections randomly? Is that something on my part?
But the weird thing is, is that while on my iPhone the app is running, and the indicator is spinning, and it's trying to connect, I try connecting on the iOS simulator, and the connection works just fine. I try again on the iPhone, and it doesn't work.
If I close the app completely and start again, then it works again. So it sounds like it may be a software issue rather than connection issue, but then again I have no evidence or data what so ever.
I know it's vague, but I'm hoping someone may have had a similar problem. Anything helps.
There is a known issue with instance start on GAE for Java. You can star http://code.google.com/p/googleappengine/issues/detail?id=7706 issue.
The same problem was reported for Python but it is not such a big problem.
I think you should check logging level you use on appengine and monitor all your calls. Instance start usually takes more time, so you will be able to see how much time do you use on start and is it really a timeout problem.
For Java version you could try to change log level to debug:
.level = DEBUG
in your logging.properties file. It will give you more information about instance start process.

Slow request processing in rails, 'waiting for server' message

I have a quite big application, running from inside spree extension. Now the issue is, all requests are very slow even locally. I am getting messages like 'Waiting for localhost" or "waiting for server" in my browser status bar for 3 - 4 seconds for each request issued, before it starts execution. I can see execution time logged in log file is quite good. But overall response time is poor because of initial delay. So please suggest me, where can I start looking into improving this situation?
One possible root cause for this kind of problem is that initial DNS name resolution is failing before eventually resolving. You can check if this is the case using tcpdump (if that's available for your platform) or wireshark. Look for taffic to and from your client host on port 53 and see if the name responses are happening in a timely fashion.
If it turns out that this is the problem then you need to make sure that the client is configured such that the first resolver it trys knows about your server addresses (I'm guessing these are local LAN addresses that are failing). Different platforms have different ways of configuring this. A quick hack would be to put the address of your server in the client's hosts file to see if that fixes it.
Once you send in your request, you will see 'waiting for host' right up until the Ruby work is done, and it starts sending a response. So, if there is pretty much any processing work that is slowing you down, you'd see this error. What you'd want to do is start looking at the functions that youre seeing the behaviour on, and breaking them down into pieces to see which peices are slow. If EVERYTHING is slow, than you need to look at the things that are common to every function - before functions, or Application Controller code, or something similar. What I do, when I'm just playing around to see what I need to fix is just put 'puts' statements in my code at different stages, to print the current time, then I can see which stage is taking a long time, you know?

Resources