iOS Network Connection Failure Policy suggestions - ios

I'm looking for suggestions to the best way to handle network connection issues for an iPhone app (iOS9/Swift2/Xcode7), to give the best user experience since we know that mobile data networks are unreliable. I have my coding options in place but I'd like to know what's worked well for other experienced techs. There's lots of info out there but nothing I could find specific to a strategy to occur when there is a connection failure.
Here is my basic strategy dealing with failed connections I'd like to implement (along with questions):
App sends request to api.myserver.com and the request fails
Wait X second(s) and try request to api.myserver.com again (how many tries and at what time interval would you suggest?)
Try pinging some other server (i.e. google.com) to see if we can access a resource other than api.myserver.com
If we can successfully ping google.com then we know our internet is working, so we try once again to ping api.myserver.com
If this last ping fails then we alert the user that we can't communicate for some reason and to try again later
I'm using the philosophy outlined in this SO answer recommended by an Apple tech, which in general means you always check the connection to your server first, using Reachability as a separate check to ensure phone hardware is available.
At any time during this process if Reachability is false then we would put our request in a queue to be tried again when the phone hardware connection was restored.
I think I've got a handle on the code involved, but looking for insights like "this is what worked for our app and gives a good user experience during connection issues...and was approved for use in the Apple app store...". I understand the concepts of trying/retrying connections in the case of failure and alerting the user (currently my code already does this successfully), but still not solid on a good policy to use for how many times should I try to reconnect and at what intervals?

For most of the apps I have worked on it was useful to define a couple of categories of requests which have different rules. For each category consider if retries are appropriate and how long you can really afford to wait before considering the request(s) a failure.
At the most sensitive are blocking requests, things which the user must allow to complete before they can proceed. Sign in, checkout, some editing actions, etc. For these it is often not worth retrying(1) and failures need to be communicated to the user immediately: if the device is offline let the user decide when to try again, if the request fails you've probably already made the user wait too long. Since failures tend to block the user they usually also need to be communicated prominently.
Less sensitive are usually use initiated but non-blocking actions: pull-to-refresh, loading details of a selected collection item, or performing a search. Your user might be waiting to see the results but is probably free to give up or navigate elsewhere in the app and check back later. Failures still need to be communicated so users can choose to try again or at least know to stop waiting but the notification of those failures can be less prominent. Here retries start to make sense. I usually start by trying to define a time limit from the user's perspective, how long will they wait before the app feels broken, and then let that be your limit for how long a request can wait for connectivity or make any number of retries in response to failed connections.
Even less sensitive are requests triggered only indirectly by your user; polling for updates, loading non-essential images, warming caches. These you might retry but the impact of failure is often so low that it may not matter.
Of all of those requests your retry policy really only impacts #2 so I would make sure you actually have requests of that type before worrying about it. Assuming those do actually apply to your app...
Wait X second(s) and try request to api.myserver.com again (how many tries and at what time interval would you suggest?)
I would set some interval here (in the tens to hundreds of milliseconds depending on your normal api performance) to avoid an accidental flood of requests. I don't want to suggest a precise number when I don't have a solid justification for it.
My experience has been that optimizing this value is unlikely to make a perceptible difference to your users because requests often take hundreds of milliseconds to fail and users are only willing to wait for a few thousand milliseconds so making 1 or 5 or 10 requests in that time doesn't really change the final outcome. If you are able to set different expectations with your users then your results may vary.
Try pinging some other server (i.e. google.com) to see if we can access a resource other than api.myserver.com
If we can successfully ping google.com then we know our internet is working, so we try once again to ping api.myserver.com
I would not assume that this is true nor do I think that making an extra request to a third party will help you make useful predictions about when to attempt to reach your own systems. This seems like extra work to build and maintain and likely to be a source of misleading results more than valuable information. In what scenario do you imagine this provides useful information to your app or its user?
Maybe not the answer you're looking for, hopefully it's still useful.
Disclaimer: my experience is biased toward apps with a fairly simple set of REST or RPC style network requests. If you're working on a problem which calls for streaming data, P2P connections, or some other scenario then don't start with these assumptions.
(1) One end note here because I see it as a source of failures so often: These requests should really be idempotent. Yes, even those POSTs creating new resources, checking out your cart, or whatever. When you cannot safely repeat a request you'll eventually see cases where the request completed but the client never got the acknowledgement so it looked like a failure. It's much easier to recover through a retry (automatic or user triggered) of the same request than to detect and recover from duplicate requests.

For better network performance. In my application I use to ping Google server for before every API request if its reachable then I called my server API else no network alert.
If you are on wifi network then also you have to do the same, because wifi reachability only checks for wifi connectivity not for network access.

Related

Cloud services to notify on a script not succeeding for a long time

I have a python script that resides on a VPS, reads (each hour) financial news from a public datafeed and emails me when certain keywords of interest appear. That can happen only a few times a week, but such events are very important and must not be missed. On any data fetching or parsing error, I should also be notified via email, and errors of course get recorded into the server's local log file.
But how do I know that my smtp credentials are not blocked by the mail provider, or my VPS is not shut down by my hoster? In that case, I would not be notified and would be unaware of important events (and the failure to fetch/deliver them itself) until I decided to log into VPS manually and take a look at the logs.
Even if I would use a backup notification channel, e.g., SMS or Telegram, it still would not protect against cloud provider service disruption, or my account being blocked due to temporary payment issues, as there would exist no instance of the script to deliver the message on any of the channels. That's why I suspect some 3rd party fault-tolerant service is needed. Especially if I'm a freelance coder having lots of similar scripts, running on a mixture of VPSes, serverless/Lambdas, possibly for different end clients.
What is the best practice you dear developers are using to be notified when some script has not succeeded for a long enough time? I would like something reliable and ready-to-use, maybe you can recommend some existing monitoring services. At least I was not able to find the ones that solve my particular problem straight away.
To clarify, I don't want to spend time on some manual checking until it's absolutely necessary (in this case, I can tolerate up to 2 hours, and if it does not self-heal within that period, then I need to be notified), and I obviously don't want to get regular annoying reports that the service is doing fine and there simply were no interesting news detected. Plus, I of course want to keep the costs reasonable.

Load Test Application calling external http service

Thanks for looking this question, I have an application which reads from JMS Queue and processes the mesages and POST the processed message to external http service. What will be best way to load test using gatling.
I can simulate load on queue using gatling.jms. How to verify POST to external service.
Load testing with Gatling is a fairly complex affair to do it right. I've done it enough to know some of the pitfalls so here is some insight that may be useful:
you want to test over the network and you want the latency to be minimal so that delays due to network latency are minimized/nullified and so that the results show how quickly incoming HTTP requests can be handled/responded to. For this reason, if your application is in the cloud in europe-east, say, you want to run your tests from the same location. If your requests were coming from us-west, there'd be a big delay in routing the requests from the wrong side of the US which could introduce big variations in the response times to/from your application.
Remove all other load from your service. If you can't remove load because you're hoping to test against a live application, then you need to make another deployment to test against that has no active load
Load tests should run for (in my experience) 45 minutes as a minimum to verify your service can handle the load. Reason for this being that it can take time for an unbearable load to accumulate on the server... so you may run at 33req/s which is fine for 40 minutes, but when run for 45-60 mins, its just long enough that the balance between what your application can cope with, vs. what causes catastrophic failure is tipped towards failure.
Notes:
You don't need to test to destruction but it is sometimes a useful metric to be aware of. I find using a binary search strategy works well here to get peak load relatively quickly.
What you should test is that your application can handle the load you expect it to receive in a worst case scenario; Different organisations have different tolerances for how much load they expect their applications to be able to cope with. At some places I've worked they've used a lot of optimisations to minimise load directly to their servers, but if those protections fail, the server is expected to handle 10x more traffic than the usual load. At other places, those same optimisations were not in place, instead there were be disaster recovery systems available, ready to pick up when the main app fails. In this case the application only needed to be able to handle 2x the peak load (as observed by assessing logs/metrics for the past year).
I work predominantly with garbage collected languages on the JVM. I'm aware there are now Zero Garbage Collection designs/capabilities which could help minimize the effects of a buildup of GC tasks... so there are almost always optimisations you can make either with language/memory settings, database indexing, or within your application itself, or the strategies you employ to perform a task effectively, before you start changing the hardware.
Peak load can be assessed from logs/metrics systems

How to know server is in good health condition or not?

How can I know that my server is working fine i.e in good health condition.
My Requirement is Users are complaining that they can not access the web application (Web site) something like it taking long time to do, some times its not completing the request.
I want to know whether my web site is in good condition or not before users and to get an alert message.
I want to know how we can measure whether the server is very responsive or user is not facing any problem. Some times my site takes long time coz. millions of data records have to be retrieved in that case I can not depend upon response time.
please help me on this
Monitoring response time without any third party software can be done with scripts like webinject. Webinject is a perl script that execute some browsing scenario and tells you if it acceptable or not.
Run a script at a regular interval, say 10min, that will start a webinject scenario. If the scenario is ko (check the return code of your webinject call), your script can send you an email, a sms, start a sound alarm, ... whatever is relevant to you.
You can also add some complexity by running a diagnostic script (check network by pinging relevant hardware, check cpu/ram usage of your servers, check the number of sessions in your database, ...) and send the diagnostic by email. You can also save the response times in a database (like a rdd database) to have a graphical view and be able to do some problem analysis on it.

Cost of continuous replications vs one-shot replications (using TouchDB and Cloudant)

We have an app that uses Cloudant as a remote server. Nevertheless, Cloudant is not completely compatible with TouchDB's continuous replications from previous experience. So our alternative for now is to trigger manually one-shot replications at a fixed frequency. Nevertheless, we would like to know if that approach is going to cost us more money than continuous replications, since continuous replications use longpoll and doesn't need to query the server often. In other words, does one-shot pull replications with Cloudant as the target cost us a GET request?
Thank you,
Paul
I think the issue you refer to is [1].
Cloudant's replication is 100% compatible with CouchDB. In this
instance, TouchDB's logs indicate the iOS network stack passed
on incomplete JSON to TouchDB. It's not clear who was to blame
in this case for the replication failure.
[1] https://github.com/couchbaselabs/TouchDB-iOS/issues/241
For the cost question, a one-shot pull replication will result in a GET to the _changes
feed each time it happens, plus the other requests required to
replicate. This _changes request will be counted as a light
HTTP request against your Cloudant account.
However, whether this works out as more or fewer requests overall
depends on the number of changes coming down from the remote server.
It's also important to remember that the number of _changes calls are very small
relative to the number of other calls involved (e.g., getting the
content of the changes themselves and particularly if there are many
attachments).
While this question is specific to TouchDB, and I mention specific
behaviours of that codebase, this answer deals with the requests involved
in replication between any two systems speaking the CouchDB replication
protocol[2].
[2] http://www.dataprotocols.org/en/latest/couchdb_replication.html
Let's take a contrived example: 1 update per 10 second window to
the source database for the replication, where a TouchDB database
is the target. Let's take a 5 minute poll vs. a continuous replication.
For simplicity of call-counting, let's also take attachments out of the
picture. We'll also assume the device has a constant network connection.
For the continuous case, every 10s TouchDB will receive an update in
the _changes feed. This causes the longpoll connection to close.
TouchDB then runs through the changes, requesting the updates from the
source database; one or more GET requests on the remote server. While
this is happening, TouchDB has to open up another longpoll request
to _changes. So in a five minute period, you'd end up with perhaps
30 calls to _changes, plus all the calls to get documents and record
checkpoints.
Compare this with a one-shot replication every five minutes. You'd
receive notification of the 30 updates in one _changes feed call.
TouchDB implements an optimisation[3] whereby it will call _all_docs
to get updated documents for 1- revs, so you might end up with a single
call to get all 30 documents (not possible in the continuous case as
you've received a single change). Then you've the checkpoint documents
to record. At best fewer than 5 HTTP calls, at most about a third of
the continuous case as you've avoided extra _changes requests.
[3] https://github.com/couchbaselabs/TouchDB-iOS/wiki/Replication-Algorithm#performance
It comes down to the frequency of updates you expect to the source
database. One-shot replication is likely to provide a smoother price
curve as you're in better control of the number of requests you make.
A further question is how often connections will drop because of the
network disconnects which happen regularly with mobile devices.
TouchDB's continuous replications will fire back up each time the
user comes on line (if added via the _replicator database). This is a
further source of unpredictable costs.
However, the benefits from more immediate visibility of changes may
certainly be worth the uncertainty.

Deferring blocking Rails requests

I found a question that explains how Play Framework's await() mechanism works in 1.2. Essentially if you need to do something that will block for a measurable amount of time (e.g. make a slow external http request), you can suspend your request and free up that worker to work on a different request while it blocks. I am guessing once your blocking operation is finished, your request gets rescheduled for continued processing. This is different than scheduling the work on a background processor and then having the browser poll for completion, I want to block the browser but not the worker process.
Regardless of whether or not my assumptions about Play are true to the letter, is there a technique for doing this in a Rails application? I guess one could consider this a form of long polling, but I didn't find much advice on that subject other than "use node".
I had a similar question about long requests that blocks workers to take other requests. It's a problem with all the web applications. Even Node.js may not be able to solve the problem of consuming too much time on a worker, or could simply run out of memory.
A web application I worked on has a web interface that sends request to Rails REST API, then the Rails controller has to request a Node REST API that runs heavy time consuming task to get some data back. A request from Rails to Node.js could take 2-3 minutes.
We are still trying to find different approaches, but maybe the following could work for you or you can adapt some of the ideas, I would love to get some feedbacks too:
Frontend make a request to Rails API with a generated identifier [A] within the same session. (this identifier helps to identify previous request from the same user session).
Rails API proxies the frontend request and the identifier [A] to the Node.js service
Node.js service add this job to a queue system(e.g. RabbitMQ, or Redis), the message contains the identifier [A]. (Here you should think about based on your own scenario, also assuming a system will consume the queue job and save the results)
If the same request send again, depending on the requirement, you can either kill the current job with the same identifier[A] and schedule/queue the lastest request, or ignore the latest request waiting for the first one to complete, or other decision fits your business requirement.
The Front-end can send interval REST request to check if the data processing with identifier [A] has completed or not, then these requests are lightweight and fast.
Once Node.js completes the job, you can either use the message subscription system or waiting for the next coming check status Request and return the result to the frontend.
You can also use a load balancer, e.g. Amazon load balancer, Haproxy. 37signals has a blog post and video about using Haproxy to off loading some long running requests that does not block shorter ones.
Github uses similar strategy to handle long requests for generating commits/contribution visualisation. They also set a limit of pulling time. If the time is too long, Github display a message saying it's too long and it has been cancelled.
YouTube has a nice message for longer queued tasks: "This is taking longer than expected. Your video has been queued and will be processed as soon as possible."
I think this is just one solution. You can also take a look EventMachine gem, that helps to improve the performance, handler parallel or async request.
Since this kind of problem may involve one or more services. Think about possibility of improving performance between those services(e.g. database, network, message protocol etc..), if caching may help, try out caching frequent requests, or pre-calculate results.

Resources