How to resolve timeout issues between 2 applications ? - timeout

My application is communicating with a service. The service provides user login , registration , update functionality (IAM Service) . Since, this feature is critical & we don't want to impact user performance, we set the timeout 500 millisec, considering the fact that both my application & the IAM service are in the same data center.
On analysis, we found that the IAM service on an average takes 10 - 12 millisecs & my application which simply sends the request takes 1 - 2 millisecs. Also, it does not happen for every request, just a few request.
The network engineer says the network is good & there are no leaks.
Request your inputs to understand, how should I proceed to analyze the root cause to recognize which component is taking time.

Make sure the Application and the Service are synchronized (have the same time stamp)
Log the Time stamp of request being sent by the App
Observe the time stamp when the request hits the wire
Log the Time stamp when the Request is being received by the Service
Log the time stamp when the Service sends out the response
Observe the time stamp when the response hits the wire
Log the time stamp when the app receives the response
The next time the timeout occurs - check the log to find out which two laterally adjacent time stamps have a difference of more than the 500ms. Now once you have the profiled information - focus on the particular segment that causes the timeout.

Related

Service worker update period

I'm creating a simple web app that can connect to a bluetooth device that I want to be able to use offline, so I use a service worker to store the app in the web cache. I know the cache only clears if there is no more space but what about the service worker?
I found that is lifespan is 24 hours. My question is how long can I use the web app without connecting to the internet? Is the cache the only problem or does the service worker "die" after x amount of time and I need to connect to the internet again?
No, it does not die. You can use it forever.
You're confusing two things. The 24 hour lifespan is actually an automatic update checking interval. In other words, when using a site frequently the browser will automatically check for updated Service Workers (specifically updated /serviceworker.js or wherever you store it). Your code can of course manually, programmatically check for SW updates more often. Usually apps check for new SWs every time they're launched. But the device may be offline for eg a month and that doesn't prevent the use of the app.

Twilio WebRTC TURN relay randomly stops working after a few minutes

I am using the Twilio Network Traversal Service as part of a native application I am working on to perform peer-to-peer remote desktop connections. We implement a subset of the WebRTC protocol stack that is equivalent to the WebRTC data channels (not the WebRTC video and audio protocols). When using a TURN relay, the TURN allocation seems to be invalidated randomly somewhere between a few minutes and a maximum of 12 minutes from the session start. This issue looks very similar to this one, but the proposed workaround (sending silent audio) is not acceptable in my case, since I do not implement the WebRTC audio/video protocols.
I have been pulling my hair on this problem for the last two weeks, and isolated the issue as being the Twilio service itself. To compare, I have used a web based WebRTC data channel demo using firefox and the Xirsys TURN server cloud. I have wireshark captures showing firefox getting disconnected with Twilio just like my native application, while the exact same firefox demo doesn't get disconnected when using the Xirsys servers.
I was using Xirsys originally, but I experienced some instability with their service that made me switch to Twilio, which is why I would rather have Twilio fix this issue instead of going back with Xirsys. At the bare minimum, I would rather have two WebRTC hosting providers I can choose from that I know should work fine. This is why I am taking the time to explain the issue in detail so it can get fixed.
Here are two wireshark captures (with the peer-to-peer data messages filtered out) showing firefox using WebRTC data channels and the Twilio TURN relay servers:
The traffic stops being relayed after 4 minutes in the first capture, and after about 11 minutes in the second capture. In both captures, firefox detects that traffic stops being relayed (at the data channel level) and attempts a graceful disconnection by sending a Refresh request packet with a lifetime of zero. Both graceful disconnections result in a 437 Allocation Mismatch error, indicating that the server doesn't even know about the allocation firefox is trying to close gracefully.
With my native application, this would often take the form of a CreatePermission Request message that fails with a 438 "Wrong nonce" error, which is basically what should happen if a client tries to update the permission on an allocation that no longer exists. The error code 438 usually means "Stale nonce", which is not really an error, but an indication that the nonce has expired and the client should try again using the new nonce contained in the "error" message. It took me a while to figure out, but even if the error code is 438, the error string is not the same. I have observed a true stale nonce error with Xirsys and successfully updated my permission with the new nonce from the error response, so I know I can properly handle this case in my implementation.
Here is the source code for the WebRTC data channel demo I have used:
https://github.com/devolutions/webrtc-demo
For comparison, here is the same firefox data channel demo using the Xirsys TURN server cloud:
In this capture, I have let the demo run for about 16 minutes (it works for much longer than that, the longest I have tried is two hours). We can see that the traffic keeps getting relayed for the entire duration of the session, and CreatePermission requests keep getting sent by firefox with success. At the end, the graceful disconnection is triggered by firefox closing the WebRTC data channel (instead of being closed due to traffic no longer being relayed). As opposed to the Twilio captures, the Refresh request with a lifetime of zero is successful: the Xirsys TURN server still knows about the allocation and sends back a success response, as expected.
It should be noted that the ICMP unreachable errors are normal because I think in this case firefox is no longer listening on the given port when the response comes back. In other words, it sends the Refresh request with a lifetime of zero and doesn't wait for the answer to come back.
For the time being, I have no other choice but to go back with Xirsys, but I would really like if the Twilio Network Traversal Service could be fixed. Let me know if you have more questions regarding the issue.
I have uploaded the wireshark captures here for reference.
EDIT: I have modified the webrtc demo page such that it doesn't close the connection when the ice connection state is set to 'disconnected'. Now I get the real disconnection when the ice connection state goes to 'failed'. However, it effectively didn't change anything, since in this case it takes just a few seconds more for the state to go from 'disconnected' to 'failed'.
Since I have new relevant screenshots and captures, I am updating the original question to clarify certain problems pointed out by Philipp Hancke:
First, here is a new capture with the ice connection state fix (the browser closes the connection only when the state goes to 'failed'):
It's interesting to see that this time, the session stayed up for a whole 18 minutes. This was taken on a saturday morning, so I'm guessing that the issue could be related to the current workload on the twilio servers. However, it failed in the exact same way as it always does so far for me. As a bonus, we even have a valid stale nonce response that is correctly handled by firefox.
However, if we take a different view of the same capture, we can see that the traffic stops being relayed for a solid 30 seconds before firefox considers the connection as being dropped and sends the Refresh request with a lifetime of zero. As in previous captures, the server responds with an Allocation Mismatch error, indicating it doesn't know which allocation firefox is talking about.
The last eight packets being sent are of the same size, so my guess is that they are retransmissions. After 30 seconds of retransmissions, it is likely that SCTP considers the transport as being dropped.
With regards to the refresh request with a lifetime of zero, I did a test where I close the connection early on, from the browser. In this case, the server recognizes the allocation and returns a success response:
The allocation mismatch is the easiest symptom to observe, but in my testing with my native application, I have seen similar errors with Refresh requests for non-zero lifetimes, and with CreatePermission requests (438 "Wrong nonce" error). However, since the browser closes the connection after 30 seconds of data not being relayed, it is hard to observe these errors with the current webrtc demo. If we could change that timeout to 10 minutes, we would see those errors as well.
Excellent problem description!
Without the server log this is hard to determine what goes wrong. I tried with the appear.in TURN servers which run an up-to-date version of coturn and show the same behaviour as the Twilio servers. Xirsys seems to be running a custom version of coturn (Coturn-0.5 'Xirsys Turn Services' from the software field but coturn never had such a version).
In both captures, firefox detects that traffic stops being relayed (at the data channel level) and attempts a graceful disconnection by sending a Refresh request packet with a lifetime of zero.
Not quite. A refresh request with a lifetime of 0 is used to discard an allocation. At that point it does not matter what the server returns as the connection is beyond repair anyway.
This is caused by peerjs closing the peerconnection if the iceconnectionstate changes to disconnected, here in your bundled library version.
This is overly aggressive (and does not even fix things) and we've had a discussion about what the specification should do wrt to trying to fix things with an ice restart here which also links to a great explanation of the disconnected state.
The disconnected state probably happens because a few packets get lost. But this is something that can happen when there is minor congestion. I'd recommend removing the pc.close() in the disconnected case.
If you are looking for other TURN providers, Tokbox provides the same service. For datachannels the latency of a properly run distributed TURN network does not matter as much as for VoIP so you might run your own servers in a single location instead.

Objective-C - How to prevent session id reusing when app terminated?

My main question is how to detect the application termination by the end user when it was in the background (Suspended) to be able to send logout request to the server ?
We already have a timeout interval in the server to kill the session, but assume that the interval is 5 minutes so this means that the session will be alive for 5 minutes after the user terminated the app and anyone can sniff on the data and reuse it.
Notes:
We use HTTPS connection and SSL Certificate Pining.
We also implemented a heartbeat web service to be called by client app every fixed interval to tell the server to keep the session alive for this interval, if this web service didn't call for specific session, the server will kill this session.
Once your app is suspended you don't get any further notice before you are terminated. There is no way to do what you want.
Plus, the user could suspend your app to do something else (like play a game) and then not go back to your app for DAYS.
If you want to log out when the user leaves your app, do it on the willBeSuspended message. Ask for more background time and send a logout right then and there.
Mohamed Amer,
Here is an approach used by Quickblox Server and I feel its pretty much solid though it involves a little overhead.
Once the client application (either iOS android) establishes the session with quickblox server, quickblox server expects the client application to send the presence information to server after a regular interval continuously.
Sending the presense information is pretty much simple. They have written a api which we keep hitting after a interval of 5 mins with session id that we have. They validate the session id and once found valid they will extend the expiration time for the user ascociated with that id for 5 mins more.
What they will do I believe is that,
Approach 1 : they maintain the last hit time and for all the subsequesnt request they check if the request time is within the the time frame of 5 min if yes simply process it. If the request comes after 5 min they will delete the session id for the user and respond saying you have timeout the session.
Approach 2 : Because they provide online and offline info as well they cant simply depend on the incoming request to delete the session id from server so they probably create a background thread which swipes over the db to find the entry with last hit time greater then 5 min and removes it from DB. and declares the user session expired.
Though this involves client apps continously hitting the server and increases the burden on the server for the app like chat application in which presense information is so vital this overhead is still fine i believe.
Hope I have provided you with some idea at least :)

POST request taking longer than timeout causing duplicates

I have an iOS application where I POST transactions to an API each time a transaction is completed. Once I get a 200 response code from the server I update an attribute on the transaction:
newTransaction.Synced = true
Incase the network connection ever drops I also POST every transaction where Synced = false when Reachability detects a network connection.
In perfect network conditions this works wells. However when I enable the Network Link Conditioner on my iPad and set packet loss to say 40% I start to see duplicated transactions on my server. What I assumed was happening is that it was taking longer than 30 seconds (the client side timeout on the request) to send my request and get the response from the server due to the high packet loss.
To confirm this, I made my API Sleep for 40 seconds for each web request and disabled Network Link Conditioner. As expected, the iOS app never set the Synced attribute to true as it was timing out before it got the response. However the server still created the entity for each POST request that was generated each time the iOS app launched or got network connectivity.
What's the best way to handle this situation so that duplicates never occur? I did think of adding a GUID to the transaction and then coding the API not to re-add the transaction if the GUID already exists. However the flip side is the iOS app would still never know the transaction has successfully synced. Is there a better way to handle this? Perhaps a timeout on the request which the server also adheres to?
Your Idea of assigning the GUID to transaction is good, but you might need to maintain a table on client side (browser memory) which will hold a record of all the calls you made to server and never heard back.

Change timeout for Parse requests

In the iOS Developers Guide for Parse, it states "By default, all connections have a timeout of 10 seconds." I am looking to change this for all requests made from the app, but am not finding any information on how to do so.
The reason we'd like to modify this is that it's taking a long time for requests to fail when the user doesn't have Wi-Fi or Cellular enabled. We want to reduce the amount of time it takes to receive said error message, just a little. We don't want to implement our own reachability tests, as it will result in duplicate popup error messages and we have many requests in various view controllers throughout the app.
Can the timeout be modified, or is there some other way to obtain a better user experience than waiting 10 seconds for an error message?
There is no information on this but certainly the request timeout limits are set by Parse and a developer will not be able to change them. I think they kept the timeouts to be long to avoid a user request being rejected if their connection becomes suddenly intermittent or they go in a tunnel, etc.
You can try to warp Parse queries around a timer which uses let say 5 seconds timeout, if the response does not come in that time you cancel your your query using PFQuery cancel function and show them a message.
If you want to avoid timing out, consider checking Reachability before making the call. You may want to show the user an alert if they're not connected and you need to do something.
A lot of people say you should just assume a connection and make the attempt without checking reachability; basically just let the connection fail and handle the error that way. I think as long as the failure isn't invisible to the user, so they don't blame your app vs their network you're good though.

Resources