We suddenly experience a problem that somehow seems to be related to iOS 14 as we haven't had those errors in prior versions.
At app start, we do quite a lot of network requests to different webservices. This sums up to 158 GET, POST and PUT requests until a user is fully logged in. The app uses 260MB of memory until then. When a user switches to a different account, the login process starts again and another 158 requests are sent out. Now if the user again decides to login with a new account, the login procedure starts yet again. But this time, network requests randomly start canceling with such error messages:
Error Domain=NSPOSIXErrorDomain Code=28 "No space left on device" UserInfo={_NSURLErrorFailingURLSessionTaskErrorKey=LocalDataTask <EA2DAE7D-F7AD-4979-8215-E716163FA725>.<1>, _kCFStreamErrorDomainKey=1, _NSURLErrorRelatedURLSessionTaskErrorKey=(
"LocalDataTask <EA2DAE7D-F7AD-4979-8215-E716163FA725>.<1>"), _kCFStreamErrorCodeKey=28}
So within a timeframe of two minutes and approximately 400 - 500 HTTP requests, the network layer starts canceling those due to a lack of memory space. Although it could use up 3GB.
The app network logic hasn't changed much before and after we started experiencing such errors. We are also using one SessionManager instance only. It seems to me as if the network stack would start drowning by the number of requests and therefore starts canceling them. Perhaps iOS 14 became more strict in such regards? Has anyone else possibly experienced a similar issue?
We use AFNetworking on our basic network layer.
Any help is much appreciated.
After a while of investigation, it turned out to be a library we're using for debugging purposes. This library, DBDebugToolkit, comes with a network logger feature which we had turned on by default. In situations of high traffic, network logger quickly increased memory usage until our requests got canceled. Now it's turned off by default, but can be switched on in our debugging menu.
Related
I'm currently developing a react native app in combination with a device. The device and the app communicate via BLE. So far everything works as expected but I'm having issues with the connection stability of the iOS app and the device. What would happen is that the device would connect and I can update some characteristics but it would regularly either disconnect with a CBErrorDomain 7 or the response for a write would timeout. The implementation on the app or device side does not seem to be the problem as Android works stable and the device also disconnects when connecting with the LightBlue app.
I've already updated the BLE connection parameters as suggested here:
https://developer.apple.com/library/archive/qa/qa1931/_index.html.
This has increased the stability but did not resolve the problems completely. I've tried playing around with the values but so far no luck.
The current set of parameters we are using are:
conn_min_interval: 15
conn_max_interval: 15
conn_latency: 0
supervision_timeout: 2000
adv_min_interval: 1285
adv_max_interval: 1285
My question now would be if somebody has an idea what other things I could check or which parameter to tune?
Are you checking the maximumWriteValueLength and making sure your writes are smaller than this? A likely cause of your problems is overwhelming the device and it fails to keep up with sending ACKs. What version of Bluetooth does your device support, and does it implement DLE (Data Length Extension)?
Your conn_min_interval and conn_max_interval are suspicious. Asking for 15ms with no leeway is likely to negotiate to 30ms instead. (See 41.6 Connection Parameters) Is your device comfortable with being re-negotiated to something other than 15ms? Can your device actually keep up with 15ms and no connection latency if it does get that? I'm betting it can't. Try setting your connection interval to 30ms (or even a bit slower), and you might even try setting your connection latency to 1 to make the connection a bit more forgiving (though I'd focus more on slowing the CI than increasing latency; increasing latency would be more of a hack in this case).
All my suspicions are around your peripheral not keeping up with its side of the connection. If you have any synchronous activities in response to the data, you need to make sure that it's not blocking your BLE stack from sending the required responses.
Finally, I've found the answer to my problem. The problem was that the pairing procedure of the BLE server was faulty and thus iOS was unable to have a stable connection. Now that this is fixed the connection is very stable.
I'm still unsure why iOS was able to have any communication at all without the pairing but I hope that this helps some people in the future.
My iOS app allows users to edit and save images and videos. The edited media are saved with calls to PHAssetChangeRequest.creationRequestForAssetFromVideo(atFileURL:) etc. within a PHPhotoLibrary's performChanges block.
These calls frequently fail with the error code 41002 (Domain: com.apple.photos.error). Its localizedDescription is:
Unable to obtain assetsd XPC proxy for getPhotoKitServiceWithReply:. assetsd could have crashed
What does the error mean? I tried to search for the error code, the domain and the keywords from the description, but couldn't find anything. Is there an official reference for the errors in this domain? What error message should I show to the user?
An Apple engineer has provided me the following explanation:
The app could not communicate with a helper application (i.e. the photo library). The most common reason that assetsd might crash is because the foreground app is using a lot of memory (or allocating memory very quickly), and the system decides to terminate assetsd to alleviate some of that memory pressure.
To prevent these errors, you should make sure that your app is not triggering a high memory pressure situation.
This should only be a temporary issue. In other words, if you tried to submit the same request later, it may work (assuming that any high memory pressure situation has been resolved). So, I recommend that you queue up requests that fail and try to submit them later, and inform your user of this in some way, so that they know the save has not gone through yet.
This same issue can also create an error with the code 4099, domain NSCocoaErrorDomain and the description:
Couldn’t communicate with a helper application.
I am using the Twilio Network Traversal Service as part of a native application I am working on to perform peer-to-peer remote desktop connections. We implement a subset of the WebRTC protocol stack that is equivalent to the WebRTC data channels (not the WebRTC video and audio protocols). When using a TURN relay, the TURN allocation seems to be invalidated randomly somewhere between a few minutes and a maximum of 12 minutes from the session start. This issue looks very similar to this one, but the proposed workaround (sending silent audio) is not acceptable in my case, since I do not implement the WebRTC audio/video protocols.
I have been pulling my hair on this problem for the last two weeks, and isolated the issue as being the Twilio service itself. To compare, I have used a web based WebRTC data channel demo using firefox and the Xirsys TURN server cloud. I have wireshark captures showing firefox getting disconnected with Twilio just like my native application, while the exact same firefox demo doesn't get disconnected when using the Xirsys servers.
I was using Xirsys originally, but I experienced some instability with their service that made me switch to Twilio, which is why I would rather have Twilio fix this issue instead of going back with Xirsys. At the bare minimum, I would rather have two WebRTC hosting providers I can choose from that I know should work fine. This is why I am taking the time to explain the issue in detail so it can get fixed.
Here are two wireshark captures (with the peer-to-peer data messages filtered out) showing firefox using WebRTC data channels and the Twilio TURN relay servers:
The traffic stops being relayed after 4 minutes in the first capture, and after about 11 minutes in the second capture. In both captures, firefox detects that traffic stops being relayed (at the data channel level) and attempts a graceful disconnection by sending a Refresh request packet with a lifetime of zero. Both graceful disconnections result in a 437 Allocation Mismatch error, indicating that the server doesn't even know about the allocation firefox is trying to close gracefully.
With my native application, this would often take the form of a CreatePermission Request message that fails with a 438 "Wrong nonce" error, which is basically what should happen if a client tries to update the permission on an allocation that no longer exists. The error code 438 usually means "Stale nonce", which is not really an error, but an indication that the nonce has expired and the client should try again using the new nonce contained in the "error" message. It took me a while to figure out, but even if the error code is 438, the error string is not the same. I have observed a true stale nonce error with Xirsys and successfully updated my permission with the new nonce from the error response, so I know I can properly handle this case in my implementation.
Here is the source code for the WebRTC data channel demo I have used:
https://github.com/devolutions/webrtc-demo
For comparison, here is the same firefox data channel demo using the Xirsys TURN server cloud:
In this capture, I have let the demo run for about 16 minutes (it works for much longer than that, the longest I have tried is two hours). We can see that the traffic keeps getting relayed for the entire duration of the session, and CreatePermission requests keep getting sent by firefox with success. At the end, the graceful disconnection is triggered by firefox closing the WebRTC data channel (instead of being closed due to traffic no longer being relayed). As opposed to the Twilio captures, the Refresh request with a lifetime of zero is successful: the Xirsys TURN server still knows about the allocation and sends back a success response, as expected.
It should be noted that the ICMP unreachable errors are normal because I think in this case firefox is no longer listening on the given port when the response comes back. In other words, it sends the Refresh request with a lifetime of zero and doesn't wait for the answer to come back.
For the time being, I have no other choice but to go back with Xirsys, but I would really like if the Twilio Network Traversal Service could be fixed. Let me know if you have more questions regarding the issue.
I have uploaded the wireshark captures here for reference.
EDIT: I have modified the webrtc demo page such that it doesn't close the connection when the ice connection state is set to 'disconnected'. Now I get the real disconnection when the ice connection state goes to 'failed'. However, it effectively didn't change anything, since in this case it takes just a few seconds more for the state to go from 'disconnected' to 'failed'.
Since I have new relevant screenshots and captures, I am updating the original question to clarify certain problems pointed out by Philipp Hancke:
First, here is a new capture with the ice connection state fix (the browser closes the connection only when the state goes to 'failed'):
It's interesting to see that this time, the session stayed up for a whole 18 minutes. This was taken on a saturday morning, so I'm guessing that the issue could be related to the current workload on the twilio servers. However, it failed in the exact same way as it always does so far for me. As a bonus, we even have a valid stale nonce response that is correctly handled by firefox.
However, if we take a different view of the same capture, we can see that the traffic stops being relayed for a solid 30 seconds before firefox considers the connection as being dropped and sends the Refresh request with a lifetime of zero. As in previous captures, the server responds with an Allocation Mismatch error, indicating it doesn't know which allocation firefox is talking about.
The last eight packets being sent are of the same size, so my guess is that they are retransmissions. After 30 seconds of retransmissions, it is likely that SCTP considers the transport as being dropped.
With regards to the refresh request with a lifetime of zero, I did a test where I close the connection early on, from the browser. In this case, the server recognizes the allocation and returns a success response:
The allocation mismatch is the easiest symptom to observe, but in my testing with my native application, I have seen similar errors with Refresh requests for non-zero lifetimes, and with CreatePermission requests (438 "Wrong nonce" error). However, since the browser closes the connection after 30 seconds of data not being relayed, it is hard to observe these errors with the current webrtc demo. If we could change that timeout to 10 minutes, we would see those errors as well.
Excellent problem description!
Without the server log this is hard to determine what goes wrong. I tried with the appear.in TURN servers which run an up-to-date version of coturn and show the same behaviour as the Twilio servers. Xirsys seems to be running a custom version of coturn (Coturn-0.5 'Xirsys Turn Services' from the software field but coturn never had such a version).
In both captures, firefox detects that traffic stops being relayed (at the data channel level) and attempts a graceful disconnection by sending a Refresh request packet with a lifetime of zero.
Not quite. A refresh request with a lifetime of 0 is used to discard an allocation. At that point it does not matter what the server returns as the connection is beyond repair anyway.
This is caused by peerjs closing the peerconnection if the iceconnectionstate changes to disconnected, here in your bundled library version.
This is overly aggressive (and does not even fix things) and we've had a discussion about what the specification should do wrt to trying to fix things with an ice restart here which also links to a great explanation of the disconnected state.
The disconnected state probably happens because a few packets get lost. But this is something that can happen when there is minor congestion. I'd recommend removing the pc.close() in the disconnected case.
If you are looking for other TURN providers, Tokbox provides the same service. For datachannels the latency of a properly run distributed TURN network does not matter as much as for VoIP so you might run your own servers in a single location instead.
My iOS app loads images from an nginx HTTP server. After I send 400+ such requests the networking 'gets stuck' and all subsequent HTTP requests result in "The request timed out" error. I can make the images load again only when I restart the app.
Details:
I am using NSURLSession.sharedSession().dataTaskWithURL to send four hundred HTTP GET requests to jpeg files.
Requests are sent sequentially, one after another. The interval between requests is 10 ms.
Each previous unfinished request is cancelled with cancel() method of NSURLSessionDataTask object.
Interestingly:
I can only have this issue with HTTPS requests and when SPDY is enabled on the server.
Non-secure HTTP requests work fine.
Non-SPDY HTTPS requests work fine. I tested it by turning SPDY off on the server side, in the nginx config.
Problem appears both on iOS 8 and 9, on physical device and in the simulator. Both on Wi-Fi and LTE.
When I look at nginx access logs, I can still see the 'stuck' requests coming in. Important nuance: the request log record appears at the exact moment when the iOS app is giving up on it after the time out period ends.
I was hoping to analyze HTTP requests with Charles Proxy but the problem cures itself when requests go through Charles. That is - everything works with Charles, much like effect in quantum mechanics when the fact of looking influences the outcome.
I was able to reproduce the issue when the iOS app connected to two different servers with vastly different nginx configurations. This probably means that the issue is not related to a particular nginx setup.
I analyzed the app using "Activity Monitor" instrument. The number of threads it is using during the bulk HTTP requests jumps from 5 to 10. In comparison, when I send just a single HTTP requests the number of threads jumps to 8. CPU load rarely goes above 30%.
What can be the cause of the issue? Can anyone recommend other ways or tools for analysing and debugging it?
Analysing with scheduling instrument
Demo app
This demo app reproduces the issue 100% of the time for me.
https://github.com/exchangegroup/ImageLoadDemo
Versions and settings
My nginx config: http://pastebin.com/pYYjdxfP
OS X: 10.10.4 (14E46), iOS: 8 and 9, Xcode: 7.0 (7A218), nginx: 1.9.4
Not ideal workaround
I managed to keep requests working only if I create a new NSURLSession for each individual request and clear the previous session with finishTasksAndInvalidate or invalidateAndCancel.
// Request 1
let configuration = NSURLSessionConfiguration.defaultSessionConfiguration()
let session = NSURLSession(configuration: configuration)
session.dataTaskWithURL ...
// Request 2
// clear the previous request
session.finishTasksAndInvalidate()
let session2 = NSURLSession(configuration: configuration)
session2.dataTaskWithURL ...
One possibility is that iOS started sending the request, and then packet loss prevented the headers and request body from being fully delivered.
Another possibility that comes to mind is that your server may not be logging the request until it actually finishes trying to deliver it, which would make the time stamps in the server logs line up with when the connection was closed, rather than when it was opened. (IIRC, that's what Apache does; I haven't worked with nginx, so I can't speak for its behavior.) If that's the case, then this is just a simple connection stall. As for why it is stalling, I couldn't guess.
Does the problem occur exclusively for HTTPS traffic? If you can reproduce it with HTTP, you don't need Charles Proxy; just use OS X's "Internet Sharing" feature, and capture the packets with tcpdump or wireshark, listening on the bridge interface. If you can't reproduce it with HTTP, my money would be on a problem with fetching the CRLs or performing the OCSP check while validating the server's certificate.
Is your app ending up with a huge number of threads as a result of excessive async dispatching to new queues, by any chance? Because that could easily cause all sorts of odd misbehavior.
How long is the timeout? If it is too short, your app might simply be running up against performance limitations of the hardware while processing the results of 400 requests delivered in only four seconds.
Also, are you trying to schedule these requests simultaneously? Because I seem to recall reading about a bug that causes NSURLSession to hit a brick wall if you start too many tasks in a single session at the same time. You might try adding tasks only after the number of tasks in a session drops below some threshold and see if that fixes the problem.
I've encountered a pretty large issue and have been trying to find a solution for 2 months now with no luck. I've submitted it as a bug, ( https://bugzilla.xamarin.com/show_bug.cgi?id=4910 ) but was hoping maybe someone here could shed some light on the cause of the problem, or suggest a work-around.
In a nutshell, to encounter the error:
Create a basic .Net socket connection between two devices
Create and initialize a GameKit.GKSession object on a least one device.
What occurs is the transfer of the data on the .NET socket becomes erratic and too slow to be usable. I've performed many tests across different devices (see link below) and it affects all of them (iPad 3 affected the least). I've tested it between an iPhone and a Windows PC and it still occurs. MonoTouch's GameKit code is somehow affecting the Socket code.
As you can see from the spreadsheet, speed drop from a few milliseconds to send 1 MB to several minutes to forever.
As soon as the GameKit.GKSession is set to null, any backedlogged data on the socket flows freely again and the sockets act normally once more.
Sample Windows and iOS/MonoTouch Apps demonstrating problem: https://dl.dropbox.com/u/8617393/SocketBug/SocketBug.zip
Test results across different devices (PDF Spreadsheet):
https://dl.dropbox.com/u/8617393/SocketBug/SocketBugTestResults.pdf
This issue seems so hard even Apple decided to circumvent it: http://developer.apple.com/library/ios/#qa/qa1753/_index.html#//apple_ref/doc/uid/DTS40011315
The revealing sentence is this: "This change was made to reduce interference with Wi-Fi."
To assist with anyone that comes across this problem, the issue only manifests itself when the GKSession.Available is set to true. This is required for the device to be discoverable, but not to maintain a connection. With this in mind, a temporary work-around can be used where the GKSession.Available is only set to true when required, and set to false as soon as the connection has been made.
In my scenario, I have a socket connection to the server on both devices. I use this connection to send a message from device A to device B, asking it to become discoverable via bluetooth for the next 10 seconds. As soon as the message is sent, device A begins looking for device B and connects as soon as it finds it. Once the connection is established (or 10 seconds elapses without a connection), device B turns off discoverability and the socket behaves as normal.
In my circumstances, this is an acceptable (pending no real fix) workaround to the problem.