HttpsURLConnection POST request of a large file throws IOException after several minutes - post

I've been working on this bug for several days now and couldn't solve it.
I wrote an HttpsURLConnection client to upload large files(>1GB) using POST requests.
I also implemented the server side using com.sun.net.httpserver.HttpServer.
As the files are quite big I have to use the: setFixedLengthStreamingMode/setChunkedStreamingMode settings on my connection
(bug is reproduced using either).
Please notice I'm using an HTTPS connection for the upload as well.
I'm uploading the file to several servers simultaneously (seperate thread for each http client, connected to a different server).
I have set a limit on the concurrent uploads so each time only X threads have an open UrlConnection (bug was reproduced with X=[1..4]).
(The other threads wait on a semaphore)
My problem is such:
When uploads takes less than 5 minutes (less than 4:50 minutes to be accurate) everything works just fine.
If the first batch of threads takes more then 5 minutes to finish then an Exception is thrown for every active thread:
java.io.IOException: Error writing request body to server
at sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.checkError(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection$StreamingOutputStream.write(Unknown Source)
The exception is thrown while trying to write to the HttpURLConnection output stream.
[outputStream.write(buffer,0,len);]
The next batch of threads will work just fine (even if they take more then 5
minutes).
Please notice that the servers are completely identical, and now the process will not fail thus leading me to think that the problem is not on the server side.
(If it was then the second batch was suppose to fail after 5 minutes as well...)
I have reproduced this issue with/without connect/read timeouts on the connection.
Furthermore, on the server side I've seen the file is being created and growing until the exception occurs.
About 20-40 seconds after the client throws an exception the server will throw an IOException "read timeout".
I have collected a TCP/IP sample using wireshark and saw that the server sends me a FIN packet at about the time of the client exception, I have no idea why.
(All connection seems functioning prior to that)
I have read many threads on similiar issues but couldn't find any proper solution.
(including Using java.net.URLConnection to fire and handle HTTP requests)
Any ideas on why is this happening?
How can I find the cause of it?
How can I solve it?
Many Thanks.
P.S
I didn't publish the code because it is pretty long...
But if it could help understanding my problem I will be glad to do so.

Related

Random and occasional network error (NSURLErrorDomain Code=-1001 and NSURLErrorDomain Code=-1005)

The last couple of days I've tried to debug a network error from d00m. I'm starting to run out of ideas/leads and my hope is that other SO users have valuable experience that might be useful. I hope to be able to provide all relevant information, but I'm not personally in control of the server environments.
The whole thing started by users noticing a couple of "network errors" in our app. The error seemed to occur randomly, without any noticeable pattern related to internet connectivity, iOS version or backend updates. The two errors that occurs behind the scenes are:
Error Domain=NSURLErrorDomain Code=-1001 "The request timed out."
and more frequently:
Error Domain=kCFErrorDomainCFNetwork Code=-1005 "The network connection was lost.
After debugging it for a couple of days, I've managed to reproduce these errors (occurring at random) by firing approx. 10 random (GET and POST) requests towards our backend with a random sleep timer between each request (set at 1-20 seconds). However, it only occurs in periods. What I've experienced the last couple of days is that when a "period of error" starts, I get one of the two errors every once or twice I run the code (meaning an error rate of 1/10 or 1/20 requests). This error rate continues for a couple of hours and then the error disappears for a couple of hours and then it starts all over.
Some quick facts about the setup:
Happens on device and simulator
Happens on iOS 8.4 and iOS 7.1 - although v. 8.4 is the main one I use for testing.
We use NSURLSession for our network requests. We also have AFNetworking included (updated to latest version), but we only use the Security part for SSL Pinning. Even with SSL pinning totally turned off, the error still occurs.
Some findings I've written down during the last couple of days:
It seems to only happen on our production environments which has some different configuration as our staging environments. This lead me to think that it might be related to the keep-alive bug as discussed here and here. However, our ops department have set up a new staging environment sending the same keep-alive header as the production environments, but this did not make the error occur on the staging environment.
Our Android version of the app were unable to reproduce the error using the same setup of requests. Further, we've not received any customer issues on "network errors" in the Android app.
My gut feeling says that it's related to the server environment and the HTTP implementation in iOS. I'm however unable to track down a convincing pattern that proves anything. I've made the same setup using a simple Rails script, and when the next "error period" occur, I'll be ready to try and reproduce it outside of iOS land. I'll update the question when this happens.
I'm not looking for solutions involving resetting wifi settings, shutting down the simulator or similar as I do not see this as feasible solutions in a production environment. I've also considered making the retry-loop-fix as mentioned in the GitHub issue, but I see this as a last resort.
Please let me know if you need any more information.
In my experience, those sorts of problems usually point to massive packet loss, particularly over a cell network, where minor variations in multipath interference and other issues can make the difference between reliably passing traffic and not.
Another possibility that comes to mind is poor-quality NAT implementations, in the unlikely event that your server's timeout interval is long enough to cause the NAT to give up on the TCP connection.
Either way, the only way to know for sure what's happening is to take a packet trace. To do that, connect a Mac to the Internet via a wired connection, enable network sharing over Wi-Fi, and connect the iOS device to that Wi-Fi network. Then run Wireshark and tell it to monitor the bridge interface. Instructions here:
http://www.howtogeek.com/104278/how-to-use-wireshark-to-capture-filter-and-inspect-packets/
From there, you should be able to see exactly what is being sent and when. That will probably go a long way towards understanding why it is failing.
Ok, I lost a lot of time investigeting similar issue.
1005 could be coused by known iOS bug and there are couple fixes. For example add header
"Connection" with value "close".
More info
1001 is a different story. In my case the problem was strange (bad?) firewall on the server. It was banning device when there was many (not so many) requests in short period of time.
I believe you can do easy test if you are facing similar issue.
Send a lot of (depends of firewall settings) requests in loop (let's say 50 in 1 second).
Close/Kill app (this will close connections to server)
(OPTIONAL) Wait a while (lets say 60 seconds)
Start app again and try send request
If you now got timeout for all next requests, you probably have same issue and you should talk with sever guys.
PS: if you don't have access to server, you can give user info that he should restart wifi on device to quit that timeout loop. It could be last resort in some cases.

Recreating Thrift clients in iOS

Shortly, our project uses a Thrift server and mobile clients with multiplexing.
While I was developing the iOS client, I encountered a strange problem;
When I first created the client and made calls, it is OK and it works as expected.
Since there is no close method for Cocoa Thrift client, I am hoping ARC will take care of it.
After some time, I create another client for the same service and do the same things, but this time, when I made a service call, client hangs and after some time in throws a "'TTransportException', reason: 'Cannot read. Remote side has closed.'".
In the server, operation is successfully completed and the value returned.
Does anybody have an idea about what I am doing wrong?
Thanks in advance!
Reading your question i remembered that we encountered a very similar problem in very a different environment. If ARC takes care of your client and closes the connection, especially the port, this might be the reason why recreating the client again with the same port is the root of your problem. Opening the same port shortly after closing it can take a very long time (minutes) depending on timeouts.
Sorry no real answer to your problem but maybe a hint were to look for.

end of file reached EOFError (Databasedotcom + Rails + Heroku)

After much frustration trying to figure it out myself, I am reaching for SO guys (you!) to help me trace this formidable error:
Message: end of file reached EOFError Backtrace:
["/app/vendor/ruby-1.9.3/lib/ruby/1.9.1/openssl/buffering.rb:174:in
`sysread_nonblock
Background: My App is a Rails 3 app hosted on Heroku and 100% back-end app. It uses Redis/Resque workers to process it's payload received from Salesforce using Chatter REST API.
Trouble: Unlike other similar errors of EOF in HTTPS/OpenSSL in Ruby, my error happens very random (since I can't yet predict when will this come up).
Usual Suspects: The error has been noticed quite frequently when I try to create 45 Resque workers, and try to sync data from 45 different Salesforce Chatter REST API connections all at once! It's so frequent that my processing fails 20% or more of the total and all due to this error.
Remedy Steps:
I am using Databasedotcom gem which uses HTTPS and follows all the required steps to connect to create a sane HTTPS connection.
So...
Use SSL set in HTTPS - checked
URI Encode - checked
Ruby 1.9.3 - checked
HTTP read timeout is set to 900 (15 minutes)
I retry this EOF error MAX of 30 times after sleeping 30 seconds before each retry!
Still, it fails some of the data.
Any help here please?
Have you considered that Salesforce might not like so many connections at once from a single source and you are blocked by an DDOS-preventer?
Also, setting those long timeouts is quite useless. If the connection fails, drop it and reschedule a new one yourself. That is what Resque does fine, it will add up those waiting times if it keeps having problems...

Neo4j server seems to drop connection when processing for 200 seconds

I've been writing a Neo4j server extension (as described here, i.e. a managed server extension: http://docs.neo4j.org/chunked/stable/server-plugins.html). It should just get a string via POST (which in productivity would hold information the extension should process, but this is of no further concern here). It tried the extension with Neo4j 1.8.2 and 1.9.RC2, the outcome was the same.
Now my problem is that sometimes this extension does quite a lot of work which can take a couple of minutes. However, after exactly 200 seconds, the connection gets lost. I'm not absolutely sure what is happening, but it seems the server is dropping the connection.
To verify this behavior, I wrote a new, almost empty server extension which does nothing else but to wait 5 minutes (via Thread.sleep()). From a test-client-class, I POST some dummy data. I tested with Jersey, Apache HTTPcomponents and plain Java URL connections. Jersey and plain Java do a retry after exactly 200 seconds, the HTTPcomponents throw " org.apache.http.NoHttpResponseException: The target server failed to respond".
I think it's a server issue, first because the exception seems to stand for that in this context (there's a comment saying that in the httpcomponent's code) and second because when I set connection timeout and/or socket timeout to lower values than 200 seconds, I get just normal timeout exceptions.
Now there's one thing on top of that: I said I would POST some data. Seemingly this whole behavior depends on the amount of data sent. I pinned it down so far I can say, when sending a string of length ca. 4500 characters, the described behavior does NOT happen, but everything is alright and I get an HTTP 204 "no content" response which is correct.
As soon as I send ca. 6000 characters or more, the mentioned connection drop occurs. The string I'm sending here is only dummy data. The sent string is only a sequence of 'a', i.e. "aaaaaaaa..." created with a for loop with 4500 or 6000 iterations, respectively.
In my productive code I would really like to wait until the server operation has finished, but I don't how to prevent the connection drop.
Is there an option on the Neo4j server to configure (I looked but didn't find anything) or isn't it the server's fault and my clients do something wrong? A bug somewhere?
Thanks for reading and any help or hints!
Just to wrap this up: I eventually found out that there exists a default timeout constant in Jetty (version 6.x was used by Neo4j back then, I think) set to exactly 200 seconds. This could be changed using the Jetty API but the Neo4j server did not appear to offer any possibility to configure this.
Changing to Neo4j 2.x eventually solved the issue (why exactly is unknown). With those newer version of Neo4j the issue did not come up anymore.

IIS7 "Not enough storage is available to process this command" intermittent error in HTTPHandler

So, I have a web site that serves videos via a HTTP handler, which is our security layer. Some clients have reported that the videos are not working intermittently. I was finally able to reproduce the issue, and our logging coded reports success in validating the user, then the line:
Response.WriteFile(filename); // Where this is the path to a video of about 32 MB
throws the above exception. I found the actual error by viewing the request and response with Fiddler. But the server has 2 GB of memory free, and the videos started working again an hour or so later ( which probably equates to less people using the server, but nothing was changed on it ). We run two websites on this machine, and the other never has issues like this, but it also doesn't use a layer like this where .NET code is responsible for writing the file. I don't see any settings that allow me to change the available memory, nor has google thrown up anything useful. Any suggestions appreciated.
I should add, I stopped and started and then restarted my site, I've had issues that are solved in the short term by doing this in the past. This did not help.
I just ran into this problem trying to Response.WriteFile a ~170 MB pdf.
In my case, using Response.TransmitFile instead worked; (maybe) because it
Writes the specified file directly to an HTTP response output stream,
without buffering it in memory.

Resources