I have a question, which probably identifies me as beginner in programming, which I am indeed.
I wrote a script, that downloads many netcdf files, each file is about 500 mb in size and there are many hundred files. The script would run for several days if all files were downloaded. The problem is, that the script regularly stops with the error message:
TimeoutError: [Errno 60] Operation timed out
This is annoying, because I would like to start the script and come back some days later when the task of downloading is done. Like this I would have to check if the script is still running at least every hour.
I found in the manual, that the timeout of the ftp-connection can be set manually.
I set it really high
ftp = FTP("rancmems.mercator-ocean.fr", timeout=10000)
My question is is this the line, where the problem comes from and is there any chance to solve the problem?
Thank you for helping :-)
Related
I have a Bazel repo that builds some artifact. The problem is that it stops half way through hanging with this message:
[3 / 8] no action
What on the earth could be causing this condition? Is this normal?
(No, the problem is not easily reducible, there is a lot of custom code and if I could localize the issue, I'd not be writing this question. I'm interested in general answer, what in principle could cause this and is this normal.)
It's hard to answer your question without more information, but yes it is sometimes normal. Bazel does several things that are not actions.
One reason that I've seen this is if Bazel is computing digests of lots of large files. If you see getDigestInExclusiveMode in the stack trace of the Bazel server, it is likely due to this. If this is your problem, you can try out the --experimental_multi_threaded_digest flag.
Depending on the platform you are running Bazel on:
Windows: I've seen similar behavior, but I couldn't yet determine the reason. Every few runs, Bazel hangs at the startup for about half a minute.
If this is mid-build during execution phase (as it appears to be, given that Bazel is already printing action status messages), then one possible explanation is that your build contains many shell commands. I measured that on Linux (a VM with an HDD) each shell command takes at most 1-2ms, but on my Windows 10 machine (32G RAM, 3.5GHz CPU, HDD) they take 1-2 seconds, with 0.5% of the commands taking up to 10 seconds. That's 3-4 orders of magnitude slower if your actions are heavy on shell commands. There can be numerous explanations for this (antivirus, slow process creation, MSYS being slow), none of which Bazel has control over.
Linux/macOS: Run top and see if the stuck Bazel process is doing anything at all. Try hitting Ctrl+\, that'll print a JVM stacktrace which could help identifying the problem. Maybe the JVM is stuck waiting in a lock -- that would mean a bug or a bad rule implementation.
There are other possibilities too, maybe you have a build rule that hangs.
Does Bazel eventually continue, or is it stuck for more than a few minutes?
So I had a job running for downloading some files and it usually takes about 10 minutes. this one ran for more than an hour before it finally failed with the following, only error message:
Workflow failed. Causes: (3f03d0279dd2eb98): The Dataflow appears to be stuck. Please reach out to the Dataflow team at http://stackoverflow.com/questions/tagged/google-cloud-dataflow.
So here I am :-)
The jobId: 2017-08-29_13_30_03-3908175820634599728
Just out of curiosity, will we be billed for the hour of stuckness? And what was the problem?
I'm working with Dataflow-Version 1.9.0
Thanks Google Dataflow Team
It seems as though the job had all its workers spending all the time doing Java garbage collection (almost 100%, about 7 second Full GCs occurring every ~7 seconds).
Your next best steps are to get a heap dump of the job by logging into one of the machines and using jmap. Use a heap dump analysis tool to inspect where all the memory is allocated to. It is best to compare the heap dump of a properly functioning job against the heap dump of a broken job. If you would like further help from Google, feel free to contact Google Cloud Support and share this SO question and the heap dumps. This would be especially useful if you suspect the issue is somewhere within Google Cloud Dataflow.
Stopped SQL service before realizing storage was full.
The UPDATE OPERATION during stopping did not cleanly complete.
Now any further operations, including START and attempting to change storage size give only the error:
"Operation failed because another operation was already in progress."
Tried from both web cloud console and gcloud command line. Same error on both.
How can I clear this incomplete UPDATE OPERATION so I can then increase storage size and start the SQL server?
It's not a perfect solution, but apparently the OPERATION does eventually complete after 2 to 3 hours. And this continues, every 2 to 3 hours, until an operation which increases the storage size, which again takes 2 to 3 hours to complete, and then the interface works normally again.
Posting this in case someone else runs into this same problem. There may still be a better solution, but giving it a lot of time does seem to work.
I'm performing a TFS Integration migration from tfs.visualstudio to an on-premise 2012 server. I'm running into a problem with a particular changeset that contains multiple binary files in excess of 1 MB, a few of which are 15-16 MB. [I'm working remotely (WAN) with the on-premise TFS]
From the TFSI logs, I'm seeing: Microsoft.TeamFoundation.VersionControl.Client.VersionControlException: C:\TfsIPData\42\******\Foo.msi: The request was aborted: The request was canceled.
---> System.Net.WebException: The request was aborted: The request was canceled.
---> System.IO.IOException: Cannot close stream until all bytes are written.
Doing some Googling, I've encountered others running into similar issues, not necessarily concerning TFS Integration. I'm confident this same issue would arise if I were just checking in a changeset like normal that met the same criteria. As I understand it, when uploading files (checking in), the default chunk size is 16MB and the timeout is 5 minutes.
My internet upload speed is only 1Mbit/s at this site. (While I think the problem would mitigated with sufficient upload bandwidth, it wouldn't solve the problem).
Using TCPView, I've watched the connections to the TFS server from my client while the upload was in progress. What I see is 9 simultaneous connections. My bandwidth is thus getting shared among 9 file uploads. Sure enough, after about 5 minutes the connections crap out before the upload byte counts can finish.
My question is, how can I configure my TFS client to utilize fewer concurrent connections, and/or smaller chunk sizes, and/or increased timeouts? Can this be done globally somewhere to cover VS, TF.EXE, and TFS Integration?
After spending some time with IL DASM poking around in Microsoft.TeamFoundation.VersionControl.Client.dll FileUploader, I discovered in the constructor the string VersionControl.UploadChunkSize. It looked like it is used to override the default chunk size (DefaultUploadChunkSize = 0x01000000).
So, I added this to TfsMigrationShell.exe.config
<appSettings>
<add key="VersionControl.UploadChunkSize" value="2097152" />
</appSettings>
and ran the VC migration again -- this time it got past the problem changeset!
Basically the TFS client DLL will try and upload multiple files simultaneously (9 in my case). Your upload bandwidth will be split among the files, and if any individual file transfer cannot complete 16MB in 5 minutes, the operation will fail. So you can see that with modest upload bandwidths, changesets containing multiple binary files can possibly timeout. The only thing you can control is the bytecount of each 5 minute timeout chunk. The default is 16MB, but you can reduce it. I reduced mine to 2MB.
I imagine this could be done to devenv.exe.config to deal with the same problem when performing developer code check ins. Hopefully this information will help somebody else out and save them some time.
I'm using the check_yum - Plugin in my Icinga-Monitoring-Environment to check if there are security critical updates available. This works quite fine but sometimes I get a " CHECK_NRPE: Socket timeout after xx seconds." while executing the check. Currently my NRPE-Timeout is 30 seconds.
If I re-schedule the check a few times or executing the check directly from my Icinga-Server with a higher nrpe-timeout-value everything works fine, at least after a few executions of the check. All other checks via NRPE are not throwing any errors. So I think there is no general error with my NRPE-config or the plugins I'm using. Is there some explanation for this strange behaviour of the check_yum - plugin? Maybe some caching issues on the monitored servers?
First, be sure you are using the 1.0 version of this check from: https://code.google.com/p/check-yum/downloads/detail?name=check_yum_1.0.0&can=2&q=
The changes I've seen in that version could fix this issue, depending on it's root cause.
Second, if your server(s) are not configured to use all 'local' cache repos, then this check will likely time out before the 30 second deadline. Because: 1> the amount of data from the refresh/update is pretty large and may be taking a long time to download from remote (include RH proper) servers and 2> most of the 'official' update servers tend to go off-line A LOT.
Best solution I've found is to have a cronjob to perform your update check at a set interval (I use weekly) and create a log file containing those security patches the system(s) require. Then use a Nagios check, via a simple shell script, to see if said file has any new items in it.