Dask aysncio tornado TimeoutError - dask

I'm running a Dask-YARN job on a YARN cluster on a schedule. The job creates a list of Delayed Dask tasks, and submits it to the cluster using the following code:
from dask_yarn import YarnCluster
cluster = YarnCluster()
cluster.scale(8)
app_id = cluster.application_client.id
client = Client(cluster)
dask.compute(dask_tasks)
cluster.shutdown()
client.close()
Then, it logs the application worker logs using the command:
yarn logs -applicationId {app_id} -log_files dask.worker.log
After printing all the worker logs, I see the following error message:
End of LogType:dask.worker.log
********************************************************************************
2019/11/28 11:16:24 - asyncio - ERROR - Future exception was never retrieved
future: <Future finished exception=TimeoutError('Timeout')>
tornado.util.TimeoutError: Timeout
This job is running on a schedule and the error message above appears intermittently. The job also completes successfully in every case it shows this error message. So does anyone have an idea the reason for this error?

Logged warnings like these can sometimes occur if things didn't clean up perfectly. In practice it's not a big deal though. If your job completes successfully then I would probably ignore it.
If you're able to provide a minimal reproducible example then you might consider submitting an issue to the dask-yarn issue tracker.

Related

Why Jest tests are SOMETIMES failing on CircleCI?

I have Jest tests that are running against the dockerized Neo4j Database, and sometimes they fail on CircleCI. The error message for all 25+ of them is :
thrown: "Exceeded timeout of 5000 ms for a hook.
#*******api: Use jest.setTimeout(newTimeout) to increase the timeout value, if this is a long-running test."
Since they fail sometimes, like once in 25 runs, I am wondering if jest.setTimeout will solve the issue. I was able to fail them locally by setting jest.setTimeout(10), but I am not sure how to debug this even more, or whether something else could be an issue here aside from a small timeout (default 5000). I would understand if 1/25 or a few fails, or if all other suits fail, but only a single file with all tests within that file is failing. And it is always the same file, never some other file for this reason ever.
Additional information, locally, that single file runs in less than a 1000ms connected to the staging database which is huge compared to the dockerized that has only a few files at the time of running
For anyone who sees this, I was able to solve this by adding the --maxWorkers=2 flag to the test command in my CircleCI config. See here for details: https://support.circleci.com/hc/en-us/articles/360005442714-Your-test-tools-are-smart-and-that-s-a-problem-Learn-about-when-optimization-goes-wrong-
Naman's answer is perfect! I couldn't believe it but it really solved my problem. Just to be extra clear on how to do it:
I change the test script from my package.json from jest to jest --maxWorkers=2. Then I pushed and it did solve my error.

Error when I click view log button in task execution screen in Spring CLoud Dataflow Dashboard

Good evening.
I have set up Spring Cloud Dataflow 2.7.0-SNAPSHOT to run spring batch's tasks in Openshift.
I registered an app, using a valid docker path, and created a task using this app.
When I execute the task from SCDF dashboard using an specific platform from the drop-down , the task gets properly executed on Openshift.
When I access the task execution screen, I click the button "View Logs", and I get the following error on screen:
"Log could not be retrieved. Verify that deployments are still available."
In SCDF log file I get:
2020-09-15 14:12:16.546 WARN 7 --- [nio-9376-exec-1] .s.c.d.s.s.i.DefaultTaskExecutionService : Failed to retrieve the log, returning verification message.
java.lang.IllegalStateException: No Launcher found for the platform named 'default'. Available platform names are [platform-test, platform-dev]
at org.springframework.cloud.dataflow.server.service.impl.DefaultTaskExecutionService.findTaskLauncher(DefaultTaskExecutionService.java:683)
at org.springframework.cloud.dataflow.server.service.impl.DefaultTaskExecutionService.getLog(DefaultTaskExecutionService.java:605)
I have seen that REST API endpoint(http://localhost:9393/tasks/logs/<external_exec_id>?platformName=platform-dev) to get the log works properly, but, from dashboard we are invoking http://localhost:9393/tasks/logs/<external_exec_id>, not including platformName.
should I configure anything, or add any attribute to the task execution to make this work, or is this a bug?
Thanks and regards.

Xcodebuild - Skip Finished requesting crash reports. Continuing with testing

I'm running a CI machine with the Xcode.
The tests are triggered using fastlane gym. I see this line in the output:
2019-05-27 16:04:28.417 xcodebuild[54605:1482269] [MT]
IDETestOperationsObserverDebug: (A72DBEA3-D13E-487E-9D04-5600243FF617)
Finished requesting crash reports. Continuing with testing.
This operation takes some time (about a minute) to complete. As far, as I understand, the Xcode requests crash reports from Apple to show in the "Organizer" window.
Since this is a CI machine, the crash reports will never be viewed on it and this step could be skipped completely how can I skip it?
Your mileage may vary, but after setting up a new machine with the following configuration, I encountered the same issue OP details:
macOS 10.15.2
Xcode 11.3
fastlane 2.139.0
Simulators # 13.3
When I run my fastlane test with 3 devices, I wind up at the following message and was sitting idle for about four minutes before I terminated it:
I then took the steps that I outlined in the comment to OP:
fastlane scan init
Edit my scanfile to look like this
I initially set disable_concurrent_testing(false), and when I ran the tests through fastlane, I got stuck again. Changing the value to disable_concurrent_testing(true) has allowed the tests to now run on my machine.
I think blaming "Finished requesting crash reports. Continuing with testing" may be a red herring. I was having several jobs stop at this step, but when I looked closer (I ran the lane locally and tailed the logs) I saw that my test was failing due to something else. It looks like Fastlane doesn't correctly show how long this step takes, in fact, I think if you're seeing that message, the process is already complete, and your tests are running. That changing concurrency fixes it for you may indicate your tests are failing due to a race condition.
So, anyway. Install fastlane locally, run your lane locally, tail -f the build output as well as the log file and see if the problem is revealed there. It was for me, but, as with everything, YMMV.

Job stuck after manual jobs deletion

I performed a manual clean by deleting job folders directly into the filesystem, and now I find a stucked running job that I cannot abort.
I've tried the answers here to force it to be stopped, but it doesn't work as it is not able to find the existing job in the system.
Additionally, when I click over the running job I get a 404 error:
"Problem accessing <route_to_job_that_doesnt_exist_anymore>"
Reason: Not found
Is there something I can do to abort this running job without restarting the server?
A way to stop a build (Like actually aborting it) is by adding a /stop at the end of the job url, behind the Build Number.
Example: http://myjenkins/project/123/stop
If this doesn't work, there is also the "Hard Kill". Instead of adding /stop you add /kill. I guess you need Admin Access for that POST action.
Don't know though if it works for jobs that don't exist on the Jenkins Host anymore due to missing filesystems

TFS build partially succeeded when calling a batch file, but no error in log

I’m building a solution which requires a batch file to be run after the build (there's a sequence in the workflow for this). TFS flags the build as partially succeeded, but there’s no error in the log even in full verbose mode ("diagnostic"). I’m checking the errorlevel after each line in the batch file and it’s always 0. I’ve also tested redirecting stdout and stderr in a file after each line and there’s no clue there.
It’s got nothing to do with unit tests because I’m skipping them for the time being.
I’ve noticed that usually when an error occurs in a batch file (e.g. file not found) there’s a visual cue to indicate the error and this matches the partially succeeded status. But I don’t see any visual cue.
So how can TFS decide that the build is only partially succeeded?
Thank you,
Solved.
It turns out the GetImpactedTests activity is throwing an exception (I can see it in the event viewer of the TFS machine), but it doesn't show at all in the build log.
I'm guessing that this exception makes the build partially succeeded (because the compilation part succeeded) but I couldn't see the assignment explicitly in the buid log. When I bypass the impact analysis (either by setting Analyze Test Impact to False or by removing the GetImpactedTests activity altogether), the error does not occur.
We experiment something similar here using the Lab Workflow (to kick our CodedUI tests). Different build template, same symptoms.
I have noticed that the build process reports that it partially succeeded, highlighting what seems to be a successful step in the deploy script (batch file).
The command is question is a command to install our mobile app on a mobile device (in order to test it at night):
adb install -d -r test.apk
I thought about looking the errorlevel right after running the adb command but the errorlevel was 0.
Then I thought that maybe the command is sending its output to stderr and found out this article on the android open source project, which confirms my hypothesis.
Following is my fix:
adb install -r -d test.apk 2>&1
Appending 2>&1 simply redirects stderr to stdout and now my deploy script does not report an error anymore and the build now succeeds (when all tests pass!).
Conclusion: When a script writes anything to stderr, the build workflow will report it as an error (partial success since it does not prevent execution of the workflow).
I know this is not your particular issue but since we had the same symptoms, I thought the stderr information could help somebody else find out the reason why their build process is reporting a partial success even though everything seems to work.

Resources