I am trying to update my dataflow pipeline. I like developing using Jupyter notebooks on Google Cloud. However, I've run into this error when trying to update:
"The new job is missing steps [5]: read/Read."
I understand the reason is because I re-ran some lines in my notebook and added some new lines, so now instead of "[5]: read/Read" it is now "[23]: read/Read" but surely dataflow doesn't need to care about the jupyter notebook execution. Is there some sort of way to turn it off and just call the steps using the given names without the numbers?
The Notebooks documentation recommends restarting the kernel and rerunning all cells to avoid that behavior.
"(Optional) Before using your notebook to run Dataflow jobs, restart the kernel, rerun all cells, and verify the output. If you skip this step, hidden states in the notebook might affect the job graph in the pipeline object."
Doing so will preserve the numbers.
Related
I have created a Spark standalone cluster on Docker which can be found here.
The issue that I'm facing is that when I run the first cell in JupyterLab to create a SparkContext I lose the ability to submit jobs (Python programs). I keep getting the message:
TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
I'm not sure where the issue is, but it seems like the Driver is blocked?
I don't know how to actually formulate the question since I can submit PySpark jobs when the app from Jupyter is not submitted.
I am running a service on GCP Cloud Run.
I found on the logs this error, how can I troubleshoot it?
What does it mean?
Application exec likely failed
terminated: Application failed to start: not available
This error can be caused when containers fail to deploy/start.
To troubleshoot the issue, You may try to follow the steps mentioned in document.
As described in the document, if you build your container image on an ARM based machine, then it might not work as expected when used with Cloud Run. If so you can solve this issue, by following doc build your image using Cloud Build.
To get the detailed logs, I would suggest setting up Cloud Logging with your Cloud Run.You can easily do so by following this documentation and Setting up cloud logging.
This will allow you to have more control over the logs that appear for your Cloud Run application.
I was using Cloud Run Second Generation.
According to the official documentation:
During Preview, although the second generation execution environment
generally performs faster under sustained load,
it has longer cold start times than the first generation.
Therefore, I switched back to First Generation.
I am running a pipeline in AI Platform pipelines based on TFX. All components run fine until the Evaluator. It just does not want to run on Dataflow, it runs in the Kubeflow pod, so it fails as there is no enough memory in there.
Apache Beam config is set to run with Dataflow as a runner, so other components like the ExampleGen, StatisticsGen, ExampleValidator all run fine in Dataflow.
When it comes to the Evaluator component it just fails without even generating a log. Complaining about the error (in the Kubeflow UI):
"This step is in a Failed state with this message: The node was low on resource: memory. The container main was using 2093880Ki, which exceeds its request of 0. Container wait was using 13492Ki, which exceeds its request of 0."
I was able to resolve this issue by setting the TFX version to 0.25.0.
I am trying to setup a cluster on AWS to run distributed sklearn model training with dask. To get started, I was trying to follow this tutorial which I hope to tweak: https://towardsdatascience.com/serverless-distributed-data-pre-processing-using-dask-amazon-ecs-and-python-part-1-a6108c728cc4
I have managed to push the docker container to AWS ECR and then launch a CloudFormation template to build a cluster on AWS Fargate. The next step in the tutorial is to launch an AWS Sagemaker Notebook. I have tried this but something is not working because when I run the commands I get errors (see image). What might the problem be? Could it be related to the VPC/subnets? Is it related to AWS Sagemaker internet access? (I have tried enabling and disabling this).
Expected Results: dask to update, scaling up of the Fargate cluster to work.
Actual Results: none of the above.
In my case, when running through the same tutorial, DaskSchedulerService takes too long to complete. The creation was initiated but never finished in CloudFormation.
After 5-6 hours i've got the following:
DaskSchedulerService CREATE_FAILED Dask-Scheduler did not stabilize.
The workers did not run, and, consequently, it was not possible to connect to the Client.
I have a working Pod for a Deployment in Openshift 3 Starter. This is based off an Image stream from a Docker image. However, I cannot get it to build in Openshift with the inbuilt S2I.
The Docker option is not good as I cannot find setting anywhere to make a Image stream update and cause a redeployment.
I tried setting it up so that a webhook would trigger an Openshift Build, but the server needs python 3 with numpy and scipy, which makes the build get stuck. The best I could do is inelegantly get a Python 3 cartridge install numpy based on requirements.txt and the rest via setup.py, but this still got stuck. I have a working webhook going for a different app that runs on basically the same layout bar for requirements (Python3 Pyramid with waitress).
Github: https://github.com/matteoferla/pedel2
Docker: https://hub.docker.com/r/matteoferla/pedel2/
Openshift: http://pedel2-git-matteo-ferla.a3c1.starter-us-west-1.openshiftapps.com
UPDATE I have made a Openshift pyramid starter template.
I would first suggest going back to using the builtin Python S2I builder. If you are doing anything with numpy/pandas, you will need to increase the amount of memory available during the build phase of your application as the compiler runs out of memory when building those packages. See:
Pandas on OpenShift v3
See if that helps and if need be can look at what your other options are around using an externally built container image.