PartitionedDataSet not found when Kedro pipeline is run in Docker - docker

I have multiple text files in an S3 bucket which I read and process. So, I defined PartitionedDataSet in Kedro datacatalog which looks like this:
raw_data:
type: PartitionedDataSet
path: s3://reads/raw
dataset: pandas.CSVDataSet
load_args:
sep: "\t"
comment: "#"
In addition, I implemented this solution to get all secrets from credentials file via environment variables including AWS secret keys.
When I run things locally using kedro run everything works just fine, but when I build Docker image (using kedro-docker) and run pipeline in Docker environement with kedro docker run and by providing all enviornement variables using --docker-args option I get the following error:
Traceback (most recent call last):
File "/usr/local/bin/kedro", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.7/site-packages/kedro/framework/cli/cli.py", line 724, in main
cli_collection()
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/home/kedro/kedro_cli.py", line 230, in run
pipeline_name=pipeline,
File "/usr/local/lib/python3.7/site-packages/kedro/framework/context/context.py", line 767, in run
raise exc
File "/usr/local/lib/python3.7/site-packages/kedro/framework/context/context.py", line 759, in run
run_result = runner.run(filtered_pipeline, catalog, run_id)
File "/usr/local/lib/python3.7/site-packages/kedro/runner/runner.py", line 101, in run
self._run(pipeline, catalog, run_id)
File "/usr/local/lib/python3.7/site-packages/kedro/runner/sequential_runner.py", line 90, in _run
run_node(node, catalog, self._is_async, run_id)
File "/usr/local/lib/python3.7/site-packages/kedro/runner/runner.py", line 213, in run_node
node = _run_node_sequential(node, catalog, run_id)
File "/usr/local/lib/python3.7/site-packages/kedro/runner/runner.py", line 221, in _run_node_sequential
inputs = {name: catalog.load(name) for name in node.inputs}
File "/usr/local/lib/python3.7/site-packages/kedro/runner/runner.py", line 221, in <dictcomp>
inputs = {name: catalog.load(name) for name in node.inputs}
File "/usr/local/lib/python3.7/site-packages/kedro/io/data_catalog.py", line 392, in load
result = func()
File "/usr/local/lib/python3.7/site-packages/kedro/io/core.py", line 213, in load
return self._load()
File "/usr/local/lib/python3.7/site-packages/kedro/io/partitioned_data_set.py", line 240, in _load
raise DataSetError("No partitions found in `{}`".format(self._path))
kedro.io.core.DataSetError: No partitions found in `s3://reads/raw`
Note: Pipeline works just fine in Docker environment, if I move files to some local directory, define PartitionedDataSet and build Docker image and provide environment variables through --docker-args

The solution (at least in my case) was to provide AWS_DEFAULT_REGION env variable in the kedro docker run command.

Related

airflow worker crashing in helms upgrade with Temporary failure in name resolution for postgres

I have been trying to use a custom dockerfile for mounting dags and plugins as follows:
FROM apache/airflow:2.3.0-python3.7
COPY ./dags/ /opt/airflow/dags/
COPY ./plugins/ /opt/airflow/plugins/docker push
COPY requirements.txt .
RUN pip install -r requirements.txt
EXPOSE 5555
which I am building as:
docker build -f base.dockerfile --pull --tag lqc-airflow:0.0.1 .
minikube image load lqc-airflow:0.0.1
and then doing a helm install
helm upgrade $RELEASE_NAME apache-airflow/airflow --namespace $NAMESPACE --set images.airflow.repository=lqc-airflow --set images.airflow.tag=0.0.1
which however is making just the airflow-worker-0 pod fail due to the following error:
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/airflow/.local/bin/airflow", line 8, in <module>
sys.exit(main())
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/__main__.py", line 38, in main
args.func(args)
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/cli/cli_parser.py", line 51, in command
return func(*args, **kwargs)
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/utils/cli.py", line 99, in wrapper
return f(*args, **kwargs)
File "/home/airflow/.local/lib/python3.7/site-packages/airflow/cli/commands/celery_command.py", line 130, in worker
session = celery_app.backend.ResultSession()
File "/home/airflow/.local/lib/python3.7/site-packages/celery/backends/database/__init__.py", line 109, in ResultSession
**self.engine_options)
File "/home/airflow/.local/lib/python3.7/site-packages/celery/backends/database/session.py", line 88, in session_factory
self.prepare_models(engine)
File "/home/airflow/.local/lib/python3.7/site-packages/celery/backends/database/session.py", line 72, in prepare_models
ResultModelBase.metadata.create_all(engine)
File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/sql/schema.py", line 4745, in create_all
ddl.SchemaGenerator, self, checkfirst=checkfirst, tables=tables
File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 3007, in _run_ddl_visitor
with self.begin() as conn:
File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 2923, in begin
conn = self.connect(close_with_result=close_with_result)
File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 3095, in connect
return self._connection_cls(self, close_with_result=close_with_result)
File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 91, in __init__
else engine.raw_connection()
File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 3174, in raw_connection
return self._wrap_pool_connect(self.pool.connect, _connection)
File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 3145, in _wrap_pool_connect
e, dialect, self
File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 2004, in _handle_dbapi_exception_noconnection
sqlalchemy_exception, with_traceback=exc_info[2], from_=e
File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
raise exception
File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 3141, in _wrap_pool_connect
return fn()
File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 301, in connect
return _ConnectionFairy._checkout(self)
File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 755, in _checkout
fairy = _ConnectionRecord.checkout(pool)
File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 419, in checkout
rec = pool._do_get()
File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/pool/impl.py", line 259, in _do_get
return self._create_connection()
File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 247, in _create_connection
return _ConnectionRecord(self)
File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 362, in __init__
self.__connect(first_connect_check=True)
File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 605, in __connect
pool.logger.debug("Error on connect(): %s", e)
File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/util/langhelpers.py", line 72, in __exit__
with_traceback=exc_tb,
File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
raise exception
File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/pool/base.py", line 599, in __connect
connection = pool._invoke_creator(self)
File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/create.py", line 578, in connect
return dialect.connect(*cargs, **cparams)
File "/home/airflow/.local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 583, in connect
return self.dbapi.connect(*cargs, **cparams)
File "/home/airflow/.local/lib/python3.7/site-packages/psycopg2/__init__.py", line 122, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) could not translate host name "postgres" to address: Temporary failure in name resolution
I am just following the reading advisory from airflow: https://airflow.apache.org/docs/helm-chart/stable/manage-dags-files.html
please note that there are no such name resolution errors if I dont use my custom docker file. Kindly help!
thanks #Hussein also for your support, but I could solve it by myself. I saw that the logs of the airflow-migrations pods, complaint of a revision id not being found. So somehow I chanced upon: https://airflow.apache.org/docs/apache-airflow/stable/migrations-ref.html
and there I saw the airflow tag which was above/carrying that alembic revision.
My revision id was ecb43d2a1842 and thus the changes to my docker file were:
FROM apache/airflow:2.4.3
COPY ./dags/ /opt/airflow/dags/
COPY ./plugins/ /opt/airflow/plugins/
COPY requirements.txt .
RUN pip install -r requirements.txt
thus 2.4.3 was the catch.

Use google default credentials on local docker run

I have the same problem asked on this question, but the provided solution does not work for me.
Basically, I want to run my docker image, with entrypoint run_query.py, locally. I have issues with credentials when I try to run a Bigquery job.
When I try to run my
docker run -v ~/.config/:/root/.config my-image-name --param1 ...
I get this error
Traceback (most recent call last):
File "run_query.py", line 97, in <module>
query_params=params)
File "run_query.py", line 54, in create_table
query_job = client.query(query, job_config=job_config)
File "/usr/local/lib/python3.7/dist-packages/google/cloud/bigquery/client.py", line 2467, in query
query_job._begin(retry=retry, timeout=timeout)
File "/usr/local/lib/python3.7/dist-packages/google/cloud/bigquery/job.py", line 3156, in _begin
super(QueryJob, self)._begin(client=client, retry=retry, timeout=timeout)
File "/usr/local/lib/python3.7/dist-packages/google/cloud/bigquery/job.py", line 638, in _begin
retry, method="POST", path=path, data=self.to_api_repr(), timeout=timeout
File "/usr/local/lib/python3.7/dist-packages/google/cloud/bigquery/client.py", line 558, in _call_api
return call()
File "/usr/local/lib/python3.7/dist-packages/google/api_core/retry.py", line 286, in retry_wrapped_func
on_error=on_error,
File "/usr/local/lib/python3.7/dist-packages/google/api_core/retry.py", line 184, in retry_target
return target()
File "/usr/local/lib/python3.7/dist-packages/google/cloud/_http.py", line 419, in api_request
timeout=timeout,
File "/usr/local/lib/python3.7/dist-packages/google/cloud/_http.py", line 277, in _make_request
method, url, headers, data, target_object, timeout=timeout
File "/usr/local/lib/python3.7/dist-packages/google/cloud/_http.py", line 315, in _do_request
url=url, method=method, headers=headers, data=data, timeout=timeout
File "/usr/local/lib/python3.7/dist-packages/google/auth/transport/requests.py", line 444, in request
self.credentials.before_request(auth_request, method, url, request_headers)
File "/usr/local/lib/python3.7/dist-packages/google/auth/credentials.py", line 133, in before_request
self.refresh(request)
File "/usr/local/lib/python3.7/dist-packages/google/oauth2/credentials.py", line 198, in refresh
self._scopes,
File "/usr/local/lib/python3.7/dist-packages/google/oauth2/_client.py", line 248, in refresh_grant
response_data = _token_endpoint_request(request, token_uri, body)
File "/usr/local/lib/python3.7/dist-packages/google/oauth2/_client.py", line 124, in _token_endpoint_request
_handle_error_response(response_body)
File "/usr/local/lib/python3.7/dist-packages/google/oauth2/_client.py", line 60, in _handle_error_response
raise exceptions.RefreshError(error_details, response_body)
I also tried to use -v ~/.config/gcloud/:/root/.config/gcloud, but I get the same result.
Keep in mind that using this image into a Kubeflow Pipeline works smoothly.
Did I misinterpret the solution from the previous question? What am I missing?

Using google cloud registry with docker while offline

I'm using Google Cloud Registry, associating with Docker using gcloud auth configure-docker.
https://cloud.google.com/sdk/gcloud/reference/auth/configure-docker
However, when my computer is offline and I run docker-compose up I get an error where it tries to communicate/authenticate with Google.
how can I use docker offline now that I've started using GCR?
$ docker-compose up --build --force-recreate -d
Building solr
ERROR: (gcloud.auth.docker-helper) There was a problem refreshing your current auth tokens: Unable to find the server at www.googleapis.com
Please run:
$ gcloud auth login
to obtain new credentials, or if you have already logged in with a
different account:
$ gcloud config set account ACCOUNT
to select an already authenticated account to use.
Traceback (most recent call last):
File "site-packages/dockerpycreds/store.py", line 74, in _execute
File "subprocess.py", line 336, in check_output
File "subprocess.py", line 418, in run
subprocess.CalledProcessError: Command '['/Users/me/google-cloud-sdk/bin/docker-credential-gcloud', 'get']' returned non-zero exit status 1.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "site-packages/docker/auth.py", line 129, in _resolve_authconfig_credstore
File "site-packages/dockerpycreds/store.py", line 35, in get
File "site-packages/dockerpycreds/store.py", line 87, in _execute
dockerpycreds.errors.StoreError: Credentials store docker-credential-gcloud exited with "".
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "docker-compose", line 6, in <module>
File "compose/cli/main.py", line 71, in main
File "compose/cli/main.py", line 127, in perform_command
File "compose/cli/main.py", line 1080, in up
File "compose/cli/main.py", line 1076, in up
File "compose/project.py", line 475, in up
File "compose/service.py", line 342, in ensure_image_exists
File "compose/service.py", line 1082, in build
File "site-packages/docker/api/build.py", line 251, in build
File "site-packages/docker/api/build.py", line 307, in _set_auth_headers
File "site-packages/docker/auth.py", line 96, in resolve_authconfig
File "site-packages/docker/auth.py", line 146, in _resolve_authconfig_credstore
docker.errors.DockerException: Credentials store error: StoreError('Credentials store docker-credential-gcloud exited with "".',)
[26419] Failed to execute script docker-compose
Docker-compose will read your YAML file to configure your application’s services , so if in your YAML you are using a Docker image or a personalized images and you don't have them on local, docker will try download them from the principal registry in your configuration, in this case Google Cloud Registry and if your are offline you will get an error.

JupyterHub - oauth_client_id not found

I am using Azure to run python notebook using Jupyterhub. After spinning up the VM, I was able to access the notebooks just by using my username and password (just like ssh). However, one day later when I switched to another network (I am not claiming that the network might have been a problem) I am unable to access the link. It gives me The site can't be reached error.
So I tried rerunning the process again, and since then I have been struggling to make it run again. I have searched for similar issues on GitHub, but they aren't helpful either.
After the kill the process using kill pid command, I tried running the jupyterhub through this command:
/anaconda/envs/py35/bin/python /anaconda/envs/py35/bin/jupyterhub-singleuser --port=50387 --notebook-dir="~/notebooks" --config=/etc/jupyterhub/jupyterhub_config.py
And it gives me the error:
JUPYTERHUB_API_TOKEN env is required to run jupyterhub-singleuser. Did you launch it manually?
So I searched through github issues similar to this. I tried generating token manually using:
jupyterhub token username
And I added that token to JUPYTERHUB_API_TOKEN via export JUPYTERHUB_API_TOKEN=token. I also added token:username to c.Authenticator.tokens in jupyterhub_config.py. Now I get this error:
Traceback (most recent call last):
File "/anaconda/envs/py35/lib/python3.5/site-packages/traitlets/traitlets.py", line 528, in get
value = obj._trait_values[self.name]
KeyError: 'oauth_client_id'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/anaconda/envs/py35/bin/jupyterhub-singleuser", line 6, in <module>
main()
File "/anaconda/envs/py35/lib/python3.5/site-packages/jupyterhub/singleuser.py", line 455, in main
return SingleUserNotebookApp.launch_instance(argv)
File "/anaconda/envs/py35/lib/python3.5/site-packages/jupyter_core/application.py", line 267, in launch_instance
return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
File "/anaconda/envs/py35/lib/python3.5/site-packages/traitlets/config/application.py", line 657, in launch_instance
app.initialize(argv)
File "<decorator-gen-7>", line 2, in initialize
File "/anaconda/envs/py35/lib/python3.5/site-packages/traitlets/config/application.py", line 87, in catch_config_error
return method(app, *args, **kwargs)
File "/anaconda/envs/py35/lib/python3.5/site-packages/notebook/notebookapp.py", line 1296, in initialize
self.init_webapp()
File "/anaconda/envs/py35/lib/python3.5/site-packages/jupyterhub/singleuser.py", line 393, in init_webapp
self.init_hub_auth()
File "/anaconda/envs/py35/lib/python3.5/site-packages/jupyterhub/singleuser.py", line 388, in init_hub_auth
if not self.hub_auth.oauth_client_id:
File "/anaconda/envs/py35/lib/python3.5/site-packages/traitlets/traitlets.py", line 556, in __get__
return self.get(obj, cls)
File "/anaconda/envs/py35/lib/python3.5/site-packages/traitlets/traitlets.py", line 535, in get
value = self._validate(obj, dynamic_default())
File "/anaconda/envs/py35/lib/python3.5/site-packages/traitlets/traitlets.py", line 593, in _validate
value = self._cross_validate(obj, value)
File "/anaconda/envs/py35/lib/python3.5/site-packages/traitlets/traitlets.py", line 599, in _cross_validate
value = obj._trait_validators[self.name](obj, proposal)
File "/anaconda/envs/py35/lib/python3.5/site-packages/traitlets/traitlets.py", line 907, in __call__
return self.func(*args, **kwargs)
File "/anaconda/envs/py35/lib/python3.5/site-packages/jupyterhub/services/auth.py", line 439, in _ensure_not_empty
raise ValueError("%s cannot be empty." % proposal.trait.name)
ValueError: oauth_client_id cannot be empty.
I am not sure where I went wrong in this process. Anybody familiar with this issue?
Try running jupyterhub instead of jupyterhub-singleuser
For your specific use case, the command would be as follows:
sudo /anaconda/envs/py35/bin/python /anaconda/envs/py35/bin/jupyterhub --port=50387 --notebook-dir="~/notebooks" --config=/etc/jupyterhub/jupyterhub_config.py
Make sure that jupyterhub is installed (correctly) in the path you mentioned.

Ipython notebook, how to set the correct path to kernel

When Running ipyhton notebook on Windows 7 64bit and launching notebook with python 2 kernel I get an error:
Traceback (most recent call last):
File "C:\Users\USER1\Anaconda2\lib\site-packages\notebook\base\handlers.py", line 436, in wrapper
result = yield gen.maybe_future(method(self, *args, **kwargs))
File "C:\Users\USER1\Anaconda2\lib\site-packages\notebook\services\sessions\handlers.py", line 56, in post
model = sm.create_session(path=path, kernel_name=kernel_name)
File "C:\Users\USER1\Anaconda2\lib\site-packages\notebook\services\sessions\sessionmanager.py", line 66, in create_session
kernel_name=kernel_name)
File "C:\Users\USER1\Anaconda2\lib\site-packages\notebook\services\kernels\kernelmanager.py", line 84, in start_kernel
**kwargs)
File "C:\Users\USER1\Anaconda2\lib\site-packages\jupyter_client\multikernelmanager.py", line 109, in start_kernel
km.start_kernel(**kwargs)
File "C:\Users\USER1\Anaconda2\lib\site-packages\jupyter_client\manager.py", line 244, in start_kernel
**kw)
File "C:\Users\USER1\Anaconda2\lib\site-packages\jupyter_client\manager.py", line 190, in _launch_kernel
return launch_kernel(kernel_cmd, **kw)
File "C:\Users\USER1\Anaconda2\lib\site-packages\jupyter_client\launcher.py", line 115, in launch_kernel
proc = Popen(cmd, **kwargs)
File "C:\Users\USER1\Anaconda2\lib\subprocess.py", line 710, in __init__
errread, errwrite)
File "C:\Users\USER1\Anaconda2\lib\subprocess.py", line 958, in _execute_child
startupinfo)
WindowsError: [Error 2] The system cannot find the file specified
I have investigated further and I have added following print lines before proc = Popen(cmd, **kwargs) inside launcher.py file
print cmd
print kwargs
Now I see that proc = Popen(cmd, **kwargs) is called with cmd=
['C:\\Users\\USER1\\Anaconda2_32bit\\python.exe', '-m', 'ipykernel', '-f', '
C:\\Users\\USER1\\AppData\\Roaming\\jupyter\\runtime\\kernel-a3f46334-4491-4
fef-aeb1-6772b8392954.json']
this is a problem because my python.exe is not in
C:\\Users\\USER1\\Anaconda2_32bit\\python.exe
but in
C:\\Users\\USER1\\Anaconda2\\python.exe
However I have checked paths in Computer/Advanced system settings/Advanced/Enviroment variables and \\Anaconda2_32bit\\ is never specified there.
Thus I suspect that the false path is specified somewhere else. Where could this be and how can I fix it?
Also I have previously had an installation of Anaconda in \\Anaconda2_32bit\\ but I have uninstalled it.
The ipython has kernels registered in special configuration files
I have run the command:
ipython kernelspec list
the output was:
Available kernels:
python2 C:\ProgramData\jupyter\kernels\python2
I have looked into C:\ProgramData\jupyter\kernels\python2\kernel.json file and there was a wrong path set for python2. I have fixed the path and it works now.

Resources