ml-pipeline pod of kubeflow deployment is in CrashLoopBackOff - kubeflow

I have very limited hands-on over the kubeflow setup. I have deployed a kubeflow pipeline on an EKS cluster with S3 connectivity for the artifacts. All the pods are up and running except the ml-pipeline deployment and also ml-pipeline-persistence agent deployment is failing due to the dependency on the ml-pipeline.
I am facing the below error while checking the logs of the pods:
I0321 19:19:49.514094 7 config.go:57] Config DBConfig.ExtraParams not specified, skipping
F0321 19:19:49.812472 7 client_manager.go:400] Failed to check if Minio bucket exists. Error: Access Denied.
Had anyone faced similar issues, I am not able to find many logs which could help me to debug the issue.
Also, the credentials consumed by the ml-pipeline deployment to access the bucket have all the required permissions.

Check the S3 permissions assigned to the aws credentials you set for MINIO_AWS_ACCESS_KEY_ID & MINIO_AWS_SECRET_ACCESS_KEY. That is what caused the same error for me.
Although the auto-rds-s3-setup.py setup program provided by the aws distribution of kubeflow can create the s3 bucket, the credentials passed in to MINITO have to enable access to that bucket. So they are primarily for reusing an existing s3 bucket.

Related

Error Creating Vault - Missing S3 Bucket Flag?

I'm trying to create a new jenkinsx cluster using jx. This is the command I am running:
jx create cluster aws --ng
And this is the error I get:
error: creating the system vault: creating vault: Missing S3 bucket flag
It seems to fail out on creating the vault due to missing a bucket flag, and I'm not sure how to remedy that.
Did you try
jx create cluster aws --ng --state s3://<bucket_name>
Also, ensure you are using the latest release

Cannot access S3 bucket from WildFly running in Docker

I am trying to configure WildFly using the docker image jboss/wildfly:10.1.0.Final to run in domain mode. I am using docker for macos 8.06.1-ce using aufs storage.
I followed the instructions in this link https://octopus.com/blog/wildfly-s3-domain-discovery. It seems pretty simple, but I am getting the error:
WFLYHC0119: Cannot access S3 bucket 'wildfly-mysaga': WFLYHC0129: bucket 'wildfly-mysaga' could not be accessed (rsp=403 (Forbidden)). Maybe the bucket is owned by somebody else or the authentication failed.
But my access key, secret and bucket name are correct. I can use them to connect to s3 using AWS CLI.
What can I be doing wrong? The tutorial seems to run it in an EC2 instance, while my test is in docker. Maybe it is a certificate problem?
I generated access keys from admin user and it worked.

Airflow: Could not send worker log to S3

I deployed Airflow webserver, scheduler, worker, and flower on my kubernetes cluster using Docker images.
Airflow version is 1.8.0.
Now I want to send worker logs to S3 and
Create S3 connection of Airflow from Admin UI (Just set S3_CONN as
conn id, s3 as type. Because my kubernetes cluster is running on
AWS and all nodes have S3 access roles, it should be sufficient)
Set Airflow config as follows
remote_base_log_folder = s3://aws-logs-xxxxxxxx-us-east-1/k8s-airflow
remote_log_conn_id = S3_CONN
encrypt_s3_logs = False
and first I tried creating a DAG so that it just raises an exception immediately after it's running. This works, log can be seen on S3.
So I modified so that the DAG now creates an EMR cluster and waits for it to be ready (waiting status). To do this, I restarted all 4 docker containers of airflow.
Now the DAG looks working, a cluster is started and once it's ready, DAG marked as success. But I could see no logs on S3.
There is no related error log on worker and web server, so I even cannot see what may cause this issue. The log just not sent.
Does anyone know if there is some restriction for remote logging of Airflow, other than this description in the official documentation?
https://airflow.incubator.apache.org/configuration.html#logs
In the Airflow Web UI, local logs take precedence over remote logs. If
local logs can not be found or accessed, the remote logs will be
displayed. Note that logs are only sent to remote storage once a task
completes (including failure). In other words, remote logs for running
tasks are unavailable.
I didn't expect it but on success, will the logs not be sent to remote storage?
The boto version that is installed with airflow is 2.46.1 and that version doesn't use iam instance roles.
Instead, you will have to add an access key and secret for an IAM user that has access in the extra field of your S3_CONN configuration
Like so:
{"aws_access_key_id":"123456789","aws_secret_access_key":"secret12345"}

Google Cloud Jenkins gcloud push access denied

I'm trying via Jenkins to push an image to the container repository. It was working at first, but now, I got "access denied"
docker -- push gcr.io/xxxxxxx-yyyyy-138623/myApp:master.1
The push refers to a repository [gcr.io/xxxxxxx-yyyyy-138623/myApp]
bdc3ba7fdb96: Preparing
5632c278a6dc: Waiting
denied: Access denied.
the Jenkinsfile look like :
sh("gcloud docker --authorize-only")
sh("docker -- push gcr.io/xxxxxxx-yyyyy-138623/hotelpro4u:master.1")
Remarks:
Jenkins is running in Google Cloud
If I try in Google Shell or from my computer, it's working
I followed this tutorial : https://github.com/GoogleCloudPlatform/continuous-deployment-on-kubernetes
I'm stuck while 12 hours.... I need help
That error means that the GKE node is not authorized to push to the GCS bucket that is backing your repository.
This could be because:
The cluster does not have the correct scopes to authenticate to GCS. Did you create the cluster w/ --scopes storage-rw?
The service account that the cluster is running as does not have permissions on the bucket. Check the IAM & Admin section on your project to make sure that the service account has the necessary role.
Building on #cj-cullen's answer above, you have two options:
Destroy the node pool and then, from the CLI, recreate it with the missing https://www.googleapis.com/auth/projecthosting,storage-rw scope. The GKE console does not have the capability to change the default scopes when creating a node pool from the console.
Stop each instance in your cluster. In the console, click the edit button for the instance. You should now be able to add the appropriate https://www.googleapis.com/auth/projecthosting,storage-rw scope.

Invalid AMI option for Elastic Beanstalk

I am trying to setup a Rails server on AWS using Elastic Beanstalk. I am following the guide here. I managed to configure the EB CLI, and I am at the part where I am trying to deploy the app to an EB environment. However, I am getting an error that the AMI option I provided is inval
[rails-beanstalk$] eb create first-beanstalk-env -sr aws-beanstalk-service-role
WARNING: You have uncommitted changes.
Creating application version archive "app-8bc6-160112_090122".
Uploading rails-beanstalk/app-8bc6-160112_090122.zip to S3. This may take a while.
Upload Complete.
ERROR: Configuration validation exception: Invalid option value: 'ami- 48eb8128' (Namespace: 'aws:autoscaling:launchconfiguration', OptionName: 'ImageId'): No EC2 ImageId found with id: 'ami-48eb8128'
I don't remember ever setting an AMI (or what that even is), so I am very confused as to why I'm getting this error.
Not sure what's up with that error, but I find the EB CLI quite fragile. Try to take a different approach and create the environment via the aws web console and not via the cli.
Once this is ready, use eb init to setup beanstalk for your local project and then eb deploy <env-name> to push your project to the newly created env.
If the error persists, try to change the deployment zone. AWS sometimes have local bugs in some regions.
EDIT: this seems to be an issue with AWS. I tried it myself and it fails with the default setup in all aws zones.
EDIT 2: this is now confirmed with amazon:
Unfortunately we are experiencing an issue on our side related to
Beanstalk and the default Ruby AMI in different regions. We are
already investigating the issue and we plan to fix this as soon as
possible. I will update you through this support case once we will get
any update from the Elastic Beasntalk service team.
'ami- 48eb8128'
This has a space in it and is not a valid AMI ID

Resources