AWS OpsWorks - setup_failed and eternal pending logs - devops

I'm trying to create an Q.A. stack at OpsWorks. My knowledge in OpsWorks are very superficial, so I began creating a stack with 1 layer and 1 instance. I used only AWS recipes to create an PHP Application layer:
[IMG]
When I try to boot my first instance, I got the error "start_failed". My problem is: I can't see any logs to find out what is going on, because it keep in pending status forever:
[IMG]
I already tried to access via SSH and AWS CLI, but I still can't get any log.

If your instance is in a start_failed state, this can indicate quite a few possible issues. A lot of issues are covered in this specific troubleshooting documentation.
Since you appear to be able to SSH into the instance, you're going to want to check the OpsWorks Agent logs for errors. These are available(with elevated privileges) in:
/var/log/aws/opsworks

Related

Unable to pull logs from Airflow Worker

I've got a simple docker development setup for Airflow that includes separate containers for the Airflow UI and Worker. I'm encountering a 403 Forbidden error whenever I attempt to view the log for a task in the Airflow UI.
So far I've ensured they all have the same secret key (in fact, using Docker Volumes they're all reading the exact same configuration file) but this doesn't seem to help. I haven't done anything about time sync, but I'd expect that docker containers would effectively be sharing the system clock anyway so I don't see how they'd get out of sync in the first place.
I can find the log file on the airflow worker, and it has run successfully - but something is obviously missing that should be allowing the airflow UI to display that (and it would be much more convenient for my workflow to be able to see the logs in the UI rather than having to rummage around on the worker).

HTTP 503 errors from Cloud Run app in one GCP projects but not the other

The issue
I am using the same container (similar resources) on 2 projects -- production and staging. Both have custom domains setup with cloud flare DNS and are on the same region. Container build is done in a completely different project and IAM is used to handle the access to these containers. Both project services have 80 concurrency and 300 seconds time out for all 5 services.
All was working good 3 days back but from yesterday almost all cloud run services on staging (thankfully) started throwing 503 randomly and for most requests. Some services were not even deployed for a week. The same containers are running fine on production project, no issues.
Ruled out causes
anything to do with Cloudflare (I tried the URL cloud run gives it has the issue of 503)
anything with build or containers (I tried the demo hello world container with go - it has the issue too)
Resources: I tried giving it 1 GB ram and 2 cpus but the problem persisted
issues on deployment (deploy multiple branches - didn't work)
issue in code (just routed traffic to old 2-3 days old revision but still issue was there)
Issue on service level ( I used the same container to create a completely new service, it also had the issue)
Possible causes
something on cloud run or cloud run load balancer
may some env vars but that also doesn't seem to be the issue
Response Codes
I just ran a quick check with vegeta (30 secs with 10 rps) same container on staging and production for a static file path and below are the responses:
Staging
Production
If anyone has any insights on this it would help greatly.
Based on your explanation, I cannot understand what's going on. You explained what doesn't work but didn't point out what works (does your app run locally? are you able to run a hello world sample application?)
So I'll recommend some debugging tips.
If you're getting a HTTP 5xx status code, first, check your application's logs. Is it printing ANY logs? Is there logs of a request? Does your application have and deployed with "verbose" logging setting?
Try hitting your *.run.app domain directly. If it's not working, then it's not a domain or dns or cloudflare issue. Try debugging and/or redeploying your app. Deploy something that works first. If *.run.app domain works, then the issue is not in Cloud Run.
Make sure you aren't using Cloudflare in proxy mode (e.g. your DNS points to Cloud Run; not Cloudflare) as there's a known issue about certificate issuance/renewals when domains are behind Cloudflare, right now.
Beyond these, if a redeploy seems to solve your problem, maybe try redeploying. It could be very likely some configuration recently became different two different projects.
See Cloud Run Troubleshooting
https://cloud.google.com/run/docs/troubleshooting
Do you see 503 errors under high load?
The Cloud Run (fully managed) load balancer strives to distribute incoming requests over the necessary amount of container instances. However, if your container instances are using a lot of CPU to process requests, the container instances will not be able to process all of the requests, and some requests will be returned with a 503 error code.
To mitigate this, try lowering the concurrency. Start from concurrency = 1 and gradually increase it to find an acceptable value. Refer to Setting concurrency for more details.

Jenkins: Cloud or AMI instance cap would be exceeded for: <name>

Using the ec2 plugin v1.39 to start worker nodes on EC2, I am faced with this error (and huge stack trace) every time I start a new node.
Cloud or AMI instance cap would be exceeded for: <name>
I have set the (previously unset) Instance Cap to 10 in both fields in Configure System. This did not fix it.
Can anyone suggest what might be the problem? Thanks
EDIT 1:
I have tried changing the instance size, with no change (I went M3Medium -> M4Large).
See full stack trace here.
I can also launch an m4.large from the console. Turns out the m3.medium doesn't exist in Sydney.. Hmm
Setting all the log levels to ALL might give you extra information about the error, endpoint in /log/levels
Anyway it seems like an issue we had previously with the private ssh key not set properly, therefore the slave can't be connected and keeps increasing the cap.

Is there any way to obtain detailed logging info when executing 'docker stack deploy'?

In Docker 17.03, when executing
docker stack deploy -c docker-compose.yml [stack-name]
the only info that is output is:
Creating network <stack-name>_myprivatenet
Creating service <stack-name>_mysql
Creating service <stack-name>_app
Is there a way to have Docker output more detailed info about what is happening during deployment?
For example, the following information would be extremely helpful:
image (i.e. 'mysql' image) is being downloaded from the registry (and provide the registry's info)
if say the 'app' image is unable to be downloaded from its private registry, that an error message (i.e. due to incorrect or omitted credentials - registry login required) be output
Perhaps it could be provided via either of the following ways:
docker stack deploy --logs
docker stack log
Thanks!
docker stack logs is actually a requested feature in issue 31458
request for a docker stack logs which can show the logs for a docker stack much like docker service logs work in 1.13.
docker-compose works similarly today, showing the interleaved logs for all containers deployed from a compose file.
This will be useful for troubleshooting any kind of errors that span across heterogeneous services.
This is still pending though, because, as Drew Erny (dperny) details:
there are some changes that have to be made to the API before we can pursue this, because right now we can only get the logs for 1 service at a time unless you make multiple calls (which is silly, because we can get the logs for multiple services in the same stream on swarmkit's side).
After I finish those API changes, this can be done entirely on the client side, and should be really straightforward. I don't know when the API changes will be in because I have started yet, but I can let you know as soon as I have them!

Why Amazon Elastic Beanstalk takes a long time to update my deploy?

I have Amazon EB. with (Puma, Nginx) 64bit Amazon Linux 2014.09 v1.0.9 running Ruby 2.1 (Puma).
Suddenly when I deployed my project send the next error in my terminal:
ERROR: Timed out while waiting for command to Complete
Note: Before didn't happen.
I see the event in the console and this is the log:
Update environment operation is complete, but with command timeouts. Try increasing the timeout period. For more information, see troubleshooting documentation.
I'v already incrementing the time without success.
option_settings:
- namespace: aws:elasticbeanstalk:command
option_name: Timeout
value: 1800
The Health takes a long time to put it in green (aprox, 20 min), and then it takes other long time for updating the instance with the new changes (aprox, other 20 min), (I have only one instance).
How can I see other logs?
This seems like rather common problem with elasticbeanstalk. In short your EC2 instance is going haywired. What you can do is to terminate the EC2 instance on the EC2 dashboard and the loader balancer will start new instance and that may save your problem. To minimise any down time you may start the new instance first and then terminate your older instance. Just be wary that you will lose any ephemeral data and you may have to reinstall certain dependencies (if they are not in your ebextensions 0
Let me know if you need any more help. Do check out the aws ebs forum
Cheers,
biobirdman
The problem was the RAM in the instance, so I had to change that instance by other bigger.

Resources