Getting error while running multiple docker containers of SonarQube - docker

I'm trying to run 2 task of same SonarQube container using the AWS ECS service (EC2 instances and not Fargate). Only 1 ECS instance is running and I'm using the EBS volumes for storing the sonar data and sonar extensions like this:-
/opt/sonarqube/data
/opt/sonarqube/extensions
If I run just 1 ECS Task (1 docker container of SonarQube) then the SonarQube application runs perfectly and I can access it however if I scale the service to additional task (means 2 docker containers of SonarQube on same ECS instance) then I get below locking error hence one of the task never comes to 'RUNNING' state:-
2021-03-18 10:39:19at org.elasticsearch.node.Node.<init>(Node.java:289) ~[elasticsearch-7.10.2.jar:7.10.2]
2021-03-18 10:39:192021.03.18 05:09:19 ERROR es[][o.e.b.ElasticsearchUncaughtExceptionHandler] uncaught exception in thread [main]
2021-03-18 10:39:19org.elasticsearch.bootstrap.StartupException: java.lang.IllegalStateException: failed to obtain node locks, tried [[/opt/sonarqube/data/es7]] with lock id [0]; maybe these locations are not writable or multiple nodes were started without increasing [node.max_local_storage_nodes] (was [1])?
2021-03-18 10:39:19at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:174) ~[elasticsearch-7.10.2.jar:7.10.2]
How can I make sure this issue does not come up and I can scale the service as and when required using the autoscaling feature?
Cheers,

Related

Apache Spark : spark executor pod isn't able to pull docker image from a registry/repo

I'm new to Apache Spark.
I'm trying to run a spark session using pyspark.
I have configured to have 2 executor nodes for it.
Now both of the executor nodes needs to pull my custom built spark image which is in a repo.
Below is the configuration in python for my spark session/job
spark = SparkSession.builder.appName('sparkpi-test1'
).master("k8s://https://kubernetes.default:443"
).config("spark.kubernetes.container.image", "\<repo\>"
).config("spark.kubernetes.authenticate.caCertFile", "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
).config("spark.kubernetes.authenticate.oauthTokenFile", "/var/run/secrets/kubernetes.io/serviceaccount/token"
).config("spark.kubernetes.authenticate.driver.serviceAccountName", "spark-driver-0"
).config("spark.executor.instances", 2
).config("spark.driver.host", "test"
).config("spark.driver.port", "20020"
).config("spark.serializer", "org.apache.spark.serializer.KryoSerializer"
).config("spark.sql.hive.convertMetastoreParquet", "false"
).config("spark.jars.packages", "org.apache.hudi:hudi-spark3.3-bundle_2.12:0.12.1,org.apache.spark:spark-avro_2.12:3.1.2"
).config("spark.kubernetes.node.selector.testNodeCategory", "ondemand"
).getOrCreate()
sparkpi-test1-2341a185c8144b60-exec-1 0/1
ImagePullBackOff 0 5h17m
sparkpi-test1-2341a185c8144b60-exec-2 0/1
ImagePullBackOff 0 5h17m
So, Correct me if I'm doing anything wrong.
I'm trying to setup Spark in my existing kubernetes cluster using my custom built spark image in some repo.
I mentioned the same in configuration in my python file.
).config("spark.kubernetes.container.image", "<repo>"
According to docs
Container image to use for the Spark application. This is usually of the form example.com/repo/spark:v1.0.0. This configuration is required and must be provided by the user, unless explicit images are provided for each different container type.
Why is my executor node failing to pull the image from registry ?
How do I pull it manually for executor node for the time being ?
Just for reference Find the below error messages
WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
I guess the above error message is because my executor pods didn't create succesfully.
I’ve got it.
I was using terraform to build all the resources.
.tfstate file got changed and is causing the pods to have these errors
Clearing terraform cache got my problem solved.
To clean terraform cache
run
rm -rf .terraform
In your terraform directory

AWS ECS- Task exited automatically with an exit code 0

I am trying to run 1 service in AWS ECS where I am getting "essential container in task exited exit code 0" as error .
While checking the logs , i don't see any logs also
Troubleshooting done:
Checked in cloud watch/logs/insights on logs (missing)
Tried to deploy another service as container (running successfully)
Tried to deploy the same container manually in ECS instance (running successfully)
Tried with changing Task definition and changing cluster (Not working)
Still only for this 1 service, am getting the same error again and again.
While checking for logs also , Am not able to get the logs.
Can anyone suggest what to do here?
The below shows an example of the error .
The problem got solved.
Was having issue with environment variables , as it was not loading in docker container.
Hence container got up and drains out

Runnning a Job in Apache Flink standalone mode on Zeppelin I have this error "TooLongFrameException: Adjusted frame length exceeds"

I'm trying to run a simple job in apache flink using Zeppelin.
I have created a docker container with zeppelin 0.8.0 running. In the same container I have runnning apache flink 1.6.1. When I run the job using the flink interpreter that zeppelin has by default, works good and I can see the result of the job. But when I do it in my own interpreter with flink 1.6.1 running in standalone mode, I have the following error:
Configuration of my interpreter:
Properties setted in flink-conf.yaml
jobmanager.rpc.address: 0.0.0.0
jobmanager.rpc.port: 6134
jobmanager.heap.size: 1024m
taskmanager.heap.size: 1024m
taskmanager.numberOfTaskSlots: 1
parallelism.default: 1
The error I get in the flink log (that's means flink receive the job from zeppelin):
2019-01-18 15:30:53,761 WARN
akka.remote.transport.netty.NettyTransport - Remote connection to
[/172.17.0.2:54842] failed with org.apache.flink.shaded.akka.org.jboss.netty.handler.codec.frame.TooLongFrameException: Adjusted frame length exceeds 10485760: 2147549189 - discarded
I know akka has the frame size messages setted 10485760b by default, but I'm receiving this value 2147549189 that is too big. I tried modifying this property akka.framesize: 10485760b in flink-conf.yaml but I can't put that huge size, doesn't have too much sense to me.

spring-cloud-netflix zero downtime deployments on AWS ECS

We're running spring-cloud microservices using eureka on AWS ECS. We're also doing continuous deployment, and we've run into an issue where rolling production deployments cause a short window of service unavailability. I'm focusing here on #LoadBalanced RestTemplate clients using ribbon. I think I've gotten retry working adequately in my local testing environment, but I'm concerned about new service instance eureka registration lag time and the way ECS rolling deployments work.
When we merge a new commit to master, if the build passes (compiles and tests pass) our jenkins pipeline builds and pushes a new docker image to ECR, then creates a new ECS task definition revision pointing to the updated docker image, and updates the ECS service. As an example, we have an ECS service definition with desired task count set to 2, minimum percent available set to 100%, and maximum percent available set to 200%. The ECS service scheduler starts 2 new docker containers using the new image, leaving the existing 2 docker container running on the old image. We use container health checks that pass once the actuator health endpoint returns 200, and as soon as that happens, the ECS service scheduler stops the 2 old containers running on the old docker image.
My understanding here could be incorrect, so please correct me if I'm wrong about any of this. Eureka clients fetch the registry every 30 seconds, so there's up to 30 seconds where all the client has in the server list is the old service instances, so retry won't help there.
I asked AWS support about how to delay ECS task termination during rolling deploys. When ECS services are associated with an ALB target group, there's a deregistration delay setting that ECS respects, but no such option exists when a load balancer is not involved. The AWS response was to run the java application via an entrypoint bash script like this:
#!/bin/bash
cleanup() {
date
echo "Received SIGINT, sleeping for 45 seconds"
sleep 45
date
echo "Killing child process"
kill -- -$$
}
trap 'cleanup' SIGTERM
"${#}" &
wait $!
When ECS terminates the old instances, it send SIGTERM to the docker container, this script traps it, sleeps for 45 seconds, then continues with the shutdown. I'll also have to change an ecs config parameter in /etc/ecs that controls the grace period before ECS sends a SIGKILL after the SIGTERM, which defaults to 30 seconds, which is not quite long enough.
This feels dirty to me. I'm not sure that script isn't going to cause some other unforeseen issue; does it forward all signals appropriately? It feels like an unwanted complication.
Am I missing something? Can anyone spot anything wrong with AWS support's suggested entrypoint script approach? Is there a better way to handle this and achieve the desired result, which is zero downtime rolling deployments on services registered in eureka on ECS?

Mesos: Failed to get/update resource statistics for executor

we are having issues with full logs from mesos-agents with messages like:
2018-06-19T07:31:05.247394+00:00 mesos-slave16 mesos-slave[10243]: W0619 07:31:05.244067 10249 slave.cpp:6750] Failed to get resource statistics for executor 'research_new-benchmarks_production_testbox-58-1529393461975-1-mesos_slave16' of framework Singularity-PROD: Failed to run 'docker -H unix:///var/run/docker.sock inspect mesos-7560fb72-28d3-4cce-8cb0-de889248cf93': exited with status 1; stderr='Error: No such object: mesos-7560fb72-28d3-4cce-8cb0-de889248cf93
or
2018-06-19T07:31:09.904414+00:00 mesos-slave16 mesos-slave[10243]: E0619 07:31:09.903687 10251 slave.cpp:4721] Failed to update resources for container b9a9f7f9-938b-4ec4-a245-331122471769 of executor 'hera_listening-api_production_checkAlert-93-1529393402085-1-mesos_slave16-us_west_2a' running task hera_listening-api_production_checkAlert-93-1529393402085-1-mesos_slave16 on status update for terminal task, destroying container: Failed to determine cgroup for the 'cpu' subsystem: Failed to read /proc/14447/cgroup: Failed to open file: No such file or directory
We are running 3x ha mesos-master, marathon framework, singularity framework - happening with tasks from both frameworks. Tasks running, crons (from singularity) running ok too, but i am confused of thouse messages. We have more than 600 long running marathon tasks and more than 30 crons starting per few minutes.
Docker version: 18.03.0-ce
Mesos version: 1.4.0-2.0.1
Marathon version: 1.4.2-1.0.647.ubuntu1604
Singularity version: 0.15.1
Masters and slaves running on Ubuntu 16.04 with AWS kernel - 4.4.0-1060-aws
I think that mesos executor on slave is deleted after task is finished, but mesos still trying to get info from docker, where task is no loger visible.
Any ideas? Thanks
Marathon is a scheduler framework for permanent tasks. Although tasks exit successfully, it would still insist to re-schedule tasks all the time.
We could see health check is one of its important features. Maybe try chronos. It’s another framework working on Apache mesos.

Resources