Docker pypspark cluster container not receiving kafka streaming from the host? - docker

I have created and deployed a spark cluster which consist of 4 container running
spark master
spark-worker
spark-submit
data-mount-container : to access the script from the local directory
i added required dependency jar in all these container
And also deployed the kafka in the host machine where it produce streaming via producer.
i launched the kafka as per the exact step in the below document
https://kafka.apache.org/quickstart
i verified kafka producer and consumer to exchange the message on 9092 port, which is working fine
Below is the simple pyspark script which i want to process as structured streaming
from pyspark import SparkContext
from pyspark.sql import SparkSession
print("Kafka App launched")
spark = SparkSession.builder.master("spark://master:7077").appName("kafka_Structured").getOrCreate()
df = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "hostmachine:9092").option("subscribe", "session-event").option("maxOffsetsPerTrigger", 10).load()
converted_string=df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
print("Recieved Stream in String", converted_string)
and below is the spark-submit i used to execute the script
##container
# pyspark_vol - container for vol mounting
# spark/stru_kafka - container for spark-submit
# i linked the spark master and worker already using the container 'master'
##spark submit
docker run --add-host="localhost: myhost" --rm -it --link master:master --volumes-from pyspark_vol spark/stru_kafka spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.1 –jars /home/spark/spark-2.1.1-bin-hadoop2.6/jars/spark-sql-kafka-0-10_2.11-2.1.1.jar --master spark://master:7077 /data/spark_session_kafka.py localhost 9092 session-event
After i ran the script, the script is executing fine, but it not seems to be listening to the streaming as a batch from the kafka producer and stopping the execution.
i didn't observed any specific error, but not producing any out put from the script
I verified the connectivity in receiving data from the host inside the docker container using socket program, which is working fine.
i am not sure if i have missed any configuration ..
Expected:
The above application which is running on spark-cluster should print the streaming coming from kafka producer
Actual
"id" : "f4e8829f-583e-4630-ac22-1d7da2eb80e7",
"runId" : "4b93d523-7b7c-43ad-9ef6-272dd8a16e0a",
"name" : null,
"timestamp" : "2020-09-09T09:21:17.931Z",
"numInputRows" : 0,
"processedRowsPerSecond" : 0.0,
"durationMs" : {
"addBatch" : 1922,
"getBatch" : 287,
"getOffset" : 361,
"queryPlanning" : 111,
"triggerExecution" : 2766,
"walCommit" : 65
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "KafkaSource[Subscribe[session-event]]",
"startOffset" : null,
"endOffset" : {
"session-event" : {
"0" : 24
}
},
"numInputRows" : 0,
"processedRowsPerSecond" : 0.0
} ],
"sink" : {
"description" : "org.apache.spark.sql.execution.streaming.ConsoleSink#6a1b0b4b"
}
}

According to the Quick Example provided in the Spark documentation you need to start your query and wait for its termination.
In your case that means you need to replace
print("Recieved Stream in String", converted_string)
with
query = df.writeStream.outputMode("complete").format("console").start()
query.awaitTermination()

The issue was with my pyspark_stream script where i missed to provide batch processing time and print statement to view the logs...
since its not a aggregated streaming, i had to use 'append' here
result =df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
print("Kafka Straming output is",result)
query = result.writeStream.outputMode("append").format("console").trigger(processingTime='30 seconds').start()

Related

Unable to run couchbase container with non standard ports

Update 04/09/20 (jj/mm/aaaa)
I tried running my couchbase instance on custom ports by playing with couchbase configuration files.
docker run -d --privileged --memory 3200M --name bumblebase \
-v '$PWD/static_config:/opt/couchbase/etc/couchbase/static_config' \
-p 3456-3461:3456-3461 -p 6575-6576:6575-6576 \
couchbase:community-6.6.0
static_config is a file I created with the following content, following this page instructions (bottom section) :
{rest_port, 3456}.
{query_port, 3458}.
{fts_http_port, 3459}.
{cbas_http_port, 3460}.
{eventing_http_port, 3461}.
{memcached_port, 6575}.
But then I cannot access my couchbase instance at all (either web UI or rest api). I tried to point my local ports to both custom and default ports (-p 3456-3461:8091-8096) but none worked, and the problem disappear only if I remove the -v option - which brings me back to the original post scenario.
On a side note, I'm still trying to play with setting-alternate-address without any success so far. For some reason, when I set the alternate hostname (which seems to be required for this to run), accessing the alternate address takes a long time to load to eventually fail with a timeout error.
Original post
Since I develop many applications running different couchbase clusters, I have to run them in containers on different port. I created my container with the following command :
docker run -d --memory 2048M --name "my-database" \
-p 3456-3461:8091-8096 \
-p 6210-6211:11210-11211 \
couchbase
I then added buckets, and going on the web UI at localhost:3456 works fine :
Here is my server info :
In my code I have the following connect function :
var cluster *gocb.Cluster
func Cluster() *gocb.Cluster {
return cluster
}
func init() {]
c, err := gocb.Connect(
"couchbase://127.0.0.1:3456",
gocb.ClusterOptions{
Username: "Administrator",
Password: "password",
TimeoutsConfig: gocb.TimeoutsConfig{
ConnectTimeout: 30 * time.Second,
},
},
)
if err != nil {
panic(err)
}
cluster = c
}
Panic doesn't trigger, but whenever I try to perform a KV operation, it fails with the following error :
ambiguous timeout | {"InnerError":{"InnerError":{"InnerError":{},"Message":"ambiguous timeout"}},"OperationID":"Add","Opaque":"0x0","TimeObserved":2501547947,"RetryReasons":null,"RetryAttempts":0,"LastDispatchedTo":"","LastDispatchedFrom":"","LastConnectionID":""}
And this error only comes when I try to connect to my docker couchbase instance. If I run the Couchbase Server application and connect to appropriate port (8091), it works perfectly fine :
var cluster *gocb.Cluster
func Cluster() *gocb.Cluster {
return cluster
}
func init() {]
c, err := gocb.Connect(
"couchbase://localhost",
gocb.ClusterOptions{
Username: "Administrator",
Password: "password",
TimeoutsConfig: gocb.TimeoutsConfig{
ConnectTimeout: 30 * time.Second,
},
},
)
if err != nil {
panic(err)
}
cluster = c
}
I checked the credentials and they are correct, also replacing localhost with 0.0.0.0 or 127.0.0.1 didn't help at all.

how to add docker name parameter into kuberntes cluster

I am deploy the xxl-job application in Kubernetes(v1.15.2), now the application deploy success but registry client service failed.If deploy it in docker, it should look like this:
docker run -e PARAMS="--spring.datasource.url=jdbc:mysql://mysql-service.example.com/xxl-job?Unicode=true&characterEncoding=UTF-8 --spring.datasource.username=root --spring.datasource.password=<mysql-password>" -p 8180:8080 -v /tmp:/data/applogs --name xxl-job-admin -d xuxueli/xxl-job-admin:2.0.2
and when start application,the server side give me tips:
22:33:21.078 logback [http-nio-8080-exec-7] WARN o.s.web.servlet.PageNotFound - No mapping found for HTTP request with URI [/xxl-job-admin/api/registry] in DispatcherServlet with name 'dispatcherServlet'
I am searching from project issue and find the problem may be I could not pass the project name in docker to be part of it's url, so give me this tips.The client side give this error:
23:19:18.262 logback [xxl-job, executor ExecutorRegistryThread] INFO c.x.j.c.t.ExecutorRegistryThread - >>>>>>>>>>> xxl-job registry fail, registryParam:RegistryParam{registryGroup='EXECUTOR', registryKey='job-schedule-executor', registryValue='172.30.184.4:9997'}, registryResult:ReturnT [code=500, msg=xxl-rpc remoting fail, StatusCode(404) invalid. for url : http://xxl-job-service.dabai-fat.svc.cluster.local:8080/xxl-job-admin/api/registry, content=null]
so to solve the problem, I should execute command as possible as the same in kubernetes like execute with docker. The question is: How to pass the docker command --name to kubernetes environment? I have already tried this:
"env": [
{
"name": "name",
"value": "xxl-job-admin"
}
],
and also tried this:
"containers": [
{
"name": "xxl-job-admin",
"image": "xuxueli/xxl-job-admin:2.0.2",
}
]
Both did not work.

Vault Docker Image - Cant get REST Response

I am deploying vault docker image on Ubuntu 16.04, I am successful initializing it from inside the image itself, but I cant get any Rest Responses, and even curl does not work.
I am doing the following:
Create config file local.json :
{
"listener": [{
"tcp": {
"address": "127.0.0.1:8200",
"tls_disable" : 1
}
}],
"storage" :{
"file" : {
"path" : "/vault/data"
}
}
"max_lease_ttl": "10h",
"default_lease_ttl": "10h",
}
under /vault/config directory
running the command to start the image
docker run -d -p 8200:8200 -v /home/vault:/vault --cap-add=IPC_LOCK vault server
entering bash terminal of the image :
docker exec -it containerId /bin/sh
Running inside the following command
export VAULT_ADDR='http://127.0.0.1:8200' and than vault init
It works fine, but when I am trying to send rest to check if vault initialized:
Get request to the following url : http://Ip-of-the-docker-host:8200/v1/sys/init
Getting No Response.
even curl command fails:
curl http://127.0.0.1:8200/v1/sys/init
curl: (56) Recv failure: Connection reset by peer
Didnt find anywhere online with a proper explanation what is the problem, or if I am doing something wrong.
Any Ideas?
If a server running in a Docker container binds to 127.0.0.1, it's unreachable from anything outside that specific container (and since containers usually only run a single process, that means it's unreachable by anyone). Change the listener address to 0.0.0.0:8200; if you need to restrict access to the Vault server, bind it to a specific host address in the docker run -p option.

How do I tail the logs of ALL my docker containers?

I can tail the logs of a single docker container by doing:
docker logs -f container1
But, how can I tail the logs of multiple containers on the same screen?
docker logs container1 container2
doesn’t work. It gives an error:
“docker logs” requires exactly 1 argument(s).
Thank you.
If you are using docker-compose, this will show all logs from the diferent containers
docker-compose logs -f
If you have access and root to the docker server:
tail -f /var/lib/docker/containers/*/*.log
The docker logs command can't stream multiple logs files.
Logging Drivers
You could use one of the logging drivers other than the default json to ship the logs to a common point. The systemd journald or syslog drivers would readily work on most systems. Any of the other centralised log systems would work too.
Note that configuring syslog on the Docker daemon means that docker logs command can no longer query the logs, they will only be stored where your syslog puts them.
A simple daemon.json for syslog:
{
"log-driver": "syslog",
"log-opts": {
"syslog-address": "tcp://10.8.8.8:514",
"syslog-format": "rfc5424"
}
}
Compose
docker-compose is capable of streaming the logs for all containers it controls under a project.
API
You could write tool that attaches to each container via the API and streams the logs via a websocket. Two of the Java libararies are docker-client and docker-java.
Hack
Or run multiple docker logs and munge the output, in node.js:
const { spawn } = require('child_process')
function run(id){
let dkr = spawn('docker', [ 'logs', '--tail', '1', '-t', '--follow', id ])
dkr.stdout.on('data', data => console.log('%s: stdout', id, data.toString().replace(/\r?\n$/,'')))
dkr.stderr.on('data', data => console.error('%s: stderr', id, data.toString().replace(/\r?\n$/,'')))
dkr.on('close', exit_code => {
if ( exit_code !== 0 ) throw new Error(`Docker logs ${id} exited with ${exit_code}`)
})
}
let args = process.argv.splice(2)
args.forEach(arg => run(arg))
Which dumps data as docker logs writes it.
○→ node docker-logs.js 958cc8b41cd9 1dad69882b3d db4b844d9478
958cc8b41cd9: stdout 2018-03-01T06:37:45.152010823Z hello2
1dad69882b3d: stdout 2018-03-01T06:37:49.392475996Z hello
db4b844d9478: stderr 2018-03-01T06:37:47.336367247Z hello2
958cc8b41cd9: stdout 2018-03-01T06:37:55.155137606Z hello2
db4b844d9478: stderr 2018-03-01T06:37:57.339710598Z hello2
1dad69882b3d: stdout 2018-03-01T06:37:59.393960369Z hello

intellij docker integration cant open ports

The docker integration has a weirdly proprietary config format and its very unpredictable and quite frustrating.
This is the command I want to run for my container:
docker run -p 9999:9999 mycontainer
Pretty much the simplest command. I can start my container with this command and see the ports open in kitmatic and access it from the host.
I tried to do this in the docker run config by clicking CLI and generated a json settings file (already wtf this is weird and convoluted)
It gave me this json:
{
"AttachStdin" : true,
"Tty" : true,
"OpenStdin" : true,
"Image" : "",
"Volumes" : { },
"ExposedPorts" : { },
"HostConfig" : {
"Binds" : [ ],
"PortBindings" : {
"9999/tcp" : [ {
"HostIp" : "",
"HostPort" : "9999"
} ]
}
},
"_comment" : ""
}
I then execute the run config and according to intellij the port is open (looking under the Port Bindings section of the docker tab). But its not open. its not accessible from host and kitmatic doesn't show it open either.
How do I get this working as a run config? How do I see what docker command intellij is actually running? Maybe its just using the API programatically.
It seems the intellij docker integration requires you to explicitly declare open ports with EXPOSE in the dockerfile.

Resources