Failed to find data source: kafka (Docker environment) - docker

We are facing this issue at the moment and all the displayed "Similar questions" did not help to solve our problem. We are new to docker and also to spark.
We used the following Docker Compose to setup our containers:
networks:
spark_net:
volumes:
shared-workspace:
name: "hadoop-distributed-file-system"
driver: local
services:
jupyterlab:
image: jupyterlab
container_name: jupyterlab
ports:
- 8888:8888
volumes:
- shared-workspace:/opt/workspace
spark-master:
image: spark-master
networks:
- spark_net
container_name: spark-master
ports:
- 8080:8080
- 7077:7077
volumes:
- shared-workspace:/opt/workspace
spark-worker-1:
image: spark-worker
networks:
- spark_net
container_name: spark-worker-1
environment:
- SPARK_WORKER_CORES=1
- SPARK_WORKER_MEMORY=512m
ports:
- 8081:8081
volumes:
- shared-workspace:/opt/workspace
depends_on:
- spark-master
spark-worker-2:
image: spark-worker
networks:
- spark_net
container_name: spark-worker-2
environment:
- SPARK_WORKER_CORES=1
- SPARK_WORKER_MEMORY=512m
ports:
- 8082:8081
volumes:
- shared-workspace:/opt/workspace
depends_on:
- spark-master
zookeeper:
image: wurstmeister/zookeeper
ports:
- "2181:2181"
kafka:
image: wurstmeister/kafka
ports:
- "7575"
environment:
KAFKA_ADVERTISED_HOST_NAME: 127.0.0.1
KAFKA_ADVERTISED_LISTENERS: INSIDE://kafka:9093,OUTSIDE://localhost:9092
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: INSIDE:PLAINTEXT,OUTSIDE:PLAINTEXT
KAFKA_LISTENERS: INSIDE://0.0.0.0:9093,OUTSIDE://0.0.0.0:9092
KAFKA_INTER_BROKER_LISTENER_NAME: INSIDE
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
volumes:
- ./var/run/docker.sock
We also created two pythonm files to test if kafka streaming works:
producer
import json
import time
producer = KafkaProducer(bootstrap_servers = ['twitter-streaming_kafka_1:9093'],
api_version=(0,11,5),
value_serializer=lambda x: json.dumps(x).encode('utf-8'))
for e in range(1000):
data = {'number' : e}
producer.send('corona', value=data)
time.sleep(0.5)
Consumer:
import time
from kafka import KafkaConsumer, KafkaProducer
from datetime import datetime
import json
print('starting consumer')
consumer = KafkaConsumer(
'corona',
bootstrap_servers=['twitter-streaming_kafka_1:9093'],
auto_offset_reset='earliest',
enable_auto_commit=True,
group_id='my-group',
value_deserializer=lambda x: json.loads(x.decode('utf-8')))
print('printing messages')
for message in consumer:
message = message.value
print(message)
When we executed both scripts in different CLIs in our jupyterlab container and it worked. When we want to connect to our producer stream via pyspark with the following code we get the mentioned error.
import random
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark.sql import SparkSession
spark = Spark = SparkSession.builder.appName('KafkaStreaming').getOrCreate()
df = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "kafka:9093").option("subscribe", "corona").load()
We also executed the following command in the spark-master CLI:
./bin/spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1 ...
stacktrace
---------------------------------------------------------------------------
AnalysisException Traceback (most recent call last)
<ipython-input-2-4dba09a73304> in <module>
6
7 spark = SparkSession.builder.appName('KafkaStreaming').getOrCreate()
----> 8 df = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "twitter-streaming_kafka_1:9093").option("subscribe", "corona").load()
/usr/local/lib/python3.7/dist-packages/pyspark/sql/streaming.py in load(self, path, format, schema, **options)
418 return self._df(self._jreader.load(path))
419 else:
--> 420 return self._df(self._jreader.load())
421
422 #since(2.0)
/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py in __call__(self, *args)
1303 answer = self.gateway_client.send_command(command)
1304 return_value = get_return_value(
-> 1305 answer, self.gateway_client, self.target_id, self.name)
1306
1307 for temp_arg in temp_args:
/usr/local/lib/python3.7/dist-packages/pyspark/sql/utils.py in deco(*a, **kw)
132 # Hide where the exception came from that shows a non-Pythonic
133 # JVM exception message.
--> 134 raise_from(converted)
135 else:
136 raise
/usr/local/lib/python3.7/dist-packages/pyspark/sql/utils.py in raise_from(e)
AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".;

Your Kafka container needs to be placed on the spark_net network in order for Spark containers to resolve it by name
Same with Jupyter if you want it to be able to launch jobs on the Spark cluster
Also, you need to add the Kafka package

Related

PySpark Failed to find data source: kafka in Docker environment

I'm putting together a prototype of a streaming data ingestion from MySQL through Debezium through Kafka and onto Spark using Docker.
My docker-compose.yml file links this way:
version: '2'
services:
zookeeper:
image: quay.io/debezium/zookeeper:${DEBEZIUM_VERSION}
ports:
- 2181:2181
- 2888:2888
- 3888:3888
kafka:
image: quay.io/debezium/kafka:${DEBEZIUM_VERSION}
ports:
- 9092:9092
links:
- zookeeper
environment:
- ZOOKEEPER_CONNECT=zookeeper:2181
mysql:
image: quay.io/debezium/example-mysql:${DEBEZIUM_VERSION}
ports:
- 3306:3306
environment:
- MYSQL_ROOT_PASSWORD=debezium
- MYSQL_USER=mysqluser
- MYSQL_PASSWORD=mysqlpw
connect:
image: quay.io/debezium/connect:${DEBEZIUM_VERSION}
ports:
- 8083:8083
links:
- kafka
- mysql
environment:
- BOOTSTRAP_SERVERS=kafka:9092
- GROUP_ID=1
- CONFIG_STORAGE_TOPIC=my_connect_configs
- OFFSET_STORAGE_TOPIC=my_connect_offsets
- STATUS_STORAGE_TOPIC=my_connect_statuses
spark-master:
image: docker.io/bitnami/spark:3.3
environment:
- SPARK_MODE=master
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
ports:
- '8080:8080'
spark-worker:
image: docker.io/bitnami/spark:3.3
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark-master:7077
- SPARK_WORKER_MEMORY=2G
- SPARK_WORKER_CORES=1
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
links:
- kafka
jupyter:
image: jupyter/pyspark-notebook
environment:
- GRANT_SUDO=yes
- JUPYTER_ENABLE_LAB=yes
- JUPYTER_TOKEN=mysecret
ports:
- "8888:8888"
volumes:
- /Users/eugenegoldberg/jupyter_notebooks:/home/eugene
depends_on:
- spark-master
My Jupyter notebook (also served by the same docker-compose) has the following PySpark code, which attempt to connect Spark to Kafka:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
import os
spark_version = '3.3.1'
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.12:{}'.format(spark_version)
packages = [
f'org.apache.kafka:kafka-clients:3.3.1'
]
# Create SparkSession
spark = SparkSession.builder \
.appName("Kafka Streaming Example") \
.config("spark.driver.host", "host.docker.internal") \
.config("spark.jars.packages", ",".join(packages)) \
.getOrCreate()
# Define the Kafka topic and Kafka server/port
topic = "dbserver1.inventory.customers"
kafkaServer = "kafka:9092" # assuming kafka is running on a container named 'kafka'
# Read data from kafka topic
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", kafkaServer) \
.option("subscribe", topic) \
.load()
I'm getting this error:
---------------------------------------------------------------------------
AnalysisException Traceback (most recent call last)
Cell In[9], line 32
24 kafkaServer = "kafka:9092" # assuming kafka is running on a container named 'kafka'
26 # Read data from kafka topic
27 df = spark \
28 .readStream \
29 .format("kafka") \
30 .option("kafka.bootstrap.servers", kafkaServer) \
31 .option("subscribe", topic) \
---> 32 .load()
File /usr/local/spark/python/pyspark/sql/streaming.py:469, in DataStreamReader.load(self, path, format, schema, **options)
467 return self._df(self._jreader.load(path))
468 else:
--> 469 return self._df(self._jreader.load())
File /usr/local/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py:1321, in JavaMember.__call__(self, *args)
1315 command = proto.CALL_COMMAND_NAME +\
1316 self.command_header +\
1317 args_command +\
1318 proto.END_COMMAND_PART
1320 answer = self.gateway_client.send_command(command)
-> 1321 return_value = get_return_value(
1322 answer, self.gateway_client, self.target_id, self.name)
1324 for temp_arg in temp_args:
1325 temp_arg._detach()
File /usr/local/spark/python/pyspark/sql/utils.py:196, in capture_sql_exception.<locals>.deco(*a, **kw)
192 converted = convert_exception(e.java_exception)
193 if not isinstance(converted, UnknownException):
194 # Hide where the exception came from that shows a non-Pythonic
195 # JVM exception message.
--> 196 raise converted from None
197 else:
198 raise
AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".
What do I need to change in order to fix this?
tl;dr Neither spark-master nor spark-worker have the necessary library on their CLASSPATHs and hence the infamous AnalysisException: Failed to find data source: kafka.
See Why does format("kafka") fail with "Failed to find data source: kafka." (even with uber-jar)? for some background.
My guess is that you should add the jar to jupyter (so when it executes Spark code it knows where to find the required classes). See https://stackoverflow.com/a/36295208/1305344 for a solution.
There is also Jupyter Docker Stacks build manifests that comes directly from Jupyter project.

Connecting Presto to Kafka does fails - Catalog 'kafka' does not exist

I tried to do something similar as the instructions outlined here. Just in my case I wanted to start presto and kafka in docker using docker-compose.
So my docker-compose.yaml looks like this:
version: '2'
services:
zookeeper:
image: confluentinc/cp-zookeeper:latest
environment:
ZOOKEEPER_CLIENT_PORT: 2181
ZOOKEEPER_TICK_TIME: 2000
ports:
- 22181:2181
networks:
- shared
kafka1:
image: confluentinc/cp-kafka:latest
depends_on:
- zookeeper
ports:
- 29092:29092
environment:
KAFKA_BROKER_ID: 1
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka1:9092,PLAINTEXT_HOST://localhost:29092,LISTENER_DOCKER_INTERNAL://kafka1:19092
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT,LISTENER_DOCKER_INTERNAL:PLAINTEXT
KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
networks:
- shared
kafdrop:
image: obsidiandynamics/kafdrop:latest
restart: "no"
ports:
- "8089:9000"
environment:
KAFKA_BROKERCONNECT: "kafka1:19092"
depends_on:
- "kafka1"
networks:
- shared
presto:
image: ahanaio/prestodb-sandbox:latest
ports:
- 8087:8080
volumes:
- ./presto/kafka.properties:/etc/catalog/kafka.properties
networks:
- shared
networks:
shared:
name: kappa-playground
driver: bridge
The mounted file kafka.properties has the following content:
connector.name=kafka
kafka.nodes=kafka1:19092
kafka.table-names=example_topic
I ensure kafka has the topic create with the following little script:
# requires kafka-python
from kafka import KafkaClient
from kafka.admin import KafkaAdminClient, NewTopic
client = KafkaClient(bootstrap_servers='localhost:29092')
admin_client = KafkaAdminClient(
bootstrap_servers="localhost:29092",
client_id='setup'
)
future = client.cluster.request_update()
client.poll(future=future)
metadata = client.cluster
topics = metadata.topics()
if(len(topics) > 0 ):
print("topics: " + " ".join(topics))
else:
print("no topics exist yet")
if("example_topic" not in topics):
topic_list = []
topic_list.append(NewTopic(name="example_topic", num_partitions=1, replication_factor=1))
admin_client.create_topics(new_topics=topic_list, validate_only=False)
I can verify the topic "example_topic" exists with kafdrop.
Now I try to verify that presto can ready the topics from kafka like this:
presto --server=localhost:8087 --catalog kafka --schema default
presto:default> SHOW TABLES;
Which shows the following error:
Query 20220622_080948_00005_t2k7a failed: line 1:1: Catalog 'kafka' does not exist
What is going wrong here?
Found the issue. The kafka.properties file was mounted to the wrong path.
It should rather be:
presto:
image: ahanaio/prestodb-sandbox:latest
ports:
- 8087:8080
volumes:
- ./presto/kafka.properties:/opt/presto-server/etc/catalog/kafka.properties
networks:
- shared

kafka streams doesn't start up

I couldn't start my Kafka streams application. I was able to when I was depending on Confluent Kafka cloud, but when I did the switch to Kafka locally on docker it doesn't start anymore.
docker-compose:
# https://docs.confluent.io/current/installation/docker/config-reference.html
# https://github.com/confluentinc/cp-docker-images
version: "3"
services:
zookeeper:
container_name: local-zookeeper
image: confluentinc/cp-zookeeper:5.5.1
ports:
- 2181:2181
hostname: zookeeper
networks:
- local_kafka_network
environment:
- ZOOKEEPER_CLIENT_PORT=2181
kafka:
container_name: local-kafka
image: confluentinc/cp-kafka:5.5.1
depends_on:
- zookeeper
ports:
- 9092:9092
- 29092:29092
hostname: kafka
networks:
- local_kafka_network
environment:
- KAFKA_ZOOKEEPER_CONNECT=zookeeper:2181
- KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://kafka:29092,PLAINTEXT_HOST://localhost:9092
- KAFKA_LISTENER_SECURITY_PROTOCOL_MAP=PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
- KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR=1
schema-registry:
container_name: local-schema-registry
image: confluentinc/cp-schema-registry:5.5.1
depends_on:
- kafka
ports:
- 8081:8081
hostname: schema-registry
networks:
- local_kafka_network
environment:
- SCHEMA_REGISTRY_KAFKASTORE_CONNECTION_URL=zookeeper:2181
- SCHEMA_REGISTRY_HOST_NAME=schema-registry
- SCHEMA_REGISTRY_LISTENERS=http://schema-registry:8081
- SCHEMA_REGISTRY_DEBUG=true
command:
- /bin/bash
- -c
- |
# install jq
curl -sL https://github.com/stedolan/jq/releases/download/jq-1.6/jq-linux64 -o /usr/local/bin/jq && chmod u+x /usr/local/bin/jq
# start
/etc/confluent/docker/run
schema-registry-ui:
container_name: local-schema-registry-ui
image: landoop/schema-registry-ui:latest
depends_on:
- schema-registry
ports:
- 8001:8000
hostname: schema-registry-ui
networks:
- local_kafka_network
environment:
- SCHEMAREGISTRY_URL=http://schema-registry:8081
- PROXY=true
kafka-rest:
container_name: local-kafka-rest
image: confluentinc/cp-kafka-rest:5.5.1
depends_on:
- kafka
- schema-registry
ports:
- 8082:8082
hostname: kafka-rest
networks:
- local_kafka_network
environment:
- KAFKA_REST_ZOOKEEPER_CONNECT=zookeeper:2181
- KAFKA_REST_LISTENERS=http://kafka-rest:8082
- KAFKA_REST_SCHEMA_REGISTRY_URL=http://schema-registry:8081
- KAFKA_REST_HOST_NAME=kafka-rest
kafka-ui:
container_name: local-kafka-ui
image: landoop/kafka-topics-ui:latest
depends_on:
- kafka-rest
ports:
- 8000:8000
hostname: kafka-ui
networks:
- local_kafka_network
environment:
- KAFKA_REST_PROXY_URL=http://kafka-rest:8082
- PROXY=true
# https://github.com/confluentinc/ksql/blob/4.1.3-post/docs/tutorials/docker-compose.yml#L85
ksql-server:
container_name: local-ksql-server
# TODO update 5.5.1
image: confluentinc/cp-ksql-server:5.4.2
depends_on:
- kafka
- schema-registry
ports:
- 8088:8088
hostname: ksql-server
networks:
- local_kafka_network
environment:
- KSQL_BOOTSTRAP_SERVERS=kafka:29092
- KSQL_LISTENERS=http://ksql-server:8088
- KSQL_KSQL_SCHEMA_REGISTRY_URL=http://schema-registry:8081
- KSQL_KSQL_SERVICE_ID=local-ksql-server
ksql-cli:
container_name: local-ksql-cli
# TODO update 5.5.1
image: confluentinc/cp-ksql-cli:5.4.2
depends_on:
- ksql-server
hostname: ksql-cli
networks:
- local_kafka_network
entrypoint: /bin/sh
tty: true
# distributed mode
kafka-connect:
container_name: local-kafka-connect
image: confluentinc/cp-kafka-connect:5.5.1
depends_on:
- kafka
- schema-registry
ports:
- 8083:8083
hostname: kafka-connect
networks:
- local_kafka_network
environment:
- CONNECT_BOOTSTRAP_SERVERS=kafka:29092
- CONNECT_REST_ADVERTISED_HOST_NAME=kafka-connect
- CONNECT_REST_PORT=8083
- CONNECT_GROUP_ID=local-connect-group
- CONNECT_CONFIG_STORAGE_TOPIC=local-connect-configs
- CONNECT_CONFIG_STORAGE_REPLICATION_FACTOR=1
- CONNECT_OFFSET_FLUSH_INTERVAL_MS=10000
- CONNECT_OFFSET_STORAGE_TOPIC=local-connect-offsets
- CONNECT_OFFSET_STORAGE_REPLICATION_FACTOR=1
- CONNECT_STATUS_STORAGE_TOPIC=local-connect-status
- CONNECT_STATUS_STORAGE_REPLICATION_FACTOR=1
- CONNECT_KEY_CONVERTER=io.confluent.connect.avro.AvroConverter
- CONNECT_KEY_CONVERTER_SCHEMA_REGISTRY_URL=http://schema-registry:8081
- CONNECT_VALUE_CONVERTER=io.confluent.connect.avro.AvroConverter
- CONNECT_VALUE_CONVERTER_SCHEMA_REGISTRY_URL=http://schema-registry:8081
- CONNECT_INTERNAL_KEY_CONVERTER=org.apache.kafka.connect.json.JsonConverter
- CONNECT_INTERNAL_VALUE_CONVERTER=org.apache.kafka.connect.json.JsonConverter
- CONNECT_PLUGIN_PATH=/usr/share/java
volumes:
- "./local/connect/data:/data"
command:
- /bin/bash
- -c
- |
# install unzip
apt-get update && apt-get install unzip -y
# install plugin
unzip /data/jcustenborder-kafka-connect-spooldir-*.zip 'jcustenborder-kafka-connect-spooldir-*/lib/*' -d /usr/share/java/kafka-connect-spooldir/
mv /usr/share/java/kafka-connect-spooldir/*/lib/* /usr/share/java/kafka-connect-spooldir
ls -la /usr/share/java
# setup spooldir plugin
mkdir -p /tmp/error /tmp/finished
# start
/etc/confluent/docker/run
kafka-connect-ui:
container_name: local-kafka-connect-ui
image: landoop/kafka-connect-ui:latest
depends_on:
- kafka-connect
ports:
- 8002:8000
hostname: kafka-connect-ui
networks:
- local_kafka_network
environment:
- CONNECT_URL=http://kafka-connect:8083
networks:
local_kafka_network:
Main method:
package io.confluent.developer.time.solution;
import io.confluent.developer.StreamsUtils;
import io.confluent.developer.avro.ElectronicOrder;
import io.confluent.developer.time.TopicLoader;
import io.confluent.kafka.streams.serdes.avro.SpecificAvroSerde;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.KeyValue;
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.kstream.Consumed;
import org.apache.kafka.streams.kstream.KStream;
import org.apache.kafka.streams.kstream.Materialized;
import org.apache.kafka.streams.kstream.Produced;
import org.apache.kafka.streams.kstream.TimeWindows;
import org.apache.kafka.streams.processor.TimestampExtractor;
import java.io.IOException;
import java.time.Duration;
import java.util.Map;
import java.util.Properties;
public class StreamsTimestampExtractor {
static class OrderTimestampExtractor implements TimestampExtractor {
#Override
public long extract(ConsumerRecord<Object, Object> record, long partitionTime) {
ElectronicOrder order = (ElectronicOrder)record.value();
System.out.println("Extracting time of " + order.getTime() + " from " + order);
return order.getTime();
}
}
public static void main(String[] args) throws IOException, InterruptedException {
final Properties streamsProps = StreamsUtils.loadProperties();
streamsProps.put(StreamsConfig.APPLICATION_ID_CONFIG, "extractor-windowed-streams");
StreamsBuilder builder = new StreamsBuilder();
final String inputTopic = streamsProps.getProperty("extractor.input.topic");
final String outputTopic = streamsProps.getProperty("extractor.output.topic");
final Map<String, Object> configMap = StreamsUtils.propertiesToMap(streamsProps);
final SpecificAvroSerde<ElectronicOrder> electronicSerde =
StreamsUtils.getSpecificAvroSerde(configMap);
final KStream<String, ElectronicOrder> electronicStream =
builder.stream(inputTopic,
Consumed.with(Serdes.String(), electronicSerde)
.withTimestampExtractor(new OrderTimestampExtractor()))
.peek((key, value) -> System.out.println("Incoming record - key " +key +" value " + value));
electronicStream.groupByKey().windowedBy(TimeWindows.of(Duration.ofHours(1)))
.aggregate(() -> 0.0,
(key, order, total) -> total + order.getPrice(),
Materialized.with(Serdes.String(), Serdes.Double()))
.toStream()
.map((wk, value) -> KeyValue.pair(wk.key(),value))
.peek((key, value) -> System.out.println("Outgoing record - key " +key +" value " + value))
.to(outputTopic, Produced.with(Serdes.String(), Serdes.Double()));
KafkaStreams kafkaStreams = new KafkaStreams(builder.build(), streamsProps);
TopicLoader.runProducer();
kafkaStreams.start();
}
}
Running the code in my machine produces records but exits immediately:
Note that I was able to process the continuous stream of data when I was running this exact code with confluent Kafka cloud.
To reproduce locally, all you need is to get the code from this confluent tutorial, modify the properties file to point to the local Kafka broker, and use the docker-compose I provided for setting up Kafka.
Adding a shutdown hook and uncaught exception handler helped me diagnose and fix the issue:
KafkaStreams kafkaStreams = new KafkaStreams(builder.build(), streamsProps);
TopicLoader.runProducer();
kafkaStreams.setUncaughtExceptionHandler(e -> {
log.error("unhandled streams exception, shutting down.", e);
return StreamsUncaughtExceptionHandler.StreamThreadExceptionResponse.SHUTDOWN_APPLICATION;
});
Runtime.getRuntime().addShutdownHook(new Thread(() -> {
log.info("Runtime shutdown hook, state={}", kafkaStreams.state());
if (kafkaStreams.state().isRunningOrRebalancing()) {
log.info("Shutting down started.");
kafkaStreams.close(Duration.ofMinutes(2));
log.info("Shutting down completed.");
}
}));
kafkaStreams.start();
Turns out I’ve configured a replication factor of 1 in the broker while in my properties file I had 3, so the exception was: Caused by: org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication factor: 3 larger than available brokers: 1.
So the solution for me was to decrease the replication.factor from 3 to 1 in my properties file.

Spark Docker Java gateway process exited before sending its port number

I am fairly new to docker and am trying to get a docker-compose file running with both airflow and pyspark. Below is what I have so far:
version: '3.7'
services:
master:
image: gettyimages/spark
command: bin/spark-class org.apache.spark.deploy.master.Master -h master
hostname: master
environment:
MASTER: spark://master:7077
SPARK_CONF_DIR: /conf
SPARK_PUBLIC_DNS: localhost
expose:
- 7001
- 7002
- 7003
- 7004
- 7005
- 7077
- 6066
ports:
- 4040:4040
- 6066:6066
- 7077:7077
- 8080:8080
volumes:
- ./conf/master:/conf
- ./data:/tmp/data
worker:
image: gettyimages/spark
command: bin/spark-class org.apache.spark.deploy.worker.Worker spark://master:7077
hostname: worker
environment:
SPARK_CONF_DIR: /conf
SPARK_WORKER_CORES: 2
SPARK_WORKER_MEMORY: 1g
SPARK_WORKER_PORT: 8881
SPARK_WORKER_WEBUI_PORT: 8081
SPARK_PUBLIC_DNS: localhost
links:
- master
expose:
- 7012
- 7013
- 7014
- 7015
- 8881
ports:
- 8081:8081
volumes:
- ./conf/worker:/conf
- ./data:/tmp/data
postgres:
image: postgres:9.6
environment:
- POSTGRES_USER=airflow
- POSTGRES_PASSWORD=airflow
- POSTGRES_DB=airflow
logging:
options:
max-size: 10m
max-file: "3"
webserver:
image: puckel/docker-airflow:1.10.9
restart: always
depends_on:
- postgres
environment:
- LOAD_EX=y
- EXECUTOR=Local
logging:
options:
max-size: 10m
max-file: "3"
volumes:
- ./dags:/usr/local/airflow/dags
# Add this to have third party packages
- ./requirements.txt:/requirements.txt
# - ./plugins:/usr/local/airflow/plugins
ports:
- "8082:8080"
command: webserver
healthcheck:
test: ["CMD-SHELL", "[ -f /usr/local/airflow/airflow-webserver.pid ]"]
interval: 30s
timeout: 30s
retries: 3
And I am trying to run the following simple DAG just to confirm pyspark is operating correctly:
import pyspark
from airflow.models import DAG
from airflow.utils.dates import days_ago, timedelta
from airflow.operators.python_operator import PythonOperator
from airflow.contrib.operators.spark_submit_operator import SparkSubmitOperator
import random
args = {
"owner": "ian",
"start_date": days_ago(1)
}
dag = DAG(dag_id="pysparkTest", default_args=args, schedule_interval=None)
def run_this_func(**context):
sc = pyspark.SparkContext()
print(sc)
with dag:
run_this_task = PythonOperator(
task_id='run_this',
python_callable=run_this_func,
provide_context=True,
retries=10,
retry_delay=timedelta(seconds=1)
)
When I do this, it fails with the error Java gateway process exited before sending its port number. I have found several posts that say to run the command export PYSPARK_SUBMIT_ARGS="--master local[2] pyspark-shell" which I have tried to run as a command like so:
version: '3.7'
services:
master:
image: gettyimages/spark
command: >
sh -c "bin/spark-class org.apache.spark.deploy.master.Master -h master
&& export PYSPARK_SUBMIT_ARGS="--master local[2] pyspark-shell""
hostname: master
...
But I still get the same error. Any ideas what I am doing wrong?
I don't think you need to modify the master's command. Leave it as they did here.
In addition, how do you expect the python code which runs on a different container - to connect the master container. I think you should add it to the spark-context, something like:
def run_this_func(**context):
sc = pyspark.SparkContext("spark://master:7077")
print(sc)

Kafka Client Timeout of 60000ms expired before the position for partition could be determined

I'm trying to connect Flink to a Kafka consumer
I'm using Docker Compose to build 4 containers zookeeper, kafka, Flink JobManager and Flink TaskManager.
For zookeeper and Kafka I'm using wurstmeister images, and for Flink I'm using the official image.
docker-compose.yml
version: '3.1'
services:
zookeeper:
image: wurstmeister/zookeeper:3.4.6
hostname: zookeeper
expose:
- "2181"
ports:
- "2181:2181"
kafka:
image: wurstmeister/kafka:2.11-2.0.0
depends_on:
- zookeeper
ports:
- "9092:9092"
hostname: kafka
links:
- zookeeper
environment:
KAFKA_ADVERTISED_HOST_NAME: kafka
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_ADVERTISED_PORT: 9092
KAFKA_CREATE_TOPICS: 'pipeline:1:1:compact'
jobmanager:
build: ./flink_pipeline
depends_on:
- kafka
links:
- zookeeper
- kafka
expose:
- "6123"
ports:
- "8081:8081"
command: jobmanager
environment:
JOB_MANAGER_RPC_ADDRESS: jobmanager
BOOTSTRAP_SERVER: kafka:9092
ZOOKEEPER: zookeeper:2181
taskmanager:
image: flink
expose:
- "6121"
- "6122"
links:
- jobmanager
- zookeeper
- kafka
depends_on:
- jobmanager
command: taskmanager
# links:
# - "jobmanager:jobmanager"
environment:
JOB_MANAGER_RPC_ADDRESS: jobmanager
And When I submit a simple job to Dispatcher the Job fails with the following error:
org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired before the position for partition pipeline-0 could be determined
My Job code is:
public class Main {
public static void main( String[] args ) throws Exception
{
// get the execution environment
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// get input data by connecting to the socket
Properties properties = new Properties();
String bootstrapServer = System.getenv("BOOTSTRAP_SERVER");
String zookeeperServer = System.getenv("ZOOKEEPER");
if (bootstrapServer == null) {
System.exit(1);
}
properties.setProperty("zookeeper", zookeeperServer);
properties.setProperty("bootstrap.servers", bootstrapServer);
properties.setProperty("group.id", "pipeline-analysis");
FlinkKafkaConsumer kafkaConsumer = new FlinkKafkaConsumer<String>("pipeline", new SimpleStringSchema(), properties);
// kafkaConsumer.setStartFromGroupOffsets();
kafkaConsumer.setStartFromLatest();
DataStream<String> stream = env.addSource(kafkaConsumer);
// Defining Pipeline here
// Printing Outputs
stream.print();
env.execute("Stream Pipeline");
}
}
I know I'm late to the party but I had the exact same error. In my case, I was not setting up TopicPartitions correctly. My topic had 2 partitions and my producer was producing messages just fine, but it's the spark streaming application, as my consumer, that wasn't really starting and giving up after 60 secs complaining the same error.
Wrong code that I had -
List<TopicPartition> topicPartitionList = Arrays.asList(new topicPartition(topicName, Integer.parseInt(numPartition)));
Correct code -
List<TopicPartition> topicPartitionList = new ArrayList<TopicPartition>();
for (int i = 0; i < Integer.parseInt(numPartitions); i++) {
topicPartitionList.add(new TopicPartition(topicName, i));
}
I had an error that looks the same.
17:34:37.668 [org.springframework.kafka.KafkaListenerEndpointContainer#1-0-C-1] ERROR o.a.k.c.c.i.ConsumerCoordinator - [Consumer clientId=consumer-3, groupId=api.dev] User provided listener org.springframework.kafka.listener.KafkaMessageListenerContainer$ListenerConsumer$ListenerConsumerRebalanceListener failed on partition assignment
org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired before the position for partition aaa-1 could be determined
Turns out it's my hosts file has been changed so the broker address is wrong.
Try this log settings to debug more details.
<logger name="org.apache.kafka.clients.consumer.internals.Fetcher" level="info" />
I was having issues with this error in a vSphere Integrated Containers environment. For me the problem was that I had advertise on the hostname and not the IP. I had to set the hostname and container name in my compose file.
Here are my settings that finally worked:
kafka:
depends_on:
- zookeeper
image: wurstmeister/kafka
ports:
- "9092:9092"
mem_limit: 10g
container_name: kafka
hostname: kafka
environment:
KAFKA_ADVERTISED_LISTENERS: OUTSIDE://kafka:9092
KAFKA_LISTENERS: OUTSIDE://0.0.0.0:9092
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: OUTSIDE:PLAINTEXT
KAFKA_INTER_BROKER_LISTENER_NAME: OUTSIDE
KAFKA_BROKER_ID: 1
KAFKA_ZOOKEEPER_CONNECT: <REPLACE_WITH_IP>:2181
I had the same problem, the issue was I had a wrong host entry in /etc/hosts file for kafka node!

Resources