PySpark Failed to find data source: kafka in Docker environment - docker

I'm putting together a prototype of a streaming data ingestion from MySQL through Debezium through Kafka and onto Spark using Docker.
My docker-compose.yml file links this way:
version: '2'
services:
zookeeper:
image: quay.io/debezium/zookeeper:${DEBEZIUM_VERSION}
ports:
- 2181:2181
- 2888:2888
- 3888:3888
kafka:
image: quay.io/debezium/kafka:${DEBEZIUM_VERSION}
ports:
- 9092:9092
links:
- zookeeper
environment:
- ZOOKEEPER_CONNECT=zookeeper:2181
mysql:
image: quay.io/debezium/example-mysql:${DEBEZIUM_VERSION}
ports:
- 3306:3306
environment:
- MYSQL_ROOT_PASSWORD=debezium
- MYSQL_USER=mysqluser
- MYSQL_PASSWORD=mysqlpw
connect:
image: quay.io/debezium/connect:${DEBEZIUM_VERSION}
ports:
- 8083:8083
links:
- kafka
- mysql
environment:
- BOOTSTRAP_SERVERS=kafka:9092
- GROUP_ID=1
- CONFIG_STORAGE_TOPIC=my_connect_configs
- OFFSET_STORAGE_TOPIC=my_connect_offsets
- STATUS_STORAGE_TOPIC=my_connect_statuses
spark-master:
image: docker.io/bitnami/spark:3.3
environment:
- SPARK_MODE=master
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
ports:
- '8080:8080'
spark-worker:
image: docker.io/bitnami/spark:3.3
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark-master:7077
- SPARK_WORKER_MEMORY=2G
- SPARK_WORKER_CORES=1
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
links:
- kafka
jupyter:
image: jupyter/pyspark-notebook
environment:
- GRANT_SUDO=yes
- JUPYTER_ENABLE_LAB=yes
- JUPYTER_TOKEN=mysecret
ports:
- "8888:8888"
volumes:
- /Users/eugenegoldberg/jupyter_notebooks:/home/eugene
depends_on:
- spark-master
My Jupyter notebook (also served by the same docker-compose) has the following PySpark code, which attempt to connect Spark to Kafka:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
import os
spark_version = '3.3.1'
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.12:{}'.format(spark_version)
packages = [
f'org.apache.kafka:kafka-clients:3.3.1'
]
# Create SparkSession
spark = SparkSession.builder \
.appName("Kafka Streaming Example") \
.config("spark.driver.host", "host.docker.internal") \
.config("spark.jars.packages", ",".join(packages)) \
.getOrCreate()
# Define the Kafka topic and Kafka server/port
topic = "dbserver1.inventory.customers"
kafkaServer = "kafka:9092" # assuming kafka is running on a container named 'kafka'
# Read data from kafka topic
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", kafkaServer) \
.option("subscribe", topic) \
.load()
I'm getting this error:
---------------------------------------------------------------------------
AnalysisException Traceback (most recent call last)
Cell In[9], line 32
24 kafkaServer = "kafka:9092" # assuming kafka is running on a container named 'kafka'
26 # Read data from kafka topic
27 df = spark \
28 .readStream \
29 .format("kafka") \
30 .option("kafka.bootstrap.servers", kafkaServer) \
31 .option("subscribe", topic) \
---> 32 .load()
File /usr/local/spark/python/pyspark/sql/streaming.py:469, in DataStreamReader.load(self, path, format, schema, **options)
467 return self._df(self._jreader.load(path))
468 else:
--> 469 return self._df(self._jreader.load())
File /usr/local/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py:1321, in JavaMember.__call__(self, *args)
1315 command = proto.CALL_COMMAND_NAME +\
1316 self.command_header +\
1317 args_command +\
1318 proto.END_COMMAND_PART
1320 answer = self.gateway_client.send_command(command)
-> 1321 return_value = get_return_value(
1322 answer, self.gateway_client, self.target_id, self.name)
1324 for temp_arg in temp_args:
1325 temp_arg._detach()
File /usr/local/spark/python/pyspark/sql/utils.py:196, in capture_sql_exception.<locals>.deco(*a, **kw)
192 converted = convert_exception(e.java_exception)
193 if not isinstance(converted, UnknownException):
194 # Hide where the exception came from that shows a non-Pythonic
195 # JVM exception message.
--> 196 raise converted from None
197 else:
198 raise
AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".
What do I need to change in order to fix this?

tl;dr Neither spark-master nor spark-worker have the necessary library on their CLASSPATHs and hence the infamous AnalysisException: Failed to find data source: kafka.
See Why does format("kafka") fail with "Failed to find data source: kafka." (even with uber-jar)? for some background.
My guess is that you should add the jar to jupyter (so when it executes Spark code it knows where to find the required classes). See https://stackoverflow.com/a/36295208/1305344 for a solution.
There is also Jupyter Docker Stacks build manifests that comes directly from Jupyter project.

Related

kafka streams doesn't start up

I couldn't start my Kafka streams application. I was able to when I was depending on Confluent Kafka cloud, but when I did the switch to Kafka locally on docker it doesn't start anymore.
docker-compose:
# https://docs.confluent.io/current/installation/docker/config-reference.html
# https://github.com/confluentinc/cp-docker-images
version: "3"
services:
zookeeper:
container_name: local-zookeeper
image: confluentinc/cp-zookeeper:5.5.1
ports:
- 2181:2181
hostname: zookeeper
networks:
- local_kafka_network
environment:
- ZOOKEEPER_CLIENT_PORT=2181
kafka:
container_name: local-kafka
image: confluentinc/cp-kafka:5.5.1
depends_on:
- zookeeper
ports:
- 9092:9092
- 29092:29092
hostname: kafka
networks:
- local_kafka_network
environment:
- KAFKA_ZOOKEEPER_CONNECT=zookeeper:2181
- KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://kafka:29092,PLAINTEXT_HOST://localhost:9092
- KAFKA_LISTENER_SECURITY_PROTOCOL_MAP=PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
- KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR=1
schema-registry:
container_name: local-schema-registry
image: confluentinc/cp-schema-registry:5.5.1
depends_on:
- kafka
ports:
- 8081:8081
hostname: schema-registry
networks:
- local_kafka_network
environment:
- SCHEMA_REGISTRY_KAFKASTORE_CONNECTION_URL=zookeeper:2181
- SCHEMA_REGISTRY_HOST_NAME=schema-registry
- SCHEMA_REGISTRY_LISTENERS=http://schema-registry:8081
- SCHEMA_REGISTRY_DEBUG=true
command:
- /bin/bash
- -c
- |
# install jq
curl -sL https://github.com/stedolan/jq/releases/download/jq-1.6/jq-linux64 -o /usr/local/bin/jq && chmod u+x /usr/local/bin/jq
# start
/etc/confluent/docker/run
schema-registry-ui:
container_name: local-schema-registry-ui
image: landoop/schema-registry-ui:latest
depends_on:
- schema-registry
ports:
- 8001:8000
hostname: schema-registry-ui
networks:
- local_kafka_network
environment:
- SCHEMAREGISTRY_URL=http://schema-registry:8081
- PROXY=true
kafka-rest:
container_name: local-kafka-rest
image: confluentinc/cp-kafka-rest:5.5.1
depends_on:
- kafka
- schema-registry
ports:
- 8082:8082
hostname: kafka-rest
networks:
- local_kafka_network
environment:
- KAFKA_REST_ZOOKEEPER_CONNECT=zookeeper:2181
- KAFKA_REST_LISTENERS=http://kafka-rest:8082
- KAFKA_REST_SCHEMA_REGISTRY_URL=http://schema-registry:8081
- KAFKA_REST_HOST_NAME=kafka-rest
kafka-ui:
container_name: local-kafka-ui
image: landoop/kafka-topics-ui:latest
depends_on:
- kafka-rest
ports:
- 8000:8000
hostname: kafka-ui
networks:
- local_kafka_network
environment:
- KAFKA_REST_PROXY_URL=http://kafka-rest:8082
- PROXY=true
# https://github.com/confluentinc/ksql/blob/4.1.3-post/docs/tutorials/docker-compose.yml#L85
ksql-server:
container_name: local-ksql-server
# TODO update 5.5.1
image: confluentinc/cp-ksql-server:5.4.2
depends_on:
- kafka
- schema-registry
ports:
- 8088:8088
hostname: ksql-server
networks:
- local_kafka_network
environment:
- KSQL_BOOTSTRAP_SERVERS=kafka:29092
- KSQL_LISTENERS=http://ksql-server:8088
- KSQL_KSQL_SCHEMA_REGISTRY_URL=http://schema-registry:8081
- KSQL_KSQL_SERVICE_ID=local-ksql-server
ksql-cli:
container_name: local-ksql-cli
# TODO update 5.5.1
image: confluentinc/cp-ksql-cli:5.4.2
depends_on:
- ksql-server
hostname: ksql-cli
networks:
- local_kafka_network
entrypoint: /bin/sh
tty: true
# distributed mode
kafka-connect:
container_name: local-kafka-connect
image: confluentinc/cp-kafka-connect:5.5.1
depends_on:
- kafka
- schema-registry
ports:
- 8083:8083
hostname: kafka-connect
networks:
- local_kafka_network
environment:
- CONNECT_BOOTSTRAP_SERVERS=kafka:29092
- CONNECT_REST_ADVERTISED_HOST_NAME=kafka-connect
- CONNECT_REST_PORT=8083
- CONNECT_GROUP_ID=local-connect-group
- CONNECT_CONFIG_STORAGE_TOPIC=local-connect-configs
- CONNECT_CONFIG_STORAGE_REPLICATION_FACTOR=1
- CONNECT_OFFSET_FLUSH_INTERVAL_MS=10000
- CONNECT_OFFSET_STORAGE_TOPIC=local-connect-offsets
- CONNECT_OFFSET_STORAGE_REPLICATION_FACTOR=1
- CONNECT_STATUS_STORAGE_TOPIC=local-connect-status
- CONNECT_STATUS_STORAGE_REPLICATION_FACTOR=1
- CONNECT_KEY_CONVERTER=io.confluent.connect.avro.AvroConverter
- CONNECT_KEY_CONVERTER_SCHEMA_REGISTRY_URL=http://schema-registry:8081
- CONNECT_VALUE_CONVERTER=io.confluent.connect.avro.AvroConverter
- CONNECT_VALUE_CONVERTER_SCHEMA_REGISTRY_URL=http://schema-registry:8081
- CONNECT_INTERNAL_KEY_CONVERTER=org.apache.kafka.connect.json.JsonConverter
- CONNECT_INTERNAL_VALUE_CONVERTER=org.apache.kafka.connect.json.JsonConverter
- CONNECT_PLUGIN_PATH=/usr/share/java
volumes:
- "./local/connect/data:/data"
command:
- /bin/bash
- -c
- |
# install unzip
apt-get update && apt-get install unzip -y
# install plugin
unzip /data/jcustenborder-kafka-connect-spooldir-*.zip 'jcustenborder-kafka-connect-spooldir-*/lib/*' -d /usr/share/java/kafka-connect-spooldir/
mv /usr/share/java/kafka-connect-spooldir/*/lib/* /usr/share/java/kafka-connect-spooldir
ls -la /usr/share/java
# setup spooldir plugin
mkdir -p /tmp/error /tmp/finished
# start
/etc/confluent/docker/run
kafka-connect-ui:
container_name: local-kafka-connect-ui
image: landoop/kafka-connect-ui:latest
depends_on:
- kafka-connect
ports:
- 8002:8000
hostname: kafka-connect-ui
networks:
- local_kafka_network
environment:
- CONNECT_URL=http://kafka-connect:8083
networks:
local_kafka_network:
Main method:
package io.confluent.developer.time.solution;
import io.confluent.developer.StreamsUtils;
import io.confluent.developer.avro.ElectronicOrder;
import io.confluent.developer.time.TopicLoader;
import io.confluent.kafka.streams.serdes.avro.SpecificAvroSerde;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.KafkaStreams;
import org.apache.kafka.streams.KeyValue;
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.StreamsConfig;
import org.apache.kafka.streams.kstream.Consumed;
import org.apache.kafka.streams.kstream.KStream;
import org.apache.kafka.streams.kstream.Materialized;
import org.apache.kafka.streams.kstream.Produced;
import org.apache.kafka.streams.kstream.TimeWindows;
import org.apache.kafka.streams.processor.TimestampExtractor;
import java.io.IOException;
import java.time.Duration;
import java.util.Map;
import java.util.Properties;
public class StreamsTimestampExtractor {
static class OrderTimestampExtractor implements TimestampExtractor {
#Override
public long extract(ConsumerRecord<Object, Object> record, long partitionTime) {
ElectronicOrder order = (ElectronicOrder)record.value();
System.out.println("Extracting time of " + order.getTime() + " from " + order);
return order.getTime();
}
}
public static void main(String[] args) throws IOException, InterruptedException {
final Properties streamsProps = StreamsUtils.loadProperties();
streamsProps.put(StreamsConfig.APPLICATION_ID_CONFIG, "extractor-windowed-streams");
StreamsBuilder builder = new StreamsBuilder();
final String inputTopic = streamsProps.getProperty("extractor.input.topic");
final String outputTopic = streamsProps.getProperty("extractor.output.topic");
final Map<String, Object> configMap = StreamsUtils.propertiesToMap(streamsProps);
final SpecificAvroSerde<ElectronicOrder> electronicSerde =
StreamsUtils.getSpecificAvroSerde(configMap);
final KStream<String, ElectronicOrder> electronicStream =
builder.stream(inputTopic,
Consumed.with(Serdes.String(), electronicSerde)
.withTimestampExtractor(new OrderTimestampExtractor()))
.peek((key, value) -> System.out.println("Incoming record - key " +key +" value " + value));
electronicStream.groupByKey().windowedBy(TimeWindows.of(Duration.ofHours(1)))
.aggregate(() -> 0.0,
(key, order, total) -> total + order.getPrice(),
Materialized.with(Serdes.String(), Serdes.Double()))
.toStream()
.map((wk, value) -> KeyValue.pair(wk.key(),value))
.peek((key, value) -> System.out.println("Outgoing record - key " +key +" value " + value))
.to(outputTopic, Produced.with(Serdes.String(), Serdes.Double()));
KafkaStreams kafkaStreams = new KafkaStreams(builder.build(), streamsProps);
TopicLoader.runProducer();
kafkaStreams.start();
}
}
Running the code in my machine produces records but exits immediately:
Note that I was able to process the continuous stream of data when I was running this exact code with confluent Kafka cloud.
To reproduce locally, all you need is to get the code from this confluent tutorial, modify the properties file to point to the local Kafka broker, and use the docker-compose I provided for setting up Kafka.
Adding a shutdown hook and uncaught exception handler helped me diagnose and fix the issue:
KafkaStreams kafkaStreams = new KafkaStreams(builder.build(), streamsProps);
TopicLoader.runProducer();
kafkaStreams.setUncaughtExceptionHandler(e -> {
log.error("unhandled streams exception, shutting down.", e);
return StreamsUncaughtExceptionHandler.StreamThreadExceptionResponse.SHUTDOWN_APPLICATION;
});
Runtime.getRuntime().addShutdownHook(new Thread(() -> {
log.info("Runtime shutdown hook, state={}", kafkaStreams.state());
if (kafkaStreams.state().isRunningOrRebalancing()) {
log.info("Shutting down started.");
kafkaStreams.close(Duration.ofMinutes(2));
log.info("Shutting down completed.");
}
}));
kafkaStreams.start();
Turns out I’ve configured a replication factor of 1 in the broker while in my properties file I had 3, so the exception was: Caused by: org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication factor: 3 larger than available brokers: 1.
So the solution for me was to decrease the replication.factor from 3 to 1 in my properties file.

Quarkus can't connect to kafka from inside docker [duplicate]

This question already has answers here:
Connect to Kafka running in Docker
(5 answers)
Closed 1 year ago.
I've created a quarkus service that reads from a bunch of Kstreams, joins them and then post the join result back into a kafka topic.
During development, I was running kafka and zookeeper from inside a docker-compose and then running my quarkus service on dev mode with:
mvn quarkus:dev
At this point, everything was working fine. I'm able to connect to the broker without problem and read/write the Kstreams.
Then I tried to create a docker container that runs this quarkus service, but when the service runs inside the container, it doesn't reach the broker.
I tried several different configs inside my docker-compose, but none worked. It just can't connect to the broker.
Here is my Dockerfile:
####
# This Dockerfile is used in order to build a container that runs the Quarkus application in JVM mode
#
# Before building the container image run:
#
# mvn package
#
# Then, build the image with:
#
# docker build -f src/main/docker/Dockerfile.jvm -t connector .
#
# Then run the container using:
#
# docker run -i --rm -p 8080:8080 connector
#
# If you want to include the debug port into your docker image
# you will have to expose the debug port (default 5005) like this : EXPOSE 8080 5050
#
# Then run the container using :
#
# docker run -i --rm -p 8080:8080 -p 5005:5005 -e JAVA_ENABLE_DEBUG="true" connector
#
###
FROM docker.internal/library/quarkus-base:latest
ARG RUN_JAVA_VERSION=1.3.8
ENV LANG='en_US.UTF-8' LANGUAGE='en_US:en'
USER root
RUN apk update && apk add libstdc++
# Configure the JAVA_OPTIONS, you can add -XshowSettings:vm to also display the heap size.
ENV JAVA_OPTIONS="-Dquarkus.http.host=0.0.0.0 -Djava.util.logging.manager=org.jboss.logmanager.LogManager"
#ENV QUARKUS_LAUNCH_DEVMODE=true \
# JAVA_ENABLE_DEBUG=true
# -Dquarkus.package.type=mutable-jar
# We make four distinct layers so if there are application changes the library layers can be re-used
COPY --chown=1001 target/quarkus-app/lib/ ${APP_HOME}/lib/
COPY --chown=1001 target/quarkus-app/*-run.jar ${APP_HOME}/app.jar
COPY --chown=1001 target/quarkus-app/app/ ${APP_HOME}/app/
COPY --chown=1001 target/quarkus-app/quarkus/ ${APP_HOME}/quarkus/
EXPOSE 8080
USER 1001
#ENTRYPOINT [ "/deployments/run-java.sh" ]
And here is my docker-compose:
version: '2'
services:
zookeeper:
container_name: zookeeper
image: confluentinc/cp-zookeeper
ports:
- "2181:2181"
- "2888:2888"
- "3888:3888"
environment:
- ZOOKEEPER_CLIENT_PORT=2181
- ZOOKEEPER_TICK_TIME=2000
networks:
- kafkastreams-network
kafka:
container_name: kafka
image: confluentinc/cp-kafka
ports:
- "9092:9092"
depends_on:
- zookeeper
environment:
- KAFKA_BROKER_ID=1
- KAFKA_ZOOKEEPER_CONNECT=zookeeper:2181
- KAFKA_LISTENER_SECURITY_PROTOCOL_MAP=PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
- KAFKA_INTER_BROKER_LISTENER_NAME=PLAINTEXT
- KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://kafka:29092,PLAINTEXT_HOST://localhost:9092
- KAFKA_AUTO_CREATE_TOPICS_ENABLE=true
- KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR=1
- KAFKA_TRANSACTION_STATE_LOG_MIN_ISR=1
- KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR=1
- KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS=100
networks:
- kafkastreams-network
connect:
container_name: connect
image: debezium/connect
ports:
- "8083:8083"
depends_on:
- kafka
environment:
- BOOTSTRAP_SERVERS=kafka:29092
- GROUP_ID=1
- CONFIG_STORAGE_TOPIC=my_connect_configs
- OFFSET_STORAGE_TOPIC=my_connect_offsets
- STATUS_STORAGE_TOPIC=my_connect_statuses
networks:
- kafkastreams-network
schema-registry:
image: confluentinc/cp-schema-registry:5.5.0
container_name: schema-registry
ports:
- "8081:8081"
depends_on:
- zookeeper
- kafka
environment:
SCHEMA_REGISTRY_HOST_NAME: schema-registry
SCHEMA_REGISTRY_KAFKASTORE_CONNECTION_URL: zookeeper:2181
networks:
- kafkastreams-network
kafdrop:
image: obsidiandynamics/kafdrop
container_name: kafdrop
restart: "no"
ports:
- "9001:9000"
environment:
KAFKA_BROKERCONNECT: kafka:29092
JVM_OPTS: "-Xms16M -Xmx48M -Xss180K -XX:-TieredCompilation -XX:+UseStringDeduplication -noverify"
depends_on:
- kafka
- schema-registry
networks:
- kafkastreams-network
connector:
image: connector
depends_on:
- zookeeper
- kafka
- connect
environment:
QUARKUS_KAFKA_STREAMS_BOOTSTRAP_SERVERS: kafka:9092
networks:
- kafkastreams-network
networks:
kafkastreams-network:
name: ks
The error I'm getting is:
2021-08-05 11:52:35,433 WARN [org.apa.kaf.cli.NetworkClient] (kafka-admin-client-thread | connector-18d10d7d-b619-4715-a219-2557d70e0479-admin) [AdminClient clientId=connector-18d10d7d-b619-4715-a219-2557d70e0479-admin] Connection to node -1 (kafka/172.21.0.3:9092) could not be established. Broker may not be available.
Am I missing any config on either the Dockerfile or the docker compose?
I figured out that there were 2 problems:
In my docker-compose, I had to change the property KAFKA_ADVERTISED_LISTENERS to PLAINTEXT://kafka:29092,PLAINTEXT_HOST://kafka:9092
In my quarkus application.properties, I had 2 properties pointing to the wrong place:
quarkus.kafka-streams.bootstrap-servers=localhost:9092
quarkus.kafka-streams.application-server=localhost:9999

Failed to find data source: kafka. using Docker all-spark-notebook Spark Ver 3.1.1

I'm very new to Spark and Kafka and I'm trying to run some sample code in Python (Jupyter) using Docker images downloaded, configured and executed with docker-compose. I'm running Ubuntu 20.04 through Windows 10 WSL.
Here are the steps I've followed so far:
Using the docker-compose.yml file shown below, I run [docker-compose up -d]
All containers start successfully
I start a Jupyter notebook using http://127.0.0.1:8888/?token=
I execute the Python script line-by-line and it fails at
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "cloud") \
.option("startingOffsets", "earliest") \
.load()```
with the error:
error preamble... AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".
I'm clearly missing a step and I've searched in-vain for a solution. Other answers to this problem suggest adding:
import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1 pyspark-shell'
which hasn't worked for me. What step am I missing?
docker-compose.yml
---
version: '2'
services:
spark:
image: jupyter/all-spark-notebook:latest
ports:
- "8888:8888"
working_dir: /home/$USER/work
volumes:
- $PWD/work:/home/$USER/work
zookeeper:
image: confluentinc/cp-zookeeper:latest
hostname: zookeeper
container_name: zookeeper
ports:
- "2181:2181"
environment:
ZOOKEEPER_CLIENT_PORT: 2181
ZOOKEEPER_TICK_TIME: 2000
KAFKA_OPTS: "-Dzookeeper.4lw.commands.whitelist=*"
kafka:
image: confluentinc/cp-kafka:latest
hostname: kafka
container_name: kafka
depends_on:
- zookeeper
ports:
- "9092:9092"
- "29092:29092"
environment:
KAFKA_BROKER_ID: 1
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092,PLAINTEXT_HOST://localhost:29092
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
connect:
image: confluentinc/cp-kafka-connect:latest
hostname: connect
container_name: connect
depends_on:
- zookeeper
- kafka
ports:
- "8083:8083"
environment:
CONNECT_BOOTSTRAP_SERVERS: kafka:9092
CONNECT_REST_ADVERTISED_HOST_NAME: connect
CONNECT_REST_PORT: 8083
CONNECT_GROUP_ID: compose-connect-group
CONNECT_CONFIG_STORAGE_TOPIC: docker-connect-configs
CONNECT_CONFIG_STORAGE_REPLICATION_FACTOR: 1
CONNECT_OFFSET_FLUSH_INTERVAL_MS: 10000
CONNECT_OFFSET_STORAGE_TOPIC: docker-connect-offsets
CONNECT_OFFSET_STORAGE_REPLICATION_FACTOR: 1
CONNECT_STATUS_STORAGE_TOPIC: docker-connect-status
CONNECT_STATUS_STORAGE_REPLICATION_FACTOR: 1
CONNECT_KEY_CONVERTER: org.apache.kafka.connect.json.JsonConverter
CONNECT_VALUE_CONVERTER: org.apache.kafka.connect.json.JsonConverter
CONNECT_INTERNAL_KEY_CONVERTER: org.apache.kafka.connect.json.JsonConverter
CONNECT_INTERNAL_VALUE_CONVERTER: org.apache.kafka.connect.json.JsonConverter
CONNECT_ZOOKEEPER_CONNECT: zookeeper:2181
CONNECT_PLUGIN_PATH: /usr/share/java,/usr/share/confluent-hub-components,/data/connect-jars
CONNECT_LOG4J_LOGGERS: org.apache.kafka.connect=DEBUG
command:
- bash
- -c
- |
echo "Installing connector plugins"
confluent-hub install --no-prompt confluentinc/kafka-connect-jdbc:latest
cd /usr/share/confluent-hub-components/confluentinc-kafka-connect-jdbc/lib
curl http://repo.odysseusinc.com/artifactory/community-libs-release-local/org/netezza/nzjdbc/1.0/nzjdbc-1.0.jar -o nzjdbc-1.0.jar
/etc/confluent/docker/run &
sleep infinity
Python Code
#!/usr/bin/env python
# coding: utf-8
# In[ ]:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.1.1 pyspark-shell'
# In[ ]:
# Spark
import pyspark
import findspark
from pyspark import SparkContext
from pyspark.sql import SparkSession
# Spark Streaming
from pyspark.streaming import StreamingContext
# json parsing
import json
# In[ ]:
findspark.init()
# In[ ]:
spark = SparkSession .builder .appName("My App") .getOrCreate()
# .config("spark.jars", "/path/to/jar.jar,/path/to/another/jar.jar") \
# In[ ]:
spark.version
# In[ ]:
df = spark .readStream .format("kafka") .option("kafka.bootstrap.servers", "localhost:9092") .option("subscribe", "cloud") .option("startingOffsets", "earliest") .load()

Failed to find data source: kafka (Docker environment)

We are facing this issue at the moment and all the displayed "Similar questions" did not help to solve our problem. We are new to docker and also to spark.
We used the following Docker Compose to setup our containers:
networks:
spark_net:
volumes:
shared-workspace:
name: "hadoop-distributed-file-system"
driver: local
services:
jupyterlab:
image: jupyterlab
container_name: jupyterlab
ports:
- 8888:8888
volumes:
- shared-workspace:/opt/workspace
spark-master:
image: spark-master
networks:
- spark_net
container_name: spark-master
ports:
- 8080:8080
- 7077:7077
volumes:
- shared-workspace:/opt/workspace
spark-worker-1:
image: spark-worker
networks:
- spark_net
container_name: spark-worker-1
environment:
- SPARK_WORKER_CORES=1
- SPARK_WORKER_MEMORY=512m
ports:
- 8081:8081
volumes:
- shared-workspace:/opt/workspace
depends_on:
- spark-master
spark-worker-2:
image: spark-worker
networks:
- spark_net
container_name: spark-worker-2
environment:
- SPARK_WORKER_CORES=1
- SPARK_WORKER_MEMORY=512m
ports:
- 8082:8081
volumes:
- shared-workspace:/opt/workspace
depends_on:
- spark-master
zookeeper:
image: wurstmeister/zookeeper
ports:
- "2181:2181"
kafka:
image: wurstmeister/kafka
ports:
- "7575"
environment:
KAFKA_ADVERTISED_HOST_NAME: 127.0.0.1
KAFKA_ADVERTISED_LISTENERS: INSIDE://kafka:9093,OUTSIDE://localhost:9092
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: INSIDE:PLAINTEXT,OUTSIDE:PLAINTEXT
KAFKA_LISTENERS: INSIDE://0.0.0.0:9093,OUTSIDE://0.0.0.0:9092
KAFKA_INTER_BROKER_LISTENER_NAME: INSIDE
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
volumes:
- ./var/run/docker.sock
We also created two pythonm files to test if kafka streaming works:
producer
import json
import time
producer = KafkaProducer(bootstrap_servers = ['twitter-streaming_kafka_1:9093'],
api_version=(0,11,5),
value_serializer=lambda x: json.dumps(x).encode('utf-8'))
for e in range(1000):
data = {'number' : e}
producer.send('corona', value=data)
time.sleep(0.5)
Consumer:
import time
from kafka import KafkaConsumer, KafkaProducer
from datetime import datetime
import json
print('starting consumer')
consumer = KafkaConsumer(
'corona',
bootstrap_servers=['twitter-streaming_kafka_1:9093'],
auto_offset_reset='earliest',
enable_auto_commit=True,
group_id='my-group',
value_deserializer=lambda x: json.loads(x.decode('utf-8')))
print('printing messages')
for message in consumer:
message = message.value
print(message)
When we executed both scripts in different CLIs in our jupyterlab container and it worked. When we want to connect to our producer stream via pyspark with the following code we get the mentioned error.
import random
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from pyspark.sql import SparkSession
spark = Spark = SparkSession.builder.appName('KafkaStreaming').getOrCreate()
df = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "kafka:9093").option("subscribe", "corona").load()
We also executed the following command in the spark-master CLI:
./bin/spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1 ...
stacktrace
---------------------------------------------------------------------------
AnalysisException Traceback (most recent call last)
<ipython-input-2-4dba09a73304> in <module>
6
7 spark = SparkSession.builder.appName('KafkaStreaming').getOrCreate()
----> 8 df = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "twitter-streaming_kafka_1:9093").option("subscribe", "corona").load()
/usr/local/lib/python3.7/dist-packages/pyspark/sql/streaming.py in load(self, path, format, schema, **options)
418 return self._df(self._jreader.load(path))
419 else:
--> 420 return self._df(self._jreader.load())
421
422 #since(2.0)
/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py in __call__(self, *args)
1303 answer = self.gateway_client.send_command(command)
1304 return_value = get_return_value(
-> 1305 answer, self.gateway_client, self.target_id, self.name)
1306
1307 for temp_arg in temp_args:
/usr/local/lib/python3.7/dist-packages/pyspark/sql/utils.py in deco(*a, **kw)
132 # Hide where the exception came from that shows a non-Pythonic
133 # JVM exception message.
--> 134 raise_from(converted)
135 else:
136 raise
/usr/local/lib/python3.7/dist-packages/pyspark/sql/utils.py in raise_from(e)
AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".;
Your Kafka container needs to be placed on the spark_net network in order for Spark containers to resolve it by name
Same with Jupyter if you want it to be able to launch jobs on the Spark cluster
Also, you need to add the Kafka package

Spark in Docker container does not read Kafka input - Structured Streaming

When the Spark job is run locally without Docker via spark-submit everything works fine.
However, running on a docker container results in no output being generated.
To see if Kafka itself was working, I extracted Kafka on to the Spark worker container, and make a console consumer listen to the same host, port and topic, (kafka:9092, crypto_topic) which was working correctly and showing output. (There's a producer constantly pushing data to the topic in another container)
Expected -
20/09/11 17:35:27 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.29.10:42565 with 366.3 MB RAM, BlockManagerId(driver, 192.168.29.10, 42565, None)
20/09/11 17:35:27 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.29.10, 42565, None)
20/09/11 17:35:27 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.29.10, 42565, None)
-------------------------------------------
Batch: 0
-------------------------------------------
+---------+-----------+-----------------+------+----------+------------+-----+-------------------+---------+
|name_coin|symbol_coin|number_of_markets|volume|market_cap|total_supply|price|percent_change_24hr|timestamp|
+---------+-----------+-----------------+------+----------+------------+-----+-------------------+---------+
+---------+-----------+-----------------+------+----------+------------+-----+-------------------+---------+
...
...
...
followed by more output
Actual
20/09/11 14:49:44 INFO BlockManagerMasterEndpoint: Registering block manager d7443d94165c:46203 with 366.3 MB RAM, BlockManagerId(driver, d7443d94165c, 46203, None)
20/09/11 14:49:44 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, d7443d94165c, 46203, None)
20/09/11 14:49:44 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, d7443d94165c, 46203, None)
20/09/11 14:49:44 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
no more output, stuck here
docker-compose.yml file
version: "3"
services:
zookeeper:
image: zookeeper:3.6.1
container_name: zookeeper
hostname: zookeeper
ports:
- "2181:2181"
networks:
- crypto-network
kafka:
image: wurstmeister/kafka:2.13-2.6.0
container_name: kafka
hostname: kafka
ports:
- "9092:9092"
environment:
- KAFKA_ADVERTISED_HOST_NAME=kafka
- KAFKA_ZOOKEEPER_CONNECT=zookeeper:2181
- KAFKA_ADVERTISED_PORT=9092
# topic-name:partitions:in-sync-replicas:cleanup-policy
- KAFKA_CREATE_TOPICS="crypto_topic:1:1:compact"
networks:
- crypto-network
kafka-producer:
image: python:3-alpine
container_name: kafka-producer
command: >
sh -c "pip install -r /usr/src/producer/requirements.txt
&& python3 /usr/src/producer/kafkaProducerService.py"
volumes:
- ./kafkaProducer:/usr/src/producer
networks:
- crypto-network
cassandra:
image: cassandra:3.11.8
container_name: cassandra
hostname: cassandra
ports:
- "9042:9042"
#command:
# cqlsh -f /var/lib/cassandra/cql-queries.cql
volumes:
- ./cassandraData:/var/lib/cassandra
networks:
- crypto-network
spark-master:
image: bde2020/spark-master:2.4.5-hadoop2.7
container_name: spark-master
hostname: spark-master
ports:
- "8080:8080"
- "7077:7077"
- "6066:6066"
networks:
- crypto-network
spark-consumer-worker:
image: bde2020/spark-worker:2.4.5-hadoop2.7
container_name: spark-consumer-worker
environment:
- SPARK_MASTER=spark://spark-master:7077
ports:
- "8081:8081"
volumes:
- ./sparkConsumer:/sparkConsumer
networks:
- crypto-network
networks:
crypto-network:
driver: bridge
spark-submit is run by
docker exec -it spark-consumer-worker bash
/spark/bin/spark-submit --master $SPARK_MASTER --class processing.SparkRealTimePriceUpdates \
--packages com.datastax.spark:spark-cassandra-connector_2.11:2.4.3,org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5 \
/sparkConsumer/sparkconsumer_2.11-1.0-RELEASE.jar
Relevant parts of Spark code
val inputDF: DataFrame = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "kafka:9092")
.option("subscribe", "crypto_topic")
.load()
...
...
...
val queryPrice: StreamingQuery = castedDF
.writeStream
.outputMode("update")
.format("console")
.option("truncate", "false")
.start()
queryPrice.awaitTermination()
val inputDF: DataFrame = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "kafka:9092")
.option("subscribe", "crypto_topic")
.load()
This part of the code was actually
val inputDF: DataFrame = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP_SERVERS)
.option("subscribe", KAFKA_TOPIC)
.load()
Where KAFKA_BOOTSTRAP_SERVERS and KAFKA_TOPIC are read in from a config file while packaging the jar locally.
The best way to debug for me was to set the logs to be more verbose.
Locally, the value of KAFKA_BOOTSTRAP_SERVERS was localhost:9092, but in the Docker container it was changed to kafka:9092 in the config file there. This however, didn't reflect as the JAR was packaged already. So changing the value to kafka:9092 while packaging locally fixed it.
I would appreciate any help about how to have a JAR pick up configurations dynamically though. I don't want to package via SBT on the Docker container.

Resources