kafka-spark streaming in docker - docker

I am new and inexperienced in the world of containers and docker systems. I want to write a producer/consumer in Kafka and set up some streaming and ETL using pyspark. Now I am well aware of the process and the technical background required.
All I want to know is if I just had to create a small demo of the above and share the files with my students in a docker so that all they would have to do is install it on their end and see how it works, is that even a possibility?

You can create a simple docker compose with the required deployment and share the docker compose artifacts with the students.
Furthermore, you can have a look at WSO2 Stream Processor which provides an interactive UI to write the streaming and ETL related logic if you are not bound to Spark.

Related

How to make separate CLI tools in docker containers accessible from "central "docker container?

I want to create a tool that couples together a lot (~10, perhaps more) of other CLI tools to automate some stuff. This tool needs to be able to just be dropped-in on any VPS and work, hence the Docker containers. Work in this case means running central program (made by me) that orchestrates all the other tools and aggregates their results in a single database to browse/export later. The tools' containers need to have network access.
In my limited knowledge of Docker I've concluded that multi-stage build to fit all the tools in a single container is a bad design here, and very cumbersome. I've thought of networking the tools' containers to the central one and doing some sort of TCP piping, but that seems less than ideal too. What should the approach here be like? Are there some ready-made solutions to this issue?
Thanks
How about docker-compose?
You can use this tool to deploy all your dockerized tools inside docker network and then communicate with them via your orchestrator. Additionally you can pack this composed dockers into another docker and create docker-in-docker environment and expose only your orchestrator as a gate to your all-in-one tool.
Cheers,

How to create a single project out of multiple docker images

I have been working on a project where I have had several docker containers:
Three OSRM routing servers
Nominatim server
Container where the webpage code is with all the needed dependencies
So, now I want to prepare a version that a user could download and run. What is the best practice to do such a thing?
Firstly, I thought maybe to join everything into one container, but I have read that it is not recommended to have several processes in one place. Secondly, I thought about wrapping up everything into a VM, but that is not really a "program" that a user can launch. And my third idea was to maybe, write a script, that would download each container from Docker Hub separately and launch the webpage. But, I am not sure if that is best practice, or maybe there are some better ideas.
When you need to deploy a full project composed of several containers.
You may use a specialized tool.
A well known for mono-server usage is docker-compose:
Compose is a tool for defining and running multi-container Docker applications
https://docs.docker.com/compose/
You could provide to your users :
docker-compose file
your application docker images (ex: through docker hub).
Regarding clusters/cloud, we talk more about orchestrator like docker swarm, Kubernetes, nomad
Kubernetes's documentation is the following:
https://kubernetes.io/

How to create docker image of existing WAS 8.5.5.14?

I have Websphere Application Server 8.5.5.14 hosting my ERP. I want to dockerize the application and deploy it into Kubernetes cluster. Can anyone provide me information on how to create image out of my existing WAS 8.5.5.14.
In theory you could do this by creating a tar ball of the filesystem and importing it into docker to make an image via something like:
cat WAS.tar | docker import - appImage
but there's going to be a number of issues you'll need to avoid, for example, if you have resources (jdbc drivers,resource adapters, etc), the tarball will need to have all of those included. You'll also need to expose all of the required ports for your app and its administration. A better way and best practice to solve this would be to start with an IBM supported image of traditional WAS and build your system atop it.
There are detailed instructions to do this at https://github.com/WASdev/ci.docker.websphere-traditional#docker-hub-image
F Rowe's answer is good; if you follow their advice of using the official images you will be using WebSphere v9.0 in the container. You can use this tool that can help figure out if there are any changes you need to make to your application in order to get it working in the container. It also generates some of the wsadmin scripts to configure the server in the image.

Big Data project requirements using twitter streams

I am currently trying to break into Data engineering and I figured the best way to do this was to get a basic understanding of the Hadoop stack(played around with Cloudera quickstart VM/went through tutorial) and then try to build my own project. I want to build a data pipeline that ingests twitter data, store it in HDFS or HBASE, and then run some sort of analytics on the stored data. I would also prefer that I use real time streaming data, not historical/batch data. My data flow would look like this:
Twitter Stream API --> Flume --> HDFS --> Spark/MapReduce --> Some DB
Does this look like a good way to bring in my data and analyze it?
Also, how would you guys recommend I host/store all this?
Would it be better to have one instance on AWS ec2 for hadoop to run on? or should I run it all in a local vm on my desktop?
I plan to have only one node cluster to start.
First of all, Spark Streaming can read from Twitter, and in CDH, I believe that is the streaming framework of choice.
Your pipeline is reasonable, though I might suggest using Apache NiFi (which is in the Hortonworks HDF distribution), or Streamsets, which is installable in CDH easily, from what I understand.
Note, these are running completely independently of Hadoop. Hint: Docker works great with them. HDFS and YARN are really the only complex components that I would rely on a pre-configured VM for.
Both Nifi and Streamsets give you a drop and drop UI for hooking Twitter to HDFS and "other DB".
Flume can work, and one pipeline is easy, but it just hasn't matured at the level of the other streaming platforms. Personally, I like a Logstash -> Kafka -> Spark Streaming pipeline better, for example because Logstash configuration files are nicer to work with (Twitter plugin builtin). And Kafka works with a bunch of tools.
You could also try out Kafka with Kafka Connect, or use Apache Flink for the whole pipeline.
Primary takeaway, you can bypass Hadoop here, or at least have something like this
Twitter > Streaming Framework > HDFS
.. > Other DB
... > Spark
Regarding running locally or not, as long as you are fine with paying for idle hours on a cloud provider, go ahead.

Delivering software updates with Docker

I am using docker-compose to run multi-container software on premise (offline networks). What is the best way to deliver software updates when there is no direct connectivity to Docker registry?
I tried 2 options (both are not sufficient):
1. To deliver full Docker images created using docker save. Very inefficient - each image is above 1gb, no layers optimization.
2. To deliver customized software in addition to Docker images and map it with host volumes inside generic containers. This way I can deliver custom software only updates, and reduce time when Docker images need to be updated. Yet it is far from optimal.
Is there any way to deliver only the updated Docker layer with custom software as a file? Any other ideas?
Thanks,
Meir
You have several options:
Setup a private registry within the private network and let servers use that
Utilize torrent protocol for going even faster, see Docket project
If nothing, gzip your save using --rsyncable option and use rsync instead of scp
Use docker-slim to reduce your final image size
I hope combination of few methods above should be good enough.

Resources