Big Data project requirements using twitter streams - twitter

I am currently trying to break into Data engineering and I figured the best way to do this was to get a basic understanding of the Hadoop stack(played around with Cloudera quickstart VM/went through tutorial) and then try to build my own project. I want to build a data pipeline that ingests twitter data, store it in HDFS or HBASE, and then run some sort of analytics on the stored data. I would also prefer that I use real time streaming data, not historical/batch data. My data flow would look like this:
Twitter Stream API --> Flume --> HDFS --> Spark/MapReduce --> Some DB
Does this look like a good way to bring in my data and analyze it?
Also, how would you guys recommend I host/store all this?
Would it be better to have one instance on AWS ec2 for hadoop to run on? or should I run it all in a local vm on my desktop?
I plan to have only one node cluster to start.

First of all, Spark Streaming can read from Twitter, and in CDH, I believe that is the streaming framework of choice.
Your pipeline is reasonable, though I might suggest using Apache NiFi (which is in the Hortonworks HDF distribution), or Streamsets, which is installable in CDH easily, from what I understand.
Note, these are running completely independently of Hadoop. Hint: Docker works great with them. HDFS and YARN are really the only complex components that I would rely on a pre-configured VM for.
Both Nifi and Streamsets give you a drop and drop UI for hooking Twitter to HDFS and "other DB".
Flume can work, and one pipeline is easy, but it just hasn't matured at the level of the other streaming platforms. Personally, I like a Logstash -> Kafka -> Spark Streaming pipeline better, for example because Logstash configuration files are nicer to work with (Twitter plugin builtin). And Kafka works with a bunch of tools.
You could also try out Kafka with Kafka Connect, or use Apache Flink for the whole pipeline.
Primary takeaway, you can bypass Hadoop here, or at least have something like this
Twitter > Streaming Framework > HDFS
.. > Other DB
... > Spark
Regarding running locally or not, as long as you are fine with paying for idle hours on a cloud provider, go ahead.

Related

Fluentd to Kafka on Kubernetes connection Issues

I'm having issues connecting Fluentd to Kafka for a centralized logging PoC I'm working on.
I'm currently using the following configuration:
Minikube
Fluentd
fluent/fluentd-kubernetes-daemonset:v1.14.3-debian-kafka2-1.0 (docker)
Configuration: I have the FLUENT_KAFKA2_BROKERS=<INTERNAL KAFKA BOOTSTRAP IP>:9092 and FLUENT_KAFKA2_DEFAULT_TOPIC=logs env set in my yaml for fluentd daemonset.
Kafka
I was sort of expecting to see the logs appear in a Kafka consumer running against the same broker listening on the "logs" topic. No dice.
Could anyone recommend next steps for troubleshooting and or a good reference? I've done a good bit of searching and have only found a few people posting about setting up with the fluentd-kafka plugin. Also would it make sense for me to explore Fluent Bit Kafka setup as an alternative?
In general, to configure forwarding of log events to Kafka topic you would definitely need to use output plugins for Fluentd.
Fluentd delivers fluent-plugin-kafka plugin, as specified in Fluentd docs, for both input and output use cases. For output case, this plugin has Kafka Producer functions to publishes messages into topics. kafka-connect-fluentd plugin can also be used as an alternative.
Fluent Bit - being the sub-project of Fluentd - a good lightweight alternative for Fluentd, but which one to use depends on your particular use case.
Fluent Bit has limited amount of filtering options, it is not as pluggable and flexible as Fluentd. The later has more configuration options and filters, it can be integrated with a much larger amount of input and output sources. It is essentially designed to deal with heavy throughput — aggregating from multiple inputs, processing data and routing to different outputs. More on comparison here and here.

How to ask KAFKA about it's status - number of consumers, etc.? Particularly when it's running in a docker container?

I am looking at how to get the information on the number of consumers from a Kafka server running in a docker container.
But I'll also take almost any info to help point me in a direction that is forward movement. I've been trying through Python ond URI requests, but I'm getting the feeling I need to get back to Java to ask Kafka questions on it's status?
In relation to the current threads I've seen, many handy scripts from $KAFKA_HOME are referenced but, the configured systems I have access to do not have $KAFKA_HOME defined - nor do they have the contents of that directory. My world is a docker container without a CLI access. So I haven't been able to apply the solutions requiring shell scripts or other tools from $KAFKA_HOME to my running system.
One of the things I have tried is a python script using requests.get(uri...)
where the uri looks like:
http://localhost:9092/connectors/
The code looks like:
r = requests.get("http://%s:%s/connectors" % (config.parameters['kafkaServerIPAddress'],config.parameters['kafkaServerPort']))
currentConnectors=r.json()
So far I get a "nobody's home at that address" response.
I'm really stuck and a pointer to something akin to "Beginners Guide to Getting KAFKA Monitoring Information" would be great. Also if there's a way to grab the helpful kafka shell scripts & tools, that would be great to - where do they come from?
One last thing - I'm new enough to Kafka that I don't know what I don't know.
Thanks.
running in a Docker container
That shouldn't matter, but Confluent maintains a few pages that go over how to configure the containers for monitoring
https://docs.confluent.io/platform/current/installation/docker/operations/monitoring.html
https://docs.confluent.io/platform/current/kafka/monitoring.html
number of consumers
Such a metric doesn't exist
Python and URI requests
You appear to be using the /connectors endpoint of the Kafka Connect REST API (which runs on port 8083, not 9092). It is not a monitoring endpoint for brokers or non-Connect-API consumers
way to grab the helpful kafka shell scripts & tools
https://kafka.apache.org/downloads > Binary downloads
You don't need container shell access, but you will need external network access, just as all clients outside of a container would.

kafka-spark streaming in docker

I am new and inexperienced in the world of containers and docker systems. I want to write a producer/consumer in Kafka and set up some streaming and ETL using pyspark. Now I am well aware of the process and the technical background required.
All I want to know is if I just had to create a small demo of the above and share the files with my students in a docker so that all they would have to do is install it on their end and see how it works, is that even a possibility?
You can create a simple docker compose with the required deployment and share the docker compose artifacts with the students.
Furthermore, you can have a look at WSO2 Stream Processor which provides an interactive UI to write the streaming and ETL related logic if you are not bound to Spark.

Docker containers monitoring system using SNMP

Can anyone tell me if it is a good idea to monitor Docker containers using SNMP? I mean, I'm thinking at installing SNMP agent on each container and collect data throug a Flink/Kafka stream, but I don't know if it's ok to proceed in this way like installing SNMP agent on each container or not.
Thank you!
There are docker APIs which many tools use to collect this info. You do not need to install anything inside the containers for these basic metrics. The most popular open source tool for this is Prometheus, but there are dozens of commercial tools which use the same method.
https://docs.docker.com/config/thirdparty/prometheus/#configure-docker

Start EC2 with Docker, run script and shut down

Hi Stackoverflow community, I have a question regarding using Docker with AWS EC2. I am comfortable with EC2 but am very new to Docker. I code in Python 3.6 and would like to automate the following process:
1: start an EC2 instance with Docker (Docker image stored in ECR)
2: run a one-off process and return results (let's call it "T") in a CSV format
3: store "T" in AWS S3
4: Shut down the EC2
The reason for using an EC2 instance is because the process is quite computationally intensive and is not feasible for my local computer. The reason for Docker is to ensure the development environment is the same across the team and the CI facility (currently using circle.ci). I understand that interactions with AWS can mostly be done using Boto3.
I have been reading about AWS's own ECS and I have a feeling that it's geared more towards deploying a web-app with Docker rather than running a one-off process. However, when I searched around EC2 + Docker nothing else but ECS came up. I have also done the tutorial in AWS but it doesn't help much.
I have also considered running EC2 with a shell script (i.e. downloading docker, pulling the image, building the container etc)but it feels a bit hacky? Therefore my questions here are:
1: Is ECS really the most appropriate solution in his scenario? (or in other words is ECS designed for such operations?)
2: If so are there any examples of people setting-up and running a one-off process using ECS? (I find the set-up really confusing especially the terminologies used)
3: What are the other alternatives (if any)?
Thank you so much for the help!
Without knowing more about your process; I'd like to pose 2 alternatives for you.
Use Lambda
Pending just how compute intensive your process is, this may not be a viable option. However, if it something that can be distributed, Lambda is awesome. You can find more information about the resource limitations here. This route, you would simply write Python 3.6 code to perform your task and write "T" to S3.
Use Data Pipeline
With Data Pipeline, you can build a custom AMI (EC2) and use that as your image. You can then specify the size of the EC2 resource that you need to run this process. It sounds like your process would be pretty simple. You would need to define:
EC2resource
Specify AMI, Role, Security Group, Instance Type, etc.
ShellActivity
Bootstrap the EC2 instance as needed
Grab your code form S3, GitHub, etc
Execute your code (Include in your code writing "T" to S3)
You can also schedule the pipeline to run at an interval/schedule or call it directly from boto3.

Resources