web logs parsing for Spark Streaming - parsing

I plan to create a system where I can read web logs in real time, and use apache spark to process them. I am planning to use kafka to pass the logs to spark streaming to aggregate statistics.I am not sure if I should do some data parsing (raw to json ...), and if yes, where is the appropriate place to do it (spark script, kafka, somewhere else...) I will be grateful if someone can guide me. Its kind of a new stuff to me. Cheers

Apache Kafka is a distributed pub-sub messaging system. It does not provide any way to parse or transform data it is not for that. But any Kafka consumer can process, parse or transform the data published to Kafka and republished the transformed data to another topic or store it in a database or file system.
There are many ways to consume data from Kafka one way is the one you suggested, real-time stream processors(apache flume, apache-spark, apache storm,...).
So the answer is no, Kafka does not provide any way to parse the raw data. You can transform/parse the raw data with spark but as well you can write your own consumer as there are many Kafka clients ports or use any other built consumer Apache flume, Apache storm, etc

Related

How to send image data via different microservices with Redis

I wanted to ask what options make sense with Redis, as I am unsure about Redis Pub/Sub in particular. Suppose I have a service A (Java client) that processes images. Unfortunately it can't process all kinds of images (because the language/framework doesn't support it yet). This is where service B comes into play (Node.js).
Service A streams the image bytes to Redis. Service B should read these bytes from Redis and encode them into the correct format. Then stream back to Redis and Service A is somehow notified to read the result from Redis.
There are two strategies I consider for this:
Using the Pub/Sub feature of Redis. Service A streams via writeStream e.g. the chunks to Redis and then publishes as publisher certain metadata to Service B (& replicas) as subscriber. Service B then reads the stream ( locks it for other replicas), processes it, and then streams the result back to Redis. Then sends a message to Service A as Publisher that the result can be fetched from Redis.
I put everything directly into the pub/sub Redis. Metadata and bytes and then proceed as in 1). But how do I then lock the message for other replicas of B? I want to avoid that all process the same image.
So my question is:
Does the pub/sub feature of Redis allow strategy no. 2 in terms of performance or is this exclusively intended for "lightweight" messages such as log data, metadata, IDs?
And if Redis in general would not be a good solution for this approach. Which one then? Async rest endpoints?

Writing to memcache from within a streaming dataflow pipeline

Is it possible to write to memcache from a streaming data flow pipeline? or do I need to write to a pubsub and create another compute engine or app engine?
Yes, the Dataflow workers can communicate with any external services that you need; they are just VMs with no special restrictions or permissions.
If you are just writing out data to memcache, the Sink API will likely be useful
For redis I created DoFn with redis client.
It is possible to do some tricks if you need batch writing. For example:
link

What would Kafka do if producer goes down?

I'm a bit confused about Kafka architecture. We would like to capture Twitter Streaming API. We came across this https://github.com/NFLabs/kafka-twitter/blob/master/src/main/java/com/nflabs/peloton2/kafka/producer/TwitterProducer.java Twitter Producer.
What I'm thinking about is how to design the system so it's fault tolerant.
If the producer goes down, does it mean we lose some of the data? How to prevent this from happening?
If the producer you linked to stops running, new data from the Twitter API will not make its way into Kafka. I'm not sure how the Twitter Streaming API works, but it may be possible to get historic data, allowing you to fetch all data back to the point when the producer failed.
Another option is to use Kafka Connect, which is a distributed, fault tolerant service for connecting data sources and sinks to Kafka. Connect exposes a higher-level API and uses the out-of-the-box producer/consumer API behind the scenes. The documentation explains Connect very thoroughly, so give that a read and go from there.

How do I retrieve data from statsd?

I'm glossing over their documentation here :
http://www.rubydoc.info/github/github/statsd-ruby/Statsd
And there's methods for recording data, but I can't seem to find anything about retrieving recorded data. I'm adopting a projecting with an existing statsd addition. It's host is likely a defunct URL. Perhaps, is the host where those stats are recorded?
The statsd server implementations that Mircea links just take care of receiving, aggregating metrics and publishing them to a backend service. Etsy's statsd definition (bold is mine):
A network daemon that runs on the Node.js platform and listens for
statistics, like counters and timers, sent over UDP or TCP and sends
aggregates to one or more pluggable backend services (e.g.,
Graphite).
To retrieve the recorded data you have to query the backend. Check the list of available backends. The most common one is Graphite.
See also this question: How does StatsD store its data?
There are 2 parts to statsd: a client and a server.
What you're looking at is the client part. You will not see functionality related to retrieving the data as it's not there - it normally is on the server side.
Here is a list of statsd server implementations:
http://www.joemiller.me/2011/09/21/list-of-statsd-server-implementations/
Research and pick one that fits your needs.
Statsd originally started at etsy: https://github.com/etsy/statsd/wiki

Benefit of Apache Flume

I am new with Apache Flume.
I understand that Apache Flume can help transport data.
But I still fail to see the ultimate benefit offered by Apache Flume.
If I can configure a software or make a software to send which data goes where, why I need Flume?
Maybe someone can explain a situation that shows Apache Flume's benefit?
Reliable transmission (if you use the file channel):
Flume sends batches of small events. Every time it sends a batch to the next node it waits for acknowledgment before deleting. The storage in the file channel is optimized to allow recovery on crash.
I think the biggest benefit that you get out of flume is extensiblity. Basically all components starting from source, interceptor and sink, everything is extensible.
We use flume and read data using custom kakfa source, data is in the form of JSON we parse it in custom kafka source and then pass it on to HDFS sink. It working reliably in 5 of nodes. We extended only kafka source, HDFS sink functionality we got out the box.
At the same time, being from the Hadoop ecosystem, you get great community support and multiple options to use the tools in different ways.

Resources