Best practices for Kafka streams - machine-learning

We have a predict service written in python to provide the Machine Learning service, you send it a set of data, and it will give the Anomaly Detection or Predict and so on.
I want to use Kafka streams to process the real-time data.
There are two ways to select:
Kafka streams jobs only complete the ETL function: load data, and do simple transform and save data to Elastic Search. And then start a timer periodically load data from ES and call predict service to compute and save result back to ES.
Kafka streams jobs do all the thing besides the ETL, when Kafka streams jobs complete the ETL and then send the data to predict service, and save the compute result to Kafka, and a consumer will forward the result from Kafka to ES.
I think the second way is more real-time, but I don't know it's a good idea to do so much predict tasks in streaming jobs.
Is there any common patterns or advice for such application?

Yes, I'd opt for the second option as well.
What you can do is to use Kafka as the data pipeline between your ML-Training module and your Prediction module. These modules could be very well implemented in Kafka Streams.
Take a look on the diagram below:

Related

Real Time computation

I have a algorithm written in python and mysql which takes inputs in csv file and some properties and then run for 20-25 mins to produce output.
I want to make it realtime such that if input csv is upload or and properties is changed the output is changed without need to run algorithm
Note Data on which algorithm runs can be very large.
Need help in making realtime computing algorithm
i am trying to change mysql to nosql DB but it still takes time to run and is not realtime
You should try using one of the streaming services for event driven update. Instead of CSV, write the data to stream like Kafka or Kinesis. Write your consumer that reads incoming invents and updates the data without running the full algorithm. You might be able to use Apache Flink for aggregation against Kafka as well.

How to send image data via different microservices with Redis

I wanted to ask what options make sense with Redis, as I am unsure about Redis Pub/Sub in particular. Suppose I have a service A (Java client) that processes images. Unfortunately it can't process all kinds of images (because the language/framework doesn't support it yet). This is where service B comes into play (Node.js).
Service A streams the image bytes to Redis. Service B should read these bytes from Redis and encode them into the correct format. Then stream back to Redis and Service A is somehow notified to read the result from Redis.
There are two strategies I consider for this:
Using the Pub/Sub feature of Redis. Service A streams via writeStream e.g. the chunks to Redis and then publishes as publisher certain metadata to Service B (& replicas) as subscriber. Service B then reads the stream ( locks it for other replicas), processes it, and then streams the result back to Redis. Then sends a message to Service A as Publisher that the result can be fetched from Redis.
I put everything directly into the pub/sub Redis. Metadata and bytes and then proceed as in 1). But how do I then lock the message for other replicas of B? I want to avoid that all process the same image.
So my question is:
Does the pub/sub feature of Redis allow strategy no. 2 in terms of performance or is this exclusively intended for "lightweight" messages such as log data, metadata, IDs?
And if Redis in general would not be a good solution for this approach. Which one then? Async rest endpoints?

What is recommended solution for monitoring heterogeneous infrastructure?

I am looking for monitoring tool for the following use cases:
Collect basic metrics about virtual machine (cpu usage, memory usage, i/o, available space)
Extract metrics from SQL Server (probably running some queries)
Extract information from external service about processing i.e how many processing are currently running and for how long. I am thinking about writing python scripts, but don't know how to combine with monitoring tool
Have the ability to plot charts and manage alerts and it will nice to have ability to send not only mails, but send message to slack/ms teams.
I was thing about Prometheus, because it has wmi_exporter, node_exporter, sql exporter, alert manager with possibility to send notifications to multiple destinations, but I don't know what to do with this external service and python scripts.
Any suggestions?
Prometheus can definitely do what you say you need done. Some of it may not be trivial, but you can definitely fill in the blanks yourself.
E.g. you can get machine metrics basically out of the box by firing up a node_exporter and having it scraped by Prometheus, but I don't think it has e.g. information on all running processes. The latter might require you to write an agent/exporter: a simple web server that exposes metrics on /metrics; there exists a Python client library to help with that. Or have said processes (assuming they're your code) push metrics to a Pushgateway instead, if they're short lived batch jobs.
Oh, and for charts/dashboards you probably want Grafana, as Prometheus' abilities in that area are rather limited and Grafana integrates rather well with Prometheus.

Difference between stream processing and message processing

What is the basic difference between stream processing and traditional message processing? As people say that kafka is good choice for stream processing but essentially kafka is a messaging framework similar to ActivMQ, RabbitMQ etc.
Why do we generally not say that ActiveMQ is good for stream processing as well.
Is it the speed at which messages are consumed by the consumer determines if it is a stream?
In traditional message processing, you apply simple computations on the messages -- in most cases individually per message.
In stream processing, you apply complex operations on multiple input streams and multiple records (ie, messages) at the same time (like aggregations and joins).
Furthermore, traditional messaging systems cannot go "back in time" -- ie, they automatically delete messages after they got delivered to all subscribed consumers. In contrast, Kafka keeps the messages as it uses a pull-based model (ie, consumers pull data out of Kafka) for a configurable amount of time. This allows consumers to "rewind" and consume messages multiple times -- or if you add a new consumer, it can read the complete history. This makes stream processing possible, because it allows for more complex applications. Furthermore, stream processing is not necessarily about real-time processing -- it's about processing infinite input streams (in contrast to batch processing, which is applied to finite inputs).
And Kafka offers Kafka Connect and Streams API -- so it is a stream-processing platform and not just a messaging/pub-sub system (even if it uses this in its core).
If you like splitting hairs:
Messaging is communication between two or more processes or components whereas streaming is the passing of event log as they occur. Messages carry raw data whereas events contain information about the occurrence of and activity such as an order.
So Kafka does both, messaging and streaming. A topic in Kafka can be raw messages or and event log that is normally retained for hours or days. Events can further be aggregated to more complex events.
Although Rabbit supports streaming, it was actually not built for it(see Rabbit´s web site)
Rabbit is a Message broker and Kafka is a event streaming platform.
Kafka can handle a huge number of 'messages' towards Rabbit.
Kafka is a log while Rabbit is a queue which means that if once consumed, Rabbit´s messages are not there anymore in case you need it.
However Rabbit can specify message priorities but Kafka doesn´t.
It depends on your needs.
Message Processing implies operations on and/or using individual messages. Stream Processing encompasses operations on and/or using individual messages as well as operations on collection of messages as they flow into the system. For e.g., let's say transactions are coming in for a payment instrument - stream processing can be used to continuously compute hourly average spend. In this case - a sliding window can be imposed on the stream which picks up messages within the hour and computes average on the amount. Such figures can then be used as inputs to fraud detection systems
Apologies for long answer but I think short answer will not be justice to question.
Consider queue system. like MQ, for:
Exactly once delivery, and to participate into two phase commit transaction
Asynchronous request / reply communication: the semantic of the communication is for one component to ask a second command to do something on its data. This is a command pattern with delay on the response.
Recall messages in queue are kept until consumer(s) got them.
Consider streaming system, like Kafka, as pub/sub and persistence system for:
Publish events as immutable facts of what happened in an application
Get continuous visibility of the data Streams
Keep data once consumed, for future consumers, for replay-ability
Scale horizontally the message consumption
What are Events and Messages
There is a long history of messaging in IT systems. You can easily see an event-driven solution and events in the context of messaging systems and messages. However, there are different characteristics that are worth considering:
Messaging: Messages transport a payload and messages are persisted until consumed. Message consumers are typically directly targeted and related to the producer who cares that the message has been delivered and processed.
Events: Events are persisted as a replayable stream history. Event consumers are not tied to the producer. An event is a record of something that has happened and so can't be changed. (You can't change history.)
Now Messaging versus event streaming
Messaging are to support:
Transient Data: data is only stored until a consumer has processed the message, or it expires.
Request / reply most of the time.
Targeted reliable delivery: targeted to the entity that will process the request or receive the response. Reliable with transaction support.
Time Coupled producers and consumers: consumers can subscribe to queue, but message can be remove after a certain time or when all subscribers got message. The coupling is still loose at the data model level and interface definition level.
Events are to support:
Stream History: consumers are interested in historic events, not just the most recent.
Scalable Consumption: A single event is consumed by many consumers with limited impact as the number of consumers grow.
Immutable Data
Loosely coupled / decoupled producers and consumers: strong time decoupling as consumer may come at anytime. Some coupling at the message definition level, but schema management best practices and schema registry reduce frictions.
Hope this answer help!
Basically Kafka is messaging framework similar to ActiveMQ or RabbitMQ. There are some effort to take Kafka towards streaming:
https://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple/
Then why Kafka comes into picture when talking about Stream processing?
Stream processing framework differs with input of data.In Batch processing,you have some files stored in file system and you want to continuously process that and store in some database. While in stream processing frameworks like Spark, Storm, etc will get continuous input from some sensor devices, api feed and kafka is used there to feed the streaming engine.
Recently, I have come across a very good document that describe the usage of "stream processing" and "message processing"
https://developer.ibm.com/articles/difference-between-events-and-messages/
Taking the asynchronous processing in context -
Messaging:
Consider it when there is a "request for processing" i.e. client makes a request for server to process.
Event streaming:
Consider it when "accessing enterprise data" i.e. components within the enterprise can emit data that describe their current state. This data does not normally contain a direct instruction for another system to complete an action. Instead, components allow other systems to gain insight into their data and status.
To facilitate this evaluation, consider these key selection criteria to consider when selecting the right technology for your solution:
Event history - Kafka
Fine-grained subscriptions - MQ
Scalable consumption - Kafka
Transactional behavior - MQ

web logs parsing for Spark Streaming

I plan to create a system where I can read web logs in real time, and use apache spark to process them. I am planning to use kafka to pass the logs to spark streaming to aggregate statistics.I am not sure if I should do some data parsing (raw to json ...), and if yes, where is the appropriate place to do it (spark script, kafka, somewhere else...) I will be grateful if someone can guide me. Its kind of a new stuff to me. Cheers
Apache Kafka is a distributed pub-sub messaging system. It does not provide any way to parse or transform data it is not for that. But any Kafka consumer can process, parse or transform the data published to Kafka and republished the transformed data to another topic or store it in a database or file system.
There are many ways to consume data from Kafka one way is the one you suggested, real-time stream processors(apache flume, apache-spark, apache storm,...).
So the answer is no, Kafka does not provide any way to parse the raw data. You can transform/parse the raw data with spark but as well you can write your own consumer as there are many Kafka clients ports or use any other built consumer Apache flume, Apache storm, etc

Resources