Is it possible to write to memcache from a streaming data flow pipeline? or do I need to write to a pubsub and create another compute engine or app engine?
Yes, the Dataflow workers can communicate with any external services that you need; they are just VMs with no special restrictions or permissions.
If you are just writing out data to memcache, the Sink API will likely be useful
For redis I created DoFn with redis client.
It is possible to do some tricks if you need batch writing. For example:
link
Related
Can anyone explain what is the benefit of adopting google cloud pub/sub service in a streaming pipeline?
I saw one of the event streaming pipeline example showcased, and it was using pub/sub to ingest the events data before connecting to the google cloud data flow service to transform it. Why does it not connect to the events data directly through data flow?
Thanks.
Dataflow will need a source to get the data from. If you are using a streaming pipeline you can use different options as a source and each of them will have its own characteristics that may fit your scenario.
With Pub/Sub you can easily publish events using a client library or directly the API to a topic, and it will guarantee at least once delivery of that message.
When you connect it with Dataflow streaming pipeline, you can have a resilient architecture (Pub/Sub will keep sending the message until Dataflow acknowledge that it has processed it) and a near real-time processing. In addition, Dataflow can use Pub/Sub metrics to scale up or down depending on the number of the messages in the backlog.
Finally, Dataflow runner uses an optimized version of the PubSubIO connector which provides additional features. I suggest checking this documentation that describes some of these features.
I am looking for monitoring tool for the following use cases:
Collect basic metrics about virtual machine (cpu usage, memory usage, i/o, available space)
Extract metrics from SQL Server (probably running some queries)
Extract information from external service about processing i.e how many processing are currently running and for how long. I am thinking about writing python scripts, but don't know how to combine with monitoring tool
Have the ability to plot charts and manage alerts and it will nice to have ability to send not only mails, but send message to slack/ms teams.
I was thing about Prometheus, because it has wmi_exporter, node_exporter, sql exporter, alert manager with possibility to send notifications to multiple destinations, but I don't know what to do with this external service and python scripts.
Any suggestions?
Prometheus can definitely do what you say you need done. Some of it may not be trivial, but you can definitely fill in the blanks yourself.
E.g. you can get machine metrics basically out of the box by firing up a node_exporter and having it scraped by Prometheus, but I don't think it has e.g. information on all running processes. The latter might require you to write an agent/exporter: a simple web server that exposes metrics on /metrics; there exists a Python client library to help with that. Or have said processes (assuming they're your code) push metrics to a Pushgateway instead, if they're short lived batch jobs.
Oh, and for charts/dashboards you probably want Grafana, as Prometheus' abilities in that area are rather limited and Grafana integrates rather well with Prometheus.
I'm using Spring Cloud Dataflow local server and deploying 60+ streams with a Kafka topic and custom sink. The memory/cpu usage cost is not currently scalable. I've set the Xmx to 64m for most streams.
Currently exploring my options.
Disable embedded Tomcat server. I tried this once and SCDF couldn't tell the deployment status of the stream.
Group multiple Kafka "source" topics to a single sink app. This is allowed by Kafka but unclear if SCDF will permit subscribing to multiple topics.
Switch to using Kubernetes deployer. Won't exactly reduce the memory/cpu usage but distribute it across multiple machines. Haven't pursued this option because Kubernetes isn't used in my org yet. Maybe this will force the issue.
Open to other ideas. Might also be able to tweak Kafka configs such as max.poll.records and reduce memory usage.
Thanks!
First, I'd like to clarify the differences between SCDF and Stream/Task apps in the data pipeline.
SCDF is a lightweight Spring Boot app that includes the DSL, REST-APIs, and the Dashboard. Simply put, it serves as the orchestrator to define and deploy stream and task/batch data pipelines made of stream and task applications respectively.
The actual business logic, its performance, and the underlying resource consumption are at the individual Stream/Task application level. SCDF doesn't interfere with the app's operation, nor it contributes to the resource load. Everything, in the end, is standalone Boot apps - standalone Java processes.
Now, to your exploratory steps.
Disable embedded Tomcat server. I tried this once and SCDF couldn't tell the deployment status of the stream.
SCDF is a REST server and it requires the application container (in this case Tomcat), you cannot disable it.
Group multiple Kafka "source" topics to a single sink app. This is allowed by Kafka but unclear if SCDF will permit subscribing to multiple topics.
Again, there is no relation between SCDF and the apps. SCDF orchestrates full-blown Stream/Task (aka: Boot apps) into coherent data pipeline. If you have to produce or consumer to/from multiple Kafka topics, it is done at application level. Checkout the multi-io sample for more details.
There's the facility to consume from multiple topics directly via named-destination, too. SCDF provides a DSL/UI capability to build fan-in and fan-out pipelines. Refer to docs for more details. This video could be useful, too.
Switch to using Kubernetes deployer.
SCDF's Local-server is generally recommended for development. Primarily because there's no resiliency baked into the Local-server implementation. For example, if the streaming apps crash for any reason, there's no mechanism to restart them automatically. This is exactly why we recommend either SCDF's Kubernetes or Cloud Foundry server implementations in production. The platform provides the resiliency and fault-tolerance by automatically restarting the apps under fault scenarios.
From resourcing standpoint, once again, it depends on each application. They are standalone microservice application doing a specific operation at runtime, and it is up to how much resources the business logic requires.
I would like to make the custom source&sink from local server (files or dbs) to dataflow directly. so i wonder whether it is possible.
If it's possible, What should I be carefully to make it?
FYI, I have never made the custom source&sink.
But I used once GCS, dataflow.
Dataflow's custom IO framework can read from arbitrary sources and write to arbitrary sinks. You can certainly write connectors to various types of files and databases.
However, when executing pipelines on a remote service, like Google Cloud Dataflow in the cloud, depending on several factors, it may not be able to access services running on your local machine. Moreover, such local services may not scale well-enough to get a performant data-processing pipeline.
Thus, it might be better to move data to a cloud-based service, like Google Cloud Storage or Google BigQuery.
I plan to create a system where I can read web logs in real time, and use apache spark to process them. I am planning to use kafka to pass the logs to spark streaming to aggregate statistics.I am not sure if I should do some data parsing (raw to json ...), and if yes, where is the appropriate place to do it (spark script, kafka, somewhere else...) I will be grateful if someone can guide me. Its kind of a new stuff to me. Cheers
Apache Kafka is a distributed pub-sub messaging system. It does not provide any way to parse or transform data it is not for that. But any Kafka consumer can process, parse or transform the data published to Kafka and republished the transformed data to another topic or store it in a database or file system.
There are many ways to consume data from Kafka one way is the one you suggested, real-time stream processors(apache flume, apache-spark, apache storm,...).
So the answer is no, Kafka does not provide any way to parse the raw data. You can transform/parse the raw data with spark but as well you can write your own consumer as there are many Kafka clients ports or use any other built consumer Apache flume, Apache storm, etc