Apache Flume vs Apache Flink difference - flume

I need to read a stream of data from some source (in my case it's UDP stream, but it shouldn't matter), transform the each record and write it to the HDFS.
Is there any difference between using Flume or Flink for this purpose?
I know I can use Flume with the custom interceptor to transform each event.
But I am new in Flink, so for me it looks like Flink will do the same.
Which one is better to choose? Is there a difference in performance?
Please, help!

Disclaimer: I'm a committer and PMC member of Apache Flink. I do not have detailed knowledge about Apache Flume.
Moving streaming data from various sources into HDFS is one of the primary use cases for Apache Flume as far as I can tell. It is a specialized tool and I would assume it has a lot of related functionality built in. I cannot comment on Flume's performance.
Apache Flink is a platform for data stream processing and more generic and feature rich than Flume (e.g., support for event-time, advance windowing, high-level APIs, fault-tolerant and stateful applications, ...). You can implement and execute many different kinds of stream processing applications with Flink including streaming analytics and CEP.
Flink features a rolling file sink to write data streams to HDFS files and allows to implement all kinds of custom behavior via user-defined functions. However, it is not a specialized tool for data ingestion into HDFS. Do not expect a lot of built-in functionality for this use case. Flink provides very good throughput and low latency.
If you do not need more than simple record-level transformations, I'd first try to solve your use case with Flume. I would expect Flume to come with a few features that you would need to implement yourself when choosing Flink. If you expect to do more advanced stream processing in the future, Flink is definitely worth a look.

Disclaimer: I'm a committer of Apache Flume. I do not have detailed knowledge about Apache Flink.
For the use case you have described, Flume could be the right choice.
You could use the Exec Source until netcat UDP source gets committed to the codebase.
For the transformation, it's hard to provide suggestions, but you might want to take a look at Morphline Interceptor.
Regarding the channel, I would recommend Memory Channel, because if the source is UDP, some negligible data loss should be acceptable.
Sink-wise, HDFS Sink probably covers your needs.

Related

Apache Flink vs Twitter Heron?

There are a lot of questions comparing Flink vs Spark Streaming, Flink vs Storm and Storm vs Heron.
The origin of this question is from the fact that both Apache Flink and Twitter Heron are true stream processing frameworks (not micro-batch, like Spark Streaming). Storm has been decommissioned by Twitter last year and they're using Heron instead (which is basically Storm reworked).
There are nice presentations by Slim Baltagi on Flink and Flink vs Spark:
https://www.youtube.com/watch?v=G77m6Ou_kFA
Nice research by Ilya Ganelin on various streaming frameworks:
https://www.youtube.com/watch?v=KkjhyBLupvs
Pretty interesting thoughts on Flink vs Storm:
What is/are the main difference(s) between Flink and Storm?
But I haven't seen any comparison of new Storm/Heron vs Apache Flink.
Both of the projects are pretty young, both support using previously written Storm applications and many other things. Flink is more fitting into Hadoop ecosystem, Heron is more into Twitter based ecosystem stack.
Any thoughts?
All of the points in the referenced article comparing Apache Flink and Apache Storm also apply to Twitter's Heron. Heron provides exactly the same type of semantics and functionality as Storm. Heron is really best understood simply as a re-implementation of Storm that better fits Twitter's operational requirements.
Heron, Stream processing engine developed by twitter and donated to Apache on 26th FEB 2018.
As per Twitter, the throughput is 10–14x higher than that of Storm in all experiments, Similarly latency is 5-15x lower than Storm’s latency.
Other then throughput and latency it provides
Easy debugging(Every task runs in process-level isolation).
Handling spikes and congestion(using backpressure mechanism).
Fully backward compatible with Storm which means only pom file changes required.
https://blog.twitter.com/engineering/en_us/a/2015/flying-faster-with-twitter-heron.html
https://apache.github.io/incubator-heron/

Couchbase or VoltDB for billion monitoring data storage and analysis?

I have a distributed monitoring system that collects and gathers monitoring data like CPU utilization, database performance metrics, network performance into a backend store. Other applications need to consume these data like real-time calculating(for a resource scheduler) , for system monitoring(to system administrator using monitoring dashboard), for historical analytic(to operation and analyzer program to modeling the resource using pattern for future capacity planning and business system activity analysis).
The dataset size is about 1.2 billion entries in the data store for 9 months. (all in OpenTSDB like format)
Previously I used an Elasticsearch cluster as the backend data store solution and decide to find a better one.
I am looking at Couchbase or VoltDB cluster but still in investigation stage so need some input from here who has the similar experience.
Major questions are as below:
Which backend store solution is good for my scenario? (Couchbase or VoltDB)?
I have to rewrite my data aggregator code (which is in golang). Couchbase provide a good golang SDK client but VoltDB's go driver is only in community level with limited function. So are there any better implementation to communicate with voltdb in golang?
Any suggestion or best practice on it?
There isn't too much in the way of usage patterns here, but it sounds like the kind of app people use VoltDB for.
As for the Golang client, we'd love some feedback as to how to make it more suitable if it's specifically missing something you need. You can also use the HTTP/JSON query interface from any language, including Golang. More info on that here:
http://docs.voltdb.com/UsingVoltDB/ProgLangJson.php
If you would like to leverage your existing model, take a look at Axibase Time-Series Database. It supports both tcollector network and http protocols. Rule engine and visualization are built-in.
The fact that ATSD is based on HBase may be an asset or a liability depending on your prior experience with it :)
URL to tcollector integation: http://axibase.com/products/axibase-time-series-database/writing-data/tcollector/
Disclosure: I work for the company developing ATSD.

spark streaming to neo4j

I need to input Spark Streaming output to Neo4j as a graph in real time. Is there any way to do that. If so, can you share some example code?. I have seen Mazerunner, but it only inputs graph data from Neo4j to Spark-Graphx. Thank you.
Mazerunner also writes data back.
Easiest would be to use a Neo4j connector to Neo4j server and write data back concurrently. Neo4j 2.2+ can sustain (quite) high concurrent write load.
For scala you can use AnormCypher and for Python py2neo
I'm currently looking into spark integration for Neo4j, so it would help a lot if you could detail your use-case a bit. E.g. do you use plain Spark (RDD / DStream) or GraphX?

Moses - Online Integration

We're actually looking to integrate Moses into our localization workflow. Our application is in Java and we're looking at using Moses' functionalities using xml-rpc calls.
Specifically, we're looking at APIs for:
Incremental training (i.e. Avoid having to retrain the model
from scratch every time we wish to use some new training data)
Domain-specific training (i.e. It should maintain separate
phrase tables for each domain that the input data belongs),
Decoding
The tutorial says that these can be achieved via xml-rpc calls. But, I don't find any examples or clear ways to do them. Can someone please provide some examples?
Also, I would like to know if the training and decoding phases can be done in a
distributed manner.
Thanks!
this question is perfectly suitable for moses mailing list:
http://www.statmt.org/moses/?n=Moses.MailingLists
moses server documentation (via xml-rpc):
http://www.statmt.org/moses/?n=Moses.AdvancedFeatures#ntoc28
However, I have better experiences with: moses/contrib/web/bin/daemon.pl which makes server as well, and you communicate via tcp stream.
General examples are harder to find(everyone has different enviroment,...), but make your question more specific and send it to moses mailing list. (e.g. someone had a problem with server installation: http://comments.gmane.org/gmane.comp.nlp.moses.user/7242 )

What elements are needed to implement a remote, event driven system? - overview needed

I am trying to design an event driven system where the elements of the system communicate by generating events that are responded to by other components of the system. It is intended that the components be independent of each other - or as largely independent as I can make them. The system will initially be implemented on Windows 7, and is being written in Delphi. The generated events will be generated by the Delphi code. I understand how to implement a system of the type described on a single machine.
I wish to design the system so that it can readily be deployed on different machine architectures in particular with different components running on a distributed architecture, which may well be different to Windows 7. There is no requirement for the system ever to communicate with any systems external to itself.
I have tried investigating the architecture I need to consider and have looked at the questions mentioned below. These seem to point towards utilising named pipes as a mechanism for inter-hardware communications. As a result of these investigations I have sketched out the following to describe my system - the first part of the diagram is the system as I am developing it; the second part what I have deduced I would need for possible future implementations.
This leads to the following points:
Can you pass events via named pipes?
Is this an appropriate and sensible structure to tackle this problem?
Are there better alternatives?
What have I forgotten (at this level of granularity)?
How is event driven programming implemented?
How do I send a string from one instance of my Delphi program to another?
EDIT:
I had not given the points arising from "#I give crap answers" response sufficient consideration. My initial responses to his points are:
Synchronous v Asynchronous - mostly asynchronous
Events will always be in a FIFO queue.
Connection loss - is not terribly important - I can afford to deal with this non-rigourously.
Unbounded queues are a perfectly good way of dealing with events passed (if they can be) - there is no expectation of large volume of event generation.
For maximum deployment flexibility (operating-system independent), I recommend to take a look at popular open source message brokers which run on the Java platform. Using standard protocols. they integrate well with Delphi and other programming languages, can be used with web applications, and have a large installed user base and active community.
They are quite easy to install and configure in a few minutes, and free / commercial clients for Delphi are available.
Some examples are:
Apache ActiveMQ
OpenMQ
JBoss HornetQ
I also recommend the book "Enterprise Integration Patterns" by Martin Fowler as an overview and introduction, with many simple recipes to handle specific problems.
Note that I am a developer of commercial Delphi clients for enterprise messaging systems, such as xmlBlaster, RabbitMQ, Amazon Simple Queue Service and the three brokers mentioned above.
I can only answer for your point 4 here: You have not yet decided if an event is synchronous or asynchronous. In the async case, you have to decide what to do when messages arrive. Do you have a queue? How big is the queue? Can one grab arbitrary elements in the queue or is it strictly FIFO. What happens if a message is lost (somebody axes the network cable)?
In the sync variant, the advantage is that you got delivery guarantees, but then what do you do when connections are suddenly lost?
Connection loss is going to be a problem. The more machines you have, the greater is the chance that they will occur. Decide how you will handle that.
Another trouble may be what you do if you have a large event and several small. Is the order of transfer FIFO or smallest-first? Can events be reeordered? What are the assumptions here?
The aside is that I hack Erlang a lot. In Erlang all the event-handling is already solved but it also means a specific model is chosen for you (async, unbounded queues, no guaranteed delivery, but detection of connection loss).
I suggest to look at RabbitMQ, http://www.rabbitmq.com/. It has the server and client. Just need some wrapper codes in delphi and you are ready to build your business logic
Cheers
This is probably just an application for a message queue.
http://msdn.microsoft.com/en-us/library/ms632590(v=vs.85).aspx

Resources