What would Kafka do if producer goes down? - twitter

I'm a bit confused about Kafka architecture. We would like to capture Twitter Streaming API. We came across this https://github.com/NFLabs/kafka-twitter/blob/master/src/main/java/com/nflabs/peloton2/kafka/producer/TwitterProducer.java Twitter Producer.
What I'm thinking about is how to design the system so it's fault tolerant.
If the producer goes down, does it mean we lose some of the data? How to prevent this from happening?

If the producer you linked to stops running, new data from the Twitter API will not make its way into Kafka. I'm not sure how the Twitter Streaming API works, but it may be possible to get historic data, allowing you to fetch all data back to the point when the producer failed.
Another option is to use Kafka Connect, which is a distributed, fault tolerant service for connecting data sources and sinks to Kafka. Connect exposes a higher-level API and uses the out-of-the-box producer/consumer API behind the scenes. The documentation explains Connect very thoroughly, so give that a read and go from there.

Related

why and when i need mqtt broker for IOT/M2M application

Just asking one silly question, hope someone can answer this.
I'm bit confused regarding MQTT broker. Basically, the confusion is, there are so many things being used for data storing, transfer and processing (like Flume, HDInsight, Spark etc). So, when and why I need to use one MQTT broker?
If I would like to use Windows 10 IoT application with HiveMQ, from where can I get the details? how to use it? How I get benefit out of this MQTT broker? Can I not send data from my IoT application directly using Azure or HDFS? So, how MQTT broker fits into it or helping me to achieve something?
I'm new to all these and tried to find some tutorials, however, I'm not getting anything proper. Please explain it in more details or give some tutorials for this?
MQTT is a client-server protocol for pub-sub based transport that has a comparatively small overhead, and thus applicable to mobile and IoT applications (unlike Flume, etc.). The MQTT broker is basically a server that handles messaging to/from MQTT clients and among them. The functionality pretty much stops at the transport layer, even though various MQTT add-ons exist.
If you are looking to implement a solution that would reliably transfer data from your IoT devices to the back-end system for processing, I would suggest you take a look into Kaa open-source IoT platform. It goes much further than MQTT by providing not only the transport layer, suitable for low-power IoT devices, but also a solid chunk of the application level logic (including the object bindings for your application-level data structures, temporary data persistence, etc.).
Here is a link to a webinar that explains how to build a scalable IoT analytics system with Kaa and Spark in less than an hour.
This is an architectural choice. IoT applications are possible without MQTT but there are some advantages when using MQTT. If you are completely new to MQTT, take a look at this in-depth MQTT series: http://forkbomb-blog.de/2015/all-you-need-to-know-about-mqtt
Basically the main architectural advantage is publish / subscribe designed for low-latency, high throughput (mobile) communication with minimal protocol overhead (which is important if bandwidth is at a premium). You can completely decouple consumers and producers.
HDFS is the (distributed) Hadoop file system and is the foundation for Map / Reduce processing. It is not comparable to a MQTT broker. The MQTT broker could write to the HDFS, though (in case of HiveMQ with a custom plugin).
Basically MQTT is a protocol while the products you are mentioning are, well, products which solve completely different problems:
Flume is basically used for log aggregation at scale. You won't use MQTT for that, at least there is not too much advantage because this is typically done in backend applications.
Spark and Hadoop shine at Big Data crunching. They are a framework and not a ready to use solution. They are not really comparable to MQTT. Often MQTT brokers like HiveMQ are used in conjunction with these, Spark / Hadoop for data processing and HiveMQ for communication.
I hope this helps you getting started. Best would be to read about typical use cases of all these technologies, this is a bit too broad for a single SO answer.
MQTT is a data transport, so the usual thing I have to compare it with is HTTP. HTTP has two important characteristics, a) It goes from one point to another, b) It is request/response, so only one end can start a data transfer. MQTT connects many end points to many end points, and either end can start a data transfer. So, if you have just one device and only one service or person that will ever access it, and only by polling, then HTTP is great. MQTT means many devices can post data to many services or people, AND the other way around. Your question assumes that your data is always going to land up in some sort of data store, but many interactions are about events and responding to them immediately, like ringing a doorbell, or lowering the landing gear. In these cases you will often want to both record the data, and have an immediate action occur, like your phone making a doorbell noise.
Finally, you send data to MQTT semantically, rather than by IP address.
This means that your services subscribes to /mikeshouse/doorbell rather than polling 192.168.22.4, which is a huge gain once you have a number of devices.

How do I retrieve data from statsd?

I'm glossing over their documentation here :
http://www.rubydoc.info/github/github/statsd-ruby/Statsd
And there's methods for recording data, but I can't seem to find anything about retrieving recorded data. I'm adopting a projecting with an existing statsd addition. It's host is likely a defunct URL. Perhaps, is the host where those stats are recorded?
The statsd server implementations that Mircea links just take care of receiving, aggregating metrics and publishing them to a backend service. Etsy's statsd definition (bold is mine):
A network daemon that runs on the Node.js platform and listens for
statistics, like counters and timers, sent over UDP or TCP and sends
aggregates to one or more pluggable backend services (e.g.,
Graphite).
To retrieve the recorded data you have to query the backend. Check the list of available backends. The most common one is Graphite.
See also this question: How does StatsD store its data?
There are 2 parts to statsd: a client and a server.
What you're looking at is the client part. You will not see functionality related to retrieving the data as it's not there - it normally is on the server side.
Here is a list of statsd server implementations:
http://www.joemiller.me/2011/09/21/list-of-statsd-server-implementations/
Research and pick one that fits your needs.
Statsd originally started at etsy: https://github.com/etsy/statsd/wiki

Building a webRTC application with Ruby on Rails Backend

I want to implement a peer-to-peer video chat feature for a web application I am currently developing. After doing my research, I've decided that using webRTC's Javascript APIs is the way to go. The application uses AngularJS in the front end and Ruby on Rails in the back end. The main issue I'm encountering while conceptualizing this application is linking the front end with the backend, and creating and maintaining the connection between user streams.
For the signaling aspect of the network, I want to utilize ActionController::Live and the Ruby gem em-event source to push live messages from the server to users and indicate which of their connections are online. Then, when they are ready to make a connection, they will create a custom room and the URL will be sent to the user that they wish to connect with, creating their offer. Once the user clicks on the link sent to them, they send back their answer. When the user responds, the ICE candidate process will begin for each of the users. Do you think that this is a sufficient signaling channel to set up the PeerConnection? What other major players am I missing?
From the research that I have done about WebRTC's RTCPeerConnection, once the initial connection is set up, and both users have public IP addresses corresponding to their stream, the connection is sustained through RTCPeerConnection, more specifically getPeerConnection(). Am I wrong? Are there other factors that I am not considering?
WebRTC makes the process of creating MediaStreams very simple with their getUserMedia method. Once these streams are created they can be added to the RTCPeerConnection that was established. Both as local and remote streams.
If you have any other suggestions for me, please let me know. I want to create this feature using webRTC, it seems like so much fun
There are certainly many ways to handle the call signaling so I'm not going to comment specifically on your approach. I will say that if you plan on supporting ICE trickling the ICE candidates will start flowing very early in the process so you really need an open signalling channel between your peers almost immediately when trying to connect to a peer.
We developed our solution for WebSphere on top of MQTT which is an open, and very simple pub/sub protocol. You can use any open MQTT broker with the protocol and there are a number of open source components available to make WebRTC development extremely easy including an AngularJS WebRTC module (angular-rtcomm), a core pure JavaScript module and much more. We also released a simple JSON based protocol as part of this open source solution. You can take a look at the signaling protocol. You can also read more details about the overall solution here (www.wasdev.net/webrtc). Here you'll find the base JavaScript libraries as well as a number of open source sample solutions. All of these can be forked on github.
In general you want to build your signaling on a protocol that will allow you to grow over time. It should work well for the web and mobile apps. From our experience it took a lot of time to get all this to work well and our goal was to not only support peer-to-peer calls but to support using media resources like Dialogic's XMS PowerMedia server on the backend for multiway support, record/playback and more. We also needed to support federation via SIP trunking so we wanted to make sure the protocol could be easily translated to SIP signaling while also supporting transcoding between media protocols like VP8 and H.264.
Note that if you're looking to only support peer-to-peer calling between WebRTC clients you can do that with these rtcomm open source components only, including an open MQTT broker and save yourself a ton of time. You can literally get something up and running in a matter of hours. The developer version of the WebSphere Liberty beta with the new rtcomm-1.0 service enabled also includes a built in MQTT broker and supports the open WebRTC signaling protocol linked above. You can use WebSphere for development and deploy a single server of this in production for free. You can also use Ruby on Rails with Liberty as well if you'd like.
Even if you decide not to use Liberty you can use all the open source components along with something like Mosquito (which is an open source MQTT broker) to get a solution off the ground quickly. There are also a number of MQTT clients available for many different programming languages including JavaScript, Java, etc. Check out https://eclipse.org/paho/. If you decide to build you're own signaling protocol you might still find these open source components helpful to see how we approached integration with the WebRTC PeerConnection.

How to stress-test HTTP Live Streaming

We built an youtube-like Rails application that serves videos using HTTP Live Streaming which are hosted on our company's S3-like (actually Ceph Object Gateway S3 API) cloud service.
It's the first public application on that storage service and we would like to know how much concurrent viewers it can handle beforehand.
We know that the network connection (10Gbps) will become the bottle neck at a certain stage, but we have no idea how much load the actual storage cloud service is able to handle.
How would you stress-test the HTTP Live Streaming?
Is something similar to this (UDP) suggestion an option in this (TCP) case?
You can use either a JMeter SAAS or cloud servers to overcome the network issue, and for JMeter you can use this commercial plugin which simulates realistically the Players behaviour and give useful metrics:
http://www.ubik-ingenierie.com/blog/easy-and-realistic-load-testing-of-http-live-stream-hls-with-apache-jmeter/
Metrics provided by plugin are:
Buffer fill time (time it took to start playin)
Lag Time (How many seconds play paused)
Lag Ratio (waiting time over watching time)
Disclaimer : We are behind the development of this solution
If you're testing HTTP streams you might be able to test it using JMeter though you'd probably need a hosted JMeter solution to create enough traffic.
I'm not sure if you'd be able to get any helpful response time info, but you would at least be able to easily create and ramp up the load.
Let me know if you need help with the JMeter side.

Implementing a sync feature similar to Evernote

The type of content isn't really important for this question, but let's just say I wanted to implement a (native mobile) shopping list app that allowed multiple users to collaborate on a shared list.
How are sync features like this usually implemented that work automatically (without explicit user interaction)? Is the preferred way to pull every few seconds to check for newer versions and update if necessary, or is it possible to push changes?
A polling solution would be (relatively) easy to implement I guess using something like AWS, Google App Engine or even from scratch on a LAMP stack and REST. But I'm worried about traffic resulting from continuous polling.
Would it be practical to try to implement this using push updates? If so, what technologies, services or design principles should I look into? Is something like this possible with AWS or Google App Engine? Or is pulling (and reducing traffic as much as possible) the way to go?
On app engine you should look into the channel API. From the overview:
The Channel API creates a persistent connection between your application and Google
servers, allowing your application to send messages to JavaScript clients in real time without the use of polling. This is useful for applications that are designed to update the user about new information immediately or where user input is immediately broadcast to other users. Some examples include collaborative applications, multi-player games, and chat rooms. In general, using Channel API is a better choice than polling in situations where updates can't be predicted or scripted, such as when relaying information between human users or from events not generated systematically.
You can use a few of Amazon Web Services to create an effective and responsive service.
If you check out the IOS SDK that you can download from AWS site, you can find in it an example for a service that is using such services: S3_SimpleDB_SNS_SQS_Demo
First you can use SQS, which is the queueing service, which has long polling that will help you to lower the number of requests.
Second you can use SNS, which is the notification (pub/sub) service. It is integrated with SQS, and you can subscribe queues to listen to notifications.
These services (and others) are accessible through the iOS SDK, as well as with other SDKs (Java, .NET, Android...) and REST and SOAP APIs.

Resources