Does anyone know if Liaison's ECS product has the capability to aggregate messages together based on a filter criteria and process them after all of the messages have been received?
I need to listen for flat file data messages, read the order number in the flat file and an integer in the flat file that represents how many flat files will be generated for that order number, and then once all of the flat files have been processed I need to map them into a single outbound EDI transaction message (one outbound message per order number).
A standard aggregation pattern; please note I am not asking about EDI batching, which is something different.
Is this something that can be done using ECS functionality, or does an external batching system need to be created?
Related
I'm looking for a way to get a list of all topics known to a broker. There are some quite similar question's, but they didn't help me to figure it out for my use case.
I've got 3 Raspberry Pi's with multiple sensors (temperature, humidity) which are connected over an MQTT network. Every Pi has it's own database containing time series of measurements and other system variables(like CPU).
Now I'm looking for a way for the following szenario:
I want to monitor my system and detect anomalies. For that I want to get all sensor-time series in the last x seconds and process them in a python script. My system to do the monitoring calculations can be every Pi.
Example: I'm on RPI2 and want to monitor the whole distributed network. There's no given knowledge about the sensors attached to the Pi's. Now from my python script running on RP2 I would initalise a MQTT client and subscribe every sensor data on the broker. I know about the wildcard # but I'm not sure how to use it in that case. My magic command would look like the following pseudo code:
1) client subscribe to all sensor data - #/sensor/#
2) get list with all topics
3) client subscribe to all topics from given list list/#
4) analyse data for anomalies every x seconds
First, your wildcard topic patterns are not valid. Topic patterns can only contain a single '#' character and it can only appear at the end of a topic e.g. foo/bar/# is valid, #/foo is not. You can use the + character which is a single level wildcard character.
This means a topic pattern of +/sensor/# will match each of the following:
rpi1/sensor/foo
rpi1/sensor/bar/temp
but not
rpi1/foo/sensor/bar
Next brokers do not have a list of topics that exist. Topics only really exist at the instant that a message is published to one, the broker then checks the patterns that subscribing clients have requested and checks that topic against the list and delivers it to the clients that match.
Thirdly when bridging brokers in loops like that you have to be very careful with the bridge filters to make sure that messages don' end up a constant loop.
The solution is probably to designate a "master" broker and bridge all the others one way to that broker and then have the client subscribe to either '#' to get everything or something more like '+/sensor/#' to just see the sensor readings.
One of the new features of MQTT 5 is the shared subscriptions feature, which allows client-side load balancing between multiple workers, so that multiple workers can be responsible for handling messages, but every message is only ever sent to a single server.
By default, this works with a round-robin approach, but I am in the need of a slightly more advanced scenario:
What I want is some kind of routing, so that one of the messages' properties gets used as some kind of routing key. I.e., I want multiple workers to be responsible for the messages, but all messages with value X in their routing key property should always go to the same worker, and all messages with Y should do as well. The workers for X and Y may be different, but all messages with X should always go to the same one.
Question 1: Is this even possible with MQTT 5? If so, what is the term I need to look for? I tried googling for this, but wasn't really successful (mainly, I guess, because I don't know exactly what to look for).
Now, supposed this is possible: How can I then handle cases where nodes join or leave? Then I still want only a single node to be responsible, so it would be great if the assignment was not statically, but could be adjusted dynamically (or even better, would adjust itself automatically). However, what I strictly need to avoid is that two messages with X ever go to different servers at the same time.
Question 2: Supposed, this is not possible – what alternatives do I have to MQTT 5?
You don't at a protocol level. That is the whole point of a shared subscription to distribute the incoming messages evenly across all the subscribers.
This also goes against the pub/sub paradigm, that messages are published to a topic not an individual subscriber.
If you want to route messages differently publish them to different topics. There is nothing to stop you republishing a message on a separate topic based on it's meta data once it's been received by a client if needed.
I would like to hear your insights about an IoT data ingesting case. In AWS IoT hub, thing shadows are virtual representation of physical ones. What i understood from the figure below is whenever a thing sends a data to platform via a message broker, thing shadows and rule engine portions get the same sensor data concurrently and process it.
Are my conclusions correct ?
Things shadow system is subscribed to message broker and gets sensor data, updates their shadow actors. Shadow side is also responsible for storing sensor data such an event sourcing mechanism.
The thing shadow system does not perform any rules, it is just for performing event sourcing and keeping last known state in virtual thing actors.
The same sensor data also is an inbound data to rules engine. Rules engine is just and ECA (event condition action) type system that handle streaming data and decides what it will do with them. This means every incoming data eventually will be processed in rules engine portion.
Below are my comments to your conclusions.
What i understood from the figure below is whenever a thing sends a
data to platform via a message broker, thing shadows and rule engine
portions get the same sensor data concurrently and process it.
Changes in the thing shadow can trigger an action registered in the rule engine. There are specific topics associated with a thing shadow that you can subscribe the rule engine to, in order to perform one or many action(s) in response.
Things shadow system is subscribed to message broker and gets sensor
data, updates their shadow actors. Shadow side is also responsible for
storing sensor data such an event sourcing mechanism.
You can update the device shadow by using the REST API, or dedicated MQTT topics to publish on specific shadow topics. The shadow does not constitute an event-sourcing system by itself, but a representation of the data model associated with a physical device, as you said.
You can however create a rule that listens for changes on one or more shadow instances, and register the changes into DynamoDB for instance, in a time-series manner. You'll then have an event-sourcing system allowing you to store the previous states, or changes, sent by a device during an arbitrary amount of time.
The thing shadow system does not perform any rules, it is just for
performing event sourcing and keeping last known state in virtual
thing actors.
The thing shadow keeps the desired and reported state of a physical device in the cloud. It does not execute rules, but emits messages on MQTT topics when events happen within the shadow. These messages can then be captured by the rules engine to execute actions.
The same sensor data also is an inbound data to rules engine. Rules
engine is just and ECA (event condition action) type system that
handle streaming data and decides what it will do with them. This
means every incoming data eventually will be processed in rules engine
portion.
The rules engine does not listen by default on an MQTT topic, and hence, on data sent by devices to the Device Gateway. You must register in the rules engine the topics you'd like to listen to along with their associated actions.
Other than that, the rules engine allows you to describe your rules in ANSI SQL, meaning that you are able to specify the origin of your data (the FROM in your SQL statement), the specific fields in a JSON payload you are interested in capturing (SELECT), and an optional condition specifying on what condition the rule should be triggered (WHERE).
An example of a rule listening on the fictive topic device/+/telemetry and interested in capturing all the fields in the received payload would be :
SELECT * FROM device/+/telemetry
Note how the + can be used as a placeholder for any device identifier for instance.
What is the basic difference between stream processing and traditional message processing? As people say that kafka is good choice for stream processing but essentially kafka is a messaging framework similar to ActivMQ, RabbitMQ etc.
Why do we generally not say that ActiveMQ is good for stream processing as well.
Is it the speed at which messages are consumed by the consumer determines if it is a stream?
In traditional message processing, you apply simple computations on the messages -- in most cases individually per message.
In stream processing, you apply complex operations on multiple input streams and multiple records (ie, messages) at the same time (like aggregations and joins).
Furthermore, traditional messaging systems cannot go "back in time" -- ie, they automatically delete messages after they got delivered to all subscribed consumers. In contrast, Kafka keeps the messages as it uses a pull-based model (ie, consumers pull data out of Kafka) for a configurable amount of time. This allows consumers to "rewind" and consume messages multiple times -- or if you add a new consumer, it can read the complete history. This makes stream processing possible, because it allows for more complex applications. Furthermore, stream processing is not necessarily about real-time processing -- it's about processing infinite input streams (in contrast to batch processing, which is applied to finite inputs).
And Kafka offers Kafka Connect and Streams API -- so it is a stream-processing platform and not just a messaging/pub-sub system (even if it uses this in its core).
If you like splitting hairs:
Messaging is communication between two or more processes or components whereas streaming is the passing of event log as they occur. Messages carry raw data whereas events contain information about the occurrence of and activity such as an order.
So Kafka does both, messaging and streaming. A topic in Kafka can be raw messages or and event log that is normally retained for hours or days. Events can further be aggregated to more complex events.
Although Rabbit supports streaming, it was actually not built for it(see Rabbit´s web site)
Rabbit is a Message broker and Kafka is a event streaming platform.
Kafka can handle a huge number of 'messages' towards Rabbit.
Kafka is a log while Rabbit is a queue which means that if once consumed, Rabbit´s messages are not there anymore in case you need it.
However Rabbit can specify message priorities but Kafka doesn´t.
It depends on your needs.
Message Processing implies operations on and/or using individual messages. Stream Processing encompasses operations on and/or using individual messages as well as operations on collection of messages as they flow into the system. For e.g., let's say transactions are coming in for a payment instrument - stream processing can be used to continuously compute hourly average spend. In this case - a sliding window can be imposed on the stream which picks up messages within the hour and computes average on the amount. Such figures can then be used as inputs to fraud detection systems
Apologies for long answer but I think short answer will not be justice to question.
Consider queue system. like MQ, for:
Exactly once delivery, and to participate into two phase commit transaction
Asynchronous request / reply communication: the semantic of the communication is for one component to ask a second command to do something on its data. This is a command pattern with delay on the response.
Recall messages in queue are kept until consumer(s) got them.
Consider streaming system, like Kafka, as pub/sub and persistence system for:
Publish events as immutable facts of what happened in an application
Get continuous visibility of the data Streams
Keep data once consumed, for future consumers, for replay-ability
Scale horizontally the message consumption
What are Events and Messages
There is a long history of messaging in IT systems. You can easily see an event-driven solution and events in the context of messaging systems and messages. However, there are different characteristics that are worth considering:
Messaging: Messages transport a payload and messages are persisted until consumed. Message consumers are typically directly targeted and related to the producer who cares that the message has been delivered and processed.
Events: Events are persisted as a replayable stream history. Event consumers are not tied to the producer. An event is a record of something that has happened and so can't be changed. (You can't change history.)
Now Messaging versus event streaming
Messaging are to support:
Transient Data: data is only stored until a consumer has processed the message, or it expires.
Request / reply most of the time.
Targeted reliable delivery: targeted to the entity that will process the request or receive the response. Reliable with transaction support.
Time Coupled producers and consumers: consumers can subscribe to queue, but message can be remove after a certain time or when all subscribers got message. The coupling is still loose at the data model level and interface definition level.
Events are to support:
Stream History: consumers are interested in historic events, not just the most recent.
Scalable Consumption: A single event is consumed by many consumers with limited impact as the number of consumers grow.
Immutable Data
Loosely coupled / decoupled producers and consumers: strong time decoupling as consumer may come at anytime. Some coupling at the message definition level, but schema management best practices and schema registry reduce frictions.
Hope this answer help!
Basically Kafka is messaging framework similar to ActiveMQ or RabbitMQ. There are some effort to take Kafka towards streaming:
https://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple/
Then why Kafka comes into picture when talking about Stream processing?
Stream processing framework differs with input of data.In Batch processing,you have some files stored in file system and you want to continuously process that and store in some database. While in stream processing frameworks like Spark, Storm, etc will get continuous input from some sensor devices, api feed and kafka is used there to feed the streaming engine.
Recently, I have come across a very good document that describe the usage of "stream processing" and "message processing"
https://developer.ibm.com/articles/difference-between-events-and-messages/
Taking the asynchronous processing in context -
Messaging:
Consider it when there is a "request for processing" i.e. client makes a request for server to process.
Event streaming:
Consider it when "accessing enterprise data" i.e. components within the enterprise can emit data that describe their current state. This data does not normally contain a direct instruction for another system to complete an action. Instead, components allow other systems to gain insight into their data and status.
To facilitate this evaluation, consider these key selection criteria to consider when selecting the right technology for your solution:
Event history - Kafka
Fine-grained subscriptions - MQ
Scalable consumption - Kafka
Transactional behavior - MQ
I am new to MongoDB and I have very basic knowledge of its concepts of sharding. However I was wondering if it is possible to control the split of data yourself? For example a part of the records would be stored on one specific shard?
This will be used together with a rails app.
You can turn off the balancer to stop auto balancing:
sh.setBalancerState(false)
If you know the range of the key you are splitting on you could also presplit your data ranges to the desired servers see PreSplitting example. The management of the shard would be done via the javascript shell and not via your rails application.
You should take care that no shard gets more load (becomes hot) and that is why there is auto balancing by default, using monitoring like the free MMS service will help you monitor that.
The decision to shard is a complex decision and one that you should put a lot of thought into.
There's a lot to learn about sharding, and much of it is non-obvious. I'd suggest reviewing the information at the following links:
Sharding Introduction
Sharding Overview
FAQ
In the context of a shard cluster, a chunk is a contiguous range of shard key values assigned to a particular shard. By default, chunks are 64 megabytes (unless modified as per above). When they grow beyond the configured chunk size, a mongos splits the chunk into two chunks. MongoDB chunks are logical and the data within them is NOT physically located together.
As I've mentioned the balancer moves the chunks around, however, you can do this manually. The balancer will take the decision to re-balance and request a chunk migration if there is a large enough difference ( minumum of 8) between the number of chunks on each shard. The actual moving of the chunks is co-ordinated between the "From" and "To" shard and when this is finished, the original chunks are removed from the "From" shard and the config servers are informed.
Quite a lot of people also pre-split, which helps with their migration. See here for more information.
In order to see documents split among the two shards, you'll need to insert enough documents in order to fill up several chunks on the first shard. If you haven't changed the default chunk size, you'd need to insert a minimum of 512MB of data in order to see data migrated to a second chunk. It's often a good idea to to test this and you can do this by setting your chunk size to 1MB and inserting 10MB of data. Here is an example of how to test this.
Probably http://www.mongodb.org/display/DOCS/Tag+Aware+Sharding addresses your requirement in v2.2
Check out Kristina Chodorow's blog post too for a nice example : http://www.kchodorow.com/blog/2012/07/25/controlling-collection-distribution/
Why do you want to split data yourself if mongo DB is automatically doing it for you , You can upgrade your rails application layer to talk to mongos instance so that mongos routes the call for any CRUD operation to the place where the data resides . This is achieved using config server .