How BroadcastHashJoin exacly work in spark? - join

I'm trying to understanding how a broadcastHashJoin is executed.
I know that the little table is send broadcast to all node, but next the result is sent back to the driver?
I'm using the spark ui to undestand how network traffic is managed but i don't get relevant result and the driver result always empty:
Why i can't see traffic to driver?

Relation which is to be broadcasted is collected to the driver
Collected relation is hashed locally
Hashed relation is used to create a broadcast variable
Broadcasted relation is used to compute join in parallel
Missing data from the driver you see most likely correspond to hashing part which is not executed inside job and doesn't create useful metrics.

Related

CAN Communication: Knowing Which Node Transmitted Data

I am new to CAN communication and one of my tasks is to use a CANalyzer to learn what message IDs are being used for a product and what data is being sent/received.
The product has multiple nodes that can send/receive CAN messages. I know CAN messages are broadcasted to all the nodes, but the part I'm having a hard time determining is which node transmitted the message and which nodes received it.
So, for example, if I have 3 CAN nodes, is there a way I can determine that Node 1 sent the message and Node 2/3 are receiving the message?
Thank you in advance.
Generally, you can't know this by listening to the CAN bus alone. The same old story whenever someone asks about data on "CAN bus" is: what application layer protocol is it using? "CAN bus" doesn't tell you jack, it's just the specification physical and data link layers. The concept of identifying individual nodes does not exist on the data link layer, only on the application layer.
There's two possible ways for you to tell:
If you know the application layer used on top of the physical CAN bus and know that it uses node id, then you can tell which node that is sending what data by decoding the application-layer protocol.
On each node, you can sniff the Tx signal between the MCU and CAN transceiver with an oscilloscope. That one only goes active when a node is sending or ACK:ing. Most modern scopes has a CAN frame decoder feature, saving you the head ache of decoding the frames manually.

Distributed computing in a network - Framework/SDK

I need to build a system that consist of:
Nodes, each mode can accept one input.
The node that received the input shares it with all nodes in the network.
Each node do a computation on the input (same computation but each node has a different database so the results are different for each node).
The node that received the input consolidate each node result and apply a logic to determine the overall result.
This result is returned to the caller.
It's very similar to a map-reduce use case. Just there will be a few nodes (maybe 10~20), and solutions like hadoop seems an overkill.
Do you know of any simple framework/sdk to build:
Network (discovery, maybe gossip protocol)
Distribute a task/data to each node
Aggregate the results
Can be in any language.
Thanks very much
Regads;
fernando
Ok to begin with, there are many ways to do this. I would suggest the following if you are just starting to tackle this architecture:
Pub/Sub with Broker
Programs like RabbitMQ are meant to easily allow for variable amounts of nodes to connect and speak to one another. Most importantly, they allow for transparency and observability. You can easily ask the Broker which nodes are connected and even view messages in transit. Basically they are a 'batteries included' means of delaying with a large amount of clients.
Brokerless (Update)
I was looking for a more 'symmetric' architecture where each node is the same and do not have a centralized broker/queue manager.
You can use a brokerless Pub/Subs, but I personally avoid them. While they have tooling, it is hard to understand their registration protocols if something odd happens. I generally just use Multicast as it is very straight forward, especially if each node has just one network interface, and you can extend/modify behavior just with routing infra.
Here is how you scheme would work with Multicast:
All nodes join a known multicast address (IE: 239.1.2.3:8000)
All nodes would need to respond to a 'who's here' message
All nodes would either need to have a 'do work' api either via multicast or from consumer to node (node address grabbed from 'who's here message)
You would need to make these messages yourself, but given how short i expect them to be it should be pretty simple.
The 'who's here' message from the consumer could just be a message with a binary zero.
The 'who's here' response could just be a 1 followed by the nodes information (making it a TLV would probably be best though)
Not sure if each node has unique arguments or not so i don't know how to make your 'do work' message or responce

How to get a list of all topics containing specific values known to MQTT broker?

I'm looking for a way to get a list of all topics known to a broker. There are some quite similar question's, but they didn't help me to figure it out for my use case.
I've got 3 Raspberry Pi's with multiple sensors (temperature, humidity) which are connected over an MQTT network. Every Pi has it's own database containing time series of measurements and other system variables(like CPU).
Now I'm looking for a way for the following szenario:
I want to monitor my system and detect anomalies. For that I want to get all sensor-time series in the last x seconds and process them in a python script. My system to do the monitoring calculations can be every Pi.
Example: I'm on RPI2 and want to monitor the whole distributed network. There's no given knowledge about the sensors attached to the Pi's. Now from my python script running on RP2 I would initalise a MQTT client and subscribe every sensor data on the broker. I know about the wildcard # but I'm not sure how to use it in that case. My magic command would look like the following pseudo code:
1) client subscribe to all sensor data - #/sensor/#
2) get list with all topics
3) client subscribe to all topics from given list list/#
4) analyse data for anomalies every x seconds
First, your wildcard topic patterns are not valid. Topic patterns can only contain a single '#' character and it can only appear at the end of a topic e.g. foo/bar/# is valid, #/foo is not. You can use the + character which is a single level wildcard character.
This means a topic pattern of +/sensor/# will match each of the following:
rpi1/sensor/foo
rpi1/sensor/bar/temp
but not
rpi1/foo/sensor/bar
Next brokers do not have a list of topics that exist. Topics only really exist at the instant that a message is published to one, the broker then checks the patterns that subscribing clients have requested and checks that topic against the list and delivers it to the clients that match.
Thirdly when bridging brokers in loops like that you have to be very careful with the bridge filters to make sure that messages don' end up a constant loop.
The solution is probably to designate a "master" broker and bridge all the others one way to that broker and then have the client subscribe to either '#' to get everything or something more like '+/sensor/#' to just see the sensor readings.

Per-partition GroupByKey in Beam

Beam's GroupByKey groups records by key across all partitions and outputs a single iterable per-key-per-window. This "brings associated data together into one location"
Is there a way I can groups records by key locally, so that I still get a single iterable per-key-per-window as its output, but only over the local records in the partition instead of a global group-by-key over all locations?
If I understand your question correctly, you don't want to transfer a data over network if a part of it (partition) was processed on the same machine and then can be grouped locally.
Normally, Beam doesn't provide you details where and how your code will be running since it may vary depending on runner/engine/resource manager. Though, if you can fetch some uniq information about your worker (like hostname, ip or mac address) then you can use it as a part of your key and group all related data by this. Quite likely that in this case these data partitions won't be moved to other machines since all needed input data is already sitting on the same machine and can be processed locally. Though, afaik, there is no 100% guarantee about that.

iOS - Parse cloud match user according to its location

I'm working on taxi booking service and using Parse as a backend.
-Client adding new Order (which contains pick up point coordinates and destination point coordinates), and cloud code should match the closest Driver (parameter 'currentLocation') to the pick point.
-Then system offering an Order to the Driver, if Driver not responding or declining an Order, system should find another Driver.
What algorithm should I have on cloud code side? I think I should have anything like matching session, but have no idea of doing this yet. Also: driver should accept order in 2mins, or order will be declined and will go to another Driver. Thank you!

Resources