I am doing a light weight program to monitor received beams for lidar. Preferably, I do not want to cache the entire UDP data packet or point cloud data due to the light weight nature.
The question is what is the data contained in ROS message velodyne_msgs/VelodynePacket. This message contains smaller data but I do not know if it is related.
By read the Ros Wiki on this topic but the link for velodynepackt did not provide useful info on the content.
Check the message definition to see what fields a message contains and their types. Message files will usually either have field names that are self explanatory or will have comments (## text) describing the fields. You can look at the message definitions either online or locally. To look at them locally use roscd to get to the package directory roscd <package_name>/msg and then using cat to see the contents of the message file. In your case, this would be:
roscd velodyne_msgs/msg
cat VelodynePacket.msg
cat VelodyneScan.msg
The relevant message files are available online from the page you linked to:
http://docs.ros.org/api/velodyne_msgs/html/msg/VelodyneScan.html
http://docs.ros.org/api/velodyne_msgs/html/msg/VelodynePacket.html
In regards to your specific question about creating a lightweight application, you have a few options.
Use the provided ROS message and subscribe to it. Most of the time if you don't have a ton of large data traveling around, you'll be okay and will be able to keep up with real time data. The majority of the time associated with ROS usually comes from the network transport, so if that's a problem, you'll need to not pass the data over ROS.
Put your code in a ROS Nodelet. This gives you the advantages of ROS data abstractions while eliminating the network data transfer that occurs between nodes. This is akin to using a pointer to the data field.
If you really don't want all the scan data, but still want to use ROS, you can write your own driver node. This will read from the LIDAR the data you want and discard the data you don't. You can do the raw data processing in that node (no ROS message required) or publish the data you care about and do the processing in another node.
Related
I'm facing a situation where I have multiple robots, most running full ROS stacks (complete with Master) and I'd like to selectively route some topics through another messaging framework to the other robots (some of which not running ROS).
The naive way to do this works, namely, to set up a node that subscribes to the ROS topics in question and sends that over the network, after which another node publishes it (if its ROS). Great, but it seems odd to have to do this much serializing. Right now the message goes from its message type to the ROS serialization, back to the message type, then to a different serialization format (currently Pickle), across the network, then back to the message type, then back to the ROS serialization, then back to the message type.
So the question is, can I simplify this? How can I operate on the ROS serialized data (ie subscribe without rospy automagically deserializing for me)? http://wiki.ros.org/rospy/Overview/Publishers%20and%20Subscribers suggests that I can access the connection information as dict of strings, which may be half of the solution, but how can the other end take the connection information and republish it without first deserializing and then immediately reserializing?
Edit: I just found https://gist.github.com/wkentaro/2cd56593107c158e2e02 , which seems to solve half of this. It uses AnyMsg to avoid deserializing on the ROS subscriber side, but then when it republishes it still deserializes and immediately reserializes the message. Is what I'm asking impossible?
Just to close the loop on this, it turns out you can publish AnyMsgs, it's just that the linked examples chose not to.
Has anyone posted a response to this problem? There have been other posts with no answers. Our situation is that we are pushing messages onto a topic that is backing a KTable in the first step of our stream process. We are then pulling a small amount of data from those messages and passing them along. We are doing multiple computations on that smaller amount of data for grouping and aggregation. At the end of the streaming process, we simply want to join back to that original topic via a KTable to pick up the full message content again. The results of the join are only a subset of the data because it can not find the entries in the KTable.
This is just the beginning of the problem. In another case, we are using KTables as indexes for lookups meant to enrich the data coming in. Think of these lookups as identifying whether we have seen a specific pattern in the streaming message before. If we have seen the pattern we want to tag it with an ID (used for grouping) pulled from an existing KTable. If we have not seen the pattern before we would assign it an ID and place it back into the KTable to be used to tag future messages. What we have found is that there is no guaranty that the information will be present in the KTable for future messages. This lack of guaranty seems to make KTables useless. We can not figure out why there is a very little discussion of this on the forums.
Finally, none of this seemed to be a problem when running with a single instance of the streams application. However, as soon as our data got large and we were forced to have 10 instances of the app, everything broke. As well, there is no way that we could use things like GlobalKTables because there is too much data to be loaded into a single machine's memory.
What can we do? We are currently planning to abandon KTables all together and use something like Hazelcast to store the lookup data. Should we just move to Hazelcast Jet and drop Kafka streams all together?
Adding flow:
Kafka data flow
I'm sorry for this non-answer answer, but I don't have enough points to comment...
The behavior you describe is definitely inconsistent with my understanding and experience with streams. If you can share the topology (or a simplified one) that is causing the problem, there might be a simple mistake we can point out.
Once we get more info, I can edit this into a "real" answer...
Thanks!
-John
To our Streaming pipeline, we want to submit unique GCS files, each file containing multiple event information, each event also containing a key (for example, device_id). As part of the processing, we want to shuffle by this device_id so as to achieve some form of worker to device_id affinity (more background on why we want to do it is in this another SO question. Once all events from the same file are complete, we want to reduce (GroupBy) by their source GCS file (which we will make a property of the event itself, something like file_id) and finally write the output to GCS (could be multiple files).
The reason we want to do the final GroupBy is because we want to notify an external service once a specific input file has completed processing. The only problem with this approach is that since the data is shuffled by the device_id and then grouped at the end by the file_id, there is no way to guarantee that all data from a specific file_id has completed processing.
Is there something we could do about it? I understand that Dataflow provides exactly_once guarantees which means all the events will be eventually processed but is there a way to set a deterministic trigger to say all data for a specific key has been grouped?
EDIT
I wanted to highlight the broader problem we are facing here. The ability to mark
file-level completeness would help us checkpoint different stages of the data as seen by external consumers. For example,
this would allow us to trigger per-hour or per-day completeness which are critical for us to generate reports for that window. Given that these stages/barriers (hour/day) are clearly defined on the input (GCS files are date/hour partitioned), it is only natural to expect the same of the output. But with Dataflow's model, this seems impossible.
Similarly, although Dataflow guarantees exactly-once, there will be cases where the entire pipeline needs to be restarted since something went horribly wrong - in those cases, it is almost impossible to restart from the correct input marker since there is no guarantee that what was already consumed has been completely flushed out. The DRAIN mode tries to achieve this but as mentioned, if the entire pipeline is messed up and draining itself cannot make progress, there is no way to know which part of the source should be the starting point.
We are considering using Spark since its micro-batch based Streaming model seems to fit better. We would still like to explore Dataflow if possible but it seems that we wont be able to achieve it without storing these checkpoints externally from within the application. If there is an alternative way of providing these guarantees from Dataflow, it would be great. The idea behind broadening this question was to see if we are missing an alternate perspective which would solve our problem.
Thanks
This is actually tricky. Neither Beam nor Dataflow have a notion of a per-key watermark, and it would be difficult to implement that level of granularity.
One idea would be to use a stateful DoFn instead of the second shuffle. This DoFn would need to receive the number of elements expected in the file (from either a side-input or some special value on the main input). Then it could count the number of elements it had processed, and only output that everything has been processed once it had seen that number of elements.
This would be assuming that the expected number of elements can be determined ahead of time, etc.
We have a group of users who need to see the payloads of packets in wireshark captures. I'm looking for a way to remind them users that the data contained within may not represent the exact frames on the wire (because the capture will have been pre-processed by the time they get it to remove, e.g. security-sensitive IP addresses). A hook in the capture file that triggered a popup with a short message would be perfect. Is there anyway to do this, short of wrapping Wireshark with another binary (which would be trivially bypass-able anyway)?
I've searched in the wireshark lists but come up empty.
The only thing you could do would be to have the pre-processing program write out the file in pcapng format and add a comment to the initial Section Header Block giving that warning. That won't produce a popup - but, then, not all the capture file reading programs in the Wireshark suite are GUI programs that could produce a popup.
What are the guidelines to follow such that data can be previewed nicely on CKAN Data Preview tool? I am working on CKAN and have been uploading data or linking it to external websites. Some could be previewed nicely, some not. I have been researching online about machine-readability and could not find any resources pertaining to CKAN that states the correct way to structure data such that it can be previewed nicely on CKAN. I hope to gather responses from all of you on the do's and don'ts so that it will come in useful to CKAN publishers and developers in future.
For example, data has to be in a tabular format with labelled rows and columns. Data has to be stored on the first tab of the spreadsheet as the other tabs cannot be previewed. Spreadsheet cannot contain formulas or macros. Data has to be stored in the correct file format (refer to another topic of mine: Which file formats can be previewed on CKAN Data Preview tool?)
Thanks!
Since CKAN is an open source data management system, it does not have a specific guidelines on the machine readability of data. Instead, you might want to take a look at the current standard for data openness and machine readability right here: http://5stardata.info
UK's implementation of CKAN also includes a set of plugins which help to rate the openness of the data based on the 5 star open data scheme right here: https://github.com/ckan/ckanext-qa
Check Data Pusher Logs - When you host files in the CKAN Data Store - the tool that loads the data in provides logs - these will reveal problems with the format of data.
Store Data Locally - Where possible store the data locally - because data stored elsewhere has to go through the proxy process (https://github.com/okfn/dataproxy) which is slower and is of course subject to the external site maintaining availability.
Consider File Size and Connectivity - Keep the file size small enough for your installation and connectivity that it doesn't time out when loading into the CKAN Data Explorer. If the file is externally hosted and is large and the access to the file is slow ( poor connectivity or too much load) you will end up with timeouts since the proxy must read the entire file before it is presented for preview. Again hosting data locally should mean better control over the load on compute resource and ensure that the data explorer works consistently.
Use Open File Formats - If you are using CKAN to publish open data - then the community generally holds that is is best to publish data in open formats (e.g. CSV, TXT) rather than proprietary ones (eg. XLS). Beyond increasing access to data to all users - and reducing the chance that the data is not properly structured for preview - this has other advantages. For example, it is harder to accidentally publish information that you didn't mean to.
Validate Your Data -Use tools like CSVKIT to check that your data is in good shape.
The best way to get good previewing experiences is to start using the DataStore. When viewing remote data CKAN has to use the DataProxy to do its best to guess data types and convert the data to a form it can preview. If you put the data into the DataStore that isn't necessary as the data will already be in a good structure and types will have been set (e.g. you'll know this column is a date rather than a number).