GNURADIO 3.7.8: identify a part of a byte stream - stream

I am feeling Stream Tags, Message Passing, Packet Data Transmission are a bit of overkill, and I have hard time to understand.
I have a simple wish: starting from a stream of bytes I would like to "extract" only a fixed number of bytes) starting from a known pattern. For example from a stream like this: ...01h 55h XXh YYh ZZh..., it should extract XXh YYh ZZh.
I utilized Correlate Access Code Tag block -- Tagged Stream Align -- Pack K Bits to convert a bit stream into a byte stream and synch to the desired Access Code (01h 55h), but how do I tell gnuradio to only process 3 bytes after every time the code is found? Likely OOT block would solve, but is it there some combinatino of standard GRC block to do this?

I think with correllate_access_code_tag_bb you can actually build this, with a bit of brain-twisting, from existing blocks alone. (Note: this does rely on stream tags, because those are the right tool to mark special points in a sample flow.)
However, your simple case might really not be worth it. Simply follow the guided tutorials up to the point where you can write your own python block.
Use self.set_history(len(preamble)+len_payload) in the constructor of your new block to make sure you always see the last samples of the previous iteration in your current call to work, and simply search for the preamble in your sample stream, outputting only the len_payload following bytes when you find it, not producing anything if you don't find it.

Related

OCaml Marshal very large data structure

I would like to send a very large (~8GB) datastructure through the network, so I use the Marshal module to transform it into Bytes.
My problem is that the memory doubles, because we need to store both representations (initial data and Marshaled data).
Is there a simple way to Marshal into a Stream instead ? This would avoid to have the full Marshalled representation of the initial datastructure.
I thought of Marshaling to an out_channel in which I opened a pipe with a second thread and reading from the pipe in the main thread into s Stream, but I guess there might be a simpler solution.
Thanks !
Edit: Answer to a comment:
In the toplevel :
let a = Array.make (1024*1024*1024) 0. ;; (* Takes 8GB of RAM *)
let data = Marshal.to_bytes a [Marshal.Closures] ;; (* Takes an extra 8GB *)
It's not possible. You would have to modify the Marshal module to stream the data as it marshals something and to reconstruct the data in place without buffering it all first.
In the short run it might be simpler to implement your own specialized marshal function specific to your data. For an 8GiB array you might want to switch to using BigArray so you can send/recv the data without having to copy it.
Note: A 8GiB array will use 16GiB if the GC ever copies it, at least temporary.
From what I understand, MPI only allows to send data packets with a known size, not a stream of data. You could implement a custom stream type that split an incoming flow of data to packets of constant, small size (on close, you flush whatever remains in the buffer).
Also, you only can marshall arbitrary long data to a channel, because otherwise you take up too many space.
And then, you need to have a way to connect the channel to the stream, which AFAIK is not easily possible. Maybe you could start antoer ocaml process: the process would convert the flow of bytes (you can wrap a custom stream over Stream.of_channel) and send it through MPI. The main process would marshall data to the process's input channel.

Concurrently parsing records in a binary file in Go

I have a binary file that I want to parse. The file is broken up into records that are 1024 bytes each. The high level steps needed are:
Read 1024 bytes at a time from the file.
Parse each 1024-byte "record" (chunk) and place the parsed data into a map or struct.
Return the parsed data to the user and any error(s).
I'm not looking for code, just design/approach help.
Due to I/O constraints, I don't think it makes sense to attempt concurrent reads from the file. However, I see no reason why the 1024-byte records can't be parsed using goroutines so that multiple 1024-byte records are being parsed concurrently. I'm new to Go, so I wanted to see if this makes sense or if there is a better (faster) way:
A main function opens the file and reads 1024 bytes at a time into byte arrays (records).
The records are passed to a function that parses the data into a map or struct. The parser function would be called as a goroutine on each record.
The parsed maps/structs are appended to a slice via a channel. I would preallocate the underlying array managed by the slice as the file size (in bytes) divided by 1024 as this should be the exact number of elements (assuming no errors).
I'd have to make sure I don't run out of memory as well, as the file can be anywhere from a few hundred MB up to 256 TB (rare, but possible). Does this make sense or am I thinking about this problem incorrectly? Will this be slower than simply parsing the file in a linear fashion as I read it 1024 bytes at a time, or will parsing these records concurrently as byte arrays perform better? Or am I thinking about the problem all wrong?
I'm not looking for code, just design/approach help.
Cross-posted on Software Engineering
This is an instance of the producer-consumer problem, where the producer is your main function that generates 1024-byte records and the consumers should process these records and send them to a channel so they are added to the final slice. There are a few questions tagged producer-consumer and Go, they should get you started. As for what is fastest in your case, it depends on so many things that it is really not possible to answer. The best solution may be anywhere from a completely sequential implementation to a cluster of servers in which the records are moved around by RabbitMQ or something similar.

Tracking address when writing to flash

My system needs to store data in an EEPROM flash. Strings of bytes will be written to the EEPROM one at a time, not continuously at once. The length of strings may vary. I want the strings to be saved in order without wasting any space by continuing from the last write address. For example, if the first string of bytes was written at address 0x00~0x08, then I want the second string of bytes to be written starting at address 0x09.
How can it be achieved? I found that some EEPROM's write command does not require the address to be specified and just continues from lastly written point. But EEPROM I am using does not support that. (I am using Spansion's S25FL1-K). I thought about allocating part of memory to track the address and storing the address every time I write, but that might wear out flash faster. What is widely used method to handle such case?
Thanks.
EDIT:
What I am asking is how to track/save the address in a non-volatile way so that when next write happens, I know what address to start.
I never worked with this particular flash, but I've implemented something similar. Unfortunately, without knowing your constrains / priorities (memory or CPU efficient, how often write happens etc.) it is impossible to give a definite answer. Here are some techniques that you may want to consider. I don't know if they are widely used though.
Option 1: Write X bytes containing string length before the string. Then on initialization you could parse your flash: read the length n, jump n bytes forward; read the next byte. If it's empty (all ones for your flash according to the datasheet) then you got your first empty bit. Otherwise you've just read the length of the next string, so do the same over again.
This method allows you to quickly search for the last used sector, since the first byte of the used sector is guaranteed to have a value. The flip side here is overhead of extra n bytes (depending on the max string length) each time you write a string, and having to parse it to get the value (although this can only be done once on boot).
Option 2: Instead of prepending the size, append the unique "end-of-string" sequence, and then parse on boot for the last sequence before ones that represent empty flash.
Disadvantage here is longer parse, but you possibly could get away with just 1 byte-long overhead for each string.
Option 3 would be just what you already thought of: allocating a separate sector that would contain the value you need. To reduce flash wear you could also write these values back-to-back and search for the last one each time you boot. Also, you might consider the expected lifetime of the device that you program versus 100,000 erases that your flash can sustain (again according to the datasheet) - is wearing even a problem? That of course depends on how often data will be saved.
Hope that helps.

Using PARSE on a PORT! value

I tried using PARSE on a PORT! and it does not work:
>> parse open %test-data.r [to end]
** Script error: parse does not allow port! for its input argument
Of course, it works if you read the data in:
>> parse read open %test-data.r [to end]
== true
...but it seems it would be useful to be able to use PARSE on large files without first loading them into memory.
Is there a reason why PARSE couldn't work on a PORT! ... or is it merely not implemented yet?
the easy answer is no we can't...
The way parse works, it may need to roll-back to a prior part of the input string, which might in fact be the head of the complete input, when it meets the last character of the stream.
ports copy their data to a string buffer as they get their input from a port, so in fact, there is never any "prior" string for parse to roll-back to. its like quantum physics... just looking at it, its not there anymore.
But as you know in rebol... no isn't an answer. ;-)
This being said, there is a way to parse data from a port as its being grabbed, but its a bit more work.
what you do is use a buffer, and
APPEND buffer COPY/part connection amount
Depending on your data, amount could be 1 byte or 1kb, use what makes sense.
Once the new input is added to your buffer, parse it and add logic to know if you matched part of that buffer.
If something positively matched, you remove/part what matched from the buffer, and continue parsing until nothing parses.
you then repeat above until you reach the end of input.
I've used this in a real-time EDI tcp server which has an "always on" tcp port in order to break up a (potentially) continuous stream of input data, which actually piggy-backs messages end to end.
details
The best way to setup this system is to use /no-wait and loop until the port closes (you receive none instead of "").
Also make sure you have a way of checking for data integrity problems (like a skipped byte, or erroneous message) when you are parsing, otherwise, you will never reach the end.
In my system, when the buffer was beyond a specific size, I tried an alternate rule which skipped bytes until a pattern might be found further down the stream. If one was found, an error was logged, the partial message stored and a alert raised for sysadmin to sort out the message.
HTH !
I think that Maxim's answer is good enough. At this moment the parse on port is not implemented. I don't think it's impossible to implement it later, but we must solve other issues first.
Also as Maxim says, you can do it even now, but it very depends what exactly you want to do.
You can parse large files without need to read them completely to the memory, for sure. It's always good to know, what you expect to parse. For example all large files, like files for music and video, are divided into chunks, so you can just use copy|seek to get these chunks and parse them.
Or if you want to get just titles of multiple web pages, you can just read, let's say, first 1024 bytes and look for the title tag here, if it fails, read more bytes and try it again...
That's exactly what must be done to allow parse on port natively anyway.
And feel free to add a WISH in the CureCode database: http://curecode.org/rebol3/

Buffering data for delimiter separated blocks

There is a question I have been wondering about for ages and I was hoping someone could give me an answer to rest my mind.
Let's assume that I have an input stream (like a file/socket/pipe) and want to parse the incoming data. Let's assume that each block of incoming data is split by a newline, like most common internet protocols. This application could just as well be parsing html, xml or any other smart data structure. The point is that the data is split into logical blocks by a delimiter rather than a fixed length. How can I buffer the data to wait for the delimiter to appear?
The answer seems simple enough: just have a large enough byte/char array to fit the entire thing.
But what if the delimiter comes after the buffer is full? This is actually a question about how to fit a dynamic block of data in a fixed size block. I can only really think of a few alternatives:
Increase the buffer size when needed. This may require heavy memory reallocation, and may lead to resource exhaustion from specially crafted streams (or perhaps even denial of service in the case of sockets where we want to protect ourselves against exhaustion attacks and drop connections that try to exhaust resources...and an attacker starts sending fake, oversized, packets to trigger the protection).
Start overwriting old data by using a circular buffer. Perhaps not an ideal method since the logical block would become incomplete.
Dump new data when the buffer is full. However, this way the delimiter will never be found, so this choice is obviously not a good option.
Just make the fixed size buffer damn large and assume all incoming logical data blocks is within its bounds...and if it ever fills, just interpret the full buffer as a logical block...
In either case I feel we must assume that the logical blocks will never exceed a certain size...
Any thoughts on this topic? Obviously there must be a way since the higher level languages offer some sort of buffering mechanisms with their readLine() stream methods.
Is there any "best way" to solve this or is there always a tradeoff? I really appreciate all thoughts and ideas on this topic since this question has been haunting me everytime I have needed to write a parser of some sort.
There are normally two techniques for this
1) What I think readline uses - if the buffer fills return the data without the delimiter on the end
2) When the buffer fills, remeber it filled, keep reading until you get the delimiter and report an error (or truncate the record at the buffer size)
Options (2) and (3) are out as you are losing data in both cases. Option (4) of a huge fixed size buffer would not solve the problem as it is just not possible to know what size is large enough? Is it all the physical memory + swap space + the free space available in all disks everywhere in the known universe?
Resizing the buffer looks like the best solution. Say realloc to twice the size and continue writing. There is always a chance of a specially constructed stream like a DoS that tries to bring down the system. My first thought was so set an arbitrarily large size as the max_size for the buffer. However, if we could do that, we could just set that as the size of the large buffer. So, resizing the buffer looks like the best option to me.
If the protocol or you do not define a upper bound for the length of each block then I don't see how you can prevent memory exhaustion edge cases.
Assuming that there is an upper bound using a fixed size block seems like a good approach for reasonably sized limits.
If the limits are high enough that a single fixed buffer will be inefficient then I would suggest using a data structure that is implemented internally as a linked list of fixed size buffers.
Why do you have to wait to start processing?
Generally alternative 4 is sound. It doesn't however, require an "assumption", rather a definition. You simply declare that blocks are smaller than 8K and be done with it. It's not difficult to do.
Further, there's alternative 5: Start processing partial buffers. This works unless you have designed a truly pathological protocol that sends critical data at the very end of the block.
HTML, XML, JSON/YAML, etc., can all be parsed incrementally. You don't require a delimeter to do useful processing.

Resources