When using stream's read-byte what kind of byte am I reading - stream

If I this understood correctly, common lisp was standardized in a time when there were many different architectures with different opinions on the size of a byte. To that end common lisp allows us to define the size of a byte.
For example I can create an array of 8bit bytes like this:
(make-array 10 :element-type '(unsigned-byte 8))
This works great and so far this knowledge has been enough for whatever I've been doing.
Today though I've been getting into using binary streams and the read-byte function confuses me.
The CLHS says that read-byte reads and returns one byte from stream.
but what kind of byte is this? The default platform byte? Can I specify this in any way?
Thanks folks

For example OPEN has an :element-type argument, which is implementation-defined. Your Common Lisp implementation has more informations about it. As said in comments, (unsigned-byte 8) describes a stream octets which happens to be the size of bytes in most (all?) implementations. Thanks #Xach.
See also flexi-streams which has make-external-format and binary-types for custom binary encodings.

It is whatever the element type of the stream you read from indicates.

Related

Running the huffman encoding algorithm on an empty file

This may be a stupid question, but compressing an empty file doesn't make any sense right?. The Huffman encoding algorithm on an empty file wouldn't work because it relies on the fact there have to be at least 2 nodes in the priority queue. If we run the algorithm on an empty file, the only node we would get is the one corresponding to EOF.
Ya, that's right, it doesn't make much sense to run the Huffman encoding on it. Depending on the details of the implementation, it may not crash.
But why would you try to compress an empty file?
You need to somehow encode at the start of the compressed data what symbols correspond to what Huffman codes. It is in that representation that the number of symbols would be indicated. If there is only one symbol, which has to be EOF per your description, then the Huffman coding is implied to be zero bits. If there is only one symbol, then you need zero bits to represent it.

Calculate CRC 8 in objective c

I've an app in which I need to send a packet to an external device. This packet has a CRC before the end message. The CRC has to be separated in CRCH and CRCL.
For example: CRC = 0x5B so CRCH should be 0x35 (ASCII representation of 5) and CRCL should be 0x42 (ASCII representation of B).
I searched on internet and I found several functions in C or in other language to create CRC32, but my device need to use a CRC8. How I can create a CRC8 in Objective-C? Can you help me to find a way to do this?
Surprising how this rather simple question is still not answered.
First, you need to separate problems in your question. CRH and CRL are just hex conversion and that's easy to do (and has lots of examples on internet too). In most cases, you just need to compare crc you received to one you calculated. So, you just need to convert them to the same form. E.g. convert the crc you calculated to text using sprintf and %2X format and compare with CRC you received (as text).
The second part is actually CRC. This is a little bit trickier.Your options are as follows:
1) the easiest is to rename your .m file to .mm and use CRC library from boost C++. It's just a header include, so it won't affect rest of your code in any way and you can even make it in a separate file, so you'll have a C function which will use boost under the hood.
You might need to find parameters for your CRC though. For that, see this excellent resource http://reveng.sourceforge.net/crc-catalogue/
2) You can write your own implementation. Surprisingly there is plenty of examples for particular algorithms in the internet, but they often optimized for particular crc and are hard to adopt for other algorithms.
So, your best bet is probably starting with "A Painless Guide to CRC Error Detection Algorithms" article by Ross Williams. It also includes examples in C.
Though it could be complicated to get your head around all the technical stuff and explanations there.
So, as as short cut I'd like to suggest to look at my own implementation in java here. It's obviously not Objective-C. But I looked through it and you should be able to just copy and paste to your .m file and just compile it. Possibly adjusting few types.
You'll need public static long calculateCRC(Parameters crcParams, byte[] data) and private static long reflect(long in, int count) functions there. And the Parameters class which looks scarier, but should just become a struct in your case:
struct Parameters
{
int width; // Width of the CRC expressed in bits
long polynomial; // Polynomial used in this CRC calculation
bool reflectIn; // Refin indicates whether input bytes should be reflected
bool reflectOut; // Refout indicates whether input bytes should be reflected
long init; // Init is initial value for CRC calculation
long finalXor; // Xor is a value for final xor to be applied before returning result
}
You might also want to also adjust types there to a shorter unsigned type (java has no unsigned). But it should work perfectly well as is.

ASCII Representation of Hexadecimal

I have a string that, by using string.format("%02X", char), I've received the following:
74657874000000EDD37001000300
In the end, I'd like that string to look like the following:
t e x t NUL NUL NUL í Ó p SOH NUL ETX NUL (spaces are there just for clarification of characters desired in example).
I've tried to use \x..(hex#), string.char(0x..(hex#)) (where (hex#) is alphanumeric representation of my desired character) and I am still having issues with getting the result I'm looking for. After reading another thread about this topic: what is the way to represent a unichar in lua and the links provided in the answers, I am not fully understanding what I need to do in my final code that is acceptable for this to work.
I'm looking for some help in better understanding an approach that would help me to achieve my desired result provided below.
ETA:
Well I thought that I had fixed it with the following code:
function hexToAscii(input)
local convString = ""
for char in input:gmatch("(..)") do
convString = convString..(string.char("0x"..char))
end
return convString
end
It appeared to work, but didnt think about characters above 127. Rookie mistake. Now I'm unsure how I can get the additional characters up to 256 display their ASCII values.
I did the following to check since I couldn't truly "see" them in the file.
function asciiSub(input)
input = input:gsub(string.char(0x00), "<NUL>") -- suggested by a coworker
print(input)
end
I did a few gsub strings to substitute in other characters and my file comes back with the replacement strings. But when I ran into characters in the extended ASCII table, it got all forgotten.
Can anyone assist me in understanding a fix or new approach to this problem? As I've stated before, I read other topics on this and am still confused as to the best approach towards this issue.
The simple way to transform a base16-encoded string is just to
function unhex( input )
return (input:gsub( "..", function(c)
return string.char( tonumber( c, 16 ) )
end))
end
This is basically what you have, just a bit cleaner. (There's no need to say "(..)", ".." is enough – if you specify no captures, you'll automatically get the whole match. And while it might work if you write string.char( "0x"..c ), it's just evil – you concatenate lots of strings and then trigger the automatic conversion to numbers. Much better to just specify the base when explicitly converting.)
The resulting string should be exactly what went into the hex-dumper, no matter the encoding.
If you cannot correctly display the result, your viewer will also be unable to display the original input. If you used different viewers for the original input and the resulting output (e.g. a text editor and a terminal), try writing the output to a file instead and looking at it with the same viewer you used for the original input, then the two should be exactly the same.
Getting viewers that assume different encodings (e.g. one of the "old" 8-bit code pages or one of the many versions of Unicode) to display the same thing will require conversion between different formats, which tends to be quite complicated or even impossible. As you did not mention what encodings are involved (nor any other information like OS or programs used that might hint at the likely encodings), this could be just about anything, so it's impossible to say anything more specific on that.
You actually have a couple of problems:
First, make sure you know the meaning of the term character encoding, and that you know the difference between characters and bytes. A popular post on the topic is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Then, what encoding was used for the bytes you just received? You need to know this, otherwise you don't know what byte 234 means. For example it could be ISO-8859-1, in which case it is U+00EA, the character ê.
The characters 0 to 31 are control characters (eg. 0 is NUL). Use a lookup table for these.
Then, displaying the characters on the terminal is the hard part. There is no platform-independent way to display ê on the terminal. It may well be impossible with the standard print function. If you can't figure this step out you can search for a question dealing specifically with how to print Unicode text from Lua.

Is The Effectiveness Of Huffman Coding Limited?

My problem is that I have a 100,000+ different elements and as I understand it Huffman works by assigning the most common element a 0 code, and the next 10, the next 110, 1110, 11110 and so on. My question is, if the code for the nth element is n-bits long then surely once I have passed the 32nd term it is more space efficient to just sent 32-bit data types as they are, such as ints for example? Have I missed something in the methodology?
Many thanks for any help you can offer. My current implementation works by doing
code = (code << 1) + 2;
to generate each new code (which seems to be correct!), but the only way I could encode over 100,000 elements would be to have an int[] in a makeshift new data type, where to access the value we would read from the int array as one continuous long symbol... that's not as space efficient as just transporting a 32-bit int? Or is it more a case of Huffmans use being with its prefix codes, and being able to determine each unique value in a continuous bit stream unambiguously?
Thanks
Your understanding is a bit off - take a look at http://en.wikipedia.org/wiki/Huffman_coding. And you have to pack the encoded bits into machine words in order to get compression - Huffman encoded data can best be thought of as a bit-stream.
You seem to understand the principle of prefix codes.
Could you tell us a little more about these 100,000+ different elements you mention?
The fastest prefix codes -- universal codes -- do, in fact, involve a series of bit sequences that can be pre-generated without regard to the actual symbol frequencies. Compression programs that use these codes, as you mentioned, associate the most-frequent input symbol to the shortest bit sequence, the next-most-frequent input symbol to the next-shorted bit sequence, and so on.
What you describe is one particular kind of prefix code: unary coding.
Another popular variant of the unary coding system assigns elements in order of frequency to the fixed codes
"1", "01", "001", "0001", "00001", "000001", etc.
Some compression programs use another popular prefix code: Elias gamma coding.
The Elias gamma coding assigns elements in order of frequency to the fixed set of codewords
1
010
011
00100
00101
00110
00111
0001000
0001001
0001010
0001011
0001100
0001101
0001110
0001111
000010000
000010001
000010010
...
The 32nd Elias gamma codeword is about 10 bits long, about half as long as the 32nd unary codeword.
The 100,000th Elias gamma codeword will be around 32 bits long.
If you look carefully, you can see that each Elias gamma codeword can be split into 2 parts -- the first part is more or less the unary code you are familiar with. That unary code tells the decoder how many more bits follow afterward in the rest of that particular Elias gamma codeword.
There are many other kinds of prefix codes.
Many people (confusingly) refer to all prefix codes as "Huffman codes".
When compressing some particular data file, some prefix codes do better at compression than others.
How do you decide which one to use?
Which prefix code is the best for some particular data file?
The Huffman algorithm -- if you neglect the overhead of the Huffman frequency table -- chooses exactly the best prefix code for each data file.
There is no singular "the" Huffman code that can be pre-generated without regard to the actual symbol frequencies.
The prefix code choosen by the Huffman algorithm is usually different for different files.
The Huffman algorithm doesn't compress very well when we really do have 100,000+ unique elements --
the overhead of the Huffman frequency table becomes so large that we often can find some other "suboptimal" prefix code that actually gives better net compression.
Or perhaps some entirely different data compression algorithm might work even better in your application.
The "Huffword" implementation seems to work with around 32,000 or so unique elements,
but the overwhelming majority of Huffman code implementations I've seen work with around 257 unique elements (the 256 possible byte values, and the end-of-text indicator).
You might consider somehow storing your data on a disk in some raw "uncompressed" format.
(With 100,000+ unique elements, you will inevitably end up storing many of those elements in 3 or more bytes).
Those 257-value implementations of Huffman compression will be able to compress that file;
they re-interpret the bytes of that file as 256 different symbols.
My question is, if the code for the nth element is n-bits long then
surely once I have passed the 32nd term it is more space efficient to
just sent 32-bit data types as they are, such as ints for example?
Have I missed something in the methodology?
One of the more counter-intuitive features of prefix codes is that some symbols (the rare symbols) are "compressed" into much longer bit sequences. If you actually have 2^8 unique symbols (all possible 8 bit numbers), it is not possible to gain any compression if you force the compressor to use prefix codes limited to 8 bits or less. By allowing the compressor to expand rare values -- to use more than 8 bits to store a rare symbol that we know can be stored in 8 bits -- that frees up the compressor to use less than 8 bits to store the more-frequent symbols.
related:
Maximum number of different numbers, Huffman Compression

Using Haskell's Parsec to parse binary files?

Parsec is designed to parse textual information, but it occurs to me that Parsec could also be suitable to do binary file format parsing for complex formats that involve conditional segments, out-of-order segments, etc.
Is there an ability to do this or a similar, alternative package that does this? If not, what is the best way in Haskell to parse binary file formats?
The key tools for parsing binary files are:
Data.Binary
cereal
attoparsec
Binary is the most general solution, Cereal can be great for limited data sizes, and attoparsec is perfectly fine for e.g. packet parsing. All of these are aimed at very high performance, unlike Parsec. There are many examples on hackage as well.
You might be interested in AttoParsec, which was designed for this purpose, I think.
I've used Data Binary successfully.
It works fine, though you might want to use Parsec 3, Attoparsec, or Iteratees. Parsec's reliance on String as its intermediate representation may bloat your memory footprint quite a bit, whereas the others can be configured to use ByteStrings.
Iteratees are particularly attractive because it is easier to ensure they won't hold onto the beginning of your input and can be fed chunks of data incrementally a they come available. This prevents you from having to read the entire input into memory in advance and lets you avoid other nasty workarounds like lazy IO.
The best approach depends on the format of the binary file.
Many binary formats are designed to make parsing easy (unlike text formats that are primarily to be read by humans). So any union data type will be preceded by a discriminator that tells you what type to expect, all fields are either fixed length or preceded by a length field, and so on. For this kind of data I would recommend Data.Binary; typically you create a matching Haskell data type for each type in the file, and then make each of those types an instance of Binary. Define the "get" method for reading; it returns a "Get" monad action which is basically a very simple parser. You will also need to define a "put" method.
On the other hand if your binary data doesn't fit into this kind of world then you will need attoparsec. I've never used that, so I can't comment further, but this blog post is very positive.

Resources