What does 'lew' stand for in 'lew2' or 'lew4'? - character-encoding

I'm seeing the term 'lew2' and 'lew4' being used in reference to character size in certain files. I know that the number represents how many bytes are used to store certain types of characters (maybe wide chars?), but I'm not sure what the 'lew' part stands for. My best guess is 'length of wide'. Can anyone enlighten me?

My guess would be Little Endian Word 2 Bytes (or 4 Bytes), as opposed to Big Endian.

Related

How do I get a number from bytes?

I am currently trying to work around with Lua 5.1 bytecode. I've gotten pretty far, and understand a lot. However, I am stuck with a question on instructions and numbers. I understand that the size of the instruction and number are located and defined in the header, but I am not sure how to get the actual number from the 4 bytes (or whatever size is specified in the header).
I've looked at output from ChunkSpy and I don't really understand how it went from those bytes to the number. I'd look in the source but I don't want to just copy it, I want to understand it. If anyone could tell me a bit about it or even point me in the right direction I'd be very grateful.
Thank you!
From A No-Frills Introduction to Lua 5.1 VM Instructions, numbers are stored in the constants pool.
The first byte is 3=LUA_TNUMBER.
The next bytes are the number, with the length as given in the header. Interpretation is based on the length, byte order and the integral flag as given in the header.
Typically, non-integral with 8 bytes means IEEE 754 64-bit double.
Deserializing bytes to double involves extracting the bits for the mantissa and exponent, and combining them with arithmetic operations. Perhaps you want that as a challenge and to start from a description of the standard: What Every Computer Scientist Should Know About Floating-Point Arithmetic, "Formats and Operations" section.

How is data written to memory

When we store data in memory.
How does it get stored, so it can recognize what type of data it is when loaded.
What I want to ask is how the data types like Natural numbers, integers, characters, etc are stored in memory. So they can be recognized easily later when extracted from memory.
When we see at memory, what we see are hex numbers.
How can we relate these hex numbers for ASCII value or Integer Value or any other etc.
Since all of your data is written in binary, there isn't much difference between how the char a is written and how the int 97 is written, since they represent the same binary string (at least the last 8 bits of those strings). That being said, when you read from memory, you read a data type, by that type, you know how you should interpret the data
Memory does not operate in terms of "character" or "integer", these are high-level concepts that assume an abstract machine.
Typically, but not necessarily, a character is just an integer with a smaller size, often 8 bits (but a character could as well be 32 bits!) which represents one symbol or letter, rather than a discrete number. In some cases, a character may even be encoded using a variable length.
Memory operates in terms of bits that are organized in bytes (smallest directly addressable unit) or words. These are -- unbeknownst to you -- organized in banks. The hardware typically allows access in units called "cache lines", but this is something that happens secretly behind your back.
In assembler language, you can typically access bytes and power-of-two multiples of these, sometimes with special alignment requirements (there's usually also bit operations, but while they only change one bit, they still work on whole bytes/words).
All of that is, however, not very interesting, and also widely irrelevant for you. It is first and foremost the compiler's (or interpreter's) job to make sure that when you speak of an integer or a character, that whatever you want comes out at the other end. It is also the tool's responsibility to convert one into another if possible, and produce an error if not possible.
You do not even know for certain whether the value of an integer or a character has a memory location at all (it may very well be stored in a register) unless you explicitly enforce that.
You cannot distinguish a byte at some memory location that came from a "character" from a byte that belongs to an "integer". They look just the same.
And while it is possible to read the raw bytes of one type as another type in most languages, this is not something you normally need to do (or should do).

What ASCII character uses up the most memory?

I've been thinking about ASCII and memory lately and couldn't find a solid answer to this question.
When a script compiled, do ASCII characters use up different amounts of memory? And if so: what ASCII character uses up the most memory?
ASCII characters are a fixed width character encoding with each character represented by 7 bits. So to answer your question the different ASCII characters will all take the same amount of memory regardless of the implementation.
Because of the way in which our processor architectures are designed we typically store ASCII character in a single byte (the reason for doing so is because aligned memory access is a lot faster than having to do bitwise operations, see tripleee's comment). This means that typically any ASCII character will take up one byte of space on common computing platforms.
In contrast to this are the variable width encodings such as UTF8. For future readers who come across this page it might be worth noting that the ASCII characters 0 through to 127 are represented with the same binary as they are in UTF8. This was done to help maintain backwards compatibility. Therefore in the context of UTF8 encoding, the ASCII characters 0 through 127 will take up less space than other UTF8 characters.
Further I haven't heard of a mainstream compiler/interpreter that compresses strings stored with ASCII characters. This would impose a runtime performance hit that many would find unacceptable. Such a space optimization would therefore be left to the user to perform.
The ASCII wikipedia page has a good summary of the ASCII character set.
﷽ is probably the most space-consuming character. Im not sure about the coding, but it is a huge single-character. It is called "Basmala" and it means "In the name of Allah, the Most Gracious, the Most Merciful."
According to a Reddit user who has now deleted their account: “It's an Arabic ligature commonly used in Urdu. It was added so someone using an Urdu keyboard can type it easier.”
I love to use this in Discord raids, because imagine 2000 Basmala characters, vs 2000 regular characters. It fills their server up a LOT. Glad I could help.

How much space does a tab take?

I want to quantify the saving of space I can get by changing the format of a file.
I have a sparse matrix stocked in a text file (30% sparsity). Columns are separated by tabs.
Following an idea in an SO answer, I will change the format to row_id, col_id for the non zero terms only. I know how much space a float takes, but my question is: how much space does a tab take?
CouchDeveloper in his comment is correct. It's impossible to tell from the data you provide.
In a single byte character set encoding you'd save 1 byte per separator from the current ", ".
In a multibyte encoding it'd depend on the way each of those characters is encoded, you could theoretically even lose space. Say a tab is encoded as 4 bytes, a comma and space as 1 each, you'd end up taking 2 more bytes per separator.
Unless you have many separators and relatively very little data, I'd not worry one way or another, it'd be micro optimisation.
If you do, a binary encoding scheme might be more relevant.
1 byte, but significantly less if you're using compression (based on how common they will be, less than a bit on average). Use compression.

Is The Effectiveness Of Huffman Coding Limited?

My problem is that I have a 100,000+ different elements and as I understand it Huffman works by assigning the most common element a 0 code, and the next 10, the next 110, 1110, 11110 and so on. My question is, if the code for the nth element is n-bits long then surely once I have passed the 32nd term it is more space efficient to just sent 32-bit data types as they are, such as ints for example? Have I missed something in the methodology?
Many thanks for any help you can offer. My current implementation works by doing
code = (code << 1) + 2;
to generate each new code (which seems to be correct!), but the only way I could encode over 100,000 elements would be to have an int[] in a makeshift new data type, where to access the value we would read from the int array as one continuous long symbol... that's not as space efficient as just transporting a 32-bit int? Or is it more a case of Huffmans use being with its prefix codes, and being able to determine each unique value in a continuous bit stream unambiguously?
Thanks
Your understanding is a bit off - take a look at http://en.wikipedia.org/wiki/Huffman_coding. And you have to pack the encoded bits into machine words in order to get compression - Huffman encoded data can best be thought of as a bit-stream.
You seem to understand the principle of prefix codes.
Could you tell us a little more about these 100,000+ different elements you mention?
The fastest prefix codes -- universal codes -- do, in fact, involve a series of bit sequences that can be pre-generated without regard to the actual symbol frequencies. Compression programs that use these codes, as you mentioned, associate the most-frequent input symbol to the shortest bit sequence, the next-most-frequent input symbol to the next-shorted bit sequence, and so on.
What you describe is one particular kind of prefix code: unary coding.
Another popular variant of the unary coding system assigns elements in order of frequency to the fixed codes
"1", "01", "001", "0001", "00001", "000001", etc.
Some compression programs use another popular prefix code: Elias gamma coding.
The Elias gamma coding assigns elements in order of frequency to the fixed set of codewords
1
010
011
00100
00101
00110
00111
0001000
0001001
0001010
0001011
0001100
0001101
0001110
0001111
000010000
000010001
000010010
...
The 32nd Elias gamma codeword is about 10 bits long, about half as long as the 32nd unary codeword.
The 100,000th Elias gamma codeword will be around 32 bits long.
If you look carefully, you can see that each Elias gamma codeword can be split into 2 parts -- the first part is more or less the unary code you are familiar with. That unary code tells the decoder how many more bits follow afterward in the rest of that particular Elias gamma codeword.
There are many other kinds of prefix codes.
Many people (confusingly) refer to all prefix codes as "Huffman codes".
When compressing some particular data file, some prefix codes do better at compression than others.
How do you decide which one to use?
Which prefix code is the best for some particular data file?
The Huffman algorithm -- if you neglect the overhead of the Huffman frequency table -- chooses exactly the best prefix code for each data file.
There is no singular "the" Huffman code that can be pre-generated without regard to the actual symbol frequencies.
The prefix code choosen by the Huffman algorithm is usually different for different files.
The Huffman algorithm doesn't compress very well when we really do have 100,000+ unique elements --
the overhead of the Huffman frequency table becomes so large that we often can find some other "suboptimal" prefix code that actually gives better net compression.
Or perhaps some entirely different data compression algorithm might work even better in your application.
The "Huffword" implementation seems to work with around 32,000 or so unique elements,
but the overwhelming majority of Huffman code implementations I've seen work with around 257 unique elements (the 256 possible byte values, and the end-of-text indicator).
You might consider somehow storing your data on a disk in some raw "uncompressed" format.
(With 100,000+ unique elements, you will inevitably end up storing many of those elements in 3 or more bytes).
Those 257-value implementations of Huffman compression will be able to compress that file;
they re-interpret the bytes of that file as 256 different symbols.
My question is, if the code for the nth element is n-bits long then
surely once I have passed the 32nd term it is more space efficient to
just sent 32-bit data types as they are, such as ints for example?
Have I missed something in the methodology?
One of the more counter-intuitive features of prefix codes is that some symbols (the rare symbols) are "compressed" into much longer bit sequences. If you actually have 2^8 unique symbols (all possible 8 bit numbers), it is not possible to gain any compression if you force the compressor to use prefix codes limited to 8 bits or less. By allowing the compressor to expand rare values -- to use more than 8 bits to store a rare symbol that we know can be stored in 8 bits -- that frees up the compressor to use less than 8 bits to store the more-frequent symbols.
related:
Maximum number of different numbers, Huffman Compression

Resources