Size of characters in input sequence for Huffman encoding? - huffman-code

In Huffman compression we eliminate redundancy in a sequence of symbols due to usage variable length codes for symbols with different frequencies.
The question is how to define the size (in bits) of input symbols, is it 7,8,9,121? How to define it?

The Huffman algorithm does not care how you choose to represent your symbols. All it cares about is how many symbols there are and what the frequency of each symbol is. It just takes a list of frequencies and produces a list of bit lengths. Your symbols can be represented as bytes, two-byte integers, Unicode characters, country flags, whatever.

Related

Data Encoding for Training in Neural Network

I have converted 349,900 words from a dictionary file to md5 hash. Sample are below:
74b87337454200d4d33f80c4663dc5e5
594f803b380a41396ed63dca39503542
0b4e7a0e5fe84ad35fb5f95b9ceeac79
5d793fc5b00a2348c3fb9ab59e5ca98a
3dbe00a167653a1aaee01d93e77e730e
ffc32e9606a34d09fca5d82e3448f71f
2fa9f0700f68f32d2d520302906e65ce
1c9b32ff1b53bd892b87578a11cbd333
26a10043bba821303408ebce568a2746
c3c32ff3481e9745e10defa7ce5b511e
I want to train a neural network to decrypt a hash using just simple architecture like MultiLayer Perceptron. Since all hash value is of length 32, I was thingking that the number of input nodes is 32, but the problem here is the number of output nodes. Since the output are words in the dictionary, it doesn't have any specific length. It could be of various length. That is the reason why Im confused on how many number of output nodes shall I have.
How will I encode my data, so that I can have specific number of output nodes?
I have found a paper here in this link that actually decrypt a hash using neural network. The paper said
The input to the neural network is the encrypted text that is to be decoded. This is fed into the neural network either in bipolar or binary format. This then traverses through the hidden layer to the final output layer which is also in the bipolar or binary format (as given in the input). This is then converted back to the plain text for further process.
How will I implement what is being said in the paper. I am thinking to limit the number of characters to decrypt. Initially , I can limit it up to 4 characters only(just for test purposes).
My input nodes will be 32 nodes representing every character of the hash. Each input node will have the (ASCII value of the each_hash_character/256). My output node will have 32 nodes also representing binary format. Since 8 bits/8 nodes represent one character, my network will have the capability of decrypting characters up to 4 characters only because (32/8) = 4. (I can increase it if I want to. ) Im planning to use 33 nodes. Is my network architecture feasible? 32 x 33 x 32? If no, why? Please guide me.
You could map the word in the dictionary in a vectorial space (e.g. bag of words, word2vec,..). In that case the words are encoded with a fix length. The number of neurons in the output layer will match that length.
There's a great discussion about the possibility of cracking SHA256 hashes using neural networks in another Stack Exchange forum: https://security.stackexchange.com/questions/135211/can-a-neural-network-crack-hashing-algorithms
The accepted answer was that:
No.
Neural networks are pattern matchers. They're very good pattern
matchers, but pattern matchers just the same. No more advanced than
the biological brains they are intended to mimic. More thorough, more
tireless, but not more sophisticated.
The patterns have to be there to be found. There has to be a bias in
the data to tease out. But cryptographic hashes are explicitly and
extremely carefully designed to eliminate any bias in the output. No
one bit is more likely than any other, no one output is more likely to
correlate to any given input. If such a correlation were possible, the
hash would be considered "broken" and a new algorithm would take its
place.
Flaws in hash functions have been found before, but never with the aid
of a neural network. Instead it's been with the careful application of
certain mathematical principles.
The following answer also makes a funny comparison:
SHA256 has an output space of 2^256, and an input space that's
essentially infinite. For reference, the time since the big bang is
estimated to be 5 billion years, which is about 1.577 x 10^27
nanoseconds, which is about 2^90 ns. So assuming each training
iteration takes 1 ns, you would need 2^166 ages of the universe to
train your neural net.

Difference between a Byte, Word, Long and a Long Word?

I'm aware that a Byte is 8 bits, but what do the others represent? I'm taking an assembly course which uses a Motorola 68k Architecture, and I'm confused on the vocabulary present.
As mentioned on the first page of the operator's manual for the 68k Architecture, in your case a word is 16 bits and a long word is 32 bits.
In an assembly language, a word is the CPU's natural working size. Each instruction, as well as addresses in memory, tend to be one word in length. Whereas a byte is always 8 bits, the size of a word depends on the architecture you're working in.

One Hot encoding for large number of values

How do we use one hot encoding if the number of values which a categorical variable can take is large ?
In my case it is 56 values. So as per usual method I would have to add 56 columns (56 binary features) in the training dataset which will immensely increase the complexity and hence the training time.
So how do we deal with such cases ?
Use a compact encoding. This trades space for time, although one-hot encodings can often enjoy a very small time penalty.
The most accessible idea is a vector of 56 booleans, if your data format supports that. The one with the most direct mapping is to use a 64-bit integer, each bit being a boolean. This is how we implement one-hot vectors in hardware design. Most 4G languages (and mature 3G languages) include fast routines for bit manipulation. You will need get, set, clear, and find bits.
Does that get you moving?

How are numbers represented in a computer and what is the role of floating-point and twos-complement?

I have very general question about how computers work with numbers.
In general computer systems only know binary - 0 and 1. So in memory any number is a sequence of bits. It does not matter if the number represented is a int or float.
But when does things like floating-point-numbers based on IEEE 754 standard and the twos-complement enter the game? Is this only a thing of the compilers (C/C++,...) and VMs (.NET/Java)?
Is it true that all integer numbers are represented by using the twos-complement?
I have read about CPUs that use co-processors for performing the floating-point-arithmetic. To tell a CPU to use it special assembler commands exists like add.s (single precision) and add.d (double precision). When I have some C++ code where a float is use, will such assembler commands be in the output?
I am totally confused at the moment. Would be great if you can help me with that.
Thank you!
Stefan
In general computer systems only know binary - 0 and 1. So in memory any number is a sequence of bits. It does not matter if the number represented is a int or float.
This is correct for the representation in memory. But computers execute instructions and store data currently being worked on in registers. Both instructions and registers are specialized, for some of them, for representations of signed integers in two's complement, and for others, for IEEE 754 binary32 and binary64 arithmetics (on a typical computer).
So to answer your first question:
But when does things like floating-point-numbers based on IEEE 754 standard and the twos-complement enter the game? Is this only a thing of the compilers (C/C++,...) and VMs (.NET/Java)?
Two's complement and IEEE 754 binary floating-point are very much choices made by the ISA, which provides specialized instructions and registers to deal with these formats in particular.
Is it true that all integer numbers are represented by using the twos-complement?
You can represent integers however you want. But if you represent your signed integers using two's complement, the typical ISA will provide instructions to operate efficiently on them. If you make another choice, you will be on your own.

Lua: subtracting decimal numbers doesn't return correct precision

I am using Lua 5.1
print(10.08 - 10.07)
Rather than printing 0.01, above prints 0.0099999999999998.
Any idea how to get 0.01 form this subtraction?
You got 0.01 from the subtraction. It is just in the form of a repeating decimal with a tiny amount of lost precision.
Lua uses the C type double to represent numbers. This is, on nearly every platform you will use, a 64-bit binary floating point value with approximately 23 decimal digits of precision. However, no amount of precision is sufficient to represent 0.01 exactly in binary. The situation is similar to attempting to write 1/3 in decimal.
Furthermore, you are subtracting two values that are very similar in magnitude. That all by itself causes an additional loss of precision.
The solution depends on what your context is. If you are doing accounting, then I would strongly recommend that you not use floating point values to represent account values because these small errors will accumulate and eventually whole cents (or worse) will appear or disappear. It is much better to store accounts in integer cents, and divide by 100 for display.
In general, the answer is to be aware of the issues that are inherent to floating point, and one of them is this sort of small loss of precision. It is easily handled by rounding answers to an appropriate number of digits for display, and never comparing results of calculations for equality.
Some resources for background:
The semi-official explanation at the Lua Users Wiki
This great page of IEEE floating point calculators where you can enter values in decimal and see how they are represented, and vice-versa.
Wiki on IEEE floating point.
Wiki on floating point numbers in general.
What Every Computer Scientist Should Know About Floating-Point Arithmetic is the most complete discussion of the fine details.
Edit: Added the WECSSKAFPA document after all. It really is worth the study, although it will likely seem a bit overwhelming on the first pass. In the same vein, Knuth Volume 2 has extensive coverage of arithmetic in general and a large section on floating point. And, since lhf reminded me of its existence, I inserted the Lua wiki explanation of why floating point is ok to use as the sole numeric type as the first item on the list.
Use Lua's string.format function:
print(string.format('%.02f', 10.08 - 10.07))
Use an arbitrary precision library.
Use a Decimal Number Library for Lua.

Resources