Diff between bit and byte, and exact meaning of byte - storage

This is just basic theoretical question. so I read that a bit consist of 0 or 1. and a byte consists of 8 bits. and in 8 bit we can store 2^8 nos.
similarly in 10 bits we store 2^10 (1024). but then why do we say that 1024 is 1 kilo bytes, its actually 10 bits which just 1.25 byte to be exact.
please share some knowledge on it
just a concrete explanation.

Bit means like there are 8 bits in 1 byte, bit is the smallest unit of any storage or you can say the system and 8 bits sums up to 1 byte.

A bit, short for binary digit, is the smallest unit of measurement used in computers for information storage. A bit is represented by a 1 or a 0 with the value true or false, also known as on or off. A single byte of information, also known as an octet, is made up of eight bits. The size, or amount of information stored, distinguishes a bit from a byte.
A kilobit is 1,000 bits, but it is designated as 1024 bits in the binary system due to the amount of space required to store a kilobit using common operating systems and storage schemes. Most people, however, think of kilo as referring to 1,000 in order to remember what a kilobit is. A kilobyte then, would be 1,000 bytes.

Related

HOW does a 8 bit processor interpret the 2 bytes of a 16 bit number to be a single piece of info?

Assume the 16 bit no. to be 256.
So,
byte 1 = Some binary no.
byte 2 = Some binary no.
But byte 1 also represents a 8 bit no.(Which could be an independent decimal number) and so does byte 2..
So how does the processor know that bytes 1,2 represent a single no. 256 and not two separate numbers
The processor would need to have another long type for that. I guess you could implement a software equivalent, but for the processor, these two bytes would still have individual values.
The processor could also have a special integer representation and machine instructions that handle these numbers. For example, most modern machines nowadays use twos-complement integers to represent negative numbers. In twos-complement, the most significant bit is used to differentiate negative numbers. So a twos-complement 8-bit integer can have a range of -128 (1000 0000) to 127 (0111 111).
You could easily have the most significant bit mean something else, so for example, when MSB is 0 we have integers from 0 (0000 0000) to 127 (0111 1111); when MSB is 1 we have integers from 256 (1000 0000) to 256 + 127 (1111 1111). Whether this is efficient or good architecture is another history.

Bit vs Byte in memory adress?

I know that 1 byte = 8 bits.
Can I access 1 bits in memory or it should be a full byte?
Let's suppose the memory starts at: 0x0000 then what's the next position? 0x0001 or 0x0008?
It depends on how many bit processer you have. I don't think there are any 1-bit processors around.
But a 8-bit processor would read a full byte, and then you can mask bits to actually read the only one you care about.
Therefore, considering the most common architectures, the next address would be 0x0008
Edit: Based on the comments, there are 1-bit processors

Gigabyte or Gibibyte (1000 or 1024)?

This may be a duplicate and I apologies if that is so but I really want a definitive answer as that seems to change depending upon where I look.
Is it acceptable to say that a gigabyte is 1024 megabytes or should it be said that it is 1000 megabytes? I am taking computer science at GCSE and a typical exam question could be how many bytes in a kilobyte and I believe the exam board, AQA, has the answer for such a question as 1024 not 1000. How is this? Are both correct? Which one should I go with?
Thanks in advance- this has got me rather bamboozled!
The sad fact is that it depends on who you ask. But computer terminology is slowly being aligned with normal terminology, in which kilo is 103 (1,000), mega is 106 (1,000,000), and giga is 109 (1,000,000,000).
This is reflected in the International System of Quantities and the International Electrotechnical Commission, which define gigabyte as 109 and use gibibyte for the computer-specific 1024 x 1024 x 1024 value.
The reason it "depends who you ask," is that for many years, specifically in relation to "bytes" of storage, the prefixes kilo, mega, and giga meant 1024, 10242, and 10243. But that flies in the face of normal convention with regard to these prefixes. So again, computer terminology is being aligned with non-computer terminology.
The term gigabyte is commonly used to mean either 10003 bytes or 10243 bytes depending on the context. Disk manufacturers prefer the decimal term while memory manufacturers use the binary.
Decimal definition
1 GB = 1,000,000,000 bytes (= 10003 B = 109 B)
Based on powers of 10, this definition uses the prefix as defined in the International System of Units (SI). This is the recommended definition by the International Electrotechnical Commission (IEC). This definition is used in networking contexts and most storage media, particularly hard drives, flash-based storage, and DVDs, and is also consistent with the other uses of the SI prefix in computing, such as CPU clock speeds or measures of performance.
Binary definition
1 GiB = 1,073,741,824 bytes (= 10243 B = 230 B).
The binary definition uses powers of the base 2, as is the architectural principle of binary computers. This usage is widely promulgated by some operating systems, such as Microsoft Windows in reference to computer memory (e.g., RAM). This definition is synonymous with the unambiguous unit gibibyte.
The difference between units based on decimal and binary prefixes increases as a semi-logarithmic (linear-log) function—for example, the decimal kilobyte value is nearly 98% of the kibibyte, a megabyte is under 96% of a mebibyte, and a gigabyte is just over 93% of a gibibyte value. This means that a 300 GB (279 GiB) hard disk might be indicated variously as 300 GB, 279 GB or 279 GiB, depending on the operating system.
The Wikipedia article https://en.wikipedia.org/wiki/Gigabyte has a good writeup of the confusion surrounding the usage of the term

Efficient way to create a bit mask from multiple numbers possibly using SSE/SSE2/SSE3/SSE4 instructions

Suppose I have 16 ascii characters (hence 16 8 bit numbers) in a 128 bit variable/register. I want to create a bit mask in which those bits will be high whose bit positions (indexes) are represented by those 16 characters.
For example, if the string formed from those 16 characters is "CAD...", in the bit mask 67th bit, 65th bit, 68th bit and so on should be 1. The rest of the bits should be 0. What is the efficient way to do it specially using SIMD instructions?
I know that one of the technique is addition like this: 2^(67-1)+2^(65-1)+2^(68-1)+...
But this will require a large number of operations. I want to do it in one/two operations/instructions if possible.
Please let me know a solution.
SSE4.2 contains one instruction, that performs almost what you want: PCMPISTRM with immediate operand 0. One of its operands should contain your ASCII characters, other - a constant vector with values like 32, 33, ... 47. You get the result in 16 least significant bits of XMM0. Since you need 128 bits, this instruction should be executed 8 times with different constant vectors (6 times if you need only printable ASCII characters). After each PCMPISTRM, use bitwise OR to accumulate the result in some XMM register.
There are 2 disadvantages of this method: (1) you need to read the Intel's architectures software developer's manual to understand PCMPISTRM's details because that's probably the most complicated SSE instruction ever, and (2) this instruction is pretty slow (throughput of 1/2 on Nehalem, 1/3 on Sandy Bridge, 1/4 on Bulldozer), so you'll hardly get any significant speed improvement over 'brute force' method.

Maximum number of different numbers, Huffman Compression

I want to compress many 32bit number using huffman compression.
Each number may appear multiple times, and I know that every number will be replaced with some bit sequences:
111
010
110
1010
1000
etc...
Now, the question: How many different numbers can be added to the huffman tree before the length of the binary sequence exceeds 32bits?
The rule of generating sequences (for those who don't know) is that every time a new number is added you must assign it the smallest binary sequence possible that is not the prefix of another.
You seem to understand the principle of prefix codes.
Many people (confusingly) refer to all prefix codes as "Huffman codes".
There are many other kinds of prefix codes -- none of them compress data into any fewer bits than Huffman compression (if we neglect the overhead of transmitting the frequency table), but many of them get pretty close (with some kinds of data) and have other advantages, such as running much faster or guaranteeing some maximum code length ("length-limited prefix codes").
If you have large numbers of unique symbols, the overhead of the Huffman frequency table becomes large -- perhaps some other prefix code can give better net compression.
Many people doing compression and decompression in hardware have fixed limits for the maximum codeword size -- many image and video compression algorithms specify a "length-limited Huffman code".
The fastest prefix codes -- universal codes -- do, in fact, involve a series of bit sequences that can be pre-generated without regard to the actual symbol frequencies. Compression programs that use these codes, as you mentioned, associate the most-frequent input symbol to the shortest bit sequence, the next-most-frequent input symbol to the next-shorted bit sequence, and so on.
For example, some compression programs use Fibonacci codes (a kind of universal code), and always associate the most-frequent symbol to the bit sequence "11", the next-most-frequent symbol to the bit sequence "011", the next to "0011", the next to "1011", and so on.
The Huffman algorithm produces a code that is similar in many ways to a universal code -- both are prefix codes.
But, as Cyan points out, the Huffman algorithm is slightly different than those universal codes.
If you have 5 different symbols, the Huffman tree will contain 5 different bit sequences -- however, the exact bit sequences generated by the Huffman algorithm depend on the exact frequencies.
One document may have symbol counts of { 10, 10, 20, 40, 80 }, leading to Huffman bit sequences { 0000 0001 001 01 1 }.
Another document may have symbol counts of { 40, 40, 79, 79, 80 }, leading to Huffman bit sequences { 000 001 01 10 11 }.
Even though both situations have exactly 5 unique symbols, the actual Huffman code for the most-frequent symbol is very different in these two compressed documents -- the Huffman code "1" in one document, the Huffman code "11" in another document.
If, however, you compressed those documents with the Fibonacci code, the Fibonacci code for the most-frequent symbol is always the same -- "11" in every document.
For Fibonacci in particular, the first 33-bit Fibonacci code is "31 zero bits followed by 2 one bits", representing the value F(33) = 3,524,578 .
And so 3,524,577 unique symbols can be represented by Fibonacci codes of 32 bits or less.
One of the more counter-intuitive features of prefix codes is that some symbols (the rare symbols) are "compressed" into much longer bit sequences.
If you actually have 2^32 unique symbols (all possible 32 bit numbers), it is not possible to gain any compression if you force the compressor to use prefix codes limited to 32 bits or less.
If you actually have 2^8 unique symbols (all possible 8 bit numbers), it is not possible to gain any compression if you force the compressor to use prefix codes limited to 8 bits or less.
By allowing the compressor to expand rare values -- to use more than 8 bits to store a rare symbol that we know can be stored in 8 bits -- or use more than 32 bits to store a rare symbol that we know can be stored in 32 bits -- that frees up the compressor to use less than 8 bits -- or less than 32 bits -- to store the more-frequent symbols.
In particular, if I use Fibonacci codes to compress a table of values,
where the values include all possible 32 bit numbers,
one must use Fibonacci codes up to N bits long, where F(N) = 2^32 -- solving for N I get
N = 47 bits for the least-frequently-used 32-bit symbol.
Huffman is about compression, and compression requires a "skewed" distribution to work (assuming we are talking about normal, order-0, entropy).
The worst situation regarding Huffman tree depth is when the algorithm creates a degenerated tree, i.e. with only one leaf per level. This situation can happen if the distribution looks like a Fibonacci serie.
Therefore, the worst distribution sequence looks like this : 1, 1, 1, 2, 3, 5, 8, 13, ....
In this case, you fill the full 32-bit tree with only 33 different elements.
Note, however, that to reach a 32 bit-depth with only 33 elements, the most numerous element must appear 3 524 578 times.
Therefore, since suming all Fibonacci numbers get you 5 702 886, you need to compress at least 5 702 887 numbers to start having a risk of not being able to represent them with a 32-bit huffman tree.
That being said, using an Huffman tree to represent 32-bits numbers requires a considerable amount of memory to calculate and maintain the tree.
[Edit] A simpler format, called "logarithm approximation", gives almost the same weight to all symbols. In this case, only the total number of symbols is required.
It computes very fast : say for 300 symbols, you will have some using 8 bits, and others using 9 bits. The formula to decide how many of each type :
9 bits : (300-256)*2 = 44*2 = 88 ;
8 bits : 300 - 88 = 212
Then you can distribute the numbers as you wish (preferably the most frequent ones using 8 bits, but that's not important).
This version scales up to 32 bits, meaning basically no restriction.

Resources