Is information stored in registers/memory structured as binary? - memory

Looking at this question on Quora HERE ("Are data stored in registers and memory in hex or binary?"), I think the top answer is saying that data persistence is achieved through physical properties of hardware and is not directly relatable to either binary or hex.
I've always thought of computers as 'binary', but have just realized that that only applies to the usage of components (magnetic up/down or an on/off transistor) and not necessarily the organisation of, for example, memory contents.
i.e. you could, theoretically, create an abstraction in memory that used 'binary components' but that wasn't binary, like this:
100000110001010001100
100001001001010010010
111101111101010100001
100101000001010010010
100100111001010101100
And then recognize that as the (badly-drawn) image of 'hello', rather than the ASCII encoding of 'hello'.
An answer on SO (What's the difference between a word and byte?) mentions that processors can handle 'words', i.e. several bytes at a time, so while information representation has to be binary I don't see why information processing has to be.
Can computers do arithmetic on hex directly? In this case, would the internal representation of information in memory/registers be in binary or hex?

Perhaps "digital computer" would be a good starting term and then from there "binary digit" ("bit"). Electronically, the terms for the values are sometimes "high" and "low". You are right, everything after that depends on the operation. Most of the time, groups of bits are operated on together. Commonly groups are 1, 8, 16, 32 and 64 bits. The meaning of the bits depends on the program but some operations go hand-in-hand with some level of meaning.
When the meaning of a group of bits is not known or important, humans like to be able to decern the value of each bit. Binary could be used but more than 8 bits is hard to read. Although it is rare to operate on groups of 4 bits, hexadecimal is much more readable and is generally used regardless of the number of bits. Sometimes octal is used but that's based on contexts where there is some meaning to a subgrouping of the 3 bits or an avoidance of digits beyond 9.
Integers can be stored in two's complement format and often CPUs have instructions for such integers. Once such operation is negation. For a group of 8 bits, it would map 1 to -1,… 127 to -127, and -1 to 1, … -127 to 127, and 0 to 0 and -128 to -128. Decimal is likely the most valuable to humans here, not base 256, base 2 or base 16. In unsigned hexadecimal, that would be 01 to FF, …, 00 to 00, 80 to 80.
For an intro to how a CPU might do integer addition on a group of bits, see adder circuits.
Other number formats include IEEE-754 floating point and binary-coded decimal.
I think you understand that digital circuits are binary. So, based on the above, yes, operations do operate on a higher conceptual level despite the actual storage.

Related

How many values can be stored per physical address in Memory?

I've read that you can only store one value per physical address in Ram. Now this data could be an instruction or data. Is this due to when the CPU reads in a Word from Ram, it can only deal with one value at a time? be that an instruction, int or a string. Is there a technical reason you can't fit more than one value per index. I've read about Scalar Processors but aren't they really old. Couldn't you fit two or more values in the width of a 64 bit Word for example? Or am i missing something really obvious here. I guess i'm asking is this a programming concept or is there an actual technical/hardware reason the cpu can't deal with more than one value per read of a Word from Ram..
Thanks
Rob
Most recent computers use addresses that point to a "Byte" location in memory.
Each machine instruction that includes "load (or store) from memory" functionality includes either an implicit or explicit specification of the number of bytes to be loaded/stored, starting at the target byte address. Common sizes are 1, 2, 4, 8 Bytes (corresponding to single data items of the most commonly supported sizes).
It is up to the application program to decide how to interpret the bytes and what operations to perform on them. It is certainly common to store the characters of a string in consecutive byte memory locations and process 4 or 8 characters at a time using 32-bit (4-Byte) or 64-bit (8-Byte) load and store instructions. Operation on the individual bytes (characters) may involves masking, shifting, and copying within the processor's general-purpose registers, but since the late 1990's, many/most microprocessors have included instructions specifically designed to treat the contents of a register as multiple independent (smaller) values.
"Packing" multiple data items into consecutive bytes of memory need not be limited to the sizes of registers for supported arithmetic types (1, 2, 4, 8 Bytes). Since about 2000, many processors have also included "Single Instruction Multiple Data" (SIMD) instructions to load bigger payloads into a set of "SIMD registers". (Common sizes are 16 and 32 Bytes, but some processors support 64 Byte registers.) Systems that include these SIMD load and store instructions typically also include instructions to operate on the SIMD registers "in parallel" -- treating the register contents as multiple independent values. It is common to provide instructions to treat the contents of a 256-bit (32-Byte) register as 32 1-Byte values, 16 2-Byte values, 8 4-Byte values, or 4 8-Byte values. The details vary by processor architecture and generation.

Meaning of a halfword and a word in microprocessor

I'm taking a microprocessor class and I'm a bit confused on the value of a halfword and word. In lectures, my professor keeps referring to a halfword or a word as though it's a fixed value, like Pi.
My understanding (if correct) is the value depends on the processor, 32bits, 64 bits, etc.. So, for a 32bit processor, a halfword would be 18bits. Would someone clarify this?
As the commenter noted, "word" is a loaded term. Now a days, processors have standardized on processors that use powers of as their "word" size (16, 32, 64, 128). This is why we talk in terms of bytes.
1) This was not so in ye olde days. The PDP-8 was a 12-bit processor. Sperry and the DEC TOPS systems were 36-bits. In these cases a "word" the amount of data the processor inherently worked with: in these examples, 12-bits or 36-bits.
2) In on the PDP-11 and VAX, the term "word" meant 16-bits (even though the VAX was a 32-bit processor).
[There may be other "word" meanings I have omitted.]
Thus a word could mean either the amount of data the processor works with or a specific data length. A half word is half of that amount.
You'd really have to ask to find out what your professor means when he says "word."

Endian-ness: Bits in a Byte vs. Bytes in Memory

When we say a specific architecture is either little-endian or big-endian, we are referring to the whether numerical significance is stored from left-to-right or right-to-left in memory. My question is: does this ordering refer to how bits or ordered in a byte, or how bytes are ordered in a memory?
For example, consider the number 6000=1770h=0001011101110000b. If both bits in a byte and byte in memory are little-endian, this would be stored as
00001110 11101000 = 0E E8,
if bits in a byte were big-endian, but bytes in memory were little-endian, this would be stored as (for what it's worth, this happens to be how Visual Studio seems to be telling me that memory is organized in x64 architecture)
01110000 00010111 = 70 17,
if bits were little-endian, but bytes were big-endian, this would be stored as
11101000 00001110 = 0E E8,
and finally, if bits were big-endian, but bytes were little-endian, this would be stored as
00010111 01110000 = 17 70
(Hopefully I did that right.)
So then, what do the terms "little-endian" and "big-endian" actually refer to? Do the terms refer to the ordering of bits in a byte, or the ordering of bytes in memory, or both? Furthermore, if VS tells me that, for example, 7C, is 'in' a given particular byte, do they mean that the bits that make up that byte in computer memory are literally 0111 1100, or do they just mean that the value stored in that byte is 7Ch=124, but or may not be actually represented as 7c=01111100 depending on whether or not the underlying architecture happens to be little-endian?
The ordering of bits in a byte is invisible. Since you can't address individual bits, there would be no difference between the two cases. However, you can address individual bytes, so there it does make a difference.
If we're expressing 6000 in byte-addressible memory, the high byte is 23 decimal (6000 divided by 256) and the low byte is 112 decimal (6000 mod 256). We could store this as 23,112 or 112,23. There are no other options. Only the ordering of bytes is an open choice, and this is what endianness refers to.
In memory, little-endian or big-endian is not so much left-to-right or right-to-left issue as one of addressing. In little endian, the least significant portion of data is stored in the lower addressed locations and the reverse with big endian.
The ordering occurs independently at 2 levels. As most machines address more than 1 bit at a time (recall some graphic CPUs that did have bit addresses), the address will locate a group of bits, typically 8 bits. So if the bits at address 10 are less significant than the bits at address 11, it is a little endian machine. This is generalized as byte endian-ness. The processor's characteristics define this.
The endian-ness of the group of bits, the bit endian-ness, is significant if there is a way to address them in some fashion. Some processors provide operations that use a bit level address within the byte (or word). If your programming language directly allows you to use that or hides that is another question.
In C there are bit fields such as
union u {
unsigned char uc;
struct s {
int a :1;
int b :7;
};
};
This code is non-portable because of bit endian-ness. u.uc = 7 may result in u.s.b also being 7 or something else. Typically the byte endian-ness and bit endian-ness are the same. But the compiler controls the above example.
Bit endian-ness is also significant in serial communication. As 1 bit is sent/received sequentially, its construction to/from memory needs endian-ness definition.
Conclude:
Big-endian and little endian most often refer to the "byte" level addressing. The endian-ness of the bit is typically either the same or of select importance to the programmer.
BTW, your example of "If both bits in a byte and byte in memory are little-endian, this would be stored as
00001110 11101000 = 0E E8
I would suggest is not correct as the left side and right side are using different endian-ness. Had you used the same endian-ness, you may conclude
00001110 11101000 = 07 71
For fun consider:
01000000 (big endian) has value "sixty-four" (A big endian word)
10110000 (little endian) has value "thirteen". (Thirteen is little endian word)

How much storage would be required to store a human genome?

I'm looking for the amount of storage in bytes (MB, GB, TB, etc.) required to store a single human genome. I read a few articles on Wikipedia about DNA, chromosomes, base pairs, genes, and have some rough guess, but before disclosing anything I'd like to see how others would approach this issue.
An alternative question would be how many atoms are there in human DNA, but that would be off topic for this site.
I understand that this will be an approximation, so I'm looking for the minimal value that would be able to store DNA of any human.
If you trust such things, here is what Wikipedia claims (from http://en.wikipedia.org/wiki/Human_genome#Information_content):
The 2.9 billion base pairs of the haploid human genome correspond to a
maximum of about 725 megabytes of data, since every base pair can be
coded by 2 bits. Since individual genomes vary by less than 1% from
each other, they can be losslessly compressed to roughly 4 megabytes.
You do not store all the DNA in one stream, rather most the time it is store by chromosomes.
A large chromosome take about 300 MB and a small one about 50 MB.
Edit:
I think the first reason why it is not saved in 2 bits per base pair is that it would cause an hurdle to work with the data. Most of the people would not know how to convert it. And even when a program for conversion would be given, a lot of people in large companies or research institutes are not allowed to/need to ask or do not know how to install programs...
1GB storage costs nothing, even the download of 3 GB takes only 4 minutes with 100 Mbitsps and most companies have faster speeds.
Another point is that the data isn't as simple as you get told.
e.g. The method for sequencing invented by Craig_Venter was a great breakthrough but has its down sides. It could not separate long chains of the same base pair, so it is not always 100% clear if there are 8 A's or 9 A's. Things you have to take care of later on...
Another example is the DNA methylation because you can't store this Information in a 2-bit representation.
Basically, each base pair takes 2 bits (you can use 00, 01, 10, 11 for T, G, C, and A). Since there are about 2.9 billion base pairs in the human genome, (2 * 2.9 billion) bits ~= 691 megabytes.
I'm no expert, however, the Human Genome page on Wikipedia states the following:
Raw MB:
Male (XY): 770MB
Female (XX): 756MB
I'm not sure where their variance comes from, but I'm sure you can figure it out.
Yes, the minimum storage space needed for whole human DNA is about 770 MB.
However, the 2-bit representation is impractical. It is hard to search through or do some computations on it. Therefore, some mathematicians designed more effective way to store those sequencies of bases and use them in searching and comparation algorithms. One such example is GARLI.
This application runs on my PC right now, and I have the human genome stored in 1563 MB.
The human genome contains over 3 billion base pairs. So if you represented each base pair as two bits then it would take over 6.15 × 10⁹ bits or approximately 770 MB.
just did it too. the raw sequence is ~700 MB. if one uses a fixed storage sequence or a fixed sequence storage algoritm - and the fact that the changes are 1% i calcuated ~120 MB with a perchromosome-sequenceoffset-statedelta storage. that's it for the storage.
There are 4 nucleotide bases that make up our DNA these are A,C,G,T therefore for each base in the DNA takes up 2bits. There are around 2.9billion bases so thats around 700 megabytes. The weird thing is that would fill a normal data cd! coincidence?!?
All answers are leaving off the fact that nuDNA is not the only DNA that defines a human genome. mtDNA is also inherited and it contributes an additional 16,500 base pairs to a human genome, bringing it more in line with the Wikipedia guess of 770MB for males, and 756MB for females.
This does not mean that a human genome can easily be stored on an 4GB USB stick. Bits do not represent information by themselves, it is the combination of bits that represent information. So in the case of nuDNA and mtDNA, the bits are encoded (not to be confused with compressed) to represent proteins and enzymes that in themselves would requires many MBs of raw data to represent, especially in terms of functionality.
Food for thought: 80% of the human genome is called "non-coding" DNA, so did you actually really believe that the entire human body and brain can be represented in a mere 151 to 154MBs of raw data?
Most answers except users slayton, rauchen, Paul Amstrong are dead wrong if its about pure storage one-on-one without compression techniques.
The human genome with 3Gb of nucleotides correspond with 3Gb of bytes and not ~750MB. The constructed "haploid" genome according to NCBI is currently 3436687kb or 3.436687 Gb in size. Check here for yourself.
Haploid = single copy of a chromosome.
Diploid = two versions of haploid.
Humans have 22 unique chromosomes x 2 = 44.
Male 23rd chromosome is X, Y and makes 46 in total.
Females 23rd chrom. is X, X and thus makes 46 in total.
For males it would be 23 + 1 chromosome in data storage on a HDD and for females 23 chromosomes, explaining the little differences mentioned now and then in answers. The X chrom. from males is equal to X chrom. from the females.
Thus loading the genome (23 + 1) into memory is done in parts via BLAST using constructed databases from fasta-files. Regardless of zipped versions or not nucleotides are hardly to be compressed. Back in the early days one of the tricks used was to replace tandem repeats (GACGACGAC with shorter coding e.g. "3GAC"; 9byte to 4byte). The reason was to save harddrive space (area of the 500bm-2GB HDDD platters with 7.200 rpm and SCSI connectors). For sequence searching this was also done with the query.
If "coded nucleotide" storage would be 2-bit per letter then you get for a byte:
A = 00
C = 01
G = 10
T = 11
Only this way you fully profit from positions 1,2,3,4,5,6,7 and 8 for 1 byte of coding. For example the combination 00.01.10.11 (as byte 00011011) would then correspond for "ACTG" (and show in a textfile as an unrecognizable character). This alone is responsible for a four times reduction in file-size as we see in other answers. Thus 3.4Gb will be downsized to 0.85917175 Gb... ~860MB including a then required conversion program (23kb-4mb).
But... in biology you want to be able to read something thus compression gzipped is more than enough. Unzipped you can still read it. If this byte filling was used it becomes harder to read the data. That's why fasta-files are plain-text files in reality.
There is only 2 types of base pairs, Cytosine can only bind to Guanine, and Adenine can only bind to thymine,
So each base pair can be considered a single bit.
This means that an entire strand of Human DNA ~3 billion "Bits" would be right around ~350 megabytes.
One base -- T, C, A, G (in the base-4 number system: 0, 1, 2, 3) -- is encoded as two bits (not one), so one base pair is encoded by four bits.

Why is the smallest value that can be stored is a Byte(8bit) & not a Bit(1bit)?

Why is the smallest value that can be stored a Byte(8bit) & not a Bit(1bit) in memory?
Even booleans are stored as Bytes. Will we ever bump the smallest number to 32 or 64bits like register's on the CPU?
EDIT: To clarify as many answers seemed confused about the nature of questing. This question is about why isn't a byte 7-bit, 1-bit, 32-bit, etc (not why lower bit primitives must fit within the hardware's byte at min). Is the 8-bit byte simply historical as some hardware has 10-bit bytes for example. Or is there a mathematical reason 8-bit is ideal vs say 10-bit for general processing?
The hardware is built to read data in blocks (bytes, later words and dwords). This provides greater efficiency, than accessing individual bits, and also offers more addressing range. So most data is aligned to at least byte boundary. There exist encodings that operate with bit sequences, rather than bytes, but they are quite rare.
Nowadays the data is most often aligned to dword (32-bits) boundary anyway. Moreover, some hardware (ARM, for example), can't access misaligned multibyte variables, i.e. 16-bit word can't "cross" dword boundary - exception will be thrown.
Because computers address memory at the byte level, so anything smaller than a byte is not addressable.
The underlying methods of processor access are limited to the size of the smallest usable register. On most architectures, that size is 8 bits. You can use smaller portions of these; for instance, C has the bitfield feature in structs that will allow combining fields that only need to be certain bit lengths. Access will still require that the whole byte be read.
Some older exotic architectures actually did have different a "word size." In these machines, 10 bits might be the common size.
Lastly, processors are almost always backwards compatible. Intel, for instance, has maintained complete instruction compatibility from the 386 on up. If you take a program compiled for the 386, it will still run on an i7 processor. Changing the word size would break compatibility. So while it is possible, no manufacturer will ever do it.
Assume that we have native language that consist of 2 character such as a , b
to distinguish two characters we need at least 1 bit for example 0 to represent char a and 1 to represent char b
so that if we count number of characters and special characters and symbols, there are 128 character and to distinguish one character from another, you need log2(128) = 7 bit and 8th bit for transmission

Resources