Why is modulus operator slow? - modulo

Paraphrasing from in "Programming Pearls" book (about c language on older machines, since book is from the late 90's):
Integer arithmetic operations (+, -, *) can take around 10 nano seconds whereas the % operator takes up to 100 nano seconds.
Why there is that much difference?
How does a modulus operator work internally?
Is it same as division (/) in terms of time?

The modulus/modulo operation is usually understood as the integer equivalent of the remainder operation - a side effect or counterpart to division.
Except for some degenerate cases (where the divisor is a power of the operating base - i.e. a power of 2 for most number formats) this is just as expensive as integer division!
So the question is really, why is integer division so expensive?
I don't have the time or expertise to analyze this mathematically, so I'm going to appeal to grade school maths:
Consider the number of lines of working out in the notebook (not including the inputs) required for:
Equality: (Boolean operations) essentially none - in computer "big O" terms this is known a O(1)
addition: two, working left to right, one line for the output and one line for the carry. This is an O(N) operation
long multiplication: n*(n+1) + 2: two lines for each of the digit products (one for total, one for carry) plus a final total and carry. So O(N^2) but with a fixed N (32 or 64), and it can be pipelined in silicon to less than that
long division: unknown, depends upon the argument size - it's a recursive descent and some instances descend faster than others (1,000,000 / 500,000 requires less lines than 1,000 / 7). Also each step is essentially a series of multiplications to isolate the closest factors. (Although multiple algorithms exist). Feels like an O(N^3) with variable N
So in simple terms, this should give you a feel for why division and hence modulo is slower: computers still have to do long division in the same stepwise fashion tha you did in grade school.
If this makes no sense to you; you may have been brought up on school math a little more modern than mine (30+ years ago).
The Order/Big O notation used above as O(something) expresses the complexity of a computation in terms of the size of its inputs, and expresses a fact about its execution time. http://en.m.wikipedia.org/wiki/Big_O_notation
O(1) executes in constant (but possibly large) time. O(N) takes as much time as the size of its data-so if the data is 32 bits it takes 32 times the O(1) time of the step to calculate one of its N steps, and O(N^2) takes N times N (N squared) the time of its N steps (or possibly N times MN for some constant M). Etc.
In the above working I have used O(N) rather than O(N^2) for addition since the 32 or 64 bits of the first input are calculated in parallel by the CPU. In a hypothetical 1 bit machine a 32 bit addition operation would be O(32^2) and change. The same order reduction applies to the other operations too.

Related

Quantum computing vs traditional base10 systems

This may show my naiveté but it is my understanding that quantum computing's obstacle is stabilizing the qbits. I also understand that standard computers use binary (on/off); but it seems like it may be easier with today's tech to read electric states between 0 and 9. Binary was the answer because it was very hard to read the varying amounts of electricity, components degrade over time, and maybe maintaining a clean electrical "signal" was challenging.
But wouldn't it be easier to try to solve the problem of reading varying levels of electricity so we can go from 2 inputs to 10 and thereby increasing the smallest unit of storage and exponentially increasing the number of paths through the logic gates?
I know I am missing quite a bit (sorry the puns were painful) so I would love to hear why or why not.
Thank you
"Exponentially increasing the number of paths through the logic gates" is exactly the problem. More possible states for each n-ary digit means more transistors, larger gates and more complex CPUs. That's not to say no one is working on ternary and similar systems, but the reason binary is ubiquitous is its simplicity. For storage, more possible states also means we need more sensitive electronics for reading and writing, and a much higher error frequency during these operations. There's a lot of hype around using DNA (base-4) for storage, but this is more on account of the density and durability of the substrate.
You're correct, though that your question is missing quite a bit - qubits are entirely different from classical information, whether we use bits or digits. Classical bits and trits respectively correspond to vectors like
Binary: |0> = [1,0]; |1> = [0,1];
Ternary: |0> = [1,0,0]; |1> = [0,1,0]; |2> = [0,0,1];
A qubit, on the other hand, can be a linear combination of classical states
Qubit: |Ψ> = α |0> + β |1>
where α and β are arbitrary complex numbers such that such that |α|2 + |β|2 = 1.
This is called a superposition, meaning even a single qubit can be in one of an infinite number of states. Moreover, unless you prepared the qubit yourself or received some classical information about α and β, there is no way to determine the values of α and β. If you want to extract information from the qubit you must perform a measurement, which collapses the superposition and returns |0> with probability |α|2 and |1> with probability |β|2.
We can extend the idea to qutrits (though, just like trits, these are even more difficult to effectively realize than qubits):
Qutrit: |Ψ> = α |0> + β |1> + γ |2>
These requirements mean that qubits are much more difficult to realize than classical bits of any base.

Why are "random" numbers plugged from memory often extremely large?

Sometimes during development of e.g. C code, you may accidentally index an array beyond its last element, resulting in a read of an essentially "random" chunk of memory. I work a lot with arrays of doubles and have noticed that when this happens, the double produced from the "random" memory is often very large, as in larger than 1e+300. I wonder why this is.
If the 64 bits used for interpreting the double were truly random, I would expect the exponent of the double to be uniformly distributed from 0 to 308 (ignoring the sign of the exponent), due to the way floating point numbers are laid out in memory using scientific (exponential) notation. Of course the values of the randomly selected bits in memory are not themselves randomly distributed, but corresponds to some meaningful state for whatever process set these values.
To investigate this effect I wrote the following Python 3 script, which plots the distribution of both truly randomly generated doubles and doubles taken from "random" but unused memory:
import random, struct
import numpy as np
import matplotlib.pyplot as plt
N = 10000
def random_floats(N=1):
return np.array(struct.unpack('d'*N, bytes(random.randrange(256) for _ in range(8*N))))
def exp_hist(a, label=None):
a = a[~np.isnan(a)]
a = a[~np.isinf(a)]
a = a[a != 0]
if len(a) == 0:
print('Zeros only!')
return
a = np.abs(np.log10(np.abs(a)))
plt.hist(a, range=(0, 350), density=True, alpha=0.8, label=label)
# Floats generated from uniformly random bits
a = random_floats(N)
exp_hist(a, 'random')
# Floats generated from memory content
a = np.empty(N)
exp_hist(a, 'memory')
plt.xlabel('exponent')
plt.legend()
plt.savefig('plot.png')
A typical result of running this script is shown below:
The exponents of the truly randomly generated doubles are indeed uniformly distributed.
The exponents of the doubles interpreted from memory content are either very small or very large. In fact, much of unused memory is zeroed out, leading to a lot of 0 values, which makes sense. However, just as I so often experience from out-of-bounce memory access, a lot of values near 1e+300 show up as well.
I would like an explanation of this large amount of extremely large doubles.
Note on running the script
If you would like to try out the script yourself, be aware that you may have to run it several times for anything interesting to show up. It may happen that every single number read from memory content will be 0, in which case it will tell you so. If this happens repeatedly, try lowering N (the number of doubles used).
There are many different things you might find in memory, but a surprising number of them map to very big or very small floating point numbers, infinities, or NaNs. In what follows, "FP" means IEEE 754 64-bit binary floating point.
First, because they have already been discussed in comments on the question, consider addresses. A 64 bit address usually has all the exponent bits zero (low end of memory), or all the exponent bits high (high end of memory, often stack addresses). If all the exponent bits are high, it is an infinity or NaN, which the program seems to ignore. If all the exponent bits are zero, it is a sub-normal number or zero. Sub-normal numbers are all less than 2.3E-308, counted as exponent 308.
Now consider 32 bit integers, another very common form of data. Negative 32-bit twos complement integers that map to finite FP are -1048577 or less. Numbers like -42 or -1 map to NaNs, ignored by the program. Similarly, moderate value positive integers have all the exponent bits zero and so map to sub-normal numbers, mapped to the large exponent end of the histogram. Even the small normal numbers correspond to surprisingly large integers. For example, the leading 32 bits of 1e-300 have integer value
27,618,847.
For both pointers and integers, there is a strong bias towards all the exponent bits having the same value, either all zero or all one. All one is a NaN or infinity, not counted by the program. All zero is subnormal, counted as very large exponent.

Bit encoding for vector of rational numbers

I would like to implement ultra compact storage for structures with rational numbers.
In the book "Theory of Linear and Integer Programming" by Alexander Schrijver, I found the definition of bit sizes (page. 15) of rational number, vector and matrix:
The representation of rational number is clear: single bit for sign and logarithm for quotient and fraction.
I can't figure out how vector can be encoded only in n bits to distinguish between its elements?
For example what if I would like to write vector of two elements:
524 = 1000001100b, 42 = 101010b. How can I use only 2 additional bits to specify when 1000001100 ends and 101010 starts?
The same problem exists with matrix representation.
Of course, it is not possible just to append the integer representations to each other, and add the information about the merging place, since this would take much more bits than given by the formula in the book, which I don't have access to.
I believe this is a problem from coding theory where I am not an expert. But I found something that might point you to the right direction. In this post an "interpolative code" is described among others. If you apply it to your example (524, 42), you get
f (the number of integers to be encoded, all in the range [1,N] = 2
N = 524
The maximum bit length of the encoded 2 integers is then
f • (2.58 + log (N/f)) = 9,99…, i.e. 10 bits
Thus, it is possible to have ultra compact encoding, although one had to spend a lot of time for coding and decoding.
It is impossible to use only two bits to specify when the quotient end and fraction start. At least you will need as big as the length of the quotient or/and the length of the fraction size. Another way is to use a fixed number of bits for both quotient and fraction similar with IEEE 754.

A few questions about CRC basics

I am an electronic engineer and have not found it important to consider CRC from a purely mathematical perspective. However, I have the following questions:
Why do we add n zeros to the message when we calculate the CRC, were n is the degree of the generator polynomial? I have seen this in the modulo-2 long division as well as the hardware implementation of CRC
Why do we want that the generator polynomial be divisible by (x+1)?
Why do we want that the generator polynomial not be divisible by x?
We add n zeros when computing an n-bit CRC because, when appending the CRC to the message and sending the whole (a usual practice in telecoms):
That allows the receiving side to process the bits of the CRC just as the rest of the message is, leading to a known remainder for any error-free transmission. This is especially useful when the end of the message is indicated by something that follows the CRC (a common practice); on the receiving side it saves an n bit buffer, and on the transmit side it adds virtually no complexity (the extra terms of x(n) reduce to an AND gate forcing message bits to zero during CRC transmission, and the n extra reduction steps are performed as the CRC is transmitted).
Mathematically, the CRC sent is (M(x) * x^n) mod P(x) = R(x) (perhaps, within some constant, or/and perhaps with some prescribed bits added at beginning of M(x), corresponding to an initialization of the CRC register), and the CRC computed on the receiving side is over the concatenation of M(x) and R(x), that is
(M(x) * x^n + R(x)) mod P(x), which is zero (or said constant).
It insures that burst of errors affecting both the end of the message and the contiguous CRC benefit from the full level of protection afforded by the choice of polynomial. In particular, if we computed C(x) as M(x) mod P(x), flipping the last bit of M(x) and the last bit of C(x) would go undetected, when most polynomials used in error detection insure that any two-bit error is detected up to some large message size.
It is common practice to have CRC polynomials used for error detection divisible by x+1, because it ensures that any error affecting an odd number of bits is detected. However that practice is not universal, and it would sometime prevents selection of a better polynomial for some useful definitions of better, including maximizing the length of message such that m errors are always detected (assuming no synchronization loss), for some combinations of m and n. In particular, if we want to be able to detect any 2-bit error for the longest message possible (which will be 2n-1 bits including n-bit CRC), we need the polynomial to be primitive, thus irreducible, thus (for n>1) not divisible by x+1.
It is universal practice to have CRC polynomials used for error detection not divisible by x, because otherwise the last bit of CRC generated would be constant, and would not help towards detection of errors in the rest of the message+CRC.

why a good choice of mod is "a prime not too close to an exact of 2"

To generate a hash function, Map a key k into one of m slots by taking the remainder of k divided by m. That is, the hash function is
h(k) = k mod m.
I have read at several places that a good choice of m will be
A prime - I understand that we want to remove common factors, hence a prime number is chosen
not too close to an exact power of 2 - why is that?
From Introduction to algorithms :
When using the division method we avoid certain values of m. For
example m should not be power of 2. Since if m=2^p then h(k) is p
lowest-order bits of k. Unless it is known that all low-order p-bit
patterns are equally likely,
it is better to make a hash function
depend on all bits of the key.
As you se from the below image if i chose 2^3 which mean p=3 and m=8. The hashed keys are only dependent to lowest 3(p) bits which is bad because when you hash you want to include as much data as possible for a good distribution.

Resources