How much storage to represent sparse matrix - memory

I don`t know how to solve this problem in Fundamentals of data structure in C ed.2nd ch2.5
On a computer with w bits per word, how much storage is needed to represent a sparse matrix, A, with t nonzero terms?
I think the answer is 3*w*t because in sparse matrix we just store row, col and values,
so 3 times w*t but someone says answer is w2 + t.... I don't understand what they mean.

In the most common “general purpose” sparse matrix formats (CSR and CSC), for a matrix with t nonzeros, there are two integer arrays, of lengths t+1 and t, and one array of floating-point numbers of length t. In practice, the size in bytes will depend on the sizes of the integer and floating-point representations. In a theoretical machine with one uniform word size for everything, the size would be 3t+1 words.
I fail to see how w^2+t could be correct or even related.

Related

How are quarter-precision motion vectors encoded

I would need to understand how exactly motion vectors are encoded, for non integer precision (whether it is for quarter pel, 1/16 pel or whatever)
In the code, the motion vectors components are always integers, but I don't understand how to deal with non integer precision.
For example if my motion vector "actual values" are say (3.5, 2.75), how then to get the "int" values that are in the code, or if the value of the x and y component in the code are (114, 82) and it is with quarter pel precision, what are the actual values ?
Thank you for helping
They are basically scaled to integer and then coded. For instance, MV=2.75 is scaled to scaledMV=2.75x4=11. Note that to be able to decode integer MVs, they should be scaled, too. For instance, MV=1.0 will become scaledMV=4x1.0=4.0.
FYI, the MV coding of HEVC is way too complicated to be explained here. So, I would suggest that you take a look at this paper.

Bit encoding for vector of rational numbers

I would like to implement ultra compact storage for structures with rational numbers.
In the book "Theory of Linear and Integer Programming" by Alexander Schrijver, I found the definition of bit sizes (page. 15) of rational number, vector and matrix:
The representation of rational number is clear: single bit for sign and logarithm for quotient and fraction.
I can't figure out how vector can be encoded only in n bits to distinguish between its elements?
For example what if I would like to write vector of two elements:
524 = 1000001100b, 42 = 101010b. How can I use only 2 additional bits to specify when 1000001100 ends and 101010 starts?
The same problem exists with matrix representation.
Of course, it is not possible just to append the integer representations to each other, and add the information about the merging place, since this would take much more bits than given by the formula in the book, which I don't have access to.
I believe this is a problem from coding theory where I am not an expert. But I found something that might point you to the right direction. In this post an "interpolative code" is described among others. If you apply it to your example (524, 42), you get
f (the number of integers to be encoded, all in the range [1,N] = 2
N = 524
The maximum bit length of the encoded 2 integers is then
f • (2.58 + log (N/f)) = 9,99…, i.e. 10 bits
Thus, it is possible to have ultra compact encoding, although one had to spend a lot of time for coding and decoding.
It is impossible to use only two bits to specify when the quotient end and fraction start. At least you will need as big as the length of the quotient or/and the length of the fraction size. Another way is to use a fixed number of bits for both quotient and fraction similar with IEEE 754.

Dimensions of LSTM variant in Deep Mind's Differentiable Neural Computer (DNC)

I'm trying to implement Deep Mind's DNC - Nature paper- with PyTorch 0.4.0.
When implementing the variant of LSTM they used I encountered some troubles with dimensions.
To simplify suppose BATCH=1.
The equations they list in the paper are these:
where [x;h] means a concatenation of x and h into one single vector, and i, f and o are column vectors.
My question is about how the state s_t is computed.
The second addendum is obtained by multiplying i with a column vector and so the result is either a scalar (transpose i first, then do scalar product) or wrong (two column vectors multiplied).
So the state results in a single scalar...
With the same reasoning the hidden state h_t is a scalar too, but it has to be a column vector.
Obviously I'm wrong somewhere, but I can't figure out where.
By looking at Wikipedia LSTM Article I think I figured it out.
This is the formal implementation of standard LSTM found in the article:
The circle represents element-by-element product.
By using this product in the corresponding parts of DNC equations (s_t and o_t) the dimensions work.

svm scaling input values

I am using libSVM.
Say my feature values are in the following format:
instance1 : f11, f12, f13, f14
instance2 : f21, f22, f23, f24
instance3 : f31, f32, f33, f34
instance4 : f41, f42, f43, f44
..............................
instanceN : fN1, fN2, fN3, fN4
I think there are two scaling can be applied.
scale each instance vector such that each vector has zero mean and unit variance.
( (f11, f12, f13, f14) - mean((f11, f12, f13, f14) ). /std((f11, f12, f13, f14) )
scale each colum of the above matrix to a range. for example [-1, 1]
According to my experiments with RBF kernel (libSVM) I found that the second scaling (2) improves the results by about 10%. I did not understand the reason why (2) gives me a improved results.
Could anybody explain me what is the reason for applying scaling and why the second option gives me improved results?
The standard thing to do is to make each dimension (or attribute, or column (in your example)) have zero mean and unit variance.
This brings each dimension of the SVM into the same magnitude. From http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf:
The main advantage of scaling is to avoid attributes in greater numeric
ranges dominating those in smaller numeric ranges. Another advantage is to avoid
numerical diculties during the calculation. Because kernel values usually depend on
the inner products of feature vectors, e.g. the linear kernel and the polynomial ker-
nel, large attribute values might cause numerical problems. We recommend linearly
scaling each attribute to the range [-1,+1] or [0,1].
I believe that it comes down to your original data a lot.
If your original data has SOME extreme values for some columns, then in my opinion you lose some definition when scaling linearly, for example in the range [-1,1].
Let's say that you have a column where 90% of values are between 100-500 and in the remaining 10% the values are as low as -2000 and as high as +2500.
If you scale this data linearly, then you'll have:
-2000 -> -1 ## <- The min in your scaled data
+2500 -> +1 ## <- The max in your scaled data
100 -> -0.06666666666666665
234 -> -0.007111111111111068
500 -> 0.11111111111111116
You could argue that the discernibility between what was originally 100 and 500 is smaller in the scaled data in comparison to what it was in the original data.
At the end, I believe it very much comes down to the specifics of your data and I believe the 10% improved performance is very coincidental, you will certainly not see a difference of this magnitude in every dataset you try both scaling methods on.
At the same time, in the paper in the link listed in the other answer, you can clearly see that the authors recommend data to be scaled linearly.
I hope someone finds this useful!
The accepted answer speaks of "Standard Scaling", which is not efficient for high-dimensional data stored in sparse matrices (text data is a use-case); in such cases, you may resort to "Max Scaling" and its variants, which works with sparse matrices.

How are matrices stored in memory?

Note - may be more related to computer organization than software, not sure.
I'm trying to understand something related to data compression, say for jpeg photos. Essentially a very dense matrix is converted (via discrete cosine transforms) into a much more sparse matrix. Supposedly it is this sparse matrix that is stored. Take a look at this link:
http://en.wikipedia.org/wiki/JPEG
Comparing the original 8x8 sub-block image example to matrix "B", which is transformed to have overall lower magnitude values and much more zeros throughout. How is matrix B stored such that it saves much more memory over the original matrix?
The original matrix clearly needs 8x8 (number of entries) x 8 bits/entry since values can range randomly from 0 to 255. OK, so I think it's pretty clear we need 64 bytes of memory for this. Matrix B on the other hand, hmmm. Best case scenario I can think of is that values range from -26 to +5, so at most an entry (like -26) needs 6 bits (5 bits to form 26, 1 bit for sign I guess). So then you could store 8x8x6 bits = 48 bytes.
The other possibility I see is that the matrix is stored in a "zig zag" order from the top left. Then we can specify a start and an end address and just keep storing along the diagonals until we're only left with zeros. Let's say it's a 32-bit machine; then 2 addresses (start + end) will constitute 8 bytes; for the other non-zero entries at 6 bits each, say, we have to go along almost all the top diagonals to store a sum of 28 elements. In total this scheme would take 29 bytes.
To summarize my question: if JPEG and other image encoders are claiming to save space by using algorithms to make the image matrix less dense, how is this extra space being realized in my hard disk?
Cheers
The dct needs to be accompanied with other compression schemes that take advantage of the zeros/high frequency occurrences. A simple example is run length encoding.
JPEG uses a variant of Huffman coding.
As it says in "Entropy coding" a zig-zag pattern is used, together with RLE which will already reduce size for many cases. However, as far as I know the DCT isn't giving a sparse matrix per se. But it usually enhances the entropy of the matrix. This is the point where the compressen becomes lossy: The intput matrix is transferred with DCT, then the values are quantizised and then the huffman-encoding is used.
The most simple compression would take advantage of repeated sequences of symbols (zeros). A matrix in memory may look like this (suppose in dec system)
0000000000000100000000000210000000000004301000300000000004
After compression it may look like this
(0,13)1(0,11)21(0,12)43010003(0,11)4
(Symbol,Count)...
As my under stand, JPEG on only compress, it also drop data. After the 8x8 block transfer to frequent domain, it drop the in-significant (high-frequent) data, which means it only has to save the significant 6x6 or even 4x4 data. That it can has higher compress rate then non-lost method (like gif)

Resources