Optimal layout of 2D array with least memory access time

Optimal layout of 2D array with least memory access time - memory

Let us say I have a 2D array that I can read from a file
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
I am looking to store them as 1D array arr[16].
I am aware of row wise and column wise storage.
This messes up the structure of the array. Say I would like to convolve this with a 2X2 filter. Then at conv(1,1), I would be accessing memory at position 1,2,5,6.
Instead, can I optimize the storage of the data in a pattern such that the elements 1,2,5,6 are stored next to each other rather than being located far away ?
This reduces memory latency issue.

It depends on your processor, but supposing you have a typical Intel cache line size of 64 bytes, then picking square subregions that are each 64 bytes in size feels like a smart move.
If your individual elements are a byte each then 8x8 subtiles makes sense. So, e.g.
#define index(x, y) (x&7) | ((y&7) << 3) | \
((x&~7) << 6) | ((y&~7) ... shifted and/or multiplied as per size of x)
So in each full tile:
in 49 of every 64 cases all data is going to be within the same cache line;
in a further 14 it's going to lie across two cache lines; and
in one case in 64 is it going to need four.
So that's an average of 1.265625 cache lines touched per output pixel, versus 2.03125 in the naive case.

I found out what I was looking for. I was looking for what is called Morten ordering of an array that has shown to reduce memory access time. One another method would be to use the hilbert curve method which is shown to be more effective than the morten ordering method.
I am attaching a link to an article explaining this
https://insidehpc.com/2015/10/morton-ordering/

Related

Deflate and fixed Huffman codes

I'm trying to implement a deflate compressor and I have to decide whether to
compress a block using the static huffman code or create a dynamic one.
What is the rationale behind the length associated with the static code?
(this is the table included in the rfc)
Lit Value Bits
--------- ----
0 - 143 8
144 - 255 9
256 - 279 7
280 - 287 8
I thought static code was more biased towards plain ascii text, instead it
looks like it prefers by a tiny bit the compression of the rle length
What is a good heuristic to decide whether to use static code?
I was thinking to build a distribution of probabilities from a sample of the
input data and calculate a distance (maybe EMD?) from the probabilities derived
from the static code.

I would guess that the creator of the code took a large sample of literals and lengths from compressed data, likely including executables along with text, and found typical code lengths over the large set. They were then approximated with the table shown. However the author passed away many years ago, so we'll never know for sure.
You don't need a heuristic. Once you have done the work to find matching strings, it is comparatively very fast to compute the number of bits in the block for both a dynamic and static representation. Then simply pick the smaller one. Or the static one if equal (decodes faster).

I don't know about rationale, but there was a small amount of irrationale in choosing the static code lengths:
In the table in your question, the maximum static code number there is 287, but the DEFLATE specification only allows up to code 285, meaning code lengths have wastefully been assigned to two invalid codes. (And not even the longest ones either!) It's a similar story with the table for distance codes, with 32 codes having lengths assigned, but only 30 valid.
So there are some easy improvements that could have been made, but that said, without some prior knowledge of the data, it's not really possible to produce anything that's massively more efficient generally. The "flatness" of the table (no code longer than 9 bits) reduces the worst-case performance to 1 extra bit per byte of uncompressable data.
I think the main rationale behind the groupings is that by keeping group sizes to a multiple of 8, it's possible to tell which group a code belongs to by looking at the 5 most significant bits, which also tells you its length, along with what value to add to immediately get the code value itself
00000 00 .. 00101 11 7 bits + 256 -> (256..279)
00110 000 .. 10111 111 8 bits - 48 -> ( 0..144)
11000 000 .. 11000 111 8 bits + 78 -> (280..287)
11001 0000 .. 11111 1111 9 bits - 256 -> (144..255)
So in theory you could set up a lookup table with 32 entries to quickly read in the codes, but it's an uncommon case and probably not worth optimising for.
There are only really two cases (with some overlap) where Fixed Huffman blocks are likely to be the most efficient:
where the input size in bytes is very small, Static Huffman can be more efficient than Uncompressed, because Uncompressed uses a 32-bit header, while Fixed Huffman needs only a 7-bit footer, plus 1 bit potential overhead per byte.
where the output size is very small (ie. small-ish, highly compressible data), Static Huffman can be more efficient than Dynamic Huffman - again because Dynamic Huffman uses a certain amount of space for an additional header. (A practical minimum header size is difficult to calculate, but I'd say at least 64 bits, probably more.)
That said, I've found they are actually helpful from a developer's perspective, because it's very easy to implement a Deflate-compatible function using Static Huffman blocks, and to iteratively improve from there to get more efficient algorithms working.

JVM 64-bit different memory usages?

I've done some reading but I'm not entirely sure about one thing, for example how much memory would this use in JVM 64 bit(sorry if stupid question, but I'm a bit confused and don't know much about this):
MyObject[] myArray; - I know an array takes up 24 bytes, but how much will each element in this array take? is every element an object reference, meaning 8 byte per element? If not, how do I know how many bytes each element in this array needs?

Normally, that is when using heap sizes of less than 32 GB, the 64-bit JVM uses compressed oops which store object pointers as a 32-bit integer (scaled by three bits when used, since all objects are aligned to 8 bytes; see the link for details), so each element would actually only use 4 bytes.
If you use more than 32 GB of heap or otherwise turn off compressed oops, however, then each element will indeed use 8 bytes.
Also, I suspect that your statement on the array header being 24 bytes is wrong. To begin with, when compressing oops, the class reference in the header is also compressed, and the identity-hash-code and array length fields are 32-bit to begin with, so I suspect it is more likely to use 12 bytes. Even when using full-length oops, it should still only take 16 bytes. I can't find any hard source verifying either, however. In general, however, it should be said that Hotspot does not even use a fixed-size object header but one that varies in size depending on various circumstances of the object. This article describes some of those circumstances.
That is on the Hotspot JVM, at least. Since the JLS doesn't specify any primitive sizes, it could, theoretically, be anything on any given JVM, though 8 bytes are, of course, the most likely implementation choice.

Here is good information on how to calculate the memory usage of a Java array
For Example
let's consider a 10x10 int array. Firstly, the "outer" array has its 12-byte object header followed by space for the 10 elements. Those elements are object references to the 10 arrays making up the rows. That comes to 12+4*10=52 bytes, which must then be rounded up to the next multiple of 8, giving 56. Then, each of the 10 rows has its own 12-byte object header, 4*10=40 bytes for the actual row of ints, and again, 4 bytes of padding to bring the total for that row to a multiple of 8. So in total, that gives 11*56=616 bytes. That's a bit bigger than if you'd just counted on 10*10*4=400 bytes for the hundred "raw" ints themselves.

CUDA 128 bytes read in a single instruction

I am new to CUDA and currently optimize an existing application for molecular dynamics. What it does is that it takes array of double4 with coordinates and computes forces based on the neighborlist. I wrote a kernel with the following lines:
double4 mPos=d_arr_xyz[gid];
while(-1!=(id=d_neib_list[gid*MAX_NEIGHBORS+i])){
Calc(gid,mPos,AA,d_arr_xyz,id);i++;
}
then Calc takes d_arr_xyz[id] and calculates force. That gives 1 read of double4 + 65 reads of (int +double4) inside every call of Calc (65 is average number of neighbors (not equal to -1) in d_neib_list for each particle).
Is it possible to reduce those reads? Neighborlists for different particles, i.e. d_arr_xyz[gid] and d_arr_xyz[id] do not correalte, so I cannot use shared memory for the block of threads to cache d_arr_xyz.
What I see is that if somehow to load the whole list int*MAX_NEIGHBORS into shared memory in one or few large transactions, that will remove 65 separate reads of int.
So the question is: is it possible to do it so that those 65 reads of int will be translated into several large transactions. I read in the documentation that reads can be even 128 bytes long. What exactly should I write so that assembler will make 1 large call?
Update:
Thank you for your replies. From the answer from user talonmies below, I changed the code replacing dimensions x and y for the neighbors matrix. Now consecutive threads load consecutive int[gid], I guess that may result in a 128 byte read. The program works 8% faster.

All memory transactions are issued (where possible) on a per warp basis. So the 128 byte transaction you are asking about is when all 32 threads in a warp issue a memory load instruction which can be serviced in a single "coalesced" transaction. A single thread can't issue large memory transactions, only a warp of 32 threads can, and only when the memory coalescing requirements of whichever architecture you run the code on can be satisfied.
I couldn't really follow your description of what you code is actually doing, but from first principles alone, the answer would appear to be no.

How much storage would be required to store a human genome?

I'm looking for the amount of storage in bytes (MB, GB, TB, etc.) required to store a single human genome. I read a few articles on Wikipedia about DNA, chromosomes, base pairs, genes, and have some rough guess, but before disclosing anything I'd like to see how others would approach this issue.
An alternative question would be how many atoms are there in human DNA, but that would be off topic for this site.
I understand that this will be an approximation, so I'm looking for the minimal value that would be able to store DNA of any human.

If you trust such things, here is what Wikipedia claims (from http://en.wikipedia.org/wiki/Human_genome#Information_content):
The 2.9 billion base pairs of the haploid human genome correspond to a
maximum of about 725 megabytes of data, since every base pair can be
coded by 2 bits. Since individual genomes vary by less than 1% from
each other, they can be losslessly compressed to roughly 4 megabytes.

You do not store all the DNA in one stream, rather most the time it is store by chromosomes.
A large chromosome take about 300 MB and a small one about 50 MB.
Edit:
I think the first reason why it is not saved in 2 bits per base pair is that it would cause an hurdle to work with the data. Most of the people would not know how to convert it. And even when a program for conversion would be given, a lot of people in large companies or research institutes are not allowed to/need to ask or do not know how to install programs...
1GB storage costs nothing, even the download of 3 GB takes only 4 minutes with 100 Mbitsps and most companies have faster speeds.
Another point is that the data isn't as simple as you get told.
e.g. The method for sequencing invented by Craig_Venter was a great breakthrough but has its down sides. It could not separate long chains of the same base pair, so it is not always 100% clear if there are 8 A's or 9 A's. Things you have to take care of later on...
Another example is the DNA methylation because you can't store this Information in a 2-bit representation.

Basically, each base pair takes 2 bits (you can use 00, 01, 10, 11 for T, G, C, and A). Since there are about 2.9 billion base pairs in the human genome, (2 * 2.9 billion) bits ~= 691 megabytes.
I'm no expert, however, the Human Genome page on Wikipedia states the following:
Raw MB:
Male (XY): 770MB
Female (XX): 756MB
I'm not sure where their variance comes from, but I'm sure you can figure it out.

Yes, the minimum storage space needed for whole human DNA is about 770 MB.
However, the 2-bit representation is impractical. It is hard to search through or do some computations on it. Therefore, some mathematicians designed more effective way to store those sequencies of bases and use them in searching and comparation algorithms. One such example is GARLI.
This application runs on my PC right now, and I have the human genome stored in 1563 MB.

The human genome contains over 3 billion base pairs. So if you represented each base pair as two bits then it would take over 6.15 × 10⁹ bits or approximately 770 MB.

just did it too. the raw sequence is ~700 MB. if one uses a fixed storage sequence or a fixed sequence storage algoritm - and the fact that the changes are 1% i calcuated ~120 MB with a perchromosome-sequenceoffset-statedelta storage. that's it for the storage.

There are 4 nucleotide bases that make up our DNA these are A,C,G,T therefore for each base in the DNA takes up 2bits. There are around 2.9billion bases so thats around 700 megabytes. The weird thing is that would fill a normal data cd! coincidence?!?

All answers are leaving off the fact that nuDNA is not the only DNA that defines a human genome. mtDNA is also inherited and it contributes an additional 16,500 base pairs to a human genome, bringing it more in line with the Wikipedia guess of 770MB for males, and 756MB for females.
This does not mean that a human genome can easily be stored on an 4GB USB stick. Bits do not represent information by themselves, it is the combination of bits that represent information. So in the case of nuDNA and mtDNA, the bits are encoded (not to be confused with compressed) to represent proteins and enzymes that in themselves would requires many MBs of raw data to represent, especially in terms of functionality.
Food for thought: 80% of the human genome is called "non-coding" DNA, so did you actually really believe that the entire human body and brain can be represented in a mere 151 to 154MBs of raw data?

Most answers except users slayton, rauchen, Paul Amstrong are dead wrong if its about pure storage one-on-one without compression techniques.
The human genome with 3Gb of nucleotides correspond with 3Gb of bytes and not ~750MB. The constructed "haploid" genome according to NCBI is currently 3436687kb or 3.436687 Gb in size. Check here for yourself.
Haploid = single copy of a chromosome.
Diploid = two versions of haploid.
Humans have 22 unique chromosomes x 2 = 44.
Male 23rd chromosome is X, Y and makes 46 in total.
Females 23rd chrom. is X, X and thus makes 46 in total.
For males it would be 23 + 1 chromosome in data storage on a HDD and for females 23 chromosomes, explaining the little differences mentioned now and then in answers. The X chrom. from males is equal to X chrom. from the females.
Thus loading the genome (23 + 1) into memory is done in parts via BLAST using constructed databases from fasta-files. Regardless of zipped versions or not nucleotides are hardly to be compressed. Back in the early days one of the tricks used was to replace tandem repeats (GACGACGAC with shorter coding e.g. "3GAC"; 9byte to 4byte). The reason was to save harddrive space (area of the 500bm-2GB HDDD platters with 7.200 rpm and SCSI connectors). For sequence searching this was also done with the query.
If "coded nucleotide" storage would be 2-bit per letter then you get for a byte:
A = 00
C = 01
G = 10
T = 11
Only this way you fully profit from positions 1,2,3,4,5,6,7 and 8 for 1 byte of coding. For example the combination 00.01.10.11 (as byte 00011011) would then correspond for "ACTG" (and show in a textfile as an unrecognizable character). This alone is responsible for a four times reduction in file-size as we see in other answers. Thus 3.4Gb will be downsized to 0.85917175 Gb... ~860MB including a then required conversion program (23kb-4mb).
But... in biology you want to be able to read something thus compression gzipped is more than enough. Unzipped you can still read it. If this byte filling was used it becomes harder to read the data. That's why fasta-files are plain-text files in reality.

There is only 2 types of base pairs, Cytosine can only bind to Guanine, and Adenine can only bind to thymine,
So each base pair can be considered a single bit.
This means that an entire strand of Human DNA ~3 billion "Bits" would be right around ~350 megabytes.

One base -- T, C, A, G (in the base-4 number system: 0, 1, 2, 3) -- is encoded as two bits (not one), so one base pair is encoded by four bits.

Lookup table size reduction

I have an application in which I have to store a couple of millions of integers, I have to store them in a Look up table, obviously I cannot store such amount of data in memory and in my requirements I am very limited I have to store the data in an embebedded system so I am very limited in the space, so I would like to ask you about recommended methods that I can use for the reduction of the look up table. I cannot use function approximation such as neural networks, the values needs to be in a table. The range of the integers is not known at the moment. When I say integers I mean a 32 bit value.
Basically the idea is use some copmpression method to reduce the amount of memory but without losing many precision. This thing needs to run in hardware so the computation overhead cannot be very high.
In my algorithm I have to access to one value of the table do some operations with it and after update the value. In the end what I should have is a function which I pass an index to it and then I get a value, and after I have to use another function to write a value in the table.
I found one called tile coding , this one is based on several look up tables, does anyone know any other method?.
Thanks.

I'd look at the types of numbers you need to store and pull out the information that's common for many of them. For example, if they're tightly clustered, you can take the mean, store it, and store the offsets. The offsets will have fewer bits than the original numbers. Or, if they're more or less uniformly distributed, you can store the first number and then store the offset to the next number.
It would help to know what your key is to look up the numbers.

I need more detail on the problem. If you cannot store the real value of the integers but instead an approximation, that means you are going to reduce (throw away) some of the data (detail), correct? I think you are looking for a hash, which can be an artform in itself. For example say you have 32 bit values, one hash would be to take the 4 bytes and xor them together, this would result in a single 8 bit value, reducing your storage by a factor of 4 but also reducing the real value of original data. Typically you could/would go further and perhaps and only use a few of those 8 bits , say the lower 4 and reduce the value further.
I think my real problem is either you need the data or you dont, if you need the data you need to compress it or find more memory to store it. If you dont, then use a hash of some sort to reduce the number of bits until you reach the amount of memory you have for storage.

Read http://www.cs.ualberta.ca/~sutton/RL-FAQ.html
"Function approximation" refers to the
use of a parameterized functional form
to represent the value function
(and/or the policy), as opposed to a
simple table."
Perhaps that applies. Also, update your question with additional facts -- don't merely answer in the comments.
Edit.
A bit array can easily store a bit for each of your millions of numbers. Let's say you have numbers in the range of 1 to 8 million. In a single megabyte of storage you can have a 1 bit for each number in your set and a 0 for each number not in your set.
If you have numbers in the range of 1 to 32 million, you'll require 4Mb of memory for a big table of all 32M distinct numbers.
See my answer to Modern, high performance bloom filter in Python? for a Python implementation of a bit array of unlimited size.

If you are merely looking for the presence of the number in question a bloom filter, might be what you are looking for. Honestly though your question is fairly vague and confusing. It would help to explain what Q values are, and what you do with them once you find them in the table.

If your set of integers is homongenous, then you could try a hash table, because there is a trick you can use to cut the size of the stored integers, in your case, in half.
Assume the integer, n, because its set is homogenous can be the hash. Assume you have 0x10000 (16k) buckets. Each bucket index, iBucket = n&FFFF. Each item in a bucket need only store 16 bits, since the first 16 bits are the bucket index. The other thing you have to do to keep the data small is to put the count of items in the bucket, and use an array to hold the items in the bucket. Using a linked list will be too large and slow. When you iterate the array looking for a match, remember you only need to compare the 16 bits that are stored.
So assuming a bucket is a pointer to the array and a count. On a 32 bit system, this is 64 bits max. If the number of ints was small enough we might be able to do some fancy things and use 32 bits for a bucket. 16k * 8 bytes = 524k, 2 million shorts = 4mb. So this gets you a method to lookup the ints and about 40% compression.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart