Large allocation - anything like continous virtual memory? - memory

in my program I create a large (~10 million elements) list of objects, where each object is about 500 byte large. Currently the allocation is like this:
const int N = 10000000;
object_type ** list = malloc( N * sizeof * list );
for (int i=0; i < N; i++)
list[i] = malloc( sizeof * list[i]);
This works OK - but I have discovered that with the high number of small allocations a significant part of the run time goes to the malloc() and subsequent free() calls. I am therefor about to change the implementation to allocate larger chunks. The simplest for me would be to allocate everything as one large chunk.
Now I know there is at least one level of virtualization between the user space memory model and actual physical memory, but is there still a risk that I will get problems getting a so large 'continous' block of memory?

Contiguous virtual does not imply contiguous physical. If your process can allocate N pages individually, it will also be able to allocate them all in one call (and is actually better from many points of view to do it in one call). On old 32bit architectures the limited size of the virtual memory address space was a problem, but on 64 bit is no longer and issue. Besides, even on 32 bit, if you could allocate 10MM individually you should be able to allocate same 10MM in one single call.
That being said, your probably need to carefully revisit your design and reconsider why do you need to allocate 10MM elements in memory.

Related

Why is there an alignment bigger than a word?

Ok I understand that storing a data aligned to a CPU word sized chunks increase the speed of accessing it. But those chunks are normally 16, 32 or 64bit, why there are other aligment values like 128bit or 256bit? I mean there aren't any processors using such large registers in PC's anyway. I supose this have something to do with the CPU cache? Also I have seen such alignments in secondary storage too (but there they are actually much more large - 10240bit for eg.).
Many processors do have 128-bit SIMD registers (e.g., x86 SSE registers, ARM Neon registers, MIPS SIMD Architecture registers); x86 AVX extends SSE registers to 256 bits and AVX-512 doubles the size again.
However, there are other reasons for desiring larger alignments. As you guessed, cache behavior is one motive for using larger alignments. Aligning a larger data structure to the size of a cache line (commonly 64 bytes for x86, usually not smaller than 32 bytes in modern systems) guarantees that an access to any member will bring the same other members into the cache. This can be used to reduce cache capacity use and miss rate by placing members that are frequently used (a.k.a., hot) or which are commonly used at about the same time together in what will be the same cache block.
E.g., consider the following structure accessed with a cache having 32-byte cache blocks:
struct {
int64_t hot1; // frequently used member
int64_t hot2; // frequently used member
int64_t hot3; // frequently used member
int64_t hot4; // frequently used member
// end of 32-byte cache block if 32-byte aligned
int64_t a; // always used by func1, func2
int64_t b; // always used by func2
int64_t c; // always used by func1, func3
int64_t d; // always used by func2, func3
// end of 32-byte cache block if 32-byte aligned
int64_t e; // used by func4
int64_t f; // used by func5
int64_t g; // used by func6
int64_t h; // used by func7
}
If the structure is 32-byte aligned:
an access to any of the hot members will bring all of the hot members into the cache
calling func1, func2, or func3 will bring a, b, c, and d into the cache; if these are functions are called nearby in time, then the data will still be in cache
If the structure is 16-byte aligned but not 32-byte aligned (50% chance with 16-byte alignment):
an access to hot1 or hot2 will bring 16-bytes of unrelated data located before hot1 and not automatically load hot3 and hot4 into cache
an access to hot3 or hot4 will bring in a and b into the cache (likely unnecessarily)
a call to func1 or func2 is more likely to encounter cache hits for a and b since these would be in the same cache block as hot3 and hot4 but have a miss for c and d and less usefully bring e and f into the cache.
a call to func3 will less usefully bring e and f into the cache but not a and b
Even for a small structure, alignment can prevent the structure (or just the hot or accessed-nearby-in-time portions) from crossing cache block boundaries. E.g., aligning a 24-byte structure with 16-bytes of hot data to 16 bytes can guarantee that the hot data will always be in the same cache block.
Cache block alignment can also be used to guarantee that two locks (or other data elements that are accessed by different threads and written by at least one) do not share the same cache block. This avoids false sharing issues. (False sharing is when unrelated data used by different threads share a cache block. A write by one thread will remove that cache block from all other caches. If another thread writes to unrelated data in that block, it removes the block from the first thread's cache. For ISAs using linked-load/store-condition to set locks, this can cause the store-conditional to fail even though there was not an actual data conflict.)
Similar alignment considerations apply with respect to virtual memory page size (typically 4KiB). By guaranteeing that data accessed nearby in time are in a smaller number of pages, the cache storing translations of virtual memory addresses (the translation lookaside buffer [TLB]) will not have as much capacity pressure.
Alignment can also be used in object caches to reduce cache conflict misses, which occur when items have the same cache index. (Caches are typically indexed simply by a selection of some least significant bits. At each index a limited number of blocks, called a set, are available. If more blocks want to share an index than there are blocks in the set—the associativity or number of ways—, then one of the blocks in the set must be removed from the cache to make room.) A 2048-byte, fully aligned chunk of memory could hold 21 copies of the above structure with a 32-byte chunk of padding (which might be used for other purposes). This guarantees that hot members from different chunks will only have a 33.3% chance of using the same cache index. (Allocating in a chunk, even if not aligned, also guarantees that none of the 21 copies within a chunk will share a cache index.)
Large alignment can also be convenient in buffers since a simple bitwise and can produce the start address of the buffer or the number of bytes in the buffer.
Alignment can also be exploited to provide pointer compression (e.g., 64-byte alignment would allow a 32-bit pointer to address 256 GiB instead of 4 GiB at the cost of a 6-bit left shift when loading the pointer). Similarly, the least significant bits of a pointer to an aligned object can be used to store metadata, requiring an and to zero the bits before using the pointer.
Here are the alignments I have used:
SSE: 16 bytes
AVX: 32 bytes
cache-line: 64 bytes
page: 4096 bytes
SSE and AVX both offer load and store instructions which require alignment to 16 bytes for SSE or 32 bytes for AVX. E.g.
SSE: _mm_load_ps() and _mm_store_ps()
AVX: _mm256_load_ps() and _mm256_store_ps()
However, they also offer instructions which don't require alignment:
SSE: _mm_loadu_ps() and _mm_storeu_ps()
AVX: _mm256_loadu_ps() and _mm256_storeu_ps()
Before Nahelem the unaligned loads/store had a larger latency/throughput even on aligned memory then the instructions that required alignment. However, since Nahelem they have the same latency/throughput on aligned memory which means there is no reason to use the load/store instructions which require alignment anymore. That does NOT mean that aligned memory does not matter anymore.
If 16 or 32 bytes cross a cache line and these 16 or 32 bytes are loaded into a SSE/AVX register this can cause a stall so it can also help to align to a cache line. In practice I usually align to 64 bytes.
On multi-socket systems with multiple processors sharing memory between the processors is slower than accessing the main memory of each processor. For this reason it can help to make sure that memory is not split between a virtual page which is usually, but not not necessarily, 4096 bytes.

CUDA: are access times for texture memory similar to coalesced global memory?

My kernel threads access a linear character array in a coalesced fashion. If I map
the array to texture I don't see any speedup. The running times are
almost the same. I'm working on a Tesla C2050 with compute capability 2.0 and read
somewhere that global accesses are cached. Is that true? Perhaps that is why I
am not seeing a difference in the running time.
The array in the main program is
char *dev_database = NULL;
cudaMalloc( (void**) &dev_database, JOBS * FRAGMENTSIZE * sizeof(char) );
and I bind it to texture texture<char> texdatabase with
cudaBindTexture(NULL, texdatabase, dev_database, JOBS * FRAGMENTSIZE * sizeof(char) );
Each thread then reads a character ch = tex1Dfetch(texdatabase, p + id) where id
is threadIdx.x + blockIdx.x * blockDim.x and p is an offset.
I'm binding only once and dev_database is a large array. Actually I found that
if the size is too large the bind fails. Is there a limit on the size of the array
to bind? Thanks very much.
There are several possibilities for why you don't see any difference in performance, but the most likely is that this memory access is not your bottleneck. If it is not your bottleneck, making it faster will have no effect on performance.
Regarding caching: for this case, since you are reading only bytes, each warp will read 32 bytes, which means each group of 4 warps will map to each cache line. So assuming few cache conflicts, you will get up to 4x reuse from the cache. So if this memory access is a bottleneck, it is conceivable that the texture cache might not benefit you more than the general-purpose cache.
You should first determine if you are bandwidth bound and if this data access is the culprit. Once you have done that, then optimize your memory accesses. Another tactic to consider is to access 4 to 16 chars per thread per load (using a char4 or int4 struct with byte packing/unpacking) rather than one per thread to increase the number of memory transactions in flight at a time -- this can help to saturate the global memory bus.
There is a good presentation by Paulius Micikevicius from GTC 2010 that you might want to watch. It covers both analysis-driven optimization and the specific concept of memory transactions in flight.

Is it possible to use cudaMemcpy with src and dest as different types?

I'm using a Tesla, and for the first time, I'm running low on CPU memory instead of GPU memory! Hence, I thought I could cut the size of my host memory by switching all integers to short (all my values are below 255).
However, I want my device memory to use integers, since the memory access is faster. So is there a way to copy my host memory (in short) to my device global memory (in int)? I guess this won't work:
short *buf_h = new short[100];
int *buf_d = NULL;
cudaMalloc((void **)&buf_d, 100*sizeof(int));
cudaMemcpy( buf_d, buf_h, 100*sizeof(short), cudaMemcpyHostToDevice );
Any ideas? Thanks!
There isn't really a way to do what you are asking directly. The CUDA API doesn't support "smart copying" with padding or alignment, or "deep copying" of nested pointers, or anything like that. Memory transfers require linear host and device memory, and alignment must be the same between source and destination memory.
Having said that, one approach to circumvent this restriction would be to copy the host short data to an allocation of short2 on the device. Your device code can retrieve a short2 containing two packed shorts, extract the value it needs and then cast the value to int. This will give the code 32 bit memory transactions per thread, allowing for memory coalescing, and (if you are using Fermi GPUs) good L1 cache hit rates, because adjacent threads within a block would be reading the same 32 bit word. On non Fermi GPUs, you could probably use a shared memory scheme to efficiently retrieve all the values for a block using coalesced reads.

Working with large arrays - OutOfRam

I have an algorithm where I create two bi-dimensional arrays like this:
TYPE
TPtrMatrixLine = array of byte;
TCurMatrixLine = array of integer;
TPtrMatrix = array of TPtrMatrixLine;
TCurMatrix = array of TCurMatrixLine;
function x
var
PtrsMX: TPtrMatrix;
CurMx : TCurMatrix;
begin
{ Try to allocate RAM }
SetLength(PtrsMX, RowNr+1, ColNr+1);
SetLength(CurMx , RowNr+1, ColNr+1);
for all rows do
for all cols do
FillMatrixWithData; <------- CPU intensive task. It could take up to 10-20 min
end;
The two matrices have always the same dimension.
Usually there are only 2000 lines and 2000 columns in the matrix but sometimes it can go as high as 25000x6000 so for both matrices I need something like 146.5 + 586.2 = 732.8MB of RAM.
The problem is that the two blocks need to be contiguous so in most cases, even if 500-600MB of free RAM doesn't seem much on a modern computer, I run out of RAM.
The algorithm fills the cells of the array with data based on the neighbors of that cell. The operations are just additions and subtractions.
The TCurMatrixLine is the one that takes a lot or RAM since it uses integers to store data. Unfortunately, values stored may have sign so I cannot use Word instead of integers. SmallInt is too small (my values are bigger than SmallInt, but smaller than Word). I hope that if there is any other way to implement this, it needs not to add a lot of overhead, since processing a matrix with so many lines/column already takes a lot of time. In other words I hope that decreasing memory requirements will not increase processing time.
Any idea how to decrease the memory requirements?
[I use Delphi 7]
Update
Somebody suggested that each row of my array should be an independent uni-dimensional array.
I create as many rows (arrays) as I need and store them in TList. Sound very good. Obviously there will be no problem allocation such small memory blocks. But I am afraid it will have a gigantic impact on speed. I use now
TCurMatrixLine = array of integer;
TCurMatrix = array of TCurMatrixLine;
because it is faster than TCurMatrix= array of array of integer (because of the way data is placed in memory). So, breaking the array in independent lines may affect the speed.
The suggestion of using a signed 2 byte integer will greatly aid you.
Another useful tactic is to mark your exe as being LARGE_ADDRESS_AWARE by adding {$SetPEFlags IMAGE_FILE_LARGE_ADDRESS_AWARE} to your .dpr file. This will only help if you are running on 64 bit Windows and will increase your address space from 2GB to 4GB.
It may not work on Delphi 7 (I seem to recall you are using D7) and you must be using FastMM since the old Borland memory manager isn't compatible with large address space. If $SetPEFlags isn't available you can still mark the exe with EDITBIN.
If you still encounter difficulties then yet another trick is to do allocate smaller sub-blocks of memory and use a wrapper class to handle mapping indices to the appropriate sub-block and offset within. You can use a default index property to make this transparent to the calling code.
Naturally a block allocated approach like this does incur some processing overhead but it's your best bet if you are having troubles with getting contiguous blocks.
If the absolute values of elements of CurMx fits word then you can store it in word and use another array of boolean for its sign. It reduces 1 byte for each element.
Have you considered to manually allocate the data structure on the heap?
...and measured how this will affect the memory usage and the performance?
Using the heap might actually increase speed and reduce the memory usage, because you can avoid the whole array to be copied from one memory segment to another memory segment. (Eg. if your FillMatrixWithData are declared with a non-const open array parameter).

Memory access after ioremap very slow

I'm working on a Linux kernel driver that makes a chunk of physical memory available to user space. I have a working version of the driver, but it's currently very slow. So, I've gone back a few steps and tried making a small, simple driver to recreate the problem.
I reserve the memory at boot time using the kernel parameter memmap=2G$1G. Then, in the driver's __init function, I ioremap some of this memory, and initialize it to a known value. I put in some code to measure the timing as well:
#define RESERVED_REGION_SIZE (1 * 1024 * 1024 * 1024) // 1GB
#define RESERVED_REGION_OFFSET (1 * 1024 * 1024 * 1024) // 1GB
static int __init memdrv_init(void)
{
struct timeval t1, t2;
printk(KERN_INFO "[memdriver] init\n");
// Remap reserved physical memory (that we grabbed at boot time)
do_gettimeofday( &t1 );
reservedBlock = ioremap( RESERVED_REGION_OFFSET, RESERVED_REGION_SIZE );
do_gettimeofday( &t2 );
printk( KERN_ERR "[memdriver] ioremap() took %d usec\n", usec_diff( &t2, &t1 ) );
// Set the memory to a known value
do_gettimeofday( &t1 );
memset( reservedBlock, 0xAB, RESERVED_REGION_SIZE );
do_gettimeofday( &t2 );
printk( KERN_ERR "[memdriver] memset() took %d usec\n", usec_diff( &t2, &t1 ) );
// Register the character device
...
return 0;
}
I load the driver, and check dmesg. It reports:
[memdriver] init
[memdriver] ioremap() took 76268 usec
[memdriver] memset() took 12622779 usec
That's 12.6 seconds for the memset. That means the memset is running at 81 MB/sec. Why on earth is it so slow?
This is kernel 2.6.34 on Fedora 13, and it's an x86_64 system.
EDIT:
The goal behind this scheme is to take a chunk of physical memory and make it available to both a PCI device (via the memory's bus/physical address) and a user space application (via a call to mmap, supported by the driver). The PCI device will then continually fill this memory with data, and the user-space app will read it out. If ioremap is a bad way to do this (as Ben suggested below), I'm open to other suggestions that'll allow me to get any large chunk of memory that can be directly accessed by both hardware and software. I can probably make do with a smaller buffer also.
See my eventual solution below.
ioremap allocates uncacheable pages, as you'd desire for access to a memory-mapped-io device. That would explain your poor performance.
You probably want kmalloc or vmalloc. The usual reference materials will explain the capabilities of each.
I don't think ioremap() is what you want there. You should only access the result (what you call reservedBlock) with readb, readl, writeb, memcpy_toio etc. It is not even guaranteed that the return is virtually mapped (although it apparently is on your platform). I'd guess that the region is being mapped uncached (suitable for IO registers) leading to the terrible performance.
It's been a while, but I'm updating since I did eventually find a workaround for this ioremap problem.
Since we had custom hardware writing directly to the memory, it was probably more correct to mark it uncacheable, but it was unbearably slow and wasn't working for our application. Our solution was to only read from that memory (a ring buffer) once there was enough new data to fill a whole cache line on our architecture (I think that was 256 bytes). This guaranteed we never got stale data, and it was plenty fast.
I have tried out doing a huge memory chunk reservations with the memmap
The ioremapping of this chunk gave me a mapped memory address space which in beyond few tera bytes.
when you ask to reserve 128GB memory starting at 64 GB. you see the following in /proc/vmallocinfo
0xffffc9001f3a8000-0xffffc9201f3a9000 137438957568 0xffffffffa00831c9 phys=1000000000 ioremap
Thus the address space starts at 0xffffc9001f3a8000 (which is waay too large).
Secondly, Your observation is correct. even the memset_io results in a extremely large delays (in tens of minutes) to touch all this memory.
So, the time taken has to do mainly with address space conversion and non cacheable page loading.

Resources