Where does linux kernel get available DRAM size - memory

I want to tweak available DRAM size when it passed to kernel side.
I have 8 GB RAM mounted on device and want to let Kernel know that it have only 4GB RAM without soldering 4GB memory on it.
Please let me know where to start.
The code where where LK passing memory information(DRAM start address and size) to the Kernel.
The code where LK recognize the DRAM size.
What I am trying to do is just change device tree file to have 4GB only.
from> (total 8GB)
memory_0: memory#80000000 {
device_type = "memory";
reg = <0x0 0x80000000 0x80000000>;
};
memory#880000000 {
device_type = "memory";
reg = <0x00000008 0x80000000 0x80000000>;
};
memory#90000/0000 {
device_type = "memory";
reg = <0x00000009 0x00000000 0x80000000>;
};
memory#98000/0000 {
device_type = "memory";
reg = <0x00000009 0x80000000 0x80000000>;
};
to> (total 4GB)
memory_0: memory#80000000 {
device_type = "memory";
reg = <0x0 0x80000000 0x80000000>;
};
memory#880000000 {
device_type = "memory";
reg = <0x00000008 0x80000000 0x80000000>;
};
Thanks

Related

Get Total free memory (RAM) of iOS/iPadOS device in swift

I know this has been asked multiple times, and I spent already multiple days to have the exact code for that, but seems I am still far away and I need help.
I use the below code,
/**
TOTAL DEVICE RAM MEMORY
**/
let total_bytes = Float(ProcessInfo.processInfo.physicalMemory)
let total_megabytes = total_bytes / 1024.0 / 1024.0
/**
FREE DEVICE RAM MEMORY
**/
var usedMemory: Int64 = 0
var totalUsedMemoryInMB: Float = 0
var freeMemoryInMB: Float = 0
let hostPort: mach_port_t = mach_host_self()
var host_size: mach_msg_type_number_t = mach_msg_type_number_t(MemoryLayout<vm_statistics_data_t>.stride / MemoryLayout<integer_t>.stride)
var pagesize:vm_size_t = 0
host_page_size(hostPort, &pagesize)
var vmStat: vm_statistics = vm_statistics_data_t()
let capacity = MemoryLayout.size(ofValue: vmStat) / MemoryLayout<Int32>.stride
let status: kern_return_t = withUnsafeMutableBytes(of: &vmStat) {
let boundPtr = $0.baseAddress?.bindMemory( to: Int32.self, capacity: capacity )
return host_statistics(hostPort, HOST_VM_INFO, boundPtr, &host_size)
}
if status == KERN_SUCCESS {
usedMemory = (Int64)((vm_size_t)(vmStat.active_count + vmStat.inactive_count + vmStat.wire_count) * pagesize)
totalUsedMemoryInMB = (Float)( usedMemory / 1024 / 1024 )
freeMemoryInMB = total_megabytes - totalUsedMemoryInMB
print("free memory: \(freeMemoryInMB)")
}
And I got the below results (real devices)
iPhone XR
free memory: 817.9844
Difference of about 150MB
iPhone13 Pro Max
free memory: 1384.2031
Difference of about 700MB
iPad 2021
free memory: 830.9375
Difference of about 170MB
I also used the below variants, with even worst results
//usedMemory = (Int64)((vm_size_t)(vmStat.active_count + vmStat.inactive_count + vmStat.wire_count + vmStat.free_count) * pagesize)
//usedMemory = (Int64)((vm_size_t)(vmStat.active_count + vmStat.wire_count) * pagesize)
//usedMemory = (Int64)((vm_size_t)(vmStat.inactive_count + vmStat.wire_count) * pagesize)
//usedMemory = (Int64)((vm_size_t)(vmStat.active_count + vmStat.inactive_count ) * pagesize)
A difference of about 100 MB is ok, but really do not understand why it is function of the device and I am not sure how can I can have a reliable value.
If that is not possible the difference between the real and the one got by the code for each device will be consistant so that I can pad it to get the real value?
My app is using scenekit and is hangry of resources, need to remove details once I am exsausting the memory.
Any help is appreciated.
I hope this method will help you -
var physicalMemory: UInt64 {
return (ProcessInfo().physicalMemory / 1024) / 1024 // in MB
}
func deviceRemainingFreeSpace() -> Int64? {
let documentDirectory = NSSearchPathForDirectoriesInDomains(.documentDirectory, .userDomainMask, true).last!
guard
let systemAttributes = try? FileManager.default.attributesOfFileSystem(forPath: documentDirectory),
let freeSize = systemAttributes[.systemFreeSize] as? NSNumber
else {
return nil
}
return (freeSize.int64Value / 1024) / 1024 // in MB
}

Meaning of declaration of Memory map in flattened device tree

I have declaration for memory map as follows:
memory#40000000 {
device_type = "memory";
reg = <0 0x40000000 0 0x20000000>;
};
memory#200000000 {
device_type = "memory";
reg = <2 0x00000000 0 0x20000000>;
};
What is the meaning of each number in reg (base size) ?
The two statements
reg = <0 0x40000000 0 0x20000000>;
reg = <2 0x00000000 0 0x20000000>;
mean, that a 64bit addressing scheme is used. However, each number in a device tree 'cell' represents a 32bit field. Thus, the numbers have to be read together as:
Addr: 0x040000000 Size: 0x020000000
Addr: 0x200000000 Size: 0x020000000
Thus, you have two 512MiB RAM ranges at two distinct address segments.
Please look for a declaration in your dts/dtsi file like:
#address-cells = <2>;
#size-cells = <2>;

Amount of local memory per CUDA thread

I read in NVIDIA documentation (http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications, table #12) that the amount of local memory per thread is 512 Ko for my GPU (GTX 580, compute capability 2.0).
I tried unsuccessfully to check this limit on Linux with CUDA 6.5.
Here is the code I used (its only purpose is to test local memory limit, it doesn't make any usefull computation):
#include <iostream>
#include <stdio.h>
#define MEMSIZE 65000 // 65000 -> out of memory, 60000 -> ok
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=false)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if( abort )
exit(code);
}
}
inline void gpuCheckKernelExecutionError( const char *file, int line)
{
gpuAssert( cudaPeekAtLastError(), file, line);
gpuAssert( cudaDeviceSynchronize(), file, line);
}
__global__ void kernel_test_private(char *output)
{
int c = blockIdx.x*blockDim.x + threadIdx.x; // absolute col
int r = blockIdx.y*blockDim.y + threadIdx.y; // absolute row
char tmp[MEMSIZE];
for( int i = 0; i < MEMSIZE; i++)
tmp[i] = 4*r + c; // dummy computation in local mem
for( int i = 0; i < MEMSIZE; i++)
output[i] = tmp[i];
}
int main( void)
{
printf( "MEMSIZE=%d bytes.\n", MEMSIZE);
// allocate memory
char output[MEMSIZE];
char *gpuOutput;
cudaMalloc( (void**) &gpuOutput, MEMSIZE);
// run kernel
dim3 dimBlock( 1, 1);
dim3 dimGrid( 1, 1);
kernel_test_private<<<dimGrid, dimBlock>>>(gpuOutput);
gpuCheckKernelExecutionError( __FILE__, __LINE__);
// transfer data from GPU memory to CPU memory
cudaMemcpy( output, gpuOutput, MEMSIZE, cudaMemcpyDeviceToHost);
// release resources
cudaFree(gpuOutput);
cudaDeviceReset();
return 0;
}
And the compilation command line:
nvcc -o cuda_test_private_memory -Xptxas -v -O2 --compiler-options -Wall cuda_test_private_memory.cu
The compilation is ok, and reports:
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function '_Z19kernel_test_privatePc' for 'sm_20'
ptxas info : Function properties for _Z19kernel_test_privatePc
65000 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 21 registers, 40 bytes cmem[0]
I got an "out of memory" error at runtime on the GTX 580 when I reached 65000 bytes per thread. Here is the exact output of the program in the console:
MEMSIZE=65000 bytes.
GPUassert: out of memory cuda_test_private_memory.cu 48
I also did a test with a GTX 770 GPU (on Linux with CUDA 6.5). It ran without error for MEMSIZE=200000, but the "out of memory error" occurred at runtime for MEMSIZE=250000.
How to explain this behavior ? Am I doing something wrong ?
It seems you are running into not a local memory limitation but a stack size limitation:
ptxas info : Function properties for _Z19kernel_test_privatePc
65000 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
The variable that you had intended to be local is on the (GPU thread) stack, in this case.
Based on the information provided by #njuffa here, the available stack size limit is the lesser of:
The maximum local memory size (512KB for cc2.x and higher)
GPU memory/(#of SMs)/(max threads per SM)
Clearly, the first limit is not the issue. I assume you have a "standard" GTX580, which has 1.5GB memory and 16 SMs. A cc2.x device has a maximum of 1536 resident threads per multiprocessor. This means we have 1536MB/16/1536 = 1MB/16 = 65536 bytes stack. There is some overhead and other memory usage that subtracts from the total available memory, so the stack size limit is some amount below 65536, somewhere between 60000 and 65000 in your case, apparently.
I suspect a similar calculation on your GTX770 would yield a similar result, i.e. a maximum stack size between 200000 and 250000.

Memory bandwidth measurement with memset,memcpy

I am trying to understand the performance of memory operations with memcpy/memset. I measure the time needed for a loop containing memset,memcpy. See the attached code (it is in C++11, but in plain C the picture is the same). It is understandable that memset is faster than memcpy. But this is more-or-less the only thing which I understand... The biggest question is:
Why there is a such a strong dependence on the number of loop iterations?
The application is single threaded! And the CPU is: AMD FX(tm)-4100 Quad-Core Processor.
And here are some numbers:
memset: iters=1 0.0625 GB in 0.1269 s : 0.4927 GB per second
memcpy: iters=1 0.0625 GB in 0.1287 s : 0.4857 GB per second
memset: iters=4 0.25 GB in 0.151 s : 1.656 GB per second
memcpy: iters=4 0.25 GB in 0.1678 s : 1.49 GB per second
memset: iters=16 1 GB in 0.2406 s : 4.156 GB per second
memcpy: iters=16 1 GB in 0.3184 s : 3.14 GB per second
memset: iters=128 8 GB in 1.074 s : 7.447 GB per second
memcpy: iters=128 8 GB in 1.737 s : 4.606 GB per second
The code:
/*
-- Compilation and run:
g++ -O3 -std=c++11 -o mem-speed mem-speed.cc && ./mem-speed
-- Output example:
*/
#include <cstdio>
#include <chrono>
#include <memory>
#include <string.h>
using namespace std;
const uint64_t _KB=1024, _MB=_KB*_KB, _GB=_KB*_KB*_KB;
std::pair<double,char> measure_memory_speed(uint64_t buf_size,int n_iters)
{
// without returning something from the buffers, the compiler will optimize memset() and memcpy() calls
char retval=0;
unique_ptr<char[]> buf1(new char[buf_size]), buf2(new char[buf_size]);
auto time_start = chrono::high_resolution_clock::now();
for( int i=0; i<n_iters; i++ )
{
memset(buf1.get(),123,buf_size);
retval += buf1[0];
}
auto t1 = chrono::duration_cast<std::chrono::nanoseconds>(chrono::high_resolution_clock::now() - time_start);
time_start = chrono::high_resolution_clock::now();
for( int i=0; i<n_iters; i++ )
{
memcpy(buf2.get(),buf1.get(),buf_size);
retval += buf2[0];
}
auto t2 = chrono::duration_cast<std::chrono::nanoseconds>(chrono::high_resolution_clock::now() - time_start);
printf("memset: iters=%d %g GB in %8.4g s : %8.4g GB per second\n",
n_iters,n_iters*buf_size/double(_GB),(double)t1.count()/1e9, n_iters*buf_size/double(_GB) / (t1.count()/1e9) );
printf("memcpy: iters=%d %g GB in %8.4g s : %8.4g GB per second\n",
n_iters,n_iters*buf_size/double(_GB),(double)t2.count()/1e9, n_iters*buf_size/double(_GB) / (t2.count()/1e9) );
printf("\n");
double avr = n_iters*buf_size/_GB * (1e9/t1.count()+1e9/t2.count()) / 2;
retval += buf1[0]+buf2[0];
return std::pair<double,char>(avr,retval);
}
int main(int argc,const char **argv)
{
uint64_t n=64;
if( argc==2 )
n = atoi(argv[1]);
for( int i=0; i<=10; i++ )
measure_memory_speed(n*_MB,1<<i);
return 0;
}
Surely this is just down to the instruction caches loading - so the code runs faster after the 1st iteration, and the data cache speeding access to the memcpy/memcmp for further iterations. The cache memory is inside the processor so it doesn't have to fetch or put the data to the slower external memory so often - so runs faster.

XMM register values

I am finding it hard to interpret the value of xmm registers in the register window of Visual Studio. The windows displays the following :
XMM0 = 00000000000000004018000000000000 XMM1 = 00000000000000004020000000000000
XMM2 = 00000000000000000000000000000000 XMM3 = 00000000000000000000000000000000
XMM4 = 00000000000000000000000000000000 XMM5 = 00000000000000000000000000000000
XMM6 = 00000000000000000000000000000000 XMM7 = 00000000000000000000000000000000
XMM00 = +0.00000E+000 XMM01 = +2.37500E+000 XMM02 = +0.00000E+000
XMM03 = +0.00000E+000 XMM10 = +0.00000E+000 XMM11 = +2.50000E+000
XMM12 = +0.00000E+000 XMM13 = +0.00000E+000
From the code that I am running, the value of XMM0 and XMM1 should be 6 and 8 (or other way round). The register value here shown is : XMM01 = +2.37500E+000
What does this translate to ?
Yes, it looks like:
XMM0 = { 6.0, 0.0 } // 6.0 = 0x4018000000000000 (double precision)
XMM1 = { 8.0, 0.0 } // 8.0 = 0x4020000000000000 (double precision)
The reason you are having problems interpreting this is that your debugger is only displaying each 128 bit XMM register in hex and then below that as 4 x single precision floats, but you are evidently using double precision floats.
I'm not familiar with the Visual Studio debugger, but there should ideally be a way to change the representation of your XMM registers - you may have to look at the manual or online help for this.
Note that in general using double precision with SSE is rarely of any value, particularly if you have a fairly modern x86 CPU with two FPUs.

Resources