I'm trying to write a system call that returns the number of memory pages the current process is using but I have no idea where to start and which variables I should look at.
I saw two variables sz and pgdir in proc.h. But I don't know what each of them represents exactly.
Looking at proc.c, you have all you want to understand the memory management:
// Grow current process's memory by n bytes.
// Return 0 on success, -1 on failure.
int
growproc(int n)
{
uint sz;
struct proc *curproc = myproc();
sz = curproc->sz;
if((sz = allocuvm(curproc->pgdir, sz, sz + n)) == 0)
return -1;
curproc->sz = sz;
switchuvm(curproc);
return 0;
}
growproc is used to increase the process memory by n bytes. This function is used by the sbrk syscall, itself called by malloc.
From this, we assert that sz from struct proc { is actually the process memory size.
Reading allocuvm from vm.c, you can see two macros:
PGROUNDUP(size) which transform a memory size to a memory size that is rounded to next page size,
PGSIZE which is the page size.
So, the number of pages actually used by a process is (PGROUNDUP(proc)->sz)/PGSIZE.
Related
I have recently been working on a soft-body physics simulation based on the following paper. The implementation uses points and springs and involves calculating the volume of the shape which is then used to calculate the pressure that is to be applied to each point.
On my MacBook Pro (2018, 13") I used the following code to calculate the volume for each soft-body in the simulation since all of the physics for the springs and mass points were being handled by a separate threadgroup:
// Gauss's theorem
shared_memory[threadIndexInThreadgroup] = 0.5 * fabs(x1 - x2) * fabs(nx) * (rAB);
// No memory fence is applied, and threadgroup_barrier
// acts only as an execution barrier.
threadgroup_barrier(mem_flags::mem_none);
threadgroup float volume = 0;
// Only do this calculation once on the first thread in the threadgroup.
if (threadIndexInThreadgroup == 0) {
for (uint i = 0; i < threadsPerThreadgroup; ++i) {
volume += shared_memory[i];
}
}
// mem_none is probably all that is necessary here.
threadgroup_barrier(mem_flags::mem_none);
// Do calculations that depend on volume.
With shared_memory being passed to the kernel function as a threadgroup buffer:
threadgroup float* shared_memory [[ threadgroup(0) ]]
This worked well until much later on I ran the code on an iPhone and an M1 MacBook and the simulation broke down completely resulting in the soft bodies disappearing fairly quickly after starting the application.
The solution to this problem was to store the result of the volume sum in a threadgroup buffer, threadgroup float* volume [[ threadgroup(2) ]], and do the volume calculation as follows:
// -*- Volume calculation -*-
shared_memory[threadIndexInThreadgroup] = 0.5 * fabs(x1 - x2) * fabs(nx) * (rAB);
threadgroup_barrier(mem_flags::mem_none);
if (threadIndexInThreadgroup == 0) {
auto sum = shared_memory[0];
for (uint i = 1; i < threadsPerThreadgroup; ++i) {
sum += shared_memory[i];
}
*volume = sum;
}
threadgroup_barrier(mem_flags::mem_none);
float epsilon = 0.000001;
float pressurev = rAB * pressure * divide(1.0, *volume + epsilon);
My question is why would the initial method work on my MacBook but not on other hardware and is this now the correct way of doing this? If it is wrong to allocate a float in the threadgroup address space like this then what is the point of being able to do so?
As a side note, I am using mem_flags::mem_none since it seems unnecessary to ensure the correct ordering of memory operations to threadgroup memory in this case. I just want to make sure each thread has written to shared_memory at this point but the order in which they do so shouldn't matter. Is this assumption correct?
you should use mem_flags::mem_threadgroup, but I think the main problem is metal cant initialize thread group memory to zero like that, the spec is unclear about this
try
threadgroup float volume;
// Only do this calculation once on the first thread in the threadgroup.
if (threadIndexInThreadgroup == 0) {
volume = 0;
for (uint i = 0; i < threadsPerThreadgroup; ++i) {
volume += shared_memory[i];
}
}
If you don't want to use a threadgroup buffer, the correct way to do this is the following:
// -*- Volume calculation -*-
threadgroup float volume = 0;
// Gauss's theorem
shared_memory[threadIndexInThreadgroup] = 0.5 * fabs(x1 - x2) * fabs(nx) * (rAB);
// No memory fence is applied, and threadgroup_barrier
// acts only as an execution barrier.
threadgroup_barrier(mem_flags::mem_none);
if (threadIndexInThreadgroup == 0) {
volume = shared_memory[0];
// Index memory use signed int types over unsigned.
for (int i = 1; i < int(threadsPerThreadgroup); ++i) {
volume += shared_memory[i];
}
}
threadgroup_barrier(mem_flags::mem_none);
You can use either threadgroup_barrier(mem_flags::mem_none) and threadgroup_barrier(mem_flags::mem_threadgroup), it appears to make no difference.
I have a metal compute kernel that takes input points from a buffer particles, and populates a new buffer particlesOut. My compute kernel is defined as:
kernel void compute(device DrawingPoint *particles [[buffer(0)]],
device Particle *particlesOut [[buffer(1)]],
constant ComputeParameters *params [[buffer(2)]],
device atomic_int &counter [[buffer(3)]],
uint id [[thread_position_in_grid]]) {
This works fine, so long as the output buffer has room for the number of records populated.
So, for instance, if the input buffer has 10,000 records, and for each of those records I create 10 output records, and the output buffer has a length of 100,000, then all is fine. In other words, if the number of output records is fixed and is large enough, all is fine.
But for some input records, I would like a random number of output records to be populated. For instance, some I would like to populate 5, and another I would like to create 200 (and any number in-between).
I am using an atomic_int for the output record's position in the buffer. Again, this works if I have a fixed number of records populated per input record.
I am populating the output buffer like this:
//Output buffer is 10 times the size of the input buffer
for (int i = 0; i < 10; i++) {
int counterValue = atomic_fetch_add_explicit(&counter, 1, memory_order_relaxed);
...
particlesOut[counterValue].position = finalPoint;
}
This works fine.
If I try to make it work on a variable number instead of the fixed value, the buffer is way under populated (instead of getting say 100,000 particles populated, maybe only 10,000 are populated).
For example:
int numberOfOutputPoints = someRandomValueBetweenFiveAndTwoHundred();
for (int i = 0; i < numberOfOutputPoints; i++) {
int counterValue = atomic_fetch_add_explicit(&counter, 1, memory_order_relaxed);
//particleCount is the size of the output buffer
if (counterValue > params->particleCount) {
return;
}
...
particlesOut[counterValue].position = finalPoint;
}
When I do that, only a small number of the particles in the output buffer are actually populated.
I looked at using different options for atomic_fetch_add_explicit, but only memory_order_relaxed will compile.
I tried using:
int counterValue = atomic_fetch_add(&counter, 1)
But, the compiler reports that there is no matching function. Other than having the buffer output large enough for every record to populate the maximum number of possible particles populated (e.g. 200 times 10,000), is there any way to make it dynamic?
In other words, I just want to stop populating the output buffer when it is full.
I'm working with a small Pololu 3pi robot. It has 32KB of flash, 2KB of SRAM and 1KB of EEPROM. I ran the following code on the robot:
#include <pololu/3pi.h>
int main(){
int nums[1000];
nums[0] = 50;
nums[999] = 100;
clear();
print_long(nums[0]); // prints 50
lcd_goto_xy(0, 1);
print_long(nums[999]); // prints 100
while(1);
}
My expectation was that it would crash because it's run out of RAM
to store the entire nums array. But not only did it not crash, it also printed the numbers correctly as they were allocated.
How come? Isn't this using 4000 bytes of memory?
here is simple cuda code.
I am testing the time of accessing global memory. read and right.
below is kernel function(test1()).
enter code here
__global__ void test1(int *direct_map)
{
int index = 10;
int index2;
for(int j=0; j<1024; j++)
{
index2 = direct_map[index];
direct_map[index] = -1;
index = index2;
}
}
direct_map is 683*1024 linear matrix and, each pixel has a offset value to access to other pixel.
index and index2 is not continued address.
this kernel function needs about 600 micro second.
But, if i delete the code,
direct_map[index] = -1;
just takes 27 micro second.
I think the code already read the value of direct_map[index] from global memory from
index2 = direct_map[index];
then, it should be located L2 cache.
So, when doing "direct_map[index] = -1;", the speed should be fast.
And, I tested random writing to global memory(test2()).
It takes about 120 micro seconds.
enter code here
__global__ void test2(int *direct_map)
{
int index = 10;
for(int j=0; j<1024; j++)
{
direct_map[index] = -1;
index = j*683 + j/3 - 1;
}
}
So, I don't know why test1() takes over than 600 micro seconds.
thank you.
When you delete the code line:
direct_map[index] = -1;
your kernel isn't doing anything useful. The compiler can recognize this and eliminate most of the code associated with the kernel launch. That modification to the kernel code means that the kernel no longer affects any global state and the code is effectively useless, from the compiler's perspective.
You can verify this by dumping the assembly code that the compiler generates in each case, for example with cuobjdump -sass myexecutable
Anytime you make a small change to the code and see a large change in timing, you should suspect that the change you made has allowed the compiler to make different optimization decisions.
I need to create Bitmap objects with direct access to their pixel data.
LockBits is too slow for my needs - its no good for rapidly recreating (sometimes large) bitmaps.
So I have a custom FastBitmap object. It has a reference to a Bitmap object and an IntPtr that points to the bits in the bitmap.
The constructor looks like this:
public FastBitmap(int width, int height)
{
unsafe
{
int pixelSize = Image.GetPixelFormatSize(PixelFormat.Format32bppArgb) / 8;
_stride = width * pixelSize;
int byteCount = _stride * height;
_bits = Marshal.AllocHGlobal(byteCount);
// Fill image with red for testing
for (int i = 0; i < byteCount; i += 4)
{
byte* pixel = ((byte *)_bits) + i;
pixel[0] = 0;
pixel[1] = 0;
pixel[2] = 255;
pixel[3] = 255;
}
_bitmapObject = new Bitmap(width, height, _stride, PixelFormat.Format32bppArgb, _bits); // All bits in this bitmap are now directly modifiable without LockBits.
}
}
The allocated memory is freed in a cleanup function which is called by the deconstructor.
This works, but not for long. Somehow, without any further modification of the bits, the allocated memory gets corrupted which corrupts the bitmap. Sometimes, big parts of the bitmap are replaced by random pixels, other times the whole program crashes when I try to display it with Graphics.DrawImage - either one or the other, completely at random.
The reason the memory was being corrupted was because I was using Bitmap.Clone to copy the _bitmapObject after I was done with a FastBitmap.
Bitmap.Clone does not make a new copy of the pixel data when called, or at least such is the case when you create a Bitmap with your own allocated data.
Instead, cloning appears to use the exact same pixel data, which was problematic for me because I was freeing the pixel data memory after a clone operation, causing the cloned bitmap to become corrupt when the memory is used for other things.
The first and current solution I have found as an alternative to Bitmap.Clone is to use:
Bitmap clone = new Bitmap(bitmapToClone);
which does copy the pixel data elsewhere, making it okay to free the old memory.
There may be even better/faster ways to make a fully copied clone but this is a simple solution for now.