I have declaration for memory map as follows:
memory#40000000 {
device_type = "memory";
reg = <0 0x40000000 0 0x20000000>;
};
memory#200000000 {
device_type = "memory";
reg = <2 0x00000000 0 0x20000000>;
};
What is the meaning of each number in reg (base size) ?
The two statements
reg = <0 0x40000000 0 0x20000000>;
reg = <2 0x00000000 0 0x20000000>;
mean, that a 64bit addressing scheme is used. However, each number in a device tree 'cell' represents a 32bit field. Thus, the numbers have to be read together as:
Addr: 0x040000000 Size: 0x020000000
Addr: 0x200000000 Size: 0x020000000
Thus, you have two 512MiB RAM ranges at two distinct address segments.
Please look for a declaration in your dts/dtsi file like:
#address-cells = <2>;
#size-cells = <2>;
Related
I want to tweak available DRAM size when it passed to kernel side.
I have 8 GB RAM mounted on device and want to let Kernel know that it have only 4GB RAM without soldering 4GB memory on it.
Please let me know where to start.
The code where where LK passing memory information(DRAM start address and size) to the Kernel.
The code where LK recognize the DRAM size.
What I am trying to do is just change device tree file to have 4GB only.
from> (total 8GB)
memory_0: memory#80000000 {
device_type = "memory";
reg = <0x0 0x80000000 0x80000000>;
};
memory#880000000 {
device_type = "memory";
reg = <0x00000008 0x80000000 0x80000000>;
};
memory#90000/0000 {
device_type = "memory";
reg = <0x00000009 0x00000000 0x80000000>;
};
memory#98000/0000 {
device_type = "memory";
reg = <0x00000009 0x80000000 0x80000000>;
};
to> (total 4GB)
memory_0: memory#80000000 {
device_type = "memory";
reg = <0x0 0x80000000 0x80000000>;
};
memory#880000000 {
device_type = "memory";
reg = <0x00000008 0x80000000 0x80000000>;
};
Thanks
I'm trying to figure out what exactly each of the metrics reported by "nvprof" are. More specifically I can't figure out which transactions are System Memory and Device Memory read and writes. I wrote a very basic code just to help figure this out.
#define TYPE float
#define BDIMX 16
#define BDIMY 16
#include <cuda.h>
#include <cstdio>
#include <iostream>
__global__ void kernel(TYPE *g_output, TYPE *g_input, const int dimx, const int dimy)
{
__shared__ float s_data[BDIMY][BDIMX];
int ix = blockIdx.x * blockDim.x + threadIdx.x;
int iy = blockIdx.y * blockDim.y + threadIdx.y;
int in_idx = iy * dimx + ix; // index for reading input
int tx = threadIdx.x; // thread’s x-index into corresponding shared memory tile
int ty = threadIdx.y; // thread’s y-index into corresponding shared memory tile
s_data[ty][tx] = g_input[in_idx];
__syncthreads();
g_output[in_idx] = s_data[ty][tx] * 1.3;
}
int main(){
int size_x = 16, size_y = 16;
dim3 numTB;
numTB.x = (int)ceil((double)(size_x)/(double)BDIMX) ;
numTB.y = (int)ceil((double)(size_y)/(double)BDIMY) ;
dim3 tbSize;
tbSize.x = BDIMX;
tbSize.y = BDIMY;
float* a,* a_out;
float *a_d = (float *) malloc(size_x * size_y * sizeof(TYPE));
cudaMalloc((void**)&a, size_x * size_y * sizeof(TYPE));
cudaMalloc((void**)&a_out, size_x * size_y * sizeof(TYPE));
for(int index = 0; index < size_x * size_y; index++){
a_d[index] = index;
}
cudaMemcpy(a, a_d, size_x * size_y * sizeof(TYPE), cudaMemcpyHostToDevice);
kernel <<<numTB, tbSize>>>(a_out, a, size_x, size_y);
cudaDeviceSynchronize();
return 0;
}
Then I run nvprof --metrics all for the output to see all the metrics. This is the part I'm interested in:
Metric Name Metric Description Min Max Avg
Device "Tesla K40c (0)"
Kernel: kernel(float*, float*, int, int)
local_load_transactions Local Load Transactions 0 0 0
local_store_transactions Local Store Transactions 0 0 0
shared_load_transactions Shared Load Transactions 8 8 8
shared_store_transactions Shared Store Transactions 8 8 8
gld_transactions Global Load Transactions 8 8 8
gst_transactions Global Store Transactions 8 8 8
sysmem_read_transactions System Memory Read Transactions 0 0 0
sysmem_write_transactions System Memory Write Transactions 4 4 4
tex_cache_transactions Texture Cache Transactions 0 0 0
dram_read_transactions Device Memory Read Transactions 0 0 0
dram_write_transactions Device Memory Write Transactions 40 40 40
l2_read_transactions L2 Read Transactions 70 70 70
l2_write_transactions L2 Write Transactions 46 46 46
I understand the shared and global accesses. The global accesses are coalesced and since there are 8 warps, there are 8 transactions.
But I can't figure out the system memory and device memory write transaction numbers.
It helps if you have a model of the GPU memory hierarchy with both logical and physical spaces, such as the one here.
Referring to the "overview tab" diagram:
gld_transactions refer to transactions issued from the warp targetting the global logical space. On the diagram, this would be the line from the "Kernel" box on the left to the "global" box to the right of it, and the logical data movement direction would be from right to left.
gst_transactions refer to the same line as above, but logically from left to right. Note that these logical global transaction could hit in a cache and not go anywhere after that. From the metrics standpoint, those transaction types only refer to the indicated line on the diagram.
dram_write_transactions refer to the line on the diagram which connects device memory on the right with L2 cache, and the logical data flow is from left to right on this line. Since the L2 cacheline is 32 bytes (whereas the L1 cacheline and size of a global transaction is 128 bytes), the device memory transactions are also 32 bytes, not 128 bytes. So a global write transaction that passes through L1 (it is a write-through cache if enabled) and L2 will generate 4 dram_write transactions. This should explain 32 out of the 40 transactions.
system memory transactions target zero-copy host memory. You don't seem to have that so I can't explain those.
Note that in some cases, for some metrics, on some GPUs, the profiler may have some "inaccuracy" when launching very small numbers of threadblocks. For example, some metrics are sampled on a per-SM basis and scaled. (device memory transactions are not in this category, however). If you have disparate work being done on each SM (perhaps due to a very small number of threadblocks launched) then the scaling can be misleading/less accurate. Generally if you launch a larger number of threadblocks, these usually become insignificant.
This answer may also be of interest.
I have some constraints like so:
interesting = 0x1
choked = 0x2
remote_interested = 0x4
remote_choked = 0x8
supports_extensions = 0x10
local_connection = 0x20
handshake = 0x40
connecting = 0x80
queued = 0x100
on_parole = 0x200
seed = 0x400
optimistic_unchoke = 0x800
rc4_encrypted = 0x100000
plaintext_encrypted = 0x200000
and the documentation tells me 'The flags attribute tells you in which state the peer is in. It is set to any combination of the enums above' so basically I call the dll and it fills in the structure with a decimal number representing the flag values, a few examples:
2086227
170
2098227
106
How do I from the decimal determine the flags?
In order to determine which flags were set, you need to use the bitwise AND operation (bit32.band() in Lua 5.2). For example:
function hasFlags(int, ...)
local all = bit32.bor(...)
return bit32.band(int, all) == all
end
if hasFlags(2086227, interesting, local_connection) then
-- do something that has interesting and local_connection
end
I'm calling a function that returns an integer which represents a bitfield of 16 binary inputs each of the colors can either be on or off.
I'm trying to create a function to get the changes between the oldstate and the new state,
e.g.
function getChanges(oldColors,newColors)
sampleOutput = {white = "",orange="added",magenta="removed" .....}
return sampleOutput
end
I've tried subtracting the oldColors from the newColors and the new Colors from the oldColors but this seems to result in chaos should more then 1 value change.
this is to detect rising / falling edges from multiple inputs.
**Edit: there appears to be a subset of the lua bit api available
from:ComputerCraft wiki
colors.white 1 0x1 0000000000000001
colors.orange 2 0x2 0000000000000010
colors.magenta 4 0x4 0000000000000100
colors.lightBlue 8 0x8 0000000000001000
colors.yellow 16 0x10 0000000000010000
colors.lime 32 0x20 0000000000100000
colors.pink 64 0x40 0000000001000000
colors.gray 128 0x80 0000000010000000
colors.lightGray 256 0x100 0000000100000000
colors.cyan 512 0x200 0000001000000000
colors.purple 1024 0x400 0000010000000000
colors.blue 2048 0x800 0000100000000000
colors.brown 4096 0x1000 0001000000000000
colors.green 8192 0x2000 0010000000000000
colors.red 16384 0x4000 0100000000000000
colors.black 32768 0x8000 1000000000000000
(there was supposed to be a table of values here, but I can't work out the syntax for markdown, it would appear stackoverflow ignores the html part of the standard.)
function getChanges(oldColors,newColors)
local added = bit.band(newColors, bit.bnot(oldColors))
local removed = bit.band(oldColors, bit.bnot(newColors))
local color_names = {
white = 1,
orange = 2,
magenta = 4,
lightBlue = 8,
yellow = 16,
lime = 32,
pink = 64,
gray = 128,
lightGray = 256,
cyan = 512,
purple = 1024,
blue = 2048,
brown = 4096,
green = 8192,
red = 16384,
black = 32768
}
local diff = {}
for cn, mask in pairs(color_names) do
diff[cn] = bit.band(added, mask) ~= 0 and 'added'
or bit.band(removed, mask) ~= 0 and 'removed' or ''
end
return diff
end
I am finding it hard to interpret the value of xmm registers in the register window of Visual Studio. The windows displays the following :
XMM0 = 00000000000000004018000000000000 XMM1 = 00000000000000004020000000000000
XMM2 = 00000000000000000000000000000000 XMM3 = 00000000000000000000000000000000
XMM4 = 00000000000000000000000000000000 XMM5 = 00000000000000000000000000000000
XMM6 = 00000000000000000000000000000000 XMM7 = 00000000000000000000000000000000
XMM00 = +0.00000E+000 XMM01 = +2.37500E+000 XMM02 = +0.00000E+000
XMM03 = +0.00000E+000 XMM10 = +0.00000E+000 XMM11 = +2.50000E+000
XMM12 = +0.00000E+000 XMM13 = +0.00000E+000
From the code that I am running, the value of XMM0 and XMM1 should be 6 and 8 (or other way round). The register value here shown is : XMM01 = +2.37500E+000
What does this translate to ?
Yes, it looks like:
XMM0 = { 6.0, 0.0 } // 6.0 = 0x4018000000000000 (double precision)
XMM1 = { 8.0, 0.0 } // 8.0 = 0x4020000000000000 (double precision)
The reason you are having problems interpreting this is that your debugger is only displaying each 128 bit XMM register in hex and then below that as 4 x single precision floats, but you are evidently using double precision floats.
I'm not familiar with the Visual Studio debugger, but there should ideally be a way to change the representation of your XMM registers - you may have to look at the manual or online help for this.
Note that in general using double precision with SSE is rarely of any value, particularly if you have a fairly modern x86 CPU with two FPUs.