I am testing the alignment and i determine something strange with iOS simulator.(XCode 4.3.2 and XCode 4.5).
On iOS simulator, structures are aligned to 8 byte boundary even when attribute ((aligned (4))) is used to force 4 byte boundary. Check that its padded with 0x00000001 at the end to align 8 byte boundary.
If myStruct variable defined in global scope then simulator aligns it to 4-byte boundary, so it may be something related to stack.
Simulator is i386 so its 32-bit and it must be aligning to 4-byte boundary. So, what would be the reason, why is it aligning to 64-bit boundary? Is it a feature or a bug?
(I know it is not necessary to struggle with simulator but it may cause to stuck into subtle problems.)
typedef struct myStruct
{
int a;
int b;
} myStruct;
//} __attribute__ ((aligned (4))) myStruct;
-(void)alignmentTest
{
// Offset 16*n (0x2fdfe2f0)
int __attribute__ ((aligned (16))) force16ByteBoundary = 0x01020304;
// Offset 16*n-4 (0x2fdfe2ec)
int some4Byte = 0x09080706;
// Offset 16*n-12 (0x2fdfe2e4)
myStruct mys;
mys.a = 0xa1b1c1d1;
mys.b = 0xf2e28292;
NSLog(#"&force16ByteBoundary: %p / &some4Byte: %p / &mys: %p",
&force16ByteBoundary, &some4Byte, &mys);
}
(EDIT Optimizations are off, -O0)
Simulator(iOS 5.1) results;
(lldb) x `&mys` -fx
0xbfffda60: 0xa1b1c1d1 0xf2e28292 0x00000001 0x09080706
0xbfffda70: 0x01020304
&force16ByteBoundary: 0xbfffda70 / &some4Byte: 0xbfffda6c / &mys:
0xbfffda60
Device(iOS 5.1) results;
(lldb) x `&mys` -fx
0x2fdfe2e4: 0xa1b1c1d1 0xf2e28292 0x09080706 0x01020304
&force16ByteBoundary: 0x2fdfe2f0 / &some4Byte: 0x2fdfe2ec / &mys:
0x2fdfe2e4
(NEW FINDINGS)
- On Simulator and Device;
- Building for Release or Debug does not make any difference for alignments.
- Local or global variables of "long long", double types are aligned to 8 byte boundary although they must be aligned to 4 byte boundary.
- There is no problem with global variables of structs.
- On Simulator;
- Local variables of structs are aligned to 8 byte boundary even when there is only a char member in the struct.
(EDIT)
I could only find out the "Data Types and Data Alignment" for iOS here.
(Also, they could be inferred from ILP32 alignments here.)
Typically, the alignment attributes only affect the relative alignment of items inside a struct. This allows backward-compatibility with code that wants to just bulk-copy data into a structure directly from the network or binary file.
The alignment attributes won't affect the alignment of local variables allocated on the stack. The ordering and alignment of items on the stack is not guaranteed and will generally be aligned optimally for each item for the device. So, if a 386-based device can fetch a 64-bit long-long from memory in a single operation by 8-byte aligning them, it will do so. Some processors actually lose dramatic amounts of performance if data is not fully aligned. Some processors can throw exceptions for attempting to read data that is not properly aligned.
Related
I'm trying to understand the way Rust deals with memory and I've a little program that prints some memory addresses:
fn main() {
let a = &&&5;
let x = 1;
println!(" {:p}", &x);
println!(" {:p} \n {:p} \n {:p} \n {:p}", &&&a, &&a, &a, a);
}
This prints the following (varies for different runs):
0x235d0ff61c
0x235d0ff710
0x235d0ff728
0x235d0ff610
0x7ff793f4c310
This is actually a mix of both 40-bit and 48-bit addresses. Why this mix? Also, can somebody please tell me why the addresses (2, 3, 4) do not fall in locations separated by 8-bytes (since std::mem::size_of_val(&a) gives 8)? I'm running Windows 10 on an AMD x-64 processor (Phenom || X4) with 24GB RAM.
All the addresses do have the same size, Rust is just not printing trailing 0-digits.
The actual memory layout is an implementation detail of your OS, but the reason that a prints a location in a different memory area than all the other variables is, that a actually lives in your loaded binary, because it is a value that can already be calculated by the compiler. All the other variables are calculated at runtime and live on the stack.
See the compilation result on https://godbolt.org/z/kzSrDr:
.L__unnamed_4 contains the value 5; .L__unnamed_5, .L__unnamed_6 and .L__unnamed_1 are &5 &&5 and &&&5.
So .L__unnamed_1 is what on your system is at 0x7ff793f4c310. While 0x235d0ff??? is on your stack and calculated in the red and blue areas of the code.
This is actually a mix of both 40-bit and 48-bit addresses. Why this mix?
It's not really a mix, Rust just doesn't display leading zeroes. It's really about where the OS maps the various components of the program (data, bss, heap and stack) in the address space.
Also, can somebody please tell me why the addresses (2, 3, 4) do not fall in locations separated by 8-bytes (since std::mem::size_of_val(&a) gives 8)?
Because println! is a macro which expands to a bunch of stuff in the stackframe, so your values are not defined next to one another in the frame final code (https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=5b812bf11e51461285f51f95dd79236b). Though even if they were there'd be no guarantee the compiler wouldn't e.g. be reusing now-dead memory to save up on frame size.
I'm trying to improve the performance of a rust program, which requires me to reduce the size of some large enums. For example
enum EE {
A, // 0
B(i32), //4
C(i64), // 8
D(String), // 24
E { // 16
x: i64,
y: i32,
},
}
fn main() {
println!("{}", std::mem::size_of::<EE>()); // 32
}
prints 32. But if I want to know the size of EE::A, I get a compile error
error[E0573]: expected type, found variant `EE::A`
--> src/main.rs:14:40
|
14 | println!("{}", std::mem::size_of::<EE::A>());
| ^^^^^
| |
| not a type
| help: try using the variant's enum: `crate::EE`
error: aborting due to previous error
error: could not compile `play_rust`.
Is there a way to find out which variant takes the most space?
No, there is no way to get the size of just one variant of an enum. The best you can do is get the size of what the variant contains, as if it were a standalone struct:
println!("sizeof EE::A: {}", std::mem::size_of::<()>()); // 0
println!("sizeof EE::B: {}", std::mem::size_of::<i32>()); // 4
println!("sizeof EE::C: {}", std::mem::size_of::<i64>()); // 8
println!("sizeof EE::D: {}", std::mem::size_of::<String>()); // 24
println!("sizeof EE::E: {}", std::mem::size_of::<(i64, i32)>()); // 16
Even this isn't especially useful because it includes padding bytes that may be used to store the tag; as you point out, the size of the enum can be reduced to 16 if D is shrunk to a single pointer, but you can't know that from looking at just the sizes. If y were instead defined as i64, the size of each variant would be the same, but the size of the enum would need to be 24. Alignment is another confounding factor that makes the size of an enum more complex than just "the size of the largest variant plus the tag".
Of course, this is all highly platform-dependent, and your code should not rely on any enum having a particular layout (unless you can guarantee it with a #[repr] annotation).
If you have a particular enum you're worried about, it's not difficult to get the size of each contained type. Clippy also has a lint for enums with extreme size differences between variants. However, I don't recommend using size alone to make manual optimizations to enum layouts, or boxing things that are only a few pointers in size -- indirection suppresses other kinds of optimizations the compiler may be able to do. If you prioritize minimal space usage you may accidentally make your code much slower in the process.
The structure is,
struct {
char a;
short b;
short c;
short d;
char e;
} s1;
size of short is given as 2 bytes
size of char is given as 1 bytes
It is a 32-bit LITTLE ENDIAN processor
According to me, the answer should be:
1000 a[0]
1001 offset
1002 b[0]
1003 b[1]
1004 c[0]
1005 c[1]
1006 d[0]
1007 d[1]
1008 e[0]
size of S1 = 9 bytes
but according to the solution, the size of S1 is supposed to be 10 bytes
The answer here is that it is that the layout of the structure is entirely up to the compiler.
10 is likely to be the most common size of this structure.
The reason for the padding is that, if there is an array, it will keep all the members properly aligned. If the size were 9, every other array element would have misaligned structure members.
Unaligned did accesses are not permitted on some systems. On most systems, they cause the processor to use extra cycles to access the data.
A compiler could allocate 4 bytes for each element in such a structure.
The C Standard says (sorry, not at my computer, so no quote): structs are aligned to the alignment of the largest (base type) member. Your largest member field is a short, 2 bytes, so the first element 'a' is aligned at an even address. 'a' takes up 1 byte. 'b' has to be aligned again at an even address, so one byte gets wasted. The last element of your struct 'e' is also one byte, and the byte following that is likely to be wasted, but that doesn't have to show up in the size of the struct. If put 'a' to the end, ie rearrange the members, you are likely to find the size of your struct to be 8 bytes..which is as good as it gets.
I'm running the OpenCL kernel below with a two-dimensional global work size of 1000000 x 100 and a local work size of 1 x 100.
__kernel void myKernel(
const int length,
const int height,
and a bunch of other parameters) {
//declare some local arrays to be shared by all 100 work item in this group
__local float LP [length];
__local float LT [height];
__local int bitErrors = 0;
__local bool failed = false;
//here come my actual computations which utilize the space in LP and LT
}
This however refuses to compile, since the parameters length and height are not known at compile time. But it is not clear to my at all how to do this correctly. Should I use pointers with memalloc? How to handle this in a way that the memory is only allocated once for the entire workgroup and not once per work item?
All that I need is 2 arrays of floats, 1 int and 1 boolean that are shared among the entire workgroup (so all 100 work items). But I fail to find any method that does this correctly...
It's relatively simple, you can pass the local arrays as arguments to your kernel:
kernel void myKernel(const int length, const int height, local float* LP,
local float* LT, a bunch of other parameters)
You then set the kernelargument with a value of NULL and a size equal to the size you want to allocate for the argument (in byte). Therefore it should be:
clSetKernelArg(kernel, 2, length * sizeof(cl_float), NULL);
clSetKernelArg(kernel, 3, height* sizeof(cl_float), NULL);
local memory is always shared by the workgroup (as opposed to private), so I think the bool and int should be fine, but if not you can always pass those as arguments too.
Not really related to your problem (and not necessarily relevant, since I do not know what hardware you plan to run this on), but at least gpus don't particulary like workingsizes which are not a multiple of a particular power of two (I think it was 32 for nvidia, 64 for amd), meaning that will probably create workgroups with 128 items, of which the last 28 are basically wasted. So if you are running opencl on gpu it might help performance if you directly use workgroups of size 128 (and change the global work size appropriately)
As a side note: I never understood why everyone uses the underscore variant for kernel, local and global, seems much uglier to me.
You could also declare your arrays like this:
__local float LP[LENGTH];
And pass the LENGTH as a define in your kernel compile.
int lp_size = 128; // this is an example; could be dynamically calculated
char compileArgs[64];
sprintf(compileArgs, "-DLENGTH=%d", lp_size);
clBuildProgram(program, 0, NULL, compileArgs, NULL, NULL);
You do not have to allocate all your local memory outside the kernel, especially when it is a simple variable instead of a array.
The reason that your code cannot compile is that OpenCL does not support local memory initialization. This is specified in the document(https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/local.html). It is also not feasible in CUDA(Is there a way of setting default value for shared memory array?)
ps:The answer from Grizzly is good enough and it would be better if I can post it as a comment, but I am restricted by the reputation policy. Sorry.
There are 2 pointers to 2 unaligned 8 byte chunks to be loaded into an xmm register. If possible, using intrinsics. And if possible, without using an auxiliary register. Without pinsrd. (SSSE Core 2)
From the msvc specs, it looks like you can do the following:
__m128d xx; // an uninitialised xmm register
xx = _mm_loadh_pd(xx, ptra); // load the higher 64 bits from (unaligned) ptra
xx = _mm_loadl_pd(xx, ptrb); // load the lower 64 bits from (unaligned) ptrb
Loading from unaligned storage (in my experience) is very much slower than loading from aligned pointers, so you properly wouldn't want to be doing this type of operation too often - if you really want higher performance.
Hope this helps.
Unaligned access is so much slower than aligned access (at least pre-Nehalem );
you may get better speed by loading the aligned 128 bit words that contain the desired unaligned 64 bit words, then shuffle them to make the result you want.
Assumes:
you have memory read access to the full 128 word
the 64 bit words are aligned on at least 32 bit boundaries
e.g. (not tested)
int aoff = ptra & 15;
int boff = ptrb & 15;
__m128 va = _mm_load_ps( (char*)ptra - aoff );
__m128 vb = _mm_load_ps( (char*)ptrb - boff );
switch ( (aoff<<4) | boff )
{
case 0: _mm_shuffle_ps(va,vb, ...
The number of cases depends on whether you can assume 64 bit alignment