SSE2: How to reduce a _m128 to a word - sse

What's the best way ( sse2 ) to reduce a _m128 ( 4 words a b c d) to one word?
I want the low part of each _m128 components:
int result = ( _m128.a & 0x000000ff ) << 24
| ( _m128.b & 0x000000ff ) << 16
| ( _m128.c & 0x000000ff ) << 8
| ( _m128.d & 0x000000ff ) << 0
Is there an intrinsics for that ? thanks !

FYI, the sse3 intrinsics _mm_shuffle_epi8 do the job: (with the mask 0x0004080c in this case )

The SSE2 answer takes more than one instructions:
unsigned benoit(__m128i x)
{
__m128i zero = _mm_setzero_si128(), mask = _mm_set1_epi32(255);
return _mm_cvtsi128_si32(
_mm_packus_epi16(
_mm_packus_epi16(
_mm_and_si128(x, mask), zero), zero));
}
The above amounts to 5 machine ops, given the input in %xmm1 and output in %rax:
pxor %xmm0, %xmm0
pand MASK, %xmm1
packuswb %xmm0, %xmm1
packuswb %xmm0, %xmm1
movd %xmm1, %rax
If you want to see some unusual uses of SSE2, including high-speed bit-matrix transpose, string search and bitonic (GPGPU-style) sort, you might want to check my blog, Coding on the edges.
Anyway, hope that helps.

Related

Why LLVM does not remember an integer range when the variable has been constrained by an if statement?

There is a problem that has become quite popular these last days regarding the LLVM/GCC inability to optimize a trivial loop when the range is quite obvious.
Godbolt for all the examples below: https://godbolt.org/z/b3PzrsE5e
The code below will not optimize and the generated assembly will produce a loop.
uint64_t sum1( uint64_t num ) {
uint64_t sum = 0;
for ( uint64_t j=0; j<=num; ++j ) {
sum += 1;
}
return sum;
}
produces
sum1(unsigned long): # #sum1(unsigned long)
xorl %eax, %eax
.LBB0_1: # =>This Inner Loop Header: Depth=1
addq $1, %rax
cmpq %rdi, %rax
jbe .LBB0_1
retq
However if one adds a limiter to the range of the variable, like an AND mask, the loop is able to optimize. You can also easily make it optimize if you change the condition j<=num to j<num+1.
uint64_t sum2( uint64_t num ) {
num &= 0xFFFFFFFFULL;
uint64_t sum = 0;
for ( uint64_t j=0; j<=num; ++j ) {
sum += 1;
}
return sum;
}
produces
sum2(unsigned long): # #sum2(unsigned long)
movl %edi, %eax
addq $1, %rax
retq
while curbing the range with an if statement does not have any effect
uint64_t sum3( uint64_t num ) {
uint64_t sum = 0;
if ( num <= 0xFF ) {
for ( uint64_t j=0; j<=num; ++j ) {
sum += 1;
}
}
return sum;
}
Produces assembly code again with a loop.
sum3(unsigned long): # #sum3(unsigned long)
xorl %eax, %eax
cmpq $255, %rdi
ja .LBB3_2
.LBB3_1: # =>This Inner Loop Header: Depth=1
addq $1, %rax
cmpq %rdi, %rax
jbe .LBB3_1
.LBB3_2:
retq
For that sake, even __builtin_assume( num < 0x100ULL ) has no effect on the result.
I have looked into the LLVM code and traced this to the failed statement at
// lib/Transforms/Scalar/IndVarSimplify.cpp:1430
const SCEV *MaxExitCount = SE->getSymbolicMaxBackedgeTakenCount(L);
if (isa<SCEVCouldNotCompute>(MaxExitCount)) {
printf( "Could not compute\n");
return false;
}
...
which then ends up in
// lib/Analysis/ScalarEvolution.cpp:7253
const SCEV *ScalarEvolution::getExitCount(const Loop *L,
const BasicBlock *ExitingBlock,
ExitCountKind Kind) {
switch (Kind) {
case Exact:
case SymbolicMaximum:
return getBackedgeTakenInfo(L).getExact(ExitingBlock, this);
case ConstantMaximum:
return getBackedgeTakenInfo(L).getConstantMax(ExitingBlock, this);
};
llvm_unreachable("Invalid ExitCountKind!");
}
What I don't understand is why the boundary cannot be inferred if the if statement makes it clear? Is this a feature that could be implemented? Am I in the right track?

Getting Beaglebone PRUs to work using PASM

I have been trying to get the PRU to work in a way that makes sense to me and at this point I am completely clueless. I can get the examples to work, but anytime I make a change or try to write things from scratch I just beat my head against the wall. I just want to as a start access any of the USRLEDS and turn them off or on at some speed, or as first pass turn on a LED and leave it on. Here is a PASM code I got off the internet (Will post link when I find it):
.origin 0
.entrypoint START
#define PRU0_ARM_INTERRUPT 19
#define AM33XX
#define GPIO1 0x4804c000 //Trying to access the GPIO1
#define GPIO_CLEARDATAOUT 0x190 //writing 1 to the bit you want cleared in GPIO_DATAOUT register (what does that mean?)
#define GPIO_SETDATAOUT 0x194 (set a value for GPIO output pins, which pins am I even writing to? GPIO1?
#define GPIO_OE 0x134 //enable the pins output capabilities
START:
//clear that bit
lbco r0, c4, 4, 4 //This creates a constant offset and stores in c4, but why do you need that?
CLR r0, r0, 4 //if you copied the data why do you need to clear it?
SBCO r0, C4, 4, 4 //What is this for?
//MOV r1, 10
MOV r2, 0x00000000 //store address 0x00 into r2, why?
MOV r3, GPIO1 //Store GPIO1 address in r3
MOV r4, GPIO_OE //place address of GPIO_OE into r4
MOV r5, GPIO_SETDATAOUT //store address of GPIO_SETDATAOUT in r5
MOV r6, GPIO_CLEARDATAOUT //store addres of GPIOCLEARDATAOUT in r6
SBBO r2, r3, r4,4 //What is this even doing? Copying 4 bytes from r2 into r3+r4, but why do you want to copy that way and if not why not?
MOV r1, 10
MOV r2, 0xFFFFFFFF //Suppossedly this turn the GPIO1 ON and OFF?
SBBO r2, r3, r6, 4 and again the storage stuff?
HALT
I am also attaching the C code that I am using:
#include <stdio.h>
#include <pruss/prussdrv.h>
#include <pruss/pruss_intc_mapping.h>
#define PRU_NUM 0 //defining which PRU to use
int main() {
int ret;
tpruss_intc_initdata intc = PRUSS_INTC_INITDATA;
//initialize the PRU by using init command from prussdrv.h
ret = prussdrv_init();
if(ret != 0) {
printf("Error returned: %d\n",ret);
printf("PRU unable to be initialized");
return -1;
}
ret = prussdrv_open(PRU_EVTOUT_0);
if(ret != 0) {
printf("Error returned for prussdrv_open(): %d\n",ret);
printf("PRU can't open PRU_EVTOUT_0");
return -1;
}
//Map PRUS's INTC
ret = prussdrv_pruintc_init(&intc);
if (ret != 0) {
printf("Error returned for prussdrv_pruintc_int\n");
printf("PRU doesn't work");
return -1;
}
//load and execute binary on PRU
prussdrv_exec_program(PRU_NUM, "./ashwini_test.bin");
prussdrv_pru_wait_event(PRU_EVTOUT_0);
prussdrv_pru_clear_event(PRU_EVTOUT_0,PRU0_ARM_INTERRUPT);
/*Disable PRU and close memory mappings*/
prussdrv_pru_disable(PRU_NUM);
prussdrv_exit();
//prussdrv_pru_wait_event(PRU_EVTOUT_0);
return 0;
}
I have gone through THE TRM and https://groups.google.com/forum/#!topic/beaglebone/98eF1wQE_QA, and elinux and derekmolloy, I just feel like I am missing something very basic about how address scheme work or how to think about these things. Thanks again for your help!
When you say that's your PASM code... do you mean it's some code you got from somewhere else that you're trying to use? Because the comments on most lines asking what they do makes it seem unlikely that it's actually your code...
Anyways, can't really answer unless you have a specific question, but there's plenty of info out there about how to use the GPIO subsystem on the BeagleBone's AM335x processor. I talked about it some in a post a while back here: https://graycat.io/tutorials/beaglebone-io-using-python-mmap/
I've also got a few documented PRU assembly examples here: https://github.com/alexanderhiam/PRU-stuffs

Calling a function crashes when the stack pointer is changed with inline assembly

I have written some code that changes the current stack used by modifying the stack pointer in inline assembly. Although I can call functions and create local variables, calls to println! and some functions from std::rt result in the application terminating abnormally with signal 4 (illegal instruction) in the playpen. How should I improve the code to prevent crashes?
#![feature(asm, box_syntax)]
#[allow(unused_assignments)]
#[inline(always)]
unsafe fn get_sp() -> usize {
let mut result = 0usize;
asm!("
movq %rsp, $0
"
:"=r"(result):::"volatile"
);
result
}
#[inline(always)]
unsafe fn set_sp(value: usize) {
asm!("
movq $0, %rsp
"
::"r"(value)::"volatile"
);
}
#[inline(never)]
unsafe fn foo() {
println!("Hello World!");
}
fn main() {
unsafe {
let mut stack = box [0usize; 500];
let len = stack.len();
stack[len-1] = get_sp();
set_sp(std::mem::transmute(stack.as_ptr().offset((len as isize)-1)));
foo();
asm!("
movq (%rsp), %rsp
"
::::"volatile"
);
}
}
Debugging the program with rust-lldb on x86_64 on OS X yields 300K stack traces, repeating these lines over and over:
frame #299995: 0x00000001000063c4 a`rt::util::report_overflow::he556d9d2b8eebb88VbI + 36
frame #299996: 0x0000000100006395 a`rust_stack_exhausted + 37
frame #299997: 0x000000010000157f a`__morestack + 13
morestack is assembly for each platform, like i386 and x86_64 — the i386 variant has more description that I think you will want to read carefully. This piece stuck out to me:
Each Rust function contains an LLVM-generated prologue that compares the stack space required for the current function to the space remaining in the current stack segment, maintained in a platform-specific TLS slot.
Here's the first instructions of the foo method:
a`foo::h5f80496ac1ee3d43zaa:
0x1000013e0: cmpq %gs:0x330, %rsp
0x1000013e9: ja 0x100001405 ; foo::h5f80496ac1ee3d43zaa + 37
0x1000013eb: movabsq $0x48, %r10
0x1000013f5: movabsq $0x0, %r11
-> 0x1000013ff: callq 0x100001572 ; __morestack
As you can see, I am about to call into __morestack, so the comparison check failed.
I believe that this indicates that you cannot manipulate the stack pointer and attempt to call any Rust functions.
As a side note, let's look at your get_sp assembly:
movq %rsp, $0
Doing a check check for the semantics of movq:
Copies a quadword from the source operand (second operand) to the destination operand (first operand).
That seems to indicate that your assembly is backwards, in addition to all the other problems.

How CUDA constant memory allocation works?

I'd like to get some insight about how constant memory is allocated (using CUDA 4.2). I know that the total available constant memory is 64KB. But when is this memory actually allocated on the device? Is this limit apply to each kernel, cuda context or for the whole application?
Let's say there are several kernels in a .cu file, each using less than 64K constant memory. But the total constant memory usage is more than 64K. Is it possible to call these kernels sequentially? What happens if they are called concurrently using different streams?
What happens if there is a large CUDA dynamic library with lots of kernels each using different amounts of constant memory?
What happens if there are two applications each requiring more than half of the available constant memory? The first application runs fine, but when will the second app fail? At app start, at cudaMemcpyToSymbol() calls or at kernel execution?
Parallel Thread Execution ISA Version 3.1 section 5.1.3 discusses constant banks.
Constant memory is restricted in size, currently limited to 64KB which
can be used to hold statically-sized constant variables. There is an
additional 640KB of constant memory, organized as ten independent 64KB
regions. The driver may allocate and initialize constant buffers in
these regions and pass pointers to the buffers as kernel function
parameters. Since the ten regions are not contiguous, the driver
must ensure that constant buffers are allocated so that each buffer
fits entirely within a 64KB region and does not span a region
boundary.
A simple program can be used to illustrate the use of constant memory.
__constant__ int kd_p1;
__constant__ short kd_p2;
__constant__ char kd_p3;
__constant__ double kd_p4;
__constant__ float kd_floats[8];
__global__ void parameters(int p1, short p2, char p3, double p4, int* pp1, short* pp2, char* pp3, double* pp4)
{
*pp1 = p1;
*pp2 = p2;
*pp3 = p3;
*pp4 = p4;
return;
}
__global__ void constants(int* pp1, short* pp2, char* pp3, double* pp4)
{
*pp1 = kd_p1;
*pp2 = kd_p2;
*pp3 = kd_p3;
*pp4 = kd_p4;
return;
}
Compile this for compute_30, sm_30 and execute cuobjdump -sass <executable or obj> to disassemble you should see
Fatbin elf code:
================
arch = sm_30
code version = [1,6]
producer = cuda
host = windows
compile_size = 32bit
identifier = c:/dev/constant_banks/kernel.cu
code for sm_30
Function : _Z10parametersiscdPiPsPcPd
/*0008*/ /*0x10005de428004001*/ MOV R1, c [0x0] [0x44]; // stack pointer
/*0010*/ /*0x40001de428004005*/ MOV R0, c [0x0] [0x150]; // pp1
/*0018*/ /*0x50009de428004005*/ MOV R2, c [0x0] [0x154]; // pp2
/*0020*/ /*0x0001dde428004005*/ MOV R7, c [0x0] [0x140]; // p1
/*0028*/ /*0x13f0dc4614000005*/ LDC.U16 R3, c [0x0] [0x144]; // p2
/*0030*/ /*0x60011de428004005*/ MOV R4, c [0x0] [0x158]; // pp3
/*0038*/ /*0x70019de428004005*/ MOV R6, c [0x0] [0x15c]; // pp4
/*0048*/ /*0x20021de428004005*/ MOV R8, c [0x0] [0x148]; // p4
/*0050*/ /*0x30025de428004005*/ MOV R9, c [0x0] [0x14c]; // p4
/*0058*/ /*0x1bf15c0614000005*/ LDC.U8 R5, c [0x0] [0x146]; // p3
/*0060*/ /*0x0001dc8590000000*/ ST [R0], R7; // *pp1 = p1
/*0068*/ /*0x0020dc4590000000*/ ST.U16 [R2], R3; // *pp2 = p2
/*0070*/ /*0x00415c0590000000*/ ST.U8 [R4], R5; // *pp3 = p3
/*0078*/ /*0x00621ca590000000*/ ST.64 [R6], R8; // *pp4 = p4
/*0088*/ /*0x00001de780000000*/ EXIT;
/*0090*/ /*0xe0001de74003ffff*/ BRA 0x90;
/*0098*/ /*0x00001de440000000*/ NOP CC.T;
/*00a0*/ /*0x00001de440000000*/ NOP CC.T;
/*00a8*/ /*0x00001de440000000*/ NOP CC.T;
/*00b0*/ /*0x00001de440000000*/ NOP CC.T;
/*00b8*/ /*0x00001de440000000*/ NOP CC.T;
...........................................
Function : _Z9constantsPiPsPcPd
/*0008*/ /*0x10005de428004001*/ MOV R1, c [0x0] [0x44]; // stack pointer
/*0010*/ /*0x00001de428004005*/ MOV R0, c [0x0] [0x140]; // p1
/*0018*/ /*0x10009de428004005*/ MOV R2, c [0x0] [0x144]; // p2
/*0020*/ /*0x0001dde428004c00*/ MOV R7, c [0x3] [0x0]; // kd_p1
/*0028*/ /*0x13f0dc4614000c00*/ LDC.U16 R3, c [0x3] [0x4]; // kd_p2
/*0030*/ /*0x20011de428004005*/ MOV R4, c [0x0] [0x148]; // p3
/*0038*/ /*0x30019de428004005*/ MOV R6, c [0x0] [0x14c]; // p4
/*0048*/ /*0x20021de428004c00*/ MOV R8, c [0x3] [0x8]; // kd_p4
/*0050*/ /*0x30025de428004c00*/ MOV R9, c [0x3] [0xc]; // kd_p4
/*0058*/ /*0x1bf15c0614000c00*/ LDC.U8 R5, c [0x3] [0x6]; // kd_p3
/*0060*/ /*0x0001dc8590000000*/ ST [R0], R7;
/*0068*/ /*0x0020dc4590000000*/ ST.U16 [R2], R3;
/*0070*/ /*0x00415c0590000000*/ ST.U8 [R4], R5;
/*0078*/ /*0x00621ca590000000*/ ST.64 [R6], R8;
/*0088*/ /*0x00001de780000000*/ EXIT;
/*0090*/ /*0xe0001de74003ffff*/ BRA 0x90;
/*0098*/ /*0x00001de440000000*/ NOP CC.T;
/*00a0*/ /*0x00001de440000000*/ NOP CC.T;
/*00a8*/ /*0x00001de440000000*/ NOP CC.T;
/*00b0*/ /*0x00001de440000000*/ NOP CC.T;
/*00b8*/ /*0x00001de440000000*/ NOP CC.T;
.....................................
I annotated to the right of the SASS.
On sm30 you can see that parameters are passed in constant bank 0 starting at offset 0x140.
User defined __constant__ variables are defined in constant bank 3.
If you execute cuobjdump --dump-elf <executable or obj> you can find other interesting constant information.
32bit elf: abi=6, sm=30, flags = 0x1e011e
Sections:
Index Offset Size ES Align Type Flags Link Info Name
1 34 142 0 1 STRTAB 0 0 0 .shstrtab
2 176 19b 0 1 STRTAB 0 0 0 .strtab
3 314 d0 10 4 SYMTAB 0 2 a .symtab
4 3e4 50 0 4 CUDA_INFO 0 3 b .nv.info._Z9constantsPiPsPcPd
5 434 30 0 4 CUDA_INFO 0 3 0 .nv.info
6 464 90 0 4 CUDA_INFO 0 3 a .nv.info._Z10parametersiscdPiPsPcPd
7 4f4 160 0 4 PROGBITS 2 0 a .nv.constant0._Z10parametersiscdPiPsPcPd
8 654 150 0 4 PROGBITS 2 0 b .nv.constant0._Z9constantsPiPsPcPd
9 7a8 30 0 8 PROGBITS 2 0 0 .nv.constant3
a 7d8 c0 0 4 PROGBITS 6 3 a00000b .text._Z10parametersiscdPiPsPcPd
b 898 c0 0 4 PROGBITS 6 3 a00000c .text._Z9constantsPiPsPcPd
.section .strtab
.section .shstrtab
.section .symtab
index value size info other shndx name
0 0 0 0 0 0 (null)
1 0 0 3 0 a .text._Z10parametersiscdPiPsPcPd
2 0 0 3 0 7 .nv.constant0._Z10parametersiscdPiPsPcPd
3 0 0 3 0 b .text._Z9constantsPiPsPcPd
4 0 0 3 0 8 .nv.constant0._Z9constantsPiPsPcPd
5 0 0 3 0 9 .nv.constant3
6 0 4 1 0 9 kd_p1
7 4 2 1 0 9 kd_p2
8 6 1 1 0 9 kd_p3
9 8 8 1 0 9 kd_p4
10 16 32 1 0 9 kd_floats
11 0 192 12 10 a _Z10parametersiscdPiPsPcPd
12 0 192 12 10 b _Z9constantsPiPsPcPd
The kernel parameter constant bank is versioned per launch so that concurrent kernels can be executed. The compiler and user constants are per CUmodule. It is the responsibility of the developer to manage coherency of this data. For example, the developer has to ensure that a cudaMemcpyToSymbol is update in a safe manner.

How do you get the ICC compiler to generate SSE instructions within an inner loop?

I have an inner loop such as this
for(i=0 ;i<n;i++){
x[0] += A[i] * z[0];
x[1] += A[i] * z[1];
x[2] += A[i] * z[2];
x[3] += A[i] * z[3];
}
The inner 4 instructions can be easily converted to SSE instructions by a compiler. Do current compilers do this ? If they do what do I have to do to force this on the compiler?
From what you've provided, this can't be vectorized, because the pointers could alias each other, i.e. the x array could overlap with A or z.
A simple way to help the compiler out would be to declare x as __restrict. Another way would be to rewrite it like so:
for(i=0 ;i<n;i++)
{
float Ai=A[i];
float z0=z[0], z1=z[1], z2=z[2], z3=z[3];
x[0] += Ai * z0;
x[1] += Ai * z1;
x[2] += Ai * z2;
x[3] += Ai * z3;
}
I've never actually tried to get a compiler to auto-vectorize code, so I don't know if that will do it or not. Even if it doesn't get vectorized, it should be faster since the loads and stores can be ordered more efficiently and without causing a load-hit-store.
If you have more information than the compiler does, (e.g. whether or not your pointers are 16-byte aligned), and should be able to use that to your advantage (e.g. using aligned loads). Note that I'm not saying you should always try to beat the compiler, only when you know more than it does.
Further reading:
Load-hit-stores and the __restrict keyword
Memory Optimization (aliasing starts around slide 35)
ICC auto-vectorizes the below code snippet for SSE2 by default:
void foo(float *__restrict__ x, float *__restrict__ A, float *__restrict__ z, int n){
for(int i=0;i<n;i++){
x[0] += A[i] * z[0];
x[1] += A[i] * z[1];
x[2] += A[i] * z[2];
x[3] += A[i] * z[3];
}
return;
}
By using restrict keyword, the memory aliasing assumption is ignored. The vectorization report generated is:
$ icpc test.cc -c -vec-report2 -S
test.cc(2): (col. 1) remark: PERMUTED LOOP WAS VECTORIZED
test.cc(3): (col. 2) remark: loop was not vectorized: not inner loop
To confirm if SSE instructions are generated, open the ASM generated (test.s) and you will find the following instructions:
..B1.13: # Preds ..B1.13 ..B1.12
movaps (%rsi,%r15,4), %xmm10 #3.10
movaps 16(%rsi,%r15,4), %xmm11 #3.10
mulps %xmm0, %xmm10 #3.17
mulps %xmm0, %xmm11 #3.17
addps %xmm10, %xmm9 #3.2
addps %xmm11, %xmm6 #3.2
movaps 32(%rsi,%r15,4), %xmm12 #3.10
movaps 48(%rsi,%r15,4), %xmm13 #3.10
movaps 64(%rsi,%r15,4), %xmm14 #3.10
movaps 80(%rsi,%r15,4), %xmm15 #3.10
movaps 96(%rsi,%r15,4), %xmm10 #3.10
movaps 112(%rsi,%r15,4), %xmm11 #3.10
addq $32, %r15 #2.1
mulps %xmm0, %xmm12 #3.17
cmpq %r13, %r15 #2.1
mulps %xmm0, %xmm13 #3.17
mulps %xmm0, %xmm14 #3.17
addps %xmm12, %xmm5 #3.2
mulps %xmm0, %xmm15 #3.17
addps %xmm13, %xmm4 #3.2
mulps %xmm0, %xmm10 #3.17
addps %xmm14, %xmm7 #3.2
mulps %xmm0, %xmm11 #3.17
addps %xmm15, %xmm3 #3.2
addps %xmm10, %xmm2 #3.2
addps %xmm11, %xmm1 #3.2
jb ..B1.13 # Prob 75% #2.1
# LOE rax rdx rsi r8 r9 r10 r13 r15 ecx ebp edi r11d r14d bl xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7 xmm8 xmm9
..B1.14: # Preds ..B1.13
addps %xmm6, %xmm9 #3.2
addps %xmm4, %xmm5 #3.2
addps %xmm3, %xmm7 #3.2
addps %xmm1, %xmm2 #3.2
addps %xmm5, %xmm9 #3.2
addps %xmm2, %xmm7 #3.2
lea 1(%r14), %r12d #2.1
cmpl %r12d, %ecx #2.1
addps %xmm7, %xmm9 #3.2
jb ..B1.25 # Prob 50% #2.1

Resources