How to calculate the no of clock cycles in RISCV-clang - clang

I am using riscv64-unknown-elf-clang, "clang version 5.0.0" to compile my code and then run it with "spike" and "pk" . I need to calculate the no of clock cycles the program takes. I used "__builtin_readcyclecounter()" or normal "clock()" to calculate clock cycles but none seems to work.
The below code works with riscv64-unknown-elf-gcc but not with riscv64-unknown-elf-clang
#define read_csr(reg) ({ unsigned long __tmp;asm volatile ("csrr %0, " #reg : "=r"(__tmp));__tmp; })
#define CSR_CYCLE 0xc00
#define CSR_TIME 0xc01
#define CSR_INSTRET 0xc02
#define CSR_MCYCLE 0xb00
Then from the main program I called
long cycles;
cycles=read_csr(cycle);

Clang 5.0 is just too old for the csrr pseudo-instruction, i.e. the pseudo-instruction support in Clang 5.0 is incomplete. Support for csrr was added in 2018 while Clang 5 was released in 2017.
Either you upgrade to a newer Clang version or you work around this issue by expanding cssr (control-and-status-register-read) to cssrs (control-and-status-register-read-and-set) directly in your code, i.e.
csrr dst, csr => csrrs dst, csr, x0
Note that there are even more specialized pseudo-instructions for reading performance related counters such as rdcycle dst, and rdtime dst etc. Of course, they also expand to cssrs, but might be more convenient to use for some use cases.
Also, with a complete toolchain, you can also use the symbolic constants cycle, time, instret etc. (instead of 0xc00, 0xc01, 0xc02 etc.) directly in you assembler code.
Example that lists the equivalent ways to read out the cycle count:
extern __inline
unsigned long
__attribute__((__gnu_inline__, __always_inline__, __artificial__))
rdcycle(void)
{
unsigned long dst;
// output into any register, likely a0
// regular instruction:
asm volatile ("csrrs %0, 0xc00, x0" : "=r" (dst) );
// regular instruction with symbolic csr and register names
// asm volatile ("csrrs %0, cycle, zero" : "=r" (dst) );
// pseudo-instruction:
// asm volatile ("csrr %0, cycle" : "=r" (dst) );
// pseudo-instruction:
//asm volatile ("rdcycle %0" : "=r" (dst) );
return dst;
}
I don't think that it's worth it using the C-pre-processor (CPP) for reading the different CSRs.
Note that even with asm volatile the compiler is free to reorder the different inline assembler statements to each other and to other instructions. For example, in
unsigned long a = rdcycle();
int r = i * i;
unsigned long b = rdcycle();
the second CSR access might be reordered before the multiplication or even the before the first one.

Related

clang __asm__ with labels in case statment, gets error: invalid operand for instruction

I am trying to add labels in C source code(instrumentation); with a small experience with assembly, comipler is clang; I have got a strange behavior with __asm__ and labels in CASE statements !!!;
here is what I have tried:
// Compiles successfully.
int main()
{
volatile unsigned long long a = 3;
switch(8UL)
{
case 1UL:
//lbl:;
__asm__ ("movb %%gs:%1,%0": "=q" (a): "m" (a));
a++;
}
return 0;
}
and this :
// Compiles successfully.
int main()
{
volatile unsigned long long a = 3;
switch(8UL)
{
case 1UL:
lbl:;
//__asm__ ("movb %%gs:%1,%0": "=q" (a): "m" (a));
a++;
}
return 0;
}
command:
clang -c examples/a.c
examples/a.c:5:14: warning: no case matching constant switch condition '8'
switch(8UL)
^~~
1 warning generated.
BUT this:
// not Compile.
int main()
{
volatile unsigned long long a = 3;
switch(8UL)
{
case 1UL:
lbl:;
__asm__ ("movb %%gs:%1,%0": "=q" (a): "m" (a));
a++;
}
return 0;
}
the error:
^~~
examples/a.c:9:22: error: invalid operand for instruction
__asm__ ("movb %%gs:%1,%0": "=q" (a): "m" (a));
^
<inline asm>:1:21: note: instantiated into assembly here
movb %gs:-16(%rbp),%rax
^~~~
1 warning and 1 error generated.
I am using :
clang --version
clang version 9.0.0-2 (tags/RELEASE_900/final)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
IMPORTANT; this will compile successfully with gcc.
gcc --version
gcc (Ubuntu 9.2.1-9ubuntu2) 9.2.1 20191008
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
I am working on Ubuntu 19 ,64 BIT.
any Help please..
EDIT
Based on the accepted answer below:
The __asm__ statement itself cause error (different sizes).
The __asm__ is unreachable.
Adding Label to the statement makes it reachable.
GCC ignores it.
Clang does not ignore it.
movb is 8-bit operand-size, %rax is 64-bit because you used unsigned long long. Just use mov to do a load of the same width as the output variable, or use movzbl %%gs:%1, %k0 to zero-extend to 64-bit. (Explicitly to 32-bit with movzbl, and implicitly to 64-bit by writing the 32-bit low half of the 64-bit register (the k modifier in %k0))
Surprised GCC doesn't reject that as well; maybe GCC removes it as dead code because of the unreachable case in switch(8). If you look at GCC's asm output, it probably doesn't contain that instruction.

Alternatives to type casting when formatting NS(U)Integer on 32 and 64 bit architectures?

With the 64 bit version of iOS we can't use %d and %u anymore to format NSInteger and NSUInteger. Because for 64 bit those are typedef'd to long and unsigned long instead of int and unsigned int.
So Xcode will throw warnings if you try to format NSInteger with %d. Xcode is nice to us and offers an replacement for those two cases, which consists of a l-prefixed format specifier and a typecast to long. Then our code basically looks like this:
NSLog(#"%ld", (long)i);
NSLog(#"%lu", (unsigned long)u);
Which, if you ask me, is a pain in the eye.
A couple of days ago someone at Twitter mentioned the format specifiers %zd to format signed variables and %tu to format unsigned variables on 32 and 64 bit plattforms.
NSLog(#"%zd", i);
NSLog(#"%tu", u);
Which seems to work. And which I like more than typecasting.
But I honestly have no idea why those work. Right now both are basically magic values for me.
I did a bit of research and figured out that the z prefix means that the following format specifier has the same size as size_t. But I have absolutely no idea what the prefix t means. So I have two questions:
What exactly do %zd and %tu mean?
And is it safe to use %zd and %tu instead of Apples suggestion to typecast to long?
I am aware of similar questions and Apples 64-Bit Transition guides, which all recommend the %lu (unsigned long) approach. I am asking for an alternative to type casting.
From http://pubs.opengroup.org/onlinepubs/009695399/functions/printf.html:
z
Specifies that a following [...] conversion specifier applies to a size_t or the corresponding signed integer type argument;
t
Specifies that a following [...] conversion specifier applies to a ptrdiff_t or the corresponding unsigned type argument;
And from http://en.wikipedia.org/wiki/Size_t#Size_and_pointer_difference_types:
size_t is used to represent the size of any object (including arrays) in the particular implementation. It is used as the return type of the sizeof operator.
ptrdiff_t is used to represent the difference between pointers.
On the current OS X and iOS platforms we have
typedef __SIZE_TYPE__ size_t;
typedef __PTRDIFF_TYPE__ ptrdiff_t;
where __SIZE_TYPE__ and __PTRDIFF_TYPE__ are predefined by the
compiler. For 32-bit the compiler defines
#define __SIZE_TYPE__ long unsigned int
#define __PTRDIFF_TYPE__ int
and for 64-bit the compiler defines
#define __SIZE_TYPE__ long unsigned int
#define __PTRDIFF_TYPE__ long int
(This may have changed between Xcode versions. Motivated by #user102008's
comment, I have checked this with Xcode 6.2 and updated the answer.)
So ptrdiff_t and NSInteger are both typedef'd to the same type:
int on 32-bit and long on 64-bit. Therefore
NSLog(#"%td", i);
NSLog(#"%tu", u);
work correctly and compile without warnings on all current
iOS and OS X platforms.
size_t and NSUInteger have the same size on all platforms, but
they are not the same type, so
NSLog(#"%zu", u);
actually gives a warning when compiling for 32-bit.
But this relation is not fixed in any standard (as far as I know), therefore I would
not consider it safe (in the same sense as assuming that long has the same size
as a pointer is not considered safe). It might break in the future.
The only alternative to type casting that I know of is from the answer to "Foundation types when compiling for arm64 and 32-bit architecture", using preprocessor macros:
// In your prefix header or something
#if __LP64__
#define NSI "ld"
#define NSU "lu"
#else
#define NSI "d"
#define NSU "u"
#endif
NSLog(#"i=%"NSI, i);
NSLog(#"u=%"NSU, u);
I prefer to just use an NSNumber instead:
NSInteger myInteger = 3;
NSLog(#"%#", #(myInteger));
This does not work in all situations, but I've replaced most of my NS(U)Integer formatting with the above.
According to Building 32-bit Like 64-bit, another solution is to define the NS_BUILD_32_LIKE_64 macro, and then you can simply use the %ld and %lu specifiers with NSInteger and NSUInteger without casting and without warnings.

Storing data in RAM without avr-libc. How can I configure the proper memory sections using a custom linker script and initialization code?

Here's my linker script:
MEMORY {
text (rx) : ORIGIN = 0x000000, LENGTH = 64K
data (rw!x) : ORIGIN = 0x800100, LENGTH = 0xFFA0
}
SECTIONS {
.vectors : AT (0x0000) { entry.o (.vectors); }
.text : AT (ADDR (.vectors) + SIZEOF(.vectors)) { * (.text.startup); * (.text); * (.progmem.data); _etext = .; }
.data : AT (ADDR (.text) + SIZEOF (.text)) { PROVIDE (__data_start = .); * (.data); * (.rodata); * (.rodata.str1.1); PROVIDE (__data_end = .); } > data
.bss : AT (ADDR (.bss)) { PROVIDE (__bss_start = .); * (.bss); PROVIDE (__bss_end = .); } > data
__data_load_start = LOADADDR(.data);
__data_load_end = __data_load_start + SIZEOF(.data);
}
And this is my initialization code. init is called at reset.
.section .text,"ax",#progbits
/* Handle low level hardware initialization. */
.global init
init: eor r1, r1
out 0x3f, r1
ldi r28, 0xFF
ldi r29, 0x02
out 0x3e, r29
out 0x3d, r28
rjmp __do_copy_data
rjmp __do_clear_bss
jmp main
/* Handle copying data into RAM. */
.global __do_copy_data
__do_copy_data: ldi r17, hi8(__data_end)
ldi r26, lo8(__data_start)
ldi r27, hi8(__data_start)
ldi r30, lo8(__data_load_start)
ldi r31, hi8(__data_load_start)
rjmp .L__do_copy_data_start
.L__do_copy_data_loop: lpm r0, Z+
st X+, r0
.L__do_copy_data_start: cpi r26, lo8(__data_end)
cpc r27, r17
brne .L__do_copy_data_loop
rjmp main
/* Handle clearing the BSS. */
.global __do_clear_bss
__do_clear_bss: ldi r17, hi8(__bss_end)
ldi r26, lo8(__bss_start)
ldi r27, hi8(__bss_start)
rjmp .L__do_clear_bss_start
.L__do_clear_bss_loop: st X+, r1
.L__do_clear_bss_start: cpi r26, lo8(__bss_end)
cpc r27, r17
brne .L__do_clear_bss_loop
The problem is that the initialization code hangs sometime during the copying process. Here's an edited dump of my symbol table, if it's helpful to anyone.
00000000 a __tmp_reg__
...
00000000 t reset
...
00000001 a __zero_reg__
...
0000003d a __SP_L__
...
00000074 T main
0000009a T init
000000ae T __do_copy_data
000000c6 T __do_clear_bss
...
00000446 A __data_load_start
00000446 T _etext
0000045b A __data_load_end
00800100 D __data_start
00800100 D myint
00800115 B __bss_start
00800115 D __data_end
00800115 b foobar.1671
00800135 B ticks
00800139 B __bss_end
C is designed to work on von Neumann architectures. AVR is Harvard based. This means that C expects strings to be in RAM. As a consequence, if you ever take a look at the disassembly for any elf binary that would eventually be copied as a hex to your AVR chip, you will see two sections: __do_copy_data and __do_clear_bss [when required]. These routines, which are added in the linking stages, take care of the basic needs of the C language. As a consequence, what you are seeing here with your pointers is likely that they are pointing to the wrong addresses. In other words, they are either pointing to an address in program space but you are reading from data space [different address bus]. Or you are purposely reading from data space but have not copied the strings over.
See:
avr/pgmspace.h
FAQ: ROM Array and scroll down for strings
and of course, the AVR instruction set as provided by Atmel, especifically, the instruction to copy program memory over to data memory
Edited to reflect new question and comment: Your assembly for both sections look ok to me. I will have to take a look at your linker scripts with closer scrutinity to check if there any funny businesses going on there. Since you are writing a bootloader, do you mind if I ask if you have taken a look at bootloader support on AVR-libc?
http://deans-avr-tutorials.googlecode.com/svn/trunk/Progmem/Output/Progmem.pdf
This document may help to clarify the use of flash/ram memories of AVR.
I actually got it to work. All I had to do was enable reading from and writing to external RAM in the SREG. It was mindblowingly obvious and simple, I know, but it was buried in the data sheet and it was poorly documented.
This resource was helpful, but didn't have any documentation for my chip. If you're having this problem, look in the data sheet for your AVR and see how it is related. The process is not the same for all variations of AVRs.
How to use external RAM.

How do I transfer an integer to __constant__ device memory?

I have a weird problem, so I thought I would ask and see if someone more experienced than me could see a solution.
I am writing a program with CUDA C/C++, and I have some constant integers that specify various things, like coordinates of the bounds of the calculation, etc.. Currently I just have those things in global device memory. They are accessed by every thread in every kernel call, and so I figured that if they are in global memory, then they never are being cached or broadcast (right?). And so these little integers are taking up a lot (relatively) of overhead, and have a lot of 'read redundancy.'
So I declare in a header:
__constant__ int* number;
I include that header, and, when I do memory stuff, I do:
cutilSafeCall( cudaMemcpyToSymbol(number, &(some_host_int), sizeof(int) );
I pass number into all my kernel's then:
__global__ void magical_kernel(int* number, ...){
//and I access 'number' like this
int data_thingy = big_array[ *number ];
}
My code crashes. With number in global memory, it is just fine. I have determined that it crashes sometime upon accessing number within the kernel. This means that either I am accessing or allocating it wrong. If it holds the wrong value, it will also cause a crash, because it is used to index into arrays.
To conclude, I will ask a few questions. First, what am I doing wrong? As a bonus: is there a better way than constant memory to accomplish this task - I don't know the value of number at compile time, so a simple #define won't work. Will constant memory even speed the code up at all, or has it been cached and broadcasted all along? Could I somehow put the data in shared memory for each threadblock and have it remain in shared memory through multiple kernel calls?
There are several problems here:
You have declared number as a pointer, but never assigned it a value which is valid address in GPU memory
You have a variable scope onflict: the argument variable int * number defined in magic_kernel is not the same variable as the __constant__ int * variable defined as compilation unit scope.
The first argument of the cudaMemcpyToSymbol call is almost certainly incorrect.
If you don't understand why either of the first two point are true, you have some revision to do on pointers and scope in C++.
Based on your response to a now deleted answer, I suspect what you are actually trying to do is this:
__constant__ int number;
__global__ void magical_kernel(...){
int data_thingy = big_array[ number ];
}
cudaMemcpyToSymbol("number", &(some_host_int), sizeof(int));
i.e. number is intended to be an integer in constant memory, not a pointer, and not a kernel argument.
EDIT: here is an exmaple which shows this in action:
#include <cstdio>
__constant__ int number;
__global__ void magical_kernel(int * out)
{
out[threadIdx.x] = number;
}
int main()
{
const int value = 314159;
const size_t sz = size_t(32) * sizeof(int);
cudaMemcpyToSymbol("number", &value, sizeof(int));
int * _out, * out;
out = (int *)malloc(sz);
cudaMalloc((void **)&_out, sz);
magical_kernel<<<1,32>>>(_out);
cudaMemcpy(out, _out, sz, cudaMemcpyDeviceToHost);
for(int i=0; i<32; i++)
fprintf(stdout, "%d %d\n", i, out[i]);
return 0;
}
You should be able to run this yourself and confirm it works as advertised.

arm asm/neon optimisation for image processing

I m currently working on a painting app on ios.
I use a directly draw into a NSMutableData buffer and apply blending with my brush like this:
- (void) combineColorDestination:(unsigned char*) dest source:(unsigned char*) src
{
const unsigned char sra = ((unsigned char *)src)[3];
const float oneminusalpha = 1.0f - (sra / 255.f);
int d[4];
for (int i=0;i<4;i++)
{
d[i] = oneminusalpha * ((unsigned char *)dest)[i] + ((unsigned char *)src)[i];
if (d[i]>255)
d[i] = 255;
((unsigned char *)dest)[i] = (unsigned char)d[i];
}
}
Any suggestions for optimisations ?
I previously tried to use neon , but i ve got a bug I wasnt able to fix (the bordering pixels was buggy)
I was iterating pixels 2 by 2 like this :
uint8x8_t va = vld1_u8(dest);
uint8x8_t vb = vld1_u8(src);
uint8x8_t res = vqadd_u8(va,vb);
vst1_u8(dest, res);
Suggestions? Alright. Note that these are valid whichever multimedia manipulation you are doing and is hardly restricted to your case.
First, before you even do NEON, you should change your code to have one function that changes a bunch of pixels (at least a row, a rectangle if you can) at once, instead of a function (or method - even worse) that changes one pixel and is called a bunch of times: somehow I doubt the brush is only 1x1 pixel.
Second, except for the column loop (and eventual row loop), there should be no branch (that is, flow control structures). No for (i=0;i<4;i++); just write the code for the four channels in sequence (use a macro if necessary). No if (d[i]>255); express that as an alternative: dest[i] = (temp>255?255:temp); at the very least, if not replacing it by a more efficient way to do saturation (tricks using subtractions, shifts, and masks exist).
Third, avoid any conversion between floating-point and integer; this is always valid advice, but float->int conversions are particularly devastating on ARM. Since you're manipulating integers, this means foregoing floating-point here.
And once you've done that, surprise, besides making your code faster you have in fact done the preparation work for NEON: NEON is only remotely useful if you process a bunch of pixels at once, if there is no branch, and if you don't convert between floating-point and integer all over the place. So only then will we talk about NEON, if it is even necessary at this point.

Resources