SET A, 0x1E vs SET A, 0x1F - stack

This is my first attempt at dpcu, I'm checking machine code generated by dpcu-16 assembly
I am using this emulator : http://dcpu.ru/
I am trying to compare code generated by
SET A, 0x1E
SET A, 0x1F
code generated is as follow :
fc01
7c01 001f
I don't get why operand size changes between those two values

That emulator appears to be using the next version of the DCPU-16 spec, which specifies that the same-word literal value for a permits values from 0xFFFF (-1) to 0x1E (30). This means that to get any literal value outside this range the assembler has to use the next-word literal syntax, which makes the operand one byte bigger.

0x1F (dec:31) is no longer a short literal (values -1 to 30), so it has to be read as a "next word" argument.
The opcodes are thus:
SET A, 0x1E
SET = 00001
A = 00000
1E = 111111
op = 1111110000000001 = fc01
SET A, 0x1F
SET = 00001
A = 00000
NW = 011111
op = 0111110000000001 = 7c01 + 001f

Related

SSE/AVX: Choose from two __m256 float vectors based on per-element min and max absolute value

I am looking for efficient AVX (AVX512) implementation of
// Given
float u[8];
float v[8];
// Compute
float a[8];
float b[8];
// Such that
for ( int i = 0; i < 8; ++i )
{
a[i] = fabs(u[i]) >= fabs(v[i]) ? u[i] : v[i];
b[i] = fabs(u[i]) < fabs(v[i]) ? u[i] : v[i];
}
I.e., I need to select element-wise into a from u and v based on mask, and into b based on !mask, where mask = (fabs(u) >= fabs(v)) element-wise.
I had this exact same problem just the other day. The solution I came up with (using AVX only) was:
// take the absolute value of u and v
__m256 sign_bit = _mm256_set1_ps(-0.0f);
__m256 u_abs = _mm256_andnot_ps(sign_bit, u);
__m256 v_abs = _mm256_andnot_ps(sign_bit, v);
// get a mask indicating the indices for which abs(u[i]) >= abs(v[i])
__m256 u_ge_v = _mm256_cmp_ps(u_abs, v_abs, _CMP_GE_OS);
// use the mask to select the appropriate elements into a and b, flipping the argument
// order for b to invert the sense of the mask
__m256 a = _mm256_blendv_ps(u, v, u_ge_v);
__m256 b = _mm256_blendv_ps(v, u, u_ge_v);
The AVX512 equivalent would be:
// take the absolute value of u and v
__m512 sign_bit = _mm512_set1_ps(-0.0f);
__m512 u_abs = _mm512_andnot_ps(sign_bit, u);
__m512 v_abs = _mm512_andnot_ps(sign_bit, v);
// get a mask indicating the indices for which abs(u[i]) >= abs(v[i])
__mmask16 u_ge_v = _mm512_cmp_ps_mask(u_abs, v_abs, _CMP_GE_OS);
// use the mask to select the appropriate elements into a and b, flipping the argument
// order for b to invert the sense of the mask
__m512 a = _mm512_mask_blend_ps(u_ge_v, u, v);
__m512 b = _mm512_mask_blend_ps(u_ge_v, v, u);
As Peter Cordes suggested in the comments above, there are other approaches as well like taking the absolute value followed by a min/max and then reinserting the sign bit, but I couldn't find anything that was shorter/lower latency than this sequence of instructions.
Actually, there is another approach using AVX512DQ's VRANGEPS via the _mm512_range_ps() intrinsic. Intel's intrinsic guide describes it as follows:
Calculate the max, min, absolute max, or absolute min (depending on control in imm8) for packed single-precision (32-bit) floating-point elements in a and b, and store the results in dst. imm8[1:0] specifies the operation control: 00 = min, 01 = max, 10 = absolute max, 11 = absolute min. imm8[3:2] specifies the sign control: 00 = sign from a, 01 = sign from compare result, 10 = clear sign bit, 11 = set sign bit.
Note that there appears to be a typo in the above; actually imm8[3:2] == 10 is "absolute min" and imm8[3:2] == 11 is "absolute max" if you look at the details of the per-element operation:
CASE opCtl[1:0] OF
0: tmp[31:0] := (src1[31:0] <= src2[31:0]) ? src1[31:0] : src2[31:0]
1: tmp[31:0] := (src1[31:0] <= src2[31:0]) ? src2[31:0] : src1[31:0]
2: tmp[31:0] := (ABS(src1[31:0]) <= ABS(src2[31:0])) ? src1[31:0] : src2[31:0]
3: tmp[31:0] := (ABS(src1[31:0]) <= ABS(src2[31:0])) ? src2[31:0] : src1[31:0]
ESAC
CASE signSelCtl[1:0] OF
0: dst[31:0] := (src1[31] << 31) OR (tmp[30:0])
1: dst[31:0] := tmp[63:0]
2: dst[31:0] := (0 << 31) OR (tmp[30:0])
3: dst[31:0] := (1 << 31) OR (tmp[30:0])
ESAC
RETURN dst
So you can get the same result with just two instructions:
auto a = _mm512_range_ps(v, u, 0x7); // 0b0111 = sign from compare result, absolute max
auto b = _mm512_range_ps(v, u, 0x6); // 0b0110 = sign from compare result, absolute min
The argument order (v, u) is a bit unintuitive, but it's needed in order to get the same behavior that you described in the OP in the event that the elements have equal absolute value (namely, that the value from u is passed through to a, and v goes to b).
On Skylake and Ice Lake Xeon platforms (probably any of the Xeons that have dual FMA units, probably?), VRANGEPS has throughput 2, so the two checks can issue and execute simultaneously, with latency of 4 cycles. This is only a modest latency improvement on the original approach, but the throughput is better and it requires fewer instructions/uops/instruction cache space.
clang does a pretty reasonable job of auto-vectorizing it with -ffast-math and the necessary __restrict qualifiers: https://godbolt.org/z/NMvN1u. and both inputs to ABS them, compare once, vblendvps twice on the original inputs with the same mask but the other sources in the opposite order to get min and max.
That's pretty much what I was thinking before checking what compilers did, and looking at their output to firm up the details I hadn't thought through yet. I don't see anything more clever than that. I don't think we can avoid abs()ing both a and b separately; there's no cmpps compare predicate that compares magnitudes and ignores the sign bit.
// untested: I *might* have reversed min/max, but I think this is right.
#include <immintrin.h>
// returns min_abs
__m256 minmax_abs(__m256 u, __m256 v, __m256 *max_result) {
const __m256 signbits = _mm256_set1_ps(-0.0f);
__m256 abs_u = _mm256_andnot_ps(signbits, u);
__m256 abs_v = _mm256_andnot_ps(signbits, v); // strip the sign bit
__m256 maxabs_is_v = _mm256_cmp_ps(abs_u, abs_v, _CMP_LT_OS); // u < v
*max_result = _mm256_blendv_ps(v, u, maxabs_is_v);
return _mm256_blendv_ps(u, v, maxabs_is_v);
}
You'd do the same thing with AVX512 except you compare into a mask instead of another vector.
// returns min_abs
__m512 minmax_abs512(__m512 u, __m512 v, __m512 *max_result) {
const __m512 absmask = _mm512_castsi512_ps(_mm512_set1_epi32(0x7fffffff));
__m512 abs_u = _mm512_and_ps(absmask, u);
__m512 abs_v = _mm512_and_ps(absmask, v); // strip the sign bit
__mmask16 maxabs_is_v = _mm512_cmp_ps_mask(abs_u, abs_v, _CMP_LT_OS); // u < v
*max_result = _mm512_mask_blend_ps(maxabs_is_v, v, u);
return _mm512_mask_blend_ps(maxabs_is_v, u, v);
}
Clang compiles the return statement in an interesting way (Godbolt):
.LCPI2_0:
.long 2147483647 # 0x7fffffff
minmax_abs512(float __vector(16), float __vector(16), float __vector(16)*): # #minmax_abs512(float __vector(16), float __vector(16), float __vector(16)*)
vbroadcastss zmm2, dword ptr [rip + .LCPI2_0]
vandps zmm3, zmm0, zmm2
vandps zmm2, zmm1, zmm2
vcmpltps k1, zmm3, zmm2
vblendmps zmm2 {k1}, zmm1, zmm0
vmovaps zmmword ptr [rdi], zmm2 ## store the blend result
vmovaps zmm0 {k1}, zmm1 ## interesting choice: blend merge-masking
ret
Instead of using another vblendmps, clang notices that zmm0 already has one of the blend inputs, and uses merge-masking with a regular vector vmovaps. This has zero advantage of Skylake-AVX512 for 512-bit vblendmps (both single-uop instructions for port 0 or 5), but if Agner Fog's instruction tables are right, vblendmps x/y/zmm only ever runs on port 0 or 5, but a masked 256-bit or 128-bit vmovaps x/ymm{k}, x/ymm can run on any of p0/p1/p5.
Both are single-uop / single-cycle latency, unlike AVX2 vblendvps based on a mask vector which is 2 uops. (So AVX512 is an advantage even for 256-bit vectors). Unfortunately, none of gcc, clang, or ICC turn the _mm256_cmp_ps into _mm256_cmp_ps_mask and optimize the AVX2 intrinsics to AVX512 instructions when compiling with -march=skylake-avx512.)
s/512/256/ to make a version of minmax_abs512 that uses AVX512 for 256-bit vectors.
Gcc goes even further, and does the questionable "optimization" of
vmovaps zmm2, zmm1 # tmp118, v
vmovaps zmm2{k1}, zmm0 # tmp118, tmp114, tmp118, u
instead of using one blend instruction. (I keep thinking I'm seeing a store followed by a masked store, but no, neither compiler is blending that way).

Break up 32-bit hex value into 4 bytes [QB64]

I would want to ask, how do you break up a 32-bit hex (for example: CEED6644) into 4 bytes (var1 = CE, var2 = ED, var3 = 66, var4 = 44). In QB64 or QBasic. I would use this to store several data bytes into one array address.
Something like this:
DIM Array(&HFFFF&) AS _UNSIGNED LONG
Array(&HAA00&) = &HCEED6644&
addr = &HAA00&
SUB PrintChar
SHARED addr
IF var1 = &HAA& THEN PRINT "A"
IF var1 = &HBB& THEN PRINT "B"
IF var1 = &HCC& THEN PRINT "C"
IF var1 = &HDD& THEN PRINT "D"
IF var1 = &HEE& THEN PRINT "E"
IF var1 = &HFF& THEN PRINT "F"
IF var1 = &H00& THEN PRINT "G"
IF var1 = &H11& THEN PRINT "H"
And so on...
You could use integer division (\) and bitwise AND (AND) to accomplish this.
DIM x(0 TO 3) AS _UNSIGNED _BYTE
a& = &HCEED6644&
x(0) = (a& AND &HFF000000&) \ 2^24
x(1) = (a& AND &H00FF0000&) \ 2^16
x(2) = (a& AND &H0000FF00&) \ 2^8
x(3) = a& AND &HFF&
PRINT HEX$(x(0)); HEX$(x(1)); HEX$(x(2)); HEX$(x(3))
Note that you could alternatively use a generic RShift~& function instead of raw integer division since what you're really doing is shifting bits:
x(0) = RShift~&(a& AND &HFF000000&, 18)
...
FUNCTION RShift~& (value AS _UNSIGNED LONG, shiftCount AS _UNSIGNED BYTE)
' Raise illegal function call if the shift count is greater than the width of the type.
' If shiftCount is not _UNSIGNED, then you must also check that it isn't less than 0.
IF shiftCount > 32 THEN ERROR 5
RShift~& = value / 2^shiftCount
END FUNCTION
Building upon that, you might create another function:
FUNCTION ByteAt~%% (value AS _UNSIGNED LONG, position AS _UNSIGNED BYTE)
'position must be in the range [0, 3].
IF (position AND 3) <> position THEN ERROR 5
ByteAt~%% = RShift~&(value AND LShift~&(&HFF&, 8*position), 8*position)
END FUNCTION
Note that an LShift~& function was used that shifts bits to the left (multiplication by a power of 2). A potentially better alternative would be to perform the right-shift first and just mask the lower 8 bits, eliminating the need for LShift~&:
FUNCTION ByteAt~%% (value AS _UNSIGNED LONG, position AS _UNSIGNED BYTE)
'position must be in the range [0, 3].
IF (position AND 3) <> position THEN ERROR 5
ByteAt~%% = RShift~&(value, 8*position) AND 255
END FUNCTION
Incidentally, another QB-like implementation known as FreeBASIC has an actual SHR operator, used like MOD or AND, to perform a shift operation directly instead of using division, which is potentially faster.
You could also use QB64's DECLARE LIBRARY facility to create functions in C++ that will perform the shift operations:
/*
* Place in a separate "shift.h" file or something.
*/
unsigned int LShift (unsigned int n, unsigned char count)
{
return n << count;
}
unsigned int RShift (unsigned int n, unsigned char count)
{
return n >> count;
}
Here's the full corresponding QB64 code:
DECLARE LIBRARY "shift"
FUNCTION LShift~& (value AS _UNSIGNED LONG, shiftCount AS _UNSIGNED _BYTE)
FUNCTION RShift~& (value AS _UNSIGNED LONG, shiftCount AS _UNSIGNED _BYTE)
END DECLARE
x(0) = ByteAt~%%(a&, 0)
x(1) = ByteAt~%%(a&, 1)
x(2) = ByteAt~%%(a&, 2)
x(3) = ByteAt~%%(a&, 3)
END
FUNCTION ByteAt~%% (value AS _UNSIGNED LONG, position AS _UNSIGNED BYTE)
'position must be in the range [0, 3].
IF (position AND 3) <> position THEN ERROR 5
ByteAt~%% = RShift~&(value, 8*position) AND 255
END FUNCTION
If QB64 had a documented API, it might be possible to raise a QB64 error from the C++ code when the shift count is too high, rather than relying on the behavior of C++ to essentially ignore shift counts that are too high. Unfortunately, this isn't the case, and it might actually cause more problems than it's worth.
This snip gets the byte pairs of a hexidecimal value:
DIM Value AS _UNSIGNED LONG
Value = &HCEED6644&
S$ = RIGHT$("00000000" + HEX$(Value), 8)
PRINT "Byte#1: "; MID$(S$, 1, 2)
PRINT "Byte#2: "; MID$(S$, 3, 2)
PRINT "Byte#3: "; MID$(S$, 5, 2)
PRINT "Byte#4: "; MID$(S$, 7, 2)

How can I better parse variable time stamp information in Fortran?

I am writing code in gfortran to separate a variable time stamp into its separate parts of year, month, and day. I have written this code so the user can input what the time stamp format will be (ie. YEAR/MON/DAY, DAY/MON/YEAR, etc). This creates a total of 6 possible combinations. I have written code that attempts to deal with this, but I believe it to be ugly and poorly done.
My current code uses a slew of 'if' and 'goto' statements. The user provides 'tsfo', the time stamp format. 'ts' is a character array containing the time stamp data (as many as 100,000 time stamps). 'tsdelim' is the delimiter between the year, month, and day. I must loop from 'frd' (the first time stamp) to 'nlines' (the last time stamp).
Here is the relevant code.
* Choose which case to go to.
first = INDEX(tsfo,tsdelim)
second = INDEX(tsfo(first+1:),tsdelim) + first
if (INDEX(tsfo(1:first-1),'YYYY') .ne. 0) THEN
if (INDEX(tsfo(first+1:second-1),'MM') .ne. 0) THEN
goto 1001
else
goto 1002
end if
else if (INDEX(tsfo(1:first-1),'MM') .ne. 0) THEN
if (INDEX(tsfo(first+1:second-1),'DD') .ne. 0) THEN
goto 1003
else
goto 1004
end if
else if (INDEX(tsfo(1:first-1),'DD') .ne. 0) THEN
if (INDEX(tsfo(first+1:second-1),'MM') .ne. 0) THEN
goto 1005
else
goto 1006
end if
end if
first = 0
second = 0
* Obtain the Julian Day number of each data entry.
* Acquire the year, month, and day of the time stamp.
* Find 'first' and 'second' and act accordingly.
* Case 1: YYYY/MM/DD
1001 do i = frd,nlines
first = INDEX(ts(i),tsdelim)
second = INDEX(ts(i)(first+1:),tsdelim) + first
read (ts(i)(1:first-1), '(i4)') Y
read (ts(i)(first+1:second-1), '(i2)') M
read (ts(i)(second+1:second+2), '(i2)') D
* Calculate the Julian Day number using a function.
temp1(i) = JLDYNUM(Y,M,D)
end do
goto 1200
* Case 2: YYYY/DD/MM
1002 do i = frd,nlines
first = INDEX(ts(i),tsdelim)
second = INDEX(ts(i)(first+1:),tsdelim) + first
read (ts(i)(1:first-1), '(i4)') Y
read (ts(i)(second+1:second+2), '(i2)') M
read (ts(i)(first+1:second-1), '(i2)') D
* Calculate the Julian Day number using a function.
temp1(i) = JLDYNUM(Y,M,D)
end do
goto 1200
* Onto the next part of the code
1200 blah blah blah
I believe this code will work, but I do not think it is a very good method. Is there a better way to go about this?
It is important to note that the indices 'first' and 'second' must be calculated for each time stamp as the month and day can both be represented by 1 or 2 integers. The year is always represented by 4.
With only six permutations to handle I would just build a look-up table with the whole tsfo string as the key and the positions of year, month and day (1st, 2nd or 3rd) as the values. Any unsupported formats should produce an error, which I haven't coded below. When subsequently you loop though your ts list and split an item you know which positions to cast to the year, month and day integer variables:
PROGRAM timestamp
IMPLICIT NONE
CHARACTER(len=10) :: ts1(3) = ["2000/3/4 ","2000/25/12","2000/31/07"]
CHARACTER(len=10) :: ts2(3) = ["3/4/2000 ","25/12/2000","31/07/2000"]
CALL parse("YYYY/DD/MM",ts1)
print*
CALL parse("DD/MM/YYYY",ts2)
CONTAINS
SUBROUTINE parse(tsfo,ts)
IMPLICIT NONE
CHARACTER(len=*),INTENT(in) :: tsfo, ts(:)
TYPE sti
CHARACTER(len=10) :: stamp = "1234567890"
INTEGER :: iy = -1, im = -1, id = -1
END TYPE sti
TYPE(sti),PARAMETER :: stamps(6) = [sti("YYYY/MM/DD",1,2,3), sti("YYYY/DD/MM",1,3,2),&
sti("MM/DD/YYYY",2,3,1), sti("DD/MM/YYYY",3,2,1),&
sti("MM/YYYY/DD",2,1,3), sti("DD/YYYY/MM",3,1,2)]
TYPE(sti) :: thisTsfo
INTEGER :: k, k1, k2
INTEGER :: y, m, d
CHARACTER(len=10) :: cc(3)
DO k=1,SIZE(stamps)
IF(TRIM(tsfo) == stamps(k)%stamp) THEN
thisTsfo = stamps(k)
EXIT
ENDIF
ENDDO
print*,thisTsfo
DO k=1,SIZE(ts)
k1 = INDEX(ts(k),"/")
k2 = INDEX(ts(k),"/",BACK=.TRUE.)
cc(1) = ts(k)(:k1-1)
cc(2) = ts(k)(k1+1:k2-1)
cc(3) = ts(k)(k2+1:)
READ(cc(thisTsfo%iy),'(i4)') y
READ(cc(thisTsfo%im),'(i2)') m
READ(cc(thisTsfo%id),'(i2)') d
PRINT*,ts(k),y,m,d
ENDDO
END SUBROUTINE parse
END PROGRAM timestamp
I would encode the different cases in another way, like this:
module foo
implicit none
private
public encode_datecode
contains
integer function encode_datecode(datestr, sep)
character(len=*), intent(in) :: datestr, sep
integer :: first, second
character(len=1) :: c1, c2, c3
first = index(datestr, sep)
second = index(datestr(first+1:), sep) + first
c1 = datestr(1:1)
c2 = datestr(first+1:first+1)
c3 = datestr(second+1:second+1)
foo = num(c1) + 3*num(c2) + 9*num(c3)
end function encode_datecode
integer function num(c)
character(len=1) :: c
if (c == 'Y') then
num = 0
else if (c == 'M') then
num = 1
else if (c == 'D') then
num = 2
else
stop "Illegal character"
end if
end function num
end module foo
and then handle the legal cases (21, 15, 19, 7, 11, 5) in a SELECT statement.
This takes advantage of the fact that there won't be a 'YDDY/MY/YM' format.
If you prefer better binary or decimal readability, you can also multiply by four or by 10 instead of 3.

code snippet lack of understanding

I am boxing with a C code snippet I need to convert.
One of the functions is as follows:
+(float) calcTemp:(NSData *)data {
char scratchVal[data.length];
[data getBytes:&scratchVal length:data.length];
UInt16 temp;
temp = (scratchVal[0] & 0xff) | ((scratchVal[1] << 8) & 0xff00);
return (float)temp;
}
This line I just can't seem to grasp:
temp = (scratchVal[0] & 0xff) | ((scratchVal[1] << 8) & 0xff00);
i know its probably a novo question (but I am a noob^), if someone could explain that line to me i would greatly appreciate it. In particular the things with address references and the operator uses.
In the code snippet I don't see why they call getBytes:length method on data, since its not being used. But mainly, Im just trying to understand the line that I pointed out.
The line
temp = (scratchVal[0] & 0xff) | ((scratchVal[1] << 8) & 0xff00);
is creating an unsigned 16-bit integer value from two bytes originating in scratchVal. A single & in this context is not the address operator but bitwise AND. So the lower byte of temp is set from the first byte contained in scratchVal, and the upper byte of temp is set by left-shifting the second byte contained in scratchVal. The two resulting numbers are joined together using bitwise OR |. To avoid sign extension or other unwanted bits the masks 0xff and 0xff00 are used to ensure all undesirables are zero.
Presented visually, if scratchVal contains the bits aaaaaaaa bbbbbbbb in the first two bytes then temp will end up as an unsigned integer with the bit pattern bbbbbbbbaaaaaaaa.
The second question asked why they're calling -getBytes:length:. The line
[data getBytes:&scratchVal length:data.length];
reads the bytes from data into the scratchVal temporary buffer.
In response to the question in the comment
why it is needed to left shift the bits to concatenate them
A simple assignment won't work. Assuming again that scratchVal is a char buffer containing the bits aaaaaaaa bbbbbbbb, the code
temp = scratchVal[0];
would make temp equal to the UInt16 equivalent of the bits aaaaaaaa. You can't use addition because the result will be whatever value comes from adding the two bytes together (aaaaaaaa + bbbbbbbb).
Using real numbers as an example, suppose the first two bytes of scratchVal are equal to 0x7f 0x7f.
temp = scratchVal[0] + scratchVal[1];
Turns out to be 0x7f + 0x7f = 0xfe which is not the purpose of this code.
Building the value using OR can be better understood by breaking it down into steps.
The first part of the expression is scratchVal[0] & 0xff = 0x7f & 0xff = 0x7f
The second part is (scratchVal[1] << 8) & 0xff00 = (0x7f << 8) & 0xff = 0x7f00 & 0xff = 0x7f00
The final result in this case is 0x7f | 0x7f00 = 0x7f7f.

Attempt as assembly on iOS

Please note: I'm just trying to learn. Please do not yell at me for toying with assembly.
I have the following method:
uint32 test(int16 a, int16 b)
{
return ( a + b ) & 0xffff;
}
I created a .s file based on details I found here.
My .s file contains the following:
.macro BEGIN_FUNCTION
.align 2 // Align the function code to a 4-byte (2^n) word boundary.
.arm // Use ARM instructions instead of Thumb.
.globl _$0 // Make the function globally accessible.
.no_dead_strip _$0 // Stop the optimizer from ignoring this function!
.private_extern _$0
_$0: // Declare the function.
.endmacro
.macro END_FUNCTION
bx lr // Jump back to the caller.
.endmacro
BEGIN_FUNCTION addFunction
add r0, r0, r1 // Return the sum of the first 2 function parameters
END_FUNCTION
BEGIN_FUNCTION addAndFunction
add r0, r0, r1 // Return the sum of the first 2 function parameters
ands r0, r0, r2 // Ands the result of r0 and the third parameter passed
END_FUNCTION
So if I call the following:
addFunction(10,20)
I get what I would expect. But then if I try
int addOne = addFunction(0xffff,0xffff); // Result = -2
int addTwo = 0xffff + 0xffff; // Result = 131070
My addOne does not end up being the same value as my add two. Any ideas on what I am doing wrong here?
When you pass the int16_t parameter 0xffff to addFunction the compiler sign-extends it to fit in the 32-bit register (as it's a signed integer) making it 0xffffffff. You add two of these together to get 0xfffffffe and return it. When you do 0xffff + 0xffff both constants are already 32-bits and so there's no need to sign extend, the result is 0x0001fffe.
What you need to understand is that when you're using a signed integer and passing the 16-bit value 0xffff you're actually passing in -1, so it's no surprise that -1 + -1 = -2. It's also no suprise that 0xffff + 0xffff = 0xfffffffe.
If you change the (not shown) C declaration of addFunction to take two unsigned ints, then you'll get the result you desire (the compiler won't do the sign extension).
Your assembly code assumes a third passed parameter (in R2) but when you call the
function you are only passing 2 parameters. I think the contents of R2 could be anything in your addFunction.

Resources