How to get this sqrt inline assembly working for iOS - ios

I am trying to follow another SO post and implement sqrt14 within my iOS app:
double inline __declspec (naked) __fastcall sqrt14(double n)
{
_asm fld qword ptr [esp+4]
_asm fsqrt
_asm ret 8
}
I have modified this to the following in my code:
double inline __declspec (naked) sqrt14(double n)
{
__asm__("fld qword ptr [esp+4]");
__asm__("fsqrt");
__asm__("ret 8");
}
Above, I have removed the "__fastcall" keyword from the method definition since my understanding is that it is for x86 only. The above gives the following errors for each assembly line respectively:
Unexpected token in argument list
Invalid instruction
Invalid instruction
I have attempted to read through a few inline ASM guides and other posts on how to do this, but I am generally just unfamiliar with the language. I know MIPS quite well, but these commands/registers seem to be very different. For example, I don't understand why the original author never uses the passed in "n" value anywhere in the assembly code.
Any help getting this to work would be greatly appreciated! I am trying to do this because I am building an app where I need to calculate sqrt (ok, yes, I could do a lookup table, but for right now I care a lot about precision) on every pixel of a live-video feed. I am currently using the standard sqrt, and in addition to the rest of the computation, I'm running at around 8fps. Hoping to bump that up a frame or two with this change.
If it matters: I'm building the app to ideally be compatibly with any current iOS device that can run iOS 7.1 Again, many thanks for any help.

The compiler is perfectly capable of generating fsqrt instruction, you don't need inline asm for that. You might get some extra speed if you use -ffast-math.
For completeness' sake, here is the inline asm version:
__asm__ __volatile__ ("fsqrt" : "=t" (n) : "0" (n));
The fsqrt instruction has no explicit operands, it uses the top of the stack implicitly. The =t constraint tells the compiler to expect the output on the top of the fpu stack and the 0 constraint instructs the compiler to place the input in the same place as output #0 (ie. the top of the fpu stack again).
Note that fsqrt is of course x86-only, meaning it wont work for example on ARM cpus.

Related

How to avoid crash during stack buffer overflow exploit?

void deal_msg(unsigned char * buf, int len)
{
unsigned char msg[1024];
strcpy(msg,buf);
//memcpy(msg, buf, len);
puts(msg);
}
void main()
{
// network operation
sock = create_server(port);
len = receive_data(sock, buf);
deal_msg(buf, len);
}
As the pseudocode shows above, the compile environment is vc6 and running environment is windows xp sp3 en. No other protection mechanisms are applied, that is stack can be executed, no ASLR.
The send data is 'A' * 1024 + addr_of_jmp_esp + shellcode.
My question is:
if strcpy is used, the shellcode is generated by msfvenom, msfvenom -p windows/exec cmd=calc.exe -a x86 -b "\x00" -f python,
msfvenom attempts to encode payload with 1 iterations of x86/shikata_ga_nai
after data is sent, no calc pops up, the exploit won't work.
But if memcpy is used, shellcode generated by msfvenom -p windows/exec cmd=calc.exe -a x86 -f python without encoding works.
How to avoid the original program's crash after calc pops up, how to keep stack balance to avoid crash?
Hard to say. I'd use a custom payload (just copy the windows/exec cmd=calc.exe) and put a 0xcc at the start and debug it (or something that will be easily recognizable under the debugger like a ud2 or \0xeb\0xfe). If your payload is executed, you'll see it. Bypass the added instruction (just NOP it) and try to see what can possibly go wrong with the remainder of the payload.
You'll need a custom payload ; Since you're on XP SP3 you don't need to do crazy things.
Don't try to do the overflow and smash the whole stack (given your overflow it seems to be perfect, just enough overflow to control rIP).
See how the target function (deal_msg in your example) behave under normal conditions. Note the stack address when the ret is executed (and if register need to have certain values, this depend on the caller).
Try to replicate that in your shellcode: you'll most probably to adjust the stack pointer a bit at the end of your shellcode.
Make sure the caller (main) stack hasn't been affected when executing the payload. This might happen, in this case reserve enough room on the stack (going to lower addresses), so the caller stack is far from the stack space needed by the payload and it doesn't get affected by the payload execution.
Finally return to the ret of the target or directly after the call of the deal_msg function (or anywhere you see fit, e.g. returning directly to ExitProcess(), but this might be more interesting to return close to the previous "normal" execution path).
All in all, returning somewhere after the payload execution is easy, just push <addr> and ret but you'll need to ensure that the stack is in good shape to continue execution and most of the registers are correctly set.

Fastest way of storing non-adjacent d registers with NEON intrinsics

I am porting 32bit NEON asm code to NEON intrinsics, and I am wondering if this code can be written in a concise way using intrinsics:
vst4.32 {d0[0], d2[0], d4[0], d6[0]}, [%[v1]]!
1) The previous code operates on q registers, but when it comes to storage, instead of using q0, q1, q2 and q3, it has to recreate vectors which have each part in one of the d registers, e.g. v1[0] = d0[0], v1[1] = d2[0] ... v2[0] = d0[1], v2[1] = d2[1] ... v3[0] = d1[0], v3[1] = d3[0] ... etc.
This operation is a one-liner in asm, but with intrinsics I don't know if I can do that without first splitting high and low bits and building a new float32x4x4_t variable to feed to vst4_f32.
Is that possible?
2) I'm not entirely sure of what [%[v1]]! does (yes, I googled quite a bit): it should be a reference to a variable named v1 and the exclamation mark will do writeback, which should mean the pointer is increased by the same amount that was written by the instruction on the same line.
Correct? Any way of replicating that with intrinsics?
After some more investigation I found this specific instruction to store a specific lane of an array of 4 vectors, so no need to split into high and low bits variables:
float32x4x4_t u = { q0, q1, q2, q3 };
vst4q_lane_f32(v1, u, 0);
v1 += 4;
Writeback is just an increased pointer, as #charlesbaylis wrote.
In principle, a sufficiently smart compiler could use the instruction you want for the vst4_f32 intrinsic, but in practice, no compiler is that good.
To get the post-index writeback, you can write
vst4_f32(ptr, v);
ptr += 4;
Some compilers will recognise this. GCC 5.1 (when released) will do this in at least some cases.
[Edit: misread the question, vst4q_lane_f32 does map to the required instruction perfectly]
It seems to be inline assembly.
Anyway, the answers are:
1) No
2) Yes

ELEVENWORDINLINE when to use it?

I was always wondering what can I do with things like that:
ONEWORDINLINE(w1)
TWOWORDINLINE(w1, w2)
THREEWORDINLINE(w1, w2, w3)
up to
TENWORDINLINE(w1, w2, w3, w4, w5, w6, w7, w8, w9, w10)
ELEVENWORDINLINE(w1, w2, w3, w4, w5, w6, w7, w8, w9, w10, w11)
TWELVEWORDINLINE(w1, w2, w3, w4, w5, w6, w7, w8, w9, w10, w11, w12)
How to use this macros?
When to use them?
Why from 1 to 12 and
not to...100 for example?
That is long-dead technology of the ancient Mac programmers, sentry to a tomb best left untouched. But, if you're interested, off we go, brave adventurer!
Here are the relevant #define-s straight from Apple:
#define ONEWORDINLINE(w1) = w1
#define TWOWORDINLINE(w1,w2) = {w1,w2}
#define THREEWORDINLINE(w1,w2,w3) = {w1,w2,w3}
/* ...etc... */
#define TWELVEWORDINLINE(w1,w2,w3,w4,w5,w6,w7,w8,w9,w10,w11,w12) = {w1,w2,w3,w4,w5,w6,w7,w8,w9,w10,w11,w12}
Now to some explaining.
A little history lesson: back in the ancient times days when Mac was implemented for the (just as ancient) Motorola 68k, Apple set up its system-calls in the following highly compact and very usable way: they mapped them to words starting with 1010b (0xA), as they were reserved by the Motorola devs for this use. These sys-calls and their mappings were called A-Traps for this hex value (no relation to "IT'S A TRAP!", honestly). They looked like this in hex: 0xA869 (this example is the A-Trap for the FixRatio(short numer, short denom) system-call). This technology was originally created for the Mac Toolbox API.
When a Mac-on-68k-targeting compiler (these should set the TARGET_OS_MAC and TARGET_CPU_68K macros as 1 or TRUE and TARGET_RT_MAC_CFM as 0 or FALSE, BTW) saw a function prototype with an assignment (=) after it, it treated the prototype as referring to an A-Trap system call indicated by the A-Trap value to the right of the assignment operator, which could be a single integer literal one-word value starting with 0xA (0xA???). So, ONEWORDINLINE was basically a stylish macro-way of saying "it's an A-Trap!".
So, here's what a sys-call function prototype declaration for 68k would look like:
EXTERN_API(Fixed) FixRatio(short numer, short denom) ONEWORDINLINE(0xA869);
This would be preprocessed to something like this:
extern Fixed FixRatio(short numer, short denom) = 0xA869;
Now, you might be thinking: if we're indexing system calls by one word, and one-fourth that word is taken by the huge 0xA, that'd be only 4096 functions at maximum (there was much less in reality, as many A-Traps actually mapped to the same system-call subroutines, but with different parameters), how is that enough? Well, obviously, it isn't. That's where selectors come in.
A-Traps like _HFSDispatch (0xA260) were called "selectors" because they had the job of selecting and calling another subroutine determined by values on the stack. So, when a function prototype was "assigned" an "array" of one-word integer literals, all but the last one (called the selector code(s)) got pushed onto the stack, and the last one was treated as an A-Trap selector that grabbed the words pushed onto the stack and called the appropriate subroutine. The maximum number of words in such an array was 12 because that was enough for the Mac Toolbox.
The macros TWOWORDINLINE through TWELVEWORDINLINE handled selector A-Traps. For example:
EXTERN_API(OSErr) ActivateTSMDocument(TSMDocumentID idocID) TWOWORDINLINE(0x7002, 0xAA54);
would be preprocessed to something like
extern OSErr ActivateTSMDocument(TSMDocumentID idocID) = {0x7002, 0xAA54};
Here, 0x7002 is the selector code, and 0xAA54 is the selector A-Trap.
So, to sum it all up, you only need it if you want to do some coding for pre-1994 Mac-s running on a Motorola 68k. So ios isn't really in place here ;)
Disclaimer: I know this stuff in theory only and I may have made a mistake somewhere. If there are any old-timers experienced with this stuff, please correct me if I got something wrong!

how to get the available memory on the device

I'm trying to obtain how much free memory I have on the device. To do this I call the cuda function cuMemGetInfo from a fortran code, but it returns negative values for the free amount of memory, so there's clearly something wrong.
Does anyone know how I can do that?
Thanks
EDIT:
Sorry, in fact my question was not very clear. I'm using OpenACC in Fortran and I call the C++ cuda function cudaMemGetInfo. Finally I could fix the code, the problem was effectively the kind of variables that I was using. Switching to size_ fixed everything. This is the interface in fortran that I'm using:
interface
subroutine get_dev_mem(total,free) bind(C,name="get_dev_mem")
use iso_c_binding
integer(kind=c_size_t)::total,free
end subroutine get_dev_mem
end interface
and this the cuda code
#include <cuda.h>
#include <cuda_runtime.h>
extern "C" {
void get_dev_mem(size_t& total, size_t& free)
{
cuMemGetInfo(&free, &total);
}
}
There's one last question: I pushed an array on the gpu and I checked its size using cuMemGetInfo, then I computed it's size counting the number of bytes, but I don't have the same answer, why? In the first case it is 3052mb large, in the latter 3051mb. This difference of 1mb could be the size of the array descriptor? Here there's the code that I used:
integer, parameter:: long = selected_int_kind(12)
integer(kind=c_size_t) :: total, free1,free2
real(8), dimension(:),allocatable::a
integer(kind=long)::N, eight, four
allocate(a(four*N))
!some OpenACC stuff in order to init the gpu
call get_dev_mem(total,free1)
!$acc data copy(a)
call get_dev_mem(total,free2)
print *,"size a in the gpu = ",(free1-free2)/1024/1024, " mb"
print *,"size a in theory = ", (eight*four*N)/1024/1024, " mb"
!$acc end data
deallocate(a)
Right, so, like commenters have suggested, we're not sure exactly what you're running, but filling in the missing details by guessing, here's a shot:
Most CUDA API calls return a status code (or error code if you will); this is true both in C/C++ and in Fortran, as we can see in the Portland Group's CUDA Fortran Manual:
Most of the runtime API routines are integer functions that return an error code; they return a value of zero if the call was successful, and a nonzero value if there was an error. To interpret the error codes, refer to “Error Handling,” on page 48.
This is the case for cudaMemGetInfo() specifically:
integer function cudaMemGetInfo( free, total )
integer(kind=cuda_count_kind) :: free, total
The two integers for free and total are cuda_count_kind, which if I am not mistaken are effectively unsigned... anyway, I would guess that what you're getting is an error code. Have a look at the Error Handling section on page 48 of the manual.

Is there a safe way to clean up stack-based code when jumping out of a block?

I've been working on Issue 14 on the PascalScript scripting engine, in which using a Goto command to jump out of a Case block produces a compiler error, even though this is perfectly valid (if ugly) Object Pascal code.
Turns out the ProcessCase routine in the compiler calls HasInvalidJumps, which scans for any Gotos that lead outside of the Case block, and gives a compiler error if it finds one. If I comment that check out, it compiles just fine, but ends up crashing at runtime. A disassembly of the bytecode shows why. I've annotated it with the original script code:
[TYPES]
<SNIPPED>
[VARS]
Var [0]: 27 Class TFORM
Var [1]: 28 Class TAPPLICATION
Var [2]: 11 S32 //i: integer
[PROCS]
Proc [0] Export: !MAIN -1
{begin}
[0] ASSIGN GlobalVar[2], [1]
{ i := 1;}
[15] PUSHTYPE 11(S32) // 1
[20] ASSIGN Base[1], GlobalVar[2]
{ case i of}
[31] PUSHTYPE 25(U8) // 2
{ 0:}
[36] COMPARE into Base[2]: [0] = Base[1]
[57] COND_NOT_GOTO currpos + 5 Base[2] [72]
{ end;}
[67] GOTO currpos + 41 [113]
{ 1:}
[72] COMPARE into Base[2]: [1] = Base[1]
[93] COND_NOT_GOTO currpos + 10 Base[2] [113]
{ goto L1;}
[103] GOTO currpos + 8 [116]
{ end;}
[108] GOTO currpos + 0 [113]
{ end; //<-- case}
[113] POP // 1
[114] POP // 0
{ Exit;}
[115] RET
{L1:
Writeln('Label L1');}
[116] PUSHTYPE 17(WideString) // 1
[121] ASSIGN Base[1], ['????????']
[144] CALL 1
{end.}
[149] POP // 0
[150] RET
Proc [1]: External Decl: \00\00 WRITELN
The "goto L1;" statement at 103 skips the cleanup pops at 113 and 114, which leaves the stack in an invalid state.
Delphi doesn't have any trouble with this, because it doesn't use a calculation stack. PascalScript, though, is not as fortunate. I need some way to make this work, as this pattern is very common in some legacy scripts from a much simpler system with little in the way of control structures that I've translated to PascalScript and need to be able to support.
Anyone have any ideas how to patch the codegen so it'll clean up the stack properly?
IIRC the goto rules in classic pascals were:
jumps are only allowed out of a block (iow from a higher to a lower nesting level on the "same" branch of the tree)
from local procedures to their parents.
The later was afaik never supported by Borland derived Pascals, but the first still holds.
So you need to generate exiting code like Martin says, but possibly it can be for multiple block levels, so you can't have a could codegeneration for each goto, but must generate code (to exit the precise number of needed blocks).
A typical test pattern is to exit from multiple nested ifs (possibly within a loop) using a goto, since that was a classic microoptimization that was faster at least up to D7.
Keep in mind that the if evaluation(s) and the begin..end blocks of their branches might have generated temps that need cleanup.
---------- added later
I think the codegenerator needs a way to walk the scopes between the goto and its endpoint, generating the relevant exit code for blocks along the way. That way a fix works for the general case and not just this example.
Since you can only jump out of scopes, and not into it that might not that be that hard.
IOW generate something that is equivalent to (for a hypothetical double case block)
Lgoto1gluecode:
// exit code first block
pop x
pop y
// exit code first block
pop A
pop B
goto real_goto_destination
Additional analysis can be done. E.g. if there is only one scope, and it has already a cleanup exit label, you can jump directly. If you know for certain that the above pop's are only discarded values (and not saves of registers) you can do them at once with add $16,%esp (4*4 byte values) etc.
The straightforward solution would be:
When generating a GOTO for goto statement, prefix the GOTO with the same cleanup code that comes before RET.
It looks to me like the calculation of how far to jump forward is the problem. I would have to spend some time looking at the implementation of the parser to help further, but my guess would be that additional handling must be performed when using a goto and there are values on the stack AND the goto would be placed after those values would be removed from the stack. Of course to determine this you would need to save the current location being parsed (the goto) and the forward parse to the target location watching for stack changes, and if so then to either adjust the goto location backwards, or inject the code as Martin suggested.

Resources