As we all know, the ALU perform Arithmetic operation, but does the computer understand post-fix notation or not?
Assuming you mean Arithmetic/Logic Unit, no. The ALU does not understand any notation. It only understands instructions. So, for example, the machine code might include an instruction to "add R10 to R11 and store the result in R9," say (disassembled) ADD R9, R10, R11, but the machine code "notation" is understood by the Control Unit, not the ALU.
By the time the ALU receives the information, it is encoded in the form of various control lines being asserted. For instance, in the above example, the CU might assert control lines for "add," "input A is R10," "input B is R11," and "store result in R9." These lines determine how the ALU and the register file behave, and result in the operation desired.
Textual notation, such as 5 + 8 or (+ x 19) or x 19 15 + * or indeed ADD R9, R10, R11, is understood by software, doing processing at a much higher level than the ALU does. It is that software that interprets, say, postfix notation, and issues the instructions that cause the ALU to execute the desired operations.
Related
I run into a roadblock and not sure where to go from here. I know MOV can only store a maximum of 0-65535 in decimal. However, the value I'm trying to store is 30,402,460.
I was going to use MVN as some source mention that can store bigger number in this format:
MVN X9, #(num), #(num). However 30402460 / 110 = only equals 276,386 which is still above the decimal value of 65535 so I can't use MVN.
My question is; how do we store 30402460 to X9?
MOV is not a real instruction. It's an alias that will be turned into either MOVK, MOVN or ORR as appropriate. But each of those have their own constraints:
MOVZ can load an arbitrary 16 bits, left-shifted by 0, 16, 32 or 48 bits. So you can only use this for values with 48 bits of all zeroes.
MOVN does the same as MOVZ, but inverts the register value. So you can only use this for values with at least 48 bits of all ones.
ORR can construct a really complicated bitmask which, as best as I can tell, can be any sequence of consecutive runs of zeroes, ones, zeroes or ones, zeroes, ones, repeated by any power of two that is a dividend of the register width. So you can load values like 0xaaaaaaaaaaaaaaaa, 0xfffffff00fffffff and 0x18, but not values like 0x0 or 0x5.
The value 30402460 matches none of these constraints. The usual practice for loading such values is to use MOVZ followed by MOVK, the latter of which allows replacing 16 bits of a register without changing the other bits. So in your case:
movz x9, 0x1cf, lsl 16
movk x9, 0xe79c
I have devised a hypothetical instruction set that I believe is Turing-complete. I cannot think of any computational operation that this instruction set is not able to complete. I would just like to verify that this hypothetical instruction set is indeed Turing-complete.
Registers
IP = Instruction Pointer
FL = Flags
Memory
Von Neumann
Other Info
Instructions have two forms (The conditional/immediate jumps being the outliers):
Acting on memory using a scalar
Acting on memory using other memory
All instructions act on unsigned integers of a fixed size
Each instruction consists of a one-byte opcode with three trailing one-byte arguments (not all arguments must be used).
Assume a byte is able to hold any memory address
Instructions
Moving = MOV
Basic Arithmetic = ADD, SUB, MUL, DIV, REM
Binary Logic = AND, NOT, OR, NOR, XOR, XNOR, NAND, SHR, SHL
Comparison = CMP (this instruction sets the flags (excluding the unsigned integer overflow flag) by comparing two values)
Conditional Jumps = JMP, various conditional jumps based on the flags
Flags
"Unsigned Integer Overflow"
">"
"<"
">="
"<="
"=/="
"="
I am porting 32bit NEON asm code to NEON intrinsics, and I am wondering if this code can be written in a concise way using intrinsics:
vst4.32 {d0[0], d2[0], d4[0], d6[0]}, [%[v1]]!
1) The previous code operates on q registers, but when it comes to storage, instead of using q0, q1, q2 and q3, it has to recreate vectors which have each part in one of the d registers, e.g. v1[0] = d0[0], v1[1] = d2[0] ... v2[0] = d0[1], v2[1] = d2[1] ... v3[0] = d1[0], v3[1] = d3[0] ... etc.
This operation is a one-liner in asm, but with intrinsics I don't know if I can do that without first splitting high and low bits and building a new float32x4x4_t variable to feed to vst4_f32.
Is that possible?
2) I'm not entirely sure of what [%[v1]]! does (yes, I googled quite a bit): it should be a reference to a variable named v1 and the exclamation mark will do writeback, which should mean the pointer is increased by the same amount that was written by the instruction on the same line.
Correct? Any way of replicating that with intrinsics?
After some more investigation I found this specific instruction to store a specific lane of an array of 4 vectors, so no need to split into high and low bits variables:
float32x4x4_t u = { q0, q1, q2, q3 };
vst4q_lane_f32(v1, u, 0);
v1 += 4;
Writeback is just an increased pointer, as #charlesbaylis wrote.
In principle, a sufficiently smart compiler could use the instruction you want for the vst4_f32 intrinsic, but in practice, no compiler is that good.
To get the post-index writeback, you can write
vst4_f32(ptr, v);
ptr += 4;
Some compilers will recognise this. GCC 5.1 (when released) will do this in at least some cases.
[Edit: misread the question, vst4q_lane_f32 does map to the required instruction perfectly]
It seems to be inline assembly.
Anyway, the answers are:
1) No
2) Yes
While reading this question and its answer I couldn't help but think why is it obligatory for the hardware to support virtual memory?
For example can't I simulate this behavior with software only (e.g the o.s with represent all the memory as some table, intercept all memory related actions and and do the mapping itself)?
Is there any OS that implements such techniques?
As far as I know, No.
intercept all memory related actions? It doesn't look impossible, but I must be very very slow.
For example, suppose this code:
int f(int *a1, int *b1, int *c1, int *d1)
{
const int n=100000;
for(int j=0;j<n;j++){
a1[j] += b1[j];
c1[j] += d1[j];
}
}
(From here >o<)
This simple loop is compiled into following by gcc -std=c99 -O3 with gcc 4.8.3:
push %edi ; 1
xor %eax,%eax
push %esi ; 2
push %ebx ; 3
mov 0x10(%esp),%ecx ; 4
mov 0x14(%esp),%esi ; 5
mov 0x18(%esp),%edx ; 6
mov 0x1c(%esp),%ebx ; 7
mov (%esi,%eax,4),%edi ; 8
add %edi,(%ecx,%eax,4) ; 9
mov (%ebx,%eax,4),%edi ; 10
add %edi,(%edx,%eax,4) ; 11
add $0x1,%eax
cmp $0x186a0,%eax
jne 15 <_f+0x15> ; 12
pop %ebx ; 13
pop %esi ; 14
pop %edi ; 15
ret ; 16
Even this really really simple function has 16 machine codes that access memory. Probably OS's simulate code has hundreds of machine codes, Then we can guess this memory-accessing codes' speed will slows down hundreds times at least.
Moreover, It's when you can watch only memory-accessing codes. Probably your processor doesn't have this feature, you should use step-by-step debugging, like x86's Trap Flag, and check every command every time.
Things goes worse - It's not enough to check codes. You may want IP (Instruction Pointer) to follow your OS's virtual memory rule as well, so you must check whether IP is over page's boundary after each codes was run. You also must very carefully deal with codes which can change IP, such as jmp, call, ret, ...
I don't think it can be implemented efficiently. It cannot be implemented efficiently. Speed is one of the most important part of operating system. If OS become a bit slow, all system is affected. In this case, it's not a bit - your computer gets slow lots and lots. Moreover, implementing this is very difficult as I say above - I'd rather write an emulator of a processor which has hardware-suppported virtual memory than do this crazy job!
I was always wondering what can I do with things like that:
ONEWORDINLINE(w1)
TWOWORDINLINE(w1, w2)
THREEWORDINLINE(w1, w2, w3)
up to
TENWORDINLINE(w1, w2, w3, w4, w5, w6, w7, w8, w9, w10)
ELEVENWORDINLINE(w1, w2, w3, w4, w5, w6, w7, w8, w9, w10, w11)
TWELVEWORDINLINE(w1, w2, w3, w4, w5, w6, w7, w8, w9, w10, w11, w12)
How to use this macros?
When to use them?
Why from 1 to 12 and
not to...100 for example?
That is long-dead technology of the ancient Mac programmers, sentry to a tomb best left untouched. But, if you're interested, off we go, brave adventurer!
Here are the relevant #define-s straight from Apple:
#define ONEWORDINLINE(w1) = w1
#define TWOWORDINLINE(w1,w2) = {w1,w2}
#define THREEWORDINLINE(w1,w2,w3) = {w1,w2,w3}
/* ...etc... */
#define TWELVEWORDINLINE(w1,w2,w3,w4,w5,w6,w7,w8,w9,w10,w11,w12) = {w1,w2,w3,w4,w5,w6,w7,w8,w9,w10,w11,w12}
Now to some explaining.
A little history lesson: back in the ancient times days when Mac was implemented for the (just as ancient) Motorola 68k, Apple set up its system-calls in the following highly compact and very usable way: they mapped them to words starting with 1010b (0xA), as they were reserved by the Motorola devs for this use. These sys-calls and their mappings were called A-Traps for this hex value (no relation to "IT'S A TRAP!", honestly). They looked like this in hex: 0xA869 (this example is the A-Trap for the FixRatio(short numer, short denom) system-call). This technology was originally created for the Mac Toolbox API.
When a Mac-on-68k-targeting compiler (these should set the TARGET_OS_MAC and TARGET_CPU_68K macros as 1 or TRUE and TARGET_RT_MAC_CFM as 0 or FALSE, BTW) saw a function prototype with an assignment (=) after it, it treated the prototype as referring to an A-Trap system call indicated by the A-Trap value to the right of the assignment operator, which could be a single integer literal one-word value starting with 0xA (0xA???). So, ONEWORDINLINE was basically a stylish macro-way of saying "it's an A-Trap!".
So, here's what a sys-call function prototype declaration for 68k would look like:
EXTERN_API(Fixed) FixRatio(short numer, short denom) ONEWORDINLINE(0xA869);
This would be preprocessed to something like this:
extern Fixed FixRatio(short numer, short denom) = 0xA869;
Now, you might be thinking: if we're indexing system calls by one word, and one-fourth that word is taken by the huge 0xA, that'd be only 4096 functions at maximum (there was much less in reality, as many A-Traps actually mapped to the same system-call subroutines, but with different parameters), how is that enough? Well, obviously, it isn't. That's where selectors come in.
A-Traps like _HFSDispatch (0xA260) were called "selectors" because they had the job of selecting and calling another subroutine determined by values on the stack. So, when a function prototype was "assigned" an "array" of one-word integer literals, all but the last one (called the selector code(s)) got pushed onto the stack, and the last one was treated as an A-Trap selector that grabbed the words pushed onto the stack and called the appropriate subroutine. The maximum number of words in such an array was 12 because that was enough for the Mac Toolbox.
The macros TWOWORDINLINE through TWELVEWORDINLINE handled selector A-Traps. For example:
EXTERN_API(OSErr) ActivateTSMDocument(TSMDocumentID idocID) TWOWORDINLINE(0x7002, 0xAA54);
would be preprocessed to something like
extern OSErr ActivateTSMDocument(TSMDocumentID idocID) = {0x7002, 0xAA54};
Here, 0x7002 is the selector code, and 0xAA54 is the selector A-Trap.
So, to sum it all up, you only need it if you want to do some coding for pre-1994 Mac-s running on a Motorola 68k. So ios isn't really in place here ;)
Disclaimer: I know this stuff in theory only and I may have made a mistake somewhere. If there are any old-timers experienced with this stuff, please correct me if I got something wrong!