x86: accessing unaligned pixel bytes of BMP image - image-processing

I am working on a program in C + x86 assembly (NASM) which performs rotating and scaling of an image. In order to do that it goes through pixels of the destination image one by one and calculates the corresponding pixel in the source image.
That part of the assembly code:
; buffer operations
push ebx
fistp dword loc ; store the number of src pixel
mov dword ebx, loc ; move it to ebx
imul ebx, 3 ; 3 bytes per pixel, so multiply pixel number by 3
mov dword eax, [ebx+esi]; store that pixel's color bytes ; ERROR, SEGSEV
mov dword [edi], eax ; draw a pixel
pop ebx
particularly the line marked 'ERROR, SEGSEV' generates a segmentation fault.
I reckon that is due to the fact that I'm trying to access the unaligned memory address.
That said, the bmp file pixel buffer is organised in a way that each pixel has B, G, R bytes stored one after another, so 3 bytes per pixel and each pixel's first byte can have a position in memory that is not divisible by 4 (eg: pixel one: 0.B, 1.G, 2.R; pixel two: 3.B, 4.G, 5.R - so I must access the address 3 to get to the second pixel).
The question is: how then can I access pixel's data if I'm not allowed to access unaligned memory location and how is it usually done when working with bmp files?

The OP assumed that the x86 architecture cannot access unaligned data (data at an address which is not a multiple of the size of the register being used), which is a common problem in Reduced Instruction Set Computing (RISC) processors. For example:
mov ebx, 0x12345677 ; Note the odd address
mov eax, [ebx] ; Load a 32-bit register from an odd address
In a RISC architecture like ARM or PowerPC, this can indeed cause problems: either always (ARM) or if not disabled (PowerPC). With the x86 architecture unaligned accesses have always been possible (albeit sometimes at a speed penalty) - and it's only since the '486 that it was even able to be checked; but that check is almost always off.
It turned out that the problem was elsewhere:
mov dword eax, [ebx+esi]; store that pixel's color bytes ; ERROR, SEGSEV
The OP hadn't confirmed that esi held the desired value.
Note, though, that an unaligned access with the x86 can still cause problems. All segments are defined to be a certain size (even if that size is 4GiB). An access near the top of that segment can "overflow" the segment - and unaligned accesses are the easiest way to suffer this.

Related

How does this arm64 instruction allocate enough space on the stack for 2 register values?

I am referring to this ARM64 documentation: https://developer.arm.com/documentation/102374/0101/Loads-and-stores---load-pair-and-store-pair
It has this instruction:
STP X0, X1, [SP, #-16]!
The description is:
Load and store pair instructions are often used for pushing, and popping off the stack.
This first instruction pushes X0 and X1 onto the stack
If these registers in arm64 are 128-bits (16 bytes), I assume we'd need 32 bytes total to store 2 of them on the stack, but the instruction above only subtracts 16 bytes from the stack pointer.
I must be misunderstanding the SP, #-16. Does this actually make enough space for the 2 registers to be copied in?
The general purpose registers are 64-bits (not 128-bits as mentioned in the question), so allocating 16 bytes on the stack does fit the x0 and x1 registers.

LC-3: BLKW ho to specify memory location to store data at?

In LC-3 when you use BLKW how do you initialize the block of memory to be at location x3102 instead of the next available memory location?
First, let's make a side note that all the memory of the 16-bit address space is there for you to use, according to the memory map (e.g. at least in the range x3000-xFDFF), and, it is initialized to zero; you can use it at will.
Generally speaking, the LC-3 assemblers don't allow multiple .ORIG directives throughout the file, instead, they require one at the beginning of the file. If they did allow subsequent .ORIG directives, this would be a way to accomplish what you're asking about.
But even if they did, frequently, we'd run into instruction offset encoding limitations. So there's an alternate solution I'll show below.
But first, let's look at the instruction offset/immediate encoding limitations.
The usual data memory access instruction formats have a very limited offset, only 9 bits worth (+/- about 256), and the offset is pc-relative. So, for example, the following won't work:
.ORIG x3000
LEA R0, TESTA
LD R1, TESTA
LEA R2, TESTB ; will get an error due to instruction offset encoding limitation
LD R3, TESTB ; will get an error due to instruction offset encoding limitation
HALT
TESTA
.FILL #1234
.BLKW xFA ; exactly enough padding to relocate TESTB to x3100
TESTB
.FILL #4321 ; which we can initialize with a non-zero value
.END
This illustrative: while this will successfully place TESTB at x3100, it cannot be reached by the either the LEA or the LD instructions due to the limited 9-bit pc-relative displacement.
(There is also the other practical limitation that as instructions are added the .BLKW operand has to shrink in size, which is clearly painful — this aspect would have been eliminated by supporting a .ORIG directive within.)
So, the alternative for large blocks and other such is to resort to using zero-initialized memory, and referencing this other memory using pointer variables nearby: using LD to load an address, rather than LEA, and LDI to access a value rather than LD.
.ORIG x3000
LEA R0, TESTA
LD R1, TESTA
LD R2, TESTBRef ; will put x3100 into R3
LDI R3, TESTBRef ; will access the memory value at address x3100..
HALT
TESTA
.FILL #1234
TESTBRef ; a nearby data pointer, that points using the full 16-bits
.FILL x3100
.END
In the latter, above, there is no declaration to reserve storage at x3100, nor can we initialize that storage at x3100 with non-zero initialization (e.g. no strings, no pre-defined arrays).
A data-to-data pointer, TESTBRef is used. Unlike code-to-code/data references (i.e. instructions referencing code or data), data-to-code/data
(i.e. data referencing code or data) pointers have all 16-bits available for pointing.
So, once we use this approach, of simply using other memory, we forgo the automatic placement of labels after other labels (for those other areas), and also forgo non-zero initialization.
Some LC-3 assemblers will allow multiple files, and these each allow their own .ORIG directive — so by using multiple files, we can place code&data at varied locations in the address space. However, the code-to-data instruction offset encoding limits still apply, and so, you'll likely end up managing other such memory areas manually anyway, and using data pointers as well.
Note that the JSR instruction has an 11-bit offset so code-to-code references can reach farther than code-to-data references.

Assembly memory addressing in big endian format

Kinda stuck here and was hoping for a pointer on memory addressing.
In theory, these represent R1 through R4. I assume 0x60 is R1, and 0x6C is R4, incrementing by a word each time. Is that the case?
If I wanted to run
ADD R1, R2
Would it store the result of the addition of 0x60 and 0x6C in memory location 0x60? Or am I looking at this wrong?
ARM registers do not correspond to any memory location. In some contexts ("spill slots" on the stack, "task state" used for multitasking) there will be memory locations reserved to save the contents of some or all registers, but they must be explicitly copied back and forth.
The problem you're trying to do is poorly worded, but I think the table gives the values of memory locations 0x60 through 0x6C, and, separately, the text ("[R1] = ..., [R2] = ..., etc") gives the values of the registers. If I'm reading this right, the instruction labeled (a) will copy the low byte of the value at memory location 0x62, which is either 0x9A or 0x90, I'm not sure which, into register R1, sign-extending it. I hope that's enough to get you unstuck.

SIMD zero vector test

Does there exist a quick way to check whether a SIMD vector is a zero vector (all components equal +-zero). I am currently using an algorithm, using shifts, that runs in log2(N) time, where N is the dimension of the vector. Does there exist anything faster? Note that my question is broader (tags), than the proposed answer and it refers to vectors of all types (integer, float, double, ...).
How about this straightforward avx code? I think it's O(N) and don't know how you could possibly do better without making assumptions about the input data - you have to actually read every value to know if its 0 so it's about doing as much of that as possible per cycle.
You should be able to massage the code to your needs. Should treat both +0 and -0 as zero. Will work for unaligned memory addresses but aligning to 32 byte addresses will make the loads faster. You may need to add something to deal with remaining bytes if size isn't a multiple of 8.
uint64_t num_non_zero_floats(float *mem_address, int size) {
uint64_t num_non_zero = 0;
__m256 zeros _mm256_setzero_ps ();
for(i = 0; i != size; i+=8) {
__m256 vec _mm256_loadu_ps (mem_addr + i);
__m256 comparison_out _mm256_cmp_ps (zeros, vec, _CMP_EQ_OQ); //3 cycles latency, throughput 1
uint64_t bits_non_zero = _mm256_movemask_ps(comparison_out); //2-3 cycles latency
num_non_zero += __builtin_popcountll(bits_non_zero);
}
return num_non_zero;
}
If you want to test floats for +/- 0.0, then you can check for all the bits being zero, except the sign bit. Any set-bits anywhere except the sign bit mean the float is non-zero. (http://www.h-schmidt.net/FloatConverter/IEEE754.html)
Agner Fog's asm optimization guide points out that you can test a float or double for zero using integer instructions:
; Example 17.4b
mov eax, [rsi]
add eax, eax ; shift out the sign bit
jz IsZero
For vectors, though, using ptest with a sign-bit mask is better than using paddd to get rid of the sign bit. Actually, test [rsi], $0x7fffffff may be more efficient than Agner Fog's load/add sequence, but a 32bit immediate probably stops the load from micro-fusing on Intel, and maybe have a larger code-size.
x86 PTEST (SSE4.1) does a bitwise AND and sets flags based on the result.
movdqa xmm0, [mask]
.loop:
ptest xmm0, [rsi+rcx]
jnz nonzero
add rcx, 16 # count up towards zero
jl .loop # with rsi pointing to past the end of the array
...
nonzero:
Or cmov could be useful to consume the flags set by ptest.
IDK if it'd be possible to use a loop-counter instruction that didn't set the zero flag, so you could do both tests with one jump instruction or something. Probably not. And the extra uop to merge the flags (or the partial-flags stall on earlier CPUs) would cancel out the benefit.
#Iwillnotexist Idonotexist: re one of your comments on the OP: you can't just movemask without doing a pcmpeq first, or a cmpps. The non-zero bit might not be in the high bit! You probably knew that, but one of your comments seemed to leave it out.
I do like the idea of ORing together multiple values before actually testing. You're right that sign-bits would OR with other sign-bits, and then you ignore them the same way you would if you were testing one at a time. A loop that PORs 4 or 8 vectors before each PTEST would probably be faster. (PTEST is 2 uops, and can't macro-fuse with a jcc.)

XOR instruction not working as thought (Intel 8086)

I am studying a topic of mine that I am fascinated with, reverse engineering. But I have run into a little speed bump. I know the bitwise operator xor and what it does to the bits but it doesnt seem to be working correctly when I watch it in process in the disassembler. The small segement of code I am dealing with is:
MOV EAX, 0040305D
XOR DWORD PTR [EAX], 1234567
Before the xor has taken place, the number that resides at the location 0040305D is 1234 or 31323334 hexadecimal (It is represented as ASCII because it was taken from a user input and it firmly resides as 31323334 in memory). When I looked up a xor calculator on the internet to check to make sure I was doing everything alright on paper I got the result of the xor calculation as 30117653 hexadecimal. But when I run the operation in disassembler it replaced the memory location held in EAX with 56771035.
What just happened? Am I missing something here? I checked the xor calculation on many calculators and I am not able to get the answer of 56771035. Can someone give me a hand and tell me what I am doing wrong?
-Dan
The numbers displayed are all in hex and you have forgotten to use proper endianness. If the user input was ascii 1234 that means the memory contains the bytes 31 32 33 34. Since x86 is little endian, the operand 1234567 is byte sequence 67 45 23 01. Performing the xor operation we get the byte sequence 56 77 10 35 which is what you see.

Resources