x86 32bit Assembly Parser | logical problem - parsing

I'm currently working on an Obfuscator for assembled x86 assembly (working with the raw bytes).
To do that I first need to build a simple parser, to "understand" the bytes.
I'm using a database that I create for myself mostly with the website: https://defuse.ca/online-x86-assembler.htm
Now my question:
Some bytes can be interpreted in two ways, for example (intel syntax):
1. f3 00 00 repz add BYTE PTR [eax],al
2. f3 repz
My idea way to loop through the bytes and work with every instruction as single,
but when I reach byte '0xf3' I have 2 ways of interpreting it.
I know there are working x86 disassemblers out there, how do I know what case this is?

Prefixes, including repz prefix, are not meaningful without subsequent instruction. The subsequent instruction may incorporate the prefix (repz nop is pause), change its meaning (repz is xrelease if used before some interlocked instruction), or the prefix may be just invalid.
The decoding is always unambiguous, otherwise the CPU could not execute instructions. It may be ambiguous only if you don't know exact byte offset where to begin decoding (as x86 uses variable instruction length).

Related

How should assemblers distinguish between symbol and all-alpha hex value?

I'm learning some 8080 assembly, which uses the older suffix H to indicate hexadecimal constants (vs modern prefix 0x or $). I'm also noodling around with a toy assembler and thinking about how to tokenize source code.
It's possible to write a valid hex constant (say) BEEFH, which contains only alphabetical characters. It's also possible to define a label called BEEFH. So when I write:
ORG 0800H
START: ...
JMP BEEFH ; <--- how is this resolved?
....
BEEFH: ...
...
This should be syntactically valid based on the old Intel docs: BEEFH meets the naming rules for labels, and of course is also a valid 16-bit address. The ambiguity of whether the operand to JMP here is an address constant or an identifier seems like a problem.
I don't have access to the original 8080 assembler to see what it does with this example. Here's an online 8080 assembler that appears to parse the operand to JMP as a label reference in all cases, but obviously a proper assembler should be able to target an absolute address with a JMP instruction.
Can anyone shed light on what the conventions around this actually are/should be? Am I missing something obvious? Thanks.
Someone left a comment that they then deleted, but I looked again and it was right on. Apparently I missed the note in the old Intel manual that says this about hex constants:
Hex constants must begin with a decimal digit. So that's certainly how you avoid the semantic ambiguity when parsing. It seems a bit inelegant to me as a solution but I guess then you should just use a modern prefix.
Thanks, anonymous commenter!

Lua/LuaJIT decompilation challenge

I have stumbled upon a Lua script which I am trying to decompile. So far I have tried all different versions of standard Lua decompilers such as unluac and luadec. I either get a "not a precompiled chunk" or "bad header in precompiled chunk" errors.
I have also tried different Lua versions and 32-bit and 64-bit architectures for the decompilers.
I have looked at the header and it reads something like this - 1b 4c 4a 01 02 52 20 20 in hex. Looks almost correct but it seems to me like the signature starts with one extra byte and the 6th and 7th byte are wrong. Also, there is 52 in there which I assume is the Lua's version.
As the signature is one byte too long, the normal decompilers don't work. I have a suspicion that this might me a LuaJIT bytecode as if you convert that string to ANSI you will notice ESC, L, J sequence which is a function header for LuaJIT.
In case this is a LuaJIT binary, where do I start about decompiling it? I have tried some decompilers but they all seem to be extremely out of date or fail at some point within the execution (I am talking about LJD and its derivatives).
Any suggestions on how to analyze, view opcodes or decompile it would be greatly appreciated.
Here is the file I am talking about (apologies, too long to post here in case someone wants to have a go):
https://pastebin.com/eeLHsiXk
https://filebin.net/r0hszoeh8zscp8dh

How do I get a number from bytes?

I am currently trying to work around with Lua 5.1 bytecode. I've gotten pretty far, and understand a lot. However, I am stuck with a question on instructions and numbers. I understand that the size of the instruction and number are located and defined in the header, but I am not sure how to get the actual number from the 4 bytes (or whatever size is specified in the header).
I've looked at output from ChunkSpy and I don't really understand how it went from those bytes to the number. I'd look in the source but I don't want to just copy it, I want to understand it. If anyone could tell me a bit about it or even point me in the right direction I'd be very grateful.
Thank you!
From A No-Frills Introduction to Lua 5.1 VM Instructions, numbers are stored in the constants pool.
The first byte is 3=LUA_TNUMBER.
The next bytes are the number, with the length as given in the header. Interpretation is based on the length, byte order and the integral flag as given in the header.
Typically, non-integral with 8 bytes means IEEE 754 64-bit double.
Deserializing bytes to double involves extracting the bits for the mantissa and exponent, and combining them with arithmetic operations. Perhaps you want that as a challenge and to start from a description of the standard: What Every Computer Scientist Should Know About Floating-Point Arithmetic, "Formats and Operations" section.

How to interpret unicode encodings

I just finished reading the article "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" by Joel Spolsky.
I'd really appreciate clarification on this part of the article.
OK, so say we have a string: Hello which, in Unicode, corresponds to these five code points:
U+0048 U+0065 U+006C U+006C U+006F...That’s where encodings come in.
The earliest idea for Unicode encoding, which led to the myth about the two bytes, was, hey, let’s just store those numbers in two bytes each. So Hello becomes
00 48 00 65 00 6C 00 6C 00 6F
Right? Not so fast! Couldn’t it also be:
48 00 65 00 6C 00 6C 00 6F 00 ?
Well, technically, yes, I do believe it could, and, in fact, early implementors wanted to be able to store their Unicode code points in high-endian or low-endian mode, whichever their particular CPU was fastest at, and lo, it was evening and it was morning and there were already two ways to store Unicode. So the people were forced to come up with the bizarre convention of storing a FE FF at the beginning of every Unicode string; this is called a Unicode Byte Order Mark and if you are swapping your high and low bytes it will look like a FF FE and the person reading your string will know that they have to swap every other byte. Phew. Not every Unicode string in the wild has a byte order mark at the beginning.
My questions:
Why could the two zero's at the beginning of 0048 be moved to the end?
What is FE FF and FF FE, what's the difference between them and how were they used? (Yes I tried googling those terms, but I'm still confused)
Why did he then say "Phew. Not every Unicode string in the wild has a byte order mark at the beginning."?
Also, I'd appreciate any recommended resources to learn more about this stuff.
Summary: the 0xFEFF (byte-order mark) character is used to solve the endianness problem for some character encodings. However, most of today's character encodings are not prone to the endianness problem, and thus the byte-order mark is not really relevant for today.
Why could the two zero's at the beginning of 0048 be moved to the end?
If two bytes are used for all characters, then each character is saved in a 2-byte data structure in the memory of the computer. Bytes (groups of 8 bits) are the basic addressable units in most computer memories, and each byte has its own address. On systems that use the big-endian format, the character 0x0048 would be saved in two 1-byte memory cells in the following way:
n n+1
+----+----+
| 00 | 48 |
+----+----+
Here, n and n+1 are the addresses of the memory cells. Thus, on big-endian systems, the most significant byte is stored in the lowest memory address of the data structure.
On a little-endian system, on the other hand, the character 0x0048would be stored in the following way:
n n+1
+----+----+
| 48 | 00 |
+----+----+
Thus, on little-endian systems, the least significant byte is stored in the lowest memory address of the data structure.
So, if a big-endian system sends you the character 0x0048 (for example, over the network), it sends you the byte sequence 00 48. On the other hand, if a little-endian system sends you the character 0x0048, it sends you the byte sequence 48 00.
So, if you receive a byte sequence like 00 48, and you know that it represents a 16-bit character, you need to know whether the sender was a big-endian or little-endian system. In the first case, 00 48 would mean the character 0x0048, in the second case, 00 48 would mean the totally different character 0x4800.
This is where the FE FF sequence comes in.
What is FE FF and FF FE, what's the difference between them and how were they used?
U+FEFF is the Unicode byte-order mark (BOM), and in our example of a 2-byte encoding, this would be the 16-bit character 0xFEFF.
The convention is that all systems (big-endian and little-endian) save the character 0xFEFF as the first character of any text stream. Now, on a big-endian system, this character is represented as the byte sequence FE FF (assume memory addresses increasing from left to right), whereas on a little-endian system, it is represented as FF FE.
Now, if you read a text stream, that has been created by following this convention, you know that the first character must be 0xFEFF. So, if the first two bytes of the text stream are FE FF, you know that this text stream has been created by a big-endian system. On the other hand, if the first two bytes are FF FE, you know that the text stream has been created by a little-endian system. In either case, you can now correctly interpret all the 2-byte characters of the stream.
Why did he then say "Phew. Not every Unicode string in the wild has a byte order mark at the beginning."?
Placing the byte-order mark (BOM) character 0xFEFF at the beginning of each text stream is just a convention, and not all systems may follow it. So, if the BOM is missing, you have the problem of not knowing whether to interpret the 2-byte characters as big-endian or little-endian.
Also, I'd appreciate any recommended resources to learn more about this stuff.
https://en.wikipedia.org/wiki/Endianness
https://en.wikipedia.org/wiki/Unicode
https://en.wikibooks.org/wiki/Unicode/Character_reference
https://en.wikipedia.org/wiki/Byte_order_mark
https://en.wikipedia.org/wiki/UTF-16#Byte_order_encoding_schemes
Notes:
Today, the most widely used Unicode-compatible encoding is UTF-8. UTF-8 has been designed to avoid the endianness problem, thus, the entire byte-order mark 0xFEFF stuff is not relevant for UTF-8 (see here).
The byte-order mark is however relevant to the other Unicode-compatible encodings UTF-16 and UTF-32, which are prone to the endianness problem. If you browse through the list of available encodings, for example in the settings of a text editor or terminal, you see that there are big-endian and little-endian versions of UTF-16 and UTF-32, typically called UTF-16BE and UTF-16LE, or UTF-32BE and UTF-32LE, respectively. However, UTF-16 and UTF-32 are rarely used in practice.
Other popular encodings used today include the encodings from the ISO 8859 series, such as ISO 8859-1 (and the derived Windows-1252), known as Latin-1, or also the pure ASCII encoding. However, all these are single-byte encodings, that is, each character is encoded to 1 byte and saved in a 1-byte data structure. Thus, the endianness problem doesn't apply here, and the byte-order mark story is also not relevant for these cases.
All in all, the endianness problem for character encodings, that you struggled to understand, has thus mostly a historical value, and is not really relevant for today's world anymore.
This is all to do with the internal storage of data in the computer's memory - in this example (00 48), some computers will store the largest byte first and the smallest byte second (known as big-endian), and others will store the smallest byte first (little-endian). So, depending on your computer, when you read the bytes out of memory you'll get either the 00 first or the 48 first. And you need to know which way round it's going to be to make sure you interpret the bytes correctly. For a more in-depth introduction to the topic, see Endianness on Wikipedia (https://en.wikipedia.org/wiki/Endianness)
These days, most compilers and interpreters will take care of this low-level stuff for you, so you will rarely (if ever) need to worry about it.

Checksum calculation in visual basic

I want to communicate with a medical analyzer and send some messages to it. But the analyzer requests a control character or checksum at the end of the message to validate it.
You'll excuse my limited knowledge of English, but according to the manual, here is how to calculate that checksum:
The control character is the exclusion logic sum (exclusive OR), in character units, from the start of the text to the end of the text and consists of a 1 byte binary. Moreover, since this control character becomes the value of 00~7F by hexadecimal, please process not to mistake for the control code used in transmission.
So please can you tell me how to get this control character based on those informations. I did not understand well what is written because of my limited English.
I'm using visual basic for programming
Thanks
The manual isn't in very good English either. My understanding:
The message you will send to the device can be represented as a byte array. You want to XOR together every single byte, which will leave you with a one byte checksum, to be added to the end of the message. (byte 1 XOR byte 2 XOR byte 3 XOR ....)
Everything after "Moreover" appears to say "If you are reading from the device, the final character is a checksum, not a command, so do not treat it as a command, even if it looks like one.". You are writing to the device, so you can ignore that part.
To XOR (bear in mind I don't know VB):
have a checksum variable, set to 0. Loop over each byte of the message, and XOR the checksum variable with that byte. Store the result back in the checksum.

Resources