I'm learning some 8080 assembly, which uses the older suffix H to indicate hexadecimal constants (vs modern prefix 0x or $). I'm also noodling around with a toy assembler and thinking about how to tokenize source code.
It's possible to write a valid hex constant (say) BEEFH, which contains only alphabetical characters. It's also possible to define a label called BEEFH. So when I write:
ORG 0800H
START: ...
JMP BEEFH ; <--- how is this resolved?
....
BEEFH: ...
...
This should be syntactically valid based on the old Intel docs: BEEFH meets the naming rules for labels, and of course is also a valid 16-bit address. The ambiguity of whether the operand to JMP here is an address constant or an identifier seems like a problem.
I don't have access to the original 8080 assembler to see what it does with this example. Here's an online 8080 assembler that appears to parse the operand to JMP as a label reference in all cases, but obviously a proper assembler should be able to target an absolute address with a JMP instruction.
Can anyone shed light on what the conventions around this actually are/should be? Am I missing something obvious? Thanks.
Someone left a comment that they then deleted, but I looked again and it was right on. Apparently I missed the note in the old Intel manual that says this about hex constants:
Hex constants must begin with a decimal digit. So that's certainly how you avoid the semantic ambiguity when parsing. It seems a bit inelegant to me as a solution but I guess then you should just use a modern prefix.
Thanks, anonymous commenter!
Related
Why such code is used in some applications instead of a MOVE?
add 16 to ZERO giving SOME-RESULT
I spotted this in professionally written code at several spots.
Sorce is on this page
Why such code is used in some applications instead of a MOVE?
add 16 to ZERO giving SOME-RESULT
Without seeing more of the code, it appears that it could be a translation of IBM Assembler to COBOL. In particular, the ZAP (Zero and Add Packed) instruction may be literally translated to the above instruction, particularly if SOME-RESULT is COMP-3. Thus, someone checking the translation could see that the ZAP instruction was faithfully translated.
Or, it could be an assembler programmer's idea of a joke.
Having seen the code, I also note the use of
subtract some-data-item from some-data-item
which is used instead of
move zero to some-data-item
This is consistent with operations used with packed decimal fields in IBM Assembly, where there are no other instructions to accomplish "flexible" moves. By flexible, I mean that the packed decimal instructions contain a length field so that specific size MVC instructions need not be used.
This particular style, being unusual, may be related to catching copyright violations.
From my experience, I'm pretty sure I know the reason why the programmer would have done this. It has something to do with the binary representation of the number.
I bet SOME-RESULT is a packed-decimal (or COMP-3) format number. Let's assume the field is defined like this
05 SOME-RESULT PIC S9(5) COMP-3.
This results in a 3-byte field with a hex representation like this
x'00016C'
The decimal number is encoded as a binary encoded decimal (BCD, one decimal digit per half-byte), and the last half-byte holds the sign.
Let's take a look at how the sign is defined:
if it is one of x'C', x'A', x'F', x'E' (café), then the number is positive
if it is one of x'B', x'D', then the number is negative
any of x'0'..'x'9' are not valid signs, so we can distinguish signed packed-decimals from unsigned.
However, a zoned number (PIC 9(5) DISPLAY) - as in the source code - looks like this:
x'F0F0F0F1F6'
As you can see, each decimal digit is an EBCDIC character with the 'zone' part (the first half-byte) always being x'F'.
Now we get closer to your question!
What happens when we use
MOVE 16 TO SOME-RESULT
If you just MOVE a number to such a field, this results in being compiled into a PACK instruction on the machine code level.
PACK SOME-RESULT,=C'16'
A pack instruction takes a zoned number and packs it by picking only the second half-byte of each byte and storing it in the half-bytes of the packed number - with one exception! When it comes to the last byte, it simply flips the two half-bytes and stores them in the last half-byte of the decimal.
This means that the zone of the last byte of the zoned decimal becomes the sign in the packed decimal:
x'00016F'
So now we have an x'F' as the sign – which is a valid positive sign.
However, what happens if use this Cobol instruction instead
ADD 16 TO ZERO GIVING SOME-RESULT
This compiles into multiple machine level instructions
PACK SOME_RESULT,=C'0'
PACK TEMP,=C'16'
AP SOME_RESULT,TEMP
(or similar - the key point is that is needs an AP somewhere)
This makes a slight difference in the result, because the AP (add packed) instruction always sets the resulting sign to either x'C' for a positive or x'D' for a negative result.
So the difference lies in the sign
x'00016C'
Finally, the question is why would one make this difference? After all, both x'F' and x'C' are valid positive signs. So why care?
There is one situation when this slight difference can cause big problems: When the packed decimal is part of an index key, then we would not get a match, even though the numbers are semantically identical!
Because this situation occurred quite often in older databases like VSAM and DL/I (later: IMS/DB), it became good practice to "normalize" packed decimals if they were part of an index key.
However, some programmers adopted the practice without knowing why, so you may come across code that uses this "normalization" even though the data are not used for index keys.
You might also wonder why a compiler does not optimize out the ADD 16 TO ZERO. I'm pretty sure it once did, but that broke a lot of applications, so this specific optimization was removed again or at least made a non-default option with warnings.
Additional useful info
Note that at least the Enterprise Cobol for z/OS compiler allows you to see exactly the machine code that is produced from your source code if use the LIST compile option (see this example output). I recommend to always compile with options LIST, MAP, OFFSET, XREF because these options enable you find the exact problem in your Cobol source even when you only have a program dump from an abend.
Anyway, good programming practice is not to care about the compiler or the machine code, but about the other programmers who will have to maintain, and thus read and understand the code. Good practice would be to always prefer simple and readable instructions, and to document the reasons (right in the code) when deviating from this rule.
Some programmers like to do things "just because they can". I have a feeling that is what you are seeing here. It makes about as much sense as doing
a := 0 + b
would in go.
From my Lua knowledge (and according to what I have read in Lua manuals), I've always been under impression that an identifier in Lua is only limited to A-Z & a-z & _ & digits (and can not start using a digit nor be a reserved keyword i.e. local local = 123).
And now I have run into some (obfuscated) Lua program which uses all kind of weird characters for an identifier:
https://i.imgur.com/HPLKMxp.png
-- Most likely, copy+paste won't work. Download the file from https://tknk.io/7HHZ
print(_VERSION .. " " .. (jit and "JIT" or "non-JIT"))
local T = {}
T.math = T.math or {}
T.math.​â®â€‹âŞâ®â€‹ď»żâ€Śâ€âŽ = math.sin
T.math.â¬â€‹ââ¬ââ«â®â€â€¬ = math.cos
for k, v in pairs(T.math) do print(k, v) end
Output:
Lua 5.1 JIT
â¬â€‹ââ¬ââ«â®â€â€¬ function: builtin#45
​â®â€‹âŞâ®â€‹ď»żâ€Śâ€âŽ function: builtin#44
It is unclear to me, why is this set of characters allowed for an identifier?
In other words, why is it a completely valid Lua program?
Unlike some languages, Lua is not really defined by a formal specification, one which covers every contingency and entirely explains all of Lua's behavior. Something as simple as "what character set is a Lua file encoded in" isn't really explain in Lua's documentation.
All the docs say about identifiers is:
Names (also called identifiers) in Lua can be any string of letters, digits, and underscores, not beginning with a digit and not being a reserved word.
But nothing ever really says what a "letter" is. There isn't even a definition for what character set Lua uses. As such, it's essentially implementation-dependent. A "letter" is... whatever the implementation wants it to be.
So, let's say you're writing a Lua implementation. And you want users to be able to provide Unicode-encoded strings (that is, strings within the Lua text). Lua 5.3 requires this. But you also don't want them to have to use UTF-16 encoding for their files (also because lua_load gets sequences of bytes, not shorts). So your Lua implementation assumes the byte sequence it gets in lua_load is encoded in UTF-8, so that users can write strings that use Unicode characters.
When it comes to writing the lexer/parser part of this implementation, how do you handle this? The simplest, easiest way to handle UTF-8 is to... not handle UTF-8. Indeed, that's the whole point of that encoding. Since everything that Lua defines with specific symbols are encoded in ASCII, and ASCII text is also UTF-8 text with the same meaning, you can basically treat a UTF-8 string like an ASCII string. For in-Lua strings, you just copy the sequence of bytes between the start and end characters of the string.
So how do you go about lexing identifiers? Well, you could ask the question above. Or you could ask a much simpler question: is the character a space, control character, digit, or symbol? A "letter" is merely something that isn't one of those.
Lua defines what things it considers to be "symbols". ASCII can tell you what is a control character, space, and a digit. In such an implementation, any UTF-8 code unit with a value outside of ASCII is a letter. Even if technically, those code units decode into something Unicode thinks of as a "symbol", your lexer just threats it as a letter.
This simple form of UTF-8 lexing gives you fast performance and low memory overhead. You don't have to decode UTF-8 into Unicode codepoints, and you don't need a giant Unicode table to tell you whether a codepoint is a "symbol" or "space" or whatever. And of course, it's also something that would naturally fall out of many ASCII-based Lua implementations.
So most Lua implementations will do it this way, if only by accident. Doing something more would require deliberate effort.
It also allows a user to use Unicode character sequences as identifiers. That means that someone can easily write code in their native language (outside of keywords).
But it also means that obfuscators have lots of ways to create "identifiers" that are just strings of nonsensical bytes. Indeed, because there are multiple ways in Unicode to "spell" the same apparent Unicode string (unless you examine the bytes directly), obfuscators can rig up identifiers that appear when rendered in a text editor to all be the same text, while actually being different strings.
To clarify there is only one identifier T
T.math is sugar syntax for T["math"] this also extends to the obfuscate strings. It is perfectly valid to have a key contain any characters or even start with a number.
Now being able to use the . rather then [ ] does not work with a string that don't conform to the identifier's limitations. See Nicol Bolas' answer for a great break down of those limitations.
How do I use CharInSet() to get rid of this warning?
lastop:=c;
{OK - if op is * or / and prev op was + or -, put the whole thing in parens}
If (c in ['*','/']) and (NextToLastOp in ['+','-']) then
Result.text:='('+Result.text+')';
Result.text:=Result.text+c;
[dcc32 Warning] U_Calc1.pas(114): W1050 WideChar reduced to byte char in set expressions. Consider using 'CharInSet' function in 'SysUtils' unit.
CharInSet is necessary due to the fact that WideChar values are too large to be used in actual set operations.
To use it in your code you would use it like this:
If (CharInSet(c, ['*','/']) and (CharInSet(NextToLastOp, ['+','-']) then
However CharInSet doesn't actually do anything special or even useful other than silence the warning.
The code in the function is identical to the code it replaces. The warning is silenced only because the RTL units are pre-compiled. The compiler does not recompile them even if you do a full "Rebuild all". i.e. that code in SysUtils is not actually ever re-compiled. If it were it would emit the exact same warning.
So, you could use that function to avoid the warning but if you can be certain that your character data is ASCII (or extended ASCII), that is with ordinal values in the range #0 to #255, then you could also avoid the warning by typecasting your character variable explicitly to an ANSIChar:
If (ANSIChar(c) in ['*','/']) and (ANSIChar(NextToLastOp) in ['+','-']) then
To reiterate: You should only do this if you are absolutely certain that the character data that you are working with consists only of ASCII characters.
But this same caveat applies equally to using CharInSet since this still requires that you test for membership in an ANSIChar set (i.e. single byte characters).
Explicit type-casting has the advantage (imho) of making it explicit and apparent that your code is intended deliberately to expect and work only with ASCII characters, rather than giving a perhaps misleading impression of being more fully equipped to handle Unicode.
I have a string that, by using string.format("%02X", char), I've received the following:
74657874000000EDD37001000300
In the end, I'd like that string to look like the following:
t e x t NUL NUL NUL í Ó p SOH NUL ETX NUL (spaces are there just for clarification of characters desired in example).
I've tried to use \x..(hex#), string.char(0x..(hex#)) (where (hex#) is alphanumeric representation of my desired character) and I am still having issues with getting the result I'm looking for. After reading another thread about this topic: what is the way to represent a unichar in lua and the links provided in the answers, I am not fully understanding what I need to do in my final code that is acceptable for this to work.
I'm looking for some help in better understanding an approach that would help me to achieve my desired result provided below.
ETA:
Well I thought that I had fixed it with the following code:
function hexToAscii(input)
local convString = ""
for char in input:gmatch("(..)") do
convString = convString..(string.char("0x"..char))
end
return convString
end
It appeared to work, but didnt think about characters above 127. Rookie mistake. Now I'm unsure how I can get the additional characters up to 256 display their ASCII values.
I did the following to check since I couldn't truly "see" them in the file.
function asciiSub(input)
input = input:gsub(string.char(0x00), "<NUL>") -- suggested by a coworker
print(input)
end
I did a few gsub strings to substitute in other characters and my file comes back with the replacement strings. But when I ran into characters in the extended ASCII table, it got all forgotten.
Can anyone assist me in understanding a fix or new approach to this problem? As I've stated before, I read other topics on this and am still confused as to the best approach towards this issue.
The simple way to transform a base16-encoded string is just to
function unhex( input )
return (input:gsub( "..", function(c)
return string.char( tonumber( c, 16 ) )
end))
end
This is basically what you have, just a bit cleaner. (There's no need to say "(..)", ".." is enough – if you specify no captures, you'll automatically get the whole match. And while it might work if you write string.char( "0x"..c ), it's just evil – you concatenate lots of strings and then trigger the automatic conversion to numbers. Much better to just specify the base when explicitly converting.)
The resulting string should be exactly what went into the hex-dumper, no matter the encoding.
If you cannot correctly display the result, your viewer will also be unable to display the original input. If you used different viewers for the original input and the resulting output (e.g. a text editor and a terminal), try writing the output to a file instead and looking at it with the same viewer you used for the original input, then the two should be exactly the same.
Getting viewers that assume different encodings (e.g. one of the "old" 8-bit code pages or one of the many versions of Unicode) to display the same thing will require conversion between different formats, which tends to be quite complicated or even impossible. As you did not mention what encodings are involved (nor any other information like OS or programs used that might hint at the likely encodings), this could be just about anything, so it's impossible to say anything more specific on that.
You actually have a couple of problems:
First, make sure you know the meaning of the term character encoding, and that you know the difference between characters and bytes. A popular post on the topic is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Then, what encoding was used for the bytes you just received? You need to know this, otherwise you don't know what byte 234 means. For example it could be ISO-8859-1, in which case it is U+00EA, the character ê.
The characters 0 to 31 are control characters (eg. 0 is NUL). Use a lookup table for these.
Then, displaying the characters on the terminal is the hard part. There is no platform-independent way to display ê on the terminal. It may well be impossible with the standard print function. If you can't figure this step out you can search for a question dealing specifically with how to print Unicode text from Lua.
I cant see what encoding Lua uses for its strings.
Im using
string.byte (s [, i [, j]])
which has the doc
Returns the internal numerical codes of the characters s[i], s[i+1],
···, s[j]. The default value for i is 1; the default value for j is i.
Note that numerical codes are not necessarily portable across
platforms.
Reading around people suggest it uses ASCII - which is fine for me - but I dont get the changing across platforms - I thought the very nature of using a single encoding (like ASCII) is that this wouldnt happen - or is it just saying this as ASCII does not define for over 126 (or 127) and therefore different countries / OEMS / OSs etc may be using custom ASCII extensions from decades ago for the upper range?
Its important for me to know that [a-zA-Z] will have the same char values on all platforms im running on.
The Lua doc could be a bit more specific here!
Any light anyone can shed on this would be great thx
I'm fairly sure you can safely assume an ASCII-derived encoding. So the minuscule set of characters you're interested in stays the same.
The note about the code changing between platforms likely means that Lua doesn't know anything about the character encoding at all and thus just uses whatever bytes the OS hands out. On Linux this is likely UTF-8, which means you'd have to deal with individual code units when stepping outside ASCII. On Windows I could imagine it being the system's legacy codepage, which means sort-of Latin 1 (CP 1252) in much of the Western world.