Can BitConverter be used to reliably extract multi-byte values from an IL byte stream (as returned by MethodBody.GetILAsByteArray)? - clr

I am working on some code that parses IL byte arrays as returned by MethodBody.GetILAsByteArray.
Lets say I want to read a metadata token or a 32-bit integer constant from such an IL byte stream. At first I thought using BitConverter.ToInt32(byteArray, offset) would make this easy. However I'm now worried that this won't work on big-endian machines.
As far as I know, IL always uses little-endian encoding for multi-byte values:
"All argument numbers are encoded least-significant-byte-at-smallest-address (a pattern commonly termed 'little-endian')." — The Common Language Infrastructure Annotated Standard, Partition III, ch. 1.2 (p. 482).
Since BitConverter's conversion methods honour the computer architecture's endianness (which can be discovered through BitConverter.IsLittleEndian), I conclude that BitConverter should not be used to extract multi-byte values from an IL byte stream, because this would give wrong results on big-endian machines.
Is this conclusion correct?
If yes: Is there any way to tell BitConverter which endianness to use for conversions, or is there any other class in the BCL that offers this functionality, or do I have to write my own conversion code?
If no: Where am I wrong? What is the proper way of extracting e.g. a Int32 operand value from an IL byte array?

You should always do this on a little endian array before passing it:
// Array is little. Are we on big?
if (!BitConverter.IsLittleEndian)
{
// Then flip it
Array.Reverse(array);
}
int val = BitConverter.ToInt32(...);
However as you mention an IL stream. The bytecode is this (AFAIK):
(OPCODE:(1|2):little) (VARIABLES:x:little)
So I would read a byte, check its opcode, then read the appropriate bytes and flip the array if necessary using the above code. Can I ask what you are doing?

Related

What is Lua number type length in bytes?

What is the number format length in bytes?
This is a "multi type" data format. Is it 4 bytes? 8 bytes? How much? How can I get it programmatically? Does the length depend on the OS/processor type?
Here https://www.lua.org/pil/2.3.html the documentation says this is a double precision type. That is, it has 64 bits. Am I right?
Like #Roddy said, it's slightly complicated with the integer type. Moreover, it depends on how your Lua is compiled.
Basically, in Lua 5.3, there are two types, the integer type lua_Integer and the number type lua_Number. You can get their lengths programatically from within Lua by parsing a chunk header:
local chunk = string.dump(function() end)
print("lua_Integer", chunk:byte(16))
print("lua_Number", chunk:byte(17))
Typically both lengths will be 8 bytes. However on some embedded platforms you can find Luas where the lua_Number type is a float (4 bytes), a 32 bit integer or even weirder things.
It depends on the version of Lua, and of course, how it's compiled.
5.3 has true integers, typically 64 bits. https://www.lua.org/manual/5.3/manual.html
The type number uses two internal representations, or two subtypes,
one called integer and the other called float.
...
Standard Lua uses 64-bit integers and double-precision (64-bit)
floats, but you can also compile Lua so that it uses 32-bit integers
and/or single-precision (32-bit) floats.
Earlier versions always use 64-bit double-precision floating point, which effectively accurately represents up to 52-bit integers. Your link... https://www.lua.org/pil/2.3.html
according to the Lua reference (for integers)
In case of overflows in integer arithmetic, all operations wrap around, according to the usual rules of two-complement arithmetic. (In other words, they return the unique representable integer that is equal modulo 2^64 to the mathematical result.)
and for floating point
With the exception of exponentiation and float division, the arithmetic operators work as follows: If both operands are integers, the operation is performed over integers and the result is an integer. Otherwise, if both operands are numbers or strings that can be converted to numbers (see §3.4.3), then they are converted to floats, the operation is performed following the usual rules for floating-point arithmetic (usually the IEEE 754 standard), and the result is a float.
Lua as a language does not define what you ask for. The data type used for representing numbers may differ from version to version (note that the link to the free online version of "Programming in Lua" is about Lua 5.0), but primarily this is defined by the way Lua is compiled, as others already said.
Look at luaconf.h for all the details.
Regarding your actual problem (converting hex-string to numbers), you could look at the result of tonumber() on various input strings, compared to known results:
function hexConvertibeBytes()
local i, s = 0, ''
repeat
i, s = i + 1, s .. 'FF'
local n = tonumber( s, 16 )
until n ~= 256^i - 1
return i - 1
end
We can use string.pack as follows:
s = string.pack("J",0)
number_of_bytes = #s

How is data written to memory

When we store data in memory.
How does it get stored, so it can recognize what type of data it is when loaded.
What I want to ask is how the data types like Natural numbers, integers, characters, etc are stored in memory. So they can be recognized easily later when extracted from memory.
When we see at memory, what we see are hex numbers.
How can we relate these hex numbers for ASCII value or Integer Value or any other etc.
Since all of your data is written in binary, there isn't much difference between how the char a is written and how the int 97 is written, since they represent the same binary string (at least the last 8 bits of those strings). That being said, when you read from memory, you read a data type, by that type, you know how you should interpret the data
Memory does not operate in terms of "character" or "integer", these are high-level concepts that assume an abstract machine.
Typically, but not necessarily, a character is just an integer with a smaller size, often 8 bits (but a character could as well be 32 bits!) which represents one symbol or letter, rather than a discrete number. In some cases, a character may even be encoded using a variable length.
Memory operates in terms of bits that are organized in bytes (smallest directly addressable unit) or words. These are -- unbeknownst to you -- organized in banks. The hardware typically allows access in units called "cache lines", but this is something that happens secretly behind your back.
In assembler language, you can typically access bytes and power-of-two multiples of these, sometimes with special alignment requirements (there's usually also bit operations, but while they only change one bit, they still work on whole bytes/words).
All of that is, however, not very interesting, and also widely irrelevant for you. It is first and foremost the compiler's (or interpreter's) job to make sure that when you speak of an integer or a character, that whatever you want comes out at the other end. It is also the tool's responsibility to convert one into another if possible, and produce an error if not possible.
You do not even know for certain whether the value of an integer or a character has a memory location at all (it may very well be stored in a register) unless you explicitly enforce that.
You cannot distinguish a byte at some memory location that came from a "character" from a byte that belongs to an "integer". They look just the same.
And while it is possible to read the raw bytes of one type as another type in most languages, this is not something you normally need to do (or should do).

Avoiding ASCII conversion in serial communication using luars232

I am using luars233 library for serial communication using Lua. I need to send data bytes without converting them in ASCII form, but the write function of luars232 is converting the data into ASCII before transmission even if I pass it to the function as a number(data type). Please provide possible assistance
I have worked-around the issue by using escape sequence in String datatype e.g. '\2' would pass 0x02 on to the serial port using write function of luars232. But this restricts performing mathematical operations on the data before transmission. Further suggestions are welcomed.
The library takes the data argument and coerces it to a string via luaL_checklstring using standard Lua rules. So, if you want complete control over the data, you should pass a string. A Lua string is a counted sequence of bytes.
Certainly, as you have found, a literal escaped character sequence will work.
You can also use the string.char(...) function, which takes a list of zero or more values 0-255 and creates string with those byte values.
If you have a table sequence of bytes, you can unpack them into a list:
local bytes = { 27, 76, 117, 97 }
port:write(string.char(table.unpack(bytes)))
So, yes, you do have to convert to a string. But, you can defer that until just before the write call.

Why does Delphi warn when assigning ShortString to string?

I'm converting some legacy code to Delphi 2010.
There are a fair number of old ShortStrings, like string[25]
Why does the assignment below:
type
S: String;
ShortS: String[25];
...
S := ShortS;
cause the compiler to generate this warning:
W1057 Implicit string cast from 'ShortString' to 'string'.
There's no data loss that is occurring here. In what circumstances would this warning be helpful information to me?
Thanks!
Tomw
It's because your code is implicitly converting a single-byte character string to a UnicodeString. It's warning you in case you might have overlooked it, since that can cause problems if you do it by mistake.
To make it go away, use an explicit conversion:
S := string(ShortS);
The ShortString type has not changed. It continues to be, in effect, an array of AnsiChar.
By assigning it to a string type, you are taking what is a group of AnsiChars (one byte) and putting it into a group of WideChars (two bytes). The compiler can do that just fine, and is smart enough not to lose data, but the warning is there to let you know that such a conversion has taken place.
The warning is very important because you may lose data. The conversion is done using the current Windows 8-bit character set, and some character sets do not define all values between 0 and 255, or are multi-byte character sets, and thus cannot convert all byte values.
The data loss can occur on a standard computer in a country with specific standard character sets, or on a computer in USA that has been set up for a different locale, because the user communicates a lot with people in other languages.
For instance, if the local code page is 932, the byte values 129 and 130 will both convert to the same value in the Unicode string.
In addition to this, the conversion involves a Windows API call, which is an expensive operation. If you do a lot of these, it can slow down your application.
It's safe ( as long as you're using the ShortString for its intended purpose: to hold a string of characters and not a collection of bytes, some of which may be 0 ), but may have performance implications if you do it a lot. As far as I know, Delphi has to allocate memory for the new unicode string, extract the characters from the ShortString into a null-terminated string (that's why it's important that it's a properly-formed string) and then call something like the Windows API MultiByteToWideChar() function. Not rocket science, but not a trivial operation either.
ShortStrings don't have a code page associated with them, AnsiStrings do (since D2009).
The conversion from ShortString to UnicodeString can only be done on the assumption that ShortStrings are encoded in the default ANSI encoding which is not a safe assumption.
I don't really know Delphi, but if I remember correctly, the Shortstrings are essentially a sequence of characters on the stack, whereas a regular string (AnsiString) is actually a reference to a location on the heap. This may have different implications.
Here's a good article on the different string types:
http://www.codexterity.com/delphistrings.htm
I think there might also be a difference in terms of encoding but I'm not 100% sure.

Delphi 2009 + Unicode + Char-size

I just got Delphi 2009 and have previously read some articles about modifications that might be necessary because of the switch to Unicode strings.
Mostly, it is mentioned that sizeof(char) is not guaranteed to be 1 anymore.
But why would this be interesting regarding string manipulation?
For example, if I use an AnsiString:='Test' and do the same with a String (which is unicode now), then I get Length() = 4 which is correct for both cases.
Without having tested it, I'm sure all other string manipulation functions behave the same way and decide internally if the argument is a unicode string or anything else.
Why would the actual size of a char be of interest for me if I do string manipulations?
(Of course if I use strings as strings and not to store any other data)
Thanks for any help!
Holger
With Unicode SizeOf(SomeChar) <> Length(SomeChar). Essentially the length of a string is less then the sum of the size of its chars. As long as you don't assume SizeOf(Char) = 1, or SizeOf(SomeString[x]) = 1 (since both are FALSE now) or try to interchange bytes with chars, then you shouldn't have any trouble. Any place you are doing something creative stuffing Bytes into Chars or Strings, then you will need to use AnsiString.
(SizeOf(SomeString) is still 4 no matter the length since it is essentially a pointer with some compiler magic.)
People often implicitly convert from characters to bytes in old Delphi code without really thinking about it. For example, when writing to a stream. When you write a string to a stream, you have to specify the number of bytes you write, but people often pass the character count instead. See this post from Chris Bensen for another example.
Another way people often make this implicit conversion and older code is by using a "string" to store binary data. In this case, they actually want bytes, but the data type expects characters. D2009 has a better type for this.
I didn't try Delphi 2009, but are using fpc which is also switching to unicode slowly. I'm 95% sure that everything below also holds for Delphi 2009
In fpc (when supporting unicode) it will be so that functions like 'length' take the codepage into consideration. Thus it will return the length of the string as a 'human' would see it. If there are - for example - two chinese characters, that both take two bytes of memory in unicode, length will return 2, since there are two characters in the string. But the string will take 4 bytes of memory. (+the memory for the reference count and the leading #0, but that aside)
What you can not do anymore is this:
var p : pchar;
begin
p := s[1];
for i := 0 to length(string)-1 do
begin
write(p);
inc(p);
end;
end;
Because this code will - in the two chinese-character example - write the wrong two characters. Namely the two bytes which are part of the first 'real' character.
In short: Length() doesn't return the amount of bytes allocated for the string anymore, but the amount of characters. (Before the switch to unicode, those two values were equal to eachother)
The actual size of a character shouldn't matter, unless you are doing the manipulation at the byte level.
(Of course if I use strings as strings and not to store any other data)
That's the key point, YOU don't use strings for other purposes, but some people do. They use strings just like arrays, so they (and that's including me) would need to check all such uses to make sure nothing is broken...
Lets not forget that there are times when this conversion is not really desired. Say for storing a GUID in a record for instance. The guid can only contain hexadecimal characters plus the - and brackets...making them take up twice the space can make quite an impact on existing code. Sure the simple solution is to change them to AnsiString, and deal with the compiler warnings if you do any string manipulation on them.
It can be an issue if you make Windows API calls. Or if you have legacy code that does inc or dec of str[0] to change its length.

Resources