Scan function with m modifier not working as intended - parsing

The scan function does not seem to work as I want it.
data test;
do i=1 to 5;
text="ABC¤¤ABC¤ABC¤ABC";
scan = scan(text,i,"¤","m");
output;
end;
run;
Results:
enter image description here
It is working for i=2 but I don't understand why i=3 and i=4 are blank...
What I want is scan=blank for only i=2 where there is a consecutive delimiter.
However, if my delimiter is a comma, it works...
data test;
do i=1 to 5;
text="ABC,,ABC,ABC,ABC";
scan = scan(text,i,",","m");
output;
end;
run;
Results:
enter image description here
What am I doing wrong ???

To go into more detail on Tom's (correct) answer, SAS is a language that existed well before Unicode. It maintains backwards compatibility, for the most part, and that means that many SAS functions aren't compatible with Unicode.
SAS has a page, Internationalization Compatibility with SAS String functions, which goes into detail as which functions are compatible with non-Single Byte Character Sets (for example, UTF-8, a Multi-Byte Character Set).
Functions that are listed as "I18N Level 0" (I18N is short for Internationalization - 18 characters between the I and the last n) are not compatible with non-single byte character sets. SCAN is one of those functions. "I18N Level 1" might work or might not, and "I18N Level 2" are designed to work with MBCSs, like UTF-8.
For the most part, the functions designed with UTF-8 in mind start with 'k' and are otherwise similar to the base SAS functions. However, in a few cases they had to make variants.
For your use, kscanx is the function you'll want. That allows the m modifier to be used.
It's still possible you'll have issues, if your SAS session and your SAS data are not in the exact same encoding. Consider the UNICODE or UNICODEC functions, or the KCVT function, to modify the character set of one or the other to match.

Your SAS session is using unicode. So that symbol you are trying to use requires more than one byte. The SCAN() function will treat that as two separate delimiter characters. So the M modifier will then see the two differnt bytes next to each other as representing a missing value.
Use the KSCAN() function instead.
To use the M modifier you will need to use KSCANX() function. (I have asked SAS to update the documentation of these three functions so they reference each other.)
You could try eplacing the two-byte character with some single byte character so that you can use the SCAN() function but then you could also have issues with the delimiter being seen as one of the bytes in some other multi-byte character in the string.

I ran the program
data test;
do i=1 to 5;
text="ABC¤¤ABC¤ABC¤ABC";
scan = scan(text,i,"¤","m");
output;
end;
run;
and got this

Related

Why is the following piece of Lua code, completely valid?

From my Lua knowledge (and according to what I have read in Lua manuals), I've always been under impression that an identifier in Lua is only limited to A-Z & a-z & _ & digits (and can not start using a digit nor be a reserved keyword i.e. local local = 123).
And now I have run into some (obfuscated) Lua program which uses all kind of weird characters for an identifier:
https://i.imgur.com/HPLKMxp.png
-- Most likely, copy+paste won't work. Download the file from https://tknk.io/7HHZ
print(_VERSION .. " " .. (jit and "JIT" or "non-JIT"))
local T = {}
T.math = T.math or {}
T.math.​â®â€‹âŞâ®â€‹­ď»żâ€Śâ€­âŽ­ = math.sin
T.math.â¬â€‹â­â¬â­â«â®â€­â€¬ = math.cos
for k, v in pairs(T.math) do print(k, v) end
Output:
Lua 5.1 JIT
â¬â€‹â­â¬â­â«â®â€­â€¬ function: builtin#45
​â®â€‹âŞâ®â€‹­ď»żâ€Śâ€­âŽ­ function: builtin#44
It is unclear to me, why is this set of characters allowed for an identifier?
In other words, why is it a completely valid Lua program?
Unlike some languages, Lua is not really defined by a formal specification, one which covers every contingency and entirely explains all of Lua's behavior. Something as simple as "what character set is a Lua file encoded in" isn't really explain in Lua's documentation.
All the docs say about identifiers is:
Names (also called identifiers) in Lua can be any string of letters, digits, and underscores, not beginning with a digit and not being a reserved word.
But nothing ever really says what a "letter" is. There isn't even a definition for what character set Lua uses. As such, it's essentially implementation-dependent. A "letter" is... whatever the implementation wants it to be.
So, let's say you're writing a Lua implementation. And you want users to be able to provide Unicode-encoded strings (that is, strings within the Lua text). Lua 5.3 requires this. But you also don't want them to have to use UTF-16 encoding for their files (also because lua_load gets sequences of bytes, not shorts). So your Lua implementation assumes the byte sequence it gets in lua_load is encoded in UTF-8, so that users can write strings that use Unicode characters.
When it comes to writing the lexer/parser part of this implementation, how do you handle this? The simplest, easiest way to handle UTF-8 is to... not handle UTF-8. Indeed, that's the whole point of that encoding. Since everything that Lua defines with specific symbols are encoded in ASCII, and ASCII text is also UTF-8 text with the same meaning, you can basically treat a UTF-8 string like an ASCII string. For in-Lua strings, you just copy the sequence of bytes between the start and end characters of the string.
So how do you go about lexing identifiers? Well, you could ask the question above. Or you could ask a much simpler question: is the character a space, control character, digit, or symbol? A "letter" is merely something that isn't one of those.
Lua defines what things it considers to be "symbols". ASCII can tell you what is a control character, space, and a digit. In such an implementation, any UTF-8 code unit with a value outside of ASCII is a letter. Even if technically, those code units decode into something Unicode thinks of as a "symbol", your lexer just threats it as a letter.
This simple form of UTF-8 lexing gives you fast performance and low memory overhead. You don't have to decode UTF-8 into Unicode codepoints, and you don't need a giant Unicode table to tell you whether a codepoint is a "symbol" or "space" or whatever. And of course, it's also something that would naturally fall out of many ASCII-based Lua implementations.
So most Lua implementations will do it this way, if only by accident. Doing something more would require deliberate effort.
It also allows a user to use Unicode character sequences as identifiers. That means that someone can easily write code in their native language (outside of keywords).
But it also means that obfuscators have lots of ways to create "identifiers" that are just strings of nonsensical bytes. Indeed, because there are multiple ways in Unicode to "spell" the same apparent Unicode string (unless you examine the bytes directly), obfuscators can rig up identifiers that appear when rendered in a text editor to all be the same text, while actually being different strings.
To clarify there is only one identifier T
T.math is sugar syntax for T["math"] this also extends to the obfuscate strings. It is perfectly valid to have a key contain any characters or even start with a number.
Now being able to use the . rather then [ ] does not work with a string that don't conform to the identifier's limitations. See Nicol Bolas' answer for a great break down of those limitations.

ASCII Representation of Hexadecimal

I have a string that, by using string.format("%02X", char), I've received the following:
74657874000000EDD37001000300
In the end, I'd like that string to look like the following:
t e x t NUL NUL NUL í Ó p SOH NUL ETX NUL (spaces are there just for clarification of characters desired in example).
I've tried to use \x..(hex#), string.char(0x..(hex#)) (where (hex#) is alphanumeric representation of my desired character) and I am still having issues with getting the result I'm looking for. After reading another thread about this topic: what is the way to represent a unichar in lua and the links provided in the answers, I am not fully understanding what I need to do in my final code that is acceptable for this to work.
I'm looking for some help in better understanding an approach that would help me to achieve my desired result provided below.
ETA:
Well I thought that I had fixed it with the following code:
function hexToAscii(input)
local convString = ""
for char in input:gmatch("(..)") do
convString = convString..(string.char("0x"..char))
end
return convString
end
It appeared to work, but didnt think about characters above 127. Rookie mistake. Now I'm unsure how I can get the additional characters up to 256 display their ASCII values.
I did the following to check since I couldn't truly "see" them in the file.
function asciiSub(input)
input = input:gsub(string.char(0x00), "<NUL>") -- suggested by a coworker
print(input)
end
I did a few gsub strings to substitute in other characters and my file comes back with the replacement strings. But when I ran into characters in the extended ASCII table, it got all forgotten.
Can anyone assist me in understanding a fix or new approach to this problem? As I've stated before, I read other topics on this and am still confused as to the best approach towards this issue.
The simple way to transform a base16-encoded string is just to
function unhex( input )
return (input:gsub( "..", function(c)
return string.char( tonumber( c, 16 ) )
end))
end
This is basically what you have, just a bit cleaner. (There's no need to say "(..)", ".." is enough – if you specify no captures, you'll automatically get the whole match. And while it might work if you write string.char( "0x"..c ), it's just evil – you concatenate lots of strings and then trigger the automatic conversion to numbers. Much better to just specify the base when explicitly converting.)
The resulting string should be exactly what went into the hex-dumper, no matter the encoding.
If you cannot correctly display the result, your viewer will also be unable to display the original input. If you used different viewers for the original input and the resulting output (e.g. a text editor and a terminal), try writing the output to a file instead and looking at it with the same viewer you used for the original input, then the two should be exactly the same.
Getting viewers that assume different encodings (e.g. one of the "old" 8-bit code pages or one of the many versions of Unicode) to display the same thing will require conversion between different formats, which tends to be quite complicated or even impossible. As you did not mention what encodings are involved (nor any other information like OS or programs used that might hint at the likely encodings), this could be just about anything, so it's impossible to say anything more specific on that.
You actually have a couple of problems:
First, make sure you know the meaning of the term character encoding, and that you know the difference between characters and bytes. A popular post on the topic is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Then, what encoding was used for the bytes you just received? You need to know this, otherwise you don't know what byte 234 means. For example it could be ISO-8859-1, in which case it is U+00EA, the character ê.
The characters 0 to 31 are control characters (eg. 0 is NUL). Use a lookup table for these.
Then, displaying the characters on the terminal is the hard part. There is no platform-independent way to display ê on the terminal. It may well be impossible with the standard print function. If you can't figure this step out you can search for a question dealing specifically with how to print Unicode text from Lua.

How to get the number of characters (as opposed to the number of bytes) of a text in Delphi?

I have a Delphi 7 application where I deal with ANSI strings and I need to count their number of characters (as opposed to the number of bytes). I always know the Charset (and thus the code page) associated with the string.
So, knowing the Charset (code page), I'm currently using MultiByteToWideChar to get the number of characters. It's useful when the Charset is one of the Chinese, Korean, or Japanese charsets where most of the characters are 2 bytes in length and simply using the Length function won't give me what I want.
However, it still counts composite characters as two characters, and I need them counted as one. Now, some composite characters have precomposed versions in Unicode, those would be counted correctly as one character since the MB_PRECOMPOSED is used by default. But many characters simply don't exist as precomposed, for example characters in Hebrew, Arabic, Thai, etc, and those are counted as two.
So the question really is: How to count composite characters as single characters? I don't mind converting the ANSI strings to Wide strings to count the number of characters, I'm already doing it with MultiByteToWideChar anyway.
You can count the Unicode code points like this:
function CodePointCount(P: PWideChar): Integer;
var
Count: Integer;
begin
Count := 0;
while Word(P^)<>0 do
begin
if (Word(P^)>=$D800) and (Word(P^)<=$DFFF) then
// part of surrogate pair
inc(Count)
else
inc(Count, 2);
inc(P);
end;
Result := Count div 2;
end;
This covers the issue that you did not mention. Namely that UTF-16 is a variable width encoding.
However, this will not tell you the number of glyphs represented by a UTF-16 string. That's because some code points represent combining characters. These combining characters combine with their neighbours to form a single equivalent character. So, multiple code-points, single glyph. More information can be found here: http://en.wikipedia.org/wiki/Unicode_equivalence
This is the harder issue. To solve it your code needs to fully understand the meaning of each Unicode code point. Is it a combining character? How does it combine? Really you need a dedicated Unicode library. For instance ICU.
The other suggestion I have for you is to give up using ANSI code pages. If you really care about internationalisation then you need to use Unicode.

How to get a single Arabic letter in a string with its Unicode transformation value in DELPHI?

Considering this Arabic word(جبل) made of 3 letters .
-the first letter is جـ,
-name is (ǧīm),
-its Unicode value is FE9F when its in the beginning,
-its basic value is 062C and
-its isolated value is FE9D but the last two values return the same shape drawing ج .
Now, Whenever I try to get it as a single character -trying many different ways-, Delphi returns the basic Unicode value.
well,that makes sense,but what happens to the char with transformation? It is a single char too..Looks like it takes the transformed value only when it is within a string, but where? how to extract it?When and which process decides these values?
Again the MAIN QUESTION:
How can I get the Arabic letter or its Unicode value as it is within a string?
just for information: Unlike English which has tow cases for its letters(Capital and Small), Arabic has four cases(Isolated, Beginning,Middle And End) with different rules as well.
I'm not sure I understand the question. If you want to know how to write U+FE9F in Delphi source code, in a modern Unicode version of Delphi. Do that simply like so:
Char($FE9F)
If you want to read individual characters from جبل then do it like this:
const
MyWord = 'جبل';
var
c: Char;
....
c := MyWord[1];//this is U+062C
Note that the code above is fine for your particular word because each code point can be encoded with a single UTF-16 WideChar character element. If the code point required multiple elements, then it would be best to transform to UTF-32 for code point level processing.
Now, let's look at the string that you included in the question. I downloaded this question using wget and the file that came down the wires was UTF-8 encoded. I used Notepad++ to convert to UTF16-LE and then picked out the three UTF-16 characters of your string. They are:
U+062C
U+0628
U+0644
You stated:
The first letter is جـ, name is (ǧīm), its Unicode value is U+FE9F.
But that is simply incorrect. As can be seen from the above, the actual character you posted was U+062C. So the reason why your attempts to read the first character yield U+062C is that U+062C really is the first character of your string.
The bottom line is that nothing in your Delphi code is transforming your character. When you do:
S[1] := Char($FE9F);
the compiler performs a simple two byte copy. There is no context aware transformation that occurs. And likewise when reading S[1].
Let's look at how these characters are displayed, using this simple code on a VCL forms application that contains a memo control:
Memo1.Clear;
Memo1.Lines.Add(StringOfChar(Char($FE9F), 2));
Memo1.Lines.Add(StringOfChar(Char($062C), 2));
The output looks like this:
As you can see, the rendering layer knows what to do with a U+062C character that appears at the beginning of the string.
Shaping of Arabic characters for presentation in Windows is served by the Uniscribe services (USP10.dll).
UniScribe
You may find the following blog post useful:
Roozbeh's Programming Blog
I don't think you can do it using string/char related methods. But using pchar, maybe can you access the memory and read the Pword values directly
EDIT: After discussing with David, I think that you will always get the basic/isolated value of the letter. The fact that begin or end glyph is used, is probably just handled by the display framework of the OS

Delphi 2009 + Unicode + Char-size

I just got Delphi 2009 and have previously read some articles about modifications that might be necessary because of the switch to Unicode strings.
Mostly, it is mentioned that sizeof(char) is not guaranteed to be 1 anymore.
But why would this be interesting regarding string manipulation?
For example, if I use an AnsiString:='Test' and do the same with a String (which is unicode now), then I get Length() = 4 which is correct for both cases.
Without having tested it, I'm sure all other string manipulation functions behave the same way and decide internally if the argument is a unicode string or anything else.
Why would the actual size of a char be of interest for me if I do string manipulations?
(Of course if I use strings as strings and not to store any other data)
Thanks for any help!
Holger
With Unicode SizeOf(SomeChar) <> Length(SomeChar). Essentially the length of a string is less then the sum of the size of its chars. As long as you don't assume SizeOf(Char) = 1, or SizeOf(SomeString[x]) = 1 (since both are FALSE now) or try to interchange bytes with chars, then you shouldn't have any trouble. Any place you are doing something creative stuffing Bytes into Chars or Strings, then you will need to use AnsiString.
(SizeOf(SomeString) is still 4 no matter the length since it is essentially a pointer with some compiler magic.)
People often implicitly convert from characters to bytes in old Delphi code without really thinking about it. For example, when writing to a stream. When you write a string to a stream, you have to specify the number of bytes you write, but people often pass the character count instead. See this post from Chris Bensen for another example.
Another way people often make this implicit conversion and older code is by using a "string" to store binary data. In this case, they actually want bytes, but the data type expects characters. D2009 has a better type for this.
I didn't try Delphi 2009, but are using fpc which is also switching to unicode slowly. I'm 95% sure that everything below also holds for Delphi 2009
In fpc (when supporting unicode) it will be so that functions like 'length' take the codepage into consideration. Thus it will return the length of the string as a 'human' would see it. If there are - for example - two chinese characters, that both take two bytes of memory in unicode, length will return 2, since there are two characters in the string. But the string will take 4 bytes of memory. (+the memory for the reference count and the leading #0, but that aside)
What you can not do anymore is this:
var p : pchar;
begin
p := s[1];
for i := 0 to length(string)-1 do
begin
write(p);
inc(p);
end;
end;
Because this code will - in the two chinese-character example - write the wrong two characters. Namely the two bytes which are part of the first 'real' character.
In short: Length() doesn't return the amount of bytes allocated for the string anymore, but the amount of characters. (Before the switch to unicode, those two values were equal to eachother)
The actual size of a character shouldn't matter, unless you are doing the manipulation at the byte level.
(Of course if I use strings as strings and not to store any other data)
That's the key point, YOU don't use strings for other purposes, but some people do. They use strings just like arrays, so they (and that's including me) would need to check all such uses to make sure nothing is broken...
Lets not forget that there are times when this conversion is not really desired. Say for storing a GUID in a record for instance. The guid can only contain hexadecimal characters plus the - and brackets...making them take up twice the space can make quite an impact on existing code. Sure the simple solution is to change them to AnsiString, and deal with the compiler warnings if you do any string manipulation on them.
It can be an issue if you make Windows API calls. Or if you have legacy code that does inc or dec of str[0] to change its length.

Resources