Parsing \"–\" with Erlang re - erlang

I've parsed an HTML page with mochiweb_html and want to parse the following text fragment
0 – 1
Basically I want to split the string on the spaces and dash character and extract the numbers in the first characters.
Now the string above is represented as the following Erlang list
[48,32,226,128,147,32,49]
I'm trying to split it using the following regex:
{ok, P}=re:compile("\\xD2\\x80\\x93"), %% characters 226, 128, 147
re:split([48,32,226,128,147,32,49], P, [{return, list}])
But this doesn't work; it seems the \xD2 character is the problem [if I remove it from the regex, the split occurs]
Could someone possibly explain
what I'm doing wrong here ?
why the '–' character seemingly requires three integers for representation [226, 128, 147]
Thanks.

226,128,147 is E2,80,93 in hex.
> {ok, P} = re:compile("\xE2\x80\x93").
...
> re:split([48,32,226,128,147,32,49], P, [{return, list}]).
["0 "," 1"]

As to your second question, about why a dash takes 3 bytes to encode, it's because the dash in your input isn't an ASCII hyphen (hex 2D), but is a Unicode en-dash (hex 2013). Your code is recieving this in UTF-8 encoding, rather than the more obvious UCS-2 encoding. Hex 2013 comes out to hex E28093 in UTF-8 encoding.
If your next question is "why UTF-8", it's because it's far easier to retrofit an old system using 8-bit characters and null-terminated C style strings to use Unicode via UTF-8 than to widen everything to UCS-2 or UCS-4. UTF-8 remains compatible with ASCII and C strings, so the conversion can be done piecemeal over the course of years, or decades if need be. Wide characters require a "Big Bang" one-time conversion effort, where everything has to move to the new system at once. UTF-8 is therefore far more popular on systems with legacies dating back to before the early 90s, when Unicode was created.

Related

AES encrypt string appear \r\n

When I use AES128 encrypt string, if the encrypted string is too long then it will contain \r\n in it. like this
Now I have to use empty string to replace it. Why does the encrypt-string contain the \r\n and any better way to avoid it or fix it.
Thanks.
Answers: it's caused by the Base64 encoding process, every 64 characters will insert a \r\n .
That is a Base64 encoded string.
Actual encryption output is an array of 8-bit bytes, not characters. The code is Base64 encoding the encrypted data with an option to insert line breaks every 64 characters, this is sometimes to allow better printing of the output. When it is decoded use the NSDataBase64DecodingIgnoreUnknownCharacters option to remove line breaks .
In particular for Objective-C the to create a Base64 string from NSData is:
- (NSString *)base64EncodedStringWithOptions:(NSDataBase64EncodingOptions)options
The options include:
NSDataBase64Encoding64CharacterLineLength
Set the maximum line length to 64 characters, after which a line ending is inserted.
Which inserts "\r\n" (carriage return, new line) characters each 64 characters.
If that is not what you want pass 0 as the option value.
To decode Base64 use the Objective-C method:
- (instancetype)initWithBase64EncodedString:(NSString *)base64String options:(NSDataBase64DecodingOptions)options
With the option: NSDataBase64DecodingIgnoreUnknownCharacters.
Apple code:
The default implementation of this method will reject non-alphabet characters, including line break characters. To support different encodings and ignore non-alphabet characters, specify an options value of NSDataBase64DecodingIgnoreUnknownCharacters.
The thing that gives it away as a Base64 string is a length that is a multiple of 4, the characters used "a-zA-Z0-9+/" and the trailing "=" characters.
Historic note: These days on OSX and iOS line breaks are a single "\n" (0x0a) line feed character. Back when we used teletypes as terminals "\r" (0x0d) carriage return moved the carriage or print head back but did not move the paper up to the next line. "\n" newline moved the paper up one line but did move the carriage or print head back. They were two distinct mechanical operations. Later some systems used either "\r\n", "\n\r", "\r" or "\n". Unix choose "\n" and thus OSX and iOS.

Best Ansi Escape beginning

Which Ansi escape sequence is the most portable and/or simply best and why?
1. "\u001B[32;1mThis is bright green\u001B[0m"
2. "\x1B[33;1mThis is bright yellow\x1B[0m"
3. "\e[35;4;1mThis is bright purple underlined\e[0m"
I have been using printf "\x1B[32;1mgreen\x1B[0m" (that's an example in unix bash script for example) out of habit, but I was wondering if there were any reasons to use one over the other. Is one more portable than the others? That would be my assumption.
Also, if you know of any other Ansi Escape sequence feel free to share it in the comments or at the end of your answer.
If you don't know what an Ansi Escape sequence is or want to become more familiar with it, then here you go: http://en.wikipedia.org/wiki/ANSI_escape_code
NOTE:
All of the escape sequences above have worked on all of the Unix systems I have been on, however one must still rely on the system itself to interpret the escape codes. Windows, for example, does not permit any sort of escape codes except four (BEL, L-F or linefeed, C-R or carriage return and, of course, BS or backspace), so Ansi escape sequences will not work.
Short answer: It depends on the host string parser.
Long answer:
It depends on the string parser; that is, the piece of code that actually takes in your string ("\x1b[1mSome string\x1b[0m") as a literal and parses the escape characters using the backslash ANSI escape sequence.
For parsers that support hexadecimal escapes (\x), then \x1b (character 0x1B) should work.
For parsers that support octal escapes (\ddd), then \033 (octal 33) should work.
For parsers that support unicode escapes (\u), then \u001B should work.
Quick elaboration: \x and \u are similar; \x usually refers to a single character, 0-255, in hexadecimal radix. \u means the same (as it is represented in hexadecimal), but supports two bytes (in most parsers) and generally refers to 16-bit unicode characters.
A lesser used/supported escape character, as you mentioned, is \e. This escape is most commonly used with parsers/languages that expect a lot of ANSI escaping to happen, such as bash (and most other shells).
For instance, Node.js does not support \e:
> console.log("\x1b[31mhello\x1b[0m")
hello
undefined
> console.log("\e[31mhello\e[0m")
e[31mhelloe[0m
undefined
Neither does Lua:
> print('\x1b[31mhello\x1b[0m')
hello
> print('\e[31mhello\e[0m')
stdin:1: invalid escape sequence near '\e'
Or even Python:
>>> print("\x1b[31mhello\x1b[0m")
hello
>>> print("\e[31mhello\e[0m")
\e[31mhello\e[0m
>>>
Though PHP does:
<?php
echo "\x1b[31mhello\x1b[0m\n"; // hello
echo "\e[31mhello\e[0m\n"; // hello

Showing wrong character for an unicode value in iOS

I am now working with an iOS app that handle unicode characters, but it seems there is some problem with translating unicode hex value (and int value too) to character.
For example, I want to get character 'đ' which has Unicode value of c491, but after this code:
NSString *str = [NSString stringWithUTF8String:"\uc491"];
The value of str is not 'đ' but '쓉' (a Korean word) instead.
I also used:
int c = 50321; // 50321 is int value of 'đ'
NSString *str = [NSString stringWithCharacters: (unichar *)&c length:1];
But the results of two above pieces of code are the same.
I can't understand what is problem here, please help!
The short answer
To specify đ, you can specify it in the following ways (untested):
#"đ"
#"\u0111"
#"\U00000111"
[NSString stringWithUTF8String: "\u0111"]
[NSString stringWithUTF8String: "\xc4\x91"]
Note that the last 2 lines uses C string literal instead of Objective-C string object literal construct #"...".
As a short explanation, \u0111 is the Unicode escape sequence for đ, where U+0111 is the code point for the character đ.
The last example shows how you would specify the UTF-8 encoding of đ (which is c4 91) in a C string literal, then convert the bytes in UTF-8 encoding into proper characters.
The examples above are adapted from this answer and this blog post. The blog also covers the tricky situation with characters beyond Basic Multilingual Plane (Plane 0) in Unicode.
Unicode escape sequences (Universal character names in C99)
According to this blog1:
Unicode escape sequences were added to the C language in the TC2 amendment to C99, and to the Objective-C language (for NSString literals) with Mac OS X 10.5.
Page 65 of C99 TC2 draft shows that \unnnn or \Unnnnnnnn where nnnn or nnnnnnnn are "short-identifier as defined by ISO/IEC 10646 standard", it roughly means hexadecimal code point. Note that:
A universal character name shall not specify a character whose short identifier is less than 00A0 other than 0024 ($), 0040 (#), or 0060 (`), nor one in the range D800 through DFFF inclusive.
Character set vs. Character encoding
It seems that you are confused between code point U+0111 and UTF-8 encoding c4 91 (representation of the character as byte). UTF-8 encoding is one of the encoding for Unicode character set, and code point is a number assigned to a character in a character set. This Wikipedia article explains quite clearly the difference in meaning.
A coded character set (CCS) specifies how to represent a repertoire of characters using a number of (typically non-negative) integer values called code points. [...]
A character encoding form (CEF) specifies the conversion of a coded character set's integer codes into a set of limited-size integer code values that facilitate storage in a system that represents numbers in binary form using a fixed number of bits [...]
There are other encoding, such as UTF-16 and UTF-32, which may give different byte representation of the character on disk, but since UTF-8, UTF-16 and UTF-32 are all encoding for Unicode character set, the code point for the same character is the same between all 3 encoding.
Footnote
1: I think the blog is correct, but if anyone can find official documentation from Apple on this point, it would be better.

Lua pattern match around comma

I have several small place marks such as 'א,א' 'א,ב'. If we use the comma as the center point, i need at most 2 characters before the comma, and up to the next space after the comma.
I have (.-,.-)%s but its not doing what I need. Any idea?
Also as you can see there not latin letters so using %l will not work.
There are couple of issues here. First, a minor issue: .-, will match as little as possible before the coma, that is zero characters. You should anchor the beginning of the matched string.
The more complicated issue is that you use Hebrew letters. The problem is that Lua has no concept of multi-byte characters.
If you use a 8-bit encoding such as Windows-1255, or ISO-8859-8, then you probably can simply match against a character class [ת-א]. If you have properly set Hebrew locale, %l should work fine for you.
If you use UTF-8 or any other encoding that uses multi-byte characters, then you must construct a regex that has all the Hebrew alphabet escaped as a sequence of octets. The aleph is U+05D0x, which in UTF-8 will be represented as 0xD7 0x90. The tav is U+05EA, which will be encoded as 0xD7 0xAA.
In Lua you can escape any 8-bit character with a backslash + decimal code. All the hebrew characters encoded in UTF-8 have the first byte the same -- 0xD7, that is "\215". The second character can be anything from "\144" to "\170". Thus, the regex that will match a single Hebrew letter is: "\215[\144-\170]". Put that in your original regex, where you had single dots that match any character.
Of course, the above reasoning must be modified for encodings different than UTF-8. Right-to-left writing direction in Hebrew is another thing to keep in mind.

Load testing tool URL encoding system

I have a load testing tool (Borland's SilkPerformer) that is encoding / character as \x252f. Can anyone please tell me what encoding system the tool might be using?
Two different escape sequences have been combined:
a C string hexadecimal escape sequence.
an URL-encoding scheme (percent encoding).
See this little diagram:
+---------> C escape in hexadecimal notation
: +------> hexadecimal number, ASCII for '%'
: : +---> hexadecimal number, ASCII for '/'
\x 25 2f
Explained step for step:
\ starts a C string escape sequence.
x is an escape in hexadecimal notation. Two hex digits are expected to follow.
25 is % in ASCII, see ASCII Table.
% starts an URL encode, also called Percent-encoding. Two hex digits are expected to follow.
2f is the slash character (/) in ASCII.
The slash is the result.
Now I don't know why your software chooses to encode the slash character in such a weird way. Slash characters in urls need to be url encoded if they don't denote directory separators (the same thing the backslash does for Windows). So you will often find the slash character being encoded as %2f. That's normal. But I find it weird and a bit suspicious that the percent character is additionally encoded as a hexadecimal escape sequence for C strings.

Resources