Where should my brackets be in relation to the text for Arabic languages? - arabic

Our application automatically modifies the layout of Arabic text when it is followed by a bracket and I was wondering whether this was the correct behaviour or not?
The application shows items in the following format:
[ID of structure](version)
So version 1.5 of the English structure "stackoverflow" would be displayed as:
stackoverflow(1.5)
Note: the brackets need to be displayed. There is no space between the ID and the first bracket. The brackets simply encompass the version. The brackets could have been any character but it's far too late to switch to a different character now!
This works fine for left to right languages, but for Arabic languages the structures appear in the form:
ستاكوفيرفلوو(1.0)
I am not an Arabic speaker and I need to know if this is actually correct. Is the Arabic format the equivalent of the English format or has something gone horribly wrong?

The text in Arabic should be shown like:
ستاكوفيرفلوو(1.0) ‏
I added the html entity of RLM / Right-to-left Mark ‏ in order to fix the text. You should do so if your application doesn't support Bidi native-ly. You can add the RLM by these ways:
HTML Entity (decimal) ‏
HTML Entity (hex) ‏
HTML Entity (named) ‏
How to type in Microsoft Windows Alt +200F
UTF-8 (hex) 0xE2 0x80 0x8F (e2808f)
UTF-8 (binary) 11100010:10000000:10001111
UTF-16 (hex) 0x200F (200f)
UTF-16 (decimal) 8,207
UTF-32 (hex) 0x0000200F (200f)
UTF-32 (decimal) 8,207
C/C++/Java source code "\u200F"
Python source code u"\u200F"
(note: StackOverflow right transliteration is ستاك-أوفرفلو)

Related

Is there something wrong with CommonMark?

If you try **b.**a on https://spec.commonmark.org/dingus/, you would see its b. is not bolded.
However, if you just omit the last a, it works.
What is the CommonMark format to bold it correctly (that is bolded a., followed by unbolded b)?
The solution to format the text correctly in CommonMark.
According to section 6.2 of the CommonMark specification, that behaviour is by design:
A delimiter run is either a sequence of one or more * characters [...]
A right-flanking delimiter run is a delimiter run that is (1) not preceded by Unicode whitespace, and either (2a) not preceded by a Unicode punctuation character, or (2b) preceded by a Unicode punctuation character and followed by Unicode whitespace or a Unicode punctuation character.
You can get the desired visual appearance with **b.**​a, where ​ is the HTML entity "zero-width joiner", thus:
**b.**​a
...and we've just found a bug in the rendering here on Stack Overflow. In edit mode, the preview shows it correctly as

Escape Unicode Characters for iOS

There are some Unicode arrangements that I want to use in my app. I am having trouble properly escaping them for use.
For instance this Unicode sequence: 🅰
If I escape it using an online tool i get: \ud83c\udd70
But of course this is an invalid sequence per the compiler:
var str = NSString.stringWithUTF8String("\ud83c\udd70")
Also if I do this:
var str = NSString.stringWithUTF8String("\ud83c")
I get an error "Invalid Unicode Scalar"
I'm trying to use these Unicode "fonts":
http://www.panix.com/~eli/unicode/convert.cgi?text=abcdefghijklmnopqrstuvwxyz
If I view the source of this website I see sequences like this:
&#x1D552
Struggling to wrap my head around what is the "proper" way to work with/escape unicode.
And simply need a to figure out a way to get them working on iOS.
Any thoughts?
\ud83c\udd70 is a UTF-16 surrogate pair which encodes the unicode character 🅰 (U+1F170). Swift string literals do not use UTF-16, so that escape sequence doesn't make sense. However, since 1F170 has five digits you can't use a \uXXXX escape sequence (which only accepts four hexadecimal digits). Instead, use a \UXXXXXXXX sequence (note the capital U), which accepts eight:
var str = "\U0001F170" // returns "🅰"
You can also just paste the character itself into your string:
var str = "🅰" // returns "🅰"
Swift is an early Beta, is is broken in many ways. This issue is a Swift bug.
let ringAboveA: String = "\u0041\u030A" is Å and is accepted
let negativeSquaredA: String = "\uD83D\uDD70" is 🅰 and produces an error
Both are decomposed UTF16 characters that are accepted by Objective-C. The difference is that the composed character 🅰 is in plane 1.
Note: to get the UTF32 code point either use the OSX Character Viewer or a code snippet:
NSLog(#"utf32: %#", [#"🅰" dataUsingEncoding:NSUTF32BigEndianStringEncoding]);
utf32: <0001f170>
To get the Character Viewer in the Apple Menu go to the "System Preferences", "Keyboard", "Keyboard" tab and select the checkbox: "Show Keyboard & Character Viewers in menu bar". The "Character View" item will be in the menu bar just to the left of the Date.
After entering the character right (control) click on the character in favorites to copy the search results.
Copied information:
🅰
NEGATIVE SQUARED LATIN CAPITAL LETTER A
Unicode: U+1F170 (U+D83C U+DD70), UTF-8: F0 9F 85 B0
Better yet: Add unicode in the list on the left and select it.

Showing wrong character for an unicode value in iOS

I am now working with an iOS app that handle unicode characters, but it seems there is some problem with translating unicode hex value (and int value too) to character.
For example, I want to get character 'đ' which has Unicode value of c491, but after this code:
NSString *str = [NSString stringWithUTF8String:"\uc491"];
The value of str is not 'đ' but '쓉' (a Korean word) instead.
I also used:
int c = 50321; // 50321 is int value of 'đ'
NSString *str = [NSString stringWithCharacters: (unichar *)&c length:1];
But the results of two above pieces of code are the same.
I can't understand what is problem here, please help!
The short answer
To specify đ, you can specify it in the following ways (untested):
#"đ"
#"\u0111"
#"\U00000111"
[NSString stringWithUTF8String: "\u0111"]
[NSString stringWithUTF8String: "\xc4\x91"]
Note that the last 2 lines uses C string literal instead of Objective-C string object literal construct #"...".
As a short explanation, \u0111 is the Unicode escape sequence for đ, where U+0111 is the code point for the character đ.
The last example shows how you would specify the UTF-8 encoding of đ (which is c4 91) in a C string literal, then convert the bytes in UTF-8 encoding into proper characters.
The examples above are adapted from this answer and this blog post. The blog also covers the tricky situation with characters beyond Basic Multilingual Plane (Plane 0) in Unicode.
Unicode escape sequences (Universal character names in C99)
According to this blog1:
Unicode escape sequences were added to the C language in the TC2 amendment to C99, and to the Objective-C language (for NSString literals) with Mac OS X 10.5.
Page 65 of C99 TC2 draft shows that \unnnn or \Unnnnnnnn where nnnn or nnnnnnnn are "short-identifier as defined by ISO/IEC 10646 standard", it roughly means hexadecimal code point. Note that:
A universal character name shall not specify a character whose short identifier is less than 00A0 other than 0024 ($), 0040 (#), or 0060 (`), nor one in the range D800 through DFFF inclusive.
Character set vs. Character encoding
It seems that you are confused between code point U+0111 and UTF-8 encoding c4 91 (representation of the character as byte). UTF-8 encoding is one of the encoding for Unicode character set, and code point is a number assigned to a character in a character set. This Wikipedia article explains quite clearly the difference in meaning.
A coded character set (CCS) specifies how to represent a repertoire of characters using a number of (typically non-negative) integer values called code points. [...]
A character encoding form (CEF) specifies the conversion of a coded character set's integer codes into a set of limited-size integer code values that facilitate storage in a system that represents numbers in binary form using a fixed number of bits [...]
There are other encoding, such as UTF-16 and UTF-32, which may give different byte representation of the character on disk, but since UTF-8, UTF-16 and UTF-32 are all encoding for Unicode character set, the code point for the same character is the same between all 3 encoding.
Footnote
1: I think the blog is correct, but if anyone can find official documentation from Apple on this point, it would be better.

Lua pattern match around comma

I have several small place marks such as 'א,א' 'א,ב'. If we use the comma as the center point, i need at most 2 characters before the comma, and up to the next space after the comma.
I have (.-,.-)%s but its not doing what I need. Any idea?
Also as you can see there not latin letters so using %l will not work.
There are couple of issues here. First, a minor issue: .-, will match as little as possible before the coma, that is zero characters. You should anchor the beginning of the matched string.
The more complicated issue is that you use Hebrew letters. The problem is that Lua has no concept of multi-byte characters.
If you use a 8-bit encoding such as Windows-1255, or ISO-8859-8, then you probably can simply match against a character class [ת-א]. If you have properly set Hebrew locale, %l should work fine for you.
If you use UTF-8 or any other encoding that uses multi-byte characters, then you must construct a regex that has all the Hebrew alphabet escaped as a sequence of octets. The aleph is U+05D0x, which in UTF-8 will be represented as 0xD7 0x90. The tav is U+05EA, which will be encoded as 0xD7 0xAA.
In Lua you can escape any 8-bit character with a backslash + decimal code. All the hebrew characters encoded in UTF-8 have the first byte the same -- 0xD7, that is "\215". The second character can be anything from "\144" to "\170". Thus, the regex that will match a single Hebrew letter is: "\215[\144-\170]". Put that in your original regex, where you had single dots that match any character.
Of course, the above reasoning must be modified for encodings different than UTF-8. Right-to-left writing direction in Hebrew is another thing to keep in mind.

Parsing \"–\" with Erlang re

I've parsed an HTML page with mochiweb_html and want to parse the following text fragment
0 – 1
Basically I want to split the string on the spaces and dash character and extract the numbers in the first characters.
Now the string above is represented as the following Erlang list
[48,32,226,128,147,32,49]
I'm trying to split it using the following regex:
{ok, P}=re:compile("\\xD2\\x80\\x93"), %% characters 226, 128, 147
re:split([48,32,226,128,147,32,49], P, [{return, list}])
But this doesn't work; it seems the \xD2 character is the problem [if I remove it from the regex, the split occurs]
Could someone possibly explain
what I'm doing wrong here ?
why the '–' character seemingly requires three integers for representation [226, 128, 147]
Thanks.
226,128,147 is E2,80,93 in hex.
> {ok, P} = re:compile("\xE2\x80\x93").
...
> re:split([48,32,226,128,147,32,49], P, [{return, list}]).
["0 "," 1"]
As to your second question, about why a dash takes 3 bytes to encode, it's because the dash in your input isn't an ASCII hyphen (hex 2D), but is a Unicode en-dash (hex 2013). Your code is recieving this in UTF-8 encoding, rather than the more obvious UCS-2 encoding. Hex 2013 comes out to hex E28093 in UTF-8 encoding.
If your next question is "why UTF-8", it's because it's far easier to retrofit an old system using 8-bit characters and null-terminated C style strings to use Unicode via UTF-8 than to widen everything to UCS-2 or UCS-4. UTF-8 remains compatible with ASCII and C strings, so the conversion can be done piecemeal over the course of years, or decades if need be. Wide characters require a "Big Bang" one-time conversion effort, where everything has to move to the new system at once. UTF-8 is therefore far more popular on systems with legacies dating back to before the early 90s, when Unicode was created.

Resources