Unknown character ı̸̸̸̸̸̸̸̸̸̸̸̸̸̸̸̸̸̸̨̨̨̨̨̨̨̨ [closed] - character-encoding

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
This is a bit a silly question, but I stumbled upon this strange "character" today ı̸̸̸̸̸̸̸̸̸̸̸̸̸̸̸̸̸̸̨̨̨̨̨̨̨̨ Try to copy it in a text editor, you will see that you have to press backspace several times in order to erase it => I suppose there are actually several caracters, but I have no idea how to analyze it further.
Any ideas?
Thanks

Use a hex editor for viewing the raw character data. Your example includes three multibyte characters with special meanings. Here you go:
‍̸
U+0338 COMBINING LONG SOLIDUS OVERLAY
General Character Properties
In Unicode since: 1.1
Unicode category: Mark, Non-Spacing
Various Useful Representations
UTF-8: 0xCC 0xB8
UTF-16: 0x0338
C octal escaped UTF-8: \314\270
XML decimal entity: ̸
Annotations and Cross References
Alias names:
• long slash overlay
----------------------
‍̨
U+0328 COMBINING OGONEK
General Character Properties
In Unicode since: 1.1
Unicode category: Mark, Non-Spacing
Various Useful Representations
UTF-8: 0xCC 0xA8
UTF-16: 0x0328
C octal escaped UTF-8: \314\250
XML decimal entity: ̨
Annotations and Cross References
Alias names:
• nasal hook
Notes:
• Americanist: nasalization
• Polish, Lithuanian
See also:
• U+02DB OGONEK
----------------------
ı
U+0131 LATIN SMALL LETTER DOTLESS I
General Character Properties
In Unicode since: 1.1
Unicode category: Letter, Lowercase
Various Useful Representations
UTF-8: 0xC4 0xB1
UTF-16: 0x0131
C octal escaped UTF-8: \304\261
XML decimal entity: ı
Annotations and Cross References
Notes:
• Turkish, Azerbaijani
• uppercase is U+0049 LATIN CAPITAL LETTER I
See also:
• U+0069 LATIN SMALL LETTER I
I found this out using a hex editor and an program for displaying a character map. Probably you could have done it yourself. The first two are overlay characters and that’s why you have to hit backspace several times (they don’t generate a space in the text; they are modifying the previous character’s appearance).
What the characters are doing in your text nobody here can tell you. You have to find it out yourself. (Maybe random binary data in a text file?)

Related

MALLET default token not remove bracket

In Java Mallet, the default token should be one or more characters in [A-Za-z] according to their website. However, when I have a text such as:
lower(location select testing) top
It thinks "lower(location" is one word. But default token should be all letter words. How can I deal with this situation?
The documentation had not been updated for the most recent version of Mallet, thank you for pointing this out. Here's a current version:
As of version 2.0.8, the default token expression is '\p{L}[\p{L}\p{P}]+\p{L}', which is valid for all Unicode letters, and supports typical English non-letter patterns such as hyphens, apostrophes, and acronyms. Note that this expression also implicitly drops one- and two-letter words. Other options include:
For non-English text, a good choice is --token-regex '[\p{L}\p{M}]+', which means Unicode letters and marks (required for Indic scripts). MALLET currently does not support Chinese or Japanese word segmentation.
To include short words, use \p{L}+ (letters only) or '\p{L}[\p{L}\p{P}]*\p{L}|\p{L}' (letters possibly including punctuation).

String comparison (>) returns different results on different platforms? [duplicate]

This question already has an answer here:
Swift how to sort dict keys by byte value and not alphabetically?
(1 answer)
Closed 5 years ago.
Consider the following predicate
print("S" > "g")
Running this on Xcode yields false, whereas running this on the online compiler of tutorialspoint or e.g. the IBM Swift Sandbox (Swift Dev. 4.0 (Sep 5, 2017) / Platform: Linux (x86_64)), yields true.
How come there's a different result of the predicate on the online compilers (Linux?) as compared to vs Xcode?
This is a known open "bug" (or perhaps rather a known limitation):
SR-530 - [String] sort order varies on Darwin vs. Linux
Quoting Dave Abrahams' comment to the open bug report:
This will mostly be fixed by the new string work, wherein String's
default sort order will be implemented as a lexicographical ordering
of FCC-normalized UTF16 code units.
Note that on both platforms we rely on ICU for normalization services,
and normalization differences among different implementations of ICU
are a real possibility, so there will never be a guarantee that two
arbitrary strings sort the same on both platforms.
However, for Latin-1 strings such as those in the example, the new
work will fix the problem.
Moreover, from The String Manifest:
Comparing and Hashing Strings
...
Following this scheme everywhere would also allow us to make sorting
behavior consistent across platforms. Currently, we sort String
according to the UCA, except that--only on Apple platforms--pairs of
ASCII characters are ordered by unicode scalar value.
Most likely, the particular example of the OP (covering solely ASCII characters), comparison according to UCA (Unicode Collation Algorithm) is used for Linux platforms, whereas on Apple platforms, the sorting of these single ASCII character String's (or; String instances starting with ASCII characters) is according to unicode scalar value.
// ASCII value
print("S".unicodeScalars.first!.value) // 83
print("g".unicodeScalars.first!.value) // 103
// Unicode scalar value
print(String(format: "%04X", "S".unicodeScalars.first!.value)) // 0053
print(String(format: "%04X", "g".unicodeScalars.first!.value)) // 0067
print("S" < "g") // 'true' on Apple platforms (comparison by unicode scalar value),
// 'false' on Linux platforms (comparison according to UCA)
See also the excellent accepted answer to the following Q&A:
What does it mean that string and character comparisons in Swift are not locale-sensitive?

How to read a text file in ancient encoding?

There is a public project called Moby containing several word lists. Some files contain European alphabets symbols and were created in pre-Unicode time. Readme, dated 1993, reads:
"Foreign words commonly used in English usually include their
diacritical marks, for example, the acute accent e is denoted by ASCII
142."
Wikipedia says that the last ASCII symbol has number 127.
For example this file: http://www.gutenberg.org/files/3203/files/mobypos.txt contains symbols that I couldn't read in any of vatious Latin encodings. (There are plenty of such symbols in the very end of section of words beginning with B, just before C letter. )
Could someone advise please what encoding should be used for reading this file or how can it be converted to some readable modern encoding?
A little research suggests that the encoding for this page is Mac OS Roman, which has é at position 142. Viewing the page you linked and changing the encoding (in Chrome, View → Encoding → Western (Macintosh)) seems to display all the words correctly (it is incorrectly reporting ISO-8859-1).
How you deal with this depends on the language / tools you are using. Here’s an example of how you could convert into UTF-8 with Ruby:
require 'open-uri'
s = open('http://www.gutenberg.org/files/3203/files/mobypos.txt').read
s.force_encoding('macroman')
s.encode!('utf-8')
You are right in that ASCII only goes up to position 127 (it’s a 7-bit encoding), but there are a large number of 8 bit encodings that are supersets of ASCII and people sometimes refer to those as “Extended ASCII”. It appears that whoever wrote the readme you refer to didn’t know about the variety of encodings and thought the one he happened to be using at the time was universal.
There isn’t a general solution to problems like this, as there is no guaranteed way to determine the encoding of some text from the text itself. In this case I just used Wikipedia to look through a few until I found one that matched. Joel Spolsky’s article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) is a good place to start reading about character sets and encodings if you want to learn more.

How to distinguish a WCHAR is Chinese, Japanese or ASCII?

For example delphi code
wchar_IsASCii : array[0..1] of WCHAR ;
wchar_IsASCii[0] := 'A' ;
wchar_IsASCii[1] := 'じ' ;
How can I tell whether wchar_IsASCii[0] belong to ASCII, or wchar_IsASCii[1] does not belong to ASCII?
Actually, I only need know whether a UNICODE char belong to ASCII, that’s all How to distinguish a WCHAR char is Chinese, Japanese or ASCII.
I don't know Delphi, but what I can tell you is you need to determine what range the character fits into in Unicode. Here is a link about finding CJK characters in Unicode: What's the complete range for Chinese characters in Unicode?
and unless Delphi has some nice library for distinguishing Chinese and Japanese charatcers, you're going to have to determine that yourself. Here is a good answer here on SO for how to do that:
Testing for Japanese/Chinese Characters in a string
The problem is... what do you mean by ASCII ? Original ASCII standard is 7-bit code, known as Latin1 - it is not even a byte.
Then if you come with so-called "extended ASCII" - a 1 byte items - then half of it can be next to anything. It can by Greek on one machien, European diacritics on another, Cyrillic at third one... etc.
So i think if all you need is testing whether you have 7 bit Latin1 character - ruling out extended characters from French, German, Spanish alphabets and all Scandinavians ones, then - as Unicode was designed as yet another superset for Latin1 what you need is checking that (0 <= Ord(char-var)) and ($7f >= Ord(char-var)).
However, if you really need to tell languages, if you consider Greek And Cyrillic somewhat ASCII and Japanese alphabets (there are two by the way, Hiragana and Katakana) not (or if you consider French and German more or less ASCII-like, but Russian not) you would have to look at Unicode Ranges.
http://www.unicode.org/charts/index.html
To come with 32-bit codepoint of UCS4 standard you can use http://docwiki.embarcadero.com/Libraries/XE3/en/System.Character.ConvertToUtf32
There are next to standard IBM Classes for Unicode but looks no good translation for Delphi exists Has anyone used ICU with Delphi?
You can use Jedi CodeLib, but its tables are (comments are contradicting) either from Unicode 4.1 or 5.0, not from current 6.2, though for Japanese version 5.0 should be enough.
http://wiki.delphi-jedi.org/wiki/JCL_Help:TUnicodeBlock
http://wiki.delphi-jedi.org/wiki/JCL_Help:CodeBlockFromChar
http://wiki.delphi-jedi.org/wiki/JCL_Help:CodeBlockName#TUnicodeBlock
You can also use Microsoft MLang interface to query internet-like character codes (RFC 1766)
http://msdn.microsoft.com/en-us/library/aa741220.aspx
http://msdn.microsoft.com/en-us/library/aa767880.aspx
http://msdn.microsoft.com/en-us/library/aa740986.aspx
http://www.transl-gunsmoker.ru/2011/05/converting-between-lcids-and-rfc-1766.html
http://www.ietf.org/rfc/rfc1766.txt
Generally, a character belongs to ASCII, if its code is in range 0x0000..0x007F, see http://www.unicode.org/charts/PDF/U0000.pdf. A new Delphi has class function TCharacter.IsAscii but it is from some strange reason declared as private.
ASCII characters have a decimal value less than 127.
However, unless you are running a teletype machine from the 1960's, ASCII chars may not be sufficient. ASCII chars will only cover English language characters. If you actually need to support "Western European" characters such as umlaut vowels, graves, etc, found in German, French, Spanish, Swedish, etc, then testing for Unicode char value <= 127 won't suffice. You might get away with testing for char value <= 255, as long as you don't need to work with Eastern European scripts.

What's the proper technical term for "high ascii" characters?

What is the technically correct way of referring to "high ascii" or "extended ascii" characters? I don't just mean the range of 128-255, but any character beyond the 0-127 scope.
Often they're called diacritics, accented letters, sometimes casually referred to as "national" or non-English characters, but these names are either imprecise or they cover only a subset of the possible characters.
What correct, precise term that will programmers immediately recognize? And what would be the best English term to use when speaking to a non-technical audience?
"Non-ASCII characters"
ASCII character codes above 127 are not defined. many differ equipment and software suppliers developed their own character set for the value 128-255. Some chose drawing symbols, sone choose accent characters, other choose other characters.
Unicode is an attempt to make a universal set of character codes which includes the characters used in most languages. This includes not only the traditional western alphabets, but Cyrillic, Arabic, Greek, and even a large set of characters from Chinese, Japanese and Korean, as well as many other language both modern and ancient.
There are several implementations of Unicode. One of the most popular if UTF-8. A major reason for that popularity is that it is backwards compatible with ASCII, character codes 0 to 127 are the same for both ASCII and UTF-8.
That means it is better to say that ASCII is a subset of UTF-8. Characters code 128 and above are not ASCII. They can be UTF-8 (or other Unicode) or they can be a custom implementation by a hardware or software supplier.
You could coin a term like “trans-ASCII,” “supra-ASCII,” “ultra-ASCII” etc. Actually, “meta-ASCII” would be even nicer since it alludes to the meta bit.
A bit sequence that doesn't represent an ASCII character is not definitively a Unicode character.
Depending on the character encoding you're using, it could be either:
an invalid bit sequence
a Unicode character
an ISO-8859-x character
a Microsoft 1252 character
a character in some other character encoding
a bug, binary data, etc
The one definition that would fit all of these situations is:
Not an ASCII character
To be highly pedantic, even "a non-ASCII character" wouldn't precisely fit all of these situations, because sometimes a bit sequence outside this range may be simply an invalid bit sequence, and not a character at all.
"Extended ASCII" is the term I'd use, meaning "characters beyond the original 0-127".
Unicode is one possible set of Extended ASCII characters, and is quite, quite large.
UTF-8 is the way to represent Unicode characters that is backwards-compatible with the original ASCII.
Taken words from an online resource (Cool website though) because I found it useful and appropriate to write and answer.
At first only included capital letters and numbers , but in 1967 was added the lowercase letters and some control characters, forming what is known as US-ASCII, ie the characters 0 through 127.
So with this set of only 128 characters was published in 1967 as standard, containing all you need to write in English language.
In 1981, IBM developed an extension of 8-bit ASCII code, called "code page 437", in this version were replaced some obsolete control characters for graphic characters. Also 128 characters were added , with new symbols, signs, graphics and latin letters, all punctuation signs and characters needed to write texts in other languages, ​such as Spanish.
In this way was added the ASCII characters ranging from 128 to 255.
IBM includes support for this code page in the hardware of its model 5150, known as "IBM-PC", considered the first personal computer.
The operating system of this model, the "MS-DOS" also used this extended ASCII code.
Non-ASCII Unicode characters.
If you say "High ASCII", you are by definition in the range 128-255 decimal. ASCII itself is defined as a one-byte (actually 7-bit) character representation; the use of the high bit to allow for non-English characters happened later and gave rise to the Code Pages that defined particular characters represented by particular values. Any multibyte (> 255 decimal value) is not ASCII.

Resources