How to identify character's language in Ruby/Rails? - ruby-on-rails

Given a character (one letter of a string), how could I identify to which language it belongs ? The options are: English, Russian, Hebrew.
Background: this character was entered by user in a form and then stored in a database.
It can be for example the first letter in one of these words:
Hello
Привет
שלום

The UNICODE standard is divided into "blocks". Go here:
http://www.unicode.org/charts/
http://en.wikipedia.org/wiki/Unicode_block
http://www.unicode.org/versions/Unicode6.0.0/
and find unicode blocks (intervals) for each language.
My guess:
English
Hebrew
Russian
So for you its the matter of simple number comparsion for each character (unicode ordinal value). Very simple.

Related

What is the meaning of 'Swift are Unicode correct and locale insensitive' in Swift's String document?

I found this sentence in Swift's String document
(https://developer.apple.com/documentation/swift/string)
Overview
A string is a series of characters, such as "Swift", that forms a collection. Strings in Swift are Unicode correct and locale insensitive, and are designed to be efficient. The String type bridges with the Objective-C class NSString and offers interoperability with C functions that works with strings.
But, I can't understand this one hundred percent and I don't know where to start.
To expand on #matt's answer a little:
The Unicode Consortium maintains certain standards for interoperation of data, and one of the most well-known standards is the Unicode string standard. This standard defines a huge list of characters and their properties, along with rules for how those characters interact with one another. (Like Matt notes: letters, emoji, combining characters [letters with diacritics, like é, etc.)
Swift strings being "Unicode-correct" means that Swift strings conform to this Unicode standard, offering the same characters, rules, and interactions as any other string implementation which conforms to the same standard. These days, being the main standard that many string implementations already conform to, this largely means that Swift strings will "just work" the way that you expect.
However, along with the character definitions, Unicode also defines many rules for how to perform certain common string actions, such as uppercasing and lowercasing strings, or sorting them. These rules can be very specific, and in many cases, depend entirely on context (e.g., the locale, or the language and region the text might belong to, or be displayed in). For instance:
Case conversion:
In English, the uppercase form of i ("LATIN SMALL LETTER I" in Unicode) is I ("LATIN CAPITAL LETTER I"), and vice versa
In Turkish, however, the uppercase form of i is actually İ ("LATIN CAPITAL LETTER I WITH DOT ABOVE"), and the lowercase form of I ("LATIN CAPITAL LETTER I") is ı ("LATIN SMALL LETTER DOTLESS I")
Collation (sorting):
In English, the letter Å ("LATIN CAPITAL LETTER A WITH RING ABOVE") is largely considered the same as the letter A ("LATIN CAPITAL LETTER A"), just with a modifier on it. Sorted in a list, words starting with Å would appear along with other A words, but before B words
In certain Scandinavian languages, however, Å is its own letter, distinct from A. In Danish and Norwegian, Å comes at the end of the alphabet: ... X, Y, Z, Æ, Ø, Å. In Swedish and Finnish, the alphabet ends with: ... X, Y, Z, Å, Ä, Ö. For these languages, words starting with Å would come after Z words in a list
In order to perform many string operations in a way that makes sense to users in various languages, those operations need to be performed within the context of their language and locale.
In the context of the documentation's description, "locale-insensitive" means that Swift strings do not offer locale-specific rules like these, and default to Unicode's default case conversion, case folding, and collation rules (effectively: English). So, in contexts where correct handling of these are needed (e.g. you are writing a localized app), you'll want to use the Foundation extensions to String methods which do take a Locale for correct handling:
localizedUppercase/uppercased(with locale: Locale?) over just uppercased()
localizedLowercase/lowercased(with locale: Locale?) over just lowercased()
localizedStandardCompare(_:)/compare(_:options:range:locale:) over just <
among others.
It basically just means that Swift strings are Unicode strings. A Swift string "character" is a character in a Unicode sense: a letter, an emoji, a combined letter-and-diacritic, whatever. A string can also be viewed not merely as a character sequence but as a sequence of UTF8, 16, or 32 code points. The "locale insensitive" stuff means they don't have a locale dependent encoding, as strings did in the bad old days before Unicode.
This is delightful but it has some downsides, most notably that strings qua character-sequence are not directly indexable by integers.

MALLET default token not remove bracket

In Java Mallet, the default token should be one or more characters in [A-Za-z] according to their website. However, when I have a text such as:
lower(location select testing) top
It thinks "lower(location" is one word. But default token should be all letter words. How can I deal with this situation?
The documentation had not been updated for the most recent version of Mallet, thank you for pointing this out. Here's a current version:
As of version 2.0.8, the default token expression is '\p{L}[\p{L}\p{P}]+\p{L}', which is valid for all Unicode letters, and supports typical English non-letter patterns such as hyphens, apostrophes, and acronyms. Note that this expression also implicitly drops one- and two-letter words. Other options include:
For non-English text, a good choice is --token-regex '[\p{L}\p{M}]+', which means Unicode letters and marks (required for Indic scripts). MALLET currently does not support Chinese or Japanese word segmentation.
To include short words, use \p{L}+ (letters only) or '\p{L}[\p{L}\p{P}]*\p{L}|\p{L}' (letters possibly including punctuation).

How to distinguish a WCHAR is Chinese, Japanese or ASCII?

For example delphi code
wchar_IsASCii : array[0..1] of WCHAR ;
wchar_IsASCii[0] := 'A' ;
wchar_IsASCii[1] := 'じ' ;
How can I tell whether wchar_IsASCii[0] belong to ASCII, or wchar_IsASCii[1] does not belong to ASCII?
Actually, I only need know whether a UNICODE char belong to ASCII, that’s all How to distinguish a WCHAR char is Chinese, Japanese or ASCII.
I don't know Delphi, but what I can tell you is you need to determine what range the character fits into in Unicode. Here is a link about finding CJK characters in Unicode: What's the complete range for Chinese characters in Unicode?
and unless Delphi has some nice library for distinguishing Chinese and Japanese charatcers, you're going to have to determine that yourself. Here is a good answer here on SO for how to do that:
Testing for Japanese/Chinese Characters in a string
The problem is... what do you mean by ASCII ? Original ASCII standard is 7-bit code, known as Latin1 - it is not even a byte.
Then if you come with so-called "extended ASCII" - a 1 byte items - then half of it can be next to anything. It can by Greek on one machien, European diacritics on another, Cyrillic at third one... etc.
So i think if all you need is testing whether you have 7 bit Latin1 character - ruling out extended characters from French, German, Spanish alphabets and all Scandinavians ones, then - as Unicode was designed as yet another superset for Latin1 what you need is checking that (0 <= Ord(char-var)) and ($7f >= Ord(char-var)).
However, if you really need to tell languages, if you consider Greek And Cyrillic somewhat ASCII and Japanese alphabets (there are two by the way, Hiragana and Katakana) not (or if you consider French and German more or less ASCII-like, but Russian not) you would have to look at Unicode Ranges.
http://www.unicode.org/charts/index.html
To come with 32-bit codepoint of UCS4 standard you can use http://docwiki.embarcadero.com/Libraries/XE3/en/System.Character.ConvertToUtf32
There are next to standard IBM Classes for Unicode but looks no good translation for Delphi exists Has anyone used ICU with Delphi?
You can use Jedi CodeLib, but its tables are (comments are contradicting) either from Unicode 4.1 or 5.0, not from current 6.2, though for Japanese version 5.0 should be enough.
http://wiki.delphi-jedi.org/wiki/JCL_Help:TUnicodeBlock
http://wiki.delphi-jedi.org/wiki/JCL_Help:CodeBlockFromChar
http://wiki.delphi-jedi.org/wiki/JCL_Help:CodeBlockName#TUnicodeBlock
You can also use Microsoft MLang interface to query internet-like character codes (RFC 1766)
http://msdn.microsoft.com/en-us/library/aa741220.aspx
http://msdn.microsoft.com/en-us/library/aa767880.aspx
http://msdn.microsoft.com/en-us/library/aa740986.aspx
http://www.transl-gunsmoker.ru/2011/05/converting-between-lcids-and-rfc-1766.html
http://www.ietf.org/rfc/rfc1766.txt
Generally, a character belongs to ASCII, if its code is in range 0x0000..0x007F, see http://www.unicode.org/charts/PDF/U0000.pdf. A new Delphi has class function TCharacter.IsAscii but it is from some strange reason declared as private.
ASCII characters have a decimal value less than 127.
However, unless you are running a teletype machine from the 1960's, ASCII chars may not be sufficient. ASCII chars will only cover English language characters. If you actually need to support "Western European" characters such as umlaut vowels, graves, etc, found in German, French, Spanish, Swedish, etc, then testing for Unicode char value <= 127 won't suffice. You might get away with testing for char value <= 255, as long as you don't need to work with Eastern European scripts.

Can someone explain ja_JP.UTF8?

I know utf8,but what's the difference between *.utf8?
From the answer to my post
Locale = ja_JP
Encoding = UTF-8
Before Unicode, handling non-english characters was done using tricks like Code Pages (like this) and special character sets (like this: Shift_JIS). UTF-8 contains a much larger range of characters with a completely different mapping system (i.e. the way each character is addressed by number).
When setting ja_JP.UTF8 as the locale, the "UTF8" part signifies the encoding for the special characters needed. For example, when you output a currency amount in the Japanese locale, you will need the ¥ character. The encoding information defines which character set to use to display the ¥.
I'm assuming there could exist a ja_JP.Shift_JIS locale. One difference to the UTF8 one - among others - would be that the ¥ sign is displayed in a way that works in this specific encoding.
Why ja_JP?
The two codes ja_JP signify language (I think based on this ISO norm) and country (based on this one). This is important if a language is spoken in more than one country. In the german speaking area, for example, the Swiss format numbers differently than the germans: 1'000'000 vs. 1.000.000. The country code serves to define these distinctions within the same language.
In which context? ja_JP tells us that the string is in the Japanese language. That does not have anything to do with the character encoding, but is probably used - depending on context - for sorting, keyboard input and language on displayed text in the program.
At a guess, I'd say each utf8 file with that naming convention contains a language definition for translating your site.
It's a locale name. The basic format is language_COUNTRY. ja = Japanese language, JP = Japan.
In addition to a date format, currency symbol, etc., each locale is associated with a character encoding. This is a historical legacy from the days when every language had its own encoding. Now, UTF-8 provides a common encoding for every locale.
The reason .UTF8 is part of the locale name is to distinguish it from older locales with a different encoding. For example, there's an ja_JP.EUC-JP locale available on my system. And for Germany, there's the choice of de_DE (obsolete pre-Euro locale with ISO-8859-1 encoding), de_DE#euro (ISO-8859-15 encoding, to provide the € sign), and de_DE.UTF-8.

What's the proper technical term for "high ascii" characters?

What is the technically correct way of referring to "high ascii" or "extended ascii" characters? I don't just mean the range of 128-255, but any character beyond the 0-127 scope.
Often they're called diacritics, accented letters, sometimes casually referred to as "national" or non-English characters, but these names are either imprecise or they cover only a subset of the possible characters.
What correct, precise term that will programmers immediately recognize? And what would be the best English term to use when speaking to a non-technical audience?
"Non-ASCII characters"
ASCII character codes above 127 are not defined. many differ equipment and software suppliers developed their own character set for the value 128-255. Some chose drawing symbols, sone choose accent characters, other choose other characters.
Unicode is an attempt to make a universal set of character codes which includes the characters used in most languages. This includes not only the traditional western alphabets, but Cyrillic, Arabic, Greek, and even a large set of characters from Chinese, Japanese and Korean, as well as many other language both modern and ancient.
There are several implementations of Unicode. One of the most popular if UTF-8. A major reason for that popularity is that it is backwards compatible with ASCII, character codes 0 to 127 are the same for both ASCII and UTF-8.
That means it is better to say that ASCII is a subset of UTF-8. Characters code 128 and above are not ASCII. They can be UTF-8 (or other Unicode) or they can be a custom implementation by a hardware or software supplier.
You could coin a term like “trans-ASCII,” “supra-ASCII,” “ultra-ASCII” etc. Actually, “meta-ASCII” would be even nicer since it alludes to the meta bit.
A bit sequence that doesn't represent an ASCII character is not definitively a Unicode character.
Depending on the character encoding you're using, it could be either:
an invalid bit sequence
a Unicode character
an ISO-8859-x character
a Microsoft 1252 character
a character in some other character encoding
a bug, binary data, etc
The one definition that would fit all of these situations is:
Not an ASCII character
To be highly pedantic, even "a non-ASCII character" wouldn't precisely fit all of these situations, because sometimes a bit sequence outside this range may be simply an invalid bit sequence, and not a character at all.
"Extended ASCII" is the term I'd use, meaning "characters beyond the original 0-127".
Unicode is one possible set of Extended ASCII characters, and is quite, quite large.
UTF-8 is the way to represent Unicode characters that is backwards-compatible with the original ASCII.
Taken words from an online resource (Cool website though) because I found it useful and appropriate to write and answer.
At first only included capital letters and numbers , but in 1967 was added the lowercase letters and some control characters, forming what is known as US-ASCII, ie the characters 0 through 127.
So with this set of only 128 characters was published in 1967 as standard, containing all you need to write in English language.
In 1981, IBM developed an extension of 8-bit ASCII code, called "code page 437", in this version were replaced some obsolete control characters for graphic characters. Also 128 characters were added , with new symbols, signs, graphics and latin letters, all punctuation signs and characters needed to write texts in other languages, ​such as Spanish.
In this way was added the ASCII characters ranging from 128 to 255.
IBM includes support for this code page in the hardware of its model 5150, known as "IBM-PC", considered the first personal computer.
The operating system of this model, the "MS-DOS" also used this extended ASCII code.
Non-ASCII Unicode characters.
If you say "High ASCII", you are by definition in the range 128-255 decimal. ASCII itself is defined as a one-byte (actually 7-bit) character representation; the use of the high bit to allow for non-English characters happened later and gave rise to the Code Pages that defined particular characters represented by particular values. Any multibyte (> 255 decimal value) is not ASCII.

Resources