What is the usual method of encoding documents in foreign alphabets? - character-encoding

What is the usual method of encoding documents in foreign alphabets for the purposes of programs that do terminal to terminal communications? There are two parts to this question: Latin alphabets and non-Latin.
I know that 8859-1 can handle most European languages, so is the usual practice in say Danish to just set your computer to 8859 and you are done? What about French and Polish?
For non-Latin alphabets like Russian, Armenian and Korean obviously you cannot use 8859-1. Do they just write documents in some other code page and have their computer set to that code page or do they use unicode or UTF-8 or do all three? What is the standard practice?
I am only interested in alphabetic systems. I know how the non-alphabetic systems (Chinese/Japanese) work, so no need to explain what they do.
My need here is to understand what kind of support to build into a terminal-based communication system that will be used by people talking to each other in different countries. For example, imagine you are writing an instant messaging system and need it to be interoperable between people in different countries.

For any system set up in this decade, you should expect and demand Unicode (though not necessarily UTF-8) and be done.
Historically, you would see all three of (1) use a legacy codepage or even (gasp) official character set for your locale (much depending on your OS and vendor -- Windows and Mac would traditionally gravitate to their own proprietary code tables, while Linux would use ISO-8859-x where available and applicable); (2) use something "close enough" and just wing it in corner cases (for example, ISO-8859-1 is in principle insufficient for Finnish, but people would just refrain from using the handful of words where it matters, or write them unaccented); and (3) use a local convention such as pr"efix"ed "accents or LaTeX \"acc\"ents or uetterly uenreaedaeble digraphs (these have some base in tradition e.g. in Germany where it is still okay to write "umlaeute" as a variant of "umläute").
It is not really correct to say that ISO-8859-1 "can handle most European languages". It is sufficient for most of the official national languages of Western Europe (especially if you are willing to compromise a bit, like the French grudgingly did) but completely inadequate for the majority of European languages. There is ISO-8859-2 and ISO-8859-3 etc which cater to the needs of groups of other European languages, but in many settings, interoperability with ISO-8859-1 was also desirable, so these were always a bit problematic.
For the specific character sets you ask about, there is ArmSCII for Armenian, a variety of Cyrillic encodings for Russian -- depending on where you look and who you ask, Windows codepage 1251 or KOI-8R would be regarded as the dominant one --, and similarly a variety of Korean standards, though KSC 5601 seems to dominate at least for email (the link has forward pointers to several others).
While Korean is nominally analyzable as a roughly alphabetic writing system, the traditional encoding approach has been to create glyphs for each possible combined syllable, resulting in a large character set which has more in common with Chinese or Japanese than with typical 8-bit alphabetic encodings. I believe the composing jamo characters have only become available for practical use when they were included in Unicode.
For a messaging system in particular, you have two choices, only one of which makes proper sense, really: Design the protocol to tag the encoding of every transmitted character, and implement transcoding in all clients; or just use Unicode everywhere.
The one remaining challenge is to ensure that each client has the necessary fonts to display the glyphs they receive. Things are slowly improving, but this is a complex matter.

There is no out of band information. With the legacy systems you are expected to know the usual encodings and try them by yourself till one works. Which is why anything but unicode is dumb today (and anything but uTF-8 as unicode encoding is fail too). It is not true that no one uses unicode today. utf-8 is the default xml encoding, the default ietf W3C encoding, the default Linux encoding, etc. Building a new multilingual system around anything but utf-8 today is a big mistake.

Related

Extracting data from Chinese language document

Does workfusion have support for extracting data from Chinese language document by using OCR and Machine learning. Please advice.
Regards,
Sunil Prabakar C
In case you are looking for a generic answer: Yes, Chinese language is supported.
A bit more details:
WorkFusion OCR module supports about 200 languages (including Chinese Simplified and and Chinese Traditional)
WorkFusion ML module is language agnostic. Training set size may be required to be of a bigger size for less common languages, as well as configuration may need to include language specific features for better results.
WorkFusion RPA module is language agnostic. It can interact with applications with pretty much any language of the user interface. More technically precise: it’s character encoding-agnostic. More than 100 character encodings supported including all wide used (different versions of Unicode/UTF, ISO, ASCII, IBM, windows, many more).

What is the correct character encoding for .txt files on a website?

I am developing a website that includes some text files (saved with .txt file extension).
Should they be UTF-8 (with BOM), or is ANSI (1252) O.K.?
(Windows adds a 3-byte BOM when I save as UTF-8).
I would like to do whatever is considered to be best practise.
UTF-8 is generally preferred on the web, though in the specifications, this seems to relate to HTML resources, formally speaking.
There is hardly any practical problem with windows-1252, if it is properly declared in HTTP headers sent by the server and all the data can be written using the restricted repertoire supported by that encoding.
Using UTF-8 with BOM, you practically guarantee that user agents get the encoding right. You might still have problems with your authoring tools, such as PHP. But if you create and save the resources yourself, using UTF-8 capable tools, there is hardly any objection to UTF-8.
Which languages your website is using?
I'm tempted to say there is no absolute best practices (well, it applies to many question). If you're in a 100% English environment and it is meant to stay that way, you don't really need to bother about encoding.
My current project is using Asian languages and European languages so ANSI was out of question. If you don't target old browsers and if your application manages UTF-8 without any problem, I suggest to directly go for UTF-8 because if you realize later that an encoding change is required, this is not fun...
For further reading, you can read the matter regarding encoding in website

Could lex/flex be used to parse binary format source files?

When I learn lex tool, I found it helps to parse source files in text format, like building a new programming languages, etc. I also with to use it to build a tool to analyse some binary input streams, like codec/decoders.
Does lex/flex/yacc/bison support such requirements, do they have special command line options and syntax to enable this?
Thanks!
Flex (and the other lex implentations I'm familiar with) have no problem with non-ascii characters, including the NUL character. You may have to use the 8bit option, although it is the default unless you request fast state tables.
However, most binary formats use length-prefixed variable length fields, which cannot be expressed in a regular expression. Moreover, it is quite common for fixed-lengtb fields to be context-dependent; you can build a state machine in flex using start conditions, but that's a lot of work and is likely to be a waste of your time and flex's features.

How do I specify language when storing strings?

I'm currently developing a system that supports several languages. I want to specify these languages as precisely as possible in the database in case of future integrations. (Yes I know it's a bit YAGNI)
I've found several ways to define a language
nb-NO
nb_NO
nb-no
nb_no
nb
These can all mean "Norwegian Bokmål". Which one, if any, is the most correct?
The Locale article on the ArchLinux Wiki specifies a Locale as language[_territory][.codeset][#modifier]. The codeset and modifier I guess are only relevant for input. But language is a minimum and territory may be nice to have should we implement cultural differences regarding currency and decimal points etc.
Am I overthinking it?
Look at BCP 47
https://tools.ietf.org/html/bcp47
In this day and age you would need at to support at least language, script, region (only language being mandatory to be present)
It depends a lot what you use this tags for.
If it is spoken content you might care about dialect (for instance Cantonese vs Mandarin Chinese), but not script. In written form you will care about script (Traditional vs. Simplified Chinese), but not dialect.
It also matters a lot the complete stack you use to process things. You might use minus as separator, use grandfathered ids, or the -u- extension (see bcp,) then discover that you use a programming language that "chokes" on it. Or you use "he" for Hebrew, but your language (cough Java cough) wants the deprecated "iw"
So you might decide to use the same locale id as your tech stack, or have a "conversion layer".
If you want things accessible from several technologies, then conversion layers is your only (reasonable) option.

Resources for character and text processing (encoding, regular expressions, NLP)

I'd like to learn foundations of encodings, characters and text. Understanding these is important for dealing with a large set of text whether that are log files or text source for building algorithms for collective intelligence. My current knowledge is pretty basic: something like "As long as I use UTF-8, I'm okay."
I don't say I need to learn about advanced topics right away. But I need to know:
Bit and bytes level knowledge of encodings.
Characters and alphabets not used in English.
Multi-byte encodings. (I understand some Chinese and Japanese. And parsing them is important.)
Regular expressions.
Algorithm for text processing.
Parsing natural languages.
I also need an understanding of mathematics and corpus linguistics. The current and future web (semantic, intelligent, real-time web) needs processing, parsing and analyzing large text.
I'm looking for some resources (maybe books?) that get me started with some of the bullets. (I find many helpful discussion on regular expressions here on Stack Overflow. So, you don't need to suggest resources on that topic.)
In addition to wikipedia, Joel Spolskys article on encoding is really good too.
This free character map is a nice resource for all unicode characters.
This regular expression tutorial can be helpful.
Specifically on NLP and Japanese, you could
take a look at this Japanese NLP
project.
On text processing, this Open
Source project can be useful.
As is usual for most general "I want to learn about X topic" questions, Wikipedia is a good place to start:
http://en.wikipedia.org/wiki/Character_encoding
http://en.wikipedia.org/wiki/Natural_language_processing

Resources