Can tab characters appear in an iso-8859-8 file? - character-encoding

I have a file that I believe to be in the ISO-8859-8 format. However, it has tabs in it, which doesn't seem to appear in this character set:
https://en.wikipedia.org/wiki/ISO/IEC_8859-8
Does this mean that the file isn't in the ISO-8859-8 format after all? Can ISO-8859-8 encoded characters be combined with tabs?

Yes.
The tab (\t) character is one of the standard C0 control codes, along with Null (\0), Bell/Alert (\a), Backspace (\b), Line Feed (\n), Vertical Tab (\v), Form Feed (\f), Carriage Return (\r), Escape (\x1B), etc.
According to Wikipedia's page on ISO/IEC 8859:
The ISO/IEC 8859 standard parts only define printable characters, although they explicitly set apart the byte ranges 0x00–1F and 0x7F–9F as "combinations that do not represent graphic characters" (i.e. which are reserved for use as control characters) in accordance with ISO/IEC 4873; they were designed to be used in conjunction with a separate standard defining the control functions associated with these bytes, such as ISO 6429 or ISO 6630. To this end a series of encodings registered with the IANA add the C0 control set (control characters mapped to bytes 0 to 31) from ISO 646 and the C1 control set (control characters mapped to bytes 128 to 159) from ISO 6429, resulting in full 8-bit character maps with most, if not all, bytes assigned. These sets have ISO-8859-n as their preferred MIME name or, in cases where a preferred MIME name is not specified, their canonical name. Many people use the terms ISO/IEC 8859-n and ISO-8859-n interchangeably.
IOW, even though the official character chart only lists the printable characters, the C0 control characters, including Tab, are for all practical purposes part of the ISO-8859-n encodings.
Your linked article even explicitly says so.
ISO-8859-8 is the IANA preferred charset name for this standard when supplemented with the C0 and C1 control codes from ISO/IEC 6429.

An "ISO 8859-8 file" is interpreted usually as a file which contain standard C0, C1, and DEL. So you can use control characters without problems.
But technically, ISO 8859-8 is just defining the characters as show in Wikipedia. Remember that files were not so relevant in such times, but the transmission of data between different systems (you may have the transmitted data stored as native file, so transcoded, but so the concept of "file encoding" was not so important). So we have ISO 2022 and ISO 4873 which defines the basic idea of encodings, to transmit data, and they define the "ANSI escape sequences". With such sequences you can redefine how the C0, C1, and the two letter blocks (G0, G1) are used.
So your system may decide to use ASCII for initial communication, then switch C0 for better control between the systems, and then load G0 and G1 with ISO 8859-8, so you can transmit your text file (and then maybe an other encoding, for a second stream of data in an other language).
So, technically, Wikipedia tables are correct, but now we use to share files without transcoding them, and so without changing encoding with ANSI escape characters, and so we use "ISO 8859-8 file" as a way to describe a ISO 8859-8 graphical characters (G0 and G1), and we allow we extra control characters (usually TAB, NL (and CR, LF), sometime also NUL, VT). This is also embedded in the string iso-8859-8 used by IANA, and so web browsers and email. But note: usually you cannot use all C0 and C1 control codes (some are forbidden by standards, and some should not be used (usually) in files, e.g. ANSI escape sequences, NUL bytes and so may be misinterpreted, or discarded (and possibly this will give a security problem).
In short: ISO 8859-8 technically do not define control codes. But usually we allow some of them in files (TAB is one of them). Check the file protocol to know which control codes are allowed (please no BEL, and ANSI escape characters)

Related

How to detect if user selected .txt file is Unicode/UTF-8 format and Convert to ANSI

My non-Unicode Delphi 7 application allows users to open .txt files.
Sometimes UTF-8/UNICODE .txt files are tried to be opened causing a problem.
I need a function that detects if the user is opening a txt file with UTF-8 or Unicode encoding and Converts it to the system's default code page (ANSI) encoding automatically when possible so that it can be used by the app.
In cases when converting is not possible, the function should return an error.
The ReturnAsAnsiText(filename) function should open the txt file, make detection and conversion in steps like this;
If the byte stream has no bytes values over x7F, its ANSI, return as is
If the byte stream has bytes values over x7F, convert from UTF-8
If the stream has BOM; try Unicode conversion
If conversion to the system's current code page is not possible, return NULL to indicate an error.
It will be an OK limit for this function, that the user can open only those files that match their region/codepage (Control Panel Regional Region Settings for non-Unicode apps).
The conversion function ReturnAsAnsiText, as you designed, will have a number of issues:
The Delphi 7 application may not be able to open files where the filename using UTF-8 or UTF-16.
UTF-8 (and other Unicode) usage has increased significantly from 2019. Current web pages are between 98% and 100% UTF-8 depending on the language.
You design will incorrectly translate some text that a standards compliant would handle.
Creating the ReturnAsAnsiText is beyond the scope of an answer, but you should look at locating a library you can use instead of creating a new function. I haven't used Delphi 2005 (I believe that is 7), but I found this MIT licensed library that may get you there. It has a number of caveats:
It doesn't support all forms of BOM.
It doesn't support all encodings.
There is no universal "best-fit" behavior for single-byte character sets.
There are other issues that are tangentially described in this question. You wouldn't use an external command, but I used one here to demonstrate the point:
% iconv -f utf-8 -t ascii//TRANSLIT < hello.utf8
^h'elloe
iconv: (stdin):1:6: cannot convert
% iconv -f utf-8 -t ascii < hello.utf8
iconv: (stdin):1:0: cannot convert
Enabling TRANSLIT in standards based libraries supports converting characters like é to ASCII e. But still fails on characters like π, since there are no similar in form ASCII characters.
Your required answer would need massive UTF-8 and UTF-16 translation tables for every supported code page and BMP, and would still be unable to reliably detect the source encoding.
Notepad has trouble with this issue.
The solution as requested, would probably entail more effort than you put into the original program.
Possible solutions
Add a text editor into your program. If you write it, you will be able to read it.
The following solution pushes the translation to established tables provided by Windows.
Use the Win32 API native calls translate strings using functions like WideCharToMultiByte, but even this has its drawbacks(from the referenced page, the note is more relevant to the topic, but the caution is important for security):
Caution  Using the WideCharToMultiByte function incorrectly can compromise the security of your application. Calling this function can easily cause a buffer overrun because the size of the input buffer indicated by lpWideCharStr equals the number of characters in the Unicode string, while the size of the output buffer indicated by lpMultiByteStr equals the number of bytes. To avoid a buffer overrun, your application must specify a buffer size appropriate for the data type the buffer receives.
Data converted from UTF-16 to non-Unicode encodings is subject to data loss, because a code page might not be able to represent every character used in the specific Unicode data. For more information, see Security Considerations: International Features.
Note  The ANSI code pages can be different on different computers, or can be changed for a single computer, leading to data corruption. For the most consistent results, applications should use Unicode, such as UTF-8 or UTF-16, instead of a specific code page, unless legacy standards or data formats prevent the use of Unicode. If using Unicode is not possible, applications should tag the data stream with the appropriate encoding name when protocols allow it. HTML and XML files allow tagging, but text files do not.
This solution still has the guess the encoding problem, but if a BOM is present, this is one of the best translators possible.
Simply require the text file to be saved in the local code page.
Other thoughts:
ANSI, ASCII, and UTF-8 are all separate encodings above 127 and the control characters are handled differently.
In UTF-16 every other byte(zero first) of ASCII encoded text is 0. This is not covered in your "rules".
You simply have to search for the Turkish i to understand the complexities of Unicode translations and comparisons.
Leverage any expectations of the file contents to establish a coherent baseline comparison to make an educated guess.
For example, if it is a .csv file, find a comma in the various formats...
Bottom Line
There is no perfect general solution, only specific solutions tailored to your specific needs, which were extremely broad in the question.

Japanese characters interpreted as control character

I have several files which include various strings in different written languages. The files I am working with are in the .inf format which is somewhat similar to .ini files.
I am inputting the text from these files into a parser which considers the [ symbol as the beginning of a 'category'. Therefore, it is important that this character does not accidentally appear in string sequences or parsing will fail because it interprets these as "control characters".
For example, this string contains some Japanese writings:
iANSProtocol_HELP="�C���e��(R) �A�h�o���X�g�E�l�b�g���[�N�E�T�[�r�X Protocol �̓`�[���������щ��z LAN �Ȃǂ̍��x�#�\�Ɏg�����܂��B"
DISKNAME ="�C���e��(R) �A�h�o���X�g�E�l�b�g���[�N�E�T�[�r�X CD-ROM �܂��̓t���b�s�[�f�B�X�N"
In my text-editors (Atom) default UTF-8 encoding this gives me garbage text which would not be an issue, however the 0x5B character is interpreted as [. Which causes the parser to fail because it assumes that this is signaling the beginning of a new category.
If I change the encoding to Japanese (CP 932), these characters are interpreted correctly as:
iANSProtocol_HELP="インテル(R) アドバンスト・ネットワーク・サービス Protocol はチーム化および仮想 LAN などの高度機能に使われます。"
DISKNAME ="インテル(R) アドバンスト・ネットワーク・サービス CD-ROM またはフロッピーディスク"
Of course I cannot encode every file to be Japanese because they may contain Chinese or other languages which will be written incorrectly.
What is the best course of action for this situation? Should I edit the code of the parser to escape characters inside string literals? Are there any special types of encoding that would allow me to see all special characters and languages?
Thanks
If the source file is in shift-jis, then you should use a parser that can support it, or convert the file to UTF-8 before you parse it.
I believe that this character set also uses ASCII as it's base type but it uses 2 bytes to for certain characters, so if 0x5B it probably doesn't appear as the 'first byte' of a character. (note: this is conjecture based on how I think shift-jis works).
So yea, you need to modify your parser to understand shift-jis, or you need to convert the file to UTF-8 before parsing. I imagine that converting is the easiest.

Unicode filenames in iOS

Is it possible to use the full range of (let's say) the Chinese language in filenames of assets (images) within iOS? If not, what portions of big languages are supported in filenames, string searches and other file handling activities?
iOS and Mac OS currently use the HFS+ filesystem, which supports full Unicode in filenames. This means essentially any character, including Chinese and other human languages. The filesystem allows up to 255 characters, which for most languages is about 255 code points. (I see a note that the length is based on UTF16-encoded characters. There are characters which require more than 16 bits to encode, like emoji, which you can also use, but you'll have fewer characters allowed.)
The file APIs on iOS (NSFileManager, etc) should accommodate Unicode strings without any extra work. Do note that Unicode sequences are canonicalized in a particular way: e.g. an é character can be represented in multiple different ways in Unicode, but will be decomposed in a standardized way as a filename.
The bottom line is, you can feel free to use Unicode strings as your filenames as long as they are of reasonable length. Because superlong Unicode names will start running into length issues in a slightly unpredictable way (really just complicated and unnecessary to compute), you should probably set some sane self-imposed length limits.
APFS is the next-gen filesystem that Apple is developing, and will appear on iOS at some point soon. I can't find info on file name encoding but it's a fair assumption that it will support anything HFS+ supports, if not more so.
The iOS filesystem uses case-sensitive HFSX, which is a variant of HFS Plus and uses the same rules for filenames and character encodings.
Those rules are laid out in several sections of Apple Technote 1150.
The important considerations are:
You may use up to 255 16-bit Unicode characters per file or folder name as described in the HFS Plus Names section of Technote 1150.
The filesystem at its base level uses Unicode v2.0 (this is fixed) and strings must be stored in fully decomposed, canonical order. This precludes the use of some "equivalent forms" -- i.e. they must be converted to decomposed form. This is described in detail in the Unicode Subtleties section of Technote 1150. This section details other issues and should be read carefully.
A list of illegal characters can be found in this Decomposition Table.
The colon character ':' is used as a directory separator and is invalid in file and folder names.

Are code pages and code charts the same thing?

Based on what I have gathered so far from reading information available online:
character set is a bunch of characters that we want to use (like an interface)
character encoding is a method of encoding some character set (like an implementation)
What is the relationship between code charts and code pages and how do they fit into the overall context? I am not sure if these two terms are synonyms or if they are referring to distinct concepts.
Do code charts/code pages define character sets through large tables and also provide a method of encoding, making them a part of character encoding? Or, do they only define character sets and leave encoding implementation to another aspect? Additionally, is a locale simply a type of code chart/code page or is it a separate concept altogether?
In the majority of cases, character sets and character encodings are one and the same. For example, ISO-8859-1 defines the character set for Western Europe AND the encoding using an 8bit scheme.
See the specification for ISO-8859-1: ftp://std.dkuug.dk/JTC1/sc2/wg3/docs/n411.pdf, which includes the encoding implementation.
Unicode on the other hand separates encoding from the character definition, albeit within a bunch of related documents. In Unicode, just about all current and a good deal of historic characters, symbols and modifiers are mapped to a 32 bit "code point". Encodings of UTF-32, UTF-16 and UTF-8 are then documented separately, to define how the Unicode Code Point is encoded.

How to read a text file in ancient encoding?

There is a public project called Moby containing several word lists. Some files contain European alphabets symbols and were created in pre-Unicode time. Readme, dated 1993, reads:
"Foreign words commonly used in English usually include their
diacritical marks, for example, the acute accent e is denoted by ASCII
142."
Wikipedia says that the last ASCII symbol has number 127.
For example this file: http://www.gutenberg.org/files/3203/files/mobypos.txt contains symbols that I couldn't read in any of vatious Latin encodings. (There are plenty of such symbols in the very end of section of words beginning with B, just before C letter. )
Could someone advise please what encoding should be used for reading this file or how can it be converted to some readable modern encoding?
A little research suggests that the encoding for this page is Mac OS Roman, which has é at position 142. Viewing the page you linked and changing the encoding (in Chrome, View → Encoding → Western (Macintosh)) seems to display all the words correctly (it is incorrectly reporting ISO-8859-1).
How you deal with this depends on the language / tools you are using. Here’s an example of how you could convert into UTF-8 with Ruby:
require 'open-uri'
s = open('http://www.gutenberg.org/files/3203/files/mobypos.txt').read
s.force_encoding('macroman')
s.encode!('utf-8')
You are right in that ASCII only goes up to position 127 (it’s a 7-bit encoding), but there are a large number of 8 bit encodings that are supersets of ASCII and people sometimes refer to those as “Extended ASCII”. It appears that whoever wrote the readme you refer to didn’t know about the variety of encodings and thought the one he happened to be using at the time was universal.
There isn’t a general solution to problems like this, as there is no guaranteed way to determine the encoding of some text from the text itself. In this case I just used Wikipedia to look through a few until I found one that matched. Joel Spolsky’s article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) is a good place to start reading about character sets and encodings if you want to learn more.

Resources