Display Arabic data with PB11 and MS Sans Serif font - character-encoding

We have a SQL Server 2000 database with collation set SQL_Latin1_General_CP1_CI_AS
When exploring the table data with SQL server we can't distinguish Arabic characters(e.g. ÇæÇãÑ ÇáãÔÑæØÉ).
When exploring the table data with a PB7 datawindow and using MS Sans Serif font, Arabic data displays well.
When exploring the table data with a PB11 datawindow and using MS Sans Serif font or any other font
arabic data does not display well (e.g. ÇæÇãÑ ÇáãÔÑæØÉ), so we can't migrate to PB11.
Could anyone advice me on how to solve the migration from PB7 to PB11 to deal correctly with latin-1 database encoding and Arabic data.

As for your other question unreadable old storage data in SQL server using PB10.5 it seems to be unicode vs. non-unicode data reading.
Was the data written by means of your PB7 application ? If so, then PB7 was not unicode aware (neither were PB8 and PB9, native unicode support was introduced with PB10) and probably sent data in your local windows encoding to the database.
You have either to migrate your existing data in the database, or to configure PB and / or your database connection to use the previous encoding.

Related

How to detect if user selected .txt file is Unicode/UTF-8 format and Convert to ANSI

My non-Unicode Delphi 7 application allows users to open .txt files.
Sometimes UTF-8/UNICODE .txt files are tried to be opened causing a problem.
I need a function that detects if the user is opening a txt file with UTF-8 or Unicode encoding and Converts it to the system's default code page (ANSI) encoding automatically when possible so that it can be used by the app.
In cases when converting is not possible, the function should return an error.
The ReturnAsAnsiText(filename) function should open the txt file, make detection and conversion in steps like this;
If the byte stream has no bytes values over x7F, its ANSI, return as is
If the byte stream has bytes values over x7F, convert from UTF-8
If the stream has BOM; try Unicode conversion
If conversion to the system's current code page is not possible, return NULL to indicate an error.
It will be an OK limit for this function, that the user can open only those files that match their region/codepage (Control Panel Regional Region Settings for non-Unicode apps).
The conversion function ReturnAsAnsiText, as you designed, will have a number of issues:
The Delphi 7 application may not be able to open files where the filename using UTF-8 or UTF-16.
UTF-8 (and other Unicode) usage has increased significantly from 2019. Current web pages are between 98% and 100% UTF-8 depending on the language.
You design will incorrectly translate some text that a standards compliant would handle.
Creating the ReturnAsAnsiText is beyond the scope of an answer, but you should look at locating a library you can use instead of creating a new function. I haven't used Delphi 2005 (I believe that is 7), but I found this MIT licensed library that may get you there. It has a number of caveats:
It doesn't support all forms of BOM.
It doesn't support all encodings.
There is no universal "best-fit" behavior for single-byte character sets.
There are other issues that are tangentially described in this question. You wouldn't use an external command, but I used one here to demonstrate the point:
% iconv -f utf-8 -t ascii//TRANSLIT < hello.utf8
^h'elloe
iconv: (stdin):1:6: cannot convert
% iconv -f utf-8 -t ascii < hello.utf8
iconv: (stdin):1:0: cannot convert
Enabling TRANSLIT in standards based libraries supports converting characters like é to ASCII e. But still fails on characters like π, since there are no similar in form ASCII characters.
Your required answer would need massive UTF-8 and UTF-16 translation tables for every supported code page and BMP, and would still be unable to reliably detect the source encoding.
Notepad has trouble with this issue.
The solution as requested, would probably entail more effort than you put into the original program.
Possible solutions
Add a text editor into your program. If you write it, you will be able to read it.
The following solution pushes the translation to established tables provided by Windows.
Use the Win32 API native calls translate strings using functions like WideCharToMultiByte, but even this has its drawbacks(from the referenced page, the note is more relevant to the topic, but the caution is important for security):
Caution  Using the WideCharToMultiByte function incorrectly can compromise the security of your application. Calling this function can easily cause a buffer overrun because the size of the input buffer indicated by lpWideCharStr equals the number of characters in the Unicode string, while the size of the output buffer indicated by lpMultiByteStr equals the number of bytes. To avoid a buffer overrun, your application must specify a buffer size appropriate for the data type the buffer receives.
Data converted from UTF-16 to non-Unicode encodings is subject to data loss, because a code page might not be able to represent every character used in the specific Unicode data. For more information, see Security Considerations: International Features.
Note  The ANSI code pages can be different on different computers, or can be changed for a single computer, leading to data corruption. For the most consistent results, applications should use Unicode, such as UTF-8 or UTF-16, instead of a specific code page, unless legacy standards or data formats prevent the use of Unicode. If using Unicode is not possible, applications should tag the data stream with the appropriate encoding name when protocols allow it. HTML and XML files allow tagging, but text files do not.
This solution still has the guess the encoding problem, but if a BOM is present, this is one of the best translators possible.
Simply require the text file to be saved in the local code page.
Other thoughts:
ANSI, ASCII, and UTF-8 are all separate encodings above 127 and the control characters are handled differently.
In UTF-16 every other byte(zero first) of ASCII encoded text is 0. This is not covered in your "rules".
You simply have to search for the Turkish i to understand the complexities of Unicode translations and comparisons.
Leverage any expectations of the file contents to establish a coherent baseline comparison to make an educated guess.
For example, if it is a .csv file, find a comma in the various formats...
Bottom Line
There is no perfect general solution, only specific solutions tailored to your specific needs, which were extremely broad in the question.

Loading data in hive table with multiple charsets

I am facing issues where i have multiple files with different charsets, say one file has Chinese charsets and other has French Charsets, how can i load them in a single hive table? I searched online and found this :-
ALTER TABLE mytable SET SERDEPROPERTIES ('serialization.encoding'='SJIS');
With this i can handle charsets for one of the file either Chinese or French. Is there a way to handle both charsets once?
[UPDATE]
Okay i am using RegexSerde for fixed width file alongside encoding scheme being used is - ISO 8859-1. Seems Regex Serde is not taking this encoding scheme into account and splitting the characters considering default UTF-8 encoding scheme. Is there a way to take encoding scheme into account with Regex serde.
I am not sure if this is possible (i think it isn't based on https://github.com/apache/hive/blob/master/serde/src/java/org/apache/hadoop/hive/serde2/AbstractEncodingAwareSerDe.java). A workaround could be create two tables with different enconding and create a view on top of that.

What should I use? UTF8 or UTF16?

I have to distribute my app internationally.
Let's say I have a control (like a memo) where the user enters some text. The user can be Japanese, Russian, Canadian, etc.
I want to save the string to disk as TXT file for later use. I will use MY OWN function to write the text and not something like TMemo.SaveToFile().
How do I want to save the string to disk? In UTF8 or UTF16 format?
The main difference between them is that UTF8 is backwards compatible with ASCII. As long as you only use the first 128 characters, an application that is not Unicode aware can still process the data (which may be an advantage or disadvantage, depending on your scenario). In particular, when switching to UTF16 every API function needs to be adjusted for 16bit strings, while with UTF8 you can often leave old API functions untouched if they don't do any string processing.
Also UTF8 does not depend on endianess, while UTF16 does, which may complicate string I/O.
A common misconception is that UTF16 is easier to process because each character always occupies exactly two bytes. That is, unfortunately, not true. UTF16 is a variable-length encoding where a character may either take up 2 or 4 bytes. So any difficulties associated with UTF8 regarding variable-length issues apply to UTF16 just as well.
Finally, storage sizes: Another common myth about UTF16 is that it is more storage-efficient than UTF8 for most foreign languages. UTF8 takes less storage for all European languages, which can be encoded with one or two bytes per character. Non-BMP characters take up 4 bytes in both UTF8 and UTF16. The only case in which UTF16 takes less storage is if your text mainly consists of characters from the range U+0800 through U+FFFF, where the characters for Chinese, Japanese and Hindi are stored.
James McNellis gave an excellent talk at BoostCon 2014, discussing the various trade-offs between different encodings in great detail. Even though the talk is titled Unicode in C++, the entire first half is actually language agnostic. A video recording of the full talk is available at Boostcon's Youtube channel, while the slides can be found on github.
Depends on the language of your data.
If your data is mostly in western languages and you want to reduce the amount of storage needed, go with UTF-8 as for those languages it will take about half the storage of UTF-16. You will pay a penalty when reading the data as it will be / needs to be converted to UTF-16 which is the Windows default and used by Delphi's (Unicode) string.
If your data is mostly in non-western languages, UTF-8 can take more storage than UTF-16 as it may take up to 6 4 bytes per character for some. (see comment by #KennyTM)
Basically: do some tests with representative samples of your users' data and see which performs better, both in storage requirements and load times. We have had some surprises with UTF-16 being slower than we thought. The performance gain of not having to transform from UTF-8 to UTF-16 was lost because of disk access as the data volume in UTF-16 is greater.
First of all, be aware that the standard encoding under Windows is UCS2 (until Windows 2000) or UTF-16 (since XP), and that Delphi native "string" type uses the same native format since Delphi 2009 (string=UnicodeString char=WideChar).
In all cases, it is it unsafe to assume 1 WideChar == 1 Unicode character - this is the surrogate problem.
About UTF-8 or UTF-16 choice, it depends on the storage itself:
If your file is a plain text file (including XML) you may use either UTF-8 or UTF-16 - but you will have to use a BOM at the beginning of the file, otherwise applications (like Notepad) may be confused at opening - for XML this is handled by your library (if it is not, change to another library);
If you are sure that your content is mostly 7 bit ASCII, use UTF-8 and the associated BOM;
If your file is some kind of database or a custom binary format, certainly the best format is UTF-16/UCS2, i.e. the default Delphi 2009+ string layout, and certainly the default database API layout;
Some file formats require or prefer UTF-8 (like JSON or even SQLite3), even if UTF-8 files can be bigger than UTF-16 for Asiatic characters.
For instance, we used UTF-8 for our Client-Server framework, since we use JSON as exchange format (which requires UTF-8), and since SQlite3 likes UTF-8. Of course, we had to write some dedicated functions and classes, to avoid conversion to/from string (which is slow for the string=UnicodeString type since Delphi 2009, and may loose some data when used with string=AnsiString type before Delphi 2009. See this post and this unit). The easiest is to rely on the string=UnicodeString type, use the RTL functions which handles directly UTF-16 encoding, and avoid conversions. And do not forget about your previous question.
If disk space and read/write speed is a problem, consider using compression instead of changing the encoding. There are real-time compression around (faster than ZIP), like LZO or our SynLZ.

Some unicode characters are represented as "?" mark when inserting to Oracle from Delphi

I have written an application in Delphi 2010 that imports data from one database to another. I've done this before many times: From Access to Acces, Access to SQL Server. But now I have to import data from SQL Server 2005 to Oracle 10G. I do this by selecting all the rows from a table in SQL Server database and inserting them one by one to a table with the same structure in Oracle database. The import performs normally except for that I get question marks for some unicode characters. When I insert those characters in the database manually it shows them properly. It's something between Delphi and Oracle. I use UniDac component set for this purpose. Does anybody know the reason for those question marks?
Basically two possibilities: either the character encoding is wrong, or the software used to display the text is using a font (or set of fonts) that does not contain all the characters. To check this, copy some of the displayed text containing the problem characters into another program, like MS Word, and see if it displays them. Set Word to use Arial Unicode MS if needed.

Delphi TBytesField - How to see the text properly - Source is HIT OLEDB AS400

We are connecting to a multi-member AS400 iSeries table via HIT OLEDB and HIT ODBC.
You connect to this table via an alias to access a specific multi-member. We create the alias on the AS400 this way:
CREATE ALIAS aliasname FOR table(membername)
We can then query each member of the table this way:
SELECT * FROM aliasname
We are testing this in Delphi6 first, but will move it to D2010 later
We are using HIT OLEDB for the AS400.
We are pulling down records from a table and the field is being seen as a tBytesField. I have also tried ODBC driver and it sees as tBytesField as well.
Directly on the AS400 I can query the data and see readable text. I can use the iSeries Navigation tool and see readable text as well.
However when I bring it down to the Delphi client via the HIT OLEDB or HIT ODBC and try to view via asString then I just see unreadable text.. something like this:
ñðð#ðõñððððñ÷#õôððõñòøóóöøñðÂÁÕÒ#ÖÆ#ÁÔÅÙÉÃÁ########ÂÈÙÉâãæÁðòñè#ÔK#k#ÉÕÃK#########ç
I jumbled up the text above, but that is the character types that show up.
When I did a test in D2010 the text looks like japanse or chinese characters, but if I display as AnsiString then it looks like what it does in Delphi 6.
I am thinking this may have something to do with code pages or character sets, but I have no experience in this are so it is new to me if it is related. When I look at the Coded Character Set on the AS400 it is set to 65535.
What do I need to do to make this text readable?
We do have a third party component (Delphi400) that makes things behave in a more native AS400 manner. When I use its AS400 connection and AS400 query components it shows the field as a tStringField and displays just fine. BUT we are phasing out this product (for a number of reasons) and would really like the OLEDB with the ADO components work.
Just for clarification the HIT OLEDB with tADOQuery do have some fields showing as tStringFields for many of the other tables we use... not sure why it is showing as a tBytesField in this case. I am not an AS400 expert, but looking at the field definititions on the AS400 the ones showing up as tBytesField look the same as the ones showing up as tStringFields... but there must be a difference. Maybe due to being a multi-member?
So... does anyone have any guidance on how to get the correct string data that is readable?
If you need more info please ask.
Greg
One problem is that your client doesn't know that it ought to convert the data from EBCDIC to ASCII because the CCSID on the server's table was set incorrectly.
A CCSID of 65535 is supposed to mean that the field contains binary data. Your client doesn't know that the column contains an EBCDIC encoded string, and therefore doesn't try to convert it.
On my servers, all of our character fields have a CCSID of 37, which is EBCDIC.
I found the answer... on both HIT ODBC 400 and HIT OLEDB 400 there is a property called: "Convert CCSID 65535=True" or in the OLEDB UDL it looks like "Binary Characters=True".
Don't know how I missed those, but that did the trick!
Thanks for the feedback.

Resources