Informix with en_US.57372 character set gives an error for a LATIN CAPITAL LETTER A WITH CIRCUMFLEX - informix

I am trying to read data from Informix DB using an ODBC driver.
Everything is just fine until I am trying to read a few characters such as Â📈'.
The ERR message I am having from the driver is Error -21005 which is:
"Invalid byte in codeset conversion input.".
Is there a reason this char set is not able to read those characters? If so, is there a website (I haven't found one) where I can see the whole supported characters for this codeset?

This error -21005 could also mean that you have inserted invalid characters in your database due to your CLIENT_LOCALE being wrong, but this having not been detected by Informix because of it being set the same as DB_LOCALE, which prevented any conversion from having been done.
Then, when you try to read the data containing invalid characters, Informix would produce error -21005, to warn that some of the characters have been replaced by a placeholder, and therefore the data will not be reversible.
See https://www.ibm.com/support/pages/error-21005-when-using-odbc-select-data-database for a detailed explanation on how an incorrect CLIENT_LOCALE can produce error -21005 when querying data.
CLIENT_LOCALE should always be set to the locale of the pc where your queries are being generated, and DB_LOCALE must match the locale with which the database was defined, and which you can find out with "SELECT tabname, site FROM systables WHERE tabid in (90, 91)" but beware that for example en_US.57372 would really mean en_us.utf8, you would need to look in gls\cm3\registry to see the mappings.
EDIT: The answer on Queries problems using IBM Informix ODBC Driver in C# also explains in great detail the misery a wrong CLIENT_LOCALE can bring, and how to fix it.

Related

Impala Column Name Issue

We are facing a problem with Imapla Column naming convention which seems unclear to us.
The CDH imapala documentation (http://www.cloudera.com/documentation/archive/impala/2-x/2-0-x/topics/impala_identifiers.html) 3rd bullet point says : An identifier must start with an alphabetic character. The remainder can contain any combination of alphanumeric characters and underscores. Quoting the identifier with backticks has no effect on the allowed characters in the name.
Now, due to dependency with the upstream SAP systems, we had to name a column name starting with (0) zero as numeric. While defining and extracting the records from the table impala does not show any semantic error. While connecting Imapala with SAP HANA through SDA (Smart Data Access), the extraction is failing for this particular column which is starting with a leading zero (0) and fine for rest of the columns which are starting with an alphabet. The error shows as "... ^ Encountered: DECIMAL LITERAL "
I have to points.
If the documentation says, an identifier can not start anything other that alphabet, then how the imapla query is running without any issues.
Why the error is only raised while it is getting extracted from SAP HANA.
Any insight will be highly appreciable.
Ok, I can only say something about the SAP HANA side here, so you will have to check the Impala side somehow.
The error message you get while accessing an external table via SDA typically comes from the 3rd party client software, in this case the ODBC driver you use to connect to Impala.
So, SAP HANA tries to access the table through the Impala ODBC driver and that driver returns the error message.
I assume that the object name check for Impala is implemented in the client in this case. Not sure if the way you use to run the query in Impala also uses the driver.
But even if Impala has the limitation for the table naming in place, I fail to see why this would force you to name the table in SAP HANA that way. If the upstream data access requires the leading 0 just create a view on top of the table and you're good to go.

tFuzzyMatch apparently not working on Arabic text strings

I have created a job in talend open studio for data integration v5.5.1.
I am trying to find matches between two customer names columns, one is a lookup and the other contain dirty data.
The job runs as expected when the customer names are in english. However, for arabic names, only exact matches are found regardless of the underlying match algorithm i used (levenschtein, metaphone, double metaphone) even with loose bounds for the levenschtein algorithm min 1 max 50).
I suspect this has to do with character encoding. How should I proceed? any way I can operate using the unicode or even UTF-8 interpretation in Talend?
I am using excel data sources through tFileInputExcel
I got it resolved by moving the data to mysql with a UTF-8 collation. Somehow Excel input wasn't preserving the collation.

What should I do with emails using charset ansi_x3.110-1983?

My application is parsing incoming emails. I try to parse them as best as possible but every now and then I get one with puzzling content. This time is an email that looks to be in ASCII but the specified charset is: ansi_x3.110-1983.
My application handles it correctly by defaulting to ASCII, but it throws a warning which I'd like to stop receiving, so my question is: what is ansi_x3.110-1983 and what should I do with it?
According to this page on the IANA's site, ANSI_X3.110-1983 is also known as:
iso-ir-99
CSA_T500-1983
NAPLPS
csISO99NAPLPS
Of those, only the name NAPLPS seems interesting or informative. If you can, consider getting in touch with the people sending those mails. If they're really using Prodigy in this day and age, I'd be amazed.
The IANA site also has a pointer to RFC 1345, which contains a description of the bytes and the characters that they map to. Compared to ISO-8859-1, the control characters are the same, as are most of the punctuation, all of the numbers and letters, and most of the remaining characters in the first 7 bits.
You could possibly use the guide in the RFC to write a tool to map the characters over, if someone hasn't written a tool for it already. To be honest, it may be easier to simply ignore the whines about the weird character set given that the character mapping is close enough to what is expected anyway...

unexpected character while parsing XML from batch script

I am getting this error when I am trying to parse XML from batch Script
error :
< was unexpected at this time.
xml:
<driver type=".dbdriver">
<attributes>localhost;1521;XE;false</attributes>
<driverType>Oracle thin</driverType>
</driver>
<password>7ECE6B7E7D2AF514C55BAE8B3A6B51E7</password>
<user>JR</user>
batch scrpit:
for /f "tokens=3 delims=><" %%j in ('type %SETTINGSPATH% ^| find "<user>"') do set user=%%j
This code is supposed to read user value from XML which is just "JR" and on some machines I am getting this values; but some machines are not showing this value and showing this error.
Please guide.
Parsing XML with batch is often problematic and always risky. A valid XML documented could be legitamately reformatted in any number of ways that would break you parser. But if you really want to continue to use batch...
That error message occurs when you have an unescaped and unquoted < character in your IN() clause. The "<user>" is already quoted, so that normally should not be a problem. The problem must stem from the value contained in %SETTINGSPATH%. Either the value must have an unquoted and unescaped <, or there must be an odd number of quotes in the value. The odd number of quotes would cause the <user> to no longer be quoted.
The only other possibility is that you have not shown us all your code, and the error is occuring someplace else.
This will never work reliably. The reason for this is that you are trying to process Xml using wrong tools. There is an infinite number of textual representations of an Xml document that have the same semantical meaning. As a result a space here or a new line there will not change the semantics of your document but will break your script even though all the tools that process the input as Xml will continue to work correctly. Use PowerShell or vbscript/jscript where you can use Xml capabilities or you will always have problems like this since you should not use a brush to drive screws.

Delphi TBytesField - How to see the text properly - Source is HIT OLEDB AS400

We are connecting to a multi-member AS400 iSeries table via HIT OLEDB and HIT ODBC.
You connect to this table via an alias to access a specific multi-member. We create the alias on the AS400 this way:
CREATE ALIAS aliasname FOR table(membername)
We can then query each member of the table this way:
SELECT * FROM aliasname
We are testing this in Delphi6 first, but will move it to D2010 later
We are using HIT OLEDB for the AS400.
We are pulling down records from a table and the field is being seen as a tBytesField. I have also tried ODBC driver and it sees as tBytesField as well.
Directly on the AS400 I can query the data and see readable text. I can use the iSeries Navigation tool and see readable text as well.
However when I bring it down to the Delphi client via the HIT OLEDB or HIT ODBC and try to view via asString then I just see unreadable text.. something like this:
ñðð#ðõñððððñ÷#õôððõñòøóóöøñðÂÁÕÒ#ÖÆ#ÁÔÅÙÉÃÁ########ÂÈÙÉâãæÁðòñè#ÔK#k#ÉÕÃK#########ç
I jumbled up the text above, but that is the character types that show up.
When I did a test in D2010 the text looks like japanse or chinese characters, but if I display as AnsiString then it looks like what it does in Delphi 6.
I am thinking this may have something to do with code pages or character sets, but I have no experience in this are so it is new to me if it is related. When I look at the Coded Character Set on the AS400 it is set to 65535.
What do I need to do to make this text readable?
We do have a third party component (Delphi400) that makes things behave in a more native AS400 manner. When I use its AS400 connection and AS400 query components it shows the field as a tStringField and displays just fine. BUT we are phasing out this product (for a number of reasons) and would really like the OLEDB with the ADO components work.
Just for clarification the HIT OLEDB with tADOQuery do have some fields showing as tStringFields for many of the other tables we use... not sure why it is showing as a tBytesField in this case. I am not an AS400 expert, but looking at the field definititions on the AS400 the ones showing up as tBytesField look the same as the ones showing up as tStringFields... but there must be a difference. Maybe due to being a multi-member?
So... does anyone have any guidance on how to get the correct string data that is readable?
If you need more info please ask.
Greg
One problem is that your client doesn't know that it ought to convert the data from EBCDIC to ASCII because the CCSID on the server's table was set incorrectly.
A CCSID of 65535 is supposed to mean that the field contains binary data. Your client doesn't know that the column contains an EBCDIC encoded string, and therefore doesn't try to convert it.
On my servers, all of our character fields have a CCSID of 37, which is EBCDIC.
I found the answer... on both HIT ODBC 400 and HIT OLEDB 400 there is a property called: "Convert CCSID 65535=True" or in the OLEDB UDL it looks like "Binary Characters=True".
Don't know how I missed those, but that did the trick!
Thanks for the feedback.

Resources