Tableau with Vertica accented characters not displaying from VARCHAR field - character-encoding

I've created a data connection to a Vertica table from Tableau and have a 'surname1' field in the rows. This field exists as VARCHAR in Vertica and if doing a SELECT I can see accented characters in the command line no problem.
The problem is that in Tableau these are not represented correctly, and I can't find any way to change the field encoding in Tabelau to recognise them.
Does anybody know how to solve this?
Below is an example of a select from Vertica in the command line, and what appears in Tableau:
surname1
---------------
Mérida
Fernández
Villadóniga
Muñoz
López
Thanks in advance,
James

Just leaving this in case it helps anybody in the future:
The cause of the problem was that the Vertica database was being fed by a MySQL database through a mysqli connection. This connections character encoding was configured as latin1 / 8859-1, whereas Vertica was configured under utf-8.
The problem was then further confounded because the Putty window I was using to access Vertica from Windows was also configured under latin1 / 8859-1 which effectively rendered invisible the fact the data wasn't stored correctly in Vertica under utf-8.
To solve this, I reconfigured the mysqli that fed the vertica connection to use utf-8 encoding, with the following line of code:
$mysqli->set_charset("utf8");
Note, to find out the characterset was Latin1 in the first place, I used the following:
echo $CMySQLI->character_set_name();
In summary, if you find an accented character problem with Tableau and your accessing your DB through putty, ensure the character encoding is aligned between putty and the DB so that errors aren't masked in this way.
Regards,
James

Related

Informix with en_US.57372 character set gives an error for a LATIN CAPITAL LETTER A WITH CIRCUMFLEX

I am trying to read data from Informix DB using an ODBC driver.
Everything is just fine until I am trying to read a few characters such as Â📈'.
The ERR message I am having from the driver is Error -21005 which is:
"Invalid byte in codeset conversion input.".
Is there a reason this char set is not able to read those characters? If so, is there a website (I haven't found one) where I can see the whole supported characters for this codeset?
This error -21005 could also mean that you have inserted invalid characters in your database due to your CLIENT_LOCALE being wrong, but this having not been detected by Informix because of it being set the same as DB_LOCALE, which prevented any conversion from having been done.
Then, when you try to read the data containing invalid characters, Informix would produce error -21005, to warn that some of the characters have been replaced by a placeholder, and therefore the data will not be reversible.
See https://www.ibm.com/support/pages/error-21005-when-using-odbc-select-data-database for a detailed explanation on how an incorrect CLIENT_LOCALE can produce error -21005 when querying data.
CLIENT_LOCALE should always be set to the locale of the pc where your queries are being generated, and DB_LOCALE must match the locale with which the database was defined, and which you can find out with "SELECT tabname, site FROM systables WHERE tabid in (90, 91)" but beware that for example en_US.57372 would really mean en_us.utf8, you would need to look in gls\cm3\registry to see the mappings.
EDIT: The answer on Queries problems using IBM Informix ODBC Driver in C# also explains in great detail the misery a wrong CLIENT_LOCALE can bring, and how to fix it.

How to correctly read Latin 1 character from postgre database using C++

I have imported a shapefile in postgre database where Latin1 character encoding is used. (Database can not import using UTF-8 format). When I retrieve value using PQgetvalue() method some special characters are received incorrectly. For example I have a field value "STURDEEÿAVENUE" that is incorrectly converted
to "STURDEEÿAVENUE"
Since you are getting the data back as UTF-8, your client_encoding is probably wrong. It can be set per connection and manages the encoding with which the strings are sent back to client. By setting the variable to Latin1 immediately after connecting you can retrieve the strings in the desired encoding.

Strange character encoding issue

I have some data which has been imported into Postgres, for use in a Rails application. However somehow the foreign accents have become strangely encoded:
ä appears as â§
á appears as â°
é appears as â©
ó appears as ââ¥
I'm pretty sure the problem is with the integrity of the data, rather than any problem with Rails. It doesn't seem to match any encoding I try:
# Replace "cp1252" with any other encoding, to no effect
"Trollâ§ttan".encode("cp1252").force_encoding("UTF-8") #-> junk
If anyone was able to identify what kind of encoding mixup I'm suffering from, that would be great.
As a last resort, I may have to manually replace each corrupted accent character, but if anyone can suggest a programatic solution (or a even a starting point for fixing this - I've found it very hard to debug), I'd be v. grateful.
It's hardly possible with recent versions of PostgreSQL to have invalid UTF8 inside a UTF8 database. There are other plausible possibilities that may lead to that output, though.
In the typical case of é appearing as ©, either:
The contents of the database are valid, but some client-side layer is interpreting the bytes from the database as if they were iso-latin-something whereas they are UTF8.
The contents are valid and the SQL client-side layer is valid, but the terminal/software/webpage with which you're looking at this is configured for iso-latin1 or a similar mono-bytes encoding (win1252, iso-latin9...).
The contents of the database consist of the wrong characters with a valid UTF8 encoding. This is what you end up with if you take iso-latin-something bytes, convert them to UTF8 representation, then take the resulting byte stream as if if was still in iso-latin, and reconvert it once again to UTF8, and insert that into the database.
Note that while the © sequence is typical in UTF8 versus iso-latin confusion, the presence of an additional â in all your sample strings is uncommon. It may be the result of another misinterpretation on top of the primary one. If you're in case #3, that may mean that an automated fix based on search-replace will be harder than the normal case which is already tricky.

Some unicode characters are represented as "?" mark when inserting to Oracle from Delphi

I have written an application in Delphi 2010 that imports data from one database to another. I've done this before many times: From Access to Acces, Access to SQL Server. But now I have to import data from SQL Server 2005 to Oracle 10G. I do this by selecting all the rows from a table in SQL Server database and inserting them one by one to a table with the same structure in Oracle database. The import performs normally except for that I get question marks for some unicode characters. When I insert those characters in the database manually it shows them properly. It's something between Delphi and Oracle. I use UniDac component set for this purpose. Does anybody know the reason for those question marks?
Basically two possibilities: either the character encoding is wrong, or the software used to display the text is using a font (or set of fonts) that does not contain all the characters. To check this, copy some of the displayed text containing the problem characters into another program, like MS Word, and see if it displays them. Set Word to use Arial Unicode MS if needed.

Delphi TBytesField - How to see the text properly - Source is HIT OLEDB AS400

We are connecting to a multi-member AS400 iSeries table via HIT OLEDB and HIT ODBC.
You connect to this table via an alias to access a specific multi-member. We create the alias on the AS400 this way:
CREATE ALIAS aliasname FOR table(membername)
We can then query each member of the table this way:
SELECT * FROM aliasname
We are testing this in Delphi6 first, but will move it to D2010 later
We are using HIT OLEDB for the AS400.
We are pulling down records from a table and the field is being seen as a tBytesField. I have also tried ODBC driver and it sees as tBytesField as well.
Directly on the AS400 I can query the data and see readable text. I can use the iSeries Navigation tool and see readable text as well.
However when I bring it down to the Delphi client via the HIT OLEDB or HIT ODBC and try to view via asString then I just see unreadable text.. something like this:
ñðð#ðõñððððñ÷#õôððõñòøóóöøñðÂÁÕÒ#ÖÆ#ÁÔÅÙÉÃÁ########ÂÈÙÉâãæÁðòñè#ÔK#k#ÉÕÃK#########ç
I jumbled up the text above, but that is the character types that show up.
When I did a test in D2010 the text looks like japanse or chinese characters, but if I display as AnsiString then it looks like what it does in Delphi 6.
I am thinking this may have something to do with code pages or character sets, but I have no experience in this are so it is new to me if it is related. When I look at the Coded Character Set on the AS400 it is set to 65535.
What do I need to do to make this text readable?
We do have a third party component (Delphi400) that makes things behave in a more native AS400 manner. When I use its AS400 connection and AS400 query components it shows the field as a tStringField and displays just fine. BUT we are phasing out this product (for a number of reasons) and would really like the OLEDB with the ADO components work.
Just for clarification the HIT OLEDB with tADOQuery do have some fields showing as tStringFields for many of the other tables we use... not sure why it is showing as a tBytesField in this case. I am not an AS400 expert, but looking at the field definititions on the AS400 the ones showing up as tBytesField look the same as the ones showing up as tStringFields... but there must be a difference. Maybe due to being a multi-member?
So... does anyone have any guidance on how to get the correct string data that is readable?
If you need more info please ask.
Greg
One problem is that your client doesn't know that it ought to convert the data from EBCDIC to ASCII because the CCSID on the server's table was set incorrectly.
A CCSID of 65535 is supposed to mean that the field contains binary data. Your client doesn't know that the column contains an EBCDIC encoded string, and therefore doesn't try to convert it.
On my servers, all of our character fields have a CCSID of 37, which is EBCDIC.
I found the answer... on both HIT ODBC 400 and HIT OLEDB 400 there is a property called: "Convert CCSID 65535=True" or in the OLEDB UDL it looks like "Binary Characters=True".
Don't know how I missed those, but that did the trick!
Thanks for the feedback.

Resources