Mysqldump error on unique columns using extended charsets - character-encoding

I'm doing a backup replication of some phpBB forums from one server to another using mysqldump, using some basic options:
mysqldump -h[server] --create-options --add-drop-database -R -E -B [database]
At the time of doing there was an error like this:
ERROR 1062 (23000) at line 9322: Duplicate entry '?????' for key 'wrd_txt'
In phpBB forums that is an UNIQUE key on a table in which every word posted is registered and counted. The problem seems to be this one:
When mysqldump dumps a DOUBLE value, it uses insufficient precision to
distinguish between some close values (and, presumably, insufficient
precision to recreate the exact values from the original database). If
the DOUBLE value is a primary key or part of a unique index, restoring
the database from this output fails with a duplicate key error.
It is been caused due to some posts on cirilic alphabet on our forums; mysqldump seems to be taking cirilic characters as a simple value and truncating them, so every character seems to be the same at using cirilic alphabet (character represented as ? in this case). That results in encountering repeated values for strings of the same size in an UNIQUE key column.
Is there any way to perform a dump using double precision using other options or through other tool?? Or a way to avoid this problem on dump??
Just for the record, since that cirilic words on table was only there due to spam, and we were only interested on latin characters, I got ride of them using this command (maybe it would be useful for anyone).
delete from [table] where NOT HEX([column]) REGEXP '^([0-C][0-9A-F])*$';
Thanks a lot in advance!

Related

Informix with en_US.57372 character set gives an error for a LATIN CAPITAL LETTER A WITH CIRCUMFLEX

I am trying to read data from Informix DB using an ODBC driver.
Everything is just fine until I am trying to read a few characters such as Â📈'.
The ERR message I am having from the driver is Error -21005 which is:
"Invalid byte in codeset conversion input.".
Is there a reason this char set is not able to read those characters? If so, is there a website (I haven't found one) where I can see the whole supported characters for this codeset?
This error -21005 could also mean that you have inserted invalid characters in your database due to your CLIENT_LOCALE being wrong, but this having not been detected by Informix because of it being set the same as DB_LOCALE, which prevented any conversion from having been done.
Then, when you try to read the data containing invalid characters, Informix would produce error -21005, to warn that some of the characters have been replaced by a placeholder, and therefore the data will not be reversible.
See https://www.ibm.com/support/pages/error-21005-when-using-odbc-select-data-database for a detailed explanation on how an incorrect CLIENT_LOCALE can produce error -21005 when querying data.
CLIENT_LOCALE should always be set to the locale of the pc where your queries are being generated, and DB_LOCALE must match the locale with which the database was defined, and which you can find out with "SELECT tabname, site FROM systables WHERE tabid in (90, 91)" but beware that for example en_US.57372 would really mean en_us.utf8, you would need to look in gls\cm3\registry to see the mappings.
EDIT: The answer on Queries problems using IBM Informix ODBC Driver in C# also explains in great detail the misery a wrong CLIENT_LOCALE can bring, and how to fix it.

tFuzzyMatch apparently not working on Arabic text strings

I have created a job in talend open studio for data integration v5.5.1.
I am trying to find matches between two customer names columns, one is a lookup and the other contain dirty data.
The job runs as expected when the customer names are in english. However, for arabic names, only exact matches are found regardless of the underlying match algorithm i used (levenschtein, metaphone, double metaphone) even with loose bounds for the levenschtein algorithm min 1 max 50).
I suspect this has to do with character encoding. How should I proceed? any way I can operate using the unicode or even UTF-8 interpretation in Talend?
I am using excel data sources through tFileInputExcel
I got it resolved by moving the data to mysql with a UTF-8 collation. Somehow Excel input wasn't preserving the collation.

What is uuid ossp in postgres

I've seen this in a migration
enable_extension 'uuid-ossp'
as far as I know uuid is a long unique string based on some RFCs, and this enable the db (in this case pg) to have a column type as a uuid
my question is - Why is this type of column needed and not just a string column?
is it to replace the regular integer id column and to have a uuid as the id instead?
is there any advantage to use a uuid as the id instead of just having a string type column contain a uuid?
I was hoping to see some more people chime in here, but I think the idea of the uuid is to replace the id column for a more unique id which is useful especially when you've got a distributed database or are dealing with replication.
Pros:
Easier to merge data
Better scaling when/if you have to move to a distributed system
Avoids Postgres sequence problems which often occur when merging or copying data
You can generate them from other platforms (other than just the database, if you need)
If you're wanting to obfuscate your records (e.g. rather than accessing users/1 (the id) which might prompt a curious user to try users/2 to see if he could access someone else's information since its obvious the sequential nature of the parameter). Obviously there are other ways of dealing with this particular issue however
Cons:
Requires larger key length that typical id
Is usually non-sequential (which can lead to strange behavior if you're ordering on it, which you probably shouldn't be doing generally anyhow)
Harder to reference when troubleshooting (finding by a long UUID rather than an simple integer id)
Here are some more resources which I found valuable:
Peter van Hardenberg's (of Heroku) argument for UUIDs (among other things, this is an amazing presentation and you should watch all of it)... Here's the part on using UUID's rather than ids: http://vimeo.com/61044807#t=15m04s
Jeff Atwood's (formerly of StackOverflow) argument for GUIDs: http://www.codinghorror.com/blog/2007/03/primary-keys-ids-versus-guids.html
http://rny.io/rails/postgresql/2013/07/27/use-uuids-in-rails-4-with-postgresql.html
http://blog.crowdint.com/2013/10/09/using-postgres-uuids-as-primary-keys-on-rails.html
It is not necessary to install that extension to use the uuid type. The advantages of using the UUID type in instead of a text type are two. The first is the automatic constraint
select 'a'::uuid;
ERROR: invalid input syntax for uuid: "a"
Second is storage space. UUID only uses 16 bytes while the hex representation takes 33:
select
pg_column_size('0123456789abcdef0123456789abcdef'),
pg_column_size('0123456789abcdef0123456789abcdef'::uuid)
;
pg_column_size | pg_column_size
----------------+----------------
33 | 16
The uuid-ossp extension just adds functions to generate UUID.

reading and sorting a variable length CSV file

We am using OpenVMS system and I believe it is using the Cobol from HP.
With a data file of a lot of records ( 500mb or more ) which variable length. The records are comma delimited. I would like to parse each records and extract corresponding fields for processing. After that, I might want to sort it by some particular fields. Is it possible with cobol?
I've seen sorting with fixed-length records only.
Variable length is no problem, not sure exactly how this is done in VMS cobol but the IBMese for this is:-
FILE SECTION.
FD THE-FILE RECORD IS VARYING DEPENDING ON REC-LENGTH.
01 THE-RECORD PICTURE X(5000) .
WORKING-STORAGE SECTION.
01 REC-LENGTH PICTURE 9(5) COMPUTATIONAL.
When you read the file "REC-LENGTH" will contain the record length, when write a record it will write a record of length REC-LENGTH.
To handle the delimited record files you will probably need to use the "UNSTRING" verb to convert into a fixed format. This is pretty verbose (but then this is COBOL).
UNSTRING record DELIMITED BY ","
INTO field1, field2, field3, field4, field5 etc....
END-UNSTRING
Once the record is in fixed format you can use the SORT as normal.
The Cobol SORT verb will do what you need.
If the SD file contains variable-length records, all of the KEY data-items must be contained within the first n character positions of the record, where n equals the minimum records size
specified for the file. In other words, they have to be in the fixed part.
However, you can get around this easily by using an input procedure. This will let you create a virtual file that has its keys in the right place. In your input procedure, you will reformat your variable, comma delimited, record, into one that has its keys at the front, then "Release" it to the sort.
If my memory is correct, VMS has a SORT/MERGE utility that you could use after you have processed the file into a fixed file format (variable may also be possible). Typically a standalone SORT utility performs better than in-line COLBOL SORT and can be better design if the sort criteria changes in the future.
No need to write a solution in COBOL, at least not to sort the file. The UNIX sort utility should do it just fine, just call sort -t ',' -n with maybe a couple of other options.

Stored procedure parameter data type that allows alphanumeric values

I have a stored procedure for SQL 2000 that has an input parameter with a data type of varchar(17) to handle a vehicle identifier (VIN) that is alphanumeric. However, whenever I enter a value for the parameter when executing that has a numerical digit in it, it gives me an error. It appears to only accept alphabetic characters. What am I doing wrong here?
Based on comments, there is a subtle "feature" of SQL Server that allows letters a-z to be used as stored proc parameters without delimiters. It's been there forever (since 6.5 at least)
I'm not sure of the full rules, but it's demonstrated in MSDN (rename SQL Server etc): there are no delimiters around the "local" parameter. And I just found this KB article on it
In this case, it could be starting with a number that breaks. I assume it works for a contained number (but as I said I'm not sure of the full rules).
Edit: confirmed by Martin as "breaks with leading number", OK for "containing number"
This doesn't help much, but somewhere, you have a bug, typo, or oversight in your code. I spent 2+ years working with VINs as parameters, and other than regretting not having made it char(17) instead ov varchar(17), we never had any problems passing in alphanumeric VIN values. Somewhere, and I'd guess it's in the application layer, something is not liking digits -- perhaps a filter looking for only alphabetical characters?

Resources