Why is there no NVarCiChar field type in ADS? - advantage-database-server

I was wondering why there is no case-insensitive Unicode character field type in ADS?
While you can set the collation of indexes to NVarChar fields to be case-insensitive, a simple query using WHERE field = 'HeLlO WoRlD' doesn't find the value 'Hello World'?
I know that WHERE field = 'HeLlO WoRlD' COLLATE ads_default_ci works, but doing that for every single comparison is not an option.
The CiChar field type is not Unicode capable (unless you store UTF-8 strings in there which causes other problems).

Fundamentally, unlike regular character field, Unicode can store characters from all languages so there is not specific collation/language associate with it. The collation comes from how it is used, indexed or sorted. If a NVarCiChar field is to be defined, a language/locale (English has different case sensitivity from French or German) will need to be associated with such field type and that would introduced unnecessary complexity to the system (what to do when an English ci field is to compare with a German ci field).
Although the ciChar type is easier to use in some respects, it has drawbacks as well. The main one being that it is not standard so it is not portable to other DB, and it requires some special handling in code. It is less flexible. It causes problem when trying to compare a ciChar field with a regular char field -- the COLLATE clause is required for such comparison. Since the relatively standard way of using the COLLATE clause supports case insensitive comparison in a more clear manner while being more flexible, we decided that the case insensitive Unicode field is not necessary. It is also easy to do case insensitive comparison of Unicode strings by specifying a case insensitive Unicode collation for the SQL statement handle to avoid using multiple COLLATE clauses.

Related

Is there ever reason to use string data type when using rails 6 with postgres database?

I see here that :text data type seems to perform exactly the same as :string
This comment particularly:
PostgreSQL implementation prefers text. The only difference for pg string/text is constraint on length for string. No performance differences.
Is there any good reason to ever use :string when making a rails app using a postgresql database?
No difference in performance, difference in semantics.
Several libraries (for example simpleform) look at the data type of the field in the database and perform differently depending on it. Simple form will add a number input if it's a number, a checkbox if it's a boolean and so on. For this case, it will add a single line text field for string and a multiline text box for text.
Since there is no difference in performance, you can use either, but it's still useful to denote semantics.

Big Integers and Custom Validation

I'm somewhat new to Rails and I'm trying to learn about custom validations.
One common requirement in Brazil are CPF/CNPJ/RG fields. They are a type of identification number and follow a specific format.
For example:
CPFs are 11 digit numbers. They follow this pattern: xxx.xxx.xxx-xx
I'm trying to store them in an Integer field but I'm getting (Using Postgres):
PG::Error: ERROR: value "xxxxxxxxxxx" is out of range for type
integer
What is the proper way to store this? Bigint (How?)? A string?
My second question is:
How can I specify a custom validation (a method) for this field that could be called somewhat like this:
class User < AR::Base
validates :cpf, presence: true, unique: true, cpf: true
Assuming performance is not critical, strings are fine. That way you can keep the dots and dashes. As mentioned by others in this thread, bigint or numeric may be far more performant if that's a concern.
If you keep the field a string, you can easily validate it with regex:
validates_format_of :cpf, with: /^[0-9]{3}\.[0-9]{3}\.[0-9]{3}\-[0-9]{2}$/
For small tables, just store as text to preserve the format.
For big tables, performance and storage size may be an issue. If your pattern is guaranteed, you may very well store the number as bigint and format it on retrieval with to_char():
Write:
SELECT translate('111.222.333-55', '.-', '')::bigint
This also serves as partial validation. Only digits, . and - are allowed in your string. The pattern might still be violated, you have to check explicitly with something like #Michael provided.
Read:
SELECT to_char(11122233355, 'FM000"."000"."000"-"00')
Returns:
111.222.333-55
Don't forget the leading FM in the pattern to remove the leading whitespace (where a negative sign might go for numbers).
A bigint occupies 8 bytes on disk and can easily store 11-digit numbers.
text (or varchar) need 1 byte plus the actual string, which amounts to 15 bytes in your case.
Plus, processing bigint is generally a bit faster than processing text of equal length..
Personally I would always store these values as bigint and apply formatting on input/output (as Erwin suggests) or in the application.
The main reasons are storage efficiency (as Erwin mentions) and efficiency of comparison. When you compare 11111111112 to 11111111113 as text, PostgreSQL will use language-specific collation rules that are correct for text, but may not be what you want for numbers. They're also slow; a recent question on SO reported a five-fold speed-up in text comparisons by using the COLLATE "C" option to force plain POSIX collations; numeric collations are faster again.
Most of these standard numbers have their own internal check-sums, often a variant of the Luhn algorithm. Validating these in a CHECK constraint is likely to be a good idea. You can implement a Luhn algorithm check against integers pretty easily in PL/PgSQL or plain SQL; I wrote some samples on the PostgreSQL wiki.
Whatever you do, make sure you have a CHECK constraint on the column that validates the number on storage, so you don't get invalid and nonsensical values stored.

StrLComp vs AnsiStrLComp when called with Unicode strings

I'm having a bit of confusion regarding the "Ansi" vs "regular" rtl string functions when called with Unicode strings. I understand that under older versions of Delphi (when Ansistring was the default) that the "Ansi" versions handled multibyte characters. Does this mean anything when dealing with Unicode strings? Assuming that I need to handle Korean characters and also that my code does not have to be compatible with older Delphi versions, which rtl functions should be used?
The 'Ansi' prefix of the string compare functions really never signified anything other than that the locale was taken into account when comparing strings instead of doing "just" a simple binary comparison. In the Unicode world this is still the case. The Ansi* family of functions also take (Unicode) strings as their parameters and take the locale into account when doing the comparison.
From the AnsiCompareStr doc (D2009):
Most locales consider lowercase characters to be less than the
corresponding uppercase characters. This is in contrast to ASCII
order, in which lowercase characters are greater than uppercase
characters. Thus, setting S1 to 'a' and S2 to 'A' causees
AnsiCompareStr to return a value less than zero, while CompareStr,
with the same arguments, returns a value greater than zero.
What the effect of "taking the locale into account" may be differs per locale. It may have to do with accented characters or not. In Unicode versions it may actually take into account how the characters are composed. For example an accented e (é) may be encoded exactly like that but may also be encoded as two separate items: the accent and the e.
Both the Ansi* and the "normal" string compare functions are included in the SysUtils unit. They all take strings as their parameters and in Unicode Delphi that does indeed mean UnicodeStrings.
If you need to work with AnsiStrings then you need to use the AnsiStrings unit. It has the same set of string compare functions, but in this unit they all take AnsiStrings as their parameters.
Now, if you don't need compatability with older versions: use the standard functions from SysUtils. Use the normale ones if byte comparison is enough. Use the Ansi ones if you need to take locale considerations into account.
Not sure what exactly you want to do, but...
if you want to compare two strings by your current user locale rules, use the AnsiStrLComp for case sensitive comparision or AnsiStrLIComp for case insensitive comparision. Internally these functions uses the CompareString function with the LOCALE_USER_DEFAULT locale set
if you want to compare two strings by using the Delphi internal comparing mechanism, use the StrLComp function for case sensitive comparision or StrLIComp for case insensitive compare
So if you'll compare the two same strings with AnsiStrLComp or AnsiStrLIComp on machines with different user locale settings, you may get different results, but on the other hand you can get natural sorting for the user's language settings to your application.
The StrLComp and StrLIComp will work on all machines the same way, locale independently.
The simple answer is that when it comes to Delphi string routines you should use the ANSI...() functions for Unicode strings.
However, if you are comparing strings (among other things) then you may also need to consider normalising those strings first, depending on the nature and needs (and the source of the strings) in your application, to deal with Unicode Equivalence.

Why is my query returning the wrong string type?

According to the official Firebird documentation, columns containing Unicode strings (what SQL Server calls NVARCHAR) should be declared as VARCHAR(x) CHARACTER SET UNICODE_FSS. So I did that, but when I query the table with DBExpress, the result I get back is a TStringField, which is AnsiString only, not the TWideStringField I was expecting.
How do I get DBX to give me a Unicode string result from a Unicode string column?
With Firebird, your only option is to set the whole database connection to a Unicode char set, for example to utf8.
That way, all the VarChar columns will result in fields of type TWideStringField. The fields will be always TWideStringFields despite the particular char set declared when creating the column.
Setting this, will result in this:
I collect this images now from a example project I created while teaching Delphi a few months ago. You have to set this property before creating any persistent fields if that's your case.
It looks like the driver does not support the UNICODE_FSS charset, as my first action was to create a new project, set the property and then create some fields. IMHO it's better to declare the whole database as utf8 or other charset supported by the driver in the create database sentence, and then match the database charset in Delphi to avoid string conversions.

What is binary character set?

I'm wondering what binary character set is and what is a difference from, let's say, ISO/IEC 8859-1 aka Latin-1 character set?
There's a page in the MySQL documentation about The _bin and binary Collations.
Nonbinary strings (as stored in the CHAR, VARCHAR, and TEXT data types) have a character set and collation. A given character set can have several collations, each of which defines a particular sorting and comparison order for the characters in the set. One of these is the binary collation for the character set, indicated by a _bin suffix in the collation name. For example, latin1 and utf8 have binary collations named latin1_bin and utf8_bin.
Binary strings (as stored in the BINARY, VARBINARY, and BLOB data types) have no character set or collation in the sense that nonbinary strings do. (Applied to a binary string, the CHARSET() and COLLATION() functions both return a value of binary.) Binary strings are sequences of bytes and the numeric values of those bytes determine sort order.
And so on. Maybe gives more sense? If not, I'd recommend looking further in the documentation for descriptions about these things. If it's a concept, it should be explained. Usually is :)

Resources