What is binary character set? - character-encoding

I'm wondering what binary character set is and what is a difference from, let's say, ISO/IEC 8859-1 aka Latin-1 character set?

There's a page in the MySQL documentation about The _bin and binary Collations.
Nonbinary strings (as stored in the CHAR, VARCHAR, and TEXT data types) have a character set and collation. A given character set can have several collations, each of which defines a particular sorting and comparison order for the characters in the set. One of these is the binary collation for the character set, indicated by a _bin suffix in the collation name. For example, latin1 and utf8 have binary collations named latin1_bin and utf8_bin.
Binary strings (as stored in the BINARY, VARBINARY, and BLOB data types) have no character set or collation in the sense that nonbinary strings do. (Applied to a binary string, the CHARSET() and COLLATION() functions both return a value of binary.) Binary strings are sequences of bytes and the numeric values of those bytes determine sort order.
And so on. Maybe gives more sense? If not, I'd recommend looking further in the documentation for descriptions about these things. If it's a concept, it should be explained. Usually is :)

Related

How do I keep my rails integer from being converted to binary?

As you may be able to see in the image, I have a User model and #user.zip is stored as an integer for validation purposes (ie, so only digits are stored, etc.). I was troubleshooting an error when I discovered that my sample zip code (00100) was automatically being converted to binary, and ending up as the number 64.
Any ideas on how to keep this from happening? I am new to Rails, and it took me a few hours to figure out the cause of this error, as you might imagine :)
I can't imagine any other information would be helpful here, but please inform me if otherwise.
This is not binary, this is octal.
In Ruby, any number starting with 0 will be treated as an octal number. You should check the Ruby number literals to learn more about this, here's a quote:
You can use a special prefix to write numbers in decimal, hexadecimal, octal or binary formats. For decimal numbers use a prefix of 0d, for hexadecimal numbers use a prefix of 0x, for octal numbers use a prefix of 0 or 0o, for binary numbers use a prefix of 0b. The alphabetic component of the number is not case-sensitive.
For your case, you should not store zipcodes as numbers. Not only in the database, but even as variables don't treat them as numeric values. Instead, store and treat them as strings.
The zip should probably be stored as a string since you can't have a valid integer with leading zeroes.

Are code pages and code charts the same thing?

Based on what I have gathered so far from reading information available online:
character set is a bunch of characters that we want to use (like an interface)
character encoding is a method of encoding some character set (like an implementation)
What is the relationship between code charts and code pages and how do they fit into the overall context? I am not sure if these two terms are synonyms or if they are referring to distinct concepts.
Do code charts/code pages define character sets through large tables and also provide a method of encoding, making them a part of character encoding? Or, do they only define character sets and leave encoding implementation to another aspect? Additionally, is a locale simply a type of code chart/code page or is it a separate concept altogether?
In the majority of cases, character sets and character encodings are one and the same. For example, ISO-8859-1 defines the character set for Western Europe AND the encoding using an 8bit scheme.
See the specification for ISO-8859-1: ftp://std.dkuug.dk/JTC1/sc2/wg3/docs/n411.pdf, which includes the encoding implementation.
Unicode on the other hand separates encoding from the character definition, albeit within a bunch of related documents. In Unicode, just about all current and a good deal of historic characters, symbols and modifiers are mapped to a 32 bit "code point". Encodings of UTF-32, UTF-16 and UTF-8 are then documented separately, to define how the Unicode Code Point is encoded.

Which NLS_LENGTH_SEMANTICS for WE8MSWIN1252 Character Set

We have a database where the character set is set to WE8MSWIN1252 which I understand is a single byte character set.
We created a schema and its tables by running a script with the following:
ALTER SYSTEM SET NLS_LENGTH_SEMANTICS=CHAR
Could we possibly lose data since we are using VARCHAR2 columns with character semantics while the underlying character set is single byte?
If you are using a single-byte character set like Windows-1252, it is irrelevant whether you are using character or byte semantics. Each character occupies exactly one byte so it doesn't matter whether a column is declared VARCHAR2(10 CHAR) or VARCHAR2(10 BYTE). In either case, up to 10 bytes of storage for up to 10 characters will be allocated.
Since you gain no benefit from changing the NLS_LENGTH_SEMANTICS setting, you ought to keep the setting at the default (BYTE) since that is less likely to cause issues with other scripts that you might need to run (such as those from Oracle).
Excellent question. Multi-byte characters will take up the number of bytes required, which could use more storage than you expect. If you store a 4-byte character in a varchar2(4) column, you have used all 4 bytes. If you store a 4-byte character in a varchar2(4 char) column, you have only used 1 character. Many foreign languages and special characters use 2-byte character sets, so it's best to 'know your data' and make your database column definitions accordingly. Oracle does NOT recommend changing NLS_LENGTH_SEMANTICS to CHAR because it will affect every new column defined as CHAR or VARCHAR2, possibly including your catalog tables when you do an in-place upgrade. You can see why this is probably not a good idea. Other Oracle toolsets and interfaces may present issues as well.

Why Lua's string can contain characters with any numeric value?

I read something aboue string there:
http://www.lua.org/pil/2.4.html
Lua is eight-bit clean and so strings may contain characters with any numeric value, including embedded zeros.
What is that eight-bit clean means?
Why it can contain characters with any numeric value ? (different with basic c strings)
There are two common ways to store strings:
Characters and Terminator
Length and Characters
When you use #1, you need to "sacrifice" one character to serve as the terminator; when you use #2, you do not have such limitation.
C uses the first method of storing strings. It uses character zero to serve as the terminator; the other 255 characters can be used to represent characters of the string.
Lua uses the second method of storing strings. All 256 possible character values, including zeros, can be used in Lua strings. For example, you can construct a three-character string from characters 'A', 0, 'B', and Lua will treat it as a three character string. You can construct the same string in C, but its string-processing libraries will treat it as a single-character string: strlen would return 1, puts will write character A and stop, and so on.
The Lua string type is a counted sequence of bytes. A byte can hold any value between 0 and 255.
The string type is used for character strings. You are right, few character set encodings allow any byte value or sequence of byte values. Code page 437 is one that does; It maps 256 characters to 256 values, one byte per character. Windows-1252 does not; It maps 251 characters to 251 values, one byte per character. UTF-8 maps 1,112,064 characters to sequences of one to four bytes, where some values of bytes are not used and some sequences of values are not used.
The Lua string library does have functions that treats bytes as characters. Their behavior is influenced by the implementation's libraries, which typically uses the C runtime along with its locale features.
There are specialized libraries for Lua to explicitly handle various character set encodings.

How many chars can numeric EDIFACT data elements be long?

In EDIFACT there are numeric data elements, specified e.g. as format n..5 -- we want to store those fields in a database table (with alphanumeric fields, so we can check them). How long must the db-fields be, so we can for sure store every possible valid value? I know it's at least two additional chars (for decimal point (or comma or whatever) and possibly a leading minus sign).
We are building our tables after the UN/EDIFACT standard we use in our message, not the specific guide involved, so we want to be able to store everything matching that standard. But documentation on the numeric data elements isn't really straightforward (or at least I could not find that part).
Thanks for any help
I finally found the information on the UNECE web site in the documentation on UN/EDIFACT rules Part 4. UN/EDIFACT rules Chapter 2.2 Syntax Rules . They don't say it directly, but when you put all the parts together, you get it. See TOC-entry 10: REPRESENTATION OF NUMERIC DATA ELEMENT VALUES.
Here's what it basically says:
10.1: Decimal Mark
Decimal mark must be transmitted (if needed) as specified in UNA (comma or point, put always one character). It shall not be counted as a character of the value when computing the maximum field length of a data element.
10.2: Triad Seperator
Triad separators shall not be used in interchange.
10.3: Sign
[...] If a value is to be indicated to be negative, it shall in transmission be immediately preceded by a minus sign e.g. -112. The minus sign shall not be counted as a character of the value when computing the maximum field length of a data element. However, allowance has to be made for the character in transmission and reception.
To put it together:
Other than the digits themselves there are only two (optional) chars allowed in a numeric field: the decimal seperator and a minus sign (no blanks are permitted in between any of the characters). These two extra chars are not counted against the maximum length of the value in the field.
So the maximum number of characters in a numeric field is the maximal length of the numeric field plus 2. If you want your database to be able to store every syntactically correct value transmitted in a field specified as n..17, your column would have to be 19 chars long (something like varchar(19)). Every EDIFACT-message that has a value longer than 19 chars in a field specified as n..17 does not need to be stored in the DB for semantic checking, because it is already syntactically wrong and can be rejected.
I used EDI Notepad from Liaison to solve a similar challenge. https://liaison.com/products/integrate/edi/edi-notepad
I recommend anyone looking at EDI to at least get their free (express) version of EDI Notepad.
The "high end" version (EDI Notepad Productivity Suite) of their product comes with a "Dictionary Viewer" tool that you can export the min / max lengths of the elements, as well as type. You can export the document to HTML from the Viewer tool. It would also handle ANSI X12 too.

Resources