'Unconvertable character' exception in HTTPRequest node when sending a HTTPRequestHeader containing a special character - messagebroker

I'm getting an 'Unconvertable character' exception when sending a HTTPRequestHeader to HTTPRequest Node containing a special character.
Error in the ExceptionList when debugging:
RecoverableException
File:CHARACTER:/jenkins/slot0/product-build/WMB/src/CommonServices/ImbConverter.cpp
Line:INTEGER:733
Function:CHARACTER:ImbConverterCPP::
Type:CHARACTER:
Name:CHARACTER:
Label:CHARACTER:
Catalog:CHARACTER:BIPmsgs
Severity:INTEGER:3
Number:INTEGER:2136
Text:CHARACTER:Unconvertable character
Insert
Type:INTEGER:5
Text:CHARACTER:1920
Insert
Type:INTEGER:5
Text:CHARACTER:4c0061006c0069006100192073002000420075007200690061006c002000460075006e006400200061006e006400200043006100720065002000450078007000 ...data truncated to first 64 chars
Insert
Type:INTEGER:2
Text:CHARACTER:819
Snippet of my esql code:
SET OutputRoot.HTTPRequestHeader."Content-Type" = 'application/octet-stream';
SET OutputRoot.HTTPRequestHeader."X-IntegrationServer-Resource-Name" = rInputDocData.*:DocumentName;
--Setting the content
SET OutputRoot.BLOB.BLOB = BASE64DECODE(rInputDocAttachment.*:AttachmentData64Binary);
The special character is coming in the field rInputDocData.*:DocumentName. Some of the values are:
Printing….PDF
Mark’s Agreement .pdf
Notice the … and ’ in the above two values which are not recognized as part of UTF-8.
Is there a way to convert these values to ISO 8859-1 in ESQL as doing the same conversion in Notepad++ leads to the values getting accepted?
I have tried the below steps but none have worked and I'm still getting the same error:
Setting OutputRoot.Properties.CodedCharSetId to 1208 and OutputRoot.Properties.Encoding to 546 as the properties were when the request was received in Input.
Setting OutputRoot.Properties.CodedCharSetId to 819.
Setting Content-Type HTTPRequestHeader to 'application/octet-stream; charset=utf-8'.
Setting Content-Type HTTPRequestHeader to 'application/octet-stream; charset=iso-8859-1'.
Casting the HTTPRequestHeader 'X-IntegrationServer-Resource-Name' in the below ways:
SET OutputRoot.HTTPRequestHeader."X-IntegrationServer-Resource-Name" = CAST(rInputDocData.*:DocumentName AS CHARACTER CCSID 1208 ENCODING 546);
SET OutputRoot.HTTPRequestHeader."X-IntegrationServer-Resource-Name" = CAST(rInputDocData.*:DocumentName AS CHARACTER CCSID 819);
SET OutputRoot.HTTPRequestHeader."X-IntegrationServer-Resource-Name" = CAST(CAST(rInputDocData.*:DocumentName AS CHARACTER CCSID 1208 ENCODING 546) AS CHARACTER CCSID 819);
The source vendor have refused to handle/convert the special character so it is upon us to handle this in ACE. Any help would be much appreciated.

HTTP headers should only use characters in the US-ASCII range. Some other characters may be tolerated, but web server implementations are not consistent in this area.
See the discussions here for more details: what characters are allowed in HTTP header values?
One of the participants in that conversation is an author of the HTTP/1.1 specification, so I think we can rely on his input.
re:
Notice the … and ’ in the above two values which are not recognized as part of UTF-8.
UTF-8 can represent any Unicode character, so that statement cannot be true. If you look carefully at the error, the quoted CCSID is 819, which is the IBM CCSID for ISO-8859-1.

It seems that the ... character is just not in the CCSID 819 charset. It's in the Windows version of Latin 1, CCSID 1252, but as the previous answer said, that may not work with the target Web server.
You can try these, but not sure any will work:
SET OutputRoot.HTTPRequestHeader."X-IntegrationServer-Resource-Name" = CAST(ASBITSTREAM(rInputDocData.*:DocumentName CCSID InputRoot.Properties.CodedCharSetId) AS CHARACTER CCSID 819);
SET OutputRoot.HTTPRequestHeader."X-IntegrationServer-Resource-Name" = CAST(ASBITSTREAM(rInputDocData.*:DocumentName CCSID InputRoot.Properties.CodedCharSetId) AS CHARACTER CCSID 1252);

The below worked for me, but as the top comment had pointed out, it did go on to fail with an encoding error in the target server:
SET OutputRoot.HTTPRequestHeader."X-IntegrationServer-Resource-Name" = CAST(CAST(rInputDocData.*:DocumentName AS BLOB CCSID InputRoot.Properties.CodedCharSetId) AS CHARACTER CCSID 819);

Related

Sybase ASE: show the character encoding used by the database

I am working on a Sybase ASE database and would like to know the character encoding (UTF8 or ASCII or whatever) used by the databae.
What's the command to show which character encoding the database uses?
The command you're looking for is actually a system stored procedure:
1> sp_helpsort
2> go
... snip ...
Sort Order Description
------------------------------------------------------------------
Character Set = 190, utf8
Unicode 3.1 UTF-8 Character Set
Class 2 Character Set
Sort Order = 50, bin_utf8
Binary sort order for the ISO 10646-1, UTF-8 multibyte encodin
g character set (utf8).
... snip ...
From this output we see this particular ASE dataserver has been configured with a default character set of utf8 and default sort order of binary (bin_utf8). This means all data is stored as utf8 and all indexing/sort operations are performed using a binary sort order.
Keep in mind ASE can perform character set conversions (for reads and writes) based on the client's character set configuration. Though the successfulness of said conversions will depend on the character sets in question (eg, a client connecting with utf8 may find many characters cannot be converted for storage in a dataserver defined with a default character set of iso_1).
With a query:
select
cs.name as server_character_set,
cs.description as character_set_description
from
master..syscharsets cs left outer join
master..sysconfigures cfg on
cs.id = cfg.value
where
cfg.config = 131
Example output:
server_character_set character_set_description
utf8 Unicode 3.1 UTF-8 Character Set

Postgres invalid byte sequence for encoding "UTF8": 0xc3 0x2f

I work with a payment API and it returns some XML. For logging I want to save the API response in my database.
One word in the API is "manhã" but the API returns "manh�". Other chars like á ou ç are being returned correctly, this is some bug in the API I guess.
But when trying to save this in my DB I get:
Postgres invalid byte sequence for encoding "UTF8": 0xc3 0x2f
How can I solve this?
I tried things like
response.encode("UTF-8") and also force_encode but all I get is:
Encoding::UndefinedConversionError ("\xC3" from ASCII-8BIT to UTF-8)
I need to either remove this wrong character or convert it somehow.
You’re on the right track - you should be able to solve the problem with the encode method - when the source encoding is known you should be able to simply use:
response.encode(‘UTF-8’, ‘ISO-8859-1’)
There may be times where there are invalid characters in the source encoding, and to get around exceptions, you can instruct ruby how to handle them:
# This will transcode the string to UTF-8 and replace any invalid/undefined characters with ‘’ (empty string)
response.encode(‘UTF-8’, 'ISO-8859-1', invalid: :replace, undef: :replace, replace: ‘’)
This is all laid out in the Ruby docs for String - check them out!
—--
Note, many people incorrectly assume that force_encode will somehow fix encoding problems. force_encode simply tags the string as the specified encoding - it does not transcode and replace/remove the invalid characters. When you're converting between encodings, you must transcode so that characters in one character set are correctly represented in the other character set.
As pointed out in the comment section, you can use force_encoding to transcode your string if you used: response.force_encoding('ISO-8859-1').encode('UTF-8') (which is equivalent to the first example using encode above).

What scheme is used to encode unicode characters in a .url shortcut?

What scheme is used to encode unicode characters in a windows url shortcut?
For example, a new shortcut for url "http://Ψαℕ℧▶" produces a .url file with the text:
[{000214A0-0000-0000-C000-000000000046}]
Prop3=19,2
[InternetShortcut]
IDList=
URL=http://?aN??/
[InternetShortcut.A]
URL=http://?aN??/
[InternetShortcut.W]
URL=http://+A6gDsSEVIScltg-/
What is the algorithm to decode "+A6gDsSEVIScltg-" to "Ψαℕ℧▶"?
I am not asking for API code, but I would like to know the encoding scheme details.
Note: The encoding scheme is not utf-8 nor utf-16 nor ucs-2 and no %encoding.
+A6gDsSEVIScltg- is the UTF-7 encoded form of Ψαℕ℧▶.
The correct way to process a .url file is to use the IUniformResourceLocator and IPropertyStorage interfaces from the CLSID_InternetShortcut COM object. See Internet Shortcuts on MSDN for details.
The answer (utf-7) allowed me to successfully develop the url conversion routine.
Let me summarize the steps:
To obtain the unicode url from a InternetShortcut.W found in a .url file.
. Pass ascii chars until crlf, after making them internet safe.
. A none escaped + character starts a utf-7 formatted unicode sequence:
. Collect 6-bit nibbles from base64 coded ascii
. Per collected 16 bits, convert the 16 bits to utf-8 (1,2, or 3 chars)
. Pass the utf8 generated characters as %hh
. Continue until the occurrence of a "-" character
. The bit collector should be zero

Mime encoded headers with extra '=' (==?utf-8?b?base64string?=)

This might be a silly question but... here it goes!
I wrote my own MIME parser in native C++. It's a nightmare with the encodings! It was stable for the last 3 months or so but recently I noticed this Subject: header.
Subject: =?UTF-8?B?T2ZpY2luYSBkZSBJbmZvcm1hY2nDs24sIEluaWNpYXRpdmFzIHkgUmVjbGFt?===?UTF-8?B?YWNpb25lcw==?=
which should decode to this:
Subject: Oficina de Información, Iniciativas y Reclamaciones
The problem is there is one extra = (equal) in there which I can't figure out binding the two (why 2?) encoded elements which I don't understand why are separated. In theory the format should be: =?charset?encoding?encoded_string?= but found another subject that starts with two =.
==?UTF-8?B?blahblahlblah?=
How should I handle the extra =?
I could replace ==? with =? (which I am) before doing anything (and it works)... but I'm wondering if there's any kind of spec regarding this so I don't hack my way into proper functionality.
PS: How much I hate these relic protocols! All text communications should be UTF-8 and XML :)
In MIME headers encoded words are used (RFC 2047 Section 2.).
... (why 2?)
To overcome 75 encoded word limit, which is there because of 78 line length limit (or to use 2 different encodings like Chinese and Polish for example).
RFC 2047:
An 'encoded-word' may not be more than 75 characters long,
including 'charset', 'encoding', 'encoded-text', and delimiters.
If it is desirable to encode more text than will fit in an
'encoded-word' of 75 characters, multiple 'encoded-word's
(separated by CRLF SPACE) may be used.
Here's the example from RFC2047 (note there is no '=' in between):
Subject: =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?=
=?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?=
Your subject should be decoded as:
"Oficina de Información, Iniciativas y Reclam=aciones"
mraq answer is incorrect. Soft line breaks apply to 'Quoted Printable' Content-Transfer-Encoding only, which can be used in MIME body.
It is called the "Soft Line Break" and it is the heritage of the SMTP protocol.
Quoting page 20 of RFC2045
(Soft Line Breaks) The Quoted-Printable encoding
REQUIRES that encoded lines be no more than 76
characters long. If longer lines are to be encoded
with the Quoted-Printable encoding, "soft" line breaks
must be used. An equal sign as the last character on a
encoded line indicates such a non-significant ("soft")
line break in the encoded text.
And also Wikipedia on Quoted-printable
A soft line break consists of an "=" at the end of an encoded line,
and does not appear as a line break in the decoded text.
From what I can see in the MIME RFC double equal signs are not valid input (for encoding), but keep in mind you could interpret the first equal sign as what it is and then use the following stuff for decoding. But seriously, those extra equal signs look like artifacts, maybe from an incorrect encoder.

In what charset is 0xE1 an "a" with an umlaut?

I am trying to identify an extended ascii charset where 0xE1 is an "a" with an umlaut (8859-1 character E4)
and 0xF5 is a "u" with an umlaut (8859-1 character FC).
Has anyone seen this charset before? It quite possibly dates back to the 80's.
To my knowledge, there are no standard ASCII character sets which use those symbols. Here is a list of the standard character sets: http://www.columbia.edu/kermit/csettables.html
However, the character set you referenced is in fact used for interfacing with some LEDs, for example HT1632-compliant LEDs use the same character set: http://blog.thiseldo.co.uk/wp-filez/USB_HT1632_Matrix.pde
I hope this helps.
These are in the commonly used ISO-8859-1 character set, also known as latin-1.

Resources