How can I get/set a TTNTRichEdit RTF content in unicode (utf8/utf16) format?
I use the TRichEdit.loadFromStream/saveToStream methods by TStringStreams to get-set the RTF content. But it use just locale dependent ANSI codes for non standart ASCII characters. (4x : \`f5 )
But I'm going to be in trouble if the user carry him/her project to another computer with a different locale. The national characters will be lost.
The EM_STREAMIN/EM_SREAMOUT messages SF_UNICODE flag can just combined with SF_TEXT not by SF_RTF.
You have no problem. You are using a Unicode compliant component. You will not suffer data loss. From the Wikipedia article on RTF:
A standard RTF file can consist of only 7-bit ASCII characters, but can encode characters beyond ASCII by escape sequences. The character escapes are of two types: code page escapes and, starting with RTF 1.5, Unicode escapes. In a code page escape, two hexadecimal digits following a backslash and typewriter apostrophe are used for denoting a character taken from a Windows code page. For example, if the code page is set to Windows-1256, the sequence \'c8 will encode the Arabic letter bāʼ (ب).
For a Unicode escape the control word \u is used, followed by a 16-bit signed decimal integer giving the Unicode UTF-16 code unit number. For the benefit of programs without Unicode support, this must be followed by the nearest representation of this character in the specified code page. For example, \u1576? would give the Arabic letter bāʼ ب, specifying that older programs which do not have Unicode support should render it as a question mark instead.
You are observing a code page escape. But that's fine. That's what \`f5 is. The character is found in the document's code page, and hence a code page escape can be used. If you include characters outside the document's code page then the control will use a Unicode escape.
Solved (by necessity) using Borland C++ 6. Same code pattern applies for Borland Delphi.
(NOTE: TTntRichEdit loads UTF-8 text as UTF-8 ONLY when it explicitly has the BOM header "\357\273\277" or [0xEF, 0xBB, 0xBF])
// This only works with BOM explicit files
// (it will fail on BOM-less UTF-8 files)
TTntRichEdit *myTntRichEdit = ...{some init code}...
myTntRichEdit->Lines->LoadFromFile(UTF8_filename);
So here is my working production code:
(Note: TRESource declaration is TTntRichEdit *TRESource;)
void TFormMyExample::LoadJavascriptFromFile(AnsiString myFile) {
// This method will load a UTF-8 text file (with or without BOM)
// // // TRESource->Lines->LoadFromFile(myFile);
TMemoryStream *JSMemoryStream;
TMemoryStream *JSBOM_MemoryStream;
AnsiString BOM = "\357\273\277"; // [0xEF, 0xBB, 0xBF]
try {
JSMemoryStream = new TMemoryStream();
JSMemoryStream->LoadFromFile(myFile);
// check for BOM
char BOMHeader[4];
JSMemoryStream->Seek(0, soFromBeginning);
JSMemoryStream->ReadBuffer(BOMHeader, 3);
JSMemoryStream->Seek(0, soFromBeginning); // reset
BOMHeader[3] = 0;
if (strcmp(BOM.c_str(), BOMHeader) == 0) {
// We have BOM header, so load it.
TRESource->Lines->LoadFromStream(JSMemoryStream);
} else {
// We need the BOM header, so add it.
try {
JSBOM_MemoryStream = new TMemoryStream;
JSBOM_MemoryStream->Write(BOM.c_str(), BOM.Length());
JSBOM_MemoryStream->Seek(0,soFromEnd);
JSBOM_MemoryStream->CopyFrom(JSMemoryStream, 0);
JSBOM_MemoryStream->Seek(0, soFromBeginning);
TRESource->Lines->LoadFromStream(JSBOM_MemoryStream);
}
__finally
{
delete JSBOM_MemoryStream;
}
}
}
__finally
{
delete JSMemoryStream;
}
}
When I write the processed file, it's done in this manner.
(Note: TREProcessed declaration is TTntRichEdit *TREProcessed; also: AnsiString outputFileName;)
ofstream SaveFile(outputFileName.c_str());
TREProcessed->PlainText = true;
SaveFile << "\357\273\277"; // Add UTF8 BOM [0xEF, 0xBB, 0xBF]
for (int i = 0, max = TREProcessed->Lines->Count; i < max; i++) {
SaveFile << UTF8Encode(TREProcessed->Lines->Strings[i]).c_str();
if (i < max - 1) {
SaveFile << UTF8Encode(_WS "\n").c_str();
}
}
SaveFile.close();
Related
I have the following code:
buff=esp.flash_read(esp.flash_user_start(),50)
print(buff)
I get the following output from print:
bytearray(b'{"ssid": "mySSID", "password": "myPASSWD"}\xff\xff\xff\xff\xff\xff')
What I want to do is get the json in buff. What is the correct "Python-way" to do that?
buff is a Python bytes object, as shown by the print output beginning with b'. To convert this into a string you need to decode it.
In standard Python you could use
buff.decode(errors='ignore')
Note that without specifying errors=ignore you would get a UnicodeDecodeError because the \xff bytes aren't valid in the default encoding, which is UTF-8; presumably they're padding and you want to ignore them.
If that works on the ESP8266, great! However this from the MicroPython docs suggests the keyword syntax might not be implemented - I don't have an ESP8266 to test it. If not then you may need to remove the padding characters yourself:
textLength = find(buff, b'\xff')
text = buff[0:textLength].decode()
or simply:
text = buff[0:buff.find(b'\xff')].decode()
If decode isn't implemented either, which it isn't in the online MicroPython interpreter, you can use str:
text = str(buff[0:find(buff, b'\xff')], 'utf-8')
Here you have to specify explicitly that you're decoding from UTF-8 (or whatever encoding you specify).
However if what you're really after is the values encoded in the JSON, you should be able to use the json module to retrieve them into a dict:
import json
j = json.loads(buff[0:buff.find(b'\xff')])
ssid = j['ssid']
password = j['password']
I'm returning a Character vector from a function in R to C# using R.NET. The only problem is that unicode characters, such as Greek Letters are being lost. The following line gives an example of the code I'm using:
CharacterVector cvAll = results[5].AsList().AsCharacter();
Where results is a list of results returned by the R function. The characters are also written by R to a text file and they display fine in notepad and other editors. Can I get R.Net to return the characters correctly?
Looks like you ran into an open issue with RDotNet : https://github.com/jmp75/rdotnet/issues/25
Unicode characters don't seem to be supported yet. I ran into the same issue while calling the engine.CreateDataFrame() method. It did return a DataFrame with all my accentuated strings wrong.
There seems to be a workaround though : when calling RDotNet functions, if I give strings encoded in my computer default encoding (Windows ANSI) and converted from UTF-8 (important), R takes them and gives back correctly interpreted accentuated strings to C#. I don't exactly know why it is working though... It might have something to do with the default encoding used with .Net for string being UTF-16. (cf. here : http://csharpindepth.com/Articles/General/Strings.aspx), hence the conversion from UTF-8 to default ANSI that seems to be working.
Here is an ugly example : when I'm building a RDotNet DataFrame, I convert all strings in a CharacterVector to ANSI (from UTF-8) encoded ones :
try
{
string[] colAsStrings = null;
colAsStrings = Array.ConvertAll<object, string>(uneColonne, s => StringEncodingHelper.EncodeToDefaultFromUTF8((string)s));
correctedDataArray[i] = colAsStrings;
columnConverted = true;
}
Here is the static method used for conversion :
public static string EncodeToDefaultFromUTF8(string stringToEncode)
{
byte[] utf8EncodedBytes = Encoding.UTF8.GetBytes(stringToEncode);
return Encoding.Default.GetString(utf8EncodedBytes);
}
CONCLUSION:
For some reason the flow wouldn't let me convert the incoming message to a BLOB by changing the Message Domain property of the Input Node so I added a Reset Content Descriptor node before the Compute Node with the code from the accepted answer. On the line that parses the XML and creates the XMLNSC Child for the message I was getting a 'CHARACTER:Invalid wire format received' error so I took that line out and added another Reset Content Descriptor node after the Compute Node instead. Now it parses and replaces the Unicode characters with spaces. So now it doesn't crash.
Here is the code for the added Compute Node:
CREATE FUNCTION Main() RETURNS BOOLEAN
BEGIN
DECLARE NonPrintable BLOB X'0001020304050607080B0C0E0F101112131415161718191A1B1C1D1E1F7F808182838485868788898A8B8C8D8E8F909192939495969798999A9B9C9D9E9FA0A1A2A3A4A5A6A7A8A9AAABACADAEAFB0B1B2B3B4B5B6B7B8B9BABBBCBDBEBFC0C1C2C3C4C5C6C7C8C9CACBCCCDCECFD0D1D2D3D4D5D6D7D8D9DADBDCDDDEDFE0E1E2E3E4E5E6E7E8E9EAEBECEDEEEFF1F2F3F4F5F6F7F8F9FAFBFCFDFEFF';
DECLARE Printable BLOB X'20202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020';
DECLARE Fixed BLOB TRANSLATE(InputRoot.BLOB.BLOB, NonPrintable, Printable);
SET OutputRoot = InputRoot;
SET OutputRoot.BLOB.BLOB = Fixed;
RETURN TRUE;
END;
UPDATE:
The message is being parsed as XML using XMLNSC. Thought that would cause a problem, but it does not appear to be.
Now I'm using PHP. I've created a node to plug into the legacy flow. Here's the relevant code:
class fixIncompetence {
function evaluate ($output_assembly,$input_assembly) {
$output_assembly->MRM = $input_assembly->MRM;
$output_assembly->MQMD = $input_assembly->MQMD;
$tmp = htmlentities($input_assembly->MRM->VALUE_TO_FIX, ENT_HTML5|ENT_SUBSTITUTE,'UTF-8');
if (!empty($tmp)) {
$output_assembly->MRM->VALUE_TO_FIX = $tmp;
}
// Ensure there are no null MRM fields. MessageBroker is strict.
foreach ($output_assembly->MRM as $key => $val) {
if (empty($val)) {
$output_assembly->MRM->$key = '';
}
}
}
}
Right now I'm getting a vague error about read only messages, but before that it wasn't working either.
Original Question:
For some reason I am unable to impress upon the senders of our MQ
messages that smart quotes, endashes, emdashes, and such crash our XML
parser.
I managed to make a working solution with SQL queries, but it wasted
too many resources. Here's the last thing I tried, but it didn't work
either:
CREATE FUNCTION CLEAN(IN STR CHAR) RETURNS CHAR BEGIN
SET STR = REPLACE('–',STR,'–');
SET STR = REPLACE('—',STR,'—');
SET STR = REPLACE('·',STR,'·');
SET STR = REPLACE('“',STR,'“');
SET STR = REPLACE('”',STR,'”');
SET STR = REPLACE('‘',STR,'&lsqo;');
SET STR = REPLACE('’',STR,'’');
SET STR = REPLACE('•',STR,'•');
SET STR = REPLACE('°',STR,'°');
RETURN STR;
END;
As you can see I'm not very good at this. I have tried reading about
various ESQL string functions without much success.
So in ESQL you can use the TRANSLATE function.
The following is a snippet I use to clean up a BLOB containing non-ASCII low hex values so that it then be cast into a usable character string.
You should be able to modify it to change your undesired characters into something more benign. Basically each hex value in NonPrintable gets translated into its positional equivalent in Printable, in this case always a full-stop i.e. x'2E' in ASCII. You'll need to make your BLOB's long enough to cover the desired range of hex values.
DECLARE NonPrintable BLOB X'000102030405060708090A0B0C0D0E0F101112131415161718191A1B1C1D1E1F202122232425262728292A2B2C2D2E2F303132333435363738393A3B3C3D3E3F';
DECLARE Printable BLOB X'2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E';
SET WorkBlob = TRANSLATE(WorkBlob, NonPrintable, Printable);
BTW if messages with invalid characters only come in every now and then I'd probably specify BLOB on the input node and then use something similar to the following to invoke the XMLNSC parser.
CREATE LASTCHILD OF OutputRoot DOMAIN 'XMLNSC'
PARSE(InputRoot.BLOB.BLOB CCSID InputRoot.Properties.CodedCharSetId ENCODING InputRoot.Properties.Encoding);
With the exception terminal wired up you can then correct the BLOB's of any messages containing parser breaking invalid characters before attempting to reparse.
Finally my best wishes as I've had a number of battles over the years with being forced to correct invalid message content in the "Integration Layer" after all that's what it's meant to do.
I'm using com.adobe.granite.xss for encoding strings in JSP. It seems to work with most characters, except for Ã. à is displayed as Ã�.
It happens when using xssAPI.encodeForHTML() method. I have tried <cq:text> with escapeXml="true" and it has the same behaviour.
The characters are stored properly in the repository and i have also set content="text/html; charset=utf-8" in the JSP.
Is there a way to encode or filter the input for XSS without the charset breaking in such situations.
I have tried it with different non-latin characters and most of them are not affected by XSS api.
It looks like it's an issue of owasp-esapi-java which is used in CQ's XSSAPI, because it's iterating through string using a charAt() method. But à is outside of BMP so, right way of iterating would be:
final int length = s.length();
for (int offset = 0; offset < length; ) {
final int codepoint = s.codePointAt(offset);
// do something with the codepoint
offset += Character.charCount(codepoint);
}
(form How can I iterate through the unicode codepoints of a Java String?)
So I think that it's an issue of this library.
Try to use xssAPI.filterHTML(), probably it can solve your issue.
I have two URLs with parameters
http://localhost:8041/Reforge.aspx?name=CyanГ
http://localhost:8041/Reforge.aspx?name=Cyanì
In first URL Firefox encodes last charecter (Г) as %D0%93 (correctly in UTF-8).
In second URL Firefox encodes last character (ì) as %EC (correctly in ISO-8859-1)
ASP.NET MVC can be configured using element in web.config to either assume UTF-8 or ISO-8859-1. But Firefox flips between encodings depending on the context.
Note that UTF-8 can be unambiguously distinguished from Latin-1 encoding.
Is there a way to teach ASP.NET MVC to decode parameter values using either one of the formats?
EDIT: Is there a class that I could use to decode raw query string that would handle encoding correctly? Note - Firefox uses either UTF-8 or Latin-1 encoding - but not both at the same time. So my plan is to try decode manually using UTF-8 and then look for "invalid" character (FFFD), if one is found - try Latin-1 decode.
Example:
Firefox encodes as following:
- v v
http://localhost:8041/Reforge.aspx?name=ArcânisГ
Firefox turns into
http://localhost:8041/Reforge.aspx?name=Arc%C3%A2nis%D0%93`
Notice that UTF8 encoding is used for both non-ASCII characters.
- v
http://localhost:8041/Reforge.aspx?name=Arcâ
Firefox turns into
http://localhost:8041/Reforge.aspx?name=Arc%E2
Notice that ISO-8859-1 (Latin-1) encoding is used for the non-ASCII character.
Here is my working solution, any way to improve on it? Specifically I would rather extend framework instead of handling it inside an action itself.
private string DecodeNameParameterFromQuery(string query) {
string nameUtf8 = HttpUtility.ParseQueryString(query, Encoding.UTF8)["name"];
const char invalidUtf8Character = (char) 0xFFFD;
if (nameUtf8.Contains(invalidUtf8Character)) {
const int latin1 = 0x6FAF;
var nameLatin1 = HttpUtility.ParseQueryString(query, Encoding.GetEncoding(latin1))["name"];
return nameLatin1;
}
return nameUtf8;
}