charset issue with XSS api in CQ5 , à being displayed as � - character-encoding

I'm using com.adobe.granite.xss for encoding strings in JSP. It seems to work with most characters, except for Ã. à is displayed as Ã�.
It happens when using xssAPI.encodeForHTML() method. I have tried <cq:text> with escapeXml="true" and it has the same behaviour.
The characters are stored properly in the repository and i have also set content="text/html; charset=utf-8" in the JSP.
Is there a way to encode or filter the input for XSS without the charset breaking in such situations.
I have tried it with different non-latin characters and most of them are not affected by XSS api.

It looks like it's an issue of owasp-esapi-java which is used in CQ's XSSAPI, because it's iterating through string using a charAt() method. But à is outside of BMP so, right way of iterating would be:
final int length = s.length();
for (int offset = 0; offset < length; ) {
final int codepoint = s.codePointAt(offset);
// do something with the codepoint
offset += Character.charCount(codepoint);
}
(form How can I iterate through the unicode codepoints of a Java String?)
So I think that it's an issue of this library.
Try to use xssAPI.filterHTML(), probably it can solve your issue.

Related

Unicode characters returning from R.NET

I'm returning a Character vector from a function in R to C# using R.NET. The only problem is that unicode characters, such as Greek Letters are being lost. The following line gives an example of the code I'm using:
CharacterVector cvAll = results[5].AsList().AsCharacter();
Where results is a list of results returned by the R function. The characters are also written by R to a text file and they display fine in notepad and other editors. Can I get R.Net to return the characters correctly?
Looks like you ran into an open issue with RDotNet : https://github.com/jmp75/rdotnet/issues/25
Unicode characters don't seem to be supported yet. I ran into the same issue while calling the engine.CreateDataFrame() method. It did return a DataFrame with all my accentuated strings wrong.
There seems to be a workaround though : when calling RDotNet functions, if I give strings encoded in my computer default encoding (Windows ANSI) and converted from UTF-8 (important), R takes them and gives back correctly interpreted accentuated strings to C#. I don't exactly know why it is working though... It might have something to do with the default encoding used with .Net for string being UTF-16. (cf. here : http://csharpindepth.com/Articles/General/Strings.aspx), hence the conversion from UTF-8 to default ANSI that seems to be working.
Here is an ugly example : when I'm building a RDotNet DataFrame, I convert all strings in a CharacterVector to ANSI (from UTF-8) encoded ones :
try
{
string[] colAsStrings = null;
colAsStrings = Array.ConvertAll<object, string>(uneColonne, s => StringEncodingHelper.EncodeToDefaultFromUTF8((string)s));
correctedDataArray[i] = colAsStrings;
columnConverted = true;
}
Here is the static method used for conversion :
public static string EncodeToDefaultFromUTF8(string stringToEncode)
{
byte[] utf8EncodedBytes = Encoding.UTF8.GetBytes(stringToEncode);
return Encoding.Default.GetString(utf8EncodedBytes);
}

How to remove non-ascii char from MQ messages with ESQL

CONCLUSION:
For some reason the flow wouldn't let me convert the incoming message to a BLOB by changing the Message Domain property of the Input Node so I added a Reset Content Descriptor node before the Compute Node with the code from the accepted answer. On the line that parses the XML and creates the XMLNSC Child for the message I was getting a 'CHARACTER:Invalid wire format received' error so I took that line out and added another Reset Content Descriptor node after the Compute Node instead. Now it parses and replaces the Unicode characters with spaces. So now it doesn't crash.
Here is the code for the added Compute Node:
CREATE FUNCTION Main() RETURNS BOOLEAN
BEGIN
DECLARE NonPrintable BLOB X'0001020304050607080B0C0E0F101112131415161718191A1B1C1D1E1F7F808182838485868788898A8B8C8D8E8F909192939495969798999A9B9C9D9E9FA0A1A2A3A4A5A6A7A8A9AAABACADAEAFB0B1B2B3B4B5B6B7B8B9BABBBCBDBEBFC0C1C2C3C4C5C6C7C8C9CACBCCCDCECFD0D1D2D3D4D5D6D7D8D9DADBDCDDDEDFE0E1E2E3E4E5E6E7E8E9EAEBECEDEEEFF1F2F3F4F5F6F7F8F9FAFBFCFDFEFF';
DECLARE Printable BLOB X'20202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020202020';
DECLARE Fixed BLOB TRANSLATE(InputRoot.BLOB.BLOB, NonPrintable, Printable);
SET OutputRoot = InputRoot;
SET OutputRoot.BLOB.BLOB = Fixed;
RETURN TRUE;
END;
UPDATE:
The message is being parsed as XML using XMLNSC. Thought that would cause a problem, but it does not appear to be.
Now I'm using PHP. I've created a node to plug into the legacy flow. Here's the relevant code:
class fixIncompetence {
function evaluate ($output_assembly,$input_assembly) {
$output_assembly->MRM = $input_assembly->MRM;
$output_assembly->MQMD = $input_assembly->MQMD;
$tmp = htmlentities($input_assembly->MRM->VALUE_TO_FIX, ENT_HTML5|ENT_SUBSTITUTE,'UTF-8');
if (!empty($tmp)) {
$output_assembly->MRM->VALUE_TO_FIX = $tmp;
}
// Ensure there are no null MRM fields. MessageBroker is strict.
foreach ($output_assembly->MRM as $key => $val) {
if (empty($val)) {
$output_assembly->MRM->$key = '';
}
}
}
}
Right now I'm getting a vague error about read only messages, but before that it wasn't working either.
Original Question:
For some reason I am unable to impress upon the senders of our MQ
messages that smart quotes, endashes, emdashes, and such crash our XML
parser.
I managed to make a working solution with SQL queries, but it wasted
too many resources. Here's the last thing I tried, but it didn't work
either:
CREATE FUNCTION CLEAN(IN STR CHAR) RETURNS CHAR BEGIN
SET STR = REPLACE('–',STR,'–');
SET STR = REPLACE('—',STR,'—');
SET STR = REPLACE('·',STR,'·');
SET STR = REPLACE('“',STR,'“');
SET STR = REPLACE('”',STR,'”');
SET STR = REPLACE('‘',STR,'&lsqo;');
SET STR = REPLACE('’',STR,'’');
SET STR = REPLACE('•',STR,'•');
SET STR = REPLACE('°',STR,'°');
RETURN STR;
END;
As you can see I'm not very good at this. I have tried reading about
various ESQL string functions without much success.
So in ESQL you can use the TRANSLATE function.
The following is a snippet I use to clean up a BLOB containing non-ASCII low hex values so that it then be cast into a usable character string.
You should be able to modify it to change your undesired characters into something more benign. Basically each hex value in NonPrintable gets translated into its positional equivalent in Printable, in this case always a full-stop i.e. x'2E' in ASCII. You'll need to make your BLOB's long enough to cover the desired range of hex values.
DECLARE NonPrintable BLOB X'000102030405060708090A0B0C0D0E0F101112131415161718191A1B1C1D1E1F202122232425262728292A2B2C2D2E2F303132333435363738393A3B3C3D3E3F';
DECLARE Printable BLOB X'2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E2E';
SET WorkBlob = TRANSLATE(WorkBlob, NonPrintable, Printable);
BTW if messages with invalid characters only come in every now and then I'd probably specify BLOB on the input node and then use something similar to the following to invoke the XMLNSC parser.
CREATE LASTCHILD OF OutputRoot DOMAIN 'XMLNSC'
PARSE(InputRoot.BLOB.BLOB CCSID InputRoot.Properties.CodedCharSetId ENCODING InputRoot.Properties.Encoding);
With the exception terminal wired up you can then correct the BLOB's of any messages containing parser breaking invalid characters before attempting to reparse.
Finally my best wishes as I've had a number of battles over the years with being forced to correct invalid message content in the "Integration Layer" after all that's what it's meant to do.

Unable to decode Base-64 URLs

I have an application that builds an HTML email. Included in the content is an encoded URL parameter that might, for example, contain a promotional code or product reference. The email is generated by a Windows service (essentially a console application) and the link, when clicked is handled by an MVC web site. Here is the code for creating the email link:
string CreateLink(string domain, string code) {
// code == "xyz123"
string encrypted = DES3Crypto.Encrypt(code); // H3uKbdyzrUo=
string urlParam = encrypted.EncodeBase64(); // SDN1S2JkeXpyVW890
return domain + "/" + urlParam;
}
The action method on the MVC controller is constructed as follows:
public ActionResult Index(string id) {
string decoded = id.DecodeBase64();
string decrypted = DES3Crypto.Decrypt(decoded);
...
}
In all our testing, this mechanism has worked as expected, however, now we have gone live we are seeing around a 4% error rate where the conversion from base-64 fails with the following exception:
The input is not a valid Base-64 string as it contains a non-base 64 character, more than two padding characters, or a non-white space character among the padding characters.
The id parameter from the url 'looks' OK. The problem appears to be with the EncodeBase64/DecodeBase64 methods that are failing as DecodeBase64 method returns a 'garbled' string such as "�nl����□��7y�b�8�sJ���=" on the failed links.
Furthermore, most of the errors are from IE6 user agents leading me to think this is a character encoding problem but I don't see why.
For reference, here is the code for my base-64 URL encoding:
public static string EncodeBase64(this string source)
{
byte[] bytes = Encoding.UTF8.GetBytes(source);
string encodedString = HttpServerUtility.UrlTokenEncode(bytes);
return encodedString;
}
public static string DecodeBase64(this string encodedString)
{
byte[] bytes = HttpServerUtility.UrlTokenDecode(encodedString);
string decodedString = Encoding.UTF8.GetString(bytes);
return decodedString;
}
Any advice would be much appreciated.
To recap, I was creating a URL that used a base-64 encoded parameter which was itself a Triple DES encrypted string. So the URL looked like http://[Domain_Name]/SDN1S2JkeXpyVW890 The link referenced a controller action on an MVC web site.
The URL was then inserted into an HTML formatted email. Looking at the error log, we saw that around 5% of the public users that responded to the link were throwing an "invalid base-64 string error". Most, but not all, of these errors were related to the IE6 user agent.
After trying many possible solutions based around character and URL encoding, it was discovered that somewhere in the client's process the url was being converted to lower-case - this, of course, broke the base-64 encoding (as it is uses both upper and lower case encoding characters).
Whether the case corruption was caused by the client's browser, email client or perhaps local anti-virus software, I have not been able to determine.
The Solution
Do not use any of the standard base-64 encoding methods, instead use a base-32 or zBase-32 encoding instead - both of which are case-insensitive.
See the following links for more details
Base-32 - Wikipedia
MyTenPennies Base-32 .NET Implementation
The moral of the story is, Base-64 URL encoding can be unreliable in some public environments. Base-32, whilst slightly more verbose, is a better choice.
Hope this helps.
It looks like you were really close. You had an extra zero coming back from your encyrpted.EncodeBase64() function.
Try this:
string data = "H3uKbdyzrUo=";
string b64str = Convert.ToBase64String(UTF8Encoding.UTF8.GetBytes(data));
string clearText = UTF8Encoding.UTF8.GetString(Convert.FromBase64String(b64str));
This is an interesting issue. My guess is that IE 6 is eating some of the characters.
For example, the length of the string that you included "ywhar0xznxpjdnfnddc0yxzbk2jnqt090" is not a multiple of four (which is a requirement for FromBase64 to work http://msdn.microsoft.com/en-us/library/system.convert.frombase64string.aspx)
But if you were to pad that string until it's length is a multiple of four ("ywhar0xznxpjdnfnddc0yxzbk2jnqt090" + "a12") then that works.
The MSDN documentation says that one ("=") or two ("==") equal characters are used for padding to/fromBase64 methods and I suspect IE 6 is truncating that from the string that you send.
This is total speculation but I hope it helps.

How to show Persian numbers on ASP.NET MVC page?

I'm building a site which needs to support both English and Persian language. The site is built with ASP.NET MVC 3 and .NET 4 (C#). All my controllers inherit from a BaseController, which sets culture to "fa-IR" (during test):
Thread.CurrentThread.CurrentCulture = new CultureInfo("fa-IR");
Thread.CurrentThread.CurrentUICulture = new CultureInfo("fa-IR");
In my views, I'm using static helper classes to convert into right timezone and to format date. Like:
DateFormatter.ToLocalDateAndTime(model.CreatedOnUtc)
I'm doing the same for money with a MoneyFormatter. For money, the currency prints correctly in Persian (or, at least I think so as I don't know any Persian. It will be verified later on though by our translators). The same goes for dates, where the month is printed correctly. This is what it displays as currently:
For money:
قیمت: ريال 1,000
For dates:
21:01 ديسمبر 3
In these examples, I want the numbers to also print in Persian. How do you accomplish that?
I have also tried adding:
<meta http-equiv="Content-Type" content="text/html; charset=windows-1256" />
in the HEAD tag, as recommended on some forums. Does not change anything though (from what I can see). I have also added "fa-IR" as the first language in my browser languages (both IE and Firefox). Didn't help either.
Anyone got any ideas on how to solve this? I'd rather avoid creating translation tables between English numbers and Persian numbers, if possible.
Best regards,
Eric
After some further research I found an answer from a Microsoft employee stating that they don't translate numbers, though it's fully possible to do it yourself using the array of digits for a specific culture, that is provided by the NativeDigits property (see http://msdn.microsoft.com/en-us/library/system.globalization.numberformatinfo.nativedigits.aspx).
To find out if text that is submitted in a form is Arabic or Latin I'm now doing:
public bool IsContentArabic(string content)
{
string pattern = #"\p{IsArabic}";
return Regex.IsMatch(
content,
pattern,
RegexOptions.RightToLeft | RegexOptions.IgnoreCase | RegexOptions.Multiline);
}
public bool IsContentLatin1(string content)
{
string pattern = #"\p{IsBasicLatin}";
return Regex.IsMatch(
content,
pattern,
RegexOptions.IgnoreCase | RegexOptions.Multiline);
}
And to convert Persian digits into their "Latin" equivalents, I wrote this helper:
public static class NumberHelper
{
private static readonly CultureInfo arabic = new CultureInfo("fa-IR");
private static readonly CultureInfo latin = new CultureInfo("en-US");
public static string ToArabic(string input)
{
var arabicDigits = arabic.NumberFormat.NativeDigits;
for (int i = 0; i < arabicDigits.Length; i++)
{
input = input.Replace(i.ToString(), arabicDigits[i]);
}
return input;
}
public static string ToLatin(string input)
{
var latinDigits = latin.NumberFormat.NativeDigits;
var arabicDigits = arabic.NumberFormat.NativeDigits;
for (int i = 0; i < latinDigits.Length; i++)
{
input = input.Replace(arabicDigits[i], latinDigits[i]);
}
return input;
}
}
I've also hooked in before model binding takes place and there I convert digit-only input from forms into Latin digits, if applicable.
Ako, for what direction goes we managed to solve quite a bit of issues using the 'dir' attribute (dir => direction) on the html tag for our Web page. Like this:
<html dir="rtl">
The 'dir' attribute takes either "rtl" (right-to-left) or "ltr" (left-to-right).
Hope this helps someone!
Maybe you have to change the font.
If you know your client supports a particular font, say Badr or Nazanin, then you can change the font-family for those numbers, and I think that will work for you.
You can supply for the font for your page, but I think this only works in modern browsers and fails in the old ones. You can check this.
Hope that helps.
How are you printing the numbers to the output? I don't have much experience with localization but you might have to use the appropriate NumberFormatInfo or something similar to format the number.
Also for greatest portability you should probably be using UTF8 as the encoding.
firs of all you should use a font-face that supports Persian number.
I use this technique and I can show Persian number on my web site.
fonts like:BNazanin.

ASP.NET MVC does not understand mixed url encoding (UTF-8/Latin-1)

I have two URLs with parameters
http://localhost:8041/Reforge.aspx?name=CyanГ
http://localhost:8041/Reforge.aspx?name=Cyanì
In first URL Firefox encodes last charecter (Г) as %D0%93 (correctly in UTF-8).
In second URL Firefox encodes last character (ì) as %EC (correctly in ISO-8859-1)
ASP.NET MVC can be configured using element in web.config to either assume UTF-8 or ISO-8859-1. But Firefox flips between encodings depending on the context.
Note that UTF-8 can be unambiguously distinguished from Latin-1 encoding.
Is there a way to teach ASP.NET MVC to decode parameter values using either one of the formats?
EDIT: Is there a class that I could use to decode raw query string that would handle encoding correctly? Note - Firefox uses either UTF-8 or Latin-1 encoding - but not both at the same time. So my plan is to try decode manually using UTF-8 and then look for "invalid" character (FFFD), if one is found - try Latin-1 decode.
Example:
Firefox encodes as following:
- v v
http://localhost:8041/Reforge.aspx?name=ArcânisГ
Firefox turns into
http://localhost:8041/Reforge.aspx?name=Arc%C3%A2nis%D0%93`
Notice that UTF8 encoding is used for both non-ASCII characters.
- v
http://localhost:8041/Reforge.aspx?name=Arcâ
Firefox turns into
http://localhost:8041/Reforge.aspx?name=Arc%E2
Notice that ISO-8859-1 (Latin-1) encoding is used for the non-ASCII character.
Here is my working solution, any way to improve on it? Specifically I would rather extend framework instead of handling it inside an action itself.
private string DecodeNameParameterFromQuery(string query) {
string nameUtf8 = HttpUtility.ParseQueryString(query, Encoding.UTF8)["name"];
const char invalidUtf8Character = (char) 0xFFFD;
if (nameUtf8.Contains(invalidUtf8Character)) {
const int latin1 = 0x6FAF;
var nameLatin1 = HttpUtility.ParseQueryString(query, Encoding.GetEncoding(latin1))["name"];
return nameLatin1;
}
return nameUtf8;
}

Resources