ascii character not showing in browser - asp.net-mvc

I have an MVC Razor view
#{
ViewBag.Title = "Index";
var c = (char)146;
var c2 = (short)'’';
}
<h2>#c --- #c2 --’-- ‘Why Oh Why’ & ’</h2>
#String.Format("hi {0} there", (char)146)
characters stored in my database in varchar fields are not rendering to the browser.
This example demonstrates how character 146 doesn't show up
How do I make them render?
[EDIT]
When I do this the character 146 get converted to UNICODE 8217 but if 146 is attempted to be rendered directly on the browser it fails
public ActionResult Index()
{
using (var context = new DataContext())
{
var uuuuuggghhh = (from r in context.Projects
where r.bizId == "D11C6FD5-D084-43F0-A1EB-76FEED24A28F"
select r).FirstOrDefault();
if (uuuuuggghhh != null)
{
var ca = uuuuuggghhh.projectSummaryTxt.ToCharArray();
ViewData.Model = ca[72]; // this is the character in question
return View();
}
}
return View();
}

#Html.Raw(((char)146).ToString())
or
#Html.Raw(String.Format("hi {0} there", (char)146))
both appear to work. I was testing this in Chrome and kept getting blank data, after viewing with FF I can confirm the representation was printing (however 146 doesn't appear to be a readable character).
This is confirmed with a readable character '¶' below:
#Html.Raw(((char)182).ToString())
Not sure why you would want this though. But best of luck!

You do not want to use character 146. Character 146 is U+0092 PRIVATE USE TWO, an obscure and useless control character that typically renders as invisible, or a missing-glyph box/question mark.
If you want the character ’: that is U+2019 SINGLE RIGHT QUOTATION MARK, which may be written directly or using ’ or ’.
146 is the byte number of the encoding of U+2019 into the Windows Western code page (cp1252), but it is not the Unicode character number. The bottom 256 Unicode characters are ordered the same as the bytes in the ISO-8859-1 encoding; ISO-8859-1 is similar to cp1252 but not the same.
Bytes 128–159 in cp1252 encode various typographical niceties like smart quotes, whereas bytes 128–159 in ISO-8859-1 (and hence characters 128–159 in Unicode) are seldom-used control characters. For web applications, you usually want to filter out the control characters (0–31 and 128–159 amongst a few others) as they come in, so they never get as far as the database.
If you are getting character 146 out of your database where you expect to have a smart quote, then you have corrupt data and you need to fix it up before continuing, or possibly you are reading the database using the wrong encoding (quite how this works depends what database you're talking to).
Now here's the trap. If you write:
’
as a character reference, the browser actually displays the smart quote U+2019 ’, and, confusingly, not the useless control character that actually owns that code point!
This is an old browser quirk: character references in the range € to Ÿ are converted to the character that maps to that number in cp1252, instead of the real character with that number.
This was arguably a bug, but the earliest browsers did it back before they grokked Unicode properly, and everyone else was forced to follow suit to avoid breaking pages. HTML5 now documents and sanctions this. (Though not in the XHTML serialisation; browsers in XHTML parsing mode won't do this because it's against the basic rules of XML.)

We finally agreed that the data was corrupt we have asked users who can't see this character rendered to fix the source data

Related

Where should my brackets be in relation to the text for Arabic languages?

Our application automatically modifies the layout of Arabic text when it is followed by a bracket and I was wondering whether this was the correct behaviour or not?
The application shows items in the following format:
[ID of structure](version)
So version 1.5 of the English structure "stackoverflow" would be displayed as:
stackoverflow(1.5)
Note: the brackets need to be displayed. There is no space between the ID and the first bracket. The brackets simply encompass the version. The brackets could have been any character but it's far too late to switch to a different character now!
This works fine for left to right languages, but for Arabic languages the structures appear in the form:
ستاكوفيرفلوو(1.0)
I am not an Arabic speaker and I need to know if this is actually correct. Is the Arabic format the equivalent of the English format or has something gone horribly wrong?
The text in Arabic should be shown like:
ستاكوفيرفلوو(1.0) ‏
I added the html entity of RLM / Right-to-left Mark ‏ in order to fix the text. You should do so if your application doesn't support Bidi native-ly. You can add the RLM by these ways:
HTML Entity (decimal) ‏
HTML Entity (hex) ‏
HTML Entity (named) ‏
How to type in Microsoft Windows Alt +200F
UTF-8 (hex) 0xE2 0x80 0x8F (e2808f)
UTF-8 (binary) 11100010:10000000:10001111
UTF-16 (hex) 0x200F (200f)
UTF-16 (decimal) 8,207
UTF-32 (hex) 0x0000200F (200f)
UTF-32 (decimal) 8,207
C/C++/Java source code "\u200F"
Python source code u"\u200F"
(note: StackOverflow right transliteration is ستاك-أوفرفلو)

Escape Unicode Characters for iOS

There are some Unicode arrangements that I want to use in my app. I am having trouble properly escaping them for use.
For instance this Unicode sequence: 🅰
If I escape it using an online tool i get: \ud83c\udd70
But of course this is an invalid sequence per the compiler:
var str = NSString.stringWithUTF8String("\ud83c\udd70")
Also if I do this:
var str = NSString.stringWithUTF8String("\ud83c")
I get an error "Invalid Unicode Scalar"
I'm trying to use these Unicode "fonts":
http://www.panix.com/~eli/unicode/convert.cgi?text=abcdefghijklmnopqrstuvwxyz
If I view the source of this website I see sequences like this:
&#x1D552
Struggling to wrap my head around what is the "proper" way to work with/escape unicode.
And simply need a to figure out a way to get them working on iOS.
Any thoughts?
\ud83c\udd70 is a UTF-16 surrogate pair which encodes the unicode character 🅰 (U+1F170). Swift string literals do not use UTF-16, so that escape sequence doesn't make sense. However, since 1F170 has five digits you can't use a \uXXXX escape sequence (which only accepts four hexadecimal digits). Instead, use a \UXXXXXXXX sequence (note the capital U), which accepts eight:
var str = "\U0001F170" // returns "🅰"
You can also just paste the character itself into your string:
var str = "🅰" // returns "🅰"
Swift is an early Beta, is is broken in many ways. This issue is a Swift bug.
let ringAboveA: String = "\u0041\u030A" is Å and is accepted
let negativeSquaredA: String = "\uD83D\uDD70" is 🅰 and produces an error
Both are decomposed UTF16 characters that are accepted by Objective-C. The difference is that the composed character 🅰 is in plane 1.
Note: to get the UTF32 code point either use the OSX Character Viewer or a code snippet:
NSLog(#"utf32: %#", [#"🅰" dataUsingEncoding:NSUTF32BigEndianStringEncoding]);
utf32: <0001f170>
To get the Character Viewer in the Apple Menu go to the "System Preferences", "Keyboard", "Keyboard" tab and select the checkbox: "Show Keyboard & Character Viewers in menu bar". The "Character View" item will be in the menu bar just to the left of the Date.
After entering the character right (control) click on the character in favorites to copy the search results.
Copied information:
🅰
NEGATIVE SQUARED LATIN CAPITAL LETTER A
Unicode: U+1F170 (U+D83C U+DD70), UTF-8: F0 9F 85 B0
Better yet: Add unicode in the list on the left and select it.

Mime encoded headers with extra '=' (==?utf-8?b?base64string?=)

This might be a silly question but... here it goes!
I wrote my own MIME parser in native C++. It's a nightmare with the encodings! It was stable for the last 3 months or so but recently I noticed this Subject: header.
Subject: =?UTF-8?B?T2ZpY2luYSBkZSBJbmZvcm1hY2nDs24sIEluaWNpYXRpdmFzIHkgUmVjbGFt?===?UTF-8?B?YWNpb25lcw==?=
which should decode to this:
Subject: Oficina de Información, Iniciativas y Reclamaciones
The problem is there is one extra = (equal) in there which I can't figure out binding the two (why 2?) encoded elements which I don't understand why are separated. In theory the format should be: =?charset?encoding?encoded_string?= but found another subject that starts with two =.
==?UTF-8?B?blahblahlblah?=
How should I handle the extra =?
I could replace ==? with =? (which I am) before doing anything (and it works)... but I'm wondering if there's any kind of spec regarding this so I don't hack my way into proper functionality.
PS: How much I hate these relic protocols! All text communications should be UTF-8 and XML :)
In MIME headers encoded words are used (RFC 2047 Section 2.).
... (why 2?)
To overcome 75 encoded word limit, which is there because of 78 line length limit (or to use 2 different encodings like Chinese and Polish for example).
RFC 2047:
An 'encoded-word' may not be more than 75 characters long,
including 'charset', 'encoding', 'encoded-text', and delimiters.
If it is desirable to encode more text than will fit in an
'encoded-word' of 75 characters, multiple 'encoded-word's
(separated by CRLF SPACE) may be used.
Here's the example from RFC2047 (note there is no '=' in between):
Subject: =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?=
=?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?=
Your subject should be decoded as:
"Oficina de Información, Iniciativas y Reclam=aciones"
mraq answer is incorrect. Soft line breaks apply to 'Quoted Printable' Content-Transfer-Encoding only, which can be used in MIME body.
It is called the "Soft Line Break" and it is the heritage of the SMTP protocol.
Quoting page 20 of RFC2045
(Soft Line Breaks) The Quoted-Printable encoding
REQUIRES that encoded lines be no more than 76
characters long. If longer lines are to be encoded
with the Quoted-Printable encoding, "soft" line breaks
must be used. An equal sign as the last character on a
encoded line indicates such a non-significant ("soft")
line break in the encoded text.
And also Wikipedia on Quoted-printable
A soft line break consists of an "=" at the end of an encoded line,
and does not appear as a line break in the decoded text.
From what I can see in the MIME RFC double equal signs are not valid input (for encoding), but keep in mind you could interpret the first equal sign as what it is and then use the following stuff for decoding. But seriously, those extra equal signs look like artifacts, maybe from an incorrect encoder.

list of garbage characters like ’

I am using librets to retrieve data form my RETS Server. Somehow librets Encoding method is not working and I am receiving some weird characters in my output. I noticed characters like '’' is replaced with ’. I am unable to find a fix for librets so i decided to replace such garbage characeters with actual values after downloading data. What I need is a list of such garbage string and their equivalent characters. I googled for this but not found any resource. Can anyone point me to the list of such garbage letters and their actual values or a piece of code which can generate such letter.
thanx
Search for the term "UTF-8", because that's what you're seeing.
UTF-8 is a way of representing Unicode characters as a sequence of bytes. ("Unicode characters" are the full range of letters and symbols used all in human languages.) Typically, one Unicode character becomes 1, 2, or 3 bytes in UTF-8. When those bytes (numbers from 0 to 255) are displayed using the character set normally used by Windows, they appear as "garbage" -- in this case, 3 "garbage letters" which are really the 3 bytes of a UTF-8 encoding.
In your example, you started with the smart quote character ’. Its representation in Unicode is the number 8217, or U+2019 (2019 is the hexadecimal for 8217). (Search for "Unicode" for a complete list of Unicode characters and their numbers.) The UTF-8 representation of the number 8217 is the three byte sequence 226, 128, 153. And when you display those three bytes as characters, using the Windows "CP-1252" character encoding (the ordinary way of displaying text on Windows in the USA), they appear as ’. (Search for "CP-1252" to see a table of bytes and characters.)
I don't have any list for you. But you could make one if you wrote a program in a language that has built-in support for Unicode and UTF-8. All I can do is explain what you are seeing.
If there is a way to tell librets to use UTF-8 when downloading, that might automatically solve your problem. I don't know anything about librets, but now that you know the term "UTF-8" you might be able to make progress.
Question reminder:
"...I noticed characters like '’' is replaced with ’... i decided to
replace such garbage characeters with actual values after downloading
data. What I need is a list of such garbage string and their
equivalent characters."
Strictly dealing with this part:
"What I need is a list of such garbage string and their equivalent
characters."
Using php, you can generate these characters and their equivalence. Working with all 1,111,998 Unicode points or 109,449 Utf8 symbols is impractical. You may use the ASCII range in the following loop between &#128 and &#258 or another range that is more relevant to your context.
<?php
for ($i=128; $i<258; $i++)
$tmp1 .= "<tr><td>".htmlentities("&#$i;")."</td><td>".html_entity_decode("&#".$i.";",ENT_NOQUOTES,"utf-8")."</td><td>&#".$i.";</td></tr>";
echo "<table border=1>
<tr><td>&#</td><td>"Garbage"</td><td>symbol</td></tr>";
echo $tmp1;
echo "</table>";
?>
From experience, in an ASCII context, most "garbage" symbols originate in the range &#128 to &#257 + (seldom) &#8129 to &#8246.
In order for the "garbage" symbols to display, the html page charset must be set to iso-1 or whichever other charset that caused the problem in the first place. They will not show if the charset is set to utf-8.
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
.
"i decided to replace such garbage characeters with actual values
after downloading data"
You CANNOT undo the "garbage" with php utf8_decode(), which would actually create more "garbage" on already "garbage". But, you may use the simple and fast search and replace php str_replace() function.
First, generate 2 arrays for each set of "garbage" symbols you wish to replace. The first array is the Search term:
<?php
//ISO 8859-1 (Latin-1) special chars are found in the range 128 to 257
$tmp1 = "\$SearchArr = array(";
for ($i=128; $i<258; $i++)
$tmp1 .= "\"".html_entity_decode("&#".$i.";",ENT_NOQUOTES,"utf-8")."\", ";
$tmp1 = substr($tmp1,0,strlen($tmp1)-2);//erases last comma
$tmp1 .= ");";
$tmp1 = htmlentities($tmp1,ENT_NOQUOTES,"utf-8");
?>
The second array is the replace term:
<?php
//Adapt for your relevant range.
$tmp2 = "\$ReplaceArr = array(\n";
for ($i=128; $i<258; $i++)
$tmp2 .= "\"&#".$i.";\", ";
$tmp2 = substr($tmp2,0,strlen($tmp2)-2);//erases last comma
$tmp2 .= ");";
echo $tmp1."\n<br><br>\n";
echo $tmp2."\n";
?>
Now, you've got 2 arrays that you can copy and paste to use and reuse to clean any of your infected strings like this:
$InfectedString = str_replace($SearchArr,$ReplaceArr,$InfectedString);
Note: utf8_decode() is of no help for cleaning up "garbage" symbols. But, it can be used to prevent further contamination. Alternatively a mb_ function can be useful.

Problem with ord () and string

i having this problem, if i have:
mychr = ' ';
where the 'space' in mychr equival to #255 (typed manually ALT+255), and i write:
myord = ord (mychr)
to myord return value 160 and not 255. Of course, same problem is too with charater ALT+254 etc.
As i can solve this problem? I have tested on delphi xe in console mode.
Note: if i use:
mychar = #255;
then function ord() return value correctly.
I think the problem is that the Windows Alt+Num shortcuts insert characters according to the local codepage, whereas a modern Delphi use Unicode characters, and these differ (unless the value is less than or equal to 127, I think). The solution is to enter the values #255 explicitly in code. In addition, it is a very bad habit to include 'invisible' special characters in code, because you cannot tell what character it is without copying in to an external tool! In addition, you will have to trust the text encoding of the .pas file. It is much better to use constants like #255. Even better, do
const
MY_PRECIOUS_VALUE = #255;
and use this constant every time you need it.
Update
According to the English Wikipedia article on Alt code:
If the number typed has a leading 0
(zero), the character set used is the
Windows code page that matches the
current input locale. For most systems
using the Latin alphabet, this is
Windows-1252. For a complete list, see
code page. If the number does not have
a leading 0 (zero), DOS compatibility
is invoked. The character set used is
the DOS code page for the current
input locale. For systems using
English, this is code page 437. For
most other systems using the Latin
alphabet, this is code page 850. For a
complete list, see code page.
So, if you really, really want to continue entering Alt keycodes, you'd better type Alt and 0255 with the leading zero.
If you type ALT+255, DOS codepage is used; for 437 and 850 DOS codepages (one of which you probably use) #255 is NBSP (non-breaking space). In Unicode, NBSP is $A0 (160). That explains why you obtain Ord 160.
AFAIK console mode use the OEM Ansi char set. And under Delphi XE, you're not in the Ansi world, but in the UCS-2 / Unicode world.
var MyChar: char;
MyWideChar: WideChar;
MyAnsiChar: AnsiChar;
begin
MyChar := #255;
MyWideChar := #255;
MyAnsiChar := #255;
The first two variables are the same, i.e. a character with Unicode code 255 = $00FF, since in Delphi XE, char = WideChar. For the first Unicode Page, see this article.
But MyAnsiChar is what will be displayed on the console, after conversion from the current code page into the OEM console code page.
In the Unicode chart, this $00FF is a minuscule y with trema:
U+00FF ÿ Latin Small Letter Y with diaeresis
Under the console, you'll use the OEM char set, i.e. Code Page 347. So in your case $FF is NOT a character, but a special code
FF NBSP Non Breaking SPace
which is converted into U+00A0 when converted back to Unicode:
U+00A0 NBSP Non Breaking SPace
It is very likely that you are in a Windows-1252 code page, so normally the Delphi XE AnsiString will map #255 into a minuscule y with trema:
FF ÿ Latin Small Letter Y with diaeresis
You can use low-level e.g. CharToOemBuff windows functions to perform the conversion to or from OEM, or use an OEM AnsiString type:
type
TOemString = AnsiString(437);
In all cases, the console is not the best way of entering accentuated text under modern Windows, and Unicode Delphi XE.
Using InputQuery function e.g. should be safer, since it will return an Unicode string variable. ;)

Resources