list of garbage characters like ’ - character-encoding

I am using librets to retrieve data form my RETS Server. Somehow librets Encoding method is not working and I am receiving some weird characters in my output. I noticed characters like '’' is replaced with ’. I am unable to find a fix for librets so i decided to replace such garbage characeters with actual values after downloading data. What I need is a list of such garbage string and their equivalent characters. I googled for this but not found any resource. Can anyone point me to the list of such garbage letters and their actual values or a piece of code which can generate such letter.
thanx

Search for the term "UTF-8", because that's what you're seeing.
UTF-8 is a way of representing Unicode characters as a sequence of bytes. ("Unicode characters" are the full range of letters and symbols used all in human languages.) Typically, one Unicode character becomes 1, 2, or 3 bytes in UTF-8. When those bytes (numbers from 0 to 255) are displayed using the character set normally used by Windows, they appear as "garbage" -- in this case, 3 "garbage letters" which are really the 3 bytes of a UTF-8 encoding.
In your example, you started with the smart quote character ’. Its representation in Unicode is the number 8217, or U+2019 (2019 is the hexadecimal for 8217). (Search for "Unicode" for a complete list of Unicode characters and their numbers.) The UTF-8 representation of the number 8217 is the three byte sequence 226, 128, 153. And when you display those three bytes as characters, using the Windows "CP-1252" character encoding (the ordinary way of displaying text on Windows in the USA), they appear as ’. (Search for "CP-1252" to see a table of bytes and characters.)
I don't have any list for you. But you could make one if you wrote a program in a language that has built-in support for Unicode and UTF-8. All I can do is explain what you are seeing.
If there is a way to tell librets to use UTF-8 when downloading, that might automatically solve your problem. I don't know anything about librets, but now that you know the term "UTF-8" you might be able to make progress.

Question reminder:
"...I noticed characters like '’' is replaced with ’... i decided to
replace such garbage characeters with actual values after downloading
data. What I need is a list of such garbage string and their
equivalent characters."
Strictly dealing with this part:
"What I need is a list of such garbage string and their equivalent
characters."
Using php, you can generate these characters and their equivalence. Working with all 1,111,998 Unicode points or 109,449 Utf8 symbols is impractical. You may use the ASCII range in the following loop between &#128 and &#258 or another range that is more relevant to your context.
<?php
for ($i=128; $i<258; $i++)
$tmp1 .= "<tr><td>".htmlentities("&#$i;")."</td><td>".html_entity_decode("&#".$i.";",ENT_NOQUOTES,"utf-8")."</td><td>&#".$i.";</td></tr>";
echo "<table border=1>
<tr><td>&#</td><td>"Garbage"</td><td>symbol</td></tr>";
echo $tmp1;
echo "</table>";
?>
From experience, in an ASCII context, most "garbage" symbols originate in the range &#128 to &#257 + (seldom) &#8129 to &#8246.
In order for the "garbage" symbols to display, the html page charset must be set to iso-1 or whichever other charset that caused the problem in the first place. They will not show if the charset is set to utf-8.
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
.
"i decided to replace such garbage characeters with actual values
after downloading data"
You CANNOT undo the "garbage" with php utf8_decode(), which would actually create more "garbage" on already "garbage". But, you may use the simple and fast search and replace php str_replace() function.
First, generate 2 arrays for each set of "garbage" symbols you wish to replace. The first array is the Search term:
<?php
//ISO 8859-1 (Latin-1) special chars are found in the range 128 to 257
$tmp1 = "\$SearchArr = array(";
for ($i=128; $i<258; $i++)
$tmp1 .= "\"".html_entity_decode("&#".$i.";",ENT_NOQUOTES,"utf-8")."\", ";
$tmp1 = substr($tmp1,0,strlen($tmp1)-2);//erases last comma
$tmp1 .= ");";
$tmp1 = htmlentities($tmp1,ENT_NOQUOTES,"utf-8");
?>
The second array is the replace term:
<?php
//Adapt for your relevant range.
$tmp2 = "\$ReplaceArr = array(\n";
for ($i=128; $i<258; $i++)
$tmp2 .= "\"&#".$i.";\", ";
$tmp2 = substr($tmp2,0,strlen($tmp2)-2);//erases last comma
$tmp2 .= ");";
echo $tmp1."\n<br><br>\n";
echo $tmp2."\n";
?>
Now, you've got 2 arrays that you can copy and paste to use and reuse to clean any of your infected strings like this:
$InfectedString = str_replace($SearchArr,$ReplaceArr,$InfectedString);
Note: utf8_decode() is of no help for cleaning up "garbage" symbols. But, it can be used to prevent further contamination. Alternatively a mb_ function can be useful.

Related

How to convert a formatted string into plain text

User copy paste and send data in following format: "𝕛𝕠𝕧𝕪 𝕕𝕖𝕓𝕓𝕚𝕖"
I need to convert it into plain txt (we can say ascii chars) like 'jovy debbie'
It comes in different font and format:
ex:
'𝑱𝒆𝒏𝒊𝒄𝒂 𝑫𝒖𝒈𝒐𝒔'
'𝙶𝚎𝚟𝚒𝚎𝚕𝚢𝚗 𝙽𝚒𝚌𝚘𝚕𝚎 𝙻𝚞𝚖𝚋𝚊𝚐'
Any Help will be Appreciated, I already refer other stack overflow question but no luck :(
Those letters are from the Mathematical Alphanumeric Symbols block.
Since they have a fixed offset to their ASCII counterparts, you could use tr to map them, e.g.:
"𝕛𝕠𝕧𝕪 𝕕𝕖𝕓𝕓𝕚𝕖".tr("𝕒-𝕫", "a-z")
#=> "jovy debbie"
The same approach can be used for the other styles, e.g.
"𝑱𝒆𝒏𝒊𝒄𝒂 𝑫𝒖𝒈𝒐𝒔".tr("𝒂-𝒛𝑨-𝒁", "a-zA-Z")
#=> "Jenica Dugos"
This gives you full control over the character mapping.
Alternatively, you could try Unicode normalization. The NFKC / NFKD forms should remove most formatting and seem to work for your examples:
"𝕛𝕠𝕧𝕪 𝕕𝕖𝕓𝕓𝕚𝕖".unicode_normalize(:nfkc)
#=> "jovy debbie"
"𝑱𝒆𝒏𝒊𝒄𝒂 𝑫𝒖𝒈𝒐𝒔".unicode_normalize(:nfkc)
#=> "Jenica Dugos"

AES encrypt string appear \r\n

When I use AES128 encrypt string, if the encrypted string is too long then it will contain \r\n in it. like this
Now I have to use empty string to replace it. Why does the encrypt-string contain the \r\n and any better way to avoid it or fix it.
Thanks.
Answers: it's caused by the Base64 encoding process, every 64 characters will insert a \r\n .
That is a Base64 encoded string.
Actual encryption output is an array of 8-bit bytes, not characters. The code is Base64 encoding the encrypted data with an option to insert line breaks every 64 characters, this is sometimes to allow better printing of the output. When it is decoded use the NSDataBase64DecodingIgnoreUnknownCharacters option to remove line breaks .
In particular for Objective-C the to create a Base64 string from NSData is:
- (NSString *)base64EncodedStringWithOptions:(NSDataBase64EncodingOptions)options
The options include:
NSDataBase64Encoding64CharacterLineLength
Set the maximum line length to 64 characters, after which a line ending is inserted.
Which inserts "\r\n" (carriage return, new line) characters each 64 characters.
If that is not what you want pass 0 as the option value.
To decode Base64 use the Objective-C method:
- (instancetype)initWithBase64EncodedString:(NSString *)base64String options:(NSDataBase64DecodingOptions)options
With the option: NSDataBase64DecodingIgnoreUnknownCharacters.
Apple code:
The default implementation of this method will reject non-alphabet characters, including line break characters. To support different encodings and ignore non-alphabet characters, specify an options value of NSDataBase64DecodingIgnoreUnknownCharacters.
The thing that gives it away as a Base64 string is a length that is a multiple of 4, the characters used "a-zA-Z0-9+/" and the trailing "=" characters.
Historic note: These days on OSX and iOS line breaks are a single "\n" (0x0a) line feed character. Back when we used teletypes as terminals "\r" (0x0d) carriage return moved the carriage or print head back but did not move the paper up to the next line. "\n" newline moved the paper up one line but did move the carriage or print head back. They were two distinct mechanical operations. Later some systems used either "\r\n", "\n\r", "\r" or "\n". Unix choose "\n" and thus OSX and iOS.

ascii character not showing in browser

I have an MVC Razor view
#{
ViewBag.Title = "Index";
var c = (char)146;
var c2 = (short)'’';
}
<h2>#c --- #c2 --’-- ‘Why Oh Why’ & ’</h2>
#String.Format("hi {0} there", (char)146)
characters stored in my database in varchar fields are not rendering to the browser.
This example demonstrates how character 146 doesn't show up
How do I make them render?
[EDIT]
When I do this the character 146 get converted to UNICODE 8217 but if 146 is attempted to be rendered directly on the browser it fails
public ActionResult Index()
{
using (var context = new DataContext())
{
var uuuuuggghhh = (from r in context.Projects
where r.bizId == "D11C6FD5-D084-43F0-A1EB-76FEED24A28F"
select r).FirstOrDefault();
if (uuuuuggghhh != null)
{
var ca = uuuuuggghhh.projectSummaryTxt.ToCharArray();
ViewData.Model = ca[72]; // this is the character in question
return View();
}
}
return View();
}
#Html.Raw(((char)146).ToString())
or
#Html.Raw(String.Format("hi {0} there", (char)146))
both appear to work. I was testing this in Chrome and kept getting blank data, after viewing with FF I can confirm the representation was printing (however 146 doesn't appear to be a readable character).
This is confirmed with a readable character '¶' below:
#Html.Raw(((char)182).ToString())
Not sure why you would want this though. But best of luck!
You do not want to use character 146. Character 146 is U+0092 PRIVATE USE TWO, an obscure and useless control character that typically renders as invisible, or a missing-glyph box/question mark.
If you want the character ’: that is U+2019 SINGLE RIGHT QUOTATION MARK, which may be written directly or using ’ or ’.
146 is the byte number of the encoding of U+2019 into the Windows Western code page (cp1252), but it is not the Unicode character number. The bottom 256 Unicode characters are ordered the same as the bytes in the ISO-8859-1 encoding; ISO-8859-1 is similar to cp1252 but not the same.
Bytes 128–159 in cp1252 encode various typographical niceties like smart quotes, whereas bytes 128–159 in ISO-8859-1 (and hence characters 128–159 in Unicode) are seldom-used control characters. For web applications, you usually want to filter out the control characters (0–31 and 128–159 amongst a few others) as they come in, so they never get as far as the database.
If you are getting character 146 out of your database where you expect to have a smart quote, then you have corrupt data and you need to fix it up before continuing, or possibly you are reading the database using the wrong encoding (quite how this works depends what database you're talking to).
Now here's the trap. If you write:
’
as a character reference, the browser actually displays the smart quote U+2019 ’, and, confusingly, not the useless control character that actually owns that code point!
This is an old browser quirk: character references in the range € to Ÿ are converted to the character that maps to that number in cp1252, instead of the real character with that number.
This was arguably a bug, but the earliest browsers did it back before they grokked Unicode properly, and everyone else was forced to follow suit to avoid breaking pages. HTML5 now documents and sanctions this. (Though not in the XHTML serialisation; browsers in XHTML parsing mode won't do this because it's against the basic rules of XML.)
We finally agreed that the data was corrupt we have asked users who can't see this character rendered to fix the source data

Is it possible to use double quotes in CSV files on iPad?

I am having issues with the special CSV interpreter (no idea what its called) on iPad mobile browser.
iPad appears to reserve the character " as reserved or special. When this character appears the string is treated as a literal instead of seperated as a CSV.
INPUT:
1111,64-1111-11,Some Tool 12", 112233
Give the input above, the CSV mobile-safari display shows ([] represents a column)
[1111] [64-1111-11] [Some Tool 12, 112233]
Note that the " is missing. Also note that 112233 is not in its own column like it should be.
Question 2:
How can I get the CSV display tool in safari to not treat a six digit number as a phone number?
1234567
Shows up as a hyperlink and asks to "Add Contact" when I click it. I do not want the hyperlink.
UPDATE
iPad is ignoring the escape character (or backslash is not the escape character) for double quotes in CSV files. I am looking at the hex version of the file and I have
\" or 5C 22 (in hex with UTF-8 encoding).
Unfortuntely, the iPad displays the backslash and still treats " as a special character, thereby corrupting my data formatting. Anybody know how I can possibly use " on iPad CSV?
With regards the quotes, have you tried escaping them in the output?
EDIT: conventional escaping doesn't work for CSV files, my apologies. Most specifications state the following:
Fields that contain a special character (comma, newline, or double quote), must be enclosed in double quotes.
So, testing this on your CSV snippet, a file formatted like this:
1111,64-1111-11,"Some Tool 12""", 112233
or even like this:
1111,64-1111-11,Some Tool 12"""", 112233
… opens in Mobile Safari OK. How good or bad that looks in Excel you'd need to check.
Moving to the second issue, to prevent Mobile Safari from presenting numbers as phone numbers, add this to your page's head element:
<meta name="format-detection" content="telephone=no" />

Parsing \"–\" with Erlang re

I've parsed an HTML page with mochiweb_html and want to parse the following text fragment
0 – 1
Basically I want to split the string on the spaces and dash character and extract the numbers in the first characters.
Now the string above is represented as the following Erlang list
[48,32,226,128,147,32,49]
I'm trying to split it using the following regex:
{ok, P}=re:compile("\\xD2\\x80\\x93"), %% characters 226, 128, 147
re:split([48,32,226,128,147,32,49], P, [{return, list}])
But this doesn't work; it seems the \xD2 character is the problem [if I remove it from the regex, the split occurs]
Could someone possibly explain
what I'm doing wrong here ?
why the '–' character seemingly requires three integers for representation [226, 128, 147]
Thanks.
226,128,147 is E2,80,93 in hex.
> {ok, P} = re:compile("\xE2\x80\x93").
...
> re:split([48,32,226,128,147,32,49], P, [{return, list}]).
["0 "," 1"]
As to your second question, about why a dash takes 3 bytes to encode, it's because the dash in your input isn't an ASCII hyphen (hex 2D), but is a Unicode en-dash (hex 2013). Your code is recieving this in UTF-8 encoding, rather than the more obvious UCS-2 encoding. Hex 2013 comes out to hex E28093 in UTF-8 encoding.
If your next question is "why UTF-8", it's because it's far easier to retrofit an old system using 8-bit characters and null-terminated C style strings to use Unicode via UTF-8 than to widen everything to UCS-2 or UCS-4. UTF-8 remains compatible with ASCII and C strings, so the conversion can be done piecemeal over the course of years, or decades if need be. Wide characters require a "Big Bang" one-time conversion effort, where everything has to move to the new system at once. UTF-8 is therefore far more popular on systems with legacies dating back to before the early 90s, when Unicode was created.

Resources