Why do quotes turn into funny characters when submitted in an HTML form? - character-encoding

I have an HTML form, and some users are copy/pasting text from MS Word. When there are single quotes or double quotes, they get translated into funny characters like:
'€™ and ’
The database column is collation utf8_general_ci.
How do I get the appropriate characters to show up?
Edit:
Problem solved. Here's how I fixed it:
Ran mysql_query("SET NAMES 'utf8'"); before adding/retreiving from the database. (thanks to Donal's comment below).
And somewhat odd, the php function urlencode($text) was applied when displaying, so that had to be removed.
I also made sure that the headers for the page and the ajax request/response were all utf8.

This looks like a classic case of unicode (UTF-8 most likely) characters being interpreted as iso-8859-1. There are a couple places along the way where the characters can get corrupted. First, the client's browser has to send the data. It might corrupt the data if it can't convert the characters properly to the page's character encoding. Then the server reads the data and decodes the bytes into characters. If the client and server disagree about the encoding used then the characters will be corrupted. Then the data is stored in the database; again there is potential for corruption. Finally, when the data is written on the page (for display to the browser) the browser may misinterpret the bytes if the page doesn't adequately indicate it's encoding.
You need to ensure that you are using UTF-8 throughout. The default for web pages is iso-8859-1, so your web pages should be served with the Content-Type header or the meta tag
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
(make sure you really are serving the text in that encoding).
By using UTF-8 along all parts of the process you will avoid problems with all working web browsers and databases.

Check the encoding that the page uses. Encode it using UTF-8 as well, and add a meta tag describing the encoding:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

We have a PHP function that tries to clean up the mess with smart quotes. It's a bit of a mess, since it's grown a bit organically as cases popped up during prototype development. It may be of some help, though:
function convert_smart_quotes($string) {
$search = array(chr(0xe2) . chr(0x80) . chr(0x98),
chr(0xe2) . chr(0x80) . chr(0x99),
chr(0xe2) . chr(0x80) . chr(0x9c),
chr(0xe2) . chr(0x80) . chr(0x9d),
chr(0xe2) . chr(0x80) . chr(0x93),
chr(0xe2) . chr(0x80) . chr(0x94),
chr(226) . chr(128) . chr(153),
'’','“','â€<9d>','â€"',' ');
$replace = array("'","'",'"','"',' - ',' - ',"'","'",'"','"',' - ',' ');
return str_replace($search, $replace, $string);
}

Related

Using Umlaut or special characters in ibm-doors from batch

We have a link module that looks something like this:
const string lMod = "/project/_admin/somethingÜ" // Umlaut
We later use the linkMod like this to loop through the outlinks:
for a in obj->lMod do {}
But this only works when executing directly from DOORS and not from a batch script since it for some reason doesn't recognize the Umlaut causing the inside of the loop to never to be run; exchanging lMod with "*" works and also shows the objects linked to by the lMod.
We are already using UTF-8 encoding for the file:
pragma encoding, "UTF-8"
Any solutions are welcome.
Encode the file as UTF-8 in Notepad++ by going to Encoding > Convert to UTF-8. (Make sure it's not already set to UTF-8 before you do it).

What encoding is this and how do I turn it into something I can see properly?

I'm writing a script that will operate on the subtitle files of a popular streaming service (Netfl*x).
The subtitle files have strange characters in them and I can't get them to render in a way that my text editors or web browser will display in a readable way. The xml encoding says UTF-8, but some characters are not readable.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<tt xmlns:tt="http://www.w3.org/ns/ttml" xmlns:ttm="http://www.w3.org/ns/ttml#metadata" xmlns:ttp="http://www.w3.org/ns/ttml#parameter" xmlns:tts="http://www.w3.org/ns/ttml#styling" ttp:tickRate="10000000" ttp:timeBase="media" xmlns="http://www.w3.org/ns/ttml">
<p>de 15 % la nuit dernière.</span></p>
<p>if youâve got things to doâ¦</span></p>
And in Vim:
This is what it looks like in the browser:
How can I convert this into something I can use?
I'll go out on a limb and say that file is UTF-8 encoded just fine, and you're merely looking at it using the wrong encoding. The character À encoded in UTF-8 is C3 80. C3 in ISO-8859-1 is Ã, which in your screenshot is followed by an 80. So looks like you're looking at a UTF-8 file using the (wrong) ISO-8859 encoding.
Use the correct encoding when opening the file.
My terminal is set to en_US.UTF-8, but was also rendering this supposedly UTF-8 encoded file incorrectly (sonné -> sonné). I was able to solve this by using iconv to encode the file in ISO8859-1.
iconv original.xml -t ISO8859-1 -o converted.xml
In the new file, the characters were properly rendered, although I don't quite understand why.

Strange URL containing 'A=0 or '0=A in web server logs

During the last weekend some of my sites logged errors implying wrong usage of our URLs:
...news.php?lang=EN&id=23'A=0
or
...news.php?lang=EN&id=23'0=A
instead of
...news.php?lang=EN&id=23
I found only one page originally which mentioned this (https://forums.adobe.com/thread/1973913) where they speculated that the additional query string comes from GoogleBot or an encoding error.
I recently changed my sites to use PDO instead of mysql_*. Maybe this change caused the errors? Any hints would be useful.
Additionally, all of the requests come from the same user-agent shown below.
Mozilla/5.0 (Windows; U; Windows NT 5.1; pt-PT; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 (.NET CLR 3.5.30729)
This lead me to find the following threads:
pt-BR
and
Strange parameter in URL - what are they trying?
It is a bot testing for SQL injection vulnerabilities by closing a query with apostrophe, then setting a variable. There are also similar injects that deal with shell commands and/or file path traversals. Whether it's a "good bot" or a bad bot is unknown, but if the inject works, you have bigger issues to deal with. There's a 99% chance your site is not generating these style links and there is nothing you can do to stop them from crafting those urls unless you block the request(s) with a simple regex string or a more complex WAF such as ModSecurity.
Blocking based on user agent is not an effective angle. You need to look for the request heuristics and block based on that instead. Some examples of things to look for in the url/request/POST/referrer, as both utf-8 and hex characters:
double apostrophes
double periods, especially followed by a slash in various encodings
words like "script", "etc" or "passwd"
paths like dev/null used with piping/echoing shell output
%00 null byte style characters used for init a new command
http in the url more than once (unless your site uses it)
anything regarding cgi (unless your site uses it)
random "enterprise" paths for things like coldfusion, tomcat, etc
If you aren't using a WAF, here is a regex concat that should capture many of those within a url. We use it in PHP apps, so you may/will need to tweak some escapes/looks depending on where you are using this. Note that this has .cgi, wordpress, and wp-admin along with a bunch of other stuff in the regex, remove them if you need to.
$invalid = "(\(\))"; // lets not look for quotes. [good]bots use them constantly. looking for () since technically parenthesis arent valid
$period = "(\\002e|%2e|%252e|%c0%2e|\.)";
$slash = "(\\2215|%2f|%252f|%5c|%255c|%c0%2f|%c0%af|\/|\\\)"; // http://security.stackexchange.com/questions/48879/why-does-directory-traversal-attack-c0af-work
$routes = "(etc|dev|irj)" . $slash . "(passwds?|group|null|portal)|allow_url_include|auto_prepend_file|route_*=http";
$filetypes = $period . "+(sql|db|sqlite|log|ini|cgi|bak|rc|apk|pkg|deb|rpm|exe|msi|bak|old|cache|lock|autoload|gitignore|ht(access|passwds?)|cpanel_config|history|zip|bz2|tar|(t)?gz)";
$cgis = "cgi(-|_){0,1}(bin(-sdb)?|mod|sys)?";
$phps = "(changelog|version|license|command|xmlrpc|admin-ajax|wsdl|tmp|shell|stats|echo|(my)?sql|sample|modx|load-config|cron|wp-(up|tmp|sitemaps|sitemap(s)?|signup|settings|" . $period . "?config(uration|-sample|bak)?))" . $period . "php";
$doors = "(" . $cgis . $slash . "(common" . $period . "(cgi|php))|manager" . $slash . "html|stssys" . $period . "htm|((mysql|phpmy|db|my)admin|pma|sqlitemanager|sqlite|websql)" . $slash . "|(jmx|web)-console|bitrix|invoker|muieblackcat|w00tw00t|websql|xampp|cfide|wordpress|wp-admin|hnap1|tmunblock|soapcaller|zabbix|elfinder)";
$sqls = "((un)?hex\(|name_const\(|char\(|a=0)";
$nulls = "(%00|%2500)";
$truth = "(.{1,4})=\1"; // catch OR always-true (1=1) clauses via sql inject - not used atm, its too broad and may capture search=chowder (ch=ch) for example
$regex = "/$invalid|$period{1,2}$slash|$routes|$filetypes|$phps|$doors|$sqls|$nulls/i";
Using it, at least with PHP, is pretty straight forward with preg_match_all(). Here is an example of how you can use it: https://gist.github.com/dhaupin/605b35ca64ca0d061f05c4cf423521ab
WARNING: Be careful if you set this to autoban (ie, fail2ban filter). MS/Bing DumbBots (and others) often muck up urls by entering things like strange triple dots from following truncated urls, or trying to hit a tel: link as a URi. I don't know why. Here is what i mean: A link with text www.example.com/link-too-long...truncated.html may point to a correct url, but Bing may try to access it "as it looks" instead of following the href, resulting in a WAF hit due to double dots.
since this is a very old version of FireFox, I blocked it in my htaccess file -
RewriteCond %{HTTP_USER_AGENT} Firefox/3\.5\.2 [NC]
RewriteRule .* err404.php [R,L]

Rails and html - encoding error

in my template (index.html.erb) is no line for any code like utf-8 or anything.
so rails of course has a problem with that.
Your template was not saved as valid UTF-8. Please either specify UTF-8 as the encoding for your template in your text editor, or mark the template with its encoding by inserting the following as the first line of the template:
# encoding: <name of correct encoding>.
so i tried to paste this into my html:
# encoding: < meta charset=utf-8 />.
did i write something wrong? or can i take any other code?
The answer from rails is: unknown encoding name - <
thanks for answering
check if the file itself is saved as UTF-8.
This was my problem.
Using "e" as a texteditor or Notepad++ (or any other (windows) tool ) with wrong configuration may be the problem.
e (and i think Notepad++) was configured to save the files as "Windows DOS OEM (EP 437)".
I've changed this in the settings to UTF-8, saved alle files (withoth changes) and it works.
in your application.rb file paste this line
config.encoding = "utf-8"
or in your application.html.erb file paste the following line
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
The meta charset is for HTML. You need to specify the charset for ruby, you can do it by using a comment like this:
# encoding: utf-8
Using Rubymine4, accidentally right clicked and didn't see what I had pressed... BANG! code = gone! Plus 1 error! I had reencoded the page to some crazy encoding. Just right click in page and -> save '[Your crazy Encoding here]' file in another encoding.
Rubymine, too easy! ;)
That solved for me. It was my file that was not being saved as utf-8 after all.
https://superuser.com/questions/581553/sublime-text-2-encoding-utf-8

HTML decoding in C/C++

I'm using libcurl for getting HTML pages.
I have some problems with Hebrew characters.
for example this: סלקום
gets gibberish.
How do I get Hebrew characters and not gibberish?
Do I need some HTML decoder?
Does libcurl support such operation?
Does libiconv support such operation?
I appreciate any help.
Thanks
Edit: Ok, so what you’re seeing is UTF-8 data being decoded as Windows-1252 (so the numeric character references were a red herring). Here’s a demonstration in Python:
>>> u = ''.join(map(unichr, [1505, 1500, 1511, 1493, 1501]))
>>> s = u.encode('utf-8')
>>> print s.decode('cp1255', 'replace')
׳¡׳�׳§׳•׳�
The solution to this problem depends on the environment in which the output is displayed. Merely outputting the bytes received and expecting them to be interpreted as characters leads to problems like this.
An HTML document typically contains a header tag like <meta charset=utf-8> to indicate to the browser what its encoding should be. A document served by a web server contains an HTTP header like Content-Type: text/html; charset=utf-8.
You should ask libcurl for the Content-Type HTTP header to know the encoding of the document, and then convert it to the system encoding using iconv. While in your case that would be codepage 1255, it depends on the user’s system and so you should look up the appropriate functions to detect that.
(Read Unicode and Character Sets and the character-encoding tag on this site for a wealth of further information.)

Resources