What is the typical Chinese language code for the Accept-Language header? - localization

Unfortunately, I have no way to check this personally, so I wanted to ask the community about it.
According to RFC 5646, Chinese can have the following representation: zh-Hans for Simplified Chinese, zh-Hant for Traditional Chinese, or more specific: zh-Hans-SG for Simplified Chinese for Singapore, zh-Hant-MO for Traditional Chinese for Macau. This is not an exhaustive set of options, there are many.
One thing I know for sure - Chinese cannot be represented as follows: zh, or zh-CN, or zh-TW and the like.
However, how are things in reality? If the site is visited by a user who speaks Chinese, what can I expect in the Accept-Language header?

Well, I got the Windows Sandbox installed and I was able to install whatever I wanted there.
I checked two browsers:
QQ browser (Chinese is selected by default, I'm not sure which
script).
Google Chrome (added all supported Chinese languages ​​and
made them first on the list).
QQ sends in the request the following content in the accept-language header: zh-CN, zh; q = 0.9.
Google Chrome sends the following content in the accept-language header: zh-CN, zh-TW; q = 0.9, zh-HK; q = 0.8, zh; q = 0.7, en; q = 0.6, also I figured out what Chrome means under the indicated codes:
zh-CN - Chinese (Simplified)
zh-TW - Chinese (Traditional)
zh-HK - Chinese (Hong Kong)
zh - Chinese
To be honest, this is strange, but it is a fact.

Related

R/exams unicode char in *.Rnw question files are not propoerly displayed: é displayed as <U+00E9> in final PDF

I am struggling to produce an exam sheet in French using exams2nops. There are accents in the text provided in the intro and title argument of this function and also in the Rnw files containing the function. The formers are correctly displayed in the resulting PDF, but not the later, for example é from a Rnw file is displayed as <U+00E9>.
The call to exams2nops looks like this:
exams2nops(file=myexam, n = N.students, dir = '.',
name = paste0('exam-', exam.date),
title = "Examen écrit",
course = course.id,
institution = "",
logo = paste(exams.dir, 'input/logo.jpg', sep='/'),
date = exam.date,
replacement = TRUE,
intro = intro,
blank=round(length(myexam)/4),
duplex = TRUE, pages = NULL,
usepackage = NULL,
language = "fr",
encoding = "UTF-8",
startid = 1,
points = c(1), showpoints = TRUE,
samepage = TRUE,
twocolumn = FALSE,
reglength = 9,
header=NULL)
Note that "Examen écrit" is correctly displayed in the final PDF, the problem is with the accent in the Rnw files. The function call yields no error.
The *.tex files by generated by exams2nops, already have the problem. For example, the sentense 'Quarante patients ont été inscrits' in the original Rnw file, becomes 'Quarante patients ont <U+00E9>t<U+00E9> inscrits' in the tex file.
I use exams_2.4-0 with R 4.2.2 with TeXShop 4.70 on OSX 11.6.
I checked that Rnw are utf-8 encoded, for example:
$ file -I question1.Rnw
question1.Rnw: text/x-tex; charset=utf-8
It seems they are utf-8-encoded, indeed. These files were translated with deepl or google translate, then edited in emacs.
I tried setting the encoding parameter of exams2nops to latin-1. It did not help.
Any Idea?
The problem disapeared after setting R 'locales' properly. A recurrent problem with OSX R installs. The symptome is:
During startup - Warning messages:
1: Setting LC_CTYPE failed, using "C"
2: Setting LC_COLLATE failed, using "C"
3: Setting LC_TIME failed, using "C"
4: Setting LC_MESSAGES failed, using "C"
5: Setting LC_MONETARY failed, using "C"
at start up. This thread explains how to fix it: Installing R on Mac - Warning messages: Setting LC_CTYPE failed, using "C".
I'm collecting a few further comments here in addition to the existing answer:
The only encoding (beyond ASCII) supported by R/exams, starting from version 2.4-0, is UTF-8. Support for other encodings like latin1 etc. has been discontinued.
As only UTF-8 is supported the encoding does not have to be specified in R/exams function calls anymore (as still might be advised in older tutorials).
To leverage this support of UTF-8, R has to be configured with a suitable locale. A "C" locate (see the answer by #vdet) is not sufficient.
When using R/LaTeX (Rnw) exercises all issues with encodings can also be avoided entirely by using LaTeX commands for special characters, e.g., {\'e}t{\'e} instead of été. The latter is of course more convenient but the former can be more robust, especially when working with teams of instructors living on different operating systems with different locale settings.
When using LaTeX commands instead of special characters in R strings (as opposed to the exercise files), then remember that the backslash has to be escaped. For example, the argument title = "Examen écrit" becomes title = "Examen {\\'e}crit".

Strange URL containing 'A=0 or '0=A in web server logs

During the last weekend some of my sites logged errors implying wrong usage of our URLs:
...news.php?lang=EN&id=23'A=0
or
...news.php?lang=EN&id=23'0=A
instead of
...news.php?lang=EN&id=23
I found only one page originally which mentioned this (https://forums.adobe.com/thread/1973913) where they speculated that the additional query string comes from GoogleBot or an encoding error.
I recently changed my sites to use PDO instead of mysql_*. Maybe this change caused the errors? Any hints would be useful.
Additionally, all of the requests come from the same user-agent shown below.
Mozilla/5.0 (Windows; U; Windows NT 5.1; pt-PT; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 (.NET CLR 3.5.30729)
This lead me to find the following threads:
pt-BR
and
Strange parameter in URL - what are they trying?
It is a bot testing for SQL injection vulnerabilities by closing a query with apostrophe, then setting a variable. There are also similar injects that deal with shell commands and/or file path traversals. Whether it's a "good bot" or a bad bot is unknown, but if the inject works, you have bigger issues to deal with. There's a 99% chance your site is not generating these style links and there is nothing you can do to stop them from crafting those urls unless you block the request(s) with a simple regex string or a more complex WAF such as ModSecurity.
Blocking based on user agent is not an effective angle. You need to look for the request heuristics and block based on that instead. Some examples of things to look for in the url/request/POST/referrer, as both utf-8 and hex characters:
double apostrophes
double periods, especially followed by a slash in various encodings
words like "script", "etc" or "passwd"
paths like dev/null used with piping/echoing shell output
%00 null byte style characters used for init a new command
http in the url more than once (unless your site uses it)
anything regarding cgi (unless your site uses it)
random "enterprise" paths for things like coldfusion, tomcat, etc
If you aren't using a WAF, here is a regex concat that should capture many of those within a url. We use it in PHP apps, so you may/will need to tweak some escapes/looks depending on where you are using this. Note that this has .cgi, wordpress, and wp-admin along with a bunch of other stuff in the regex, remove them if you need to.
$invalid = "(\(\))"; // lets not look for quotes. [good]bots use them constantly. looking for () since technically parenthesis arent valid
$period = "(\\002e|%2e|%252e|%c0%2e|\.)";
$slash = "(\\2215|%2f|%252f|%5c|%255c|%c0%2f|%c0%af|\/|\\\)"; // http://security.stackexchange.com/questions/48879/why-does-directory-traversal-attack-c0af-work
$routes = "(etc|dev|irj)" . $slash . "(passwds?|group|null|portal)|allow_url_include|auto_prepend_file|route_*=http";
$filetypes = $period . "+(sql|db|sqlite|log|ini|cgi|bak|rc|apk|pkg|deb|rpm|exe|msi|bak|old|cache|lock|autoload|gitignore|ht(access|passwds?)|cpanel_config|history|zip|bz2|tar|(t)?gz)";
$cgis = "cgi(-|_){0,1}(bin(-sdb)?|mod|sys)?";
$phps = "(changelog|version|license|command|xmlrpc|admin-ajax|wsdl|tmp|shell|stats|echo|(my)?sql|sample|modx|load-config|cron|wp-(up|tmp|sitemaps|sitemap(s)?|signup|settings|" . $period . "?config(uration|-sample|bak)?))" . $period . "php";
$doors = "(" . $cgis . $slash . "(common" . $period . "(cgi|php))|manager" . $slash . "html|stssys" . $period . "htm|((mysql|phpmy|db|my)admin|pma|sqlitemanager|sqlite|websql)" . $slash . "|(jmx|web)-console|bitrix|invoker|muieblackcat|w00tw00t|websql|xampp|cfide|wordpress|wp-admin|hnap1|tmunblock|soapcaller|zabbix|elfinder)";
$sqls = "((un)?hex\(|name_const\(|char\(|a=0)";
$nulls = "(%00|%2500)";
$truth = "(.{1,4})=\1"; // catch OR always-true (1=1) clauses via sql inject - not used atm, its too broad and may capture search=chowder (ch=ch) for example
$regex = "/$invalid|$period{1,2}$slash|$routes|$filetypes|$phps|$doors|$sqls|$nulls/i";
Using it, at least with PHP, is pretty straight forward with preg_match_all(). Here is an example of how you can use it: https://gist.github.com/dhaupin/605b35ca64ca0d061f05c4cf423521ab
WARNING: Be careful if you set this to autoban (ie, fail2ban filter). MS/Bing DumbBots (and others) often muck up urls by entering things like strange triple dots from following truncated urls, or trying to hit a tel: link as a URi. I don't know why. Here is what i mean: A link with text www.example.com/link-too-long...truncated.html may point to a correct url, but Bing may try to access it "as it looks" instead of following the href, resulting in a WAF hit due to double dots.
since this is a very old version of FireFox, I blocked it in my htaccess file -
RewriteCond %{HTTP_USER_AGENT} Firefox/3\.5\.2 [NC]
RewriteRule .* err404.php [R,L]

Nexmo VoiceXML not working in language other than en-US

I have a running VoiceXML application that works ok in Nexmo. If I set any language other than en-US the calls want get answered. I just change en xml:lang as in:
<vxml application="/dialogue/root/50b9bab0-9ce8-4d7a-9389-09f06aa8f9ee" version="2.1" xml:lang="es-es">
I have tried in the vxml above and also in the prompt tag. Any language like es-es, es-ES... even en-UK will make my vxml stop working in Nexmo.
I am sure script is OK as I can change en-US female and male voice with en-us-male and en-us-female. That works.
Am I missing something?
(I don't think it makes a difference but I use the great Rivr java library to generated vxml)
for me "fr-ca" doesnt work but "fr-ca-female" does work

Read site html from a site in a different geo region

I am using python and Beautiful soup to read html pages. Unfortunately some sites redirect to my Geo region (AU) so I can't retrieve the target countries version i.e. (UK, US, FR, NZ...)
I have tried using a VPN service but this requires me to manually change the region so I can't automate the process. I have tried using the python quartz.Coregraphics library to click the options on screen but this is temperamental.
Is there a way I can achieve this programmatically?
I have manage to nut this one out myself. Best answered by example for reading a uk based site.
import urllib2
url = 'Some-uk-url'
req = urllib2.Request(url)
req.add_header('Accept-Language', 'en-gb')
req.add_header('X-Forwarded-For', [a uk proxy ipaddress here])
htmltext = urllib2.urlopen(req).read()

HTML decoding in C/C++

I'm using libcurl for getting HTML pages.
I have some problems with Hebrew characters.
for example this: סלקום
gets gibberish.
How do I get Hebrew characters and not gibberish?
Do I need some HTML decoder?
Does libcurl support such operation?
Does libiconv support such operation?
I appreciate any help.
Thanks
Edit: Ok, so what you’re seeing is UTF-8 data being decoded as Windows-1252 (so the numeric character references were a red herring). Here’s a demonstration in Python:
>>> u = ''.join(map(unichr, [1505, 1500, 1511, 1493, 1501]))
>>> s = u.encode('utf-8')
>>> print s.decode('cp1255', 'replace')
׳¡׳�׳§׳•׳�
The solution to this problem depends on the environment in which the output is displayed. Merely outputting the bytes received and expecting them to be interpreted as characters leads to problems like this.
An HTML document typically contains a header tag like <meta charset=utf-8> to indicate to the browser what its encoding should be. A document served by a web server contains an HTTP header like Content-Type: text/html; charset=utf-8.
You should ask libcurl for the Content-Type HTTP header to know the encoding of the document, and then convert it to the system encoding using iconv. While in your case that would be codepage 1255, it depends on the user’s system and so you should look up the appropriate functions to detect that.
(Read Unicode and Character Sets and the character-encoding tag on this site for a wealth of further information.)

Resources