When would it be appropriate to localize a single ascii character?
for instance /, or | ?
is it ever necessary to add these "strings" to the localization effort?
just want to give some people the benefit of the doubt and make sure there's not something I didn't think of.
Generally it wouldn't be appropriate to use something like that except as a graphic element (which of course wouldn't be I18N'd in the first place, much less L10N'd). If you are trying to use it to e.g. indicate a ratio then you should have something like "%d / %d" instead, and localize the whole thing.
Yes, there are cases where these individual characters change in localization. This is not a comprehensive list, just examples I happen to know.
Not every locale uses , to separate thousands and . for the decimal. (However, these will usually be handled by your number formatter. If you do so yourself, you're probably doing it wrong. See this MSDN blog post by Michael Kaplan, Number format and currency format are not always the same.)
Not every language uses the same quotation marks (“, ”, ‘ and ’). See Wikipedia on Non-English Uses of Quotation Marks. (Many of these are only easy to replace if you use full quote marks. If you use the " and ' on your keyboard to mark both the start and end of sentences, you won't know which of two symbols to substitute.)
In Spanish, a question or exclamation is preceded by an inverted ? or !. ¿Question? ¡Exclamation! (Obviously, you can't fix this with a locale substitution for a single character. Any questions or exclamations in your application should be entire strings anyway, unless you're writing some stunningly intelligent natural language generator.)
If you do find a circumstance where you need to localize these symbols, be extra cautious not to accidentally localize a symbol like / used as a file separator, " to denote a string literal or ? for a search wildcard.
However, this has already happened with CSV files. These may be separated by ,, or may be separated by the local list separator. See What would happen if you defined your system's CSV delimiter as being a quotation mark?
In Greek, questions end with a semicolon rather than ?, so essentially the ? is replaced with ; ... however, you should aim to always translate the question as a complete string including question mark anyway.
Related
It seems that it is best to use the & escape, instead of simply typing the ampersand (&).
However, should we be using X/HTML character entity references for dashes and other common typographical characters when writing blog posts on CMSs like WordPress or hard-coding websites by hand?
For example:
– is an en dash (–)
— is an em dash (—)
What is the risk if we do not?
Why is the hyphen (-) never written as - but simply typed directly from the keyboard in HTML? (Assuming that it the hyphen, and not a minus sign.)
The W3C released an official response about when to use and when not to use character escapes which you can find here. As they are also the group that is in charge of the HTML specification, I think it's best to follow their advice.
From the section "When to Use Escapes"
Syntax characters. There are three characters that should always appear in content as escapes, so that they do not interact with the syntax of the markup. These are part of the language for all documents based on XML and for HTML.
< (<)
> (>)
& (&)
They also mention using characters that might not be supported in the current encoding.
From the section "When Not to Use Escapes"
It is almost always preferable to use an encoding that allows you to represent characters in their normal form, rather than using character entity references or NCRs.
Using escapes can make it difficult to read and maintain source code, and can also significantly increase file size.
http://www.w3.org/International/questions/qa-escapes
Those entities are there to help you, the author, with characters not usually typable on your average keyboard. (The em dash is an example —, as well as © and ).
You only need to escape those characters that have meaning in (X)HTML < > and &.
I am in need of matching Unicode letters, similarly to PCRE's \p{L}.
Now, since Dart's RegExp class is based on ECMAScript's, it doesn't have the concept of \p{L}, sadly.
I'm looking into perhaps constructing a big character class that matches all Unicode letters, but I'm not sure where to start.
So, I want to match letters like:
foobar
מכון ראות
But the R symbol shouldn't be matched:
BlackBerry®
Neither should any ASCII control characters or punctuation marks, etc. Essentially every letter in every language Unicode supports, whether it's å, ä, φ or ת, they should match if they are actual letters.
I know this is an old question. But RegExp now supports unicode categories (since Dart 2.4) so you can do something like this:
RegExp alpha = RegExp(r'\p{Letter}', unicode: true);
print(alpha.hasMatch("f")); // true
print(alpha.hasMatch("ת")); // true
print(alpha.hasMatch("®")); // false
I don't think that complete information about classification of Unicode characters as letters or non-letters is anywhere in the Dart libraries. You might be able to put something together that would mostly work using things in the Intl library, particularly Bidi. I'm thinking that, for example,
isLetter(oneCharacterString) => Bidi.endsWithLtr(oneLetterString) || Bidi.endsWithRTL(oneLetterString);
might do a plausible job. At least it seems to have a number of ranges for valid characters in there. Or you could put together your own RegExp based on the information in _LTR_CHARS and _RTL_CHARS. It explicitly says it's not 100% accurate, but good for most practical purposes.
Looks like you're going to have to iterate through the runes in the string and then check the integer value against a table of unicode ranges.
Golang has some code to generate these tables directly from the unicode source. See maketables.go, and some of the other files in the golang unicode package.
Or take the lazy option, and file a Dart bug, and wait for the Dart team to implement it ;)
There's no support for this yet in Dart or JS.
The Xregexp JS library has support for generating fairly large character class regexps to support something like this. You may be able to generate the regexp, print it and cut and paste it into your app.
I need to know how to properly use "OR" when it comes to individual characters and whole phrases... For example I have code that is checking for any number of characters OR words that are found in an array...
I want to check for some unicode characters and also some html lines of code.
I'm currently just checking for the characters using this:
([\u200b\u200c\u200d\0\1\2\3\4\5\6\7]*)
(the backslashes are representing the unicode characters u+200b - u+200d and the special characters in my software \0-\7 (They are all individual characters), these are valid escape sequences in Objective-C.)
Now what if I wanted to check for these characters AND check for phrases like <b> or <font color="#FF0000">
I found stuff while doing research that said to use pipelines | but I'm not sure if I put them only in-between the words or also in-between the individual characters and I'm not sure if I put quotes around the words or what not... I need help before I screw this up badly haha!
(p.s., not sure if it will be any different but I'm also doing it for this:
([^\u200b\u200c\u200d\0\1\2\3\4\5\6\7])
it's be someting like
/([^....]|\<b\/\>|\<font color .... \>)/
though, the usual caveats about regexes and html apply here.
As for the confusion about where to put the |, consider this this hackneyed example: You want to find the word color, but also want to accommodate the british spelling, colour:
/(color|colour)/
/(colou?r)/
/(colo(r|ur))/
are all basically equivalent.
As I understand it, Europeans(*) write numbers with a comma for a decimal separator, so one-and-a-quarter is written as 1,25
Europeans also use commas to separate lists, so how do you write a list of decimal numbers? I, as an Englishman, would write one-and-a-quarter, one-and-a-half, one-and-three-quarters like this:
1.25, 1.5, 1.75
How do you do that in Europe?
(Why is this a programming question? Because I'm writing a program that will ask European users for a list of numbers!)
* For the purposes of this question, there are no English-speaking countries in Europe. :-)
I'm European (french), and in almost all programs here we have to use semicolons ';' as a separator, even if the numbers are only integers because the comma doesn't look like a separator for us. In mathematics, semicolons are the only right way here to separate a list of numbers.
The most common example is when we have to enter the page numbers we want to print on a PDF, all programs ask for a semicolon-separated list and I clearly found it intuitive. I think they would have changed it if it was uncomfortable for some.
This varies by culture, and within a culture. The CLDR data contains the “list” element that specifies the list separator character, and it is the semicolon for most cultures, see the chart of number symbols (element “list”). The definition is very implicit though, and there is variation inside locales. Some people regard 1,25, 1,5, 1,75 as acceptable, while others prefer 1,25; 1,5; 1,75. There are also people who seriously think that in a strongly mathematical or numeric context, one should deviate from the locale practices and use the Anglo-Saxon notation with decimal point, hence with comma as separator.
On the practical side, I think it would not be very wrong to use ”;” as number list separator when decimal comma is used, or even when decimal point is used. So you might even consider using ”;” in all locales.
But when it comes to user input, it’s trickier. In principle, you be liberal in what you accept, but since the comma can be meant to be a decimal comma, a thousands separator, or a list item separator, there is such a thing as being too liberal.
If possible, prompt for each number separately, avoiding the separator issue. If this is not possible, the crucial thing is to make it very, very clear to the use which separator is expected. I would go as far as saying that requiring for the semicolon ”;” is the most reliable thing to do.
Why ask about Europeans in general ? I don't think there is one European way of doing so, and if it happens to be the case then it would be sheer luck. Europe is comprised of different cultures and each has its own rules.
You don't mention what platform you are using but you might be able to rely on your plaform to get this information. In the case of .NET, you can get this information through Textinfo.ListSeparator. For example this would give you the French one (result: a semicolon):
string listSeparator = new CultureInfo("fr-FR").TextInfo.ListSeparator;
I don't think there is one way to do it. White space separating the numbers would works just the same, or you could use a semicolon (';') to separate the numbers
In designing of a (mini)language:
When there are certain characters that should be escaped to lose special meanings (like quotes in some programming languages), what should be done, especially from a security perspective, when characters that are not escapable (e.g. normal characters which never have special meaning) are escaped? Should an error be "error"ed, or should the character be discarded, or should it be in the output the same as if it was not escaped?
Example:
In a simple language where strings are delimited by double-quotes("), and any quotes in a given string are escaped with a back-slash(\): for input "We \said, \"We want Moshiach Now\"" -- what would should be done with the letter s in said which is escaped?
I prefer the lexer to whine when this occurs. A lexer/parser should be tight about syntax; one can always loosen it up later. If you are sloppy, you'll find you can't retract a decision you didn't think you made.
Assume that you initially decide to treat " backslash not-an-escape " as that pair of characters, and the "T" is
not-an-escape today. Sometime later you decide to extend the language, and want "\T" to mean something special, and you change your language.
You'll find an angry mob of programmers storming your design castle,
because for them, "\T" means "\" "T" (or "T" depending on your default decision),
and you just broke their code. You hang your head in shame, retract the decision,
and then realize... oops, there are no more available escape characters!
This lesson goes for any piece of syntax that isn't well defined in your language. If it isn't explicitly legal, it should be implicitly illegal and your compiler should check it. Or you'll never be able to extend your successful language.
If your language isn't going to be successful, you may not care as much.
Well, one way to solve the problem is for the backslash to just mean backslash when it precedes a non-escapable character. That's what Python does:
>>> print "a\tb"
a b
>>> print "a\tb\Rc"
a b\Rc
Obviously, most systems take the escape character to mean "take the next character verbatim", so escaping a "non-escapable" character is usually harmless. The problem later happens when you get to comparisons and such, where the literal text does not represent the actual value (that's where you see a lot of issues securitywise, especially with things like URLs).
So on the one hand, you can only accept a limited number of escaped characters. In that sense, you have an "escape sequence", rather than an escaped character (the \x is the entire sequence rather than a \ followed by an x). That's like the most safe mechanism, and it's not really burdensome to write.
The other option is to ensure that you you "canonicalizing" everything you compare, through some ruleset. This typically means removing all of the escape sequences properly up front, before comparison and comparing only the final values rather than the literals.
Most systems interpret the slash as Will Hartung says, except for alphanumerics which are variously used as aliases for control codes, character classes, word boundaries, the start of hex sequences, case region markers, hex or octal digits, etc. \s in particular often means white-space in perl5 style regexs. JavaScript, which interprets it as 's' in one context and as whitespace in another suffers from subtle bugs because of this choice. Consider /foo\sbar/ vs new RegExp('foo\sbar').