How do Europeans write a list of numbers with decimals? - localization

As I understand it, Europeans(*) write numbers with a comma for a decimal separator, so one-and-a-quarter is written as 1,25
Europeans also use commas to separate lists, so how do you write a list of decimal numbers? I, as an Englishman, would write one-and-a-quarter, one-and-a-half, one-and-three-quarters like this:
1.25, 1.5, 1.75
How do you do that in Europe?
(Why is this a programming question? Because I'm writing a program that will ask European users for a list of numbers!)
* For the purposes of this question, there are no English-speaking countries in Europe. :-)

I'm European (french), and in almost all programs here we have to use semicolons ';' as a separator, even if the numbers are only integers because the comma doesn't look like a separator for us. In mathematics, semicolons are the only right way here to separate a list of numbers.
The most common example is when we have to enter the page numbers we want to print on a PDF, all programs ask for a semicolon-separated list and I clearly found it intuitive. I think they would have changed it if it was uncomfortable for some.

This varies by culture, and within a culture. The CLDR data contains the “list” element that specifies the list separator character, and it is the semicolon for most cultures, see the chart of number symbols (element “list”). The definition is very implicit though, and there is variation inside locales. Some people regard 1,25, 1,5, 1,75 as acceptable, while others prefer 1,25; 1,5; 1,75. There are also people who seriously think that in a strongly mathematical or numeric context, one should deviate from the locale practices and use the Anglo-Saxon notation with decimal point, hence with comma as separator.
On the practical side, I think it would not be very wrong to use ”;” as number list separator when decimal comma is used, or even when decimal point is used. So you might even consider using ”;” in all locales.
But when it comes to user input, it’s trickier. In principle, you be liberal in what you accept, but since the comma can be meant to be a decimal comma, a thousands separator, or a list item separator, there is such a thing as being too liberal.
If possible, prompt for each number separately, avoiding the separator issue. If this is not possible, the crucial thing is to make it very, very clear to the use which separator is expected. I would go as far as saying that requiring for the semicolon ”;” is the most reliable thing to do.

Why ask about Europeans in general ? I don't think there is one European way of doing so, and if it happens to be the case then it would be sheer luck. Europe is comprised of different cultures and each has its own rules.
You don't mention what platform you are using but you might be able to rely on your plaform to get this information. In the case of .NET, you can get this information through Textinfo.ListSeparator. For example this would give you the French one (result: a semicolon):
string listSeparator = new CultureInfo("fr-FR").TextInfo.ListSeparator;

I don't think there is one way to do it. White space separating the numbers would works just the same, or you could use a semicolon (';') to separate the numbers

Related

is it ever appropriate to localize a single ascii character

When would it be appropriate to localize a single ascii character?
for instance /, or | ?
is it ever necessary to add these "strings" to the localization effort?
just want to give some people the benefit of the doubt and make sure there's not something I didn't think of.
Generally it wouldn't be appropriate to use something like that except as a graphic element (which of course wouldn't be I18N'd in the first place, much less L10N'd). If you are trying to use it to e.g. indicate a ratio then you should have something like "%d / %d" instead, and localize the whole thing.
Yes, there are cases where these individual characters change in localization. This is not a comprehensive list, just examples I happen to know.
Not every locale uses , to separate thousands and . for the decimal. (However, these will usually be handled by your number formatter. If you do so yourself, you're probably doing it wrong. See this MSDN blog post by Michael Kaplan, Number format and currency format are not always the same.)
Not every language uses the same quotation marks (“, ”, ‘ and ’). See Wikipedia on Non-English Uses of Quotation Marks. (Many of these are only easy to replace if you use full quote marks. If you use the " and ' on your keyboard to mark both the start and end of sentences, you won't know which of two symbols to substitute.)
In Spanish, a question or exclamation is preceded by an inverted ? or !. ¿Question? ¡Exclamation! (Obviously, you can't fix this with a locale substitution for a single character. Any questions or exclamations in your application should be entire strings anyway, unless you're writing some stunningly intelligent natural language generator.)
If you do find a circumstance where you need to localize these symbols, be extra cautious not to accidentally localize a symbol like / used as a file separator, " to denote a string literal or ? for a search wildcard.
However, this has already happened with CSV files. These may be separated by ,, or may be separated by the local list separator. See What would happen if you defined your system's CSV delimiter as being a quotation mark?
In Greek, questions end with a semicolon rather than ?, so essentially the ? is replaced with ; ... however, you should aim to always translate the question as a complete string including question mark anyway.

Localization and Lists of Decimal Numbers

I'm working on localizing some strings in our application and we have text that looks something like:
Factor f (1.0, 1.2, or 1.5)
In a locale that uses a comma for the decimal point, would this be written as:
Factor f (1,0, 1,2, or 1,5)
Maybe it's just not what I'm accustomed to, but that looks crazy hard to read quickly.
I'm also wondering about text like version numbers. Would Firefox 3.5.1 be Firefox 3,5,1?
If I understand what you are looking for, there are two things in regards to Internationalization here:
Decimal separator
List separator
Obviously these separators are quite tightly coupled, so in Locale that uses comma as decimal separator, list separator must be something else. Usually this is a semicolon and there just a few Locales that uses something different than comma or semicolon for list separator.
To summarize:
In Locales that uses dot as a decimal separator, comma is usually used as a list separator, so in some free-form text you might expect something like Factor f (1.0, 1.2, or 1.5).
In Locales that uses comma as a decimal separator, semicolon is typically used as a list separator – Faktor f (1,0; 1,2; oder 1,5) is something you should expect.
I am not sure what you are up to (the technology does matter in the advice) but you can leave the format as well as list separator to the translators to decide. In .Net list separator is given, though (no need to ask translators for input, just use appropriate property of CultureInfo class).
Sorry to say, but I don't know about your first question. However, as far as version numbers go, they are generally left untranslated. End users typically attribute little meaning to the version's numeric value (they are infact NOT numeric in nature. 3.90 < 3.100). They are simply discrete numbers with a universally-accepted separator, and not natural numbers with natural "grouping/decimal" separators.
In addition to end-user experience with version numbers. Developers are often known to parse version numbers in the standard format of {major}.{minor}.{revision}, using . as the well-known seperator character.
I did find this link that talks about your first question (sort of). I don't know how authoritative or credible it is; but it doesn't look dubious.

Locale-specific lookup table

I'm using a lookup table for optimizing an algorithm that works on single characters. Currently I'm adding a..z, A..Z, 0..9 to the lookup table. This works fine in european countries, but in asian countries it doesn't make much sense.
My idea was that I could perhaps use the characters in the windows default code page as an alphabet for the lookup table.
Pseudocode:
for Ch in DefaultCodePage.Characters do
LookupTable.Add (Ch, ComputeValue (Ch));
What do you think and how could this be achieved? Any alternative suggestions?
As you mentioned, it does not make much sense for different scripts. It may only make some sense for alphabet-based languages.
BTW. A-Z is not enough for most of European languages.
I don't quite know what you are doing and what you need this look-up table for but it seems that what you are looking for are Index Characters. You could find such information in CLDR – look for indexCharacters. The resources for various languages are available here.
The only problem you'll face that in fact for some languages Index Characters tend to be Latin based. That is just because these languages do not actually have them... In that case you might want to use so called Exemplar Characters instead but please be warned that it might be just not enough for some use cases.

Avoiding real English words in "short URLs", without sacrificing too much headroom

Assuming here that the language in question is English, and the character sets used are basic ASCII / latin alphabet.
When generating "Short URLs", the first thought is often to use a large "code set"/alphabet to convert an integer (possibly an ID referencing the long URL in your database) to a high "base" (URL-friendly Base-64, for example). In my specific case, I first opted to normalize to Base-36 (numbers, latin letters, not case-sensitive).
However, upon closer inspection, one might find their Short URL generator eventually spitting out naughty words, or other common words, which may be quite undesirable.
One option to avoid generating "real words" would be to just strip out all of the common vowels.
Are there other/better workarounds that don't sacrifice too much headroom?
I think your idea to strip out the vowels will be your best best here.
Anything else, like blacklists, dictionary lookups, etc, will just be incredibly tedious, require a lot of maintenance and, ultimately, falible.
You could normalize to base-30 [0-9bcdfghj-np-tvwxz], which will simply never generate vowels and thus not generate real words.
You could separate your vowels and consonants (xxxddd_eeeaaa). If it's always longer than three letters you're probably safe with curse words.
Or you could insert numbers randomly.
Or you could create a filter.
of the three I'd probably stick with the first.
In order to sacrifice only little information per digit but at the same time prevent as much meaning as possible, you should probably leave out the most frequent letters in english. This will be slightly more efficient than simply skipping all vowels.

Is it possible to create a mask to handle non-north american phone numbers?

For north american phone numbers, (999) 999-9999 works pretty well for an input mask.
However, I can't find a good example that will handle non-north american numbers. I know that the number of digits can vary, so other than restricting it to digits only, is there a good example anywhere?
There is no generic mask, really: There are too many combinations.
The only thing that is fixed is the international country code, usually prefixed by +.
According to the Wikipedia Article on telephone numbering plans, most countries conform with the E.164 numbering plan.
If I read E.164 correctly, you can safely make the following assumptions:
Country code: 1-3 digits
Network / Area code and Number: Up to 19 digits
I would ask for the country code, and have the "area code + number" field as a 19-digit input.
You can deduce the country code with a simple RegEx such as:
^(?:(?:0(?:0|11)\s?)|+)([17]|2([07]|[1-689]\d)|3([0-469]|[578]\d)|4([013-9]|2\d)|5([1-8]|[09]\d)|6([0-6]|[789]\d)|8([12469]|[03578]\d)|9([0-58]|[679]\d))
Followed by
(([\s\(\).-]{0,2}\d){4,13})$
to extract the national number.
For validating the national number length and validity, you'd need libphonenumber or similar.
The long RegEx above allows +, 00 or 011 before the country code and a selection of punctuation in the number which will also have to be stripped.
You don't mention your application but this is certainly possible using regular expressions. You might want to take a look here.
Not easily. Take a look at this page for an example why: if you only look at the German phone numbers, you'll note that there are different formats depending on where you're calling the number from. Which one do you pick? And that's just for German phone numbers; they differ from continent to continent, and from country to country.
Going with "numbers-only" is probably your safest bet.
I would allow for spaces, dashes, slashes and all that, but actually only care for numbers and the optional leading + sign. Everything else, such as assuming certain blocks of a certain length is just asking for trouble.
May be it is bad to answer an old question. But libphonenumber seems like a good solution to your question.

Resources