I seek a source for the world's main language names, country names, and territory names, localised into a long list of languages.
Example of localised names of languages:
EN EN English
EN ES Inglés
ES EN Spanish
ES ES Español
Example of localised names of a certain country in south-west Europe:
ES ES España
ES FR Espagne
ES EN Spain
Any idea where can I take/build that from?
You can find the information you are looking for in the Unicode Common Locale Data Repository (CLDR) here: http://cldr.unicode.org/
Data is supplied in XML so you will need to import this into your database.
The CLDR publishes human-readable charts for language names and territory (country, continent, etc.) names. In each, there is a section for each language or territory, identified by a standardised code. Then the rows of each section give a localised name, and codes for the languages which use that localised name to refer to the language or territory.
The underlying CLDR data is in XML form. The language and territory names you seek are in directory, repos/cldr/trunk/common/main/, with an XML file for each language, containing the names for various languages and territories localised into that language. For instance, the file es.xml has the Spanish-language names for languages ("español", "inglés") and countries ("España").
For country names, you can use geonames, for example here is a list of alternate names for the south west European country. Geonames has an api and data dump that you can use in your programs.
For language names, this list could be useful
HERE is a database in multiple formats.
Best of luck learning all those languages in their local slang.
Related
I can retrieve all language codes like so:
Locale.isoLanguageCodes
And all region codes:
Locale.isoRegionCodes
But I would like a list of all language ID's. There is a short example list described in the Apple Language and Locale IDs reference
en-AU for English as used in Australia
en-GB for English as used in United Kingdom
fr-FR for French as used in France
fr-CA for French as used in Canada
de-AT for German as used in Austria
de-CH for German as used in Switzerland
I'm hoping there's some standard list so I wouldn't list all possible permutations of Language + Region. Then I'll get combinations like, Italian as used in Mongolia. Anyone know of a source or method to produce a standard list of language ID's?
Bonus Question - Does anyone know how to get a list of the full spelling of the languages or regions and not just the code?
Is there a list with complete information about caracteristics like:
currency
date and time (including if it is 12 or 24 hours) format
measurement units (distance, speed, temperature...)
preferred language
masks for phone and local documents
timezones (at least the main ones / variations if daylight saving time is applicable)
decimal and thousand separators
for countries around the world?
I am doing it myself, however, as it takes too long to gather the data, I tought maybe someone have already have it done.
Don't reinvent the wheel.
Start with CLDR, the Common Locale Data Repository (http://cldr.unicode.org/)
Or if you want to honor the locale preferences in your application, use standard I18N APIs (from you platform, whatever that is, or a popular library, like ICU, http://site.icu-project.org/)
For currencies you can rely on international standard ISO 4217. It also refers to the country code of each currency code. This website provides this dataset for download.
For date formats, the best reference seems to be wikipedia.
The measurement units is a very complex domain, because you need to know which dimension you measure (speed, distance, volume, ...) and the units (paper size in cm is not the same as road distance in km). Here you have some lists per type of units, but not per country. This website shows a list of system of measurements in use per country. You'll see that fortunately ùany of them share the metric system, so taht you could use an approach "by exception" documenting yourself only on the remaining ones".
For languages, you have international standard ISO 639 or IANA , but it's country independent. You can look at reference lists for locale such as here: it associates a language code to a country code, so that you could complete the standard information. Note that some countries have several language, and you cannot and should not decide which one is preferred.
For telephone masks, there is only an international list of prefix. The usage vary greately accross countries. Some have fixed format, some use variable formats, some have zone prefixes and some not. Sometimes there is even no clear standard in the country and there are several coexisting usages. I'm not aware of any global list of these.
For timezones around the world, you could have a look at IANA which is extremely comprehensive.
For decimal and thousand separators, it's not an international standard. Again I'd suggest to refer to Wikipedia
I am currently working on a project that would benefit from localized locale codes. For example, RFC 5646 and the parent-standard BCP 47 define locale codes for various locales, such as en-GB for British English and zh-Hans-SG for Singaporean Chinese using simplified Chinese characters. Unfortunately, these codes use only a small subset of the latin alphabet.
I am looking for a similar standard or commonly used system that defines a set of language codes in the respective writing system of each language (somewhat akin to an autoglossonym).
EDIT: I am strictly seeking localized locale codes since in the problem's context (URI i18n/l10n), it would be unreasonable to use an autoglossonym or other verbose equivalent.
Locale codes as specified by RFC 5656 and BCP 47 are meant to be machine parseable. Thus, en-GB is "English (Great Britain)" and zh-Hans-SG is "Chinese (Singapore, Simplified Chinese Script)".
They are designed so that web pages, e-books and other documents can specify the language and script they are written in in a standard way.
Thus, each language, script and country is given a unique code from the respective standards and collated in the IANA Language Subtag Registry (http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry).
For a localized version of this, you are better off mapping the codes to a localized name (e.g. localizing the Description field of the subtag registry database, or using a project like iso-codes) and formatting that in a presentable way, keeping the locale code as an internal representation.
(changed title so as not to confuse future readers)
Is this an authoritative list of languages I can use for my application?
From: http://en.wikipedia.org/wiki/ISO_639-1 --> http://www.infoterm.info/standardization/iso_639_1_2002.php
ISO Language Codes
ISO 639-1:2002
Codes for the representation of names
of languages -- Part 1: Alpha-2 code
Infoterm has been designated the
Registration Authority (ISO 639-1/RA)
for the language alpha-2 language code
contained in ISO 639-1:2002 "Codes for
the representation of names of
languages - Part 1: Alpha-2 code /
Codes pour la représentation des noms
de langue - Partie 1 : Code alpha-2".
-1 is old. -3 is newer and more complete.
I would love to be able to localize my Android apps using ISO 639-3. That however the underlying system permit localization in the ISO 639-1 list only. This means that I can't use the built in localisation for minor languages like Yolngu Matha and other Australian languages.
So, it really depends on the underlying system you are developing for, if you can utilize 639-3, 639-3 or not. I have not fund any big support for 639-3, except in linguistics software.
What products support 3-digit region subtags, e.g., es-419 for Latin-American Spanish?
Are web browsers, translation tools and translators familiar with these numeric codes in addition to the more common "es" or "es-ES"?
I've already visited the following pages:
W3C Choosing a Language Tag
W3C Language tags in HTML and XML
RFC 5646 Tags for Identifying Languages
Microsoft National Language Support (NLS) API Reference
I doubt that many products like that exist. It seems that some main stream programming languages (I have tested C# and Java) does not support these tags, therefore it would be quite hard to develop programs that does so.
BTW. NLS API Reference that you have provided, does not contain region tag for any of the LCID definition. And if you think of it for the moment, knowing how Locale Identifier is built, there is no way to support it now, actually. Implementation change would be required (they should use some reserved bits, I suppose).
I don't think we will see support for region tags in foreseeable future.
Edit
I saw that Microsoft assigned LCID of value -1 and -2 to "European Union 1" and "European Union 2" respectively. However I don't think it is related.