How do I specify language when storing strings? - localization

I'm currently developing a system that supports several languages. I want to specify these languages as precisely as possible in the database in case of future integrations. (Yes I know it's a bit YAGNI)
I've found several ways to define a language
nb-NO
nb_NO
nb-no
nb_no
nb
These can all mean "Norwegian Bokmål". Which one, if any, is the most correct?
The Locale article on the ArchLinux Wiki specifies a Locale as language[_territory][.codeset][#modifier]. The codeset and modifier I guess are only relevant for input. But language is a minimum and territory may be nice to have should we implement cultural differences regarding currency and decimal points etc.
Am I overthinking it?

Look at BCP 47
https://tools.ietf.org/html/bcp47
In this day and age you would need at to support at least language, script, region (only language being mandatory to be present)
It depends a lot what you use this tags for.
If it is spoken content you might care about dialect (for instance Cantonese vs Mandarin Chinese), but not script. In written form you will care about script (Traditional vs. Simplified Chinese), but not dialect.
It also matters a lot the complete stack you use to process things. You might use minus as separator, use grandfathered ids, or the -u- extension (see bcp,) then discover that you use a programming language that "chokes" on it. Or you use "he" for Hebrew, but your language (cough Java cough) wants the deprecated "iw"
So you might decide to use the same locale id as your tech stack, or have a "conversion layer".
If you want things accessible from several technologies, then conversion layers is your only (reasonable) option.

Related

Command line argument / program option parsing Styles and Specification?

I am curious if there are any extensive overview, preferrably specifications / technical reports about the GNU style and other commonly used styles for parsing Command Line Arguments.
As far as I know, there are many catches and it's not completely trivial to write a parsing library that would be as compliant as, for example, C++ boost::program_options, Python's argparse, GNU getopt and more.
On the other hand, there might be libraries that are too liberal in accepting certain options or too restrictive. So, if one wants to aim for a good compatibility / conformance with a de-facto standard (if such exists), is there a better way than simply reading a number of mature libraries' source code and/or test cases?
Posix provides guidelines for the syntax of utilities, as Chapter 12 of XBD (the Base Definitions). It's certainly worth a read. As is noted, backwards-compatibility has meant that many standardized utilities do not conform to these guidelines, but nonetheless the standard recommends
... that all future utilities and applications use these guidelines to enhance user portability. The fact that some historical utilities could not be changed (to avoid breaking existing applications) should not deter this future goal.
You can also read the rationale for the syntax guidelines.
Posix provides a basic syntax but it's insufficient for utilities with a large number of arguments, and single-letter options are somewhat lacking in self-documentation. Some utilities -- test, find and tcpdump spring to mind -- essentially implement domain specific languages. Others -- ls and ps, for example -- have a bewildering pantheon of invocation options. To say nothing of compilers...
Over the years, a number of possible extension methods have been considered, and probably all of the are still in use in at least one common (possibly even standard) utility. Posix recommends the use of -W as an extension mechanism, but there are few uses of that. X Windows and TCL/Tk popularized the use of spelled-out multicharacter options, but those utilities expect long option names to still start with a single dash, which renders it impossible to condense non-argument options [Note 1]. Other utilities -- dd, make and awk, to name a few -- special-case arguments which have the form {íd}={val} with no hyphens at all. The GNU approach of using a double-hyphen seems to have largely won, partly for this reason, but GNU-style option reordering is not universally appreciated.
A brief discussion of GNU style is found in the GNU style guide (see also the list of long options), and a slightly less brief discussion is in Eric Raymond's The Art of Unix Programming [Note 2].
Google code takes command-line options to a new level; the internal library has now been open-sourced as gflags so I suppose it is now not breaking confidentiality to observe how much of Google's server management tooling is done through command-line options. Google flags are scattered indiscriminately throughout the code, so that library functions can define their own options without the calling program ever being aware of them, making it possible to tailor the behaviour of key libraries independently of the application. (It's also possible to modify the value of a gflag on the fly at runtime, another interesting tool for service management.) From a syntactic viewpoint, gflags allows both single- and double-hyphen long option presentation, indiscriminately, and it doesn't allow coalesced single-character-option calls. [Note 3]
It's worth highlighting the observation in The Unix Programming Environment (Kernighan & Pike) that because the shell "must satisfy both the interactive and programming aspects of command execution, it is a strange language, shaped as much by history as by design." The requirements of these two aspects -- the desire of a concise interactive language and a precise programming language -- are not always compatible.
Syntax flexibility, while handy for the interactive user, can be disastrous for the script author. As an example, last night I typed -env=... instead of --env=... which resulted in my passing nv=... to the -e option rather than passing ... to the --env option, which I didn't notice until someone asked me why I was passing that odd string as an EOF indicator. On the other hand, my pet bugbear -- the fact that some prefer --long-option and others prefer --long_option and sometimes you find both styles in the same program (I'm looking at you, gcc) -- is equally annoying as an interactive user and as a scripter.
Sadly, I don't know of any resource which would serve as an answer to this question, and I'm not sure that the above serves the need either. But perhaps we can improve it over time.
Notes:
Obviously a bad idea, since it would make impossible the pastime of constructing useful netstat invocations whose argument is a readable word.
The book and its author are commonly known as TAOUP and ESR, respectively.
It took me a while to get used to this, and very little time to revert to my old habits. So you can see where my biases lie.

Internationalization and For Program Localization. i18n

I have several projects I've worked on that are setup for internationalization.
From the programming perspective, I have everything pretty much setup and put all of the string into an xml file or properties file. I wish to get these files translated into other languages, such as: Italian (it), Spanish (es), Germany (de), Brazillian Portugese (pt-br), Chinese Simplified (zh-cn), Chinese Traditional (zh-tw), Japanese (ja), Russian (ru), Hugarian (hu), Polish (pl), and French (fr).
I've considered using services like google translate, however I feel that this automatic translation tools are still a bit weak.
In summsary, I'm curious on if others have used professional translation services for their programs, if so which ones would people recommend and how did you coordinate the translation updates with the translation teams? Any idea on what I should expect to pay? Or is there a better way of doing this that I'm not aware of?
Machine translation services like Google, Bing etc. are not a good choice. As you mention, these services are in reality still in their infancy, and more importantly using them will most likely give your non-English customers a bad impression of your application.
If you want top quality translation, you will need to employ the services of a professional translation agency. Translators need to understand your application in order to translate the text correctly, so providing them with the application itself or screen captures of the English product will help.
You will pay per word - the rates vary from agency to agency, and also from language to language.
The other alternative is using crowd-sourced translations, from GetLocalization for example.
To summarize, proper localization is not just a matter of translating the text - you need to build a relationship with your translators, and ensure they understand your application and the context of the strings that they are translating, otherwise you will end up with a linguistically poor application, that will reflect badly on your company.

How to translate (internationalize, localize) application?

I need to translate an application on delphi. Now all strings in interface on Russian.
What are the tools for fast find, parcing all pas'es for string constants?
How people translate large applications?
GetText should be able to do, search for "extract" at http://dxgettext.po.dk/documentation/how-to
If all you need is to translate GUI and maybe resourcestring in sources, and the inline string constants in Pascal sources are not needed for translation, then you can try Delphi built-in method. However forums say that ITE is buggy. But at least that is official Delphi way.
http://edn.embarcadero.com/article/32974
http://docwiki.embarcadero.com/RADStudio/en/Creating_Resource_DLLs
To translate sources with ITM manual preparation is needed as shown in source sheets at http://www.gunsmoker.ru/2010/06/delphi-ite-integrated-translation.html
I remember i tranlsated Polaris texts for JediVCL team - so they did some extraction. But i think they just extracted all characters > #127 into a text file - there was no structure, there were constants and comments, all together.
Still, there is some component, though i doubt it can be used the way you need: http://wiki.delphi-jedi.org/wiki/JVCL_Help:TJvTranslator
There are also commercial tools. But i don't know if their features would help you on your initial extraction and translation tasks. They probably would be of much help when you need to mantain your large application translated to many languages, but not when you need to do one-time conversion. But maybe i am wrong, check their trial versions if you wish. By reviews those suites are considered among best of commercial ones:
TsiLang Suite http://www.tsilang.com/?siteid2=7
Korzh Localizer http://devtools.korzh.com/localization-delphi/
Firstly I'd recommend to move all localizable string constants into the resource string sections within their unit files. I.e.
raise Exception.Create('Error: что-то пошло не так (in Russian language)');
will be converted to
resourcestring
rsSomeErrorMessage = 'Error: что-то пошло не так (in Russian language)';
...
raise Exception.Create(rsSomeErrorMessage);
More about Resource Strings.
This process can be accelerated by using the corespondent Delphi IDE refactoring command, or with third-party utilities like as ModelMaker Tools.
Then you can use any available localizer to translate or even internationalize your program. I'd recommend my Delphi localizer - it's free.
Basically, you have two ways
Resource-based localization tools (Delphi ITE, Multilizer, ecc.)
Database-based localization tools (GetText, TsiLang, ecc.)
The former takes advantage of the Windows resource support, resource loading can be redirected to a different one stored into a DLL when the application is started. The advantages are that whole forms can be localized, including images, colors, control sizes, etc. and not only strings. Moreover, no code change is required. The disadvantages are that end-user localization is not usually possible, and changing language without restarting the application may be trickier. Microsoft applications, including Windows itself, use this technique. It will work with any Delphi libraries that stores strings into resources and dfms properly.
The latter stores strings in an external "database" (it could even be a text file...). The advantage is usually that users can add/modify translations, and switching language on the fly easier. The disadvantages are this technique is more intrusive (it has to hook string loading/display) and may require code changes, tools are usually limited to string localization and don't offer broader control (images, sizes, etc.), and may not work with unknown controls/libraries they could not hook correctly. Usually cross-platform application use this technique because Windows-like resource support is not available on all operating systems.
You should choose the technique that suits you and your application best. Moreover some tools ease the collaboration with an external translator, while others don't. I prefer the resource-based approach because it doesn't require code changes and don't tie me to a given library.
We are using dxgettext (GNUGettext for Delphi and C++ Builder) and Gorm (from the same author). Mind you, most tools require you to use English as the primary language and translate from that only. dxgettext allows other languages but there are bound to be unknown problems with that. Be prepared that internationalizing a large applicatin will be more work than you currently think.

Real world usage of concatenative programming languages

What are some real-world projects done in concatenative languages like Forth, Factor, Joy, etc.?
factorcode.org, concatenative.org and tinyvid.tv are powered by Furnace, a Factor web server and framework.
PostScript is concatenative, and there's obviously a huge number of applications of PostScript. It's just not a general purpose programming language.
As Greg wrote, postscript is the mammoth example.
Concatenative languages pop up everywhere, quite naturally, because of the trivial nature of the language runtime. It's a favourite for many firmwares: I first encountered Forth "in the flesh" in the bootloader for a Sun Sparcstation. It powers the firmware for the OLPC.
Ocaml's parent, Caml was based on realising the semantics of functional programming as the Categorical Abstract Machine (the CAM in Caml).
Bibtex uses a concatenative language to compile style files.
There is the somewhat-obsolete but very cool Quartus Forth for Palm which allowed full compiled application development on the Palm device (Forth as a minimalist language works rather well in those circumstances). Their home page lists several Palm apps.
This FIG page has a list of mostly-embedded projects including a reference to the very cool use of Forth by NASA.
I met a guy at an Apple conference in Queensland back in about 1991 who had retailed a road planning application written in MacForth.
Christopher Diggins was talking about his Cat language being used inside Microsoft to help optimise compilers but I don't know if that went anywhere.
I suspect PowerMOPS (the successor to Neon) may elude the definition of concatenative because its big deal is adding object-orientation, which implies instances.
Take a look at FORTH Inc, They list several projects that they and their customers did, using their FORTH.
Eserv and nncron are written in SP-Forth.
Bitcoin protocol, and most of the other cryptocoins, uses pubkey scripts and signature scripts for validation of transactions:
Pubkey scripts and signature scripts combine secp256k1 pubkeys and signatures with conditional logic, creating a programable authorization mechanism.
These scripts are written in a concatenative language:
The script language is a Forth-like stack-based language deliberately designed to be stateless and not Turing complete. Statelessness ensures that once a transaction is added to the block chain, there is no condition which renders it permanently unspendable. Turing-incompleteness (specifically, a lack of loops or gotos) makes the script language less flexible and more predictable, greatly simplifying the security model.
Part of the firmware on Macs (at least in the older PowerPC models) was written in Forth.
See: Link

Process for localization of Delphi 2009 app by volunteer translators?

I have a freeware scientific app that is used by thousands of people in nearly 100 countries. Many have offered to translate for free. Now that D2009 makes this easier (with integrated and external localization tools, plus native Unicode support) I'd like to make this happen for a few languages and steadily add as many as user energy will support.
I'm thinking that I'll distribute a spreadsheet with a list of strings (dozens but not hundreds) to be translated, have them return it, and compare submissions in the same language from 2-3 users then work to resolve discrepancies by consensus. Then I'll incorporate the localizations using the Integrated Translation Environment, and distribute localized updates.
Has anyone delegated translation to users? Any gotchas, D2009-specific or otherwise?
EDIT: Has anyone compared the localization support built into D2009 versus dxgettext?
I have never been a friend of proprietary localization tools for Freeware or Open Source applications. Using dxgettext, the Delphi port of GNU gettext looks like a much better option to me:
Integration into the program (even much later than its development) is easy.
Extraction of translatable strings can be done by command line programs and is therefore easily introduced into an automated build.
A new translation can be added simply by creating a new directory with the correct structure, copying the empty translation file into it, and starting to translate the strings. This is something each user can do for themselves, there's no need to involve the original author for creation of a new translation. There is also instant gratification with this process - once the program is restarted the new translations are shown immediately.
Changing an existing translation is even easier than creating a new one. Thus if a user finds spelling or other errors or needs for improvement in the translation they can correct them easily and send the changes to the author.
New program versions work with old translations, the system degrades very gracefully - new and untranslated strings are simply shown unmodified.
Translations can be made using only notepad, but there are several free tools for creating and managing translation files too; see the links on the dxgettext page. They are localized themselves, and have some advantages over a spreadsheet as well:
The location of the strings in the source code can be shown (makes sense only for Open Source apps, of course).
The percentage of translated strings is shown.
Modifications to already translated strings are highlighted too.
The whole system is mature and future-proof - I have used dxgettext for Delphi 4 programs, and there should be no changes necessary for Delphi 2009 even - translation files have always been UTF-8 encoded.
Using a spreadsheet for the translation doesn't seem a workable solution to me once you have more than a few languages. Suppose a new program version adds 2 new strings and changes 10 strings only slightly - wouldn't you need to add the new strings to and highlight the changed strings in all of the several dozen spreadsheet files and send them again to your translators? Using dxgettext you just mail the changed po file to all of them.
Edit:
There is an interesting comment about the problems there may be with dxgettext and libraries. I did never experience this, as I have stopped using resource strings altogether. The biggest part of our programs are in German, and only a few are in English or translated into several languages.
Our internal libraries use "_(...)" around all translatable strings. There are defines ENGLISH and USEGETTEXT that are set on a per-project basis. If ENGLISH or USEGETTEXT are defined, then the English texts are compiled into the DCUs, else the German text is compiled into the DCUs. If USEGETTEXT is not defined "_()" is compiled as a function that returns its parameter as-is, else the dxgettext translation lookup is used.
I have... There can be some challenges.
a string does not mean much in itself, it needs a context.
corollary, the same string can need to have more than one translation.
screen real estate: beware of varying length depending on the language, for instance, French tends to be more verbose than English.
unless you are proficient in a given language, you won't be able to evaluate the discrepancies.
I've used TsiLang Translation Suite for enabling end users to translate. I modified the code to allow encryption so that if someone does a really good job they can protect their name against a translation file, but in general the idea is that people can share their translations, and add/edit any small part they wish to. Given that it all happens within the app, and with instant visibility, it works really nicely.
As you have mentioned, D2009 comes with localization tools. Why not simply using them? AFAIK you can distribute the external translation manager (etm.exe). Do you need anything else?
Also, localization is more than just translating text. ETM also supports translation of .dfm resources.
For completeness, here is another Delphi localization tool called Delphi Localizer I recently found that looks to be well designed and polished. The tool is free for commercial use with the exception of Government projects (not exactly sure why the exception).
FWIW I have uses TsiLang Translation Suite in the past and am currently working on another project using the localization tools shipped with DevExpress VCL. The later integrates nicely with their components as well as third-party components.

Resources