authoring and maintaining multilingual documentation - localization

I need to author and maintain multilingual technical documentation, where each document is made up of some "standard" portions and some specific to the product (industrial equipments): standard portions could be warnings, quotes from laws or regulations, common sentences, or the like.
Each "portion" (i.e. a sentence, a paragraph, an image with its caption, an annotation or a paragraph with an icon) is translated in 13 languages, and each translation should be versioned and completed with references to the author.
Hence each document would result as a "composition" of those portions: a language-specific instance of the document is one using the portions in that specific translation.
Is there any specific technology, standards, or tools for doing that?

The standard is Dita and being it essentially XML-based there is plenty of tools to work with, both standalone XML editors (OxygenXML, XMetal XML, Adobe Framemaker, for full reference: maintained list of editors) and extensions to well-known CMS (Drupal, Joomla) and ECM (Alfresco, MS Sharepoint already have; other like Nuxeo can handle by configuring the XML grammar according to the DITA specs).

Related

Delphi XE . How to dynamic change label text from a resource file? [duplicate]

I need to translate an application on delphi. Now all strings in interface on Russian.
What are the tools for fast find, parcing all pas'es for string constants?
How people translate large applications?
GetText should be able to do, search for "extract" at http://dxgettext.po.dk/documentation/how-to
If all you need is to translate GUI and maybe resourcestring in sources, and the inline string constants in Pascal sources are not needed for translation, then you can try Delphi built-in method. However forums say that ITE is buggy. But at least that is official Delphi way.
http://edn.embarcadero.com/article/32974
http://docwiki.embarcadero.com/RADStudio/en/Creating_Resource_DLLs
To translate sources with ITM manual preparation is needed as shown in source sheets at http://www.gunsmoker.ru/2010/06/delphi-ite-integrated-translation.html
I remember i tranlsated Polaris texts for JediVCL team - so they did some extraction. But i think they just extracted all characters > #127 into a text file - there was no structure, there were constants and comments, all together.
Still, there is some component, though i doubt it can be used the way you need: http://wiki.delphi-jedi.org/wiki/JVCL_Help:TJvTranslator
There are also commercial tools. But i don't know if their features would help you on your initial extraction and translation tasks. They probably would be of much help when you need to mantain your large application translated to many languages, but not when you need to do one-time conversion. But maybe i am wrong, check their trial versions if you wish. By reviews those suites are considered among best of commercial ones:
TsiLang Suite http://www.tsilang.com/?siteid2=7
Korzh Localizer http://devtools.korzh.com/localization-delphi/
Firstly I'd recommend to move all localizable string constants into the resource string sections within their unit files. I.e.
raise Exception.Create('Error: что-то пошло не так (in Russian language)');
will be converted to
resourcestring
rsSomeErrorMessage = 'Error: что-то пошло не так (in Russian language)';
...
raise Exception.Create(rsSomeErrorMessage);
More about Resource Strings.
This process can be accelerated by using the corespondent Delphi IDE refactoring command, or with third-party utilities like as ModelMaker Tools.
Then you can use any available localizer to translate or even internationalize your program. I'd recommend my Delphi localizer - it's free.
Basically, you have two ways
Resource-based localization tools (Delphi ITE, Multilizer, ecc.)
Database-based localization tools (GetText, TsiLang, ecc.)
The former takes advantage of the Windows resource support, resource loading can be redirected to a different one stored into a DLL when the application is started. The advantages are that whole forms can be localized, including images, colors, control sizes, etc. and not only strings. Moreover, no code change is required. The disadvantages are that end-user localization is not usually possible, and changing language without restarting the application may be trickier. Microsoft applications, including Windows itself, use this technique. It will work with any Delphi libraries that stores strings into resources and dfms properly.
The latter stores strings in an external "database" (it could even be a text file...). The advantage is usually that users can add/modify translations, and switching language on the fly easier. The disadvantages are this technique is more intrusive (it has to hook string loading/display) and may require code changes, tools are usually limited to string localization and don't offer broader control (images, sizes, etc.), and may not work with unknown controls/libraries they could not hook correctly. Usually cross-platform application use this technique because Windows-like resource support is not available on all operating systems.
You should choose the technique that suits you and your application best. Moreover some tools ease the collaboration with an external translator, while others don't. I prefer the resource-based approach because it doesn't require code changes and don't tie me to a given library.
We are using dxgettext (GNUGettext for Delphi and C++ Builder) and Gorm (from the same author). Mind you, most tools require you to use English as the primary language and translate from that only. dxgettext allows other languages but there are bound to be unknown problems with that. Be prepared that internationalizing a large applicatin will be more work than you currently think.

How to translate (internationalize, localize) application?

I need to translate an application on delphi. Now all strings in interface on Russian.
What are the tools for fast find, parcing all pas'es for string constants?
How people translate large applications?
GetText should be able to do, search for "extract" at http://dxgettext.po.dk/documentation/how-to
If all you need is to translate GUI and maybe resourcestring in sources, and the inline string constants in Pascal sources are not needed for translation, then you can try Delphi built-in method. However forums say that ITE is buggy. But at least that is official Delphi way.
http://edn.embarcadero.com/article/32974
http://docwiki.embarcadero.com/RADStudio/en/Creating_Resource_DLLs
To translate sources with ITM manual preparation is needed as shown in source sheets at http://www.gunsmoker.ru/2010/06/delphi-ite-integrated-translation.html
I remember i tranlsated Polaris texts for JediVCL team - so they did some extraction. But i think they just extracted all characters > #127 into a text file - there was no structure, there were constants and comments, all together.
Still, there is some component, though i doubt it can be used the way you need: http://wiki.delphi-jedi.org/wiki/JVCL_Help:TJvTranslator
There are also commercial tools. But i don't know if their features would help you on your initial extraction and translation tasks. They probably would be of much help when you need to mantain your large application translated to many languages, but not when you need to do one-time conversion. But maybe i am wrong, check their trial versions if you wish. By reviews those suites are considered among best of commercial ones:
TsiLang Suite http://www.tsilang.com/?siteid2=7
Korzh Localizer http://devtools.korzh.com/localization-delphi/
Firstly I'd recommend to move all localizable string constants into the resource string sections within their unit files. I.e.
raise Exception.Create('Error: что-то пошло не так (in Russian language)');
will be converted to
resourcestring
rsSomeErrorMessage = 'Error: что-то пошло не так (in Russian language)';
...
raise Exception.Create(rsSomeErrorMessage);
More about Resource Strings.
This process can be accelerated by using the corespondent Delphi IDE refactoring command, or with third-party utilities like as ModelMaker Tools.
Then you can use any available localizer to translate or even internationalize your program. I'd recommend my Delphi localizer - it's free.
Basically, you have two ways
Resource-based localization tools (Delphi ITE, Multilizer, ecc.)
Database-based localization tools (GetText, TsiLang, ecc.)
The former takes advantage of the Windows resource support, resource loading can be redirected to a different one stored into a DLL when the application is started. The advantages are that whole forms can be localized, including images, colors, control sizes, etc. and not only strings. Moreover, no code change is required. The disadvantages are that end-user localization is not usually possible, and changing language without restarting the application may be trickier. Microsoft applications, including Windows itself, use this technique. It will work with any Delphi libraries that stores strings into resources and dfms properly.
The latter stores strings in an external "database" (it could even be a text file...). The advantage is usually that users can add/modify translations, and switching language on the fly easier. The disadvantages are this technique is more intrusive (it has to hook string loading/display) and may require code changes, tools are usually limited to string localization and don't offer broader control (images, sizes, etc.), and may not work with unknown controls/libraries they could not hook correctly. Usually cross-platform application use this technique because Windows-like resource support is not available on all operating systems.
You should choose the technique that suits you and your application best. Moreover some tools ease the collaboration with an external translator, while others don't. I prefer the resource-based approach because it doesn't require code changes and don't tie me to a given library.
We are using dxgettext (GNUGettext for Delphi and C++ Builder) and Gorm (from the same author). Mind you, most tools require you to use English as the primary language and translate from that only. dxgettext allows other languages but there are bound to be unknown problems with that. Be prepared that internationalizing a large applicatin will be more work than you currently think.

What is the best practices (business optimal) for localization of a website?

For a website like a marketplace or similar, what is the best approach for localization if majority of the content is in one language, but some user-generated content is in other languages?
There are so many approaches to this that I am getting confused, I am interested in the most cost-effective business optimal approach for this.
Some typical approaches;
Website in one language, accept content in many languages
Website in one language, only accept content in one language (reject other content)
Website in one language, content in the same language by translating to main language if content is not main language
Website in multiple languages, content is outputted as is for each localized version of the website, that is, content is duplicated for each language version of the website
Website in multiple languages, content belongs to the same language version of the website as the contents language is. That is, english content for english version of the website, german content for german version fo the website and so on.
tld vs subdomain vs directory for localization?
As others have commented, this is not really a technical question so is probably not the best fit here.
However, I will say that, if you are providing a service that has clients/users from different countries and different languages, basic politeness alone would dictate that you provide a website that can adapt to the client's language.
The content provided by the users in their own language should at least have an automated translation link (e.g. Google Translate).
If you don't do both of these things, you are locking segments of your possible audience out.
You also need to consider legislation. If you are providing services in some countries, it is mandated that you provide a number of base languages.

Looking for an information retrival / text mining application or library

We extract various information from e-mails - flights, car rentals, hotels and more. the method is to extract the body of the mail, usually in HTML form but sometime it's text or we use the information in a PDF/Word/RTF attachment. We then apply regular expressions (sometimes in several steps) in order to get information, which is provided in a tabular form (you can think of a flight table, hotel table, etc.). Notice, even though we parse HTML, this is not web scraping.
Currently we are using QL2's WebQL engine, but we are looking to replace it from business reasons. Can you recommend on another engine? It must run on Linux and be accessible from Java (a Java API would be the the best, but Web services are good solution as well). It also must support regular expressions for text extraction and not just to be based on the HTML structure.
I recommend that you have a look at R. It has an extensive number of text mining packages: have a look at the Natural Language Processing view. In particular, look at the tm package. Here are some relevant links:
Paper about the package in the Journal of Statistical Computing: http://www.jstatsoft.org/v25/i05/paper. The paper includes a nice example of an analysis of the R-devel
mailing list (https://stat.ethz.ch/pipermail/r-devel/) newsgroup postings from 2006.
Package homepage: http://cran.r-project.org/web/packages/tm/index.html
Look at the introductory vignette: http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
In addition, R provides many tools for parsing HTML or XML. Have a look at this question for an example using the RCurl and XML packages.
Edit: You can integrate R with Java with JRI. It's a very widely used package, with many examples. You can also see these related questions.
Have a look at:
LingPipe - LingPipe is a suite of Java libraries for the linguistic analysis of human language.
Lucene - Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java.
Just wanted to update - our final decision was to implement the parsing in groovy, and to add some required functionality (html to text, pdf to text, clean whitespace, etc.) either by implementing it in Java ot by relying on 3rd party libraries.
I use a custom parser made with Flex and C++ for similar purposes. I'd suggest you take a look at parser generators in java (javaCC .jj files) javacc-faq Nutch does it this way. (NutchAnalysis.jj)

LaTeX vs DocBook [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I have only little knowledge about LaTeX, basic formatting, basic math fomulae etc.. I found that LaTeX is hard to configure to my own flavor. Recently, I've heard about Docbook, which is also a typesetting mechanism, but much easier since it uses XML. So, if my main job using LaTeX/Docbook is writing a simple document (not a class book) with some mathematics, and I want easy configuration, and a highly constomizable application, which one is better, and is there any good book on Docbook?
DocBook isn't "a typesetting mechanism". DocBook is all about separating presentation from content. DocBook only deals with content; it's used to create an abstract representation of a book, article, etc. There are numerous tools out there which layout DocBook according to predefined templates. Some of these tools use LaTeX. AFAIK, O'Reilly uses a slightly modified version of the DocBook language to author their content, then they feed this XML into custom scripts that integrate with Adobe FrameMaker to layout their books.
LaTeX is essentially an attempt to separate presentation from content within TeX, but it doesn't quite achieve that goal IMO. Presentation is still mixed with the content in most cases. I think LaTeX is currently the best open source tool for laying-out paginated documents. However, proprietary tools like InDesign have many features (like good OpenType support) that TeX doesn't have (XeTeX kind of adds OpenType support). Either way, if you're writing a book, I highly recommend using DocBook to author your content rather than LaTeX.
That said, it sounds like you're writing short, one-off documents with a bit of math. I think LaTeX is probably your best choice. If you need lots of customizability, you might need to use Plain TeX as opposed to LaTeX, but it's going to require quite a bit of work on your part.
Well, I haven't used DocBook, but from a quick look on wikipedia and google:
DocBook does not have elements to describe mathematics.
DocBook is XML, as you say. To me, that makes it a horrible thing to write by-hand (or, rather, with a basic text editor). Maybe you enjoy writing XML, or have a good IDE. I guess you could look at this question.
DocBook's Wikipedia page lists a couple of books on it which you may want to look at, though I obviously can't say whether they are "good" books.
I would suggest going with LaTeX. Get someone to give you a basic template, then writing LaTeX is as simple as:
\section{Introduction}
This is my introduction.
\section{Stuff}
Here is some stuff.
\subsection{Particular stuff}
A particular type of stuff. With maths:
$\int_{x=1}^n 3x^2$
% etc.
Google is your friend for finding basic templates that you can start from:
One
Two
Three
To go from source code to a document, you'll need a working install of LaTeX (which is beyond the scope of this answer, but is pretty easy if you're on linux). Ideally your LaTeX install will include pdflatex. Then you just run:
pdflatex source.tex
(there's a bit more work if you have a bibliography – but that's a topic for a different question)
The great thing about DocBook is that it is XML based - so a chapter is a full subtree, a section is a full subtree, etc. In LaTeX, separation is only determined by the structure of the document during a linear scan.
The worst thing about Docbook is that it is XML based - lower-level stuff is extremely dirty and annoying to code manually.
I'm not really familiar with DocBook, though I have used LaTeX fairly extensively. The idea of LaTeX is not to produce a customized document, it's to produce a readable, attractive document. It's a set of libraries, templates, macros, and so forth around TeX, set up by people who know what they are doing when it comes to document design. Of course, you have special needs that they can't anticipate, so you're going to have to do some tweaking, too. It is a very high-level, declarative language that is meant to reflect the content and structure of a document, rather than what it should look like, the idea being that your ideas and how they are organized is what you should concern yourself with, not the layout of your text on the page. If you need more control, there exists a HUGE library of additional styles and macros and so forth (CTAN), and some of them (memoir comes to mind) give you back a lot of that control.
If you are shoving a lot of complicated formatting stuff into the body of your LaTeX document, you're doing it wrong. What you need to do is get your content in there, and your document structured into chapters and sections and subsections semantically, then go back in and worry about formatting. You shouldn't have to go into the body of your document much at this point; it should all be general stuff that applies to the whole document, preferably in a reusable way. This ensures consistency.
Yes, LaTeX is kind of difficult to configure to produce exactly the kind of layout you want. I suggest you take a look at the manual of the LaTeX class memoir to see what kinds of layouts it enables you to produce.
There is a book on DocBook available online. Take a look at that too, to see what kind of layouts you can produce and if you can easily format the math content you want with DocBook.
My suggestion is to go with LaTeX if you have to write any nontrivial math, but of course it depends on which format you find it easier to work with.
About two years ago, I tried to like and use DocBook; however, I returned to LaTeX because, at least at the time, LaTeX produced better quality output (PDFs). I never managed to get the DocBook to LaTeX to PDF translation working. My problems were likely "operator error", but I suggest trying DocBook (and LaTeX) for a few simple documents before choosing one.
Here are a few points that led me to choose LaTeX:
BibTeX for bibliographies with JabREF as a GUI
Excellent quality PDF output
Lots of examples on the Internet, including several similar to my preferred format
Good books, like "A Guide to LaTeX"
If you like GUIs, take a look at LyX.
The real reasons to use DocBook center on having your document marked up meaningfully, being able to validate it, and transform it for many purposes, not only publishing. LaTeX and other macro sets add a layer of semantic markup, but you're always free to introduce TeX code, and add macros from other sources. Fundamentally, a TeX document is a computer program that can only be parsed by a TeX processor.
For maths and DocBook: DocBook being XML it allows you to use other XML technologies as appropriate; in this case MathML. The XMLmind XMLEditor already mentioned provides a GUI maths editor, and includes stylesheets to format them for web and print along with the DocBook contents.
There are also tools available that enable translation of XML documents into other languages (xml2po is a simple one, http://heartsome.net/EN/home.html is a whole suite).
I don't want to go down the "easier" or better route as I regard this as a matter of taste and getting used to. I see docbook being XML as an advantage as therefore it can be morphed into almost anything you like by using XSLT. Combined with its self-containedness it feels more like structuring content that Latex does. Especially documenting open source software Docbook is really widely used. You can easily grab the templates and stylesheets of e.g. Hibernate and/or Spring and tweak them to your needs.
Another aspect I'd like to spot on is integration in build systems. For Maven there is a plugin called docbkx available, that just spits out PDF, HTML and whatever you like based on the contents and an appropriate XSLT. No further installations needed. The only ways I have seen to get this done with Latex is installing a few packages to the build OS and building your own script around em. IMHO that's not a feasible way to go, especially if you build cross platform.
Regarding the editor I can advise XMLmind XMLEditor that takes a lot of the pain and provides quite a nice WYSIWYG approach to docbook.
If you rely on mathematical expressions I also would rather choose Latex as there is nothing with the same power available in docbook.
FWIW I use docbook via xmlmind (http://xmlmind.com/) to produce html and .chm files. I've also set fop up to produce pdfs, but they aren't pretty.
Having got the docbook source done, I cook it with xsltproc and the docbook.xsl files. This is protracted and painful to set up, but once it's working it's sweet.
Another approach would be to use pandoc (an extended markdown type tool) to get from markdown to DocBook. This would cut the xml editor out, but you still have to do the transformation(s) to your output format.
Whoever had to create a professional, scientific document (research paper, book, technical guide etc.) will know why TeX is a better choice.
For those who are not aware of some facts here is a perfect example: at good colleges student's work may be completely refused if (s)he did not properly reference other people's works. There are, I believe, hundreds of "official" ways for citing and referencing, Harvard school has its own, ACM their own, among computer scientists numeric (Vancouver) notation is the most common. Many professional organisations have their own styles, and they stick to it. As far as I know, TeX is the only typesetting system that is aware of that, and with the help of BiBTeX it becomes extremely powerful tool for authors. It can save hours, if not days, of work.
If I was a novel writer, or author of some non-technical document, I might chose DocBook.
Have you looked at ConTeXt. It is more flexible and much easier to configure compared to LaTeX.
Arbortext supports native LaTeX. You can send the publishing engine or print composer LaTeX and it'll pass it through. It also supports a lot of other composition languages as well.

Resources