Merge process for gettext - how to remove old strings

Merge process for gettext - how to remove old strings - translation

I want to improve my gettext translation file process, so I only have one .po file for the application, to avoid duplicated translation work when I send my files to translation agencies and reducing onload merging.
The challenges is that the initial parsing will always create two separate files (one parsing from PHP using poedit, one parsing from smarty using tsmarty2c.php). So I looked in to msgcat and came up with this process:
1. parse from PHP files (using poedit)
2. $php tsmarty2c.php templates > templates.c
3. $xgettext templates.c -d templates -p /path/locale/en_GB/LC_MESSAGES -j
4. $msgcat templates.po controllerClasses.po complete.po -o complete.po
This works fine with one exception, now deprecated strings that are removed in 1/2 is still kept since I need to include complete.po in the merging process (since that has the latest translations from the agency).
Any tips on how I can create a similar process, but also assure that old strings are removed from complete.po?

Related

ID3 Parser and Editor

I'm writing an ID3 parser and editor. It does already support ID3v1, v2.1-2.3. Are there any other widely used ID3 versions or extensions? For example, I've read about Enhanced ID3v1 tag (which goes before ID3v1) and starts with "TAG+", but I've never seen it inside MP3 files. Should I implement support for it anyway?

"ID3v2.1" never existed.
Yes, Enhanced TAG identifies by TAG+, which extends IDv1.
For a list of all metadata systems to be expected in MP3 files see https://stackoverflow.com/a/62366354 - top priority should have ID3v2.4 as you will encounter those most aside from ID3v2.3. Then go for informal and/or legacy ones because those can still be encountered (just because files become old doesn't mean they cease to exist).
Keep the following things in mind when parsing files:
A file can have both: IDv1 and IDv2 tags.
A file can have multiple IDv2 tags (i.e. IDv2.3 and IDv2.4). Although it shouldn't occur it should pose no problem to your parser to also accept multiple tags of the same version.
ID3v2 is not limited to MP3 files (but IDv1 and all its informal extensions are).
Consider the following parsing order in an MP3 file:
Check for ID3v1 at the end of the file.
Check for ID3v1.2 in front of ID3v1.
Check for Enhanced TAG in front of ID3v1.
Check for multiple ID3v2 at the start of file and, as for ID3v2.4, a footer at the end of the file in front of all ID3v1-like tags.

Is there a way to determine the coverage of a .PO file?

I've got a python program under active development, which uses gettext for translation.
I've got a .POT file with translations, but it is slightly out of date. I've got a script to generate an up-to-date .PO file. Is there a way to check how much of the new .PO file is covered by the .POT file?

I've got a .POT file with translations, but it is slightly out of date. I've got a script to generate an up-to-date .PO file
I think you mean the other way around. POT files are generated from your source code with PO files containing the translations.
Is there a way to check how much of the new .PO file is covered by the .POT file?
The Gettext command line msgmerge program can be used for syncing your out-of-date PO files with your latest source strings. To create a new PO file from an updated POT you would issue this command:
msgmerge old.po new.pot > updated.po
The new file will contain all the existing translations that are still valid and add any new source strings. Open it in your favourite PO editor and you should see how many strings now remain untranslated.
Update
As pointed out in the comments, you can see how many strings remain untranslated with the "statistics" option of the msgfmt program (normally used for compiling to .mo) e.g.
msgfmt --statistics updated.po
Or without bothering with the interim file:
msgmerge old.po new.pot | msgfmt --statistics -
This would produce a synopsis like:
123 translated messages, 77 untranslated messages.

Options for MeCab Japanese tokenizer on iOS?

I'm using the iPhone library for MeCab found at https://github.com/FLCLjp/iPhone-libmecab . I'm having some trouble getting it to tokenize all possible words. Specifically, I cannot tokenize "吉本興業" into two pieces "吉本" and "興業". Are there any options that I could use to fix this? The iPhone library does not expose anything, but it uses C++ underneath the objective-c wrapper. I assume there must be some sort of setting I could change to give more fine-grained control, but I have no idea where to start.
By the way, if anyone wants to tag this 'mecab' that would probably be appropriate. I'm not allowed to create new tags yet.
UPDATE: The iOS library is calling mecab_sparse_tonode2() defined in libmecab.cpp. If anyone could point me to some English documentation on that file it might be enough.

There is nothing iOS-specific in this. The dictionary you are using with mecab (probably ipadic) contains an entry for the company name 吉本興業. Although both parts of the name are listed as separate nouns as well, mecab has a strong preference to tag the compound name as one word.
Mecab lacks a feature that allows the user to choose whether or not compounds should be split into parts. Note that such a feature is generally hard to implement because not everyone agrees on which compounds can be split and which ones can't. E.g. is 容疑者 a compound made up of 容疑 and 者? From a purely morphological point of view perhaps yes, but for most practical applications probably no.
If you have a list of compounds you'd like to get segmented, a quick fix is to create a user dictionary for the parts they consist of, and make mecab use this in addition to the main dictionary.
There is Japanese documentation on how to do this here. For your particular example, it would involve the steps below.
Make a user dictionary with two entries, one for 吉本 and one for 興業:
吉本,,,100,名詞,固有名詞,人名,名,*,*,よしもと,ヨシモト,ヨシモト
興業,,,100,名詞,一般,*,*,*,*,こうぎょう,コウギョウ,コウギョウ
I suspect that both entries exist in the default dictionary already, but by adding them to a user dictionary and specifying a relatively low specificness indicator (I've used 100 for both -- the lower, the more likely to be split), you can get mecab to tend to prefer the parts over the whole.
Compile the user dictionary:
$> $MECAB/libexec/mecab/mecab-dict-index -d /usr/lib64/mecab/dic/ipadic -u mydic.dic -f utf-8 -t utf-8 ./mydic
You may have to adjust the command. The above assumes:
Mecab was installed from source in $MECAB. If you use mecab installed by a package manager, you might have difficulties finding the mecab-dict-index tool. Best install from source.
The default dictionary is in /usr/lib64/mecab/dict/ipadic. This is not part of the mecab package; it comes as a separate package (e.g. this) and you may have difficulties finding this, too.
mydic is the name of the user dictionary created in step 1. mydic.dic is the name of the compiled dictionary you'll get as output (needs not exist).
Both the system dictionary (-t option) and the user dictionary (-f option) are encoded in UTF-8. This may be wrong, in which case you'll get an error message later when you use mecab.
Modify the mecab configuration. In a system-wide installation, this is a file named /usr/lib64/mecab/dic/ipadic/dicrc or similar. In your case it may be located somewhere else. Add the following line to the end of the configuration file:
userdic = home/myhome/mydic.dic
Make sure the absolute path to the dictionary compiled above is correct.
If you then run mecab against your input, it will split the compound into its parts (I tested it, using mecab 0.994 on a Linux system).
A more thorough fix would be to get the source of the default dictionary and manually remove all compoun nouns you want to get split, then recompile the dictionary. As a general remark, using a CJK tokenizer for a serious application in production mode over a longer period of time usually involves a certain amount of dictionary maintenance (adding/removing entries) regularly.

Rails: Hash#to_json - split over multiple lines?

I'm having a slight annoyance with the javascript_I18n plugin, which generates js-friendly versions of the I18n translation tables, so that you can localise your javascript. It all works fine, but it works by calling to_json on each locale's translations hash and outputting the results into a file. When you call to_json on a hash the resulting string is all a single lne, which means you end up with files with an enormous line at the end.
This in turn is stopping git from being able to merge any changes, because the merge works on a line by line basis and can't deal with a single massive line with changes in the middle somewhere. It's not a massive problem because i can always just re-generate the js-friendly translation file with a rake task that comes with the plugin (which replaces the mid-merge file with a brand new one which i can just commit in), but it's just a bit annoying. It occurred to me that if the json was output on different lines, instead of all the same line, then it wouldn't be a problem, and it wouldn't even make the file that much bigger, just inserting two characters (\n) per line.
Before i try and hack the resulting string with a gsub to split it onto seperate lines, is there a nicer way to call to_json on a hash and output the results onto seperate lines? Or a nicer way to solve this problem generally? (i can't find much useful in the documentation for javascript_I18n).
Grateful for any advice - max

Not to answer you question, but to give a suggestion:
It would probably be easier to just ignore all the generated js translation files.
This is because even if you separate the translation into multiple lines, there are still chances of merge conflicts, which you probably already resolved once in your yaml file.
I would set it up this way:
1). gitignore all the js translation files.
2). In ActionController, add a before filter to generate the js files automatically on each load in development mode only.
3). Tweak your deploy.rb file to generate the js files after code update.
No more merge conflicts! (at least for js translation files) :D
Aaron Qian

Is it possible to combine two .po translation files together?

We have two .po files, each from different branches of a piece of software.
We need to combine these into a single .po file.
There are duplicates between the two files, and the ideal handling would be for one file's strings to be favoured (consistently).
We have a SUSE system so the --output-file doesn't seem to have the behaviour of ignoring/merging duplicates which the Sun version has according to a man page I found from a web search. (We do not have a Sun machine handy!)

What you are looking for is the msgcat util, it concatenates and merges the specified PO dictionaries.
This is part of gettext utils, for more information please consult gettext manual page on msgcat.

you can use poedit.
To merge your current po-file, you must to open it and click:
Catalog > Update from POT-file.
Set the filter to all files and select your second.po file
Poedit will show you new & obsolete strings

I use msgmerge:
msgmerge [old_file.po] [new_file.po] > output.po
It works for me, but be aware that it does a silly merge, it is, it discards the entries in the old_file (new file items overwrites old one items).

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart