Why msgmerge marked some of my translation as fuzzy?

Why msgmerge marked some of my translation as fuzzy? - localization

I use msgmerge to merge my existing po file with an updated pot file, e.g.
msgmerge test-zh_TW.po test.pot > test.po
I've found that after the msgmerge, some of the fields are marked as fuzzy, why is that?
(I want to know the reason, I know I can turnoff them by -N, but why it is the default in the 1st place?)

Quoting the documentation for the manual
Fuzzy entries, even if they account for translated entries for most other purposes, usually call for revision by the translator. Those may be produced by applying the program msgmerge to update an older translated PO files according to a new PO template file, when this tool hypothesises that some new msgid has been modified only slightly out of an older one, and chooses to pair what it thinks to be the old translation for the new modified entry. The slight alteration in the original string (the msgid string) should often be reflected in the translated string, and this requires the intervention of the translator. For this reason, msgmerge might mark some entries as being fuzzy.
In short it is because the fuzzy matching algorithm in msgmerge, finds some of the new messages to be close enough to the old one, to warrant associating it with the old translation, but it marks it is a fuzzy, in order to prompt the translator to revise the translation because it is only a fuzzy or partial match.
The reason this is the default behaviour is because the implementation of msgmerge.c has the following lines.
/* Determines whether to use fuzzy matching. */
static bool use_fuzzy_matching = true;
References
Fuzzy Entries from gettext manual
Invoking the msgmerge Program
Source Code For msgmerge

Related

How do I merge POT and PO files so that I exclude entries that are not in the POT file?

In short, I am trying to find a way to create a new PO file from a new POT and an existing PO file - but I want to exclude any strings (and their translations) that are not in the POT file.
Every time we change the wording on our cakePHP site, we generate a new POT file that contains all the translatable strings in the site. But when we merge it with the existing PO file (using POEdit), the merge process only adds the POT entries to the PO file. It doesn't remove the translations we no longer need. We have over 12k unneeded translations in our PO files. This makes our translator very unhappy. She has taken to just looking at the site and sending me translations to add manually, which makes me very unhappy.
I've looked around for tools that do this destructive merge, but I haven't been successful finding one. Before I head off to write one...is there something I missed?
(Sorry if this belongs on a different exchange, I will move this post to a better exchange if anyone tells me which one).

What you describe as "destructive" merge is the standard, normal merge operation in gettext and what everybody wants — you'd have to go out of your way to accomplish non-destructive versions, and I'm not even sure how.
From this it's safe to conclude that (1) you must be doing some weird steps not described above, or (2) your POT file contains more than you think it does (e.g. because you append to it instead of replacing it), or (3) you or the tools you use misinterpret the resulting PO file.
To merge using GNU gettext command line tools:
msgmerge -U your_old_translation.po latest_strings.pot
To merge using Poedit (notice the spelling):
Open PO file with the (now outdated) translations.
Use Catalog → Update from POT file…
Choose the newly regenerated POT file.
Notice that by default, outdated translations are kept in the PO file as backup. In Poedit, you can purge them (see Catalog → Purge deleted translations). However, these obsolete entries are stored in a different way in the PO file, as specially formatted comments, and are not visible or editable in Poedit or any conforming PO editing tool.
If I were to bet, I'd say (3) is the most likely cause (in which case, use a better editor like, ahem, Poedit), or perhaps (2) (should be easy to review by searching the POT for now-unused strings).
But merging really does the right thing that you expect it to do.

How to properly do custom markdown markup

I currently work on a personal writing project which has ended up with me maintaining a few different versions due to the differences of the relevant platforms and output formats I want to support that are not trivially solved. After several instances of me glancing at pandoc and the sheer forest that it represents, I have concluded mere templates don't do what I need, and worse, that I seem to need a combination of a custom filter and writer... suffice to say: messing with the AST is where I feel way out of my depth. Enough so that, rather than asking specific questions of 'how do I do X' here, this is a question of 'is X the right way to go about it, or what is the proper way to do it, and can you give an example of how it ties together?'... so if this question is rather lengthy: my apologies.
My current goal is to have custom markup like the following which is supposed to 'track' which character says something:
<paul|"Hi there">
If I convert to HTML, I'd want something similar to:
<span class="speech paul">"Hi there"</span>
to pop out (and perhaps the <p> tags), whereas if it is just pure markdown / plain text, I'd want it to silently disappear:
"Hi there"
Looking at the JSON AST structures I've studied, it would make sense that I'd want a new structure type similar to the 'Emph' tag called 'Speech' which allows whole blobs of text to be put inside of it with a bit of extra information attached (the person speaking). So something like this:
{"t":"Speech","speaker":"paul","c":[ ... ] }
Problem #1: At the point a lua-filter sees the document, it is obviously already distilled to an AST. This means replacing the items in a manner similar to what most macro expander samples do cannot really work since it would require reading forward. With this method, I just replace bits and pieces in place (<NAME| becomes a StartSpeech and the first solitary > that follows becomes an EndSpeech, but that would make malformed input a bigger potential problem because of silent-ish failures. Additionally, these tags would be completely out of sorts with how an AST is supposed to look.
To complicate matters even further, some of my characters end up learning a secondary language throughout the story, for which I apply a different format that contains a simplified understanding of the spoken text with perspective-characters understanding of what was said. Example:
<paul|"Heb je goed geslapen?"|"Did you ?????">
I could probably add a third 'UnderstoodSpeech' group to my filter, but (problem #2) at this point, the relationship between the speaker, the original speech, and the understood translation is completely gone. As long as the final documents need these values in these respective orders and only in these orders, it is fine... but what if I want my HTML version to look like
"Did you?????"
with a tool-tip / hover-over effect containing the original speech? That would be near impossible to achieve because the AST does not contain that kind of relational detail.
Whatever kind of AST I create in the filter is what I need to understand in my custom writer. Ideally, I want to re-use as much stock functionality of pandoc as possible for the writer, but I don't even know if that is feasible at this point.
So now my question: could someone with great pandoc understanding please give me an example on how to keep relevant data-bits together and apply them in the correct manner? By this I mean show a basic example of what needs to be put in the lua-filter and lua-writer scripts in the following toolchain
[CUSTOMIZED MARKDOWN INPUT] -> lua-filter -> lua-writer -> [CUSTOMIZED HTML5 OUTPUT]

How to walk whole Parse Tree and print it's content with slight changes in ANTLR4?

So as stated in the title, my task is to traverse the Parse Tree generated for code written in Java (grammar is a standard Java grammar), print most of it unchanged and modify only some words, for example type declarations.
My current approach was to create ParseTreeListener and implement the logic in the enterEveryRule method, but unfortunately it doesn't appear to work even for basic printing. The output is very messy and there are a lot of repetitions, as if every node was visited multiple times.
My another try was to implement appropriate methods in BaseListener that would do the changes to the type declarations I need, but from there I see no possibility to print the rest of the code unchanged.
Looking forward to your help!

You could use ANTLR's string templates to produce code from the ASTs.
In general, you start with set of "standard" string templates that can regenerate source code corresponding to the underlying tree.
To get the effect you want, you judiciously choose the standard string templates on AST nodes where you don't want changes, and variant templates where you do want changes.
IMHO, it is better to modify the AST, and then simply apply the standard templates.

Combining keys and full text when working with gettext and .po files

I am looking at gettext and .po files for creating a multilingual application. My understanding is that in the .po file msgid is the source and msgstr is the translation. Accordingly I see 2 ways of defining msgid:
Using full text (e.g. "My name is %s.\n") with the following advantages:
when calling gettext you can clearly see what is about to be
translated
it's easier to translate .po files because they
contain the actual content to be translated
Using a key (e.g. my-name %s) with the following advantages:
when the source text is long (e.g. paragraph about company), gettext calls are more concise which makes your views cleaner
easier to maintain several .po files and views, because the key is less likely to change (e.g. key of company-description far less likely to change than the actual company description)
Hence my question:
Is there a way of working with gettext and .po files that allows combining the advantages of both methods, that is:
-usage of a keys for gettext calls
-ability for the translator to see the full text that needs to be translated?

gettext was designed to translate English text to other languages, and this is the way you should use it. Do not use it with keys. If you want keys, use some other technique such as an associative array.
I have managed two large open-source projects (50 languages, 5000 translations), one using the key approach and one using the gettext approach - and I would never use the key approach again.
The cons include propagating changes in English text to the other langauges. If you change
msg_no_food = "We had no food left, so we had to eat the cats"
to
msg_no_food = "We had no food left, so we had to eat the cat's"
The new text has a completely different meaning - so how do you ensure that other translations are invalidated and updated?
You mentioned having long text that makes your scripts hard to read. The solution to this might be to put these in a separate script. For example, put this in the main code
print help_message('help_no_food')
and have a script that just provides help messages:
switch ($help_msg) {
...
case 'help_no_food': return gettext("We had no food left, so we had to eat the cat's");
...
}
Another problem for gettext is when you have a full page to translate. Perhaps a brochure page on a website that contains lots of embedded images. If you allow lots of space for languages with long text (e.g. German), you will have lots of whitespace on languages with short text (e.g. Chinese). As a result, you might have different images/layout for each language.
Since these tend to be few in number, it is often easier to implement these outside gettext completely. e.g.
brochure-view.en.php
brochure-view.de.php
brochure-view.zh.php

I just answered a similar (much older) question here.
Short version:
The PO file format is very simple, so it is possible to generate PO/MO files from another workflow that allows the flexibility you're asking for. (your devs want identifiers, your translators want words)
You could roll this solution yourself, or use a cloud-based app like Loco to manage your translations and export a Gettext file with identifiers when your devs need them.

Determine Cobol coding style

I'm developing an application that parses Cobol programs. In these programs some respect the traditional coding style (programm text from column 8 to 72), and some are newer and don't follow this style.
In my application I need to determine the coding style in order to know if I should parse content after column 72.
I've been able to determine if the program start at column 1 or 8, but prog that start at column 1 can also follow the rule of comments after column 72.
So I'm trying to find rules that will allow me to determine if texts after column 72 are comments or valid code.
I've find some but it's hard to tell if it will work everytime :
dot after column 72, determine the end of sentence but I fear that dot can be in comments too
find the close character of a statement after column 72 : " ' ) }
look for char at columns 71 - 72 - 73, if there is not space then find the whole word, and check if it's a key word or a var. Problem, it can be a var from a COPY or a replacement etc...
I'd like to know what do you think of these rules and if you have any ideas to help me determine the coding style of a Cobol program.
I don't need an API or something just solid rules that I will be able to rely on.

I think you need to know the COBOL compiler for each program. Its documentation should tell you what conventions/configurations/switches it uses to decide if the source code ends at column 72 or not.
So.... which compiler(s)?
And if you think the column 72 issue is a pain, wait till you get around to actually parsing the COBOL itself. If you are not well prepared to handle the lexical issues of the language, you are probably very badly prepared to handle the syntactic ones.

There is no absolutely reliable way to determine if a COBOL program
is in fixed or free format based only on the source code. Heck it is sometimes difficult to identify
the programming language based only on source code. Check out
this classic polyglot - it is valid under 8 different language compilers. That
said, you could try a few heuristics that might yield
the correct answer more often than not.
Compiler directives imbedded in source code
Watch for certain compiler directives that determine code format.
Unfortunately, every compiler vendor uses their own flavour of directive.
For example, Microfocus COBOL uses the
SOURCEFORMAT directive. This directive will appear near the top of the program so a short pre-scan
could be used to find it. On the other hand, OpenCobol uses >>SOURCE FORMAT IS FREE and
>>SOURCE FORMAT IS FIXED to toggle between free and fixed format, different parts of the same program
could be formatted differently!
The bottom line here is that you will have to support the conventions of multiple COBOL compilers.
Compiler switches
Source code format can be also be specified using a compiler switch. In this case, there are no concrete
clues to go on. However, you can be reasonably sure that the entire source program will be either
fixed or free. All you can do here is guess. Unless the programmer is out to "mess with
your head" (and some will), a program in free format will have the keywords IDENTIFICATION DIVISION or ID DIVISION, starting before column 8.
Every COBOL program will begin with these keywords so you can use them as the anchor point for determining code format in the
absence of imbedded compiler directives.
Warning - this is far from fool proof, but might be a good start.

There won't be an algorithm to do this with 100% certainty, because if comments can be anything, they can also be compilable COBOL code. So you could theoretically write a program that means one thing if the comments are ignored, and something else entirely if the comments are treated as part of the COBOL.
But that's extremely unlikely. What's most likely to happen is that if you try to compile the code under the wrong convention, it will simply fail. So the only accurate way to do this is to try compiling/parsing the program one way, and if you come to a line that can't make sense, switch to the other style. You could also support passing an argument to the compiler when the style is already known.
You can try using heuristics like what you've described, but that will never be totally accurate. The most they can give you is a probability that the code is one or the other style, which will increase as they examine more and more lines of code. They could be useful for helping you guess the style before you start compiling, or for figuring out when the problem is really just a typo in the code.
EDIT:
Regarding ideas for heuristics, it's hard to say. If there were a standard comment sigil like // or # in other languages, this would be a lot easier (actually, there is, but it sounds like your code doesn't follow this convention). The only thing I can think of would be to check whether every line (or maybe 99% of lines, and not counting empty lines or lines commented with *) has a period somewhere before position 72.
One thing you DON'T want to do is apply any heuristics to the part after position 72. That is, you don't want to be checking the comments to see if they're valid COBOL. You want to check what you know is COBOL first, and see if that works by itself. There are several reasons for this:
Comments written in English are likely to have periods and quotes in them, so your first and second bullet points are out.
Natural languages are WAY harder to parse than something like COBOL.
The comments could easily have COBOL in them (maybe someone commented out the previous version of the line).
An important rule for comments is that they should never affect what the program does. If changing the comments can change how the program is compiled, you violate that.
All that in mind, my opinion is that you shouldn't use heuristics at all. You should always try to compile the program under both conventions unless one is explicitly specified. There's a chance that code will compile successfully under both conventions, and then you'll have two different programs and no way to tell which one is correct.
If that happens, you need to compare the two results (perhaps with a hash or something) to see if they're the same program. If they're the same, great, but if not, you'll need to force the user to explicitly choose a convention.

Most COBOL compilers will allow you to generate and analyze the post text manipulation phase.
The text preprocessor output can be seen (using OpenCOBOL for the example)
cobc -E program.cob
The text manipulation processor deals with any COPY ... REPLACING compiler directives, as well as converting SOURCE FORMAT IS FIXED (with line continuations, string literal concatenations, comment line removal, among other things) to the actual free format that the compiler lexical analyzer needs. A lot of the OpenCOBOL toolkits (Cross referencer and Animator, to name two) use source code AFTER the preprocessor pass. I don't think you'll lose any street cred if your parser program relies on post processed source code files.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart