Max size for PO file strings - localization

I know that PO / MO files are meant to be used for small strings like button names, labels, etc. Not long text like an About page, etc.
But lately I am encountering a lot of situations that are in the middle. For example, a two sentence call to action. Or a short paragraph.
Is there best practice or "rule of thumb" for when a string is too long to put in a PO file?
update
For "long" text I use partials and include the correct language version. My question is WHEN is it optimal to use one vs the other. I've heard that PO files are "inefficient" for "long" pieces of text. But what does that mean and when is it too "long"? Or is this not a concern?

Use one entry for a self-contained chunk of text; e.g. a sentence as you say.
Two sentences that belong together and don't make sense without each other should be one entry. Why? Because otherwise the translator wouldn't have the context necessary to translate it well. Same goes for a short paragraph, e.g. explaining a setting: if it's inseparable in the code, it should be one entry.
If you encounter a situation where you have lots of long texts regularly (e.g. entire pages or paragraphs of pages), that's usually a sign that you are using an ill-fitting tool. Some people do it, using Gettext for entire articles, but you're better off having separate documents in such cases. But that doesn't seem to be the case here.

Related

How to properly do custom markdown markup

I currently work on a personal writing project which has ended up with me maintaining a few different versions due to the differences of the relevant platforms and output formats I want to support that are not trivially solved. After several instances of me glancing at pandoc and the sheer forest that it represents, I have concluded mere templates don't do what I need, and worse, that I seem to need a combination of a custom filter and writer... suffice to say: messing with the AST is where I feel way out of my depth. Enough so that, rather than asking specific questions of 'how do I do X' here, this is a question of 'is X the right way to go about it, or what is the proper way to do it, and can you give an example of how it ties together?'... so if this question is rather lengthy: my apologies.
My current goal is to have custom markup like the following which is supposed to 'track' which character says something:
<paul|"Hi there">
If I convert to HTML, I'd want something similar to:
<span class="speech paul">"Hi there"</span>
to pop out (and perhaps the <p> tags), whereas if it is just pure markdown / plain text, I'd want it to silently disappear:
"Hi there"
Looking at the JSON AST structures I've studied, it would make sense that I'd want a new structure type similar to the 'Emph' tag called 'Speech' which allows whole blobs of text to be put inside of it with a bit of extra information attached (the person speaking). So something like this:
{"t":"Speech","speaker":"paul","c":[ ... ] }
Problem #1: At the point a lua-filter sees the document, it is obviously already distilled to an AST. This means replacing the items in a manner similar to what most macro expander samples do cannot really work since it would require reading forward. With this method, I just replace bits and pieces in place (<NAME| becomes a StartSpeech and the first solitary > that follows becomes an EndSpeech, but that would make malformed input a bigger potential problem because of silent-ish failures. Additionally, these tags would be completely out of sorts with how an AST is supposed to look.
To complicate matters even further, some of my characters end up learning a secondary language throughout the story, for which I apply a different format that contains a simplified understanding of the spoken text with perspective-characters understanding of what was said. Example:
<paul|"Heb je goed geslapen?"|"Did you ?????">
I could probably add a third 'UnderstoodSpeech' group to my filter, but (problem #2) at this point, the relationship between the speaker, the original speech, and the understood translation is completely gone. As long as the final documents need these values in these respective orders and only in these orders, it is fine... but what if I want my HTML version to look like
"Did you?????"
with a tool-tip / hover-over effect containing the original speech? That would be near impossible to achieve because the AST does not contain that kind of relational detail.
Whatever kind of AST I create in the filter is what I need to understand in my custom writer. Ideally, I want to re-use as much stock functionality of pandoc as possible for the writer, but I don't even know if that is feasible at this point.
So now my question: could someone with great pandoc understanding please give me an example on how to keep relevant data-bits together and apply them in the correct manner? By this I mean show a basic example of what needs to be put in the lua-filter and lua-writer scripts in the following toolchain
[CUSTOMIZED MARKDOWN INPUT] -> lua-filter -> lua-writer -> [CUSTOMIZED HTML5 OUTPUT]

Add forbidden words to TexStudio / Latex

I have some words in my language (German) that seem to be valid according to TexStudios spellchecker.
However they must not be used for my thesis (and globally for me at least).
Is it possible to add words to a list, that trigger a (optimally huge) sign "DO NOT USE THIS!" or even prevent compilation in Latex when such words are used?
I'm looking for something like a negative dictionary.
I've seen files like "badwords" or "stopwords" but don't know when/how they are used. I can freely use them although "check for bad words" is on.
In case anyone else has the problem: Badword files are named after the main language. For me it happened that I have "de_DE_frami" as the dictionary set. Hence it did not use the "de_DE.badwords".
For a good highlighting: One can change the appearance in the options dialog (syntaxhighlighting->badwords) and make it e.g. background red, size 200%
I'd still would like to have a "bad" words and a "impossible" words distinction as you can sometimes not avoid "bad" words or they are not bad in all contexts.

Combining keys and full text when working with gettext and .po files

I am looking at gettext and .po files for creating a multilingual application. My understanding is that in the .po file msgid is the source and msgstr is the translation. Accordingly I see 2 ways of defining msgid:
Using full text (e.g. "My name is %s.\n") with the following advantages:
when calling gettext you can clearly see what is about to be
translated
it's easier to translate .po files because they
contain the actual content to be translated
Using a key (e.g. my-name %s) with the following advantages:
when the source text is long (e.g. paragraph about company), gettext calls are more concise which makes your views cleaner
easier to maintain several .po files and views, because the key is less likely to change (e.g. key of company-description far less likely to change than the actual company description)
Hence my question:
Is there a way of working with gettext and .po files that allows combining the advantages of both methods, that is:
-usage of a keys for gettext calls
-ability for the translator to see the full text that needs to be translated?
gettext was designed to translate English text to other languages, and this is the way you should use it. Do not use it with keys. If you want keys, use some other technique such as an associative array.
I have managed two large open-source projects (50 languages, 5000 translations), one using the key approach and one using the gettext approach - and I would never use the key approach again.
The cons include propagating changes in English text to the other langauges. If you change
msg_no_food = "We had no food left, so we had to eat the cats"
to
msg_no_food = "We had no food left, so we had to eat the cat's"
The new text has a completely different meaning - so how do you ensure that other translations are invalidated and updated?
You mentioned having long text that makes your scripts hard to read. The solution to this might be to put these in a separate script. For example, put this in the main code
print help_message('help_no_food')
and have a script that just provides help messages:
switch ($help_msg) {
...
case 'help_no_food': return gettext("We had no food left, so we had to eat the cat's");
...
}
Another problem for gettext is when you have a full page to translate. Perhaps a brochure page on a website that contains lots of embedded images. If you allow lots of space for languages with long text (e.g. German), you will have lots of whitespace on languages with short text (e.g. Chinese). As a result, you might have different images/layout for each language.
Since these tend to be few in number, it is often easier to implement these outside gettext completely. e.g.
brochure-view.en.php
brochure-view.de.php
brochure-view.zh.php
I just answered a similar (much older) question here.
Short version:
The PO file format is very simple, so it is possible to generate PO/MO files from another workflow that allows the flexibility you're asking for. (your devs want identifiers, your translators want words)
You could roll this solution yourself, or use a cloud-based app like Loco to manage your translations and export a Gettext file with identifiers when your devs need them.

Profanity filter import

I am looking to write a basic profanity filter in a Rails based application. This will use a simply search and replace mechanism whenever the appropriate attribute gets submitted by a user. My question is, for those who have written these before, is there a CSV file or some database out there where a list of profanity words can be imported into my database? We are submitting the words that we will replace the profanities with on our own. We more or less need a database of profanities, racial slurs and anything that's not exactly rated PG-13 to get triggered.
As the Tin Man suggested, this problem is difficult, but it isn't impossible. I've built a commercial profanity filter named CleanSpeak that handles everything mentioned above (leet speak, phonetics, language rules, whitelisting, etc). CleanSpeak is capable of filtering 20,000 messages per second on a low end server, so it is possible to build something that works well and performs well. I will mention that CleanSpeak is the result of about 3 years of on-going development though.
There are a few things I tell everyone that is looking to try and tackle a language filter.
Don't use regular expressions unless you have a small list and don't mind a lot of things getting through. Regular expressions are relatively slow overall and hard to manage.
Determine if you want to handle conjugations, inflections and other language rules. These often add a considerable amount of time to the project.
Decide what type of performance you need and whether or not you can make multiple passes on the String. The more passes you make the slow your filter will be.
Understand the scunthrope and clbuttic problems and determine how you will handle these. This usually requires some form of language intelligence and whitelisting.
Realize that whitespace has a different meaning now. You can't use it as a word delimiter any more (b e c a u s e of this)
Be careful with your handling of punctuation because it can be used to get around the filter (l.i.k.e th---is)
Understand how people use ascii art and unicode to replace characters (/ = v - those are slashes). There are a lot of unicode characters that look like English characters and you will want to handle those appropriately.
Understand that people make up new profanity all the time by smashing words together (likethis) and figure out if you want to handle that.
You can search around StackOverflow for my comments on other threads as I might have more information on those threads that I've forgotten here.
Here's one you could use: Offensive/Profane Word List from CMU site
Based on personal experience, you do understand that it's an exercise in futility?
If someone wants to inject profanity, there's a slew of words that are innocent in one context, and profane in another so you'll have to write a context parser to avoid black-listing clean words. A quick glance at CMU's list shows words I'd never consider rude/crude/socially unacceptable. You'll see there are many words that could be proper names or nouns, countries, terms of endearment, etc. And, there are myriads of ways to throw your algorithm off using L33T speak and such. Search Wikipedia and the internets and you can build tables of variations of letters.
Look at CMU's list and imagine how long the list would be if, in addition to the correct letter, every a could also be 4, o could be 0 or p, e could be 3, s could be 5. And, that's a very, very, short example.
I was asked to do a similar task and wrote code to generate L33T variations of the words, and generated a hit-list of words based on several profanity/offensive lists available on the internet. After running the generator, and being a little over 1/4 of the way through the file, I had over one million entries in my DB. I pulled the plug on the project at that point, because the time spent searching, even using Perl's Regex::Assemble, was going to be ridiculous, especially since it'd still be so easy to fool.
I recommend you have a long talk with whoever requested that, and ask if they understand the programming issues involved, and low-likelihood of accuracy and success, especially over the long-term, or the possible customer backlash when they realize you're censoring them.
I have one that I've added to (obfuscated a bit) but here it is: https://github.com/rdp/sensible-cinema/blob/master/lib/subtitle_profanity_finder.rb

When you write TeX source, how do you use your editor's word wrap?

Do you use "hard wrapping" (either yourself or automatically by your editor) by inserting newlines into your source document at a certain line length, or do you write your paragraphs in one continual line and let your editor "soft-wrap" for you?
Also, what editor do you use for this?
Note: I'm interested in how you wrap lines in your TeX source code (.tex file, general prose), not how TeX wraps lines for the final document.
I recently switched to hard-wrapping per sentence (i.e., newline after sentence end only; one-to-one mapping between lines and sentences) for two reasons:
softwrap for a whole paragraph makes typos impossible to spot in version control diffs.
hardwrapped paragraphs look nice until you start to edit them, and if you re-flow a hard wrapped paragraph you end up with a whole bunch of lines changed in the diff for a possibly one word change.
Only wrapping per sentence fixes these two problems:
Small changes are comparatively easy to spot in a diff.
No re-flowing of text, only changes to, insertions of, or removal of single lines.
Looks a bit weird when you first look at it, but is the only compromise I've seen that addresses the two problems of soft and hard wrapping.
Of course, if you're working collaboratively, the answer is to use whatever the other people are using.
I use Emacs (with AUCTeX). After editing or writing a paragraph, I hit M-q to hard-wrap it. It also handles indenting items, and it also formats commented paragraphs. I don't like soft wraps, because they are visually indistinguishable from real newline characters, but behave differently.
I generally let my LaTeX editor softwrap the lines. I think part of it is due to the fact that I had some bad experiences with significant whitespace when I was first learning LaTeX, and part of it is because I don't like heavily-jagged right-margins when I'm editing the text file.
Depending on what os you use, i recommend winedt (windows) and kile (linux). Both of these soft wrap, and there is no need for hard wraps. (That is, i leave my paragraphs as long lines in the source) Latex sorts out line breaks in the output and when i read the source, i use my editor.
The only possible reason to use hard line breaks is to make it easier to find errors in the code (which the compiler indicates by line number) but they are generally not hard to find, if it's mainly text, errors are rare anyway.
Typically I have my editor insert newlines. That is, I try not to hit the "enter" key for a new line, but when the editor soft-wraps, it actually inserts a newline character.
I use vim to accomplish this, and I don't know if other editors have this feature or how they work. In specific, i use the wrapmargin feature.
I typically try to keep my lines of code (TeX or otherwise) at n-characters long for clarity and consistency. I tend to go with 80 characters, but that is up to you.
More vim-related line-breaking docs:
http://www.vim.org/htmldoc/usr_25.html
http://www.vim.org/htmldoc/options.html#%27textwidth%27
I tend to do hard-wrapping with TeX, but that's rooted more in my obsession with text formatting than any real gain of efficiency. One major thing that I don't like about soft-wrapping is that it tends (in my opinion, obviously) to make things harder to read by wrapping in semantically-random places.
Although I would prefer to use soft wrapping I end up using hard wrapping for one practical reason: all of my collaborators do the same. So, when I work on an article with someone it would be a big pain for me to soft wrap while the other person hard wraps. The second reason is that Emacs was until recently able to handle properly on hard wrapping. Emacs 23 which I currently use changes this but it will be a long time before everybody upgrades to 23 so I can sneak soft wrapped texts to them.
The way I actually use hard wrapping is to have auto-fill-mode turned on. Furthermore M-q is bound to LaTeX-fill-paragraph (in the AucTeX mode - but I don't remember if this is a standard binding or one of my bindings - I'm pretty sure it's the latter). Combining these two I manage to keep my TeX source more or less decently formatted.
By the way, I have heard the suggestion to always start a new sentence at the beginning of a line. In other words a period at the end of a sentence should be followed by a hard return. The benefit is that it works well with version control systems since changes to a sentence can remain localized. I think that this is in principle a nice idea but I have not managed to use it because of my obsessive-compulsive usage of M-q.
I use Kile under Linux with hard wrapping (called static word wrap in Kile) because apparently in my work environment everybody do like that. Soft wrapping makes much more sense to me, so if I could choose I would use that rather than hard wrapping.
I work in joe mostly. I from time to time press enter automatically, and if it doesn't look good I press auto-format (ctrl-k j).
Joe has autowrap modes, but I don't even bother.
I use Auctex with automatic line breaking switched off, and insert line breaks by hand. I avoid auto-formatting, since I want as few changes to where line breaks occur between edits to the document, which makes diffs less cluttered.
Using a smarter diff, one that doesn't care about tex-irrelevant whitespace, would be better, but that's the tool I use.
I like Will's suggestion of hard wrapping per sentence. I thought about it before, but I am fixed in my habits.

Resources