I will translate our web (written with php) into other languages and I'm thinking in using gettext to do it. Also, I have see that English is used as placeholder or msgid.
My web site is currently in Spanish and I think if I can use Spanish as the msgid instead of English. Is there any problem doing it? (apart of English been more international).
This will help us when we add new strings that don't get translated as we will still making the web in Spanish and prefer a mixed web (with Spanish fragments in English pages) than a Spanish version with missed strings (because we don't have the English msgid).
Appear that gettext admit it, so I think it is possible.
What you need to get in mind is the encoding when extracting the strings and when manipulating the po and pot files so they aren't corrupted. For example, if use xgettext to extract strings from a php file in UTF-8, use the --from-code option, like in:
xgettext --from-code UTF-8 prueba.php
Appart of this "technically, you can do it", is discourage as using English is more spread to be translated and its chars are all ascii so easy to not corrupt they.
Related
Until now, I had always sticked to lowercase alphanum and hyphen for the slug part of any URL.
I'm currently working on a website that supports both english and Arabic languages, and I used transliteration so far.
However feedbacks from Arabic peoples said that the latin transliteration is "horrible".
After searching the web, I found out that Arabic characters can be used nowadays (and are even recommended for SEO), but I don't know exactly what are the rules and best practices (and I don't even speak or read Arabic).
More specifically, I would like to know:
What is the recommended character length?
What is the recommended number of words?
should I turn spaces into hyphens like it's usually done for latin language?
Anything notable that a non-Arabic speaker should be aware of?
I have an application that currently support 'en' and 'fr' locales, and maintain one language file for each locale i.e. 'en.json' and 'fr.json'
Now for the user from USA, locale comes in as "en_US", Canada 'en_CA', British 'en_UK' etc.
So now as a best practice, is it recommended that I maintain different files for different English Locales or I treat all English locales (en_CA, en_US, en_UK) as 'en' locale and refer to one file for all?
As usual, it depends.
Typically, you will only have one English file, containing English-international messages. In this case, you won't maintain separate files for each version, but instead you will fall back to English international (having a request for en-US, en-CA, etc. you will serve messages from en.json).
Judging by your nickname, you probably know that sometimes it is better to maintain separate messages for some specific cultures for which English-US messages (typically used as international English) might be simply way too direct.
If there is a request for separate locale version (i.e. en-IN), you would serve messages from the specific file (i.e. en-IN.json), but fall back to en for each other language (en-GB, en-AU, etc.).
Resource fall-back (which is the term specialists use for what I described above) could be quite painful to implement. Sure, usually you would fall back to base language (en for any en-XX), but there are some corner cases which you need to know: Portuguese/Brazilian Portuguese, Norwegian and Chinese. In case of Portuguese you should use pt-BR (i.e. pt-BR.json) and for requests to pt or pt-XX fall back to pt-BR, as Brazilian Portuguese is now the standard one. Obviously, it could be easily done by simply creating one file pt.json and let anything fall back to pt.
This is not the case for neither Norwegian nor Chinese.
There are two versions of Norwegian language:
Norwegian Nynorsk (locale nn-NO)
Norwegian Bokmål (locale nb-NO)
There is also so-called macro language (locale no).
Unless you maintain two separate versions of Norwegian (which is unlikely), you should use one resource file (i.e. no.json <- sounds funny, isn't it?) and fall back to it for any requests for nn-NO, nb-NO, nb, nn and no-NO (I believe simply no will be covered).
The Chinese is even more complicated. You may have heard about Chinese Simplified (locale zh-Hans) and Chinese Traditional (zh-Hant). If you'll ever need to localize into Chinese, it make sense to maintain two separate Chinese files (i.e. zh-Hans.json and zh-Hant.json) and fall back any requests as follows:
zh, zh-CN and zh-SG to zh-Hans
zh-HG, zh-MO and zh-TW to zh-Hant
I hope it gives you better understanding. It is worth to consider future localization plans to implement the resource fall-back mechanism as simple as it could be done (but no simpler). If you'll ever need to support languages like English, French, Italian, German and Spanish, there is no point in implementing complex rules - simply check if xx-XX.json exists and serve it if it does, if not check if xx.json exists (serve it...) or fall-back to default application language (en.json, I guess?).
We're implementing a blog for a site which supports six different languages and five of them have non-Latin characters in their alphabets. We are not sure whether we should have them encoded (that is what we're doing at the moment)
Létání s potravinami: Co je dovoleno? becomes l%c3%a9t%c3%a1n%c3%ad-s-potravinami-co-je-dovoleno and the browser displays it as létání-s-potravinami-co-je-dovoleno.
or if we should replace them with their Latin "counterparts" (similar looking letters)
Létání s potravinami: Co je dovoleno? becomes letani-s-potravinami-co-je-dovoleno.
I can't find a definitive answer as to what's better from SEO perspective? Search engine optimization is very important for us. Which approach would you suggest?
Most of the times, search engines deal with latin counterparts good, although sometimes, results for i.e. "létání" and "letani" slightly differ.
So, in terms of SEO, almost no harm is done - once your site has good content, good markup and all that other stuff, your site won't suffer from having latin URLs.
You don't always know what combination of system browser and plugins users use, so make them as easy as possible - all websites use standard latin in URLs, because non-latin symbols can choke anything from server through browser to any plugin that might break user's experience.
And I can't stress this enough; Users before SEO!
"what's better from SEO perspective"
Who's your audience? Americans who think all those extra letters are a mistake?
Or folks who read (and search) for "non-ASCII" letters because those non-ASCII letters are part of their language?
SEO is a bad thing to chase. Complete, correct, consistent and usable is what you what to build first.
well i suggest you to replace them with there latin counterparts because it's user friendly and your website will be accessible on every single computer (as the keyboard changes from computer to another but all of them have latins letters), but for SEO perspective i don't think it's gonna be a problem.
Pawel, first of all, you should decide whether you're going to optimize for global Google (google.com) or Polish one.
In accordance with the URI specification, RFC 3986, only 7bit ASCII characters are allowed, and characters among those mentioned in the specification as control characters must be properly escaped. If you want to represent other characters or URI control characters then you should be using IRI, RFC 3987. Keep in mind that HTTP is not compatible with IRI, however.
When in doubt RTFM.
Another issue is that there are Unicode code points whose glyphs look very much alike in most fonts, which is absolutely ideal for phishers. Stick to ASCII and the glyphs are visibly different when the characters are.
I'm creating an English translation for a program written in German (i.e. all strings within tr("...") are German). Users who are in a non-English non-German locale will probably want to see the English translation, but with the program as it is now they will see German.
There are some ways to solve this problem:
Check if it's a German locale and force to English otherwise.
Present an option to the user.
Make the programmers change their source code to English.
What is considered best-practice for internationalizing where the source code is not in English?
These are two separate questions.
The best practice is to not use any kind of hard-coded string in the sources.
Strings should be stored in external files and loaded by ID.
But what you have there does not sound like the best practice. Might be too much work to get it there.
What you describe (the tr("...") stuff) sounds like gettext (or something similar).
That approach for gettext (and similar libraries) is that "the stuff in the sources is the ultimate fallback", used if the strings for the desired language are not present.
In this case I would go with "Present an option to the user."
You can't assume the user knows English.
Real example: in Switzerland the official languages are Italian, German, French and Romansh. If I ask for French and it is not present, then the next best option is probably German, not English. I Canada the official languages are French and English, so if I as for French and is not available, the next best option is probably English.
I think the best option is asking the user (during installation probably).
Change the source to English is too costly and not worth it. I live in Brazil, we have tons of codes in Portuguese and translating to English wan't necessary one time (we do make software to english speakers). Unless you have a client that requires you to do so (usually when you are selling the source also).
Hope it helps
OK, so I guess the three options are:
Recompile the program with translated strings.
This is fraught with danger as you'll end up with two copies of the source. Bug-fixes in one will need to be done in the other. And then, what happens if you need French? Italian? Spanish? The only advantage of this approach is that it's feasible for a non-developer to do the work. (Just about.)
Resource out the strings, and automatically check what the UI locale is on load.
Here the strings are replaced with GetResource("key") or similar. On load the program automatically translates to the user's culture. This might work, but I know plenty of German-speakers who have English-language culture installed on their PCs but who would prefer German language programs at some points.
Resource out the strings and give the user the choice on load
In general it's always best to give the user control. This might be a prompt on load, although if the application is used often this can be an annoyance. Perhaps a balance is to ask the user during installation for their preference and then give then an option in a dialog to later change this setting.
Note, by the way, that translation is not localisation. For instance: number formats are quite different in Germany (e.g. 1.233,44) from English (e.g. 1,233.44). Icons and suchlike often have national characteristics.
As the title says, how SEO friendly is a URL containing Unicode characters.
Edit: To clarify, I meant URL with non-ASCII characters but valid Unicode.
If I were a Google other search engines authority I wouldn't consider the unicode URL-s an advantage. I have been using unicode urls for more than two years in my Persian website but believe me I just did it because I felt I was forced to do this. We know Google handles Uncode urls very well but I can't see the unicode words in URL-s when I'm working with them in google webmaster tools here is an example:
http://www.learnfast.ir/%D9%88%D8%A8%D9%84%D8%A7%DA%AF-%D8%A2%DA%AF%D9%87%DB%8C
there are only two Farsi words in such a messy and lengthy URL.
I believe other Unicode url users don't like this either but they do this only for SEO optimization not for categorizing their contents or directing their users to the right address. Of course unicode is excellent for crawling the contents but there should be other ways to index URL-s. more over, English is our international language; Isn't it? It can be beneficially used for URL-s. There should be other means for indexing Unicode Urls. (Sorry for too much words from an amateur webmaster).
All URLs can be represented as Unicode. Unicode just defines a range of code-points from U+0000 to U+10FFFF, which allows you to define any characters.
If what you mean is "How SEO friendly are URLs containing characters above U+007F" then they should be as good as anything else, as long as the word is correct. However, they won't be very easy for most users to type if that's a concern, and may not be supported by all internet browsers/libraries/proxies etc. so I'd tend to steer clear.
FWIW, Amazon (Japan) uses Unicode URL for their product pages.
http://www.amazon.co.jp/任天堂-193706011-Wiiスポーツ-リゾート-「Wiiモーションプラス」1個同梱/dp/B001DLXXCC/ref=pd_bxgy_vg_img_a
(As you can see, it causes trouble with systems like the Stackoverflow wiki formatter)
if we consider that the urls that have the searched keywords in them have higher placements in the search results and you're targeting unicode search terms then it may actually help.
But of course this is hardly the most important thing when it comes to position in search results.
I would guess from an SEO point of view it would be a really bad idea, unless you are specifically looking to target unicode search terms.