As the title says, how SEO friendly is a URL containing Unicode characters.
Edit: To clarify, I meant URL with non-ASCII characters but valid Unicode.
If I were a Google other search engines authority I wouldn't consider the unicode URL-s an advantage. I have been using unicode urls for more than two years in my Persian website but believe me I just did it because I felt I was forced to do this. We know Google handles Uncode urls very well but I can't see the unicode words in URL-s when I'm working with them in google webmaster tools here is an example:
http://www.learnfast.ir/%D9%88%D8%A8%D9%84%D8%A7%DA%AF-%D8%A2%DA%AF%D9%87%DB%8C
there are only two Farsi words in such a messy and lengthy URL.
I believe other Unicode url users don't like this either but they do this only for SEO optimization not for categorizing their contents or directing their users to the right address. Of course unicode is excellent for crawling the contents but there should be other ways to index URL-s. more over, English is our international language; Isn't it? It can be beneficially used for URL-s. There should be other means for indexing Unicode Urls. (Sorry for too much words from an amateur webmaster).
All URLs can be represented as Unicode. Unicode just defines a range of code-points from U+0000 to U+10FFFF, which allows you to define any characters.
If what you mean is "How SEO friendly are URLs containing characters above U+007F" then they should be as good as anything else, as long as the word is correct. However, they won't be very easy for most users to type if that's a concern, and may not be supported by all internet browsers/libraries/proxies etc. so I'd tend to steer clear.
FWIW, Amazon (Japan) uses Unicode URL for their product pages.
http://www.amazon.co.jp/任天堂-193706011-Wiiスポーツ-リゾート-「Wiiモーションプラス」1個同梱/dp/B001DLXXCC/ref=pd_bxgy_vg_img_a
(As you can see, it causes trouble with systems like the Stackoverflow wiki formatter)
if we consider that the urls that have the searched keywords in them have higher placements in the search results and you're targeting unicode search terms then it may actually help.
But of course this is hardly the most important thing when it comes to position in search results.
I would guess from an SEO point of view it would be a really bad idea, unless you are specifically looking to target unicode search terms.
Related
I need to check that the strings coming from my form contain special characters (##$% & *!/), including characters from countries like China, Bulgaria, Russia, Greece and others that are not in the Latin alphabet.
The intention is to find a solution that does not harm the user experience and that brings security to the system.
I found some solutions using reGex, but they don't include characters from these other countries.
Is there any library or solution for this?
If I use a domain such as www.äöü.com, is there any way to avoid it being displayed as www.xn--4ca0bs.com in users’ browsers?
Domains such as www.xn--4ca0bs.com cause a lot of confusion with average internet users, I guess.
This is entirely up to the browser. In fact, IDNs are pretty much a browser-only technology. Domain names cannot contain non-ASCII characters, so the actual domain name is always the Punycode encoded xn--... form. It's up to the browser to prettify this, but many choose to not do so to avoid domain name spoofing using lookalike Unicode characters.
From a security perspective, Unicode domains can be problematic because many Unicode characters are difficult to distinguish from common ASCII characters (or indeed other Unicode characters).
It is possible to register domains such as "xn–pple-43d.com", which is equivalent to "аpple.com". It may not be obvious at first glance, but "аpple.com" uses the Cyrillic "а" (U+0430) rather than the ASCII "a" (U+0061). This is known as a homograph attack.
Fortunately modern browsers have mechanisms in place to limit IDN homograph attacks. The page IDN Policy on chrome highlights the conditions under which an IDN is displayed in its native Unicode form. Generally speaking, the Unicode form will be hidden if a domain label contains characters from multiple different languages. The "аpple.com" domain as described above will appear in its Punycode form as "xn–pple-43d.com" to limit confusion with the real "apple.com".
For more information see this blog post by Xudong Zheng.
Internet Explorer 8.0 on Windows 7 displays your UTF-8 domain just fine.
Google Chrome 19 on the other hand doesn't.
Read more here: An Introduction to Multilingual Web Addresses #phishing.
Different browsers to things differently, possibly because some use the system codepage/locale/encoding/wtvr. And others use their own settings, or a list of allowed characters.
Read that article carefully, it explains how each browser works when making a decision.
If you are targeting a specific language, you can get away with it and make it work.
Until now, I had always sticked to lowercase alphanum and hyphen for the slug part of any URL.
I'm currently working on a website that supports both english and Arabic languages, and I used transliteration so far.
However feedbacks from Arabic peoples said that the latin transliteration is "horrible".
After searching the web, I found out that Arabic characters can be used nowadays (and are even recommended for SEO), but I don't know exactly what are the rules and best practices (and I don't even speak or read Arabic).
More specifically, I would like to know:
What is the recommended character length?
What is the recommended number of words?
should I turn spaces into hyphens like it's usually done for latin language?
Anything notable that a non-Arabic speaker should be aware of?
All,
I ran into this problem where for a UITextField that has secureTextEntry=YES, I cannot get any UTF-8 keyboards(Japanese, Arabic, etc.) to show, only non UTF-8 ones do(English, French, etc..). I did alot of searching on Google, on this site, and on Apple dev forums and see others with the same problem, but short of implementing my own UITextField, nobody seems to have a reasonable solution or an answer as to whether this is a bug or intended behavior.
And if this is intended behavior, why? Is there a standard, a white paper, SOMETHING someplace that I can look at and then point to when I go to my Product Manager and say we cannot support UTF-8 passwords?
THanks,
I was unable to find anything in Apple's documentation to explain why this should be the case, but after creating a test project it does indeed appear to be so. At a guess, I imagine secure text entry is disallowed for any language using composite characters because it would make character input difficult.
For instance, for Japanese input, should each kana character be hidden after it is typed? Or just kanji characters? If the latter, the length of time characters remain onscreen is long enough make secure input almost moot. Similarly for other languages using composite input methods.
This post includes code for manually implementing your own secure input behaviour.
We're implementing a blog for a site which supports six different languages and five of them have non-Latin characters in their alphabets. We are not sure whether we should have them encoded (that is what we're doing at the moment)
Létání s potravinami: Co je dovoleno? becomes l%c3%a9t%c3%a1n%c3%ad-s-potravinami-co-je-dovoleno and the browser displays it as létání-s-potravinami-co-je-dovoleno.
or if we should replace them with their Latin "counterparts" (similar looking letters)
Létání s potravinami: Co je dovoleno? becomes letani-s-potravinami-co-je-dovoleno.
I can't find a definitive answer as to what's better from SEO perspective? Search engine optimization is very important for us. Which approach would you suggest?
Most of the times, search engines deal with latin counterparts good, although sometimes, results for i.e. "létání" and "letani" slightly differ.
So, in terms of SEO, almost no harm is done - once your site has good content, good markup and all that other stuff, your site won't suffer from having latin URLs.
You don't always know what combination of system browser and plugins users use, so make them as easy as possible - all websites use standard latin in URLs, because non-latin symbols can choke anything from server through browser to any plugin that might break user's experience.
And I can't stress this enough; Users before SEO!
"what's better from SEO perspective"
Who's your audience? Americans who think all those extra letters are a mistake?
Or folks who read (and search) for "non-ASCII" letters because those non-ASCII letters are part of their language?
SEO is a bad thing to chase. Complete, correct, consistent and usable is what you what to build first.
well i suggest you to replace them with there latin counterparts because it's user friendly and your website will be accessible on every single computer (as the keyboard changes from computer to another but all of them have latins letters), but for SEO perspective i don't think it's gonna be a problem.
Pawel, first of all, you should decide whether you're going to optimize for global Google (google.com) or Polish one.
In accordance with the URI specification, RFC 3986, only 7bit ASCII characters are allowed, and characters among those mentioned in the specification as control characters must be properly escaped. If you want to represent other characters or URI control characters then you should be using IRI, RFC 3987. Keep in mind that HTTP is not compatible with IRI, however.
When in doubt RTFM.
Another issue is that there are Unicode code points whose glyphs look very much alike in most fonts, which is absolutely ideal for phishers. Stick to ASCII and the glyphs are visibly different when the characters are.