Bingbot converts unicode characters to not understandable symbols - url

I get a lot of errors from my site when bing trying to index some pages which have unicode characters.
For example:
http://www.example.com/kjøp
Bing is trying to index
http://www.example.com/kjøp
Then I get en error "System.NullReferenceException: Object reference not set to an instance of an object." because there is no such controller.
Google works good with such links. How to help bing to understand norwegian letters?

You can confirm that Bing does not index these URLs correctly by doing an "INURL:" search like this... https://www.bing.com/search?q=inurl%3A%C3%B8
Only 6 pages are indexed which cannot be correct.
Unfortunately you won't be able to fix Bing. You may be able to do compensate for its shortcoming by making some changes to your site however. It is a burden that you shouldn't have to deal with. However the other option is to do nothing and continue not getting pages properly linked.
Bing will likely have issues with URLs containing characters in this list...
https://www.i18nqa.com/debug/utf8-debug.html
Your webserver needs to look for URL requests containing these characters. You will then replace the wrong characters with the correct ones and do a 301 redirect to the correct page. The specifics depend on what kind of server and programming language you are using. In your case it is most likely IIS and MVC so you would most likely look into Microsoft's URL Rewrite extension. https://www.iis.net/downloads/microsoft/url-rewrite
Before doing this however I would see what errors Bing's webmaster tools might provide.
https://www.bing.com/toolbox/webmaster
The other option is to not use those characters in your URL. My recommendation is to take the time to use the wrong to right translation. Bing will eventually fix this but it could be quite a while.

Related

What are URL codes called?

I came across a website with a blog post teaching all how to clear cache for web development purposes. My personal favourite one is to do /? on the end of a web address at the URL bar.
Are there any more little codes like that? if so what are they and where can I find a cheat sheet?
Appending /? may work for some URLs, but not for all.
It works if the server/site is configured in a way that, for example, http://example.com/foo and http://example.com/foo/? deliver the same document. But this is not the case for all servers/sites, and the defaults can be changed anyway.
There is no name for this. You just manipulate the canonical URL, hoping to craft a URL that points to the same document, without getting redirected.
Other common variants?
I’d expect appending ? would work even more often than /? (both, of course, only work if the URL has no query component already).
http://example.com/foo
http://example.com/foo?
You’ll also find sites that allow any number of additional slashes where only one slash used to be.
http://example.com/foo/bar
http://example.com/foo////bar
Not sure if it affects the cache, but specifying the domain as FQDN, by adding a dot after the TLD, would work for many sites, too.
http://example.com/foo
http://example.com./foo
Some sites might not have case-sensitive paths.
http://example.com/foo
http://example.com/fOo

Canonical url and localization

In my application I have localized urls that look something like this:
http://examle.com/en/animals/elephant
http://examle.com/nl/dieren/olifant
http://examle.com/de/tiere/elefant
This question is mainly for Facebook Likes, but I guess I will hit similar problems when I start thinking about search engine crawlers.
What kind of url would you expect as canonical url? I don't want to use the exact english url, because I want that people clicking the link will be forwarded to their own language (browser setting/dependent on IP).
The IP lookup is not something that I want to do on every page hit. Besides that I would need to incorporate more 'state' in my application, because I have to check wether a user has already been forwarded to his own locale, or is browsing the english version on purpose.
I guess it will going to be something like:
http://example.com/something/animals/elephant
or maybe without any language identifier at all:
http://example.com/animals/elephant
but that is a bit harder to implement, bigger chance on url clashes in the future (in the rare case I would get a category called en or de).
Summary
What kind of url would you expect as canonical url? Is there already a standard set for this?
I know this question is a bit old, but I was facing the same issue.
I found this:
Different language versions of a single page are considered duplicates only if the main content is in the same language (that is, if only the header, footer, and other non-critical text is translated, but the body remains the same, then the pages are considered to be duplicates).
That can be found here: https://developers.google.com/search/docs/advanced/crawling/consolidate-duplicate-urls
From this I can conclude that we should add locales to canonicals.
I did find one resource that recommends not using the canonical tag with localized addresses. However, Google's documentation does not specify and only mentions subdomains in another context.
There is more that that language that you need to think of.
It's typical a tuple of 3 {region, language, property}
If you only have one website then you have {region, language} only.
Every piece of content can either be different in this 3 dimensional space, or at least presented differently. But this is the same piece of content so you'd like to centralize managing of editorial signals, promotions, tracking etc etc. Think about search systems - you'd like page rank to be merged across all instances of the article, not spread thinly out.
I think there is a standard solution: Canonical URL
Put language/region into the domain name
example.com
uk.example.com
fr.example.com
Now you have a choice how you attach a cookie for subdomain (for language/region) or for domain (for user tracking)!
On every html page add a link to canonical URL
<link rel="canonical" href="http://example.com/awesome-article.html" />
Now you are done.
There certainly is no "Standard" beyond it has to be an URL. What you certainly do see on many comercial websites is exactly what you describe:
<protocol>://<server>/<language>/<more-path>
For the "language-tag" you may follow RFCs as well. I guess your 2-letter-abbrev is quite fine.
I only disagree on the <more-path> of the URL. If I understand you right you are thinking about transforming each page into a local-language URL? I would not do that. Maybe I am not the standard user, but I personally like to manually monkey around in URLs, i.e. if the URL shown is http://examle.com/de/tiere/elefant, but I don't trust the content to be translated well I would manually try http://examle.com/en/tiere/elefant -- and that would not bring me to the expected page. And since I also dislike those URLs http://ex.com/with-the-whole-title-in-the-url-so-the-page-will-be-keyworded-by-search-engines my favorite would be to just exchange the <language> part and use generic english (or any other language) for <more-path>. Eg:
http://examle.com/en/animals/elephant
http://examle.com/nl/animals/elephant
http://examle.com/de/animals/elephant
If your site is something like Wikipedia, then I would agree to your scheme of translating the <more-part> as well.
Maybe this Google's guidelines can help with your issue: https://support.google.com/webmasters/answer/189077?hl=en
It says that many websites serve users (across the world) with content targeted to users in a certain region. It is advised to use the rel="alternate" hreflang="x" attributes to serve the correct language or regional URL in Search results.

Why is IIS 7.5 / Coldfusion 9 adding a weird character to URL string?

We have built a "redirect" engine into our product so our customers can add/edit/delete custom redirects without us having to maintain a bunch of rewrite rules on the server.
Some issues are arising in the URLs that get passed into our code. We are pulling these from the CGI.QUERY_STRING property populated by Coldfusion (it picks up on 404's thrown by IIS/Coldfusion, which appends the bad URL as a query string like ?404;http://www.mysite.com:80/nonexistent-file.cfm).
What we see is that some special characters are getting an additional character thrown in there (an Â) character. Take this URL (%A9 is the copyright symbol):
http://www.mysite.com/%A9/
The CGI.QUERY_STRING is reporting this as:
http://www.mysite.com:80/©/
I have no idea where this extra "Â" is coming from. I have a hunch that this is being brought in by IIS, but it could also be with Coldfusion as it has to populate the CGI variable.
Any ideas as to why this is happening and how to fix it? It appears that not all percent-encoded/special characters do this...
EDIT:
I am probably giving up on my exact problem, however, it would be beneficial still to know why either IIS or Coldfusion is throwing in this extra character (especially for certain escape sequences over others).
Wow... that's a tough one. Usually folks design sites to use alphanumeric plus the tilde (~) and dash (=). I'm not even sure if the RFC allows for a copywrite symbol as part of the host header. I'm not positive that it should be allowed in the scheme portion of the URL. This article might shed some light on it for you. As for IIS - you might be able to add a specific rewrite rule that takes care of the issue. Personally I would avoid these characters in the schema part of the URL.

Unicode url of website and seo issue

I am working on a persina website. i want to change url structure of pages to be more seo friendly.but i don't know using unicode urls will have positive effect on SEO of website or not.
The pages unicode is UTF-8 . When i copy the link location in firefox and paste it in address bar something like this (for example) will appears:
http://mysite.com/pages/36161-%D8%B4%DB%8C%D9%85%D9%89.html
it is ok with search engines and seo ?
I encountered a similar problem on my site after a few tests and a long time I have concluded that Google deal well with these addresses and you have no reason to worry.
In my case the URLs was in Hebrew and there is not much difference between the two languages for Googlebot.
The major problems i has was with URLs in the site map they looked really bad, but google still indexed them.
Is this transition will be good for seo? I guess you it will but do not allow friendly URLs confuse you is only one criterion and there is no reason to trust him.
You get +1 on friendly URLs but there's no reason to forget about the rest of the onsite site Seo.
It is very important that you redirect the old URLs with 301 redirect to the new ones.
To not receive a 401 error that will cause you to be punished by the search engines

Google sees something that it shouldn't see. Why?

For some mysterious reason, Google has indexed both these adresses, that lead to the same page:
/something/some-text-1055.html
and
/index.php?pg=something&id=1055
(short notice - the site has had friendly urls since its launch, I have no idea how google found the "index.php?" url - there are "unfriendly" urls only in the content management system, which is password-restricted)
What can I do to solve the situation? (I have around 1000 pages that are double-indexed.) Somebody told me to use "disallow: index.php?" in the robots.txt file.
Right or wrong? Any other suggestions?
You'd be surprised as how pervasive and quick the google bots are at indexing site content. That, combined with lots of CMS systems creating unintended pages/links making it likely that at some point those links were exposed is the most likely culprit. It's also possible your administration area isn't as secure as you think, the google bot got through that way.
The well-behaved, and google recommended, things to do here are
If possible, create 301 redirects from you query string style URLs to your canonical style URLs. That's you saying "hey there, web bot/browser, the content that used to be at this URL is now at this other URL"
Block the query string content in your robots.txt. That's like asking the spiders or other automated programs "Hey, please don't look at this stuff. These aren't the URLs you're looking for"
Google apparently allows you to specify a canonical URL now via a <link /> tag in the top of your page. Consider adding these in.
As to whether doing the well behaved things is the the "right" thing to do re: Google rankings ... who knows. Only "Google" knows how their algorithms work now, and will work in the future, and by Google, I mean a bunch of engineers and executives with conflicting goals on how search should work.
Google now offers a way to specify a page's canonical URL. You can use the following code in your HTML to tell Google your canonical URL:
<link rel="canonical" href="http://www.example.com/product.php?item=swedish-fish" />
You can read more about canonical URLs on Google on their blog post on the subject, here: http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html
According to the blog post, Ask.com, Microsoft Live Search and Yahoo! all support the canonical tag.
If you use sitemap generators to submit to search engines, you'll want to disallow in them as well. They are likely where Google got your links, from crawling your folder and from checking your logs.
Better check what URI has been requested ($_SERVER['REQUEST_URI']) and redirect if it was /index.php.
Changing robots.txt will not help, since the page is already indexed.
The best is to use a permanent redirect (301).
If you want to remove a page once indexed by Google the only way, more or less, is to make it return a 404 not found message.
Is it possible you're posting a form to a similar url and google is simply picking it up from the source?

Resources