HTTrack gives 404 on unicode urls with german special characters - url

I've realized that HTTrack can't download files if urls have special characters in them, like german ß - it returns a 404 response.
Errors look like on screenshot:
Is there any setting in HTTrack to make it able to deal with such characters?
ps: I found a similar thread, but without an answer:
Httrack faulty when encountering japanese encoded URLS

HTTrack seems to be able to get files errorfree from urls with special characters, only if you don't run a "real" domain crawl, but:
firstly create an url list,
save it as iso-8859-1,
than let HTTrack crawl this list
If HTTrack will explore urls by its own, it will run into 404 errors on urls with special characters - at least i wasn't able to get them errorfree. Maybe somebody will provide a magic setting ;)

Related

Bingbot converts unicode characters to not understandable symbols

I get a lot of errors from my site when bing trying to index some pages which have unicode characters.
For example:
http://www.example.com/kjøp
Bing is trying to index
http://www.example.com/kjøp
Then I get en error "System.NullReferenceException: Object reference not set to an instance of an object." because there is no such controller.
Google works good with such links. How to help bing to understand norwegian letters?
You can confirm that Bing does not index these URLs correctly by doing an "INURL:" search like this... https://www.bing.com/search?q=inurl%3A%C3%B8
Only 6 pages are indexed which cannot be correct.
Unfortunately you won't be able to fix Bing. You may be able to do compensate for its shortcoming by making some changes to your site however. It is a burden that you shouldn't have to deal with. However the other option is to do nothing and continue not getting pages properly linked.
Bing will likely have issues with URLs containing characters in this list...
https://www.i18nqa.com/debug/utf8-debug.html
Your webserver needs to look for URL requests containing these characters. You will then replace the wrong characters with the correct ones and do a 301 redirect to the correct page. The specifics depend on what kind of server and programming language you are using. In your case it is most likely IIS and MVC so you would most likely look into Microsoft's URL Rewrite extension. https://www.iis.net/downloads/microsoft/url-rewrite
Before doing this however I would see what errors Bing's webmaster tools might provide.
https://www.bing.com/toolbox/webmaster
The other option is to not use those characters in your URL. My recommendation is to take the time to use the wrong to right translation. Bing will eventually fix this but it could be quite a while.

Is '|' a recommended separator for semantic URLs?

After researching Google and SO, there seems to be conflicting opinions on this.
We have run-in to a problem with Google Chrome substituting | separator as %7C, whereas Firefox and Safari do not.
Here's an example:
http://www.example.com/page1|sub-page2|sub-page-3
Are there any strict rules to follow when choosing a separator character for semantic URLs and are there any strong arguments against (or workarounds when) using |?
| is not a valid character in a URL. Modern browsers will silently encode it to %7C when sending, and may or may not display this change in the address bar. Similarly, servers will silently decode the character for you.
This would have been a problem in last millennium, where browsers would crash just because you didn't specify http://, but today you can just use whatever you want and the browser will take care of it. However, automatic parsers such as http://example.com/test|fish Markdown may not agree to it being a valid URL. In this case, it looks like it does, but try that on my forums and it will complain at you.
Internet explorer/chrome use url encoding when displaying the url in the address bar after a page request has been made, %7C is the safe way of displaying a pipe ('|')
so its not a problem that chrome is doing this.
as a cheeky fix to make all browsers behave the same way, why not use %7C as your separator from the get-go, instead of a pipe, and then all browsers should interpret this as a pipe for you behind the scenes, but display it as &7C in the address bar.

Characters with accents from a MySQL DB showing correctly on PHP pages but not on HTML

I have searched and searched and applied the obvious fixes but it seems I have another variant of the problem. I have PHP pages and these display what song is currently playing, what songs are coming next and last recently played on my web radio station, the info comes from mysql. The characters are displayed correctly on the php pages. This is where it gets tricky, I also have HTML pages which load 2 div's from a php page so that the coming up songs also display on those HTML pages but there that's when the accents characters don't show correctly, I have the correct meta tag in the header on those pages and have also used the .htaccess file trick (although I was not sure how important the location of the line in the file was so tried various places). I even opened my .htaccess in notepad++ to change the encoding to use UTF8 but no BOM. I even added a meta tag for UTF8 in the php page header and then the characters didn't work on php either, probably you're not supposed to. As you can see I spent a lot of time. What's interesting the characters display correctly on iPad, it's on the PC browsers it doesn't work. Maybe no one ever tried this before loading divs from php into HTML and have special characters too. Sounds interesting anyway and if anyone is interested in having a think that would be great but it's not a vital problem just a nice to have fix. The server side of my stuff is hosted on a hosting site
thanks

Internet Explorer does not display Chinese characters from the URL

I am working on a requirement to display (make readable) characters from the URL.
When I use Google Chrome, it displays the parameters in Chinese - even though they are encoded to UTF-8.
When I use Mozilla Firefox, it displays the parameters in Chinese - even though they are encoded to UTF-8.
When I use Internet Explorer, it displays the parameters encoded in UTF-8.
N.B. The URL is encoded to UTF-8; I know that because when I copy the URL from the three of them and paste it to Notepad++ the three of them display the following:
/%E6%89%93%E5%BC%80%E7%9B%AE%E5%BD%95/%E7%9B%B8%E6%9C%BA/%E6%95%B0%E7%A0%81%E7%9B%B8%E6%9C%BA/%E5%B0%8F%E5%9E%8B%E6%95%B0%E7%A0%81%E7%9B%B8%E6%9C%BA/PowerShot-A480/p/1934793
Could it be that Mozilla Firefox and Google Chrome guys have this improvement that can make an encoded String readable and perhaps the IE guys do not support that? Or, is there any way to activate that with IE?
By the way... Going to View >> Encoding >> Unicode (UTF-8) takes care of the text inside of the page but does not make any difference for the text in the URL.
Any help will be greatly appreciated!
I've written a blog post about Internet Explorer not displaying the decoded version of non-ASCII characters and using IRIs to solve the problem.
As of today, we have the following situation:
HTML5 supports IRIs, i.e. URIs with Unicode character support
HTTP does not support IRIs, but all major browsers take care of converting IRIs to valid (encoded) URIs to retrieve the specified resource (page).
IE supports IRIs in the href attribute of anchor tags and properly displays them in its address bar just like when you enter your URL by hand (keyboard ;-)).
If you choose to percent-encode your IRI thus making it a URI, IE will not decode that URI back into an IRI.
So you could try the following:
Save your HTML files using UTF-8. This allows you to insert any Unicode character into it.
Do not percent-encode your URLs inside your HTML pages' links. Just use links like this: 亦思巴奚兵乱
A great article on the topic can also be found at the W3C: An Introduction to Multilingual Web Addresses.

Why is IIS 7.5 / Coldfusion 9 adding a weird character to URL string?

We have built a "redirect" engine into our product so our customers can add/edit/delete custom redirects without us having to maintain a bunch of rewrite rules on the server.
Some issues are arising in the URLs that get passed into our code. We are pulling these from the CGI.QUERY_STRING property populated by Coldfusion (it picks up on 404's thrown by IIS/Coldfusion, which appends the bad URL as a query string like ?404;http://www.mysite.com:80/nonexistent-file.cfm).
What we see is that some special characters are getting an additional character thrown in there (an Â) character. Take this URL (%A9 is the copyright symbol):
http://www.mysite.com/%A9/
The CGI.QUERY_STRING is reporting this as:
http://www.mysite.com:80/©/
I have no idea where this extra "Â" is coming from. I have a hunch that this is being brought in by IIS, but it could also be with Coldfusion as it has to populate the CGI variable.
Any ideas as to why this is happening and how to fix it? It appears that not all percent-encoded/special characters do this...
EDIT:
I am probably giving up on my exact problem, however, it would be beneficial still to know why either IIS or Coldfusion is throwing in this extra character (especially for certain escape sequences over others).
Wow... that's a tough one. Usually folks design sites to use alphanumeric plus the tilde (~) and dash (=). I'm not even sure if the RFC allows for a copywrite symbol as part of the host header. I'm not positive that it should be allowed in the scheme portion of the URL. This article might shed some light on it for you. As for IIS - you might be able to add a specific rewrite rule that takes care of the issue. Personally I would avoid these characters in the schema part of the URL.

Resources