Are extended ASCII chars [äöå] allowed in URLs?

Are extended ASCII chars [äöå] allowed in URLs? - url

I have seen some websites that use äöå-characters in their slugs (they don't convert them to aoo). So what's the case?

Åå (as in affOrd), Ää (as in Air), and Öö (as in gIrl) are our lovely Swedish additions to the latin alphabet. Nowadays, these can even be used in domain names (at least under the .SE top-level domain). See, for instance, http://www.linköping.se. Of course, this is an example of an IDN.

Yes, they are!
But see here: Should I use accented characters in URLs?

Related

Is it necessary to use — and – in XHTML or HTML5?

It seems that it is best to use the & escape, instead of simply typing the ampersand (&).
However, should we be using X/HTML character entity references for dashes and other common typographical characters when writing blog posts on CMSs like WordPress or hard-coding websites by hand?
For example:
– is an en dash (–)
— is an em dash (—)
What is the risk if we do not?
Why is the hyphen (-) never written as - but simply typed directly from the keyboard in HTML? (Assuming that it the hyphen, and not a minus sign.)

The W3C released an official response about when to use and when not to use character escapes which you can find here. As they are also the group that is in charge of the HTML specification, I think it's best to follow their advice.
From the section "When to Use Escapes"
Syntax characters. There are three characters that should always appear in content as escapes, so that they do not interact with the syntax of the markup. These are part of the language for all documents based on XML and for HTML.
< (<)
> (>)
& (&)
They also mention using characters that might not be supported in the current encoding.
From the section "When Not to Use Escapes"
It is almost always preferable to use an encoding that allows you to represent characters in their normal form, rather than using character entity references or NCRs.
Using escapes can make it difficult to read and maintain source code, and can also significantly increase file size.
http://www.w3.org/International/questions/qa-escapes

Those entities are there to help you, the author, with characters not usually typable on your average keyboard. (The em dash is an example —, as well as © and ).
You only need to escape those characters that have meaning in (X)HTML < > and &.

Is there known URI scheme or URN namespace for Unicode characters?

I need to reference to a Unicode character with a URI. Following IANA references list multiple schemes and namespaces but do not mention anything about identifiers for the Unicode characters. Does anyone know if something like this exists already?
http://www.iana.org/assignments/uri-schemes.html
http://www.iana.org/assignments/urn-namespaces/urn-namespaces.xml
I hoped to find something like
unicode://U+0394
urn:unicode://0394
http://unicode.org/unicode/0394
for the greek capital letter delta Δ.
If someone wonders, this is for a semantic web like application that uses URIs as identifiers for concepts, including concepts of the Unicode characters.

I’m afraid there is no URL or URN for referring authoritative information on a Unicode character in general. In the Unicode Standard, information about individual characters is partly in the so-called character database (mostly plain text files in specific formats), partly in the Code Charts (PDF files). Neither of them offers a way to point at an individual character. Moreover, the information there is not exhaustive: there are important remarks on individual characters information scattered around the standard.
The Decodeunicode site has individually addressable items, such as
http://www.decodeunicode.org/en/u+0394
but its information content varies a lot and is generally very limited. It is not official, and it currently contains Unicode 5.0 only.
The Fileformat.info site is much more systematic, but it, too, is unofficial. It is basically limited to formal properties and data derivable from them, plus comments extracted from the Code Charts, plus instructions on typing the character in Windows, plus information about support in fonts—but that’s quite a lot! Example:
http://www.fileformat.info/info/unicode/char/0394/

[ EDIT ] : found this URL matching your needs : http://unicode.org/cldr/utility/character.jsp?a=1F40F
.
Well, there is an URL referencing the authoritative information on the Unicode database, even though it does not describe (as said in the other answer) all the information on one specific character.
You have the following URL, pointing to the latest Unicode database. This is a simple list of existing valid Unicode characters. Some upcoming characters are missing (㋿), and you should expect it to be mutable.
https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
The contents looks like the following, which isn't so practical to use as-is.
$ grep -ai kangaroo UnicodeData.txt -C 7
1F991;SQUID;So;0;ON;;;;;N;;;;;
1F992;GIRAFFE FACE;So;0;ON;;;;;N;;;;;
1F993;ZEBRA FACE;So;0;ON;;;;;N;;;;;
1F994;HEDGEHOG;So;0;ON;;;;;N;;;;;
1F995;SAUROPOD;So;0;ON;;;;;N;;;;;
1F996;T-REX;So;0;ON;;;;;N;;;;;
1F997;CRICKET;So;0;ON;;;;;N;;;;;
1F998;KANGAROO;So;0;ON;;;;;N;;;;;
1F999;LLAMA;So;0;ON;;;;;N;;;;;
1F99A;PEACOCK;So;0;ON;;;;;N;;;;;
1F99B;HIPPOPOTAMUS;So;0;ON;;;;;N;;;;;
1F99C;PARROT;So;0;ON;;;;;N;;;;;
1F99D;RACCOON;So;0;ON;;;;;N;;;;;
1F99E;LOBSTER;So;0;ON;;;;;N;;;;;
1F99F;MOSQUITO;So;0;ON;;;;;N;;;;;
You could build up a hacky « hash-based » namespace with a suffix like this, but that's definitely non-standard.
https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt#1F998

Since this is also tagged semantic-web, I will try to pick URIs that are easily (and permanently) dereferenceable and cannot be mistaken for a document describing that character: the data: scheme. Not only can that refer to a character in Unicode, but any encoding, and also any string thereof.
data:;charset=utf-8,%CE%94
Attempting to open this URI should result in a text/plain file with the single character as its content.
If the system accepts IRIs (as many semantic web applications do), the character can be included directly:
data:;charset=utf-8,Δ
This is mapped to the same URI as shown above, and your browser may convert it directly. Specifying UTF-8 is necessary in this case, since the mapping is not defined for other encodings.

Why do you need to encode URLs?

Why do you need to encode urls? Is there a good reason why you have to change every space in the GET data to %20?

Because some characters have special meanings.
For instance, in a query string, the ampersand (&) is used as a separator between key-value pairs. If you were to put an ampersand into one of those values, it would look like the separator between the end of a value and the beginning of the next key. So for special characters like this, we use percent encoding so that we can be sure that the data is unambiguously encoded.

From RFC 2936, section 2.4.3:
The space character is excluded
because significant spaces may
disappear and insignificant spaces may
be introduced when URI are transcribed
or typeset or subjected to the
treatment of word- processing
programs. Whitespace is also used to
delimit URI in many contexts.

originally older browsers could get confused by the spaces (not really an issue anymore).
now, if someone copies the url to send as a link - the space can break the hyperlink - ie
Hey! Check out this derping cat playing a piano!
http://www.mysite.com/?video=funny cat plays piano.
See how the link breaks?
Now look at this:
http://www.mysite.com/?video=funny%20cat%20plays%20piano.

Let's break down your question.
Why do you need to encode URL?
A URL is composed of only a limited number of characters and those are digits(0-9), letters(A-Z, a-z), and a few special characters("-", ".", "_", "~").
So does it mean that we cannot use any other character?
The answer to this question is "YES". But wait a minute, there is a hack and the hack is URL Encoding or Perchantage Encoding. So if you want to transmit any character which is not a member of the above mentioned (digits, letters, and special chars), then we need to encode them. And that is why we need to encode "space" as "%20".
OK? Is this enough for URL encoding? No this is not enough, there's a lot about URL encoding but here, I'm not gonna make it a pretty big, boring technical answer. But If you want to know more, then you can read it from here: https://www.urlencoder.io/learn/ (Credit goes to this writer)

Well, you do so because every different browsers knows how the string that makes up the URL is encoded. converting the space to %20, etc makes that URL/URI portable. It could be latin-1 it could be unicode. It needs normalized to something that is understood universally. Take a look at rfc3986 https://www.rfc-editor.org/rfc/rfc3986#section-2.1

What's the best character to represent blank spaces in a URL?

When you are building URLs that should be legible for users and search engines and you do it automatically from the content, what's the best way to represent blank spaces? Hyphens (this is what StackOverflow uses)? Underscores? Any other? Does any of those make a different for SEO?

Both are valid URL characters and both have their pros and cons.
Pro dash
Google recommends dashes, and here is what Matt Cutts from Google has to say about
Dashes vs. underscores.
If you have a url like word1-word2,
that page can be returned for the
searches word1, word2, and even “word1
word2″.
That’s why I would always choose
dashes instead of underscores.
Dashes seem to be what major blogs do:
The Huffington Post,
TechCrunch,
Engadget, ...
Dashes seem to be what major CMS do.
Not sure about that one anymore, can anyone comment?
As mentioned by Kazar, underscores can clash with the underlining of links.
I find underscores awkward to type.
Rene Saarsoo pointed out that dashes take less space than underscores in proportional fonts.
Ionut G. Stan mentioned that underscores are not allowed in hostnames. If you strive for consistency you should opt for dashes.
Pro underscore
Dashes are not allowed in
ISO9660 file systems.
This can be a problem if your content is also shipped on DVD or CD (e.g help files or
eLearning content).
In some languages (e.g. German) dashes can be word characters and are not generally considered word separators.

Another advantage of dashes is that in proportional font they take less space that underscores. Compare:
https://stackoverflow.com/../whats-the-best-character-to-represent-blank-spaces-in-a-url
https://stackoverflow.com/../whats_the_best_character_to_represent_blank_spaces_in_a_url
It's not a lot, but every little helps :)

Again, personal preference - personally I think hyphens work better than underscores, because underscores can clash with the underlining a tags add (by default), so http://someurl.com/this_is_a_address looks like there are no underscores there. (as this is stack overflow, roll over the link). http://someurl.com/this-is-a-address looks fine.

You know, if you buy a domain name, you're allowed to use hyphens inside that name, but no underscores. This is an additional reason for which I believe hyphens are better than underscores.

I'd say dashes. I used to use underscored for pretty much every such purpose (representing spaces) but nowadays, with all the visual thingies blinking all round, you often find underlining that makes them normally invisible.

This may answer your question. Things looks like changed for Google few years ago about - and _
See this article here:
http://www.blog-tutorials.com/marketing-and-seo/linking/google-oks-underscores-as-word-separators-in-urls-and-more-seo-tips/

I think that depends on your favorite. My favourites are underscores, but I don't see any (dis-)advantages if using hyphens or other valid URL characters instead. And everything looks better than %20 :)

URLs: Dash vs. Underscore [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
The community reviewed whether to reopen this question 8 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
Is it better convention to use hyphens or underscores in your URLs?
Should it be /about_us or /about-us?
From usability point of view, I personally think /about-us is much better for end-user yet Google and most other websites (and javascript frameworks) use underscore naming pattern. Is it just matter of style? Are there any compatibility issues with dashes?

From Google Webmaster Central
Consider using punctuation in your
URLs. The URL
http://www.example.com/green-dress.html
is much more useful to us than
http://www.example.com/greendress.html.
We recommend that you use hyphens (-)
instead of underscores (_) in your
URLs.

Here are a few points in favor of the dashes:
Dashes are recommended by Google over underscores (source).
Dashes are more familiar to the end user.
Dashes are easier to write on a standard keyboard (no need to Shift).
Dashes don't hide behind underlines.
Dashes feel more native in the context of URLs as they are allowed in domain names.

It's not just dash vs. underscore:
text with spaces
textwithoutspaces
encoded%20spaces%20in%20URL
underscore_means_space
dash-means-space
plus+means+space
camelCase
PascalCase
" quoted text with spaces" (and single quote vs. double quote)
slash/means/space
dot.means.space

Google did not treat underscore as a word separator in the past, which I thought was pretty crazy, but apparently it does now. Because of this history, dashes are preferred. Even though underscores are now permissible from an SEO point of view, I still think that dashes are best.
One benefit is that your average semi-computer-illiterate web surfer is much more likely to be able to type a dash on the keyboard, they may not even know what the underscore is.

This is just a guess, but it seems they picked the one that people most probably wouldn't use in a name. This way you can have a name that includes a hyphenated word, and still use the underbar as a word delimiter, e.g. UseTwo-wayLinks could be converted to use_two-way_links.
In your example, /about-us would be a directory named the hyphenated word "about-us" (if such a word existed, and /about_us would be a directory named the two-word phrase "about us" converted to a single string of non-white characters.

I used to use underscores all the time, now I only use them for parts of a web site that I don't want anyone to directly link, js files, css, ... etc.
From an SEO point of view, dashes seem to be the preferred way of handling it, for a detailed explanation, from the horses mouth http://www.mattcutts.com/blog/dashes-vs-underscores/.
The other problem that seems to occur, more with the general public than programmers, is that when a hyperlink with underscores is underlined, you can't see the underscore. Advanced users will work it out, but Joe Public probably won't.
Still use underscores in code in preference to dashes though - programmers understand them, most other people don't.

Jeff has some thoughts on this: https://blog.codinghorror.com/of-spaces-underscores-and-dashes/
There are drawbacks to both. I would suggest that you pick one and be consistent.

I'm more comfortable with underscores. First of all, they match in with my regular programming experience of variable_names_are_not-subtraction, second of all, and I believe this was mentioned already, words can have hyphens, but they do not ever have underscores. To pick a really stupid example, "Nation-state country" is different from "nation state country". The former translates something like "the land of nation-states" (think "this here is gun country! Best move along, y'hear?"), whereas the latter looks like a list of sometime-synonyms. http://example.com/nation-state-country/ doesn't appear to mean the same as http://example.com/nation-state_country/, and yet, if hyphens are delimiters/"space"s in addition to characters in words, it can. The latter seems more clear as to the actual purpose, whereas the former looks more like that list, if anything.

The SEO guru Jim Westergren tested this back in 2005 from a strict SEO perspective and came to the conclusion that + (plus) was actually the best word delimiter. However, this doesn't seem reasonable and may be due to a bug in the search engines' algorithms. He recommends - (dash) for both readability and SEO.

Underscores replace spaces where whitespace is not allowed. Dashes (hyphens) can be part of a word, thus joining words with hyphens that already include hyphens is ugly/confusing.
Bad:
/low-budget-movies
Good:
/low-budget_movies

I think dash is better from a user perspective and it will not interfere with SEO.
Not sure where or why the underscore convention started.
A little more knowledgeable debate

I prefer dashes on the basis that an underscore might be obscured to an extent by a link underline. Textual URLs are primarily for being recognised at a glance rather than being grammatically correct so the argument for preserving dashes for use in hyphenated words is limited.
Where the accuracy of a textual URL is important is when reading it out to someone, in which case you don't want to confuse an underscore for a space (or vice-versa).
I also find dashes more aesthetically pleasing, if that counts for anything.

For end-user view i prefer "about-us" or "about us" not "about_us"

Personally, I'd avoid using about-us or about_us, and just use about.

Some older web hosting and DNS servers actually have problems parsing underscores for URLs, so that may play a part in conventions like these.

I personally would avoid all dashes and underscores and opt for camelCase or PascalCase if its in code.
The Wikipedia article on camelCase explains a bit of the reasoning behind it's origins. They amount to
Lazy programmers who didn't like
reaching for the _ key
Potential confusion about
readability
The "Alto" keyboard at xerox PARC
that had no underscore key.
If the user is to see the string then I'd do none of the above and use "About us." or "AboutUs" if I had to as camelCase has spread to common usage in some areas such as product names. i.e ThinkPad, TiVo

Spaces are allowed in URL's, so you can just use "/about us" in a link (although that will be encoded to "/about%20us". But be honest, this will always be personal preference, so there is no real answer to be given here.
I would go with the convention that dashes can appear in words, so spaces should be converted to underscores.

Better use . - / as separators, because _ seems not to be a separator.
http://www.sistrix.com/blog/832-how-long-may-a-linktext-be.html

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Are extended ASCII chars [äöå] allowed in URLs? - url

I have seen some websites that use äöå-characters in their slugs (they don't convert them to aoo). So what's the case?

Åå (as in affOrd), Ää (as in Air), and Öö (as in gIrl) are our lovely Swedish additions to the latin alphabet. Nowadays, these can even be used in domain names (at least under the .SE top-level domain). See, for instance, http://www.linköping.se. Of course, this is an example of an IDN.

Yes, they are! But see here: Should I use accented characters in URLs?

Related

Is it necessary to use — and – in XHTML or HTML5?

Is there known URI scheme or URN namespace for Unicode characters?

Why do you need to encode URLs?

What's the best character to represent blank spaces in a URL?

URLs: Dash vs. Underscore [closed]

Categories

Resources