why use - instead off _ in url - url

why use - instead off _ in url?
Url contain '_' seems like no bad effects.

Underscores are not allowed in a host name. Thus some_place.com is not a valid URL because the host name is not valid. Underscores are permissible in URLS. Thus some-place.com/which_place/ is perfectly legitimate, other concerns aside.
From RFC 1738:
host
[...] Fully qualified domain names take the form as described
in Section 3.5 of RFC 1034 [13] and Section 2.1 of RFC 1123
[5]: a sequence of domain labels separated by ".", each domain
label starting and ending with an alphanumerical character and
possibly also containing "-" characters. The rightmost domain
label will never start with a digit, though, which
syntactically distinguishes all domain names from the IP
addresses.

When you read a_long_sentence_with_many_underscores, because you are reading it by letter or word recognition, your eye tracks along the middle of the line, but when you reach an underscore, your eye is more likely to track down a bit and back up for the next word.
When you read a-long-sentence-with-many-dashes, your eye keeps tracking along the same horizon, and by sight, it is easier for your brain to try and ignore them.
Another good reason is that Google and other search engines rank urls that match to search terms higher when the word separator is a dash.

One main reason is that most anchor tags have text-decoration:underline which effectively hides your underscore.
And, a non-tech savvy user wont automatically assume that there is an underscore :)

By the way... it seems several Java network libraries will not be able to interpret a URL correctly when using underscore:
URI uri = URI.create("http://www.google-plus.com/");
System.out.println(uri.getHost()); // prints www.google-plus.com
URI uri = URI.create("http://www.google_plus.com/");
System.out.println(uri.getHost()); // prints null

It's easier to type (at least on my german keyboard) and see.

Related

Unicode URLs shown in wrong order

I have enabled unicode urls in my joomla site
My language is Persian which is a right-to-left language but
urls written in persian appear in wrong order. For example:
Mysite.com/محصولات/محصول-اول
It translates to:
Mysite.com/first-product/products
Which should have been:
Mysite.com/products/first-product
This is only a matter of displaying text. I know that the actual text the server receives is in correct order because url-encoded version has the correct order.
(If you don't get the idea type "something.com/" in your url bar. Now copy/paste this at the end of url
محصولات
Now type a slash and copy/paste this at the end
محصول
You see? The last one should have gone to the right but goes to the left)
I have two questions regarding this issue:
1-is there anything i can do to display urls in correct order?
2-can it affect how google indexes my pages? Can it misdirect google?
The behaviour of the url display is totally correct in Unicode sense, as the slash is defined as bidirectionally neutral:
http://www.fileformat.info/info/unicode/char/002f/index.htm
Thus, standing between two arabic (right-to-left) words, the slash has to adapt to the writing direction of the surrounding words. The slash would, though, never adapt to the writing direction of the whole line within in a right-to-left neighborhood.
To answer your questions:
(1) It is not possible to influence this behaviour if you do not change the URL, as Jukka K. Korpela already assumed.
(2) As long as the order of the words is correctly encoded, I do not see any bad consequences for search engine indexings.
If you want to change it anyway, and assumed that your URLs are artificial and do no represent real paths, I can see the following workarounds:
(a) Substitute the slash with another "strong" symbol which influences the writing direction.
(b) Insert a "pseudo strong" character before (U+200e) the slash, which will enforce LTR for the slash.
Hope this helps.

Is there a character that is illegal in all parts of a URI?

I need a character to separate two or more URIs in one string. Later I will the split the string to get each URI separately.
The problem is I'm not sure what character to pick here. Is there a good character to choose here that definitely can't be part of a URI itself? Or is ultimately pretty much all characters allowed in a URI?
I know certain characters are illegal in certain parts of the URI, but I'm talking about a URI as a whole, like this:
scheme://username:password#domain.tld/path/to/file.ext?key=value#blah
I'm thinking maybe space, although technically I suppose that could be part of the password, or would it be escaped as %20 in that case?
Any of the control characters should be good for this, such as TAB, FF and so on.
RFC3986 (a) controls the URI specification and Appendix A of that RFC states that the characters are limited to:
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
0123456789-._~:/?#[]#!$&'()*+,;=
(and the % encoding character, of course, for all other characters not listed above).
So, basically, any other character should be okay as a delimiter.
(a) This has actually been augmented by RFC6874 which has to do with changes to the IPv6 part of the URI, adding a zone identifier. Since the zone ID consists of % and "unreserved" characters already included above, it doesn't change the set of characters allowed.

Using commas in URL's can break the URL sometimes?

Is anyone aware of any problems with using commas in SEO friendly URL's? I'm working with some software that uses a lot of commas in it's SEO friendly URL's; but I am 100% certain I have seen some instances where some programs/platforms don't recognize the URL correctly & cut the "linking" of the URL off after the first comma.
I just tested this out with thunderbird, gmail, hotmail & on a SMF forum with no problems; however I know I have seen the issue before.
So my question is, is there anything in particular that would cause some platforms to stop linking URL's with a comma? Such as a certain character after the comma?
There will be countless implementations that will cut the automatical linking at that point. As with many other characters, too. But that’s not a problem because of using these characters, but because of a wrong/incomplete implementation.
See for example this very site, Stack Overflow. It will cut off the link at the * when manually entering/pasting this URL (see bug; in case it gets fixed, here’s a screenshot of it):
http://wayback.archive.org/web/*/http://www.example.com/
But when using the hyperlink syntax, it works fine:
http://wayback.archive.org/web/*/http://www.example.com/
The * character is allowed in an HTTP URL path, so the link detection should have recognized the first URL instead of breaking it at the occurence of *.
Regarding the comma:
The comma is a reserved character and its meaning is relevant for the URL path (bold emphasis mine):
Aside from dot-segments in hierarchical paths, a path segment is
considered opaque by the generic syntax. URI producing applications
often use the reserved characters allowed in a segment to delimit
scheme-specific or dereference-handler-specific subcomponents. For
example, the semicolon (";") and equals ("=") reserved characters are
often used to delimit parameters and parameter values applicable to
that segment. The comma (",") reserved character is often used for
similar purposes. For example, one URI producer might use a segment
such as "name;v=1.1" to indicate a reference to version 1.1 of
"name", whereas another might use a segment such as "name,1.1" to
indicate the same.
So, if you don’t intend to use the comma for the function it has as reserved character, you may want to percent-encode it with %2C. Users copying such an URL from their browser’s address bar would paste it in the encoded form, so it should work almost everywhere.
However, especially because it’s a reserved character, the unencoded form should work, too.

Valid URL separators

I have a long URL with several values.
Example 1:
http://www.domain.com/list?seach_type[]=0&search_period[]=1&search_min=3000&search_max=21000&search_area=6855%3B7470%3B7700%3B7730%3B7741%3B7742%3B7752%3B7755%3B7760%3B7770%3B7800%3B7840%3B7850%3B7860%3B7870%3B7884%3B7900%3B7950%3B7960%3B7970%3B7980%3B7990%3B8620%3B8643%3B8800%3B8830%3B8831%3B8832%3B8840%3B8850%3B8860%3B8881%3B9620%3B9631%3B9632
My variable search area contains only 4 number digits (example 4000, 5000), but can contain a lot of them. Right now I seperate these in the URL by using ; as separator symbol. Though as seen in Example 1, the ; is converted into %3B. This makes me believe that this is a bad symbol to use.
What is the best URL separator?
Moontear, I think you have misread the linked document. That limitation only applies to the "scheme" portion of the URL. In the case of WWW URLs, that is the "http".
The next section of the document goes on to say:
Thus, only alphanumerics, the special characters "$-_.+!*'(),", and
reserved characters used for their reserved purposes may be used
unencoded within a URL.
I'd personally use comma (,). However, plus (+) and dash (-) are both reasonable options as well.
BTW, that document also mentions that semi-colon (;) is reserved in some schemes.
Well, according to RFC1738, valid URLs may only contain the letters a-z, the plus sign (+), period and hyphen (-).
Generally I would go with a plus to separate your search areas. So your URL would become http://www.domain.com/list?seach_type=0&search_period=1&search_min=3000&search_max=21000&search_area=6855+7470+7700+...
--EDIT--
As GinoA pointed out I misread the document. Hence "$-_.+!*'()," are valid characters too. I'd still go with the + sign though.
If there are only numbers to separate, you have a large choice of separators. You can choose any letter for example.
Probably a space can be a good choice. It will be transformed into + character in the URL, so will be more readable than a letter.
Example: search_area=4000+5000+6000
I'm very late to the party, but a valid query string can repeat variables so instead of...
http://x.y.z/list?type=0&period=1&min=3000&max=21000&area=6855+7470+7700
...you could also use...
http://x.y.z/list?type=0&period=1&min=3000&max=21000&area=6855&area=7470&area=7700
"+" is to be interpreted as a space " " when the content-type is application/x-www-form-urlencoded (standard for HTML forms). This may be handled by your server software.
I prefer "!". It doesn't get URL encoded (at least not in Chrome) and it reserves "+" for use as a real space character in the typical case.

What are the valid characters that can show up in a URL host?

I'm writing some code that processes URLs, and I want to make sure i'm not leaving some strange case out...
Are there any valid characters for a host other than: A-Z, 0-9, "-" and "."?
(This includes anything that can be in subdomains, etc. Esentially, anything between :// and the first /)
Thanks!
Please see Restrictions on valid host names:
Hostnames are composed of series of
labels concatenated with dots, as are
all domain names1. For example,
"en.wikipedia.org" is a hostname. Each
label must be between 1 and 63
characters long, and the entire
hostname has a maximum of 255
characters.
RFCs mandate that a hostname's labels
may contain only the ASCII letters 'a'
through 'z' (case-insensitive), the
digits '0' through '9', and the
hyphen. Hostname labels cannot begin
or end with a hyphen. No other
symbols, punctuation characters, or
blank spaces are permitted.
no, that is all that is allowed
here is a reference if you like to read:
http://www.ietf.org/rfc/rfc1034.txt
Depends at what level you do the validation (before or after the URL escaping).
If you try to validate user input, then it can go way beyond ASCII (with big chunks of Unicode).
See http://en.wikipedia.org/wiki/Internationalized_domain_name
If you try to validate after all the escaping and the "punycode" is done, there is no point in validation, since that is already guaranteed to only contain valid characters by the old RFC.
Keep in mind that besides the hostname rules of the Internet, DNS systems are free to create any names that they like. DNS servers could accept and reply to 8-bit binary requests: the DNS wire protocol does not forbid it.
This means that for internal LAN URLs you may have different rules, such as the underscore appearing in a host name.
Valid URL host include ascii letters, numbers, the dot ( . ) and the hyphen ( - ) with max length 255 with dot separated labels with max length 63. The hyphen can delimit alphanumeric sequences e.g. one-two.net but cannot appear at the beginning or end of a dot separated label e.g. -one.two.com, one.two.com- or one-.two.com are invalid host.
See https://www.rfc-editor.org/rfc/rfc1123#page-79 and Assumptions part 1 of https://www.rfc-editor.org/rfc/rfc952
Also this is a link to an online regex tool to validate URL host which worked as of 5/28/2019 https://www.regextester.com/23
Also when validating a host referencing https://www.rfc-editor.org/rfc/rfc1123#page-13 you should check the host syntactically for a dotted-decimal number before looking it up in the DNS.
If you want to write URL-parsing code that perfectly matches the official W3C spec, see the document at www.w3.org/TR/url-1/ . See section 3 (Hosts) for specific information on hosts in URLs.

Resources