URL grammar rules

URL grammar rules - url

I made a pastebin site where each entry gets a random string. For example
example.com/ds34
example.com/sdf-2zA
example.com/234+_2
My question is, what is the grammar rule for these strings?
Can that start with anything? which characters are/aren't allowed?

See in RFC and w3.org. In short - any ASCII symbol excluding reserved ! * ' ( ) ; : # & = + $, / ? % # [ ]. Other symbols can be percent-encoded.

Related

Premature end of char-class

I had the regular expression for email validating following rules
The local-part of the e-mail address may use any of these ASCII characters:
Uppercase and lowercase English letters (a-z, A-Z)
Digits 0 to 9
Characters ! # $ % & ' * + - / = ? ^ _ ` { | } ~
Character . (dot, period, full stop) provided that it is not the first or last character, and provided also that it does not appear two or more times consecutively.
/^(([^<>()[\]\\.,;:\s#\"]+(\.[^<>()[\]\\.,;:\s#\"]+)*)|(\".+\"))#((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$/i
It is working in Javascript but in Ruby http://rubular.com/ it gives error "Premature end of char-class".
How can i resolve this?

Brackets are part of regex syntax. If you want to match a literal bracket (or any other special symbol, for that matter), escape with a backslash.
this should work :
/^(([^<>()\[\]\\.,;:\s#\"]+(\.[^<>()\[\]\\.,;:\s#\"]+)*)|(\".+\"))#((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$/i

You should escape opening square brackets as well as closings inside the symbol range:
# ⇓ ⇓
/^(([^<>()[\]\\.,;:\s#\"]+(\.[^<>()[\]\\.,;:\s#\"]+)*)…/
This should be:
/^(([^<>()\[\]\\.,;:\s#\"]+(\.[^<>()\[\]\\.,;:\s#\"]+)*)…/
Hope it helps.

irb(main):016:0> /[[e]/
SyntaxError: (irb):16: premature end of char-class: /[[e]/
from /ms/dist/ruby/PROJ/core/2.0.0-p195/bin/irb:12:in `<main>'
In JavaScript regular expression engine, you don't need to escape [ inside a character group []. However, you have to use \[ in Ruby regular expression.
/^(([^<>()\[\]\\.,;:\s#\"]+(\.[^<>()\[\]\\.,;:\s#\"]+)*)|(\".+\"))#((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$/i

Is array syntax using square brackets in URL query strings valid?

Is it actually safe/valid to use multidimensional array synthax in the URL query string?
http://example.com?abc[]=123&abc[]=456
It seems to work in every browser and I always thought it was OK to use, but accodring to a comment in this article it is not: http://www.456bereastreet.com/archive/201008/what_characters_are_allowed_unencoded_in_query_strings/#comment4
I would like to hear a second opinion.

The answer is not simple.
The following is extracted from section 3.2.2 of RFC 3986 :
A host identified by an Internet Protocol literal address, version 6
[RFC3513] or later, is distinguished by enclosing the IP literal
within square brackets ("[" and "]"). This is the only place where
square bracket characters are allowed in the URI syntax.
This seems to answer the question by flatly stating that square brackets are not allowed anywhere else in the URI. But there is a difference between a square bracket character and a percent encoded square bracket character.
The following is extracted from the beginning of section 3 of RFC 3986 :
Syntax Components
The generic URI syntax consists of a hierarchical sequence of
components referred to as the scheme, authority, path, query, and
fragment.
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
So the "query" is a component of the "URI".
The following is extracted from section 2.2 of RFC 3986 :
2.2. Reserved Characters
URIs include components and subcomponents that are delimited by
characters in the "reserved" set. These characters are called
"reserved" because they may (or may not) be defined as delimiters by
the generic syntax, by each scheme-specific syntax, or by the
implementation-specific syntax of a URI's dereferencing algorithm.
If data for a URI component would conflict with a reserved
character's purpose as a delimiter, then the conflicting data must
be percent-encoded before the URI is formed.
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "#"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
So square brackets may appear in a query string, but only if they are percent encoded. Unless they aren't, to be explained further down in section 2.2 :
URI producing applications should percent-encode data octets that
correspond to characters in the reserved set unless these characters
are specifically allowed by the URI scheme to represent data in that
component. If a reserved character is found in a URI component and
no delimiting role is known for that character, then it must be
interpreted as representing the data octet corresponding to that
character's encoding in US-ASCII.
So because square brackets are only allowed in the "host" subcomponent, they "should" be percent encoded in other components and subcomponents, and in this case in the "query" component, unless RFC 3986 explicitly allows unencoded square brackets to represent data in the query component, which is does not.
However, if a "URI producing application" fails to do what it "should" do, by leaving square brackets unencoded in the query, then readers of the URI are not to reject the URI outright. Instead, the square brackets are to be considered as belonging to the data of the query component, since they are not used as delimiters in that component.
This is why, for example, it is not a violation of RFC 3986 when PHP accepts both unencoded and percent encoded square brackets as valid characters in a query string, and even assigns to them a special purpose. However, it would appear that authors who try to take advantage of this loophole by not percent encoding square brackets are in violation of RFC 3986.

According to RFC 3986, the Query component of an URL has the following grammar:
*( pchar / "/" / "?" )
From appendix A of the same RFC:
pchar = unreserved / pct-encoded / sub-delims / ":" / "#"
[...]
pct-encoded = "%" HEXDIG HEXDIG
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
[...]
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
My interpretation of this is that anything that isn't:
ALPHA / DIGIT / "-" / "." / "_" / "~" /
"!" / "$" / "&" / "'" / "(" / ")" /
"*" / "+" / "," / ";" / "=" / ":" / "#"
...should be pct-encoded, i.e percent-encoded. Thus [ and ] should be percent-encoded to follow RFC 3986.

David N. Jafferian's answer is fantastic. I just want to add a couple updates and practical notes:
For many years, every browser has left square brackets in query strings unencoded when submitting the request to the server. (Source: https://bugzilla.mozilla.org/show_bug.cgi?id=1152455#c6). As such, I imagine a huge portion of the web has come to rely on this behavior, which makes it extremely unlikely to change.
My reading of the WHATWG URL standard which, at least for web purposes, can be seen as superseding RFC 3986, is that it codifies this behavior of not encoding [ and ] in query strings.
Edit: Based on the comments and other answers, a more correct reading of the WHATWG URL standard is that unencoded [/] are invalid, but also should be tolerated when received/parsed and, once parsed that way, should even be re-serialized without encoding.

I'd ideally like to comment on Ethan's answer really, but don't have sufficient reputation to do it.
I'm not sure that the relevant part of the WHATWG URL standard is being referenced here. I think the correct part might be in the definition of a valid URL-query string, which it describes as being composed of URL units that themselves are formed from URL code points and percent-encoded bytes. Square brackets are listed within URL code points and thus fall into the percent-encoded bytes category.
Thus, in answer to the original question, multidimensional array syntax (i.e. using square brackets to represent array indexing) within the query part of the URL is valid, provided the square brackets are percent encoded (as %5B for [ and %5D for ]).

My understanding that square brackets are not first-class citizens anyway. Here is the quote:
https://www.rfc-editor.org/rfc/rfc1738
Other characters are unsafe because gateways and other transport
agents are known to sometimes modify such characters. These
characters are "{", "}", "|", "", "^", "~", "[", "]", and "`".

I always had a temptation to go for that sort of query when I had to pass an array, but I steered away from it. The reason being:
It is not cleared defined in RFC.
Different languages may interpret it differently.
You have a couple of options to pass an array:
Encode the string representation of the array(JSON may be?)
Have parameters like "val1=blah&val2=blah&.." or something like that.
And if you are sure about the language you are using, you can (safely) go for the kind of query string you have (Just that you need to %-encode [] also).

What character encoding uses 2 underscores and a letter?

I'm currently parsing what looks to be a proprietary file format from a third-party commercial application. They seem to use a funny character encoding system and I need some help determining what it is, assuming it's not a proprietary encoding system as well.
I don't have a whole lot of different characters to analyze from but here is what I have so far:
__b -> blank space
__f -> forward slash
So for example, "Hello World" become "Hello__bWorld".
Does anybody have any idea what this is?
If not do you know of a resource on the web that can help me? Maybe there is a tool out there than can help in identifying character encoding?

It seems to be a proprietary encoding used by Numara FootPrints. This list of mappings comes from the FootPrints User Group forum. There is also a Perl script for decoding it.
Code Character
__b (space)
__a ' (single quote)
__q " (double quote)
__t ` (backquote)
__m # (at-sign)
__d . (period)
__u - (hyphen-minus)
__s ;
__c :
__p )
__P (
__3 #
__4 $
__5 %
__6 ^
__7 &
__8 *
__0 ~ (tilde)
__f / (slash)
__F \ (backslash)
__Q ?
__e ]
__E [
__g >
__G <
__B !
__W {
__w }
__C =
__A +
__I | (vertical line)
__M , (comma)
__Ux_ Unicode character with value 'x'

regex validation - grails constraints

I'm pretty new on grails, I'm having a problem in matches validation using regex. What I wanted to happen is my field can accept a combination of alphanumeric and specific special characters like period (.), comma (,) and dash (-), it may accept numbers (099) or letters only (alpha) , but it won't accept input that only has special characters (".-,"). Is it possible to filter this kind of input using regex?
please help. Thank you for sharing your knowledge.

^[0-9a-zA-Z,.-]*?[0-9a-zA-Z]+?[0-9a-zA-Z,.-]*$
meaning:
/
^ beginning of the string
[...]*? 0 or more characters from this class (lazy matching)
[...]+? 1 or more characters from this class (lazy matching)
[...]* 0 or more characters from this class
$ end of the string
/

I think you could match that with a regular expression like this:
".*[0-9a-zA-Z.,-]+.*"
That means:
"." Begin with any character
"*" Have zero or more of these characters
"[0-9a-zA-Z.,-]" Have characters in the range 0-9, a-z, etc, or . or , or -
"+" Have one or more of this kind of character (so it's mandatory to have one in this set)
"." End with any character
"*" Have zero or more of these characters
This is working ok for me, hope it helps!

Can . (period) be part of the path part of an URL?

Is the following URL valid?
http://www.example.com/module.php/lib/lib.php
According to https://www.rfc-editor.org/rfc/rfc1738 section the hpath element of an URL can not contain a '.' (period). There is in the above case a '.' after "module" which is not allowed according to RFC1738.
Am I reading the RFC wrong or is this RFC succeed by another RFC? Some other RFC's allows '.' in URLs (https://www.rfc-editor.org/rfc/rfc1808).

I don't see where RFC1738 disallows periods (.) in URLs. Here are some excerpts from there:
hpath = hsegment *[ "/" hsegment ]
hsegment = *[ uchar | ";" | ":" | "#" | "&" | "=" ]
uchar = unreserved | escape
unreserved = alpha | digit | safe | extra
safe = "$" | "-" | "_" | "." | "+"
So the answer to your question is: Yes, http://www.example.com/module.php/lib/lib.php is a valid URL.

As others have noted, periods are allowed in URLs, but be careful. If a single or double period is used in part of a URL's path, the browser will treat it as a change in the path, and you may not get the behavior you want.
For example:
www.example.com/foo/./ redirects to www.example.com/foo/
www.example.com/foo/../ redirects to www.example.com/
Whereas the following will not redirect:
www.example.com/foo/bar.biz/
www.example.com/foo/..biz/
www.example.com/foo/biz../

Periods are allowed. See section "2.3 Unreserved Characters" in this document:
https://www.rfc-editor.org/rfc/rfc3986
"Characters that are allowed in a URI but do not have a reserved purpose are called unreserved. These include uppercase and lowercase letters, decimal digits, hyphen, period, underscore, and tilde".

Nothing wrong with a period in a url. If you look at the makeup in the grammar in the link you provided a period is mentioned via the 'safe' group, which is included via uchar a
Ignore my answer, Adams is better

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart