JFlex maximum read length - flex-lexer

Given a positional language like the old IBM RPG, we can have a line such as
CCCCCDIDENTIFIER E S 10
Where characters
1-5: comment
6: specification type
7-21: identifier name
...And so on
Now, given that JFlex is based on RegExp, we would have a RegExp such as:
[a-zA-Z][a-zA-Z0-9]{0,14} {0,14}
for the identifier name token.
This RegExp however can match tokens longer than the 15 characters possible for identifier name, requiring yypushbacks.
Thus, is there a way to limit how many characters JFlex reads for a particular token?

Regular expression based lexical analysis is really not the right tool to parse fixed-field inputs. You can just split the input into fields at the known character positions, which is way easier and a lot faster. And it doesn't require fussing with regular expressions.
Anyway, [a-zA-Z][a-zA-Z0-9]{0,14}[ ]{0,14} wouldn't be the right expression even if it did properly handle the token length, since the token is really the word at the beginning, without space characters.
In the case of fixed-length fields which contain something more complicated than a single identifier, you might want to feed the field into a lexer, using a StringReader or some such.
Although I'm sure it's not useful, here's a regular expression which matches 15 characters which start with a word and are completed with spaces:
[a-zA-Z][ ]{14} |
[a-zA-Z][a-zA-Z0-9][ ]{13} |
[a-zA-Z][a-zA-Z0-9]{2}[ ]{12} |
[a-zA-Z][a-zA-Z0-9]{3}[ ]{11} |
[a-zA-Z][a-zA-Z0-9]{4}[ ]{10} |
[a-zA-Z][a-zA-Z0-9]{5}[ ]{9} |
[a-zA-Z][a-zA-Z0-9]{6}[ ]{8} |
[a-zA-Z][a-zA-Z0-9]{7}[ ]{7} |
[a-zA-Z][a-zA-Z0-9]{8}[ ]{6} |
[a-zA-Z][a-zA-Z0-9]{9}[ ]{5} |
[a-zA-Z][a-zA-Z0-9]{10}[ ]{4} |
[a-zA-Z][a-zA-Z0-9]{11}[ ]{3} |
[a-zA-Z][a-zA-Z0-9]{12}[ ]{2} |
[a-zA-Z][a-zA-Z0-9]{13}[ ] |
[a-zA-Z][a-zA-Z0-9]{14}
(That might have to be put on one very long line.)

Related

ANTLR4 match any not-matched sections into one single STRING token

I am trying to create a Lexer/Parser with ANTLR that can parse plain text with 'tags' scattered inbetween.
These tags are denoted by opening ({) and closing (}) brackets and they represent Java objects that can evaluate to a string, that is then replaced in the original input to create a dynamic template of sorts.
Here is an example:
{player:name} says hi!
The {player:name} should be replaced by the name of the player and result in the output i.e. Mark says hi! for the player named Mark.
Now I can recognize and parse the tags just fine, what I have problems with is the text that comes after.
This is the grammar I use:
grammar : content+
content : tag
| literal
;
tag : player_tag
| <...>
| <other kinds of tags, not important for this example>
| <...>
;
player_tag : BRACKET_OPEN player_identifier SEMICOLON player_string_parameter BRACKET_CLOSE ;
player_string_parameter : NAME
| <...>
;
player_identifier : PLAYER ;
literal : NUMBER
| STRING
;
BRACKET_OPEN : '{';
BRACKET_CLOSE : '}';
PLAYER : 'player'
NAME : 'name'
NUMBER : <...>
STRING : (.+)? /* <- THIS IS THE PROBLEMATIC PART !*/
Now this STRING Lexer definition should match anything that is not an empty string but the problem is that it is too greedy and then also consumes the { } bracket tokens needed for the tag rule.
I have tried setting it to ~[{}]+ which is supposed to match anything that does not include the { } brackets but that screws with the tag parsing which I don't understand either.
I could set it to something like [ a-zA-Z0-9!"§$%&/()= etc...]+ but I really don't want to restrict it to parse only characters available on the british keyboard (German umlaute or French accents and all other special characters other languages have must to work!)
The only thing that somewhat works though I really dislike it is to force strings to have a prefix and a suffix like so:
STRING : '\'' ~[}{]+ '\'' ;
This forces me to alter the form from "{player:name} says hi!" to "{player:name}' says hi!'" and I really desperately want to avoid such restrictions because I would then have to account for literal ' characters in the string itself and it's just ugly to work with.
The two solutions I have in mind are the following:
- Is there any way to match any number of characters that has not been matched by the lexer as a STRING token and pass it to the parser? That way I could match all the tags and say the rest of the input is just plain text, give it back to me as a STRING token or whatever...
- Does ANTLR support lookahead and lookbehind regex expressions with which I could match any number of characters before the first '{', after the last '}' and anything inbetween '}' and '{' ?
I have tried
STRING : (?<=})(.+)?(?={) ;
but I can't seem to get the syntax right because that won't compile at all, which leads me to believe that ANTLR does not support lookahead and lookbehind syntax, but I could not find a definitive answer on the internet to that question.
Any advice on what to do?
Antlr does not support lookahead or lookbehind. It does support non-greedy wildcard matches, but only when the .* non-greedy wildcard is followed in the rule with the termination sequence (which, as you say, is also contained in the match, although you could push it back into the input stream).
So ~[{}]* is correct. But there's a little problem: lexer rules are (normally) always active. So that lexer rule will be active inside the braces as well, which means that it will swallow the entire contents between the braces (unless there are nested braces or braces inside quotes or some such, and that's even worse).
So you need to define different lexical contents, called "lexical modes" in Antlr. There's a publically viewable example in the Antlr Definitive Reference, which shows a solution to a very similar problem: parsing HTML.

whitespace in flex patterns leads to "unrecognized rule"

The flex info manual provides allows whitespace in regular expressions using the "x" modifier in the (?r-s:pattern) form. It specifically offers a simple example (without whitespace)
(?:foo) same as (foo)
but the following program fails to compile with the error "unrecognized rule":
BAD (?:foo)
%%
{BAD} {}
I cannot find any form of (? that is acceptable as a rule pattern. Is the manual in error, or do I misunderstand?
The example in your question does not seem to reflect the question itself, since it shows neither the use of whitespace nor a x flag. So I'm going to assume that the pattern which is failing for you is something like
BAD (?x:two | lines |
of | words)
%%
{BAD} { }
And, indeed, that will not work. Although you can use extended format in a pattern, you can only use it in a definition if it doesn't contain a newline. The definition terminates at the last non-whitespace character on the definition line.
Anyway, definitions are overused. You could write the above as
%%
(?x:two | lines |
of | words ) { }
Which saves anyone reading your code from having to search for a definition.
I do understand that you might want to use a very long pattern in a rule, which is awkward, particularly if you want to use it twice. Regardless of the issue with newlines, this tends to run into problems with Flex's definition length limit (2047 characters). My approach has been to break the very long pattern into a series of definitions, and then define another symbol which concatenates the pieces.
Before v2.6, Flex did not chop whitespace off the end of the definition line, which also leads to mysterious "unrecognized rule" errors. The manual seems to still reflect the v2.5 behaviour:
The definition is taken to begin at the first non-whitespace character following the name and continuing to the end of the line.

How can I verify URLs are correct, and/or extract valid URLs from arbitrary text?

Sometimes I've had a text entry form, where I want to disable the "Accept" button until a user has entered a valid URL. Searching here or the web turns up a huge number of regular expressions, but given the complexity of the URL specification (RFC-3986), its neigh impossible to write your own verification test suite for them. Once my app is in the App store, how would I ever know how many false negatives I got due to defects in the regular expression?
Other times I've had the need to extract all the valid URLs from a web site or some other text, and want to get an array of them so I can filter it down to say just those that point to an image file. Faulty regular expressions are less likely to be a problem in this case, since if I miss an image or two, or get a bogus URL, its not a major problem. In any case, the better the regular expression, the more correct the returned list of images.
So, how can I with virtual certainty validate a presented string as a proper URL? Also it would be nice to have the means to extract valid URLs out of arbitrary text.
There are a huge number of regular expression on the web that claim to verify URLs. The problem with most is while they may work, they have no credentials - that is, there does not exist any way to prove their correctness one way or the other.
The reference spec on URLs is RFC-3986, and while on a long search for the best regular expression I tripped over Jeff Roberson's regular expression page. What he did was start from the spec, constructing small regular expressions to match the low level parts of the RFC, and gradually building them up into a full expression.
For instance, this is how one gets the full scheme:
# From http://jmrware.com/articles/2009/uri_regexp/URI_regex.html Copyright # Jeff Roberson
(⌽[A-Za-z][A-Za-z0-9+\-.]*)
# DFH Addition: change ⌽ from "?:" to "" to get capture groups of the various components
The unicode character after the first "(" gets changed to either "?:", meaning non-capture group, or "" to turn it into a capture group. Note that this matches a single character with one or more of the characters contained in the second "[]" group,
The full authority is found using this expression:
# RFC-3986 URI component: relative-part
(?: // # ( "//"
(?: (⌽(?:[A-Za-z0-9\-._~!$&'()*+,;=:]|%[0-9A-Fa-f]{2}☯)* ) #)? # authority DFH modified to grab the authority without '#'
(⌽
\[
(?:
(?:
(?: (?:[0-9A-Fa-f]{1,4}:){6}
| :: (?:[0-9A-Fa-f]{1,4}:){5}
| (?: [0-9A-Fa-f]{1,4})? :: (?:[0-9A-Fa-f]{1,4}:){4}
| (?: (?:[0-9A-Fa-f]{1,4}:){0,1} [0-9A-Fa-f]{1,4})? :: (?:[0-9A-Fa-f]{1,4}:){3}
| (?: (?:[0-9A-Fa-f]{1,4}:){0,2} [0-9A-Fa-f]{1,4})? :: (?:[0-9A-Fa-f]{1,4}:){2}
| (?: (?:[0-9A-Fa-f]{1,4}:){0,3} [0-9A-Fa-f]{1,4})? :: [0-9A-Fa-f]{1,4}:
| (?: (?:[0-9A-Fa-f]{1,4}:){0,4} [0-9A-Fa-f]{1,4})? ::
) (?:
[0-9A-Fa-f]{1,4} : [0-9A-Fa-f]{1,4}
| (?: (?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?) \.){3}
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
)
| (?: (?:[0-9A-Fa-f]{1,4}:){0,5} [0-9A-Fa-f]{1,4})? :: [0-9A-Fa-f]{1,4}
| (?: (?:[0-9A-Fa-f]{1,4}:){0,6} [0-9A-Fa-f]{1,4})? ::
)
| [Vv][0-9A-Fa-f]+\.[A-Za-z0-9\-._~!$&'()*+,;=:]+
)
\]
| (?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)
|
(?:[A-Za-z0-9\-._~!$&'()*+,;=]|%[0-9A-Fa-f]{2}☯)*
)
(?: : (⌽[0-9]*) )? # DFH addition to grab just the port
(⌽ # DFH addition to get one capture group
(⌽ / (?:[A-Za-z0-9\-._~!$&'()*+,;=:#]|%[0-9A-Fa-f]{2}☯)* )* # path-abempty
| / # / path-absolute
(⌽: (?:[A-Za-z0-9\-._~!$&'()*+,;=:#]|%[0-9A-Fa-f]{2}☯)+
(?:/ (?:[A-Za-z0-9\-._~!$&'()*+,;=:#]|%[0-9A-Fa-f]{2}☯)* )*
)?
| (⌽ (?:[A-Za-z0-9\-._~!$&'()*+,;=#] |%[0-9A-Fa-f]{2}☯)+ # / path-noscheme
(?:/ (?:[A-Za-z0-9\-._~!$&'()*+,;=:#]|%[0-9A-Fa-f]{2}☯)* )*
) # DFH Wrapper
| # / path-empty
(⌽) # DFH addition so constant number of capture groups
)
) # )
# DFH Addition: change ☯ to "|[\u0080-\U0010ffff]" to get inline Unicode detection (making this an IRI, not a URI, but you can later hex encode it), or "" for standard behavior
# DFH Addition: change ⌽ from "?:" to "" to get capture groups of the various components
If you read the above, you can see that this expression can be extended to find Unicode characters by the addition of "|[\u0080-\U0010ffff]" in just a few places.
Because he actually started with the RFC, and all portions of his expression fully reference the ABNF specification, I have great confidence in them.
However, when I started testing, I found that a URL verifier for say http:// passed! It turns out that the spec allows virtually everything to be an empty string! Sort of hard to use that for a UI form verifier.
So I took his expressions, and made some small additions. First, I found I could change the path specifier from a '*' to a '?', so that in form entry, a user would be forced to type at least one '/' after 'http://'. This makes the validator more stringent than it needs to be, but a more realistic one.
Jeff's regular expressions just use non-capturing groups, so I looked at ways to support capturing groups, so all components of a URL could be extracted if need be.
Also, think of non-USA users, who often need to enter non-ASCII characters into a URL - they want to enter an accented character - but the normal validator would reject a Unicode character. It would be nice to validate a string containing unicode characters, then convert the unicode to '%' encoded hex before actually using it. This requires extending the expressions to accept unicode characters by adding |[\\u0080-\\U0010ffff] to the sections accepting ASCII.
The whole problem begged putting together a test harness that could construct one or more regular expressions with the options a given app might need, and that could test those against various test strings; thus was borne URLFinderAndVerifier.
The test harness uses extended expression strings taken from Jeff's page, with all their spaces and comments intact, and with additional comments made by me. Those make the expressions much easier to read and understand. The test app reads the text files and removes all comments and spaces, preprocess them based on the options selected in the UI, and then sets those for use or for pasting (so you can use them in your app). The test app also lets you use it in a interactive mode, where it validates as you modify the input text.
Options:
look for http/https, http/https/ftp, or any scheme
for form entry, require a "/" after "scheme://", it makes the toggling of an "Accept" button more realistic (also requires at least one character after query's "?" and frament's "#")
enable capture groups, so for each URL extract the scheme, userinfo, host, port, path, and optionally the query and/or the fragment)
in extract mode, include or exclude the query and/or fragment
USAGE:
clone the project, and determine what regular expression you want, then paste it into the results window and use it in your app (suitable for a text file or NSString in code)
copy the URLFinder interface and implementation files into your project
instantiate a URLFinder and supply it the regular expression from the first step.
Surely the easiest way to validate a url is by constructing an NSURL object.
NSURL *url = [NSURL URLWithString:urlString];
According to the documentation:
Must be a URL that conforms to RFC 2396.
If the string was malformed, returns nil.
Ultimately you'll likely want to convert the url into an NSURL object anyway, so it's probably in the best position to decide whether your string is valid or not.
Then to find urls in a block of text you can perform a very simple regex search, just looking for potential candidates. For example, something like this:
[^\s]+://[^\s]+
Then use the NSURL construction technique described above to validate whether those candidates are genuine matches or not.

URL encoding the space character: + or %20?

When is a space in a URL encoded to +, and when is it encoded to %20?
From Wikipedia (emphasis and link added):
When data that has been entered into HTML forms is submitted, the form field names and values are encoded and sent to the server in an HTTP request message using method GET or POST, or, historically, via email. The encoding used by default is based on a very early version of the general URI percent-encoding rules, with a number of modifications such as newline normalization and replacing spaces with "+" instead of "%20". The MIME type of data encoded this way is application/x-www-form-urlencoded, and it is currently defined (still in a very outdated manner) in the HTML and XForms specifications.
So, the real percent encoding uses %20 while form data in URLs is in a modified form that uses +. So you're most likely to only see + in URLs in the query string after an ?.
This confusion is because URLs are still 'broken' to this day.
From a blog post:
Take "http://www.google.com" for instance. This is a URL. A URL is a Uniform Resource Locator and is really a pointer to a web page (in most cases). URLs actually have a very well-defined structure since the first specification in 1994.
We can extract detailed information about the "http://www.google.com" URL:
+---------------+-------------------+
| Part | Data |
+---------------+-------------------+
| Scheme | http |
| Host | www.google.com |
+---------------+-------------------+
If we look at a more complex URL such as:
"https://bob:bobby#www.lunatech.com:8080/file;p=1?q=2#third"
we can extract the following information:
+-------------------+---------------------+
| Part | Data |
+-------------------+---------------------+
| Scheme | https |
| User | bob |
| Password | bobby |
| Host | www.lunatech.com |
| Port | 8080 |
| Path | /file;p=1 |
| Path parameter | p=1 |
| Query | q=2 |
| Fragment | third |
+-------------------+---------------------+
https://bob:bobby#www.lunatech.com:8080/file;p=1?q=2#third
\___/ \_/ \___/ \______________/ \__/\_______/ \_/ \___/
| | | | | | \_/ | |
Scheme User Password Host Port Path | | Fragment
\_____________________________/ | Query
| Path parameter
Authority
The reserved characters are different for each part.
For HTTP URLs, a space in a path fragment part has to be encoded to "%20" (not, absolutely not "+"), while the "+" character in the path fragment part can be left unencoded.
Now in the query part, spaces may be encoded to either "+" (for backwards compatibility: do not try to search for it in the URI standard) or "%20" while the "+" character (as a result of this ambiguity) has to be escaped to "%2B".
This means that the "blue+light blue" string has to be encoded differently in the path and query parts:
"http://example.com/blue+light%20blue?blue%2Blight+blue".
From there you can deduce that encoding a fully constructed URL is impossible without a syntactical awareness of the URL structure.
This boils down to:
You should have %20 before the ? and + after.
Source
I would recommend %20.
Are you hard-coding them?
This is not very consistent across languages, though.
If I'm not mistaken, in PHP urlencode() treats spaces as + whereas Python's urlencode() treats them as %20.
EDIT:
It seems I'm mistaken. Python's urlencode() (at least in 2.7.2) uses quote_plus() instead of quote() and thus encodes spaces as "+".
It seems also that the W3C recommendation is the "+" as per here: http://www.w3.org/TR/html4/interact/forms.html#h-17.13.4.1
And in fact, you can follow this interesting debate on Python's own issue tracker about what to use to encode spaces: http://bugs.python.org/issue13866.
EDIT #2:
I understand that the most common way of encoding " " is as "+", but just a note, it may be just me, but I find this a bit confusing:
import urllib
print(urllib.urlencode({' ' : '+ '})
>>> '+=%2B+'
A space may only be encoded to "+" in the "application/x-www-form-urlencoded" content-type key-value pairs query part of an URL. In my opinion, this is a may, not a must. In the rest of URLs, it is encoded as %20.
In my opinion, it's better to always encode spaces as %20, not as "+", even in the query part of an URL, because it is the HTML specification (RFC 1866) that specified that space characters should be encoded as "+" in "application/x-www-form-urlencoded" content-type key-value pairs (see paragraph 8.2.1. subparagraph 1.)
This way of encoding form data is also given in later HTML specifications. For example, look for relevant paragraphs about application/x-www-form-urlencoded in HTML 4.01 Specification, and so on.
Here is a sample string in a URL where the HTML specification allows encoding spaces as pluses: "http://example.com/over/there?name=foo+bar". So, only after "?", spaces can be replaced by pluses. In other cases, spaces should be encoded to %20. But since it's hard to determine the context correctly, it's the best practice to never encode spaces as "+".
I would recommend to percent-encode all character except "unreserved" defined in RFC 3986, p.2.3
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
The implementation depends on the programming language that you chose.
If your URL contains national characters, first encode them to UTF-8 and then percent-encode the result.
To summarize the (somewhat conflicting) answers here, I think it can be boiled down to:
| standard | + | %20 |
|---------------+-----+-----|
| URL | no | yes |
| query string | yes | yes |
| form params | yes | no |
| mailto query | no | yes |
So historically I think what happened is:
The RFC specified a pretty clear standard about the form of URLs and how they are encoded. In this context the query is just a "string", there is no specification how key/value pairs should be encoded
The HTTP guys put out a standard of how key/value pairs are to be encoded in form params, and borrowed from the URL encoding standard, except that spaces should be encoded as +.
The web guys said: cool we have a way to encode key/value pairs let's put that into the URL query string
Result: We end up with two different ways how to encode spaces in a URL depending on which part you're talking about. But it doesn't even violate the URL standard. From URL perspective the "query" is just a blackbox. If you want to use other encodings there besides percent encoding: knock yourself out.
But as the email example shows it can be problematic to borrow from the form-params implementation for an URL query string. So ultimately using %20 is safer but there might not be out of the box library support for it.

In a URL, should spaces be encoded using %20 or +? [duplicate]

This question already has answers here:
URL encoding the space character: + or %20?
(5 answers)
Closed 6 years ago.
In a URL, should I encode the spaces using %20 or +? For example, in the following example, which one is correct?
www.mydomain.com?type=xbox%20360
www.mydomain.com?type=xbox+360
Our company is leaning to the former, but using the Java method URLEncoder.encode(String, String) with "xbox 360" (and "UTF-8") returns the latter.
So, what's the difference?
Form data (for GET or POST) is usually encoded as application/x-www-form-urlencoded: this specifies + for spaces.
URLs are encoded as RFC 1738 which specifies %20.
In theory I think you should have %20 before the ? and + after:
example.com/foo%20bar?foo+bar
According to the W3C (and they are the official source on these things), a space character in the query string (and in the query string only) may be encoded as either "%20" or "+". From the section "Query strings" under "Recommendations":
Within the query string, the plus sign is reserved as shorthand notation for a space. Therefore, real plus signs must be encoded. This method was used to make query URIs easier to pass in systems which did not allow spaces.
According to section 3.4 of RFC2396 which is the official specification on URIs in general, the "query" component is URL-dependent:
3.4. Query Component
The query component is a string of information to be interpreted by
the resource.
query = *uric
Within a query component, the characters ";", "/", "?", ":", "#",
"&", "=", "+", ",", and "$" are reserved.
It is therefore a bug in the other software if it does not accept URLs with spaces in the query string encoded as "+" characters.
As for the third part of your question, one way (though slightly ugly) to fix the output from URLEncoder.encode() is to then call replaceAll("\\+","%20") on the return value.
This confusion is because URL is still 'broken' to this day
Take "http://www.google.com" for instance. This is a URL. A URL
is a Uniform Resource Locator and is really a pointer to a web page
(in most cases). URLs actually have a very well-defined structure
since the first specification in 1994.
We can extract detailed information about the "http://www.google.com"
URL:
+---------------+-------------------+
| Part | Data |
+---------------+-------------------+
| Scheme | http |
| Host address | www.google.com |
+---------------+-------------------+
If we look at a more
complex URL such as
"https://bob:bobby#www.lunatech.com:8080/file;p=1?q=2#third" we can
extract the following information:
+-------------------+---------------------+
| Part | Data |
+-------------------+---------------------+
| Scheme | https |
| User | bob |
| Password | bobby |
| Host address | www.lunatech.com |
| Port | 8080 |
| Path | /file |
| Path parameters | p=1 |
| Query parameters | q=2 |
| Fragment | third |
+-------------------+---------------------+
The reserved characters are different for each part
For HTTP URLs, a space in a path fragment part has to be encoded to
"%20" (not, absolutely not "+"), while the "+" character in the path
fragment part can be left unencoded.
Now in the query part, spaces may be encoded to either "+" (for
backwards compatibility: do not try to search for it in the URI
standard) or "%20" while the "+" character (as a result of this
ambiguity) has to be escaped to "%2B".
This means that the "blue+light blue" string has to be encoded
differently in the path and query parts:
"http://example.com/blue+light%20blue?blue%2Blight+blue". From there
you can deduce that encoding a fully constructed URL is impossible
without a syntactical awareness of the URL structure.
What this boils down to is
you should have %20 before the ? and + after
Source
It shouldn't matter, any more than if you encoded the letter A as %41.
However, if you're dealing with a system that doesn't recognize one form, it seems like you're just going to have to give it what it expects regardless of what the "spec" says.
You can use either - which means most people opt for "+" as it's more human readable.
When encoding query values, either form, plus or percent-20, is valid; however, since the bandwidth of the internet isn't infinite, you should use plus, since it's two fewer bytes.

Resources