Understanding 'strictness' in regex grammar - parsing

In writing a grammar/parser for a regex, I'm wondering why the following constructions are both syntactically and semantically valid in the regex syntax (at least as far as I can understand it)?
Repetitions of a character class, such as:
Repetitions of a zero-width assertion, such as:
Assertions at a bogus position, such as:
To me, this sort of seems like having a syntax that would allow having a construction like SELECT SELECT SELECT SELECT SELECT col FROM tbl. In other words, why isn't the regex syntax defined as more strict than it is in practice?

To start with, that's not a very good analogy. Your statement with multiple SELECT keywords is not, as far as I know, part of the SQL grammar, so it's simply ungrammatical. Repeated elements in a character class are more like the SQL construct:
SELECT * FROM table WHERE Value IN (1, 2, 3, 2, 3, 2, 3)
I think most (if not all) SQL processors would allow that. You could argue that it would be nice if a warning message were issued, but the usual SQL interface (where a query is sent from client to server and a result returned) does not leave space for a warning.
It's certainly the case that repeated characters in a character class are often an indication that the regular expression was written by a novice with only a fuzzy idea of what a character class is. If you hang out in SO's flex-lexer long enough, you'll see how often students write regular expressions like [a-z|A-Z|0-9], or even [begin|end]. If flex detected duplicate characters in character classes, those mistakes would receive warnings, which might or might not be useful to the student coder. (Reading and understanding warning messages is not, apparently, an innate skill.) But it needs to be asked, Who is the target audience for a tool like Flex, and I think the answer is not "impatient beginners who won't read documentation". Certainly, a non-novice programmer might also make that kind of mistake, usually as a result of a typo, but it's not common and it will probably easily be detected during debugging.
If you've already started to work on a parser for regular expressions, you should already know why these rules are not usually a feature of regular expression libraries. Every rule you place on a syntax must be:
precisely defined,
documented,
implemented,
acted upon appropriately (which may mean complicating the interface in order to allow for warnings).
and all of that needs to be thoroughly tested.
That's a lot of work to prevent something which is not even technically an error. And catching those errors (if they are errors) will probably make the "grammar" for the regular expressions context-sensitive. (In other words, you can't write a context-free grammar which forbids duplication inside a character class.)
Moreover, practically all of those expressions might show up in the wild, typically in the case that the regular expression was programmatically generated. For example, suppose you have a set of words, and you want to write a regular expression which will match any sequence of characters not in any of the words. The simple solution is [^firstsecondthirdwordandtherest]+. Of course, you could have gone to the trouble of deduping the individual letters in the words, -- a simple task in some programming languages, but more complicated in others -- but should it be necessary?
With respect to your other two examples, with repeated and non-terminal $, there are actual regex libraries in which those interpreted in some different way. Many older regex libraries (including (f)lex) only treat $ as a zero-length end-of-string assertion if it appears at the very end of the regex; such libraries would treat all of those $s other than the last one in a$$$$$$$$ as matching a literal $. Others treat $ as an end-of-line assertion, rather than an end-of-string assertion. But I don't know of any which treat those as errors.
There's a pretty good argument that a didactic tool, such as regex101, might usefully issue warnings for doubtful regular expressions, a task usually called "linting". But for a production regex library, there is little to be gained, and a lot of pain.

Related

Is Pug context free?

I was thinking to make a Pug parser but besides the indents are well-known to be context-sensitive (that can be trivially hacked with a lexer feedback loop to make it almost context-free which is adopted by Python), what otherwise makes it not context-free?
XML tags are definitely not context-free, that each starting tag needs to match an end tag, but Pug does not have such restriction, that makes me wonder if we could just parse each starting identifier as a production for a tag root.
The main thing that Pug seems to be missing, at least from a casual scan of its website, is a formal description of its syntax. Or even an informal description. Perhaps I wasn't looking in right places.
Still, based on the examples, it doesn't look awful. There will be some challenges; in particular, it does not have a uniform tokenisation context, so the scanner is going to be complicated, not just because of the indentation issue. (I got the impression from the section on whitespace that the indentation rule is much stricter than Python's, but I didn't find a specification of what it is exactly. It appeared to me that leading whitespace after the two-character indent is significant whitespace. But that doesn't complicate things much; it might even simplify the task.)
What will prove interesting is handling embedded JavaScript. You will at least need to tokenise the embedded JS, and the corner cases in the JS spec make it non-trivial to tokenise without parsing. Anyway, just tokenising isn't sufficient to know where the embedded code terminates. (For the lexical challenge, consider the correct identification of regular expression literals. /= might be the start of a regex or it might be a divide-and-assign operator; how a subsequent { is tokenised will depend on that decision.) Template strings present another challenge (recursive embedding). However, JavaScript parsers do exist, so you might be able to leverage one.
In other words, recognising tag nesting is not going to be the most challenging part of your project. Once you've identified that a given token is a tag, the nesting part is trivial (and context-free) because it is precisely defined by the indentation, so a DEDENT token will terminate the tag.
However, it is worth noting that tag parsing is not particularly challenging for XML (or XML-like HTML variants). If you adopt the XML rule that close tags cannot be omitted (except for self-closing tags), then the tagname in a close tag does not influence the parse of a correct input. (If the tagname in the close tag does not match the close tag in the corresponding open tag, then the input is invalid. But the correspondence between open and close tags doesn't change.) Even if you adopt the HTML-5 rule that close tags cannot be omitted except in the case of a finite list of special-case tagnames, then you could theoretically do the parse with a CFG. (However, the various error recovery rules in HTML-5 are far from context free, so that would only work for input which did not require rematching of close tags.)
Ira Baxter makes precisely this point in the cross-linked post he references in a comment: you can often implement context-sensitive aspects of a language by ignoring them during the parse and detecting them in a subsequent analysis, or even in a semantic predicate during the parse. Correct matching of open- and close tagnames would fall into this category, as would the "declare-before-use" rule in languages where the declaration of an identifier does not influence the parse. (Not true of C or C++, but true in many other languages.)
Even if these aspects cannot be ignored -- as with C typedefs, for example -- the simplest solution might be to use an ambiguous CFG and a parsing technology which produces all possible parses. After the parse forest is generated, you could walk the alternatives and reject the ones which are inconsistent. (In the case of C, that would include an alternative parse in which a name was typedef'd and then used in a context where a typename is not valid.)

How can I determine if two mathematical expressions are 'sort of' the same?

Sorry about the vagueness of this question, and maybe I want a different exchange? Not sure. Anyway, here goes:
I'd like to be able to determine if two mathematical expressions are 'the same'. I don't need full equivalence testing, which I understand is impossible anyway. Basically, I'd like to make a function that looks like this:
areTheSame(expression1, expression2, [testing methods])
where [testing methods] might include: 'exact', 'allow commutativity', 'allow distribution', ...
'exact' would be easy: expression1 == expression2 if their strings
are exactly equal
'allow commutativity' would be more difficult. For instance, if
expression1 is y=3*x and expression2 is y=x*3, then under 'allow
commutativity', they would be the same. Ditto here for 'y=x' and 'x=y'.
'allow distribution' would allow 'y=2*(x-3)' to be equal to
'y=2*x-6'.
others?
Ideally, I'd like to be lazy! I'd love to find some library that
supports parsing expressions from latex representations (or MathML, which is xml already, maybe it's easier to parse)
supports equivalence testing with some flags or something that
govern how exact the comparison should be, as above.
is written in c or c++ (or Objective-C - this is for an iOS project)
is not GPL.
4 rules out SymbolicC++ and GiNaC afaik. Mathomatic is LGPL, which I'm not sure about in the context of apple's app-Store (and I would really not like to have to give out object files)
Any ideas? Thanks!
This may not be an answer, but it's too long to be a comment. :)
My first thought: this is not an easy or common problem, so you probably won't have much luck finding a library to do it for you. Finding a library that does some of the non-critical parts, on the other hand, shouldn't be difficult.
What you probably want to do is parse the expressions to create an abstract syntax tree (you'll probably find lots of libraries for that), then recursively analyze the AST yourself to test whatever definition of sameness you're after.
In iOS, a decent place to start might be (horribly ab)using NSExpression and NSPredicate. These have constructor methods that parse a string and return a structure of expression and predicate objects.
Recursively walk that structure. For each predicate, check to see if the predicateOperatorType matches... if it doesn't, the predicates aren't the same. If it does, look at the left predicate's leftExpression and the right predicate's leftExpression. Each expression has a function that tells you what its operator is (add, subtract, etc). If they don't match, the expressions aren't the same. (Do the same check for the other side.) If they do, recurse: look at each expression's sub-expressions and do a similar check, and so on until you get to expressions that are constant values or variables.
That's a rough sketch of how to see if two predicates (and the expressions they contain) "match". For "sort of the same", just relax each check you perform while recursively walking the tree and/or add more checks; e.g. if you get to an expression whose function is add, check for commutativity by comparing its sub-expressions to the corresponding ones in the other predicate in either order. (Also, there are probably other libraries that'll parse basic math expressions and get you an AST you can walk however you like.)
That still won't get you everything you're after — "allow distribution" gets you into the realm of full-fledged CAS software. Maybe look into whether the likes of Wolfram Alpha have web service APIs?

Lexical Analysis of a Scripting Language

I am trying to create a simple script for a resource API. I have a resource API mainly creates game resources in a structured manner. What I want is dealing with this API without creating c++ programs each time I want a resource. So we (me and my instructor from uni) decided to create a simple script to create/edit resource files without compiling every time. There are also some other irrelevant factors that I need a command line interface rather than a GUI program.
Anyway, here is script sample:
<path>.<command> -<options>
/Graphics[3].add "blabla.png"
I didn't design this script language, the owner of API did. The part before '.' as you can guess is the path and part after '.' is actual command and some options, flags etc. As a first step, I tried to create grammar of left part because I thought I could use it while searching info about lexical analyzers and parser. The problem is I am inexperienced when it comes to parsing and programming languages and I am not sure if it's correct or not. Here is some more examples and grammar of left side.
dir -> '/' | '/' path
path -> object '/' path | object
object -> number | string '[' number ']'
Notation if this grammar can be a mess, I don't know. There is 5 different possibilities, they are:
String
"String"
Number
String[Number]
"String"[Number]
It has to start with '/' symbol and if it's the only symbol, I will accept it as Root.
Now my problem is how can I lexically analyze this script? Is there a special method? What should my lexical analyzer do and do not(I read some lexical analysers also do syntactic analysis up to a point). Do you think grammar, etc. is technically appropriate? What kind of parsing method I should use(Recursive Descent, LL etc.)? I am trying to make it technically appropriate piece of work. It's not commercial so I have time thus I can learn lexical analysis and parsing better. I don't want to use a parser library.
What should my lexical analyzer do and not do?
It should:
recognize tokens
ignore ignorable whitespace and comments (if there are such things)
optionally, keep track of source location in order to produce meaningful error messages.
It should not attempt to parse the input, although that will be very tempting with such a simple language.
From what I can see, you have the following tokens:
punctuation: /, ., linear-white-space, new-line
numbers
unquoted strings (often called "atoms" or "ids")
quoted strings (possibly the same token type as unquoted strings)
I'm not sure what the syntax for -options is, but that might include more possibilities.
Choosing to return linear-white-space (that is, a sequence consisting only of tabs and spaces) as a token is somewhat questionable; it complicates the grammar considerably, particularly since there are probably places where white-space is ignorable, such as the beginning and end of a line. But I have the intuition that you do not want to allow whitespace inside of a path and that you plan to require it between the command name and its arguments. That is, you want to prohibit:
/left /right[3] .whimper "hello, world"
/left/right[3].whimper"hello, world"
But maybe I'm wrong. Maybe you're happy to accept both. That would be simpler, because if you accept both, then you can just ignore linear-whitespace altogether.
By the way, experience has shown that using new-line to separate commands can be awkward; sooner or later you will need to break a command into two lines in order to avoid having to buy an extra monitor to see the entire line. The convention (used by bash and the C preprocessor, amongst others) of putting a \ as the last character on a line to be continued is possible, but can lead to annoying bugs (like having an invisible space following the \ and thus preventing it from really continuing the line).
From here down is 100% personal opinion, offered for free. So take it for what its worth.
I am trying to make it technically appropriate piece of work. It's not commercial so I have time thus I can learn lexical analysis and parsing better. I don't want to use a parser library.
There is a contradiction here, in my opinion. Or perhaps two contradictions.
A technically appropriate piece of work would use standard tools; at least a lexical generator and probably a parser generator. It would do that because, properly used, the lexical and grammatical descriptions provided to the tools document precisely the actual language, and the tools guarantee that the desired language is what is actually recognized. Writing ad hoc code, even simple lexical recognizers and recursive descent parsers, for all that it can be elegant, is less self-documenting, less maintainable, and provides fewer guarantees of correctness. Consequently, best practice is "use standard tools".
Secondly, I disagree with your instructor (if I understand their proposal correctly, based on your comments) that writing ad hoc lexers and parsers aids in understanding lexical and parsing theory. In fact, it may be counterproductive. Bottom-up parsing, which is incredibly elegant both theoretically and practically, is almost impossible to write by hand and totally impossible to read. Consequently, many programmers prefer to use recursive-descent or Pratt parsers, because they understand the code. However, such parsers are not as powerful as a bottom-up parser (particularly GLR or Earley parsers, which are fully general), and their use leads to unnecessary grammatical compromises.
You don't need to write a regular expression library to understand regular expressions. The libraries abstract away the awkward implementation details (and there are lots of them, and they really are awkward) and let you concentrate on the essence of creating and using regular expressions.
In the same way, you do not need to write a compiler in order to understand how to program in C. After you have a good basis in C, you can improve your understanding (maybe) by understanding how it translates into machine code, but unless you plan a career in compiler writing, knowing the details of obscure optimization algorithms are not going to make you a better programmer. Or, at least, they're not first on your agenda.
Similarly, once you really understand regular expressions, you might find writing a library interesting. Or not -- you might find it incredibly frustrating and give up after a couple of months of hard work. Either way, you will appreciate existing libraries more. But learn to use the existing libraries first.
And the same with parser generators. If you want to learn how to translate an idea for a programming language into something precise and implementable, learn how to use a parser generator. Only after you have mastered the theory of parsing should you even think of focusing on low-level implementations.

Is parsing a language with an ending delimiter (e.g. ';') more efficient than having none?

I am wondering about the effect of having an ending delimiter about performance (if it's a scripting language) and ease to parse languages.
Is it easier to parse a language that has one?
If it is and the language is a scripting language, does it make the language run commands faster?
Any difference is going to be trivial.
This is because a language that doesn't use such end delimiters, do infact have them in the form of newline characters!! These will then use special characters to mark that teh statement continues on to the next line.
A few do neither, but there, the end of the statement is implicit in the definition - eg. the close ')' at the end of a parameter list. If there's a comma (or nothing) then more parameters or a closed ')' will be expected on the next line.
In the big scheme of things, these small differences have a negligible effect on processing time. For script languages, you're going to have much bigger differences according to whether (or how much) a particular statement construct can be internally optimized.
Performance-wise It doesn't matter.
If you think about it, there's always a delimiter, if it's not the ; then it's EndOfLine (\n)
So you're still parsing the code script the same way. The parser won't mind if it's a delimiter that's a visible character or invisible one.
The only thing that a visible delimiter is good for, is enabling the programmer to write multi-line script lines, which can be useful. Multi-lined script lines enable the programmer to write more coherent code in some cases. (Just parse the EndOfLine as a white space)
All languages I'm aware of that don't require an explicit typed-out statement delimiter or terminator use line breaks instead - in a few cases, with the exception that line breaks inside parens, braces, brackets, etc. are ignored (implicit line continuation). A small difference, and its effect on parsing speed will heavily depend on the parser (or parser generator). It may be slightly harder to write a parser for if you allow implicit line continuations (because you have to distinguish between newlines and statement seperators - but this can be sorted out relatively early in the parsing process unless you invent incredibly complex rules for it).
But: Even if there was a huge difference, it wouldn't affect runtime performance. Unless for programs where it doesn't matter anyway since they're rather short and/or I/O-bound (or especially stupid implementations that refuse to build any IR and instead interpret from source as they go, using a poorly-written parser), parsing takes place at most once at startup (today's "scripting" languages mostly compile to bytecode and mostly cache that bytecode between runs) and is then shadowed by the time the program spends doing its actual pass. Don't make tradeoffs at language design for speed; or if so, do it properly as e.g. C.
A more interesting issue is explicit block delimiters vs. offside rule. The tokenizer can already solve this, but most parsing generators/libraries don't (easily) give you such a parser. A few do, and in that case the difference is negible, but it's not as widely supported as "free-form" langauges.

Good grammar for date data type for recursive descent parser LL(1)

I'm building a custom expression parser and evaluator for production environment to provide a limited DSL to the users. The parser itself as the DSL, need to be simple. The parser is going to be built in an exotic language that doesn't support dynamic expression parsing nor has any parser generator tools available.
My current decision is to go for recursive descent approach with LL(1) grammar, so that even programmers with no previous experience in evaluating expression could quickly learn how the code works.
It has to handle mixed expressions made up of several data types: decimals, percentages, strings and dates. And dates in the format of dd/mm/yyyy are easy to confuse with a string of division ops.
Is where a good solution to this problem?
My own solution that is aimed at keeping the parser simple and involves prefixing dates with a special symbol, let's say apostrophe:
<date> ::= <apostr><digit><digit>/<digit><digit>/<digit><digit><digit><digit>
<apostr> ::= '
<digit> ::= '0'..'9'
First off, I'm a fan of LL parsers, so I approve of your approach heartily. Note that one of the newer popular parser generators (ANTLR) is LL. If you allow more look-ahead, rather that restricting yourself to LL(1), you can do pretty much anything you'd ever want to do with an LR(1) parser, but the code will be far clearer, more reliable, and easier to debug.
I don't know enough about your overall grammar to be able to tell. It is possible you might be able to design things so that the LL parser can always tell from context if it is an integer expression or a date constant. However, assuming you can't, yeah you'd need some kind of way to tell the difference. The only other thing I can think of would be to use backslash as a separator instead of slash, but that's kinda ugly.
An LL-like lexerless parser with an infinite lookahead is what you need. And, namely, it is PEG.
http://en.wikipedia.org/wiki/Parsing_expression_grammar
With an ordered choice it is quite easy to avoid this date vs. constant literals division confusion.
When a language is intended for human input, defining it is as much a matter of
adding syntax constraints to ensure unambiguous and easy parsing
removing/bending syntax to ensure that the language feels intuitive, "natural", to the intended human audience.
Satisfying the second requirement is much harder than the first one, and requires insight into the
intended use cases of the language
Which type of keyboard/input device is available? Are there some characters among the allowed characters which are hard to produce or to see on the display?
Which tokens / expression will be frequently used, which will be required only occasionally?
Are users frequently inputting short, ad-hoc code snippets, or are the programs meant to be reused and modified over long periods
... etc.
background/culture of the intended audience
Which common practices/idioms from other regular (and plain natural) languages can or should be reused if possible?
Should one favor a terse but cryptic style, or a more explicit, but more verbose style?
... etc.
Basically, it is hard to make suggestions about a language syntax, without a good grasp of the intended usage and users.
Nevertheless, I'd like to suggest the following for the date format question:
Use an alternative format for date values altogether; one that would be "natural" enough to users but distinctive enough that a regular grammar can describe.
For example, one that uses a 3 letters abbreviation for the month (downside DSL becomes tied to English or other tongue, but also advantage, the ambiguity for humans about which is day and which is month is removed). Tentatively:
dd-mmm-yyyy (may seem unnatural in cultures where the prevailing date order
starts with the month maybe yyyy-mmm-dd then ?)
mmm-dd-yyyy (better for the above mentioned cultures)
ddmmmyyyy (avoid the dashes, but impose leading zeros)
MnnDnnYyyyy (using "M", "D" and "Y" (or others) as explicit prefixes; now,
this is completely culture neutral, but maybe a bit awkward...)
Anyway, just ideas... Applicability will vary with the human/cultural factors mentioned, and with the rest of the syntax. For example the above may imply that variables be explicitly marked (that's one of the reasons many languages use the $ prefix for example), to avoid possible conflict with [odd, but possible] variable identifiers.
In a nutshell the idea is to substitute the need for a special character prefix (which may then collide with the use these characters for mathematical and other expressions), by making the 12 months tag an good enough discriminator for the parser.

Resources