ParseKit greedy matching mode - parsing

I am making something like formula validator and I am using ParseKit framework to accomplish that. My approach is to create proper grammar and when didMatchFormula callback method is called on sample string I assume formula has been found and therefore it is valid.
There is one difficulty however - formula is detected from sample string even if it contains also other characters following formula part. I would need something like greedy mode for matching - an entire string would be matched against formula grammar so that didMatchFormula would be called only if string contains formula and no other characters.
Can you give me some hints how to accomplish that with PaseKit or in other way.
I cannot use regular expressions since my formulas would use recursion and regexp is not a good tool for handling that.

Developer of ParseKit here.
Probably the simplest and most elegant way to do this with ParseKit (or any parsing toolkit) is to design your formula language have a terminator char after every statement. This would be the same concept as ; terminating statements in most C-like programming languages.
Here's an example toy formula language which uses . as the statement terminator:
#start = lang;
lang = statment+;
statment = Word+ terminator;
terminator = '.';
Notice how I have designed the language so that your "greedy" requirement is an inherent feature of the language. Think about it – if the input string ends with any junk content which is not a valid statement ending in a ., my lang production will not find a match and the parse will fail.
With this type of design, you won't need any "greedy" features in the parsking toolkit you use. Rather, your requirement will be naturally met by your language design.

Related

ANTLR4 - Parse subset of a language (e.g. just query statements)

I'm trying to figure out how I can best parse just a subset of a given language with ANTLR. For example, say I'm looking to parse U-SQL. Really, I'm only interested in parsing certain parts of the language, such as query statements. I couldn't be bothered with parsing the many other features of the language. My current approach has been to design my lexer / parser grammar as follows:
// ...
statement
: queryStatement
| undefinedStatement
;
// ...
undefinedStatement
: (.)+?
;
// ...
UndefinedToken
: (.)+?
;
The gist is, I add a fall-back parser rule and lexer rule for undefined structures and tokens. I imagine later, when I go to walk the parse tree, I can simply ignore the undefined statements in the tree, and focus on the statements I'm interested in.
This seems like it would work, but is this an optimal strategy? Are there more elegant options available? Thanks in advance!
Parsing a subpart of a grammar is super easy. Usually you have a top level rule which you call to parse the full input with the entire grammar.
For the subpart use the function that parses only a subrule like:
const expression = parser.statement();
I use this approach frequently when I want to parse stored procedures or data types only.
Keep in mind however, that subrules usually are not termined with the EOF token (as the top level rule should be). This will cause no syntax error if more than the subelement is in the token stream (the parser just stops when the subrule has matched completely). If that's a problem for you then add a copy of the subrule you wanna parse, give it a dedicated name and end it with EOF, like this:
dataTypeDefinition: // For external use only. Don't reference this in the normal grammar.
dataType EOF
;
dataType: // type in sql_yacc.yy
type = (
...
Check the MySQL grammar for more details.
This general idea -- to parse the interesting bits of an input and ignore the sea of surrounding tokens -- is usually called "island parsing". There's an example of an island parser in the ANTLR reference book, although I don't know if it is directly applicable.
The tricky part of island parsing is getting the island boundaries right. If you miss a boundary, or recognise as a boundary something which isn't, then your parse will fail disastrously. So you need to understand the input at least well enough to be able to detect where the islands are. In your example, that might mean recognising a SELECT statement, for example. However, you cannot blindly recognise the string of letters SELECT because that string might appear inside a string constant or a comment or some other context in which it was never intended to be recognised as a token at all.
I suspect that if you are going to parse queries, you'll basically need to be able to recognise any token. So it's not going to be sea of uninspected input characters. You can view it as a sea of recognised but unparsed tokens. In that case, it should be reasonably safe to parse a non-query statement as a keyword followed by arbitrary tokens other than ; and ending with a ;. (But you might need to recognise nested blocks; I don't really know what the possibilities are.)

Predictive editor for Rascal grammar

I'm trying to write a predictive editor for a grammar written in Rascal. The heart of this would be a function taking as input a list of symbols and returning as output a list of symbol types, such that an instance of any of those types would be a syntactically legal continuation of the input symbols under the grammar. So if the input list was [4,+] the output might be [integer]. Is there a clever way to do this in Rascal? I can think of imperative programming ways of doing it, but I suspect they don't take proper advantage of Rascal's power.
That's a pretty big question. Here's some lead to an answer but the full answer would be implementing it for you completely :-)
Reify an original grammar for the language you are interested in as a value using the # operator, so that you have a concise representation of the grammar which can be queried easily. The representation is defined over the modules Type, ParseTree which extends Type and Grammar.
Construct the same representation for the input query. This could be done in many ways. A kick-ass, language-parametric, way would be to extend Rascal's parser algorithm to return partial trees for partial input, but I believe this would be too much hassle now. An easier solution would entail writing a grammar for a set of partial inputs, i.e. the language grammar with at specific points shorter rules. The grammar will be ambiguous but that is not a problem in this case.
Use tags to tag the "short" rules so that you can find them easily later: syntax E = #short E "+";
Parse with the extended and now ambiguous grammar;
The resulting parse trees will contain the same representation as in ParseTree that you used to reify the original grammar, except in that one the rules are longer, as in prod(E, [E,+,E],...)
then select the trees which serve you best for the goal of completion (which use the #short tag), and extract their productions "prod", which look like this prod(E,[E,+],...). For example using the / operator: [candidate : /candidate:prod(_,_,/"short") := trees], and you could use a cursor position to find candidates which are close by instead of all short trees in there.
Use list matching to find prefixes in the original grammar, like if (/match:prod(_,[*prefix, predicted, *postfix],_) := grammar) ..., prefix is your query as extracted from the #short rules. predicted is your answer and postfix is whatever would come after.
yield the predicted symbol back as a type for the user to read: "<type(predicted, ())>" (will pretty print it nicely even if it's some complex regexp type and does the quoting right etc.)

Making a Lua pattern case insensitive with LPeg

I have an app that (among other things) supports plain-text searches and searches using Lua patterns. As a convenience, the app supports case-insensitive searches. Here is an image snippet:
The code that transforms the given Lua pattern into a case-insensitive Lua pattern isn't too pretty. It basically worries about whether or not a character is preceded by an odd or even number of escapes (%) and whether or not it is located inside of square brackets. The pattern shown in the image becomes %a[bB][bB]%%[cC][%abB%%cC]
I haven't had a chance to learn LPeg yet, and I suppose this could be my motivator.
My question is whether this is something that LPeg could have handled easily?
Yes, but for an easier entry into the LPeg world, consider LPeg's "re" module, which gives you a regex-like syntax and which you can specify a set of rules, as in a grammar (think Yacc, etc.). You'd basically write rules for escaped characters, bracket groups and regular characters. Then, you could associate functions to the rules, that would emit either the same text they consumed as the input or the case-insensitive modified version.
The structure of your rules would take care of the even-odd distinction automatically, bracket context, etc. LPeg uses "ordered choice", so if you add your escape rule first, it will handle %[ correctly and avoid mixing it up with the brackets rule, for example.

What is the proper Lua pattern for quoted text?

I've been playing with this for an hour or tow and have found myself at a road block with the Lua pattern matching utilities. I am attempting to match all quoted text in a string and replace it if needed.
The pattern I have come up with so far is: (\?[\"\'])(.-)%1
This works in some cases but, not all cases:
Working: "This \"is a\" string of \"text to\" test with"
Not Working: "T\\\"his \"is\' a\" string\" of\' text\" to \"test\" wit\\\"h"
In the not working example I would like it to match to (I made a function that gets the matches I desire, I'm just looking for a pattern to use with gsub and curious if a lua pattern can do this):
string
a" string" of
is' a" string" of' text
test
his "is' a" string" of' text" to "test" wit
I'm going to continue to use my function instead for the time being, but am curious if there is a pattern I could/should be using and i'm just missing something with patterns.
(a few edits b/c I forgot about stackoverflows formating)
(another edit to make a non-html example since it was leading to assumptions that I was attempting to parse html)
Trying to match escaped, quoted text using regular expressions is like trying to remove the daisies (and only the daises) from a field using a lawnmower.
I made a function that gets the matches I desire
This is the correct move.
I'm curious if a lua pattern can do this
From a practical point of view, even if a pattern can do this, you don't want to. From a theoretical point of view, you are trying to find a double quote that is preceded by an even number of backslashes. This is definitely a regular language, and the regular expression you want would be something like the following (Lua quoting conventions)
[[[^\](\\)*"(.-[^\](\\)*)"]]
And the quoted string would be result #2. But Lua patterns are not full regular expressions; in particular, you cannot put a * after a parenthesized pattern.
So my guess is that this problem cannot be solved using Lua patterns, but since Lua patterns are not a standard thing in automata theory, I'm not aware of any body of proof technique that you could use to prove it.
The issue with escaped quotes is that, in general, if there's an odd number of backslashes before the quote, then it's escaped, and if there's an even number, it's not. I do not believe that Lua pattern-matching is powerful enough to represent this condition, so if you need to parse text like this, then you should seek another way. Perhaps you can iterate through the string and parse it, or you could find each quote in turn and read backwards, counting the backslashes until you find a non-backslash character (or the beginning of the string).
If you absolutely must use patterns for some reason, you could try doing this in a multi-step process. First, gsub for all occurrences of two backslashes in a row, and replace them with some sentinel value. This must be a value that does not already occur in the string. You could try something like "\001" if you know this string doesn't contain non-printable characters. Anyway, once you've replaced all sequences of two backslashes in a row, any backslashes left are escaping the following character. Now you can apply your original pattern, and then finally you can replace all instances of your sentinel value with two backslashes again.
Lua's pattern language is adequate for many simple cases. And it has at least one trick you don't find in a typical regular expression package: a way to match balanced parenthesis. But it has its limits as well.
When those limits are exceeded, then I reach for LPeg. LPeg is an implementation of a Parsing Expression Grammer for Lua, and was implemented by one of Lua's original authors so the adaptation to Lua is done quite well. A PEG allows specification of anything from simple patterns through complete language grammars to be written. LPeg compiles the grammar to a bytecode and executes it extremely efficiently.
you should NOT be trying to parse HTML with regular expressions, HTML and XML are NOT regular languages and can not be successfully manipulated with regular expressions. You should use a dedicated HTML parser. Here are lots of explanations why.

Good grammar for date data type for recursive descent parser LL(1)

I'm building a custom expression parser and evaluator for production environment to provide a limited DSL to the users. The parser itself as the DSL, need to be simple. The parser is going to be built in an exotic language that doesn't support dynamic expression parsing nor has any parser generator tools available.
My current decision is to go for recursive descent approach with LL(1) grammar, so that even programmers with no previous experience in evaluating expression could quickly learn how the code works.
It has to handle mixed expressions made up of several data types: decimals, percentages, strings and dates. And dates in the format of dd/mm/yyyy are easy to confuse with a string of division ops.
Is where a good solution to this problem?
My own solution that is aimed at keeping the parser simple and involves prefixing dates with a special symbol, let's say apostrophe:
<date> ::= <apostr><digit><digit>/<digit><digit>/<digit><digit><digit><digit>
<apostr> ::= '
<digit> ::= '0'..'9'
First off, I'm a fan of LL parsers, so I approve of your approach heartily. Note that one of the newer popular parser generators (ANTLR) is LL. If you allow more look-ahead, rather that restricting yourself to LL(1), you can do pretty much anything you'd ever want to do with an LR(1) parser, but the code will be far clearer, more reliable, and easier to debug.
I don't know enough about your overall grammar to be able to tell. It is possible you might be able to design things so that the LL parser can always tell from context if it is an integer expression or a date constant. However, assuming you can't, yeah you'd need some kind of way to tell the difference. The only other thing I can think of would be to use backslash as a separator instead of slash, but that's kinda ugly.
An LL-like lexerless parser with an infinite lookahead is what you need. And, namely, it is PEG.
http://en.wikipedia.org/wiki/Parsing_expression_grammar
With an ordered choice it is quite easy to avoid this date vs. constant literals division confusion.
When a language is intended for human input, defining it is as much a matter of
adding syntax constraints to ensure unambiguous and easy parsing
removing/bending syntax to ensure that the language feels intuitive, "natural", to the intended human audience.
Satisfying the second requirement is much harder than the first one, and requires insight into the
intended use cases of the language
Which type of keyboard/input device is available? Are there some characters among the allowed characters which are hard to produce or to see on the display?
Which tokens / expression will be frequently used, which will be required only occasionally?
Are users frequently inputting short, ad-hoc code snippets, or are the programs meant to be reused and modified over long periods
... etc.
background/culture of the intended audience
Which common practices/idioms from other regular (and plain natural) languages can or should be reused if possible?
Should one favor a terse but cryptic style, or a more explicit, but more verbose style?
... etc.
Basically, it is hard to make suggestions about a language syntax, without a good grasp of the intended usage and users.
Nevertheless, I'd like to suggest the following for the date format question:
Use an alternative format for date values altogether; one that would be "natural" enough to users but distinctive enough that a regular grammar can describe.
For example, one that uses a 3 letters abbreviation for the month (downside DSL becomes tied to English or other tongue, but also advantage, the ambiguity for humans about which is day and which is month is removed). Tentatively:
dd-mmm-yyyy (may seem unnatural in cultures where the prevailing date order
starts with the month maybe yyyy-mmm-dd then ?)
mmm-dd-yyyy (better for the above mentioned cultures)
ddmmmyyyy (avoid the dashes, but impose leading zeros)
MnnDnnYyyyy (using "M", "D" and "Y" (or others) as explicit prefixes; now,
this is completely culture neutral, but maybe a bit awkward...)
Anyway, just ideas... Applicability will vary with the human/cultural factors mentioned, and with the rest of the syntax. For example the above may imply that variables be explicitly marked (that's one of the reasons many languages use the $ prefix for example), to avoid possible conflict with [odd, but possible] variable identifiers.
In a nutshell the idea is to substitute the need for a special character prefix (which may then collide with the use these characters for mathematical and other expressions), by making the 12 months tag an good enough discriminator for the parser.

Resources