PEG parsing match at least one preserving order

PEG parsing match at least one preserving order - parsing

Given the PEG rule:
rule = element1:'abc' element2:'def' element3:'ghi' ;
How do I rewrite this such that it matches at least one of the elements but possibly all while enforcing their order?
I.e. I would like to match all of the following lines:
abc def ghi
abc def
abc ghi
def ghi
abc
def
ghi
but not an empty string or misordered expressions, e.g. def abc.
Of course with three elements, I could spell out the combinations in separate rules, but as the number of elements increases, this becomes error prone.
Is there a way to specify this in a concise manner?

You can use optionals:
rule = [element1:'abc'] [element2:'def'] [element3:'ghi'] ;
You would use a semantic action for rule to check that at least one token was matched:
def rule(self, ast):
if not (ast.element1 or ast.element2 or ast.element3):
raise FailedSemantics('Expecting at least one token')
return ast
Another option is to use several choices:
rule
=
element1:'abc' [element2:'def'] [element3:'ghi']
| [element1:'abc'] element2:'def' [element3:'ghi']
| [element1:'abc'] [element2:'def'] element3:'ghi'
;
Caching will make the later as efficient as the former.
Then, you can add cut elements for additional efficiency and more meaningful error messages:
rule
=
element1:'abc' ~ [element2:'def' ~] [element3:'ghi' ~]
| [element1:'abc' ~] element2:'def' ~ [element3:'ghi' ~]
| [element1:'abc' ~] [element2:'def' ~] element3:'ghi' ~
;
or:
rule = [element1:'abc' ~] [element2:'def' ~] [element3:'ghi' ~] ;

The answer is: one precondition on the disjunct, and then a sequence of optionals.
rule = &(e1 / e2 / e3) e1? e2? e3?
This is standard PEG, with & meaning 'must be present but not consumed' and ? meaning 'optional'. Most PEG parsers have these features if not with these symbols.

Related

Make lexer consider parser before determining tokens?

I'm writing a lexer and parser in ocamllex and ocamlyacc as follows. function_name and table_name are same regular expression, i.e., a string containing only english alphabets. The only way to determine if a string is function_name or table_name is to check its surroundings. For example, if such a string is surrounded by [ and ], then we know that it is a table_name. Here is the current code:
In lexer.mll,
... ...
let function_name = ['a'-'z' 'A'-'Z']+
let table_name = ['a'-'z' 'A'-'Z']+
rule token = parse
| function_name as s { FUNCTIONNAME s }
| table_name as s { TABLENAME s }
... ...
In parser.mly:
... ...
main:
| LBRACKET TABLENAME RBRACKET { Table $2 }
... ...
As I wrote | function_name as s { FUNCTIONNAME s } before | table_name as s { TABLENAME s }, the above code failed to parse [haha]; it firstly considered haha as a function_name in the lexer, then it could not find any corresponding rule for it in the parser. If it could consider haha as a table_name in the lexer, it would match [haha] as a table in the parser.
One workaround for this is to be more precise in the lexer. For example, we define let table_name_with_brackets = '[' ['a'-'z' 'A'-'Z']+ ']' and | table_name_with_brackets as s { TABLENAMEWITHBRACKETS s } in the lexer. But, I would like to know if there is any other options. Is it not possible to make lexer and parser work together to determine the tokens and the reduction?

You should avoid trying to get the lexer to do the parser's work. The lexer should just identify lexemes; it should not try to figured out where a lexeme fits into the syntax. So in your (simplified) example, there should be only one lexical type, name. The parser will figure it out from there.
But it seems, from the comments, that in the unsimplified original, the two patterns are overlapping rather than identical. That's more annoying, although it's only slightly more complicated. Basically, you need to separate out the common pattern as one lexical type, and then add the additional matches as one or two other lexical types (depending on whether or not one pattern is a strict superset of the other).
That might not be too difficult, depending on the precise relationship between the two patterns. You might be able to find a very simple solution by writing the patterns in the correct order, for example, because of the longest match rule:
If several regular expressions match a prefix of the input, the “longest match” rule applies: the regular expression that matches the longest prefix of the input is selected. In case of tie, the regular expression that occurs earlier in the rule is selected.
Most of the time, that's all it takes: first define the intersection of the two patterns as a based lexeme, and then add the full lexical patterns of each contextual type to provide additional matches. Your parser will then have to match name | function_name in one context and name | table_name in the other context. But that's not too bad.
Where it will fail is when an input stream cannot be unambiguously divided in lexemes. For example, suppose that in a function context, a name could include a ? character, but in a table context the ? is a valid postscript operator. In that case, you have to actively prevent foo? from being analysed as a single token in the table context, which means that the lexer does have to be aware of parser context.

split large regular expression in different lines

I have this regular expression:
INVALID_NAMES = /\b(bib$|costumes$|httpanties?|necklace|cuff link|cufflink|scarf|pendant|apron|buckle|beanie|hat|ring|blanket|polo|earrings?|plush|pacifier|tie$|panties|boxers?|slippers?|pants?|leggings|ibattz|dress|bodysuits?|charm|battstation|tea|pocket ref|pajamas?|boyshorts?|mimopowertube|coat|bathrobe)\b/i
and it's working in that way.... but I want to write something like this:
INVALID_NAMES = /\b(bib$|costumes$|httpanties?|necklace|cuff link|
cufflink|scarf|pendant|apron|buckle|beanie|hat|ring|blanket|
polo|earrings?|plush|pacifier|tie$|panties|boxers?|
slippers?|pants?|leggings|ibattz|dress|bodysuits?|charm|
battstation|tea|pocket ref|pajamas?|boyshorts?|
mimopowertube|coat|bathrobe)\b/i
but if I use the second option the words: cufflink, polo, slippers?, battstation and mimopowertube.... are not taken because the spaces that the word have before, example:
(this space before the word)cufflink
I'll be very grateful of any help.

You may use something like this
INVALID_NAMES = [
"bib$",
"costumes$",
"httpanties?",
"necklace"
]
INVALID_NAMES_REGEX = /\b(#{INVALID_NAMES.join '|'})\b/i
p INVALID_NAMES_REGEX

Construct Your Regex with the Space-Insensitive Flag
You can use the space-insensitive flag to ignore whitespace and comments in your regular expression. Note that you will need to use \s or other explicit characters to catch whitespace once you enable this flag, since the /x flag would otherwise cause the spaces to be ignored.
Consider the following example:
INVALID_NAMES =
/\b(bib$ |
costumes$ |
httpanties? |
necklace |
cuff\slink |
cufflink |
scarf |
pendant |
apron |
buckle |
beanie |
hat |
ring |
blanket |
polo |
earrings? |
plush |
pacifier |
tie$ |
panties |
boxers? |
slippers? |
pants? |
leggings |
ibattz |
dress |
bodysuits? |
charm |
battstation |
tea |
pocket\sref |
pajamas? |
boyshorts? |
mimopowertube |
coat |
bathrobe
)\b/ix
Note that you can format it in many other ways, but having one expression per line makes it easier to sort and edit your sub-expressions. If you want it to have multiple alternatives per line, you could certainly do that.
Making Sure It Works
You can see that the expression above works as intended with the following examples:
'cufflink'.match INVALID_NAMES
#=> #<MatchData "cufflink" 1:"cufflink">
'cuff link'.match INVALID_NAMES
#=> #<MatchData "cuff link" 1:"cuff link">

When you add a newline in the middle of a regex literal, the newline becomes a part of the regular expression. Look at this example:
"ab" =~ /ab/ # => 0
"ab" =~ /a
b/ # => nil
"a\nb" =~ /a
b/ # => 0
You can suppress the newline by appending a backslash at the end of the line:
"ab" =~ /a\
b/ # => 0
Applied to your regex (leading spaces also removed):
INVALID_NAMES = /\b(bib$|costumes$|httpanties?|necklace|cuff link|\
cufflink|scarf|pendant|apron|buckle|beanie|hat|ring|blanket|\
polo|earrings?|plush|pacifier|tie$|panties|boxers?|\
slippers?|pants?|leggings|ibattz|dress|bodysuits?|charm|\
battstation|tea|pocket ref|pajamas?|boyshorts?|\
mimopowertube|coat|bathrobe)\b/i

Your patterns are inefficient and will cause the Regexp engine to thrash badly.
I'd recommend you investigate what Perl's Regexp::Assemble can do to help your Ruby code:
"How do I ignore file types in a web crawler?"
"Is there an efficient way to perform hundreds of text substitutions in Ruby?"

You might do it like this:
INVALID_NAMES = ['necklace', 'cuff link', 'cufflink', 'scarf', 'tie?', 'bib$']
r = Regexp.union(INVALID_NAMES.map { |n| /\b#{n}\b/i })
str = 'cat \n cufflink bib cuff link. tie Scarf\n cow necklace? \n ti. bib'
str.scan(r)
#=> ["cufflink", "cuff link", "tie", "Scarf", "necklace", "ti", "bib"]

Operator associativity using Scala Parsers

So I've been trying to write a calculator with Scala's parser, and it's been fun, except that I found that operator associativity is backwards, and that when I try to make my grammar left-recursive, even though it's completely unambiguous, I get a stack overflow.
To clarify, if I have a rule like:
def subtract: Parser[Int] = num ~ "-" ~ add { x => x._1._1 - x._2 }
then evaluating 7 - 4 - 3 comes out to be 6 instead of 0.
The way I have actually implemented this is that I am composing a binary tree where operators are non-leaf nodes, and leaf nodes are numbers. The way I evaluate the tree is left child (operator) right child. When constructing the tree for 7 - 4 - 5, what I would like for it to look like is:
-
- 5
7 4 NULL NULL
where - is the root, its children are - and 5, and the second -'s children are 7 and 4.
However, the only tree I can construct easily is
-
7 -
NULL NULL 4 5
which is different, and not what I want.
Basically, the easy parenthesization is 7 - (4 - 5) whereas I want (7 - 4) - 5.
How can I hack this? I feel like I should be able to write a calculator with correct operator precedence regardless. Should I tokenize everything first and then reverse my tokens? Is it ok for me to just flip my tree by taking all left children of right children and making them the right child of the right child's parent and making the parent the left child of the ex-right child? It seems good at a first approximation, but I haven't really thought about it too deeply. I feel like there must just be some case that I'm missing.
My impression is that I can only make an LL parser with the scala parsers. If you know another way, please tell me!

Scala's standard implementation of parser combinators (the Parsers trait) do not support left-recursive grammars. You can, however, use PackratParsers if you need left recursion. That said, if your grammar is a simple arithmetic expression parser, you most definitely do not need left recursion.
Edit
There are ways to use right recursion and still keep left associativity, and if you are keen on that, just look up arithmetic expressions and recursive descent parsers. And, of course, as, I said, you can use PackratParsers, which allow left recursion.
But the easiest way to handle associativity without using PackratParsers is to avoid using recursion. Just use one of the repetition operators to get a List, and then foldLeft or foldRight as required. Simple example:
trait Tree
case class Node(op: String, left: Tree, right: Tree) extends Tree
case class Leaf(value: Int) extends Tree
import scala.util.parsing.combinator.RegexParsers
object P extends RegexParsers {
def expr = term ~ (("+" | "-") ~ term).* ^^ mkTree
def term = "\\d+".r ^^ (_.toInt)
def mkTree(input: Int ~ List[String ~ Int]): Tree = input match {
case first ~ rest => ((Leaf(first): Tree) /: rest)(combine)
}
def combine(acc: Tree, next: String ~ Int) = next match {
case op ~ y => Node(op, acc, Leaf(y))
}
}
You can find other, more complete, examples on the scala-dist repository.

I'm interpreting your question as follows:
If you write rules like def expression = number ~ "-" ~ expression and then evalute on each node of the syntax tree, then you find that in 3 - 5 - 4, the 5 - 4 is computed first, giving 1 as a result, and then 3 - 1 is computed giving 2 as a result.
On the other hand, if you write rules like def expression = expression ~ "-" ~ number, the rules are left-recursive and overflow the stack.
There are three solutions to this problem:
Post-process the abstract syntax tree to convert it from a right-associative tree to a left-associative tree. If you're using actions on the grammar rules to do the computation immediately, this won't work for you.
Define the rule as def expression = repsep(number, "-") and then when evaluating the computation, loop over the parsed numbers (which will appear in a flat list) in whichever direction provides you the associativity you need. You can't use this if more than one kind of operator will appear, since the operator will be thrown away.
Define the rule as def expression = number ~ ( "-" ~ number) *. You'll have an initial number, plus a set of operator-number pairs in a flat list, to process in any direction you want (though left-to-right is probably easier here).
Use PackratParsers as Daniel Sobral suggested. This is probably your best and simplest choice.

string format check

Suppose I have string variables like following:
s1="10$"
s2="10$ I am a student"
s3="10$Good"
s4="10$ Nice weekend!"
As you see above, s2 and s4 have white space(s) after 10$ .
Generally, I would like to have a way to check if a string start with 10$ and have white-space(s) after 10$ . For example, The rule should find s2 and s4 in my above case. how to define such rule to check if a string start with '10$' and have white space(s) after?
What I mean is something like s2.RULE? should return true or false to tell if it is the matched string.
---------- update -------------------
please also tell the solution if 10# is used instead of 10$

You can do this using Regular Expressions (Ruby has Perl-style regular expressions, to be exact).
# For ease of demonstration, I've moved your strings into an array
strings = [
"10$",
"10$ I am a student",
"10$Good",
"10$ Nice weekend!"
]
p strings.find_all { |s| s =~ /\A10\$[ \t]+/ }
The regular expression breaks down like this:
The / at the beginning and the end tell Ruby that everything in between is part of the regular expression
\A matches the beginning of a string
The 10 is matched verbatim
\$ means to match a $ verbatim. We need to escape it since $ has a special meaning in regular expressions.
[ \t]+ means "match at least one blank and/or tab"
So this regular expressions says "Match every string that starts with 10$ followed by at least one blank or tab character". Using the =~ you can test strings in Ruby against this expression. =~ will return a non-nil value, which evaluates to true if used in a conditional like if.
Edit: Updated white space matching as per Asmageddon's suggestion.

this works:
"10$ " =~ /^10\$ +/
and returns either nil when false or 0 when true. Thanks to Ruby's rule, you can use it directly.

Use a regular expression like this one:
/10\$\s+/
EDIT
If you use =~ for matching, note that
The =~ operator returns the character position in the string of the
start of the match
So it might return 0 to denote a match. Only a return of nil means no match.
See for example http://www.regular-expressions.info/ruby.html on a regular expression tutorial for ruby.

If you want to proceed to cases with $ and # then try this regular expression:
/^10[\$#] +/

Regular Expression Attack Vector?

How does one "parameterize" variable input into a Regex in Ruby? For example, I'm doing the following:
q = params[:q]
all_values.collect { | col | [col.name] if col.name =~ /(\W|^)#{q}/i }.compact
Since it (#{q}) is a variable from an untrusted source (the query string), I have to assume it could be an attack vector. Any best practices here?

Try Regexp.escape:
>> Regexp.escape('foo\bar\baz$+')
=> "foo\\\\bar\\\\baz\\$\\+"
So your code would look something like:
q = params[:q]
re = Regexp.escape(q)
all_values.collect { | col | [col.name] if col.name =~ /(\W|^)#{re}/i }.compact

So, do you want the user to be able to provide an arbitrary regular expression, or just some literal text that surely can be quoted? If the user can provide both a regular expression and the text it will be attempted matched against, it's not hard to do a DoS-attack by providing an expression with exponential runtime.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart