How to combine Regexp and keywords in Scala parser combinators - parsing

I've seen two approaches to building parsers in Scala.
The first is to extends from RegexParsers and define your won lexical patterns. The issue I see with this is that I don't really understand how it deals with keyword ambiguities. For example, if my keyword match the same pattern as idents, then it processes the keywords as idents.
To counter that, I've seen posts like this one that show how to use the StandardTokenParsers to specify keywords. But then, I don't understand how to specify the regexp patterns! Yes, StandardTokenParsers comes with "ident" but it doesn't come with the other ones I need (complex floating point number representations, specific string literal patterns and rules for escaping, etc).
How do you get both the ability to specify keywords and the ability to specify token patterns with regular expressions?

I've written only RegexParsers-derived parsers, but what I do is something like this:
val name: Parser[String] = "[A-Z_a-z][A-Z_a-z0-9]*".r
val kwIf: Parser[String] = "if\\b".r
val kwFor: Parser[String] = "for\\b".r
val kwWhile: Parser[String] = "while\\b".r
val reserved: Parser[String] = ( kwIf | kwFor | kwWhile )
val identifier: Parser[String] = not(reserved) ~> name

Similar to the answer from #randall-schulz, but use an explicit negative lookahead in the regular expression itself.
Here, empty is a keyword but empty? should be an identifier. The negative lookahead fails the match (without consuming the characters) if empty is followed by anything in nameCharsRE. The kw helper function is used for multiple such keywords:
val nameCharsRE = "[^\\s\",'`()\\[\\]{}|;#]"
private def kw(kw: String, token: Token) = positioned {
(s"${kw}(?!${nameCharsRE})").r ^^ { _ => token }
}
private def empty = kw("empty", EMPTY_KW())
private def and = kw("and", AND())
private def or = kw("or", OR())

Related

How to handle 'line-continuation' using parser combinators

I'm trying to write a small parser using the Sprache parser combinator library. The parser should be able to parse lines ended with a single \ as insignificant white space.
Question
How can I create a parser that can parse the values after the = sign that may contain a line-continuation character \?
For example
a = b\e,\
c,\
d
Should be parsed as (KeyValuePair (Key, 'a'), (Value, 'b\e, c, d')).
I'm new to using this library and parser combinators in general. So any pointers in the right direction are much appreciated.
What I have tried
Test
public class ConfigurationFileGrammerTest
{
[Theory]
[InlineData("x\\\n y", #"x y")]
public void ValueIsAnyStringMayContinuedAccrossLinesWithLineContinuation(
string input,
string expectedKey)
{
var key = ConfigurationFileGrammer.Value.Parse(input);
Assert.Equal(expectedKey, key);
}
}
Production
Attempt one
public static readonly Parser<string> Value =
from leading in Parse.WhiteSpace.Many()
from rest in Parse.AnyChar.Except(Parse.Char('\\')).Many()
.Or(Parse.String("\\\n")
.Then(chs => Parse.Return(chs))).Or(Parse.AnyChar.Except(Parse.LineEnd).Many())
select new string(rest.ToArray()).TrimEnd();
Test output
Xunit.Sdk.EqualException: Assert.Equal() Failure
↓ (pos 1)
Expected: x y
Actual: x\
↑ (pos 1)
Attempt two
public static readonly Parser<string> SingleLineValue =
from leading in Parse.WhiteSpace.Many()
from rest in Parse.AnyChar.Many().Where(chs => chs.Count() < 2 || !(string.Join(string.Empty, chs.Reverse().Take(2)).Equals("\\\n")))
select new string(rest.ToArray()).TrimEnd();
public static readonly Parser<string> ContinuedValueLines =
from firsts in ContinuedValueLine.AtLeastOnce()
from last in SingleLineValue
select string.Join(" ", firsts) + " " + last;
public static readonly Parser<string> Value = SingleLineValue.Once().XOr(ContinuedValueLines.Once()).Select(s => string.Join(" ", s));
Test output
Xunit.Sdk.EqualException: Assert.Equal() Failure
↓ (pos 1)
Expected: x y
Actual: x\\n y
↑ (pos 1)
You must not include line continuation in the output. That's the only issue of the last unit test. When you parse the continuation \\\n you must drop it from the output result and return the empty string. Sorry I don't know how to do that using C# sprache. Maybe with something like that:
Parse.String("\\\n").Then(chs => Parse.Return(''))
I solved the problem using combinatorix python library. It's a parser combinator library. The API use functions instead of the using chained methods but the idea is the same.
Here is the full code with comments:
# `apply` return a parser that doesn't consume the input stream. It
# applies a function (or lambda) to the output result of a parser.
# The following parser, will remove whitespace from the beginning
# and the end of what is parsed.
strip = apply(lambda x: x.strip())
# parse a single equal character
equal = char('=')
# parse the key part of a configuration line. Since the API is
# functional it reads "inside-out". Note, the use of the special
# `unless(predicate, parser)` parser. It is sometime missing from
# parser combinator libraries. What it does is use `parser` on the
# input stream if the `predicate` parser fails. It allows to execute
# under some conditions. It's similar in spirit to negation in prolog.
# It does parse *anything until an equal sign*, "joins" the characters
# into a string and strips any space starting or ending the string.
key = strip(join(one_or_more(unless(equal, anything))))
# parse a single carriage return character
eol = char('\n')
# returns a parser that return the empty string, this is a constant
# parser (aka. it always output the same thing).
return_empty_space = apply(lambda x: '')
# This will parse a full continuation (ie. including the space
# starting the new line. It does parse *the continuation string then
# zero or more spaces* and return the empty string
continuation = return_empty_space(sequence(string('\\\n'), zero_or_more(char(' '))))
# `value` is the parser for the value part. Unless the current char
# is a `eol` (aka. \n) it tries to parse a continuation, otherwise it
# parse anything. It does that at least once, ie. the value can not be
# empty. Then, it "joins" all the chars into a single string and
# "strip" from any space that start or end the value.
value = strip(join(one_or_more(unless(eol, either(continuation, anything)))))
# this basically, remove the element at index 1 and only keep the
# elements at 0 and 2 in the result. See below.
kv_apply = apply(lambda x: (x[0], x[2]))
# This is the final parser for a given kv pair. A kv pair is:
#
# - a key part (see key parser)
# - an equal part (see equal parser)
# - a value part (see value parser)
#
# Those are used to parse the input stream in sequence (one after the
# other). It will return three values: key, a '=' char and a value.
# `kv_apply` will only keep the key and value part.
kv = kv_apply(sequence(key, equal, value))
# This is sugar syntax, which turns the string into a stream of chars
# and execute `kv` parser on it.
parser = lambda string: combinatorix(string, kv)
input = 'a = b\\e,\\\n c,\\\n d'
assert parser(input) == ('a', 'b\\e,c,d')

Parsing a string containing any chars

I'm trying to get following to work. So I have strings that are inside parentheses. The strings can contain any characters, and hence the string that I want to parse can also contain parentheses. I think the regex currently matches also the last parentheses that is supposed to be matched by <~ ")", and thus the parsing fails. What am I missing here?
private def parser: Parser[Any] = a ~ b ~ c ^^ {
<do stuff here>
}
private def a: Parser[String] = "\"[^\"]*\"".r | "[^(),>]*".r
private def b: Parser[String] = opt("(" ~> ".*".r <~ ")") ^^ {
case Some(y) => y.trim
case None => ""
}
private def c: Parser[String] = rep(".#" ~> "[^>.]*".r) ^^ (new String(_).trim)
This is supposed to parse following kind of strings:
test0
test1.#attr
"test2"
"test3".#attr
test4..
test5..#attr
"test6..".#attr
"test7.#attr".#attr
test8(icl>uw)
test9(icl>uw).#attr
"test10..().#"(icl>uw).#attr
test11(icl>uw(agt>uw2,obj>uw3),icl>uw4(agt>uw5))
test12(icl>uw1(agt>uw2,obj>uw3),icl>uw4).#attr1.#attr2
test13(agt>thing,obj>role>effect)
So the "a" parser parses the string until open parentheses or .#attr part. "b" parser parses the characters inside optional parentheses. "c" parses the optional .#attrs.
Currently I get similar error on all test strings containing parentheses part:
11:07:44.662 [main] DEBUG - Parsed: test8()
11:07:44.667 [main] ERROR - FAILURE parsing: test8(icl>uw) -- `)' expected but `i' found
So I assume that the parser parsed the first part correctly, but failed when it saw the parentheses part.
The right solution to parse nested structures is to use recursion, for example in the following fashion:
val parser= "regex".r
#tailrec
def extract(string:String,foundTokens:List[String]=List.empty):List[String]={
parser.findFirstMatchIn(string) match {
case Some(parser(matchedValue)) => extract(matchedValue,matchedValue::foundedTokens)
case None=>foundTokens
}
Where basically at each call to the function, you append the found token to a list of results and you launch the function on the result of the match. When you do not find anymore you return the found token.
If multiple matches are possible inside each subtoken, then you should look for a procedure like this one:
def extract(string:String):Iterator[String]={
parser.findAllIn(string).flatMap{
item => extract(item)
}
}

Unable to parse a complex language with regex and Scala parser combinators

I'm trying to write a parser for a certain language as part of my research. Currently I have problems getting the following code to work in a way I want:
private def _uw: Parser[UW] = _headword ~ _modifiers ~ _attributes ^^ {
case hw ~ mods ~ attrs => new UW(hw, mods, attrs)
}
private def _headword[String] = "\".*\"".r | "[^(),]*".r
private def _modifiers: Parser[List[UWModifier]] = opt("(" ~> repsep(_modifier, ",") <~ ")") ^^ {
case Some(mods) => mods
case None => List[UWModifier]()
}
private def _modifier: Parser[UWModifier] = ("[^><]*".r ^^ (RelTypes.toRelType(_))) ~ "[><]".r ~ _uw ^^ {
case (rel: RelType) ~ x ~ (uw: UW) => new UWModifier(rel, uw)
}
private def _attributes: Parser[List[UWAttribute]] = rep(_attribute) ^^ {
case Nil => List[UWAttribute]()
case attrs => attrs
}
private def _attribute: Parser[UWAttribute] = ".#" ~> "[^>.]*".r ^^ (new UWAttribute(_))
The above code contains just one part of the language, and to spare time and space, I won't go much into details about the whole language. _uw method is supposed to parse a string that consists of three parts, although just the first part must exist in the string.
_uw should be able to parse these test strings correctly:
test0
test1.#attr
"test2"
"test3".#attr
test4..
test5..#attr
"test6..".#attr
"test7.#attr".#attr
test8(urel>uw)
test9(urel>uw).#attr
"test10..().#"(urel>uw).#attr
test11(urel1>uw1(urel2>uw2,urel3>uw3),urel4>uw4).#attr1.#attr2
So if the headword starts and ends with ", everything inside the double quotes is considered to be part of the headword. All words starting with .#, if they are not inside the double quotes, are attributes of the headword.
E.g. in test5, the parser should parse test5. as headword, and attr as an attribute. Just .# is omitted, and all dots before that should be contained in the headword.
So, after headword there CAN be attributes and/or modifiers. The order is strict, so attributes always come after modifiers. If there are attributes but no modifiers, everything until .# is considered as part of the headword.
The main problem is "[^#(]*".r. I've tried all kind of creative alternatives, such as "(^[\\w\\.]*)((\\.\\#)|$)".r, but nothing seems to work. How does lookahead or lookbehind even affect parser combinators? I'm not an expert on parsing or regex, so all help is welcome!
I don't think "[^#(]*".r has anything to do with your problem. I see this:
private def _headword[String] = "\".*\"".r | "[^(),]*".r
which is the first thing in _uw (and, by the way, using underscores in names in Scala is not recommended), so when it tries to parse test5..#attr, the second regexp will match all of it!
scala> "[^(),]*".r findFirstIn "test5..#attr"
res0: Option[String] = Some(test5..#attr)
So there will be nothing left for the remaining parsers. Also, the first regex in _headword is also problematic, because .* will accept quotes, which means that something like this becomes valid:
"test6 with a " inside of it..".#attr
As for look-ahead and look-behind, it doesn't affect parser combinators at all. Either the regex matches, or it doesn't -- that's all the parser combinators care about.

How to define syntax

I am new at language processing and I want to create a parser with Irony for a following syntax:
name1:value1 name2:value2 name3:value ...
where name1 is the name of an xml element and value is the value of the element which can also include spaces.
I have tried to modify included samples like this:
public TestGrammar()
{
var name = CreateTerm("name");
var value = new IdentifierTerminal("value");
var queries = new NonTerminal("queries");
var query = new NonTerminal("query");
queries.Rule = MakePlusRule(queries, null, query);
query.Rule = name + ":" + value;
Root = queries;
}
private IdentifierTerminal CreateTerm(string name)
{
IdentifierTerminal term = new IdentifierTerminal(name, "!##$%^*_'.?-", "!##$%^*_'.?0123456789");
term.CharCategories.AddRange(new[]
{
UnicodeCategory.UppercaseLetter, //Ul
UnicodeCategory.LowercaseLetter, //Ll
UnicodeCategory.TitlecaseLetter, //Lt
UnicodeCategory.ModifierLetter, //Lm
UnicodeCategory.OtherLetter, //Lo
UnicodeCategory.LetterNumber, //Nl
UnicodeCategory.DecimalDigitNumber, //Nd
UnicodeCategory.ConnectorPunctuation, //Pc
UnicodeCategory.SpacingCombiningMark, //Mc
UnicodeCategory.NonSpacingMark, //Mn
UnicodeCategory.Format //Cf
});
//StartCharCategories are the same
term.StartCharCategories.AddRange(term.CharCategories);
return term;
}
but this doesn't work if the values include spaces. Can this be done (using Irony) without modifying the syntax (like adding quotes around values)?
Many thanks!
If newlines were included between key-value pairs, it would be easily achievable. I have no knowledge of "Irony", but my initial feeling is that almost no parser/lexer generator is going to deal with this given only a naive grammar description. This requires essentially unbounded lookahead.
Conceptually (because I know nothing about this product), here's how I would do it:
Tokenise based on spaces and colons (i.e. every continguous sequence of characters that isn't a space or a colon is an "identifier" token of some sort).
You then need to make it such that every "sentence" is described from colon-to-colon:
sentence = identifier_list
| : identifier_list identifier : sentence
That's not enough to make it work, but you get the idea at least, I hope. You would need to be very careful to distinguish an identifier_list from a single identifier such that they could be parsed unambiguously. Similarly, if your tool allows you to define precedence and associativity, you might be able to get away with making ":" bind very tightly to the left, such that your grammar is simply:
sentence = identifier : identifier_list
And the behaviour of that needs to be (identifier :) identifier_list.

<< and >> symbols in Erlang

First of all, I'm an Erlang rookie here. I need to interface with a MySQL database and I found the erlang-mysql-driver. I'm trying that out, and am a little confused by some of the syntax.
I can get a row of data from the database with this (greatly oversimplified for brevity here):
Result = mysql:fetch(P1, ["SELECT column1, column2 FROM table1 WHERE column2='", Key, "'"]),
case Result of
{data, Data} ->
case mysql:get_result_rows(Data) of
[] -> not_found;
Res ->
%% Now 'Res' has the row
So now here is an example of what `Res' has:
[[<<"value from column1">>, <<"value from column2">>]]
I get that it's a list of records. In this case, the query returned 1 row of 2 columns.
My question is:
What do the << and >> symbols mean? And what is the best (Erlang-recommended) syntax for turning a list like this into a records which I have defined like:
-record(
my_record,
{
column1 = ""
,column2 = ""
}
).
Just a small note: the results are not bit string comprehensions per see, they are just bit strings. However you can use bit string comprehensions to produce a sequence of bit strings (which is described above with the generators and that), much like lists and lists comprehensions.
you can use erlang:binary_to_list/1 and erlang:list_to_binary/1 to convert between binary and strings (lists).
The reason the mysql driver returns bit strings is probably because they are much faster to manipulate.
In your specific example, you can do the conversion by matching on the returned column values, and then creating a new record like this:
case mysql:get_result_rows(Data) of
[] ->
not_found;
[[Col1, Col2]] ->
#my_record{column1 = Col1, column2 = Col2}
end
These are bit string comprehensions.
Bit string comprehensions are analogous to List Comprehensions. They are used to generate bit strings efficiently and succinctly.
Bit string comprehensions are written with the following syntax:
<< BitString || Qualifier1,...,QualifierN >>
BitString is a bit string expression, and each Qualifier is either a generator, a bit string generator or a filter.
• A generator is written as:
Pattern <- ListExpr.
ListExpr must be an expression which evaluates to a list of terms.
• A bit string generator is written as:
BitstringPattern <= BitStringExpr.
BitStringExpr must be an expression which evaluates to a bitstring.
• A filter is an expression which evaluates to true or false.
The variables in the generator patterns shadow variables in the function clause surrounding the bit string comprehensions.
A bit string comprehension returns a bit string, which is created by concatenating the results of evaluating BitString for each combination of bit string generator elements for which all filters are true.
Example:
1> << << (X*2) >> ||
<<X>> <= << 1,2,3 >> >>.
<<2,4,6>>

Resources