How to handle 'line-continuation' using parser combinators - parsing

I'm trying to write a small parser using the Sprache parser combinator library. The parser should be able to parse lines ended with a single \ as insignificant white space.
Question
How can I create a parser that can parse the values after the = sign that may contain a line-continuation character \?
For example
a = b\e,\
c,\
d
Should be parsed as (KeyValuePair (Key, 'a'), (Value, 'b\e, c, d')).
I'm new to using this library and parser combinators in general. So any pointers in the right direction are much appreciated.
What I have tried
Test
public class ConfigurationFileGrammerTest
{
[Theory]
[InlineData("x\\\n y", #"x y")]
public void ValueIsAnyStringMayContinuedAccrossLinesWithLineContinuation(
string input,
string expectedKey)
{
var key = ConfigurationFileGrammer.Value.Parse(input);
Assert.Equal(expectedKey, key);
}
}
Production
Attempt one
public static readonly Parser<string> Value =
from leading in Parse.WhiteSpace.Many()
from rest in Parse.AnyChar.Except(Parse.Char('\\')).Many()
.Or(Parse.String("\\\n")
.Then(chs => Parse.Return(chs))).Or(Parse.AnyChar.Except(Parse.LineEnd).Many())
select new string(rest.ToArray()).TrimEnd();
Test output
Xunit.Sdk.EqualException: Assert.Equal() Failure
↓ (pos 1)
Expected: x y
Actual: x\
↑ (pos 1)
Attempt two
public static readonly Parser<string> SingleLineValue =
from leading in Parse.WhiteSpace.Many()
from rest in Parse.AnyChar.Many().Where(chs => chs.Count() < 2 || !(string.Join(string.Empty, chs.Reverse().Take(2)).Equals("\\\n")))
select new string(rest.ToArray()).TrimEnd();
public static readonly Parser<string> ContinuedValueLines =
from firsts in ContinuedValueLine.AtLeastOnce()
from last in SingleLineValue
select string.Join(" ", firsts) + " " + last;
public static readonly Parser<string> Value = SingleLineValue.Once().XOr(ContinuedValueLines.Once()).Select(s => string.Join(" ", s));
Test output
Xunit.Sdk.EqualException: Assert.Equal() Failure
↓ (pos 1)
Expected: x y
Actual: x\\n y
↑ (pos 1)

You must not include line continuation in the output. That's the only issue of the last unit test. When you parse the continuation \\\n you must drop it from the output result and return the empty string. Sorry I don't know how to do that using C# sprache. Maybe with something like that:
Parse.String("\\\n").Then(chs => Parse.Return(''))
I solved the problem using combinatorix python library. It's a parser combinator library. The API use functions instead of the using chained methods but the idea is the same.
Here is the full code with comments:
# `apply` return a parser that doesn't consume the input stream. It
# applies a function (or lambda) to the output result of a parser.
# The following parser, will remove whitespace from the beginning
# and the end of what is parsed.
strip = apply(lambda x: x.strip())
# parse a single equal character
equal = char('=')
# parse the key part of a configuration line. Since the API is
# functional it reads "inside-out". Note, the use of the special
# `unless(predicate, parser)` parser. It is sometime missing from
# parser combinator libraries. What it does is use `parser` on the
# input stream if the `predicate` parser fails. It allows to execute
# under some conditions. It's similar in spirit to negation in prolog.
# It does parse *anything until an equal sign*, "joins" the characters
# into a string and strips any space starting or ending the string.
key = strip(join(one_or_more(unless(equal, anything))))
# parse a single carriage return character
eol = char('\n')
# returns a parser that return the empty string, this is a constant
# parser (aka. it always output the same thing).
return_empty_space = apply(lambda x: '')
# This will parse a full continuation (ie. including the space
# starting the new line. It does parse *the continuation string then
# zero or more spaces* and return the empty string
continuation = return_empty_space(sequence(string('\\\n'), zero_or_more(char(' '))))
# `value` is the parser for the value part. Unless the current char
# is a `eol` (aka. \n) it tries to parse a continuation, otherwise it
# parse anything. It does that at least once, ie. the value can not be
# empty. Then, it "joins" all the chars into a single string and
# "strip" from any space that start or end the value.
value = strip(join(one_or_more(unless(eol, either(continuation, anything)))))
# this basically, remove the element at index 1 and only keep the
# elements at 0 and 2 in the result. See below.
kv_apply = apply(lambda x: (x[0], x[2]))
# This is the final parser for a given kv pair. A kv pair is:
#
# - a key part (see key parser)
# - an equal part (see equal parser)
# - a value part (see value parser)
#
# Those are used to parse the input stream in sequence (one after the
# other). It will return three values: key, a '=' char and a value.
# `kv_apply` will only keep the key and value part.
kv = kv_apply(sequence(key, equal, value))
# This is sugar syntax, which turns the string into a stream of chars
# and execute `kv` parser on it.
parser = lambda string: combinatorix(string, kv)
input = 'a = b\\e,\\\n c,\\\n d'
assert parser(input) == ('a', 'b\\e,c,d')

Related

Creating a Cipher Code in Ruby

I'm tasked with creating a Caesar cipher for a project I am working on. A Caesar cipher takes each letter in a string of text and replaces it with a letter a fixed number of places away from it (dictated by the cipher key). For instance if my text is "cat" and my cipher key is 3, my new word would be "fdw" (I'm assuming positive numbers move the letters to the right). I've been able to get my code to solve correctly for most strings of text, but I am finding that if my string includes > ? or # it will not work. Their ASCii codes are 62,63 and 64 if that helps. Any input is appreciated!
def caesar_cipher(str, num)
strArray = str.split('')
cipherArray = strArray.collect do |letter|
letter = letter.ord
if (65...90).include?(letter + num) || (97...122).include?(letter + num)
letter = letter + num
elsif (91...96).include?(letter + num) || (123...148).include?(letter + num)
letter = (letter - 26) + num
else
letter
end
end
cipherArray = cipherArray.collect {|x| x.chr}
cipher = cipherArray.join('')
end
caesar_cipher("Am I ill-prepared for this challenge?", 3)
#ord 62-64 DON'T work >, ?, #
You should create an alphabet variable, just think in if you use both ends then you will have 2 problems: negative numbers and an ASCii number that doesn't exist. You can handle this with module operator % or with a single subtraction.
alphabet = "abcde"
text_to_cipher= "aaee" => 0044 #number based in his position at aphabet var
key = 3
result will be 3377 => dd¡? or any other symbol since 7 is out of the string length "abcde" same happens with ASCii at its ends.
With module operator, you can restrict that.
size_of_your_alphabet = 5 # For this case
7%size_of_your_alphabet = 2
The Ruby builtin tr is ideal to implement substitution ciphers.
Step 1: assemble the characters you wish to transform.
chars = ["A".."Z", "a".."z", ">".."#"].flat_map(&:to_a)
Step 2: create a 1:1 array of the transformed characters
transformed = chars.map{|c| (c.ord + 3).chr}
Step 3: apply tr to transform the string.
str.tr(chars.join, transformed.join)
Full working example:
def caesar_cipher(str, num)
chars = ["A".."Z", "a".."z", ">".."#"].flat_map(&:to_a)
transformed = chars.map{|c| (c.ord + num).chr}
str.tr(chars.join, transformed.join)
end
Output:
> caesar_cipher("Am I ill-prepared for this challenge?", 3)
#=> "Dq L mpp-tvitevih jsv xlmw gleppirkiC"
Notes:
Most substitution ciphers actually rely on letter rotation, not ASCII values. My inital assumption was that you wanted a rotation, e.g. ("a".."z").to_a.rotate(num). See prior edit for a working example of that.
You can use ranges in tr() to create a really simple Caesar cipher: str.tr('A-Za-z','B-ZAb-za')
Edit: Also, because of the range specification feature, the \ is an escape character so that you can use literals like -. See this SO answer for details. I think the above exhibits a bug due to this, because it contains a \ which should be escaped by another \.

Parsing a string containing any chars

I'm trying to get following to work. So I have strings that are inside parentheses. The strings can contain any characters, and hence the string that I want to parse can also contain parentheses. I think the regex currently matches also the last parentheses that is supposed to be matched by <~ ")", and thus the parsing fails. What am I missing here?
private def parser: Parser[Any] = a ~ b ~ c ^^ {
<do stuff here>
}
private def a: Parser[String] = "\"[^\"]*\"".r | "[^(),>]*".r
private def b: Parser[String] = opt("(" ~> ".*".r <~ ")") ^^ {
case Some(y) => y.trim
case None => ""
}
private def c: Parser[String] = rep(".#" ~> "[^>.]*".r) ^^ (new String(_).trim)
This is supposed to parse following kind of strings:
test0
test1.#attr
"test2"
"test3".#attr
test4..
test5..#attr
"test6..".#attr
"test7.#attr".#attr
test8(icl>uw)
test9(icl>uw).#attr
"test10..().#"(icl>uw).#attr
test11(icl>uw(agt>uw2,obj>uw3),icl>uw4(agt>uw5))
test12(icl>uw1(agt>uw2,obj>uw3),icl>uw4).#attr1.#attr2
test13(agt>thing,obj>role>effect)
So the "a" parser parses the string until open parentheses or .#attr part. "b" parser parses the characters inside optional parentheses. "c" parses the optional .#attrs.
Currently I get similar error on all test strings containing parentheses part:
11:07:44.662 [main] DEBUG - Parsed: test8()
11:07:44.667 [main] ERROR - FAILURE parsing: test8(icl>uw) -- `)' expected but `i' found
So I assume that the parser parsed the first part correctly, but failed when it saw the parentheses part.
The right solution to parse nested structures is to use recursion, for example in the following fashion:
val parser= "regex".r
#tailrec
def extract(string:String,foundTokens:List[String]=List.empty):List[String]={
parser.findFirstMatchIn(string) match {
case Some(parser(matchedValue)) => extract(matchedValue,matchedValue::foundedTokens)
case None=>foundTokens
}
Where basically at each call to the function, you append the found token to a list of results and you launch the function on the result of the match. When you do not find anymore you return the found token.
If multiple matches are possible inside each subtoken, then you should look for a procedure like this one:
def extract(string:String):Iterator[String]={
parser.findAllIn(string).flatMap{
item => extract(item)
}
}

Unable to parse a complex language with regex and Scala parser combinators

I'm trying to write a parser for a certain language as part of my research. Currently I have problems getting the following code to work in a way I want:
private def _uw: Parser[UW] = _headword ~ _modifiers ~ _attributes ^^ {
case hw ~ mods ~ attrs => new UW(hw, mods, attrs)
}
private def _headword[String] = "\".*\"".r | "[^(),]*".r
private def _modifiers: Parser[List[UWModifier]] = opt("(" ~> repsep(_modifier, ",") <~ ")") ^^ {
case Some(mods) => mods
case None => List[UWModifier]()
}
private def _modifier: Parser[UWModifier] = ("[^><]*".r ^^ (RelTypes.toRelType(_))) ~ "[><]".r ~ _uw ^^ {
case (rel: RelType) ~ x ~ (uw: UW) => new UWModifier(rel, uw)
}
private def _attributes: Parser[List[UWAttribute]] = rep(_attribute) ^^ {
case Nil => List[UWAttribute]()
case attrs => attrs
}
private def _attribute: Parser[UWAttribute] = ".#" ~> "[^>.]*".r ^^ (new UWAttribute(_))
The above code contains just one part of the language, and to spare time and space, I won't go much into details about the whole language. _uw method is supposed to parse a string that consists of three parts, although just the first part must exist in the string.
_uw should be able to parse these test strings correctly:
test0
test1.#attr
"test2"
"test3".#attr
test4..
test5..#attr
"test6..".#attr
"test7.#attr".#attr
test8(urel>uw)
test9(urel>uw).#attr
"test10..().#"(urel>uw).#attr
test11(urel1>uw1(urel2>uw2,urel3>uw3),urel4>uw4).#attr1.#attr2
So if the headword starts and ends with ", everything inside the double quotes is considered to be part of the headword. All words starting with .#, if they are not inside the double quotes, are attributes of the headword.
E.g. in test5, the parser should parse test5. as headword, and attr as an attribute. Just .# is omitted, and all dots before that should be contained in the headword.
So, after headword there CAN be attributes and/or modifiers. The order is strict, so attributes always come after modifiers. If there are attributes but no modifiers, everything until .# is considered as part of the headword.
The main problem is "[^#(]*".r. I've tried all kind of creative alternatives, such as "(^[\\w\\.]*)((\\.\\#)|$)".r, but nothing seems to work. How does lookahead or lookbehind even affect parser combinators? I'm not an expert on parsing or regex, so all help is welcome!
I don't think "[^#(]*".r has anything to do with your problem. I see this:
private def _headword[String] = "\".*\"".r | "[^(),]*".r
which is the first thing in _uw (and, by the way, using underscores in names in Scala is not recommended), so when it tries to parse test5..#attr, the second regexp will match all of it!
scala> "[^(),]*".r findFirstIn "test5..#attr"
res0: Option[String] = Some(test5..#attr)
So there will be nothing left for the remaining parsers. Also, the first regex in _headword is also problematic, because .* will accept quotes, which means that something like this becomes valid:
"test6 with a " inside of it..".#attr
As for look-ahead and look-behind, it doesn't affect parser combinators at all. Either the regex matches, or it doesn't -- that's all the parser combinators care about.

How to retrieve value from optional parser in Parsec?

Sorry if it's a novice question - I want to parse something defined by
Exp ::= Mandatory_Part Optional_Part0 Optional_Part1
I thought I could do this:
proc::Parser String
proc = do {
;str<-parserMandatoryPart
;str0<-optional(parserOptionalPart0) --(1)
;str1<-optional(parserOptionalPart1) --(2)
;return str++str0++str1
}
I want to get str0/str1 if optional parts are present, otherwise, str0/str1 would be "".
But (1) and (2) won't work since optional() doesn't allow extracting result from its parameters, in this case, parserOptionalPart0/parserOptionalPart1.
Now What would be the proper way to do it?
Many thanks!
Billy R
The function you're looking for is optionMaybe. It returns Nothing if the parser failed, and returns the content in Just if it consumed input.
From the docs:
option x p tries to apply parser p. If p fails without consuming input, it returns the value x, otherwise the value returned by p.
So you could do:
proc :: Parser String
proc = do
str <- parserMandatoryPart
str0 <- option "" parserOptionalPart0
str1 <- option "" parserOptionalPart1
return (str++str0++str1)
Watch out for the "without consuming input" part. You may need to wrap either or both optional parsers with try.
I've also adjusted your code style to be more standard, and fixed an error on the last line. return isn't a keyword; it's an ordinary function. So return a ++ b is (return a) ++ b, i.e. almost never what you want.

<< and >> symbols in Erlang

First of all, I'm an Erlang rookie here. I need to interface with a MySQL database and I found the erlang-mysql-driver. I'm trying that out, and am a little confused by some of the syntax.
I can get a row of data from the database with this (greatly oversimplified for brevity here):
Result = mysql:fetch(P1, ["SELECT column1, column2 FROM table1 WHERE column2='", Key, "'"]),
case Result of
{data, Data} ->
case mysql:get_result_rows(Data) of
[] -> not_found;
Res ->
%% Now 'Res' has the row
So now here is an example of what `Res' has:
[[<<"value from column1">>, <<"value from column2">>]]
I get that it's a list of records. In this case, the query returned 1 row of 2 columns.
My question is:
What do the << and >> symbols mean? And what is the best (Erlang-recommended) syntax for turning a list like this into a records which I have defined like:
-record(
my_record,
{
column1 = ""
,column2 = ""
}
).
Just a small note: the results are not bit string comprehensions per see, they are just bit strings. However you can use bit string comprehensions to produce a sequence of bit strings (which is described above with the generators and that), much like lists and lists comprehensions.
you can use erlang:binary_to_list/1 and erlang:list_to_binary/1 to convert between binary and strings (lists).
The reason the mysql driver returns bit strings is probably because they are much faster to manipulate.
In your specific example, you can do the conversion by matching on the returned column values, and then creating a new record like this:
case mysql:get_result_rows(Data) of
[] ->
not_found;
[[Col1, Col2]] ->
#my_record{column1 = Col1, column2 = Col2}
end
These are bit string comprehensions.
Bit string comprehensions are analogous to List Comprehensions. They are used to generate bit strings efficiently and succinctly.
Bit string comprehensions are written with the following syntax:
<< BitString || Qualifier1,...,QualifierN >>
BitString is a bit string expression, and each Qualifier is either a generator, a bit string generator or a filter.
• A generator is written as:
Pattern <- ListExpr.
ListExpr must be an expression which evaluates to a list of terms.
• A bit string generator is written as:
BitstringPattern <= BitStringExpr.
BitStringExpr must be an expression which evaluates to a bitstring.
• A filter is an expression which evaluates to true or false.
The variables in the generator patterns shadow variables in the function clause surrounding the bit string comprehensions.
A bit string comprehension returns a bit string, which is created by concatenating the results of evaluating BitString for each combination of bit string generator elements for which all filters are true.
Example:
1> << << (X*2) >> ||
<<X>> <= << 1,2,3 >> >>.
<<2,4,6>>

Resources