How can I tell pest.rs to flatten grammar? - parsing

Let's say I have a rule like,
key = { ASCII_ALPHA ~ ( ASCII_ALPHA | "_" )+ }
value = { (!NEWLINE ~ ANY)+ }
keyvalue = { key ~ "=" ~ value? }
option = { key }
This supports a
K=V
K=
K
Which is want to set/unset a key, and to specify an option, what I don't like is the syntax for option which produces an AST like this,
rule: option,
span: Span {
str: "check_local_user",
start: 302,
end: 318,
},
inner: [
Pair {
rule: key,
span: Span {
str: "check_local_user",
start: 302,
end: 318,
},
inner: [],
},
],
I don't like that my option has inner with key. I'm just wanting to the option to have the same grammar as a key. Is there any method in Pest.rs to write the grammar such that
inner { myStuff }
outer = { inner }
gets flattened to
outer = { myStuff }

Using the Atomic Parsing Token #, I could accomplish this.
option = #{ key }
It's documented as,
Any rules called by atomic rules do not generate token pairs.

Related

How to write a map to a YAML file in Dart

I have a map of key value pairs in Dart. I want to convert it to YAML and write into a file.
I tried using YAML package from dart library but it only provides methods to load YAML data from a file. Nothing is mentioned on how to write it back to the YAML file.
Here is an example:
void main() {
var map = {
"name": "abc",
"type": "unknown",
"internal":{
"name": "xyz"
}
};
print(map);
}
Expected output:
example.yaml
name: abc
type: unknown
internal:
name: xyz
How to convert the dart map to YAML and write it to a file?
It's a bit late of a response but for anyone else looking at this question I have written this class. It may not be perfect but it works for what I'm doing and I haven't found anything wrong with it yet. Might make it a package eventually after writing tests.
class YamlWriter {
/// The amount of spaces for each level.
final int spaces;
/// Initialize the writer with the amount of [spaces] per level.
YamlWriter({
this.spaces = 2,
});
/// Write a dart structure to a YAML string. [yaml] should be a [Map] or [List].
String write(dynamic yaml) {
return _writeInternal(yaml).trim();
}
/// Write a dart structure to a YAML string. [yaml] should be a [Map] or [List].
String _writeInternal(dynamic yaml, { int indent = 0 }) {
String str = '';
if (yaml is List) {
str += _writeList(yaml, indent: indent);
} else if (yaml is Map) {
str += _writeMap(yaml, indent: indent);
} else if (yaml is String) {
str += "\"${yaml.replaceAll("\"", "\\\"")}\"";
} else {
str += yaml.toString();
}
return str;
}
/// Write a list to a YAML string.
/// Pass the list in as [yaml] and indent it to the [indent] level.
String _writeList(List yaml, { int indent = 0 }) {
String str = '\n';
for (var item in yaml) {
str += "${_indent(indent)}- ${_writeInternal(item, indent: indent + 1)}\n";
}
return str;
}
/// Write a map to a YAML string.
/// Pass the map in as [yaml] and indent it to the [indent] level.
String _writeMap(Map yaml, { int indent = 0 }) {
String str = '\n';
for (var key in yaml.keys) {
var value = yaml[key];
str += "${_indent(indent)}${key.toString()}: ${_writeInternal(value, indent: indent + 1)}\n";
}
return str;
}
/// Create an indented string for the level with the spaces config.
/// [indent] is the level of indent whereas [spaces] is the
/// amount of spaces that the string should be indented by.
String _indent(int indent) {
return ''.padLeft(indent * spaces, ' ');
}
}
Usage:
final writer = YamlWriter();
String yaml = writer.write({
'string': 'Foo',
'int': 1,
'double': 3.14,
'boolean': true,
'list': [
'Item One',
'Item Two',
true,
'Item Four',
],
'map': {
'foo': 'bar',
'list': ['Foo', 'Bar'],
},
});
File file = File('/path/to/file.yaml');
file.createSync();
file.writeAsStringSync(yaml);
Output:
string: "Foo"
int: 1
double: 3.14
boolean: true
list:
- "Item One"
- "Item Two"
- true
- "Item Four"
map:
foo: "bar"
list:
- "Foo"
- "Bar"
package:yaml does not have YAML writing features. You may have to look for another package that does that – or write your own.
As as stopgap, remember JSON is valid YAML, so you can always write out JSON to a .yaml file and it should work with any YAML parser.
I ran into the same issue and ended up hacking together a simple writer:
// Save the updated configuration settings to the config file
void saveConfig() {
var file = _configFile;
// truncate existing configuration
file.writeAsStringSync('');
// Write out new YAML document from JSON map
final config = configToJson();
config.forEach((key, value) {
if (value is Map) {
file.writeAsStringSync('\n$key:\n', mode: FileMode.writeOnlyAppend);
value.forEach((subkey, subvalue) {
file.writeAsStringSync(' $subkey: $subvalue\n',
mode: FileMode.writeOnlyAppend);
});
} else {
file.writeAsStringSync('$key: $value\n',
mode: FileMode.writeOnlyAppend);
}
});
}

protobuf text format parsing maps

This answer clearly shows some examples of proto text parsing, but does not have an example for maps.
If a proto has:
map<int32, string> aToB
I would guess something like:
aToB {
123: "foo"
}
but it does not work. Does anyone know the exact syntax?
I initially tried extrapolating from an earlier answer, which led me astray, because I incorrectly thought multiple k/v pairs would look like this:
aToB { # (this example has a bug)
key: 123
value: "foo"
key: 876 # WRONG!
value: "bar" # NOPE!
}
That led to the following error:
libprotobuf ERROR: Non-repeated field "key" is specified multiple times.
Proper syntax for multiple key-value pairs:
(Note: I am using the "proto3" version of the protocol buffers language)
aToB {
key: 123
value: "foo"
}
aToB {
key: 876
value: "bar"
}
The pattern of repeating the name of the map variable makes more sense after re-reading this relevant portion of the proto3 Map documentation, which explains that maps are equivalent to defining your own "pair" message type and then marking it as "repeated".
A more complete example:
proto definition:
syntax = "proto3";
package myproject.testing;
message UserRecord {
string handle = 10;
bool paid_membership = 20;
}
message UserCollection {
string description = 20;
// HERE IS THE PROTOBUF MAP-TYPE FIELD:
map<string, UserRecord> users = 10;
}
message TestData {
UserCollection user_collection = 10;
}
text format ("pbtxt") in a config file:
user_collection {
description = "my default users"
users {
key: "user_1234"
value {
handle: "winniepoo"
paid_membership: true
}
}
users {
key: "user_9b27"
value {
handle: "smokeybear"
}
}
}
C++ that would generate the message content programmatically
myproject::testing::UserRecord user_1;
user_1.set_handle("winniepoo");
user_1.set_paid_membership(true);
myproject::testing::UserRecord user_2;
user_2.set_handle("smokeybear");
user_2.set_paid_membership(false);
using pair_type =
google::protobuf::MapPair<std::string, myproject::testing::UserRecord>;
myproject::testing::TestData data;
data.mutable_user_collection()->mutable_users()->insert(
pair_type(std::string("user_1234"), user_1));
data.mutable_user_collection()->mutable_users()->insert(
pair_type(std::string("user_9b27"), user_2));
The text format is:
aToB {
key: 123
value: "foo"
}

Why are my syntax errors in Jison not being "propagated"?

This is the code that I have:
%lex
%options flex
%{
// Used to store the parsed data
if (!('regions' in yy)) {
yy.regions = {
settings: {},
tables: [],
relationships: []
};
}
%}
text [a-zA-Z][a-zA-Z0-9]*
%%
\n\s* return 'NEWLINE';
[^\S\n]+ ; // ignore whitespace other than newlines
"." return '.';
"," return ',';
"-" return '-';
"=" return '=';
"=>" return '=>';
"<=" return '<=';
"[" return '[';
"settings]" return 'SETTINGS';
"tables]" return 'TABLES';
"relationships]" return 'RELATIONSHIPS';
"]" return ']';
{text} return 'TEXT';
<<EOF>> return 'EOF';
/lex
%left ','
%start source
%%
source
: content EOF
{
console.log(yy.regions);
console.log("\n" + JSON.stringify(yy.regions));
return yy.regions;
}
| NEWLINE content EOF
{
console.log(yy.regions);
console.log("\n" + JSON.stringify(yy.regions));
return yy.regions;
}
| NEWLINE EOF
| EOF
;
content
: '[' section content
| '[' section
;
section
: SETTINGS NEWLINE settings_content
| TABLES NEWLINE tables_content
| RELATIONSHIPS NEWLINE relationships_content
;
settings_content
: settings_line NEWLINE settings_content
| settings_line NEWLINE
| settings_line
;
settings_line
: text '=' text
{ yy.regions.settings[$1] = $3; }
;
tables_content
: tables_line NEWLINE tables_content
| tables_line NEWLINE
| tables_line
;
tables_line
: table_name
{ yy.regions.tables.push({ name: $table_name, fields: [] }); }
| field_list
{
var tableCount = yy.regions.tables.length;
var tableIndex = tableCount - 1;
yy.regions.tables[tableIndex].fields.push($field_list);
}
;
table_name
: '-' text
{ $$ = $text; }
;
field_list
: text
{ $$=[]; $$.push($text); }
| field_list ',' text
{ $field_list.push($text); $$ = $field_list; }
;
relationships_content
: relationships_line NEWLINE relationships_content
| relationships_line NEWLINE
| relationships_line
;
relationships_line
: relationship_key '=>' relationship_key
{
yy.regions.relationships.push({
pkTable: $1,
fkTable: $3
});
}
| relationship_key '<=' relationship_key
{
yy.regions.relationships.push({
pkTable: $3,
fkTable: $1
});
}
;
relationship_key
: text '.' text
{ $$ = { name: $1, field: $3 }; }
| text
{ $$ = { name: $1 }; }
;
text
: TEXT
{ $$ = $TEXT; }
;
It's used to parse this kind of code:
[settings]
DefaultFieldType = string
[tables]
-table1
id, int, PK
username, string, NULL
password, string
-table2
id, int, PK
itemName, string
itemCount, int
[relationships]
table1 => table2
foo.test => bar.test2
Into this kind of JSON:
{ settings: { DefaultFieldType: 'string' },
tables:
[ { name: 'table1', fields: [Object] },
{ name: 'table2', fields: [Object] } ],
relationships:
[ { pkTable: [Object], fkTable: [Object] },
{ pkTable: [Object], fkTable: [Object] } ] }
However I don't get syntax error. When I go to Jison demo and try to parse 5*PI 3^2, I get the following error:
Parse error on line 1:
5*PI 3^2
-----^
Expecting 'EOF', '+', '-', '*', '/', '^', ')', got 'NUMBER'
which is expected. But when I change the last line of the code which I wish to parse from:
foo.test => bar.test2
to something like
foo.test => a bar.test2
I get the following error:
throw new _parseError(str, hash);
^
TypeError: Function.prototype.toString is not generic
I traced this to the generated parser code which looks like this:
if (hash.recoverable) {
this.trace(str);
} else {
function _parseError (msg, hash) {
this.message = msg;
this.hash = hash;
}
_parseError.prototype = Error;
throw new _parseError(str, hash);
}
So this leads me to believe that there is something wrong in how I structured my code and how I handled parsing but I have no idea what that might be.
It seems like it might have something to do with error recovery. If that is correct, how is that supposed to be used? Am I supposed to add the 'error' rule upwards to every element all the way to the source root?
Your grammar seems to work as expected in the Jison demo page, at least with the browser I'm using (Firefox 46.0.1). From the amount of activity in the git repository around the code that you cite, I suspect that the version of jison you are using has one of the bugs:
https://github.com/zaach/jison/issues/328
https://github.com/zaach/jison/issues/318
I think the jison version on the demo page is older, not newer, so if grabbing the current code from github doesn't work, you could try using an older version.

What grammar is this?

I have to parse a document containing groups of variable-value-pairs which is serialized to a string e.g. like this:
4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14^VAR1^6^VALUE1^^
Here are the different elements:
Group IDs:
4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14^VAR1^6^VALUE1^^
Length of string representation of each group:
4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14^VAR1^6^VALUE1^^
One of the groups:
4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14 ^VAR1^6^VALUE1^^
Variables:
4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14^VAR1^6^VALUE1^^
Length of string representation of the values:
4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14^VAR1^6^VALUE1^^
The values themselves:
4^26^VAR1^6^VALUE1^VAR2^4^VAL2^^1^14^VAR1^6^VALUE1^^
Variables consist only of alphanumeric characters.
No assumption is made about the values, i.e. they may contain any character, including ^.
Is there a name for this kind of grammar? Is there a parsing library that can handle this mess?
So far I am using my own parser, but due to the fact that I need to detect and handle corrupt serializations the code looks rather messy, thus my question for a parser library that could lift the burden.
The simplest way to approach it is to note that there are two nested levels that work the same way. The pattern is extremely simple:
id^length^content^
At the outer level, this produces a set of groups. Within each group, the content follows exactly the same pattern, only here the id is the variable name, and the content is the variable value.
So you only need to write that logic once and you can use it to parse both levels. Just write a function that breaks a string up into a list of id/content pairs. Call it once to get the groups, and then loop through them calling it again for each content to get the variables in that group.
Breaking it down into these steps, first we need a way to get "tokens" from the string. This function returns an object with three methods, to find out if we're at "end of file", and to grab the next delimited or counted substring:
var tokens = function(str) {
var pos = 0;
return {
eof: function() {
return pos == str.length;
},
delimited: function(d) {
var end = str.indexOf(d, pos);
if (end == -1) {
throw new Error('Expected delimiter');
}
var result = str.substr(pos, end - pos);
pos = end + d.length;
return result;
},
counted: function(c) {
var result = str.substr(pos, c);
pos += c;
return result;
}
};
};
Now we can conveniently write the reusable parse function:
var parse = function(str) {
var parts = {};
var t = tokens(str);
while (!t.eof()) {
var id = t.delimited('^');
var len = t.delimited('^');
var content = t.counted(parseInt(len, 10));
var end = t.counted(1);
if (end !== '^') {
throw new Error('Expected ^ after counted string, instead found: ' + end);
}
parts[id] = content;
}
return parts;
};
It builds an object where the keys are the IDs (or variable names). I'm asuming as they have names that the order isn't significant.
Then we can use that at both levels to create the function to do the whole job:
var parseGroups = function(str) {
var groups = parse(str);
Object.keys(groups).forEach(function(id) {
groups[id] = parse(groups[id]);
});
return groups;
}
For your example, it produces this object:
{
'1': {
VAR1: 'VALUE1'
},
'4': {
VAR1: 'VALUE1',
VAR2: 'VAL2'
}
}
I don't think it's a trivial task to create a grammar for this. But on the other hand, a simple straight forward approach is not that hard. You know the corresponding string length for every critical string. So you just chop your string according to those lengths apart..
where do you see problems?

PEG for Python style indentation

How would you write a Parsing Expression Grammar in any of the following Parser Generators (PEG.js, Citrus, Treetop) which can handle Python/Haskell/CoffeScript style indentation:
Examples of a not-yet-existing programming language:
square x =
x * x
cube x =
x * square x
fib n =
if n <= 1
0
else
fib(n - 2) + fib(n - 1) # some cheating allowed here with brackets
Update:
Don't try to write an interpreter for the examples above. I'm only interested in the indentation problem. Another example might be parsing the following:
foo
bar = 1
baz = 2
tap
zap = 3
# should yield (ruby style hashmap):
# {:foo => { :bar => 1, :baz => 2}, :tap => { :zap => 3 } }
Pure PEG cannot parse indentation.
But peg.js can.
I did a quick-and-dirty experiment (being inspired by Ira Baxter's comment about cheating) and wrote a simple tokenizer.
For a more complete solution (a complete parser) please see this question: Parse indentation level with PEG.js
/* Initializations */
{
function start(first, tail) {
var done = [first[1]];
for (var i = 0; i < tail.length; i++) {
done = done.concat(tail[i][1][0])
done.push(tail[i][1][1]);
}
return done;
}
var depths = [0];
function indent(s) {
var depth = s.length;
if (depth == depths[0]) return [];
if (depth > depths[0]) {
depths.unshift(depth);
return ["INDENT"];
}
var dents = [];
while (depth < depths[0]) {
depths.shift();
dents.push("DEDENT");
}
if (depth != depths[0]) dents.push("BADDENT");
return dents;
}
}
/* The real grammar */
start = first:line tail:(newline line)* newline? { return start(first, tail) }
line = depth:indent s:text { return [depth, s] }
indent = s:" "* { return indent(s) }
text = c:[^\n]* { return c.join("") }
newline = "\n" {}
depths is a stack of indentations. indent() gives back an array of indentation tokens and start() unwraps the array to make the parser behave somewhat like a stream.
peg.js produces for the text:
alpha
beta
gamma
delta
epsilon
zeta
eta
theta
iota
these results:
[
"alpha",
"INDENT",
"beta",
"gamma",
"INDENT",
"delta",
"DEDENT",
"DEDENT",
"epsilon",
"INDENT",
"zeta",
"DEDENT",
"BADDENT",
"eta",
"theta",
"INDENT",
"iota",
"DEDENT",
"",
""
]
This tokenizer even catches bad indents.
I think an indentation-sensitive language like that is context-sensitive. I believe PEG can only do context-free langauges.
Note that, while nalply's answer is certainly correct that PEG.js can do it via external state (ie the dreaded global variables), it can be a dangerous path to walk down (worse than the usual problems with global variables). Some rules can initially match (and then run their actions) but parent rules can fail thus causing the action run to be invalid. If external state is changed in such an action, you can end up with invalid state. This is super awful, and could lead to tremors, vomiting, and death. Some issues and solutions to this are in the comments here: https://github.com/dmajda/pegjs/issues/45
So what we are really doing here with indentation is creating something like a C-style blocks which often have their own lexical scope. If I were writing a compiler for a language like that I think I would try and have the lexer keep track of the indentation. Every time the indentation increases it could insert a '{' token. Likewise every time it decreases it could inset an '}' token. Then writing an expression grammar with explicit curly braces to represent lexical scope becomes more straight forward.
You can do this in Treetop by using semantic predicates. In this case you need a semantic predicate that detects closing a white-space indented block due to the occurrence of another line that has the same or lesser indentation. The predicate must count the indentation from the opening line, and return true (block closed) if the current line's indentation has finished at the same or shorter length. Because the closing condition is context-dependent, it must not be memoized.
Here's the example code I'm about to add to Treetop's documentation. Note that I've overridden Treetop's SyntaxNode inspect method to make it easier to visualise the result.
grammar IndentedBlocks
rule top
# Initialise the indent stack with a sentinel:
&{|s| #indents = [-1] }
nested_blocks
{
def inspect
nested_blocks.inspect
end
}
end
rule nested_blocks
(
# Do not try to extract this semantic predicate into a new rule.
# It will be memo-ized incorrectly because #indents.last will change.
!{|s|
# Peek at the following indentation:
save = index; i = _nt_indentation; index = save
# We're closing if the indentation is less or the same as our enclosing block's:
closing = i.text_value.length <= #indents.last
}
block
)*
{
def inspect
elements.map{|e| e.block.inspect}*"\n"
end
}
end
rule block
indented_line # The block's opening line
&{|s| # Push the indent level to the stack
level = s[0].indentation.text_value.length
#indents << level
true
}
nested_blocks # Parse any nested blocks
&{|s| # Pop the indent stack
# Note that under no circumstances should "nested_blocks" fail, or the stack will be mis-aligned
#indents.pop
true
}
{
def inspect
indented_line.inspect +
(nested_blocks.elements.size > 0 ? (
"\n{\n" +
nested_blocks.elements.map { |content|
content.block.inspect+"\n"
}*'' +
"}"
)
: "")
end
}
end
rule indented_line
indentation text:((!"\n" .)*) "\n"
{
def inspect
text.text_value
end
}
end
rule indentation
' '*
end
end
Here's a little test driver program so you can try it easily:
require 'polyglot'
require 'treetop'
require 'indented_blocks'
parser = IndentedBlocksParser.new
input = <<END
def foo
here is some indented text
here it's further indented
and here the same
but here it's further again
and some more like that
before going back to here
down again
back twice
and start from the beginning again
with only a small block this time
END
parse_tree = parser.parse input
p parse_tree
I know this is an old thread, but I just wanted to add some PEGjs code to the answers. This code will parse a piece of text and "nest" it into a sort of "AST-ish" structure. It only goes one deep and it looks ugly, furthermore it does not really use the return values to create the right structure but keeps an in-memory tree of your syntax and it will return that at the end. This might well become unwieldy and cause some performance issues, but at least it does what it's supposed to.
Note: Make sure you have tabs instead of spaces!
{
var indentStack = [],
rootScope = {
value: "PROGRAM",
values: [],
scopes: []
};
function addToRootScope(text) {
// Here we wiggle with the form and append the new
// scope to the rootScope.
if (!text) return;
if (indentStack.length === 0) {
rootScope.scopes.unshift({
text: text,
statements: []
});
}
else {
rootScope.scopes[0].statements.push(text);
}
}
}
/* Add some grammar */
start
= lines: (line EOL+)*
{
return rootScope;
}
line
= line: (samedent t:text { addToRootScope(t); }) &EOL
/ line: (indent t:text { addToRootScope(t); }) &EOL
/ line: (dedent t:text { addToRootScope(t); }) &EOL
/ line: [ \t]* &EOL
/ EOF
samedent
= i:[\t]* &{ return i.length === indentStack.length; }
{
console.log("s:", i.length, " level:", indentStack.length);
}
indent
= i:[\t]+ &{ return i.length > indentStack.length; }
{
indentStack.push("");
console.log("i:", i.length, " level:", indentStack.length);
}
dedent
= i:[\t]* &{ return i.length < indentStack.length; }
{
for (var j = 0; j < i.length + 1; j++) {
indentStack.pop();
}
console.log("d:", i.length + 1, " level:", indentStack.length);
}
text
= numbers: number+ { return numbers.join(""); }
/ txt: character+ { return txt.join(""); }
number
= $[0-9]
character
= $[ a-zA-Z->+]
__
= [ ]+
_
= [ ]*
EOF
= !.
EOL
= "\r\n"
/ "\n"
/ "\r"

Resources