Implement heredocs with trim indent using PEG.js - parsing

I working on a language similar to ruby called gaiman and I'm using PEG.js to generate the parser.
Do you know if there is a way to implement heredocs with proper indentation?
xxx = <<<END
hello
world
END
the output should be:
"hello
world"
I need this because this code doesn't look very nice:
def foo(arg) {
if arg == "here" then
return <<<END
xxx
xxx
END
end
end
this is a function where the user wants to return:
"xxx
xxx"
I would prefer the code to look like this:
def foo(arg) {
if arg == "here" then
return <<<END
xxx
xxx
END
end
end
If I trim all the lines user will not be able to use a string with leading spaces when he wants. Does anyone know if PEG.js allows this?
I don't have any code yet for heredocs, just want to be sure if something that I want is possible.
EDIT:
So I've tried to implement heredocs and the problem is that PEG doesn't allow back-references.
heredoc = "<<<" marker:[\w]+ "\n" text:[\s\S]+ marker {
return text.join('');
}
It says that the marker is not defined. As for trimming I think I can use location() function

I don't think that's a reasonable expectation for a parser generator; few if any would be equal to the challenge.
For a start, recognising the here-string syntax is inherently context-sensitive, since the end-delimiter must be a precise copy of the delimiter provided after the <<< token. So you would need a custom lexical analyser, and that means that you need a parser generator which allows you to use a custom lexical analyser. (So a parser generator which assumes you want a scannerless parser might not be the optimal choice.)
Recognising the end of the here-string token shouldn't be too difficult, although you can't do it with a single regular expression. My approach would be to use a custom scanning function which breaks the here-string into a series of lines, concatenating them as it goes until it reaches a line containing only the end-delimiter.
Once you've recognised the text of the literal, all you need to normalise the spaces in the way you want is the column number at which the <<< starts. With that, you can trim each line in the string literal. So you only need a lexical scanner which accurately reports token position. Trimming wouldn't normally be done inside the generated lexical scanner; rather, it would be the associated semantic action. (Equally, it could be a semantic action in the grammar. But it's always going to be code that you write.)
When you trim the literal, you'll need to deal with the cases in which it is impossible, because the user has not respected the indentation requirement. And you'll need to do something with tab characters; getting those right probably means that you'll want a lexical scanner which computes visible column positions rather than character offsets.
I don't know if peg.js corresponds with those requirements, since I don't use it. (I did look at the documentation, and failed to see any indication as to how you might incorporate a custom scanner function. But that doesn't mean there isn't a way to do it.) I hope that the discussion above at least lets you check the detailed documentation for the parser generator you want to use, and otherwise find a different parser generator which will work for you in this use case.

Here is the implementation of heredocs in Peggy successor to PEG.js that is not maintained anymore. This code was based on the GitHub issue.
heredoc = "<<<" begin:marker "\n" text:($any_char+ "\n")+ _ end:marker (
&{ return begin === end; }
/ '' { error(`Expected matched marker "${begin}", but marker "${end}" was found`); }
) {
const loc = location();
const min = loc.start.column - 1;
const re = new RegExp(`\\s{${min}}`);
return text.map(line => {
return line[0].replace(re, '');
}).join('\n');
}
any_char = (!"\n" .)
marker_char = (!" " !"\n" .)
marker "Marker" = $marker_char+
_ "whitespace"
= [ \t\n\r]* { return []; }
EDIT: above didn't work with another piece of code after heredoc, here is better grammar:
{ let heredoc_begin = null; }
heredoc = "<<<" beginMarker "\n" text:content endMarker {
const loc = location();
const min = loc.start.column - 1;
const re = new RegExp(`^\\s{${min}}`, 'mg');
return {
type: 'Literal',
value: text.replace(re, '')
};
}
__ = (!"\n" !" " .)
marker 'Marker' = $__+
beginMarker = m:marker { heredoc_begin = m; }
endMarker = "\n" " "* end:marker &{ return heredoc_begin === end; }
content = $(!endMarker .)*

Related

lua tables - string representation

as a followup question to lua tables - allowed values and syntax:
I need a table that equates large numbers to strings. The catch seems to be that strings with punctuation are not allowed:
local Names = {
[7022003001] = fulsom jct, OH
[7022003002] = kennedy center, NY
}
but neither are quotes:
local Names = {
[7022003001] = "fulsom jct, OH"
[7022003002] = "kennedy center, NY"
}
I have even tried without any spaces:
local Names = {
[7022003001] = fulsomjctOH
[7022003002] = kennedycenterNY
}
When this module is loaded, wireshark complains "}" is expected to close "{" at line . How can I implement a table with a string that contains spaces and punctuation?
As per Lua Reference Manual - 3.1 - Lexical Conventions:
A short literal string can be delimited by matching single or double quotes, and can contain the (...) C-like escape sequences (...).
That means the short literal string in Lua is:
local foo = "I'm a string literal"
This matches your second example. The reason why it fails is because it lacks a separator between table members:
local Names = {
[7022003001] = "fulsom jct, OH",
[7022003002] = "kennedy center, NY"
}
You can also add a trailing separator after the last member.
The more detailed description of the table constructor can be found in 3.4.9 - Table Constructors. It could be summed up by the example provided there:
a = { [f(1)] = g; "x", "y"; x = 1, f(x), [30] = 23; 45 }
I really, really recommend using the Lua Reference Manual, it is an amazing helper.
I also highly encourage you to read some basic tutorials e.g. Learn Lua in 15 minutes. They should give you an overview of the language you are trying to use.

What am I doing incorrectly to my approach to this function & its call?

I have a fairly simple nested table Aurora64.Chat containing a couple of functions (where the main Aurora64 class is initialized elsewhere but I inserted it here for completeness):
Aurora64 = {};
Aurora64.Chat = {
entityId = 0;
Init = function()
local entity; --Big long table here.
if (g_gameRules.class == "InstantAction") then
g_gameRules.game:SetTeam(3, entity.id); --Spectator, aka neutral.
end
entityId = Entity.id;
self:LogToSystem("Created chat entity '" .. entity.name .. "' with ID '" .. entity.id .. "'. It is now available for use.");
end
LogToSystem = function(msg)
System.LogAlways("$1[Aurora 64]$2 " .. msg);
end
}
The above code fails (checked with the Lua Demo) with the following:
input:14: '}' expected (to close '{' at line 3) near 'LogToSystem'
I have tracked it down to the LogToSystem function and its usage (if I remove the function and the one time it is used, the code compiles perfectly), and I thought it was to do with my use of use of concatenation (it wasn't).
I'm thinking I might have missed something simple, but I checked the documentation on functions and the function & its call seem to be written properly.
What exactly am I doing wrong here?
You are missing a comma before LogToSystem and you need to define it a bit differently (by adding self as an explicit parameter):
end,
LogToSystem = function(self, msg)
System.LogAlways("$1[Aurora 64]$2 " .. msg);
end
}
It's not possible to use the form obj:method with anonymous functions assigned to table fields; you can only use it with function obj:method syntax.

How to implement for loop in javacc

I am implementing a parser based on javacc which will be able to GW Basic programs.
I implemented for loop like this
void forloop(Token line):
{
Token toV;
Token toI;
Token step;
Token next;
Token var;
}
{
<FOR> var=<VARIABLE> "=" Expression() { instructions.add("STORE " + var.image); } <TO> toV=<INTEGER> <STEP> step=<INTEGER>
{
instructions.add("LABEL "+labelsCntr);
instructions.add("LOAD "+var.image);
instructions.add("CONST "+toV.image);
instructions.add("SUB");
instructions.add("CONST 0");
}
( Line() )*
next = <INTEGER> <NEXT> <VARIABLE>
{
instructions.add("LINE "+next.image);
instructions.add("LOAD "+step.image);
instructions.add("LOAD "+var.image);
instructions.add("ADD");
instructions.add("JMP LABEL "+(labelsCntr));
labelsCntr++;
}
}
But it does not work.
How can I implement for loop so that it works.
Or where I am doing wrong.
Thanks in advance.
There are two things I don't see in your machine code that I would have expected. One is a conditional jump out of the loop. When the variable exceeds toV, control should jump to the first instruction after the loop. Second I don't see where the var is changed. At the end of the loop you add the value of the variable to the step value, but you don't store the result back into the variable.
There are also a few checks I expect need to be done at compile time that are missing: That the variable in the NEXT statement matches that in the VAR and that the step is positive.

ANTLR best way to include meta-data in lexing/parsing (custom objects, kind of annotation)

I plan to include text metadata (like bold, font-size, etc.) in the process of parsing to achieve better recognition.
For instance, I have a given structure, where a word on its own line word/r/n which is bold and sized 24px, is the title for some article. In order to get better recognition results, I want to take the characters as well as the metadata in account. In terms of ANTRL I'm not sure how this could be done best. I'd like to do something like:
Wrap each character of the original text into a custom object with fields for the metadata and pass that to ANTLR.
Preprocess the text and insert at specific places annotations for the metadata which is considered by the grammer.
I really like to take option 1. but I'm not sure which part from ANTLR I need to subclass etc. Do I have to start at the ANTLRInputStream-Object, in order to get a proper stream for a subclassed Lexer to get custom Tokens for a subclassed Parser etc. Is there a more elegant way, especially in querying the tokens while parsing with actions in a {} block ?
If anyone has some hints and/or experiences this would be great!
EDIT:
Here is a more specific simple example: I have a file wich includes the encoding of metadata which I parse forehand. the actual text including newline look like the following:
entryOne
Here is some content one.
entryTwo
Here is some content two.
Where the titlesentryOneand entryTwo are originally font-size of 24px and the content is font-size of 12px (as exemplary given values). Char by char I create a new instance of a custom object encapsulating the character as String and the font-size.
I initialize respective objects for each of the characters with fields of the font-size, e.g for the first letter of entryOne like
MyChar aTitelChar = new MyChar("e", 24);
For the content, like the second line Here is some content one. I create instances of MyChar like:
MyChar aContentChar= new MyChar("H", 12);
All characters of the texts are wrapped in instances of the below MyChar-Class and added to a List<MyChar> in order to produce a new input for ANTLR.
below is the Java Class for the characters:
public class MyChar {
private int fontSizePx;
private String text;
public MyChar(String text, int fontSizePx) {
this.text = text;
this.fontSizePx = fontSizePx;
}
public int getFontSizePx() {
return fontSizePx;
}
public String getText() {
return text;
}
}
I want that my grammar matches the above two entries (or more formatted this way) which in turn consist each of a title and a content which is terminated with a fullstop. This grammar could look like this:
rule: entry+ NEWLINE
;
entry:
title
content
;
title:
letters NEWLINE
;
content:
(letters)+ '.' NEWLINE
;
letters:
LETTERS
;
LETTERS:
('a'..'z' | 'A'..'Z')+
;
WS:
(' ' | '\t' | 'f' ) + {$channel = HIDDEN;};
NEWLINE:'\r'? '\n';
Now, for instance, what I want to do is to find out if it's really a title of an entry by checking the font-size of all letters encompassing the title-token before titel-rule returns. In case the input conforms to the grammar but is actually some kind of mistake (the original metadata-encoded file starts with something that conforms to the title-rule but its actually the content) the author of the grammar could sort that out if he knows that the original font-size for titles is 24 and check this. If one of the letter-tokens doesn't equal to font-size 24 throw an exception/don't return/do smthg. appropriate.
The thing I'm pondering on is where to plug in the List<MyChar> to provide this functionality (to query kinds of metadata while parsing in context of ANTLR). I'm experimenting with ANTLR's Classes but as I'm new to ANTLR I thought probably some of the experienced users can point me in the right direction, like where would be a good insertion points for custom objects? should I start by implenting CharStream and override some methods? Probably there is something which ANTLR provides which I haven't found yet?
Here's one way to accomplish what I think you're going for, using the parser to manage matching input to metadata. Note that I made whitespace significant because it's part of the content and can't be skipped. I also made periods part of content to simplify the example, rather than using them as a marker.
SysEx.g
grammar SysEx;
#header {
import java.util.List;
}
#parser::members {
private List<MyChar> metadata;
private int curpos;
private boolean isTitleInput(String input) {
return isFontSizeInput(input, 24);
}
private boolean isContentInput(String input){
return isFontSizeInput(input, 12);
}
private boolean isFontSizeInput(String input, int fontSize){
List<MyChar> sublist = metadata.subList(curpos, curpos + input.length());
System.out.println(String.format("Testing metadata for input=\%s, font-size=\%d", input, fontSize));
int start = curpos;
//move our metadata pointer forward.
skipInput(input);
for (int i = 0, count = input.length(); i < count; ++i){
MyChar chardata = sublist.get(i);
char c = input.charAt(i);
if (chardata.getText().charAt(0) != c){
//This character doesn't match the metadata (ERROR!)
System.out.println(String.format("Content mismatch at metadata position \%d: metadata=(\%s,\%d); input=\%c", start + i, chardata.getText(), chardata.getFontSizePx(), c));
return false;
} else if (chardata.getFontSizePx() != fontSize){
//The font is wrong.
System.out.println(String.format("Format mismatch at metadata position \%d: metadata=(\%s,\%d); input=\%c", start + i, chardata.getText(), chardata.getFontSizePx(), c));
return false;
}
}
//All characters check out.
return true;
}
private void skipInput(String str){
curpos += str.length();
System.out.println("\t\tMoving metadata pointer ahead by " + str.length() + " to " + curpos);
}
}
rule[List<MyChar> metadata]
#init {
this.metadata = metadata;
}
: entry+ EOF
;
entry
: title content
{System.out.println("Finished reading entry.");}
;
title
: line {isTitleInput($line.text)}? newline {System.out.println("Finished reading title " + $line.text);}
;
content
: line {isContentInput($line.text)}? newline {System.out.println("Finished reading content " + $line.text);}
;
newline
: (NEWLINE{skipInput($NEWLINE.text);})+
;
line returns [String text]
#init {
StringBuilder builder = new StringBuilder();
}
#after {
$text = builder.toString();
}
: (ANY{builder.append($ANY.text);})+
;
NEWLINE:'\r'? '\n';
ANY: .; //whitespace can't be skipped because it's content.
A title is a line that matches the title metadata (size 24 font) followed by one or more newline characters.
A content is a line that matches the content metadata (size 12 font) followed by one or more newline characters. As mentioned above, I removed the check for a period for simplification.
A line is a sequence of characters that does not include newline characters.
A validating semantic predicate (the {...}? after line) is used to validate that the line matches the metadata.
Here is the code I used to test the grammar (minus imports, for brevity):
SysExGrammar.java
public class SysExGrammar {
public static void main(String[] args) throws Exception {
//Create some metadata that matches our input.
List<MyChar> matchingMetadata = new ArrayList<MyChar>();
appendMetadata(matchingMetadata, "entryOne\r\n", 24);
appendMetadata(matchingMetadata, "Here is some content one.\r\n", 12);
appendMetadata(matchingMetadata, "entryTwo\r\n", 24);
appendMetadata(matchingMetadata, "Here is some content two.\r\n", 12);
parseInput(matchingMetadata);
System.out.println("Finished example #1");
//Create some metadata that doesn't match our input (negative test).
List<MyChar> mismatchingMetadata = new ArrayList<MyChar>();
appendMetadata(mismatchingMetadata, "entryOne\r\n", 24);
appendMetadata(mismatchingMetadata, "Here is some content one.\r\n", 12);
appendMetadata(mismatchingMetadata, "entryTwo\r\n", 12); //content font size!
appendMetadata(mismatchingMetadata, "Here is some content two.\r\n", 12);
parseInput(mismatchingMetadata);
System.out.println("Finished example #2");
}
private static void parseInput(List<MyChar> metadata) throws Exception {
//Test setup
InputStream resource = SysExGrammar.class.getResourceAsStream("SysExTest.txt");
CharStream input = new ANTLRInputStream(resource);
resource.close();
SysExLexer lexer = new SysExLexer(input);
CommonTokenStream tokens = new CommonTokenStream(lexer);
SysExParser parser = new SysExParser(tokens);
parser.rule(metadata);
System.out.println("Parsing encountered " + parser.getNumberOfSyntaxErrors() + " syntax errors");
}
private static void appendMetadata(List<MyChar> metadata, String string,
int fontSize) {
for (int i = 0, count = string.length(); i < count; ++i){
metadata.add(new MyChar(string.charAt(i) + "", fontSize));
}
}
}
SysExTest.txt (note this uses Windows newlines (\r\n)
entryOne
Here is some content one.
entryTwo
Here is some content two.
Test output (trimmed; the second example has deliberately-mismatched metadata):
Parsing encountered 0 syntax errors
Finished example #1
Parsing encountered 2 syntax errors
Finished example #2
This solution requires that each MyChar corresponds to a character in the input (including newline characters, although you can remove that limitation if you like -- I would remove it if I didn't already have this answer written up ;) ).
As you can see, it's possible to tie the metadata to the parser and everything works as expected. I hope this helps.

Scala parser combinators and Reader infinite loop

I am a little confused by the Scala parser combinators.
I'm using a custom implementation of Reader to directly read a list of tokens:
private class Token_Reader(tokens: List[Token], val pos: Token_Pos) extends Reader
{
def first = if(atEnd) null else tokens.head
def rest = if(atEnd) this else new Token_Reader(tokens.tail, new Token_Pos(pos.p + 1))
def atEnd = tokens.isEmpty
}
What puzzles me is that atEnd seems to be completely ignored by the actual parsers, resulting in an infinite loop / infinite recursion when using */rep.
I don't know that it will fix this issue, but in the Reader implementations I see in the Scala source, the first method returns an end of file character rather than null when at the end. And I believe it's generally good to avoid nulls...
For example, in CharSequenceReader it looks like
/** Returns the first element of the reader, or EofCh if reader is at its end
*/
def first =
if (offset < source.length) source.charAt(offset) else EofCh
And this character is defined in the companion object:
object CharSequenceReader {
final val EofCh = '\032'
}

Resources