Parsing unique but unordered named blocks - parsing

I have a DSL where a file consists of multiple named blocks.
Ideally, each block should occur only once, but the order doesn't matter.
How do I write a parser that ignores block order, but gives syntax errors if the same block is repeated?

One option is to detect the error after parsing, perhaps with a walker.
If you need to detect the errors during parsing, then add a semantics class that stores the block identifiers and raises SemanticError if a block name is repeated.

Related

Antlr4 Is there a way to walk on a node with a listener

I am working with antlr4 to parse some inner language.
The way I do it is the standard way of creating a stream of tokens, then creating a parse tree with set tokens. And the last step is creating a tree.
Now that I have a tree I can use the walk command to walk over it.
The walk command and the parse tree being created are an extension of the BaseListener class.
For my project I use many lookaheads, I mostly use the visitors as a sort of look ahead.
A visitor function can be applied on a node from the parser tree. (Assuming you implement a visitor class)
I have tried creating a smaller listener class to act as my look ahead on a specific node. However, I'm not managing to figure out if this is even possible.
Or is there another / better / smarter way to create a look ahead?
To make the question seem shorter:
Can a listener class be used on a node in the antlr4 parser tree-like visitor can?
There’s no real reason to do anything fancy with a listener here. To handle your example of resolving the type of an expression, this can be done in a couple of steps. You’ll need a listener that builds a symbol table (presumably a scoped symbol table so that each scope in your parse tree has a symbol table and references to “parent” scopes it might search to resolve the data types of any variables.). Depending upon your needs, you can build these scoped symbol tables in your listener, pushing and popping them on a stack as you enter/exit scopes, or you may do this is a separate listener attaching a reference to scope nodes to your parse tree. This symbol table will be required for type resolution, so if you can reference variables in your source prior to encountering the declaration, you’d need too pass your tree with a listener building and retaining the symbol table structure.
With the symbol table structure available to your listener for resolving expression types. If you always use the exit* method overrides then all of the data types for any sub-expressions will have already been resolved. With all of the sub-expression types, you can infer the type of the expression you’re about to “exit”. Evaluating the type of child expressions should cover all of the “lookahead” you need to resolve an expression’s type.
There’s really nothing about this approach that rules out a visitor either, it would just be up to you to visit each child, determining it’s type, and then, at the end of the listener, resolve the expression type from the types of it’s children.

reference to where the parser is at when calling a rule listener in ANTLR4

I'm generating listeners in Python, but any language is ok for answers or comments.
I need to know if there's some reference to where in the parsing tree, or even better, in the token stream or in the source file the parser is at when calling a specific listener method.
I get a context object, which has a reference to the parser itself, I looked for it but don't seem to find any.
This is for debugging only.
def enterData_stmt(self, ctx:fassParser.Data_stmtContext):
I know the parser doesn't traverse the source file but rather the abstract syntax tree, and I could look at it and get where the parser is at, but I'm wondering if I can get a little context for quick debugging without having to do a tree traversal
Every ParseRuleContext object has the fields start and stop, which contain the first and last token matched by the rule respectively. The token objects have the methods getLine and getCharPositionInLine to find out the line number and column number where each token starts respectively (there are no methods telling you where a token ends (except as an absolute index - not a line and column number), so if you need that, you'll need to calculate it yourself using the start position and the length).
I know the pareser doesn't traverse the source file but rather the abstract syntax tree
Of course the parser traverses the source file - how else could it parse it? The parser goes through the source file to generate the (not very abstract) parse tree. If you're using a visitor or ParseTreeWalker with a listener, the visitor/listener will then walk the generated parse tree. If you're using addParseListener, the listener will be invoked with the partially-constructed tree while the parser is still parsing the file.

How to parse a very large file in F# using FParsec

I'm trying to parse a very large file using FParsec. The file's size is 61GB, which is too big to hold in RAM, so I'd like to generate a sequence of results (i.e. seq<'Result>), rather than a list, if possible. Can this be done with FParsec? (I've come up with a jerry-rigged implementation that actually does this, but it doesn't work well in practice due to the O(n) performance of CharStream.Seek.)
The file is line-oriented (one record per line), which should make it possible in theory to parse in batches of, say, 1000 records at a time. The FParsec "Tips and tricks" section says:
If you’re dealing with large input files or very slow parsers, it
might also be worth trying to parse multiple sections within a single
file in parallel. For this to be efficient there must be a fast way to
find the start and end points of such sections. For example, if you
are parsing a large serialized data structure, the format might allow
you to easily skip over segments within the file, so that you can chop
up the input into multiple independent parts that can be parsed in
parallel. Another example could be a programming languages whose
grammar makes it easy to skip over a complete class or function
definition, e.g. by finding the closing brace or by interpreting the
indentation. In this case it might be worth not to parse the
definitions directly when they are encountered, but instead to skip
over them, push their text content into a queue and then to process
that queue in parallel.
This sounds perfect for me: I'd like to pre-parse each batch of records into a queue, and then finish parsing them in parallel later. However, I don't know how to accomplish this with the FParsec API. How can I create such a queue without using up all my RAM?
FWIW, the file I'm trying to parse is here if anyone wants to give it a try with me. :)
The "obvious" thing that comes to mind, would be pre-processing the file using something like File.ReadLines and then parsing one line at a time.
If this doesn't work (your PDF looked, like a record is a few lines long), then you can make a seq of records or 1000 records or something like that using normal FileStream reading. This would not need to know details of the record, but it would be convenient, if you can at least delimit the records.
Either way, you end up with a lazy seq that the parser can then read.

Multiple parsers calling each other?

I am working on a complicated system that uses a number of XML schemas and associated parsers. One of the schemas is used to hold general data that are accessed by all of the other schemas. I would like to maintain this division in the (flex and bison) parsers. So, if I parse the main XML file and get to, say, the tag <matrix>, I would like to call a <matrix> parser as a subroutine, return its content to the calling program and continue parsing there after the </matrix> tag. I have been looking around the net, but I have not found anything useful. Is it even possible to do this?
It seems easiest to maintain the common pieces in a separate file and to split the individual parser components into two more files: Part 1 has the Prologue and the individual grammar rules, part 2 has the epilogue. Then the three files can be concatenated (in a Makefile) before calling the parser:
parser.y: parser.part1 common.inc parser.part2
cat parser.part1 common.inc parser.part2 >parser.y
Your approach is wrong. You shouldn't need a special parser for each distinctive tag. You should parse all tags regardless of their properties and link them to a tree. Afterwards you can validate the tree to ensure a correct consistency of nested tags. If the markup language you're talking about is really that special, then you could create a parser that takes rules describing each tag. In this case parsing and checking are done at the same time, most HTML parsers are implemented like this.

Is it acceptable to store the previous state as a global variable?

One of the biggest problems with designing a lexical analyzer/parser combination is overzealousness in designing the analyzer. (f)lex isn't designed to have parser logic, which can sometimes interfere with the design of mini-parsers (by means of yy_push_state(), yy_pop_state(), and yy_top_state().
My goal is to parse a document of the form:
CODE1 this is the text that might appear for a 'CODE' entry
SUBCODE1 the CODE group will have several subcodes, which
may extend onto subsequent lines.
SUBCODE2 however, not all SUBCODEs span multiple lines
SUBCODE3 still, however, there are some SUBCODES that span
not only one or two lines, but any number of lines.
this makes it a challenge to use something like \r\n
as a record delimiter.
CODE2 Moreover, it's not guaranteed that a SUBCODE is the
only way to exit another SUBCODE's scope. There may
be CODE blocks that accomplish this.
In the end, I've decided that this section of the project is better left to the lexical analyzer, since I don't want to create a pattern that matches each line (and identifies continuation records). Part of the reason is that I want the lexical parser to have knowledge of the contents of each line, without incorporating its own tokenizing logic. That is to say, if I match ^SUBCODE[ ][ ].{71}\r\n (all records are blocked in 80-character records) I would not be able to harness the power of flex to tokenize the structured data residing in .{71}.
Given these constraints, I'm thinking about doing the following:
Entering a CODE1 state from the <INITIAL> start condition results
in calls to:
yy_push_state(CODE_STATE)
yy_push_state(CODE_CODE1_STATE)
(do something with the contents of the CODE1 state identifier, if such contents exist)
yy_push_state(SUBCODE_STATE) (to tell the analyzer to expect SUBCODE states belonging to the CODE_CODE1_STATE. This is where the analyzer begins to masquerade as a parser.
The <SUBCODE1_STATE> start condition is nested as follows: <CODE_STATE>{ <CODE_CODE1_STATE> { <SUBCODE_STATE>{ <SUBCODE1_STATE> { (perform actions based on the matching patterns) } } }. It also sets the global previous_state variable to yy_top_state(), to wit SUBCODE1_STATE.
Within <SUBCODE1_STATE>'s scope, \r\n will call yy_pop_state(). If a continuation record is present (which is a pattern at the highest scope against which all text is matched), yy_push_state(continuation_record_states[previous_state]) is called, bringing us back to the scope in 2. continuation_record_states[] maps each state with its continuation record state, which is used by the parser.
As you can see, this is quite complicated, which leads me to conclude that I'm massively over-complicating the task.
Questions
For states lacking an extremely clear token signifying the end of its scope, is my proposed solution acceptable?
Given that I want to tokenize the input using flex, is there any way to do so without start conditions?
The biggest problem I'm having is that each record (beginning with the (SUB)CODE prefix) is unique, but the information appearing after the (SUB)CODE prefix is not. Therefore, it almost appears mandatory to have multiple states like this, and the abstract CODE_STATE and SUBCODE_STATE states would act as groupings for each of the concrete SUBCODE[0-9]+_STATE and CODE[0-9]+_STATE states.
I would look at how the OMeta parser handles these things.

Resources