I have a file in which each line represents a concatenated String series as this:
302007030064201410241
30210704006426141
1021070400642614134
Each line starts with operation code and each operation has a known rules to parse remaining part of the line.
What will be the good strategy to parse these numbers? Any sample for start would be great.
IMO, Antlr wont be much usefull if all different informations to parse look like all token are identical.
Write manually a little state machine.
Read a digit in loop until that digit and predecessors result in a know "operation code" (it could be simpler if all codes have the same lenght: you could wrap that in a function)
then depending on that code (e.g. in a switch) you can call its specific decoding logic in a dedicated function.
Your resulting parser will look like a recursive descent parser.
Related
I'm trying to figure out how I can best parse just a subset of a given language with ANTLR. For example, say I'm looking to parse U-SQL. Really, I'm only interested in parsing certain parts of the language, such as query statements. I couldn't be bothered with parsing the many other features of the language. My current approach has been to design my lexer / parser grammar as follows:
// ...
statement
: queryStatement
| undefinedStatement
;
// ...
undefinedStatement
: (.)+?
;
// ...
UndefinedToken
: (.)+?
;
The gist is, I add a fall-back parser rule and lexer rule for undefined structures and tokens. I imagine later, when I go to walk the parse tree, I can simply ignore the undefined statements in the tree, and focus on the statements I'm interested in.
This seems like it would work, but is this an optimal strategy? Are there more elegant options available? Thanks in advance!
Parsing a subpart of a grammar is super easy. Usually you have a top level rule which you call to parse the full input with the entire grammar.
For the subpart use the function that parses only a subrule like:
const expression = parser.statement();
I use this approach frequently when I want to parse stored procedures or data types only.
Keep in mind however, that subrules usually are not termined with the EOF token (as the top level rule should be). This will cause no syntax error if more than the subelement is in the token stream (the parser just stops when the subrule has matched completely). If that's a problem for you then add a copy of the subrule you wanna parse, give it a dedicated name and end it with EOF, like this:
dataTypeDefinition: // For external use only. Don't reference this in the normal grammar.
dataType EOF
;
dataType: // type in sql_yacc.yy
type = (
...
Check the MySQL grammar for more details.
This general idea -- to parse the interesting bits of an input and ignore the sea of surrounding tokens -- is usually called "island parsing". There's an example of an island parser in the ANTLR reference book, although I don't know if it is directly applicable.
The tricky part of island parsing is getting the island boundaries right. If you miss a boundary, or recognise as a boundary something which isn't, then your parse will fail disastrously. So you need to understand the input at least well enough to be able to detect where the islands are. In your example, that might mean recognising a SELECT statement, for example. However, you cannot blindly recognise the string of letters SELECT because that string might appear inside a string constant or a comment or some other context in which it was never intended to be recognised as a token at all.
I suspect that if you are going to parse queries, you'll basically need to be able to recognise any token. So it's not going to be sea of uninspected input characters. You can view it as a sea of recognised but unparsed tokens. In that case, it should be reasonably safe to parse a non-query statement as a keyword followed by arbitrary tokens other than ; and ending with a ;. (But you might need to recognise nested blocks; I don't really know what the possibilities are.)
I'm currently in the process of creating a programming language. I've laid out my entire design and am in progress of creating the Lexer for it. I have created numerous lexers and lexer generators in the past, but have never come to adopt the "standard", if one exists.
Is there a specific way a lexer should be created to maximise capability to use it with as many parsers as possible?
Because the way I design mine, they look like the following:
Code:
int main() {
printf("Hello, World!");
}
Lexer:
[
KEYWORD:INT, IDENTIFIER:"main", LEFT_ROUND_BRACKET, RIGHT_ROUNDBRACKET, LEFT_CURLY_BRACKET,
IDENTIFIER:"printf", LEFT_ROUND_BRACKET, STRING:"Hello, World!", RIGHT_ROUND_BRACKET, COLON,
RIGHT_CURLY_BRACKET
]
Is this the way Lexer's should be made? Also as a side-note, what should my next step be after creating a Lexer? I don't really want to use something such as ANTLR or Lex+Yacc or Flex+Bison, etc. I'm doing it from scratch.
If you don't want to use a parser generator [Note 1], then it is absolutely up to you how your lexer provides information to your parser.
Even if you do use a parser generator, there are many details which are going to be project-dependent. Sometimes it is convenient for the lexer to call the parser with each token; other times is is easier if the parser calls the lexer; in some cases, you'll want to have a driver which interacts separately with each component. And clearly, the precise datatype(s) of your tokens will vary from project to project, which can have an impact on how you communicate as well.
Personally, I would avoid use of global variables (as in the original yacc/lex protocol), but that's a general style issue.
Most lexers work in streaming mode, rather than tokenizing the entire input and then handing the vector of tokens to some higher power. Tokenizing one token at a time has a number of advantages, particularly if the tokenization is context-dependent, and, let's face it, almost all languages have some impurity somewhere in their syntax. But, again, that's entirely up to you.
Good luck with your project.
Notes:
Do you also forgo the use of compilers and write all your code from scratch in assembler or even binary?
Is there a specific way a lexer should be created to maximise capability to use it with as many parsers as possible?
In the lexers I've looked at, the canonical API is pretty minimal. It's basically:
Token readNextToken();
The lexer maintains a reference to the source text and its internal pointers into where it is currently looking. Then, every time you call that, it scans and returns the next token.
The Token type usually has:
A "type" enum for which kind of token it is: string, operator, identifier, etc. There are usually special kinds for "EOF", meaning a special terminator token that is produced after the end of the input, and "ERROR" for the rare cases where a syntax error comes from the lexical grammar. This is mainly just unterminated string literals or totally unknown characters in the source.
The source text of the token.
Sometimes literals are converted to their proper value representation during lexing in which case you'll have that value too. So a number token would have "123" as text but also have the numeric value 123. Or you can do that during parsing/compilation.
Location within the source file of the token. This is for error reporting. Usually 1-based line and column, but can also just be start and end byte offsets. The latter is a little faster to produce and can be converted to line and column lazily if needed.
Depending on your grammar, you may need to be able to rewind the lexer too.
I'm trying to interface Haskell with a command line program that has a read-eval-print loop. I'd like to put some text into an input handle, and then read from an output handle until I find a prompt (and then repeat). The reading should block until a prompt is found, but no longer. Instead of coding up my own little state machine that reads one character at a time until it constructs a prompt, it would be nice to use Parsec or Attoparsec. (One issue is that the prompt changes over time, so I can't just check for a constant string of characters.)
What is the best way to read the appropriate amount of data from the output handle and feed it to a parser? I'm confused because most of the handle-reading primatives require me to decide beforehand how much data I want to read. But it's the parser that should decide when to stop.
You seem to have two questions wrapped up in here. One is about incremental parsing, and one is about incremental reading.
Attoparsec supports incremental parsing directly. See the IResult type in Data.Attoparsec.Text. Parsec, alas, doesn't. You can run your parser on what you have, and if it gives an error, add more input and try again, but you really don't know if the error was an unrecoverable parse error, or just needing for more input.
In your case, usualy REPLs read one line at a time. Hence you can use hGetLine to read a line - pass it to Attoparsec, and if it parses evaluate it, and if not, get another line.
If you want to see all this in action, I do this kind of thing in Plush.Job.Output, but with three small differences: 1) I'm parsing byte streams, not strings. 2) I've set it up to pull as much as is available from the input and parse as many items as I can. 3) I'm reading directly from file descriptos. But the same structure should help you do it in your situation.
I got interested in parser generators. But I don't have the theoretical background. I just read a few things on the internet.
Currently I'm trying to do something with ANTLR
So my questions:
I have a special format of my dataframes:
The first byte of a frame is a tag that describes the nature of the data
The second byte contains the length (number of bytes) of the data itself
Then follows the data itself
The data can contain dataframes itself, and dataframes can be listed one after the other
I hope my description is clear. My questions:
Can I create such a parser with ANTLR that reads the lengs of the frame and then knows when the frame ends?
In ANTLR can I load the different tags I use from a generated file?
Thank you!
I'm not 100% sure about this, but:
Parser generators like antlr require a grammar that is at least context-free
using length-fields in your data makes your grammar not context free (context-sensitive i think)
It is the latter point i'm not sure about - maybe you want to research some more on that.
You probably have to write a packet "parser" yourself (which then has to be a parser for your context-sensitive packet grammar)
Alternatively, you could drop the length field, and use something like s-expressions, JSON or xml; these would be parseable by something generated with antlr.
I think you will be better off to create a hand written binary parser instead of using ANTLR because ANTLR is primarily intended to read and make sense of a text file and not binary data. The lexer part is focused on tokenizing text so trying to make it read binary data instead would be an uphill battle.
It sounds as if your structure would need some kind of recursive way of reading the data although it could be done easier just having a tree structure and then fill it as you read your file.
I have implemented combinatorial GLR parsers. Among them there are:
char(·) parser which consumes specified character or range of characters.
many(·) combinator which repeats specified parser from zero to infinite times.
Example: "char('a').many()" will match a string with any number of "a"-s.
But many(·) combinator is greedy, so, for example, char('{') >> char('{') >> char('a'..'z').many() >> char('}') >> char('}') (where ">>" is sequential chaining of parsers) will successfully consume the whole "{{foo}}some{{bar}}" string.
I want to implement the lazy version of many(·) which, being used in previous example, will consume "{{foo}}" only. How can I do that?
Edit:
May be I confused ya all. In my program a parser is a function (or "functor" in terms of C++) which accepts a "step" and returns forest of "steps". A "step" may be of OK type (that means that parser has consumed part of input successfully) and FAIL type (that means the parser has encountered error). There are more types of steps but they are auxiliary.
Parser = f(Step) -> Collection of TreeNodes of Steps.
So when I parse input, I:
Compose simple predefined Parser functions to get complex Parser function representing required grammar.
Form initial Step from the input.
Give the initial Step to the complex Parser function.
Filter TreeNodes with Steps, leaving only OK ones (or with minimum FAIL-s if there were errors in input).
Gather information from Steps which were left.
I have implemented and have been using GLR parsers for 15 years as language front ends for a program transformation system.
I don't know what a "combinatorial GLR parser" is, and I'm unfamiliar with your notation so I'm not quite sure how to interpret it. I assume this is some kind of curried function notation? I'm imagining your combinator rules are equivalent to definining a grammer in terms of terminal characters, where "char('a').many" corresponds to grammar rules:
char = "a" ;
char = char "a" ;
GLR parsers, indeed, produce all possible parses. The key insight to GLR parsing is its psuedo-parallel processing of all possible parses. If your "combinators" can propose multiple parses (that is, they produce grammar rules sort of equivalent to the above), and you indeed have them connected to a GLR parser, they will all get tried, and only those sequences of productions that tile the text will survive (meaning all valid parsess, e.g., ambiguous parses) will survive.
If you have indeed implemented a GLR parser, this collection of all possible parses should have been extremely clear to you. The fact that it is not hints what you have implemented is not a GLR parser.
Error recovery with a GLR parser is possible, just as with any other parsing technology. What we do is keep the set of live parses before the point of the error; when an error is found, we try (in psuedo-parallel, the GLR parsing machinery makes this easy if it it bent properly) all the following: a) deleting the offending token, b) inserting all tokens that essentially are FOLLOW(x) where x is live parse. In essence, delete the token, or insert one expected by a live parse. We then turn the GLR parser loose again. Only the valid parses (e.g., repairs) will survive. If the current token cannot be processed, the parser processing the stream with the token deleted survives. In the worst case, the GLR parser error recovery ends up throwing away all tokens to EOF. A serious downside to this is the GLR parser's running time grows pretty radically while parsing errors; if there are many in one place, the error recovery time can go through the roof.
Won't a GLR parser produce all possible parses of the input? Then resolving the ambiguity is a matter of picking the parse you prefer. To do that, I suppose the elements of the parse forest need to be labeled according to what kind of combinator produced them, eager or lazy. (You can't resolve the ambiguity incrementally before you've seen all the input, in general.)
(This answer based on my dim memory and vague possible misunderstanding of GLR parsing. Hopefully someone expert will come by.)
Consider the regular expression <.*?> and the input <a>bc<d>ef. This should find <a>, and no other matches, right?
Now consider the regular expression <.*?>e with the same input. This should find <a>bc<d>e, right?
This poses a dilemma. For the user's sake, we want the behavior of the combinator >> to be understood in terms of its two operands. Yet there is no way to produce the second parser's behavior in terms of what the first one finds.
One answer is for each parser to produce a sequence of all parses, ordered by preference, rather than the unordered set of all parsers. Greedy matching would return matches sorted longest to shortest; non-greedy, shortest to longest.
Non-greedy functionality is nothing more than a disambiguation mechanism. If you truly have a generalized parser (which does not require disambiguation to produce its results), then "non-greedy" is meaningless; the same results will be returned whether or not an operator is "non-greedy".
Non-greedy disambiguation behavior could be applied to the complete set of results provided by a generalized parser. Working left-to-right, filter the ambiguous sub-groups corresponding to a non-greedy operator to use the shortest match which still led to a successful parse of the remaining input.