How to write an Antlr4 grammar that matches X number of characters - parsing

I want to use Antlr4 to parse a format that stores the length of segments in the serialised form
For example, to parse:
"6,Hello 5,World"
I tried to create a grammar like this
grammar myGrammar;
(LEN ',' TEXT)*;
LEN: [0-9]+;
TEXT: // I dont know what to put in here but it should match LEN number of chars
Is this even possible with Antlr?
A real world example of this would be parsing the messagePack binary format which has several types that serialise the length of the data into the serialised form.
For example there is the str8:
str 8 stores a byte array whose length is upto (2^8)-1 bytes:
| 0xd9 |YYYYYYYY| data |
And str16 type
str16 stores a byte array whose length is upto (2^16)-1 bytes:
| 0xda |ZZZZZZZZ|ZZZZZZZZ| data |
In these examples the first byte identifies the type, then we have 1 byte for str8 and 2 bytes for str16 which contain the length of the data. Then finally there is the data.
I think a rule might look something like this but dont know how to match the right amount of data
str8 : '\u00d9' BYTE DATA ;
str16: '\u00da' BYTE BYTE DATA ;
BYTE : '\u0000'..'\u00FF' ;
DATA : ???

The data format you describe is usually called TLV (tag/type–length–value). TLV cannot be recognised with a regular expression (or even with a context-free grammar) so it's not usually supported by standard tokenisers.
Fortunately, it's easy to tokenise. Standard libraries may exist for particular formats, and some formats even have automated code generators for more efficient parsing. But you should be able to write a simple tokeniser for a particular format in a few lines of code.
Once you have writen the datastream tokeniser, you could use a parser generator like Antlr to build a datastructure from the parse, but it's rarely nevessary. Most TLV-encoded streams are simple sequences of components, although you occasionally run into formats (like Google protobufs or ASN.1) which include nested subsequences. Even with those, the parse is straight-forward (although for both of those examples, standard tools exist).
In any event, using context-free grammar tools like Antlr is rarely the simplest solution, because TLV formats are mostly order-independent. (If the order were fixed, the tags wouldn't be necessary.) Context-free grammars do not have any way of handling a language such as "at most one of A, B, C, D, and E in any order" other than enumerating the alternatives, of which there are an exponential number.


How to also get how many characters read in parse?

I'm using Numeric.readDec to parse numbers and reads to parse Strings. But I also need to know how many characters were read.
For example readDec "52 rest" returns [(52," rest")], and read 2 characters. But there isn't a great way that I can find to know that it read 2 characters.
You could check the string length of show 52, but if the input was 052 that would give you the wrong answer (this solution also wouldn't work for the string parsing which has escape characters). You also could use the length of the post parsed string subtracted from the length of the input string. But this is very inefficient for long strings with many parses.
How can this be done correctly and efficiently (preferably without just writing your own parse)?
With just base, instead of readDec, you can use readDecP from Text.Read.Lex, which uses a ReadP parser:
readDecP :: (Eq a, Num a) => ReadP a
The gather combinator in Text.ParserCombinators.ReadP returns the parse result along with the actual characters parsed:
gather :: ReadP a -> ReadP (String, a)
You can run the parser with readP_to_S, which gives back a ReadS parser, which is a function that accepts a string and produces a list of possible parses with the remainder of the string.
readP_to_S :: ReadP a -> ReadS a
type ReadS a = String -> [(a, String)]
An example in GHCi:
> import Text.ParserCombinators.ReadP (gather, readP_to_S)
> import Text.Read.Lex (readDecP)
> readP_to_S (gather readDecP) "52 rest"
[(("52",52)," rest")]
> readP_to_S (gather readDecP) "0644 permissions"
[(("0644",644)," permissions")]
You can simply check that there is only one valid parse if you want the result to be unambiguous, and then take the length of the first component to find the number of Char code points parsed.
These parsers are fairly limited, however; if you want something easier to use, faster, or able to produce more detailed error messages, then you should check out a more fully featured parsing package such as regex-applicative (regular grammars) or megaparsec (context-sensitive grammars).

How to understand ANTLRWorks 1.5.2 MismatchedTokenException(80!=21)

I'm testing a simple grammar (shown below) with simple input strings and get the following error message from the Antlrworks interpreter: MismatchedTokenException(80!=21).
My input (abc45{r24}) means "repeat the keys a, b, c, 4 and 5, 24 times."
ANTLRWorks 1.5.2 Grammar:
expr : '(' (key)+ repcount ')' EOF;
key : KEY | digit ;
repcount : '{' 'r' count '}';
count : (digit)+;
digit : DIGIT;
DIGIT : '0'..'9';
KEY : ('a'..'z'|'A'..'Z') ;
(abc4{r4}) - ok
(abc44{r4}) - fails NoViableAltException
(abc4 4{r4}) - ok
(abc4{r45}) - fails MismatchedTokenException(80!=21)
(abc4{r4 5}) - ok
The parse succeeds with input (abc4{r4}) (single digits only).
The parse fails with input (abc44{r4}) (NoViableAltException).
The parse fails with input (abc4{r45}) (MismatchedTokenException(80!=21)).
The parse errors go away if I put a space between 44 or 45 to separate the individual digits.
Q1. What does NoViableAltException mean? How can I interpret it to look for a problem in the grammar/input pair?
Q2. What does the expression 80!=21 mean? Can I do anything useful with the information to look for a problem in the grammar/input pair?
I don't understand why the grammar has a problem reading successive digits. I thought my expressions (key)+ and (digit)+ specify that successive digits are allowed and would be read as successive individual digits.
If someone could explain what I'm doing wrong, I would be grateful. This seems like a simple problem, but hours later, I still don't understand why and how to solve it. Thank you.
Further down in my simple grammar file I had a lexer rule for FLOAT copied from another grammar. I did not think to include it above (or check it as a source of the errors) because it was not used by any parser rule and would never match my input characters. Here is the FLOAT grammar rule (which contains sequences of DIGITs):
: ('0'..'9')+ '.' ('0'..'9')* EXPONENT?
| '.' ('0'..'9')+ EXPONENT?
| ('0'..'9')+ EXPONENT
If I delete the whole rule, all my test cases above parse successfully. If I leave any one of the three FLOAT clauses in the grammar/lexer file, the parses fail as shown above.
Q3. Why does the FLOAT rule cause failures in the parse? The DIGIT lexer rule appears first, and so should "win" and be used in preference to the FLOAT rule. Besides, the FLOAT rule doesn't match the input stream.
I hazard a guess that the lexer is skipping the DIGIT rule getting stuck in the FLOAT rule, even though FLOAT comes after DIGIT in the input file.
I took these two screenshots after Bart's comment below to show the parse failures that I am experiencing. Not that it matters, but ANTLRWorks 1.5.2 will not accept the syntax SPACE : [ \t\r\n]+; regular expression syntax in Bart's kind replies. Maybe the screenshots will help. They show all the rules in my grammar file.
The only difference in the two screenshots is that one input has two sets of multiple digits and the other input string has only set of multiple digits. Maybe this extra info will help somehow.
If I remember correctly, ANTLR's v3 lexer is less powerful than v4's version. When the lexer gets the input "123x", this first 3 chars (123) are consumed by the lexer rule FLOAT, but after that, when the lexer encounters the x, it knows it cannot complete the FLOAT rule. However, the v3 lexer does not give up on its partial match and tries to find another rule, below it, that matches these 3 chars (123). Since there is no such rule, the lexer throws an exception. Again, not 100% sure, this is how I remember it.
ANTLRv4's lexer will give up on the partial 123 match and will return 23 to the char stream to create a single KEY token for the input 1.
I highly suggest you move away from v3 and opt for the more powerful v4 version.

"Lexeme" vs "Token" Terminology

I'm trying to understand the difference between "lexeme" and "token" in compilers.
If the lexer part of my compiler encounters the following sequence of characters in the source code to be compiled.
is it correct to say that the above is a lexeme that is 5 characters long?
If my compiler is implemented in C, and I allocate space for a token for this lexeme, the token will be an struct. The first member of the struct will be an int which will have the type from some enum, in this case STRING_LITERAL. The second member of the struct will be a char * that points to some (dynamically allocated) memory that has 4 bytes. The first byte is 'a', the second 'b', the third 'c', and the fourth is NULL to terminate the string.
The lexeme is 5 character of the source code text.
The token is a total of 6 bytes in memory.
Is that the correct way to use the terminology?
(I'm ignoring tokens tracking meta data like filename, line number, and column number.)
Sort of related question:
Is it uncommon practice to have the lexer convert an integer lexeme into an integer value in a token? Or is it better (or more standard) to store the characters of the lexeme in a token and let the parser stage convert those characters to an integer node to be attached to the AST?
A "lexeme" is a literal character in the source, for example 'a' is a lexeme in "abc". It is the smallest unit. The "lexer" or lexical analysis stage converts lexemes into tokens(such as keywords, identifiers, literals, operators etc) which are the smallest units the parser can use to create ASTs. So if we have the statement
int x = 0;
The lexer would output
<type:int> <id: x> <operator: = > <literal: 0> <semicolon>
The lexer is typically a collection of regular expressions that can simply define collections of characters as what would be terminals in the languages grammar. These are turned into tokens which is feed into the parser as a stream.
However, most people use lexeme and token interchangeably, and it usually doesn't cause confusion. For you question about converting the int literal, you would want a wrapper class for your AST. Just having a integer alone might not be enough information.

Append text file to lexicon in Rascal

Is it possible to append terminals retrieved from a text file to a lexicon in Rascal? This would happen at run time, and I see no obvious way to achieve this. I would rather keep the data separate from the Rascal project. For example, if I had read in a list of countries from a text file, how would I add these to a lexicon (using the lexical keyword)?
In the data-dependent version of the Rascal parser this is even easier and faster but we haven't released this yet. For now I'd write a generic rule with a post-parse filter, like so:
rascal>set[str] lexicon = {"aap", "noot", "mies"};
set[str]: {"noot","mies","aap"}
rascal>lexical Word = [a-z]+;
rascal>syntax LexiconWord = word: Word w;
rascal>LexiconWord word(Word w) { // called when the LexiconWord.word rule is use to build a tree
>>>>>>> if ("<w>" notin lexicon)
>>>>>>> filter; // remove this parse tree
>>>>>>> else fail; // just build the tree
rascal>[Sentence] "hello"
|prompt:///|(0,18,<1,0>,<1,18>): ParseError(|prompt:///|(0,18,<1,0>,<1,18>))
at $root$(|prompt:///|(0,64,<1,0>,<1,64>))
rascal>[Sentence] "aap"
Sentence: (Sentence) `aap`
Because the filter function removed all possible derivations for hello, the parser eventually returns a parse error on hello. It does not do so for aap which is in the lexicon, so hurray. Of course you can make interestingly complex derivations with this kind of filtering. People sometimes write ambiguous grammars and use filters like so to make it unambiguous.
Parsing and filtering in this way is in cubic worst-case time in terms of the length of the input, if the filtering function is in amortized constant time. If the grammar is linear, then of course the entire process is also linear.
A completely different answer would be to dynamically update the grammar and generate a parser from this. This involves working against the internal grammar representation of Rascal like so:
set[str] lexicon = {"aap", "noot", "mies"};
syntax Word = ; // empty definition
typ = #Word;
grammar = typ.definitions;
grammar[sort("Word")] = { prod(sort("Word"), lit(x), {}) | x <- lexicon };
newTyp = type(sort("Word"), grammar);
This newType is a reified grammar + type for the definition of the lexicon, and which can now be used like so:
import ParseTree;
if (type[Word] staticGrammar := newType) {
parse(staticGrammar, "aap");
Now having written al this, two things:
I think this may trigger unknown bugs since we did not test dynamic parser generation, and
For a lexicon with a reasonable size, this will generate an utterly slow parser since the parser is optimized for keywords in programming languages and not large lexicons.

bison and grammar: replaying the parse stack

I have not messed with building languages or parsers in a formal way since grad school and have forgotten most of what I knew back then. I now have a project that might benefit from such a thing but I'm not sure how to approach the following situation.
Let's say that in the language I want to parse there is a token that means "generate a random floating point number" in an expression.
{$$ = $1;}
{$$ = $1 + $3;}
| R PLUS exp
{$$ = random() + $3;}
I also want a "list" generating operator that will reevaluate an "exp" some number of times. Maybe like:
listExp: NUMBER COLON exp
for (int i = 0; i < $1; i++) {
print $3;
The problem I see is that "exp" will have already been reduced by the time the loop starts. If I have the input
2 : R + 2
then I think the random number will be generated as the "exp" is parsed and 2 added to it -- lets say the result is 2.0055. Then in the list expression I think 2.0055 would be printed out twice.
Is there a way to mark the "exp" before evaluation and then parse it as many times as the list loop count requires? The idea being to get a different random number in each evaluation.
Your best bet is to build an AST and evaluate the entire AST at the end of the parse. In-line evaluation is only possible for very simple (i.e. "calculator-like") projects.
Instead of an AST, you could construct code for a stack- or three-address- virtual machine. That's generally more efficient, particularly if you intend to execute the code frequently, but the AST is a lot simpler to construct, and executing it is a single depth-first scan.
Depending on your language design there are at least 5 different points at which a token in the language could be bound to a value. They are:
Pre-processor (like C #define)
Lexer: recognise tokens
Parser: recognise token structure, output AST
Semantic analysis: analyse AST, assign types and conversions etc
Code generation: output executable code or execute code directly.
If you have a token that can occur multiple times and you want to assign it a different random value each time, then phase 4 is the place to do it. If you generate an AST, walk the tree and assign the values. If you go straight to code generation (or an interpreter) do it then.
