How to also get how many characters read in parse? - parsing

I'm using Numeric.readDec to parse numbers and reads to parse Strings. But I also need to know how many characters were read.
For example readDec "52 rest" returns [(52," rest")], and read 2 characters. But there isn't a great way that I can find to know that it read 2 characters.
You could check the string length of show 52, but if the input was 052 that would give you the wrong answer (this solution also wouldn't work for the string parsing which has escape characters). You also could use the length of the post parsed string subtracted from the length of the input string. But this is very inefficient for long strings with many parses.
How can this be done correctly and efficiently (preferably without just writing your own parse)?

With just base, instead of readDec, you can use readDecP from Text.Read.Lex, which uses a ReadP parser:
readDecP :: (Eq a, Num a) => ReadP a
The gather combinator in Text.ParserCombinators.ReadP returns the parse result along with the actual characters parsed:
gather :: ReadP a -> ReadP (String, a)
You can run the parser with readP_to_S, which gives back a ReadS parser, which is a function that accepts a string and produces a list of possible parses with the remainder of the string.
readP_to_S :: ReadP a -> ReadS a
type ReadS a = String -> [(a, String)]
An example in GHCi:
> import Text.ParserCombinators.ReadP (gather, readP_to_S)
> import Text.Read.Lex (readDecP)
> readP_to_S (gather readDecP) "52 rest"
[(("52",52)," rest")]
> readP_to_S (gather readDecP) "0644 permissions"
[(("0644",644)," permissions")]
You can simply check that there is only one valid parse if you want the result to be unambiguous, and then take the length of the first component to find the number of Char code points parsed.
These parsers are fairly limited, however; if you want something easier to use, faster, or able to produce more detailed error messages, then you should check out a more fully featured parsing package such as regex-applicative (regular grammars) or megaparsec (context-sensitive grammars).

Related

How to write an Antlr4 grammar that matches X number of characters

I want to use Antlr4 to parse a format that stores the length of segments in the serialised form
For example, to parse:
"6,Hello 5,World"
I tried to create a grammar like this
grammar myGrammar;
sequence:
(LEN ',' TEXT)*;
LEN: [0-9]+;
TEXT: // I dont know what to put in here but it should match LEN number of chars
Is this even possible with Antlr?
A real world example of this would be parsing the messagePack binary format which has several types that serialise the length of the data into the serialised form.
For example there is the str8:
str 8 stores a byte array whose length is upto (2^8)-1 bytes:
+--------+--------+========+
| 0xd9 |YYYYYYYY| data |
+--------+--------+========+
And str16 type
str16 stores a byte array whose length is upto (2^16)-1 bytes:
+--------+--------+--------+========+
| 0xda |ZZZZZZZZ|ZZZZZZZZ| data |
+--------+--------+--------+========+
In these examples the first byte identifies the type, then we have 1 byte for str8 and 2 bytes for str16 which contain the length of the data. Then finally there is the data.
I think a rule might look something like this but dont know how to match the right amount of data
str8 : '\u00d9' BYTE DATA ;
str16: '\u00da' BYTE BYTE DATA ;
BYTE : '\u0000'..'\u00FF' ;
DATA : ???
The data format you describe is usually called TLV (tag/type–length–value). TLV cannot be recognised with a regular expression (or even with a context-free grammar) so it's not usually supported by standard tokenisers.
Fortunately, it's easy to tokenise. Standard libraries may exist for particular formats, and some formats even have automated code generators for more efficient parsing. But you should be able to write a simple tokeniser for a particular format in a few lines of code.
Once you have writen the datastream tokeniser, you could use a parser generator like Antlr to build a datastructure from the parse, but it's rarely nevessary. Most TLV-encoded streams are simple sequences of components, although you occasionally run into formats (like Google protobufs or ASN.1) which include nested subsequences. Even with those, the parse is straight-forward (although for both of those examples, standard tools exist).
In any event, using context-free grammar tools like Antlr is rarely the simplest solution, because TLV formats are mostly order-independent. (If the order were fixed, the tags wouldn't be necessary.) Context-free grammars do not have any way of handling a language such as "at most one of A, B, C, D, and E in any order" other than enumerating the alternatives, of which there are an exponential number.

"Lexeme" vs "Token" Terminology

I'm trying to understand the difference between "lexeme" and "token" in compilers.
If the lexer part of my compiler encounters the following sequence of characters in the source code to be compiled.
"abc"
is it correct to say that the above is a lexeme that is 5 characters long?
If my compiler is implemented in C, and I allocate space for a token for this lexeme, the token will be an struct. The first member of the struct will be an int which will have the type from some enum, in this case STRING_LITERAL. The second member of the struct will be a char * that points to some (dynamically allocated) memory that has 4 bytes. The first byte is 'a', the second 'b', the third 'c', and the fourth is NULL to terminate the string.
So...
The lexeme is 5 character of the source code text.
The token is a total of 6 bytes in memory.
Is that the correct way to use the terminology?
(I'm ignoring tokens tracking meta data like filename, line number, and column number.)
Sort of related question:
Is it uncommon practice to have the lexer convert an integer lexeme into an integer value in a token? Or is it better (or more standard) to store the characters of the lexeme in a token and let the parser stage convert those characters to an integer node to be attached to the AST?
A "lexeme" is a literal character in the source, for example 'a' is a lexeme in "abc". It is the smallest unit. The "lexer" or lexical analysis stage converts lexemes into tokens(such as keywords, identifiers, literals, operators etc) which are the smallest units the parser can use to create ASTs. So if we have the statement
int x = 0;
The lexer would output
<type:int> <id: x> <operator: = > <literal: 0> <semicolon>
The lexer is typically a collection of regular expressions that can simply define collections of characters as what would be terminals in the languages grammar. These are turned into tokens which is feed into the parser as a stream.
However, most people use lexeme and token interchangeably, and it usually doesn't cause confusion. For you question about converting the int literal, you would want a wrapper class for your AST. Just having a integer alone might not be enough information.

Erlang: Strange chars in a generated list

Trying to generate a list through comprehension and at some point I start seeing strange character strings. Unable to explain their presence at this point (guessing the escape chars to be ASCII codes - but why?):
45> [[round(math:pow(X,2))] ++ [Y]|| X <- lists:seq(5,10), Y <- lists:seq(5,10)].
[[25,5],
[25,6],
[25,7],
[25,8],
[25,9],
[25,10],
[36,5],
[36,6],
[36,7],
"$\b","$\t","$\n",
[49,5],
[49,6],
[49,7],
"1\b","1\t","1\n",
[64,5],
[64,6],
[64,7],
"#\b","#\t","#\n",
[81,5],
[81,6],
[81,7],
"Q\b",
[...]|...]
In Erlang all strings are just list of small integers (like chars in C). And shell to help you out a little tries to interpret any list as printable string. So what you get are numbers, they are just printed in a way you would not expect.
If you would like to change this behaviour you can look at this answer.

Haskell command system parser

I've written this simple parser that take from the command line [ps auxww | ./myparser] and parses the output of the ps command in order to insert it into the process data structure I created.
I succeed to parse one line of the result String, but now I'm stuck trying to parse the whole string and return a [Process] and not a single Process. The problem is how to implement parsePS. It has to call many times myParser in order to parse every single line and return a list of Process and print it into the terminal.
Can someone help me?
I'm not sure what's failing for you, but I am guessing the spacing is killing you. If so, I have two ideas that might help.
Modify myParser to consume spaces at the end and the many combinator should work.
myParser = do
...
spaces
command <- pCommand
spaces -- CONSUME END OF LINE
return Entry{ ... }
Then many myParser should work.
Alternately, you could split the input into lines separately first and call parse on each.
argLines <- fmap lines getContents
(I take it you mean to burn the first line via getLine before the hGetContents?)
It sounds to me like you're looking for a way to parse each line in sequence and return a list of parsed results. How about mapM from the Prelude?
If myParser :: String -> Parser Process, then mapM myParser :: [String] -> Parser [Process], which seems to be what you're looking for (using generic names for Parsec's Parser types). So if you have a list of lines (call it lns) that you want to parse in sequence, you can use parse (mapM myParser) lns to get what you want.

String splitting problems in Erlang

I've been playing around with the splitting of atoms and have a problem with strings. The input data will always be an atom that consists of some letters and then some numbers, for instance ms444, r64 or min1. Since the function lists:splitwith/2 takes a list the atom is first converted into a list:
24> lists:splitwith(fun (C) -> is_atom(C) end, [m,s,4,4,4]).
{[m,s],[4,4,4]}
25> lists:splitwith(fun (C) -> is_atom(C) end, atom_to_list(ms444)).
{[],"ms444"}
26> atom_to_list(ms444).
"ms444"
I want to separate the letters from the numbers and I've succeeded in doing that when using a list, but since I start out with an atom I get a "string" as result to put into my splitwith function...
Is it interpreting each item in the list as a string or what is going on?
You might want to have a look at the string module documentation:
http://www.erlang.org/doc/man/string.html
The following function might interest you:
tokens(String, SeparatorList) -> Tokens
Since strings in Erlang are just a list() of integer() the test in the fun will be made if the item is an atom() when it is in fact an integer(). If the test is changed to look for letters it works:
29> lists:splitwith(fun (C) -> (C >= $a) and (C =< $Z) end, atom_to_list(ms444)).
{"ms","444"}
An atom in erlang is a named constant and not a variable (or not like a variable is in an imperative language).
You should really not create atoms in dynamic fashion (that is, don't convert things to atoms at runtime)
They are used more in pattern matching and send recive code.
Pid ! {matchthis, X}
recive
{foobar,Y} -> doY(Y);
{matchthis,X} -> doX(X);
Other -> doother(Other)
end
A variable, like X could be set to an atom. For example X=if 1==1 -> ok; true -> fail end. I could suffer from poor imagination but I can't think of a way why you would like to parse atom. You should be in charge of what atoms you write and not use list_to_atom(CharIntegerList).
Can you perhaps give a more overview of what you like to accomplish?
A "string" in Erlang is not a primitive type: it is just a list() of integers(). So if you want to "separate" the letters from the digits, you'll have to do comparison with the integer representation of the characters.

Resources