haskell identifier recognition - parsing

my working in file parsing using Haskell, and I'm using both Data.Attoparsec.Char8 and Data.ByteString.Char8. I want to parse an expression which can contains symbols like : - / [ ] _ . (minus, slashes, braquets and underscore).
I've write the following parser
import qualified Data.ByteString.Char8 as B
import qualified Data.Attoparsec.Char8 as A
identifier' :: Parser B.ByteString
identifier' = A.takeWhile $ A.inClass "A-Za-z0-9_//- /[/]"
... but it's not works like expected.
ghc> A.parse identifier' (B.pack "EMBXSHM-PortClo")
Done "-PortClo" "EMBXSHM"
ghc> A.parse identifier' (B.pack "AU_D[1].PCMPTask")
Done ".PCMPTask" "AU_D[1]"
can someone help me.
Thanks for your time.

Take a look at the documentation: http://hackage.haskell.org/packages/archive/attoparsec/0.10.1.0/doc/html/Data-Attoparsec-ByteString-Char8.html#g:9
To add a "-" to a set, place it a the beginning or end of a string.
The latter doesn't parse because you don't have dots in your class listing.

You want to allow '-' characters in identifiers, but A.inClass uses '-' for ranges. You have to put it at the start or end of the range string:
To add a literal '-' to a set, place it at the beginning or end of the string.
— attoparsec documentation

Related

ANTLR4 match any not-matched sections into one single STRING token

I am trying to create a Lexer/Parser with ANTLR that can parse plain text with 'tags' scattered inbetween.
These tags are denoted by opening ({) and closing (}) brackets and they represent Java objects that can evaluate to a string, that is then replaced in the original input to create a dynamic template of sorts.
Here is an example:
{player:name} says hi!
The {player:name} should be replaced by the name of the player and result in the output i.e. Mark says hi! for the player named Mark.
Now I can recognize and parse the tags just fine, what I have problems with is the text that comes after.
This is the grammar I use:
grammar : content+
content : tag
| literal
;
tag : player_tag
| <...>
| <other kinds of tags, not important for this example>
| <...>
;
player_tag : BRACKET_OPEN player_identifier SEMICOLON player_string_parameter BRACKET_CLOSE ;
player_string_parameter : NAME
| <...>
;
player_identifier : PLAYER ;
literal : NUMBER
| STRING
;
BRACKET_OPEN : '{';
BRACKET_CLOSE : '}';
PLAYER : 'player'
NAME : 'name'
NUMBER : <...>
STRING : (.+)? /* <- THIS IS THE PROBLEMATIC PART !*/
Now this STRING Lexer definition should match anything that is not an empty string but the problem is that it is too greedy and then also consumes the { } bracket tokens needed for the tag rule.
I have tried setting it to ~[{}]+ which is supposed to match anything that does not include the { } brackets but that screws with the tag parsing which I don't understand either.
I could set it to something like [ a-zA-Z0-9!"§$%&/()= etc...]+ but I really don't want to restrict it to parse only characters available on the british keyboard (German umlaute or French accents and all other special characters other languages have must to work!)
The only thing that somewhat works though I really dislike it is to force strings to have a prefix and a suffix like so:
STRING : '\'' ~[}{]+ '\'' ;
This forces me to alter the form from "{player:name} says hi!" to "{player:name}' says hi!'" and I really desperately want to avoid such restrictions because I would then have to account for literal ' characters in the string itself and it's just ugly to work with.
The two solutions I have in mind are the following:
- Is there any way to match any number of characters that has not been matched by the lexer as a STRING token and pass it to the parser? That way I could match all the tags and say the rest of the input is just plain text, give it back to me as a STRING token or whatever...
- Does ANTLR support lookahead and lookbehind regex expressions with which I could match any number of characters before the first '{', after the last '}' and anything inbetween '}' and '{' ?
I have tried
STRING : (?<=})(.+)?(?={) ;
but I can't seem to get the syntax right because that won't compile at all, which leads me to believe that ANTLR does not support lookahead and lookbehind syntax, but I could not find a definitive answer on the internet to that question.
Any advice on what to do?
Antlr does not support lookahead or lookbehind. It does support non-greedy wildcard matches, but only when the .* non-greedy wildcard is followed in the rule with the termination sequence (which, as you say, is also contained in the match, although you could push it back into the input stream).
So ~[{}]* is correct. But there's a little problem: lexer rules are (normally) always active. So that lexer rule will be active inside the braces as well, which means that it will swallow the entire contents between the braces (unless there are nested braces or braces inside quotes or some such, and that's even worse).
So you need to define different lexical contents, called "lexical modes" in Antlr. There's a publically viewable example in the Antlr Definitive Reference, which shows a solution to a very similar problem: parsing HTML.

Is there a guide/convention to the markup being used in the lua official documentation?

I consistently have a really hard time reading official documentation when it's related to coding. I generally don't understand it unless it's paired with an example. I am seeking clarification on what kind of conventions are inplace when reading docs, if any. Take the example below from the lua manual(https://www.lua.org/manual/5.1/manual.html#2.1)
:
stat ::= if exp then block {elseif exp then block} [else block] end
The first word, Stat, is defined as a statement and "this set includes assignments, control structures, function calls, and variable declarations."
::= Is not defined in the docs, it can be googled thankfully.
Exp is linked and explained.
Block has a section as well.
But then they do {} and []. They literally stated "Square brackets are used to index a table" just a few lines above. And that squiggly brackets are for writing a table. So what am I supposed to deduce from this? That {} and [] are being used to denote separate sections as a markup to make it easier to see certain components? Or that {elseif exp then block} is a table with those values inside of itself and [else block] is a key-value indexing a table? If I was writing a doc where that was indeed the case, wouldn't I write it this way?
Then I see
var ::= prefixexp `[´ exp `]´`
' ' defines a string, but I have to make the assumption that '[' ']' is used as a way to highlight the fact that because they were talking about what square brackets do in the previous section they are simply highlighting their position and this should not be included in the code. I only know to make this assumption though cause I know it doesn't work when you put them in there.
But then I see this:
chunk ::= {stat [`;´]}
Similarly they are talking about the placement of the semicolon before listing that code, but the entirety of the line of code was also newly explained and being talked about. Why would I assume that its written without the parenthesis if its written with the parenthesis? And I see they are using {} and [] again, and I have no idea what they are referencing because its not stated explicitly that we're talking about a table...its simply using the code itself to explain whether its talking about a table or not with the {}, but we have that first set of code where {} is being used and its not talking about a table.
What is the convention being used? What are they actually trying to do/show by using {} and [] in the first line of code?
As stated both at the beginning of the Lua documentation and in the section on the Lua grammar, Lua presents its grammar in extended BNF format.
EBNF has its own punctuation with its own meaning, like ::= as you discovered. But as a grammar, there needs to be a distinction between the EBNF meaning of a piece of punctuation and "this punctuation appears in the language defined by the grammar". The former meaning is therefore always assumed; the latter meaning can only be achieved by quoting the punctuation.
So this:
var ::= prefixexp `[´ exp `]´`
Means a prefixexp followed by an open bracket followed by exp followed by a close bracket.
By contrast, this:
funcname ::= Name {`.´ Name} [`:´ Name]
Means Name followed by zero or more sub-sequences of . followed by Name, followed by an optional sub-sequence of : followed by Name. Because those are what {} and [] mean to EBNF.

Alphabetic characters not recognized in tatsu parse

I have defined a very simple grammar, but tatsu does not behave as expected.
I have added a "start" rule and terminated it with a "$" character, but I still see the same behavior.
If I define the "fingering" rule with a regular expression (digit = /[1-5x]/) instead of the individual terminal symbols, the problem disappears. But shouldn't the old-school BNF-like syntax below work?
from pprint import pprint
from tatsu import parse
GRAMMAR = """
##grammar :: test
##nameguard :: False
start = sequence $ ;
sequence = {digit}+ ;
digit = 'x' | '1' | '2' | '3' | '4' | '5' ;"""
test = "23"
ast = parse(GRAMMAR, test)
pprint(ast) # Prints ['2', '3']
test = "xx"
ast = parse(GRAMMAR, test)
pprint(ast) # Throws tatsu.exceptions.FailedParse: (1:1) no available options :
The "xx" test should produce "['x', 'x']" and not throw an exception.
What am I missing?
You probably need to check interactions with ##nameguard, which is turned on by default.
For the first version of the grammar, use:
##nameguard :: False
You can also consider the definitions of ##whitespace and ##namechars that best suite the language and grammar.
Okay, I think there is a problem with ##nameguard. See https://github.com/neogeny/TatSu/issues/95. The easy workaround for the time being is to use a pattern expression in lieu of individual alphabetic terminals. Also, when ##nameguard is fixed, the documentation should clarify that it only relates to alphanumerics that begin with an alphabetic. Clearly, we did not need ##nameguard for the numeric terminals here.

Haskell/Parsec: how do I use Text.Parsec.Token with Text.Parsec.Indent (from the indents package)

The indents package for Haskell's Parsec provides a way to parse indentation-style languages (like Haskell and Python). It redefines the Parser type, so how do you use the token parser functions exported by Parsec's Text.Parsec.Token module, which are of the normal Parser type?
Background
Parsec is a parser combinator library, whatever that means.
IndentParser 0.2.1 is an old package providing the two modules Text.ParserCombinators.Parsec.IndentParser and Text.ParserCombinators.Parsec.IndentParser.Token
indents 0.3.3 is a new package providing the single module Text.Parsec.Indent
Parsec comes with a load of modules. most of them export a bunch of useful parsers (e.g. newline from Text.Parsec.Char, which parses a newline) or parser combinators (e.g. count n p from Text.Parsec.Combinator, which runs the parser p, n times)
However, the module Text.Parsec.Token would like to export functions which are parametrized by the user with features of the language being parsed, so that, for example, the braces p function will run the parser p after parsing a '{' and before parsing a '}', ignoring things like comments, the syntax of which depends on your language.
The way that Text.Parsec.Token achieves this is that it exports a single function makeTokenParser, which you call, giving it the parameters of your specific language (like what a comment looks like) and it returns a record containing all of the functions in Text.Parsec.Token, adapted to your language as specified.
Of course, in an indentation-style language, these would need to be adapted further (perhaps? here's where I'm not sure – I'll explain in a moment) so I note that the (presumably obsolete) IndentParser package provides a module Text.ParserCombinators.Parsec.IndentParser.Token which looks to be a drop-in replacement for Text.Parsec.Token.
I should mention at some point that all the Parsec parsers are monadic functions, so they do magic things with state so that error messages can say at what line and column in the source file the error appeared
My Problem
For a couple of small reasons it appears to me that the indents package is more-or-less the current version of IndentParser, however it does not provide a module that looks like Text.ParserCombinators.Parsec.IndentParser.Token, it only provides Text.Parsec.Indent, so I am wondering how one goes about getting all the token parsers from Text.Parsec.Token (like reserved "something" which parses the reserved keyword "something", or like braces which I mentioned earlier).
It would appear to me that (the new) Text.Parsec.Indent works by some sort of monadic state magic to work out at what column bits of source code are, so that it doesn't need to modify the token parsers like whiteSpace from Text.Parsec.Token, which is probably why it doesn't provide a replacement module. But I am having a problem with types.
You see, without Text.Parsec.Indent, all my parsers are of type Parser Something where Something is the return type and Parser is a type alias defined in Text.Parsec.String as
type Parser = Parsec String ()
but with Text.Parsec.Indent, instead of importing Text.Parsec.String, I use my own definition
type Parser a = IndentParser String () a
which makes all my parsers of type IndentParser String () Something, where IndentParser is defined in Text.Parsec.Indent. but the token parsers that I'm getting from makeTokenParser in Text.Parsec.Token are of the wrong type.
If this isn't making much sense by now, it's because I'm a bit lost. The type issue is discussed a bit here.
The error I'm getting is that I've tried replacing the one definition of Parser above with the other, but then when I try to use one of the token parsers from Text.Parsec.Token, I get the compile error
Couldn't match expected type `Control.Monad.Trans.State.Lazy.State
Text.Parsec.Pos.SourcePos'
with actual type `Data.Functor.Identity.Identity'
Expected type: P.GenTokenParser
String
()
(Control.Monad.Trans.State.Lazy.State Text.Parsec.Pos.SourcePos)
Actual type: P.TokenParser ()
Links
Parsec
IndentParser (old package)
indents, providing Text.Parsec.Indent (new package)
some discussion of Parser types with example code
another example of using Text.Parsec.Indent
Sadly, neither of the examples above use token parsers like those in Text.Parsec.Token.
What are you trying to do?
It sounds like you want to have your parsers defined everywhere as being of type
Parser Something
(where Something is the return type) and to make this work by hiding and redefining the Parser type which is normally imported from Text.Parsec.String or similar. You still need to import some of Text.Parsec.String, to make Stream an instance of a monad; do this with the line:
import Text.Parsec.String ()
Your definition of Parser is correct. Alternatively and equivalently (for those following the chat in the comments) you can use
import Control.Monad.State
import Text.Parsec.Pos (SourcePos)
type Parser = ParsecT String () (State SourcePos)
and possibly do away with the import Text.Parsec.Indent (IndentParser) in the file in which this definition appears.
Error, error on the wall
Your problem is that you're looking at the wrong part of the compiler error message. You're focusing on
Couldn't match expected type `State SourcePos' with actual type `Identity'
when you should be focusing on
Expected type: P.GenTokenParser ...
Actual type: P.TokenParser ...
It compiles!
Where you "import" parsers from Text.Parsec.Token, what you actually do, of course (as you briefly mentioned) is first to define a record your language parameters and then to pass this to the function makeTokenParser, which returns a record containing the token parsers.
You must therefore have some lines that look something like this:
import qualified Text.Parsec.Token as P
beetleDef :: P.LanguageDef st
beetleDef =
haskellStyle {
parameters, parameters etc.
}
lexer :: P.TokenParser ()
lexer = P.makeTokenParser beetleDef
... but a P.LanguageDef st is just a GenLanguageDef String st Identity, and a P.TokenParser () is really a GenTokenParser String () Identity.
You must change your type declarations to the following:
import Control.Monad.State
import Text.Parsec.Pos (SourcePos)
import qualified Text.Parsec.Token as P
beetleDef :: P.GenLanguageDef String st (State SourcePos)
beetleDef =
haskellStyle {
parameters, parameters etc.
}
lexer :: P.GenTokenParser String () (State SourcePos)
lexer = P.makeTokenParser beetleDef
... and that's it! This will allow your "imported" token parsers to have type ParsecT String () (State SourcePos) Something, instead of Parsec String () Something (which is an alias for ParsecT String () Identity Something) and your code should now compile.
(For maximum generality, I'm assuming that you might be defining the Parser type in a file separate from, and imported by, the file in which you define your actual parser functions. Hence the two repeated import statements.)
Thanks
Many thanks to Daniel Fischer for helping me with this.

Create a Print Function

I'm learning Bison and at this time the only thing that I do was the rpcalc example, but now I want to implement a print function(like printf of C), but I don't know how to do this and I'm planning to have a syntax like this print ("Something here");, but I don't know how to build the print function and I don't know how to create that ; as a end of line. Thanks for your help.
You first need to ask yourself:
What are the [sub-]parts of my 'print ("something");' syntax ?
Once you identify these parts, "simply" describe them in the form of grammar syntax rules, along with applicable production rules. And then let Bison generate the parser for you; that's about it.
To put you on your way:
The semi-column is probably a element you will use to separate statemements (such a one "call" to print from another).
'print' itself is probably a keyword, or preferably a native function name of your language.
The print statement appears to take a literal string as [one of] its arguments. a literal string starts and ends with a double quote (and probably allow for escaped quotes within itself)
etc.
The bolded and italic expressions above are some of the entities (the 'symbols' in parser lingo) you'll likely need to define in the syntax for your language. For that you'll use Bison grammar rules, such as
stmt : print_stmt ';' | input_stmt ';'| some_other_stmt ';' ;
prnt_stmt : print '(' args ')'
{ printf( $3 ); }
;
args : arg ',' args;
...
Since the question asked about the semi-column, maybe some confusion was from the different uses thereof; see for example above how the ';' belong to your language's syntax whereby the ; (no quotes) at the end of each grammar rule are part of Bison's language.
Note: this is of course a simplistic implementation, aimed at showing the essential. Also the Bison syntax may be a tat off (been there / done it, but a long while back ;-) I then "met" ANTLR never to return to Bison, although I do see how its lightweight and fully self contained nature can make it appropriate in some cases)

Resources