Parsing newline in Jison - parsing

Hi I am a newbie for Jison and was trying to learn it. I try the online jison parser calculator code on http://techtonik.github.io/jison/try/. It is working fine for the expression
5*PI^2.
But when I added a new expression on a newline, the parser will not take the newline and try to parse another expression as if it is on the same line.
Input :
5*PI^2
23+56
Parser takes it as :
5*PI^223+56
This fails, hence I want to know how to parse newline in jison parsor.

The problem here is that the Jison parser expects a single expression to parse, and it tries to evaluate whether the ENTIRE text is valid as a whole. What you've given it in this case is TWO separate expressions that don't evaluate correctly together, which is why it fails. If, for example, you evaluate
5*PI^2
+
23+56
Then it has no problems. This is because Jison is trying to parse the entire value it's given, and it allows you to break complex expressions up over multiple lines.
However, that doesn't stop you from evaluating lines individually if you want to. Instead of passing the parse function the entire text from the field, just split the text into an array using JavaScript's string split method (splitting on the new-line character, '\n'), then loop through and pass each line of the content to the parse function separately.

Related

FParsec alternatives getting parser which parsed input

When using alternative parsers, is there an option to get which parser matched input.
My input string can be like below format
{firstPart_number} {secondPart_operator_symbol} {thirdPart}
Here firstPart is always number, second part is alternative parser to parse operator and thirdPart is also alternative (of number, list of string).
sample input
1 + 2
5 * 3
1 in {2,45,6}
Since my discriminated unions are of different types, I want to know which parser matched input so that based on that parser I can create instance of my discriminate union type?
How to handle this situation in FParsec, where my first part is same across parsers but second and third parsers are different and based on that instantiate Type using |>>
my present problem was solved using attempt parser with alternatives. attempt will backtrack if it doesn't match and next alternative parser will parse input again and match

Grammar for parsing numbers

I have a file in which each line represents a concatenated String series as this:
302007030064201410241
30210704006426141
1021070400642614134
Each line starts with operation code and each operation has a known rules to parse remaining part of the line.
What will be the good strategy to parse these numbers? Any sample for start would be great.
IMO, Antlr wont be much usefull if all different informations to parse look like all token are identical.
Write manually a little state machine.
Read a digit in loop until that digit and predecessors result in a know "operation code" (it could be simpler if all codes have the same lenght: you could wrap that in a function)
then depending on that code (e.g. in a switch) you can call its specific decoding logic in a dedicated function.
Your resulting parser will look like a recursive descent parser.

Syntax Highlighting when using special characters

I'm currently finishing up a mathematical DSL based on LaTeX code in Rascal. This means that I have a lot of special characters ({,},), for instance in the syntax shown below, the sum doesn't get highlighted unless I remove the \ and _{ from the syntax.
syntax Expression = left sum: '\\sum_{' Assignment a '}^{' Expression until '}' Expression e
I've noticed that keywords that contain either \ or { and } do not get highlighted. Is there a way to overcome this?
Edit: I accidentally used data instead of syntax in this example
There are at least two solutions, one is based on changing the grammar, one is based on a post-parse tree traversal. Pick your poison :-)
The cause of the behavior is the default highlighting rules which heuristically detect what a "keyword" to be highlighted is by matching any literal with the regular expression [A-Za-z][A-Za-z0-9\-]*. Next to these heuristic defaults, the highlighting is fully programmable via #category tags in the grammar and #category annotations in the parse tree.
If you change the grammar like so, you can influence highlighting via tags:
data Expression = left sum: SumKw Assignment a '}^{' Expression until '}' Expression e
data SymKw = #category="MetaKeyword" '\\sum_{';
Or, another grammar-based solution is to split the definition up (which is not a language preserving grammar refactoring since it adds possibility for spaces):
data Expression = left sum: "\\" 'sum' "_{" Assignment a '}^{' Expression until '}' Expression e
(The latter solution will trigger the heuristic for keywords again)
If you don't like to hack the grammar to accomodate highlighting, the other way is to add an annotation via a tree traversal, like so:
visit(yourTree) {
case t:appl(prod(cilit("\\sum_{"),_,_),_) => t[#category="MetaKeyword"]
}
The code is somewhat hairy because you have to match on and replace a tree which can usually be ignored while thinking of your own language. It's the notion of the syntax rule generated for each (case-insensitive) literal and it's application to the individual characters it consists of. See ParseTree.rsc from the standard library for a detailed and formal definition of what parse trees look like under-the-hood.
To make the latter solution have effect, when you instantiate the IDE using the registerLanguage function from util::IDE, make sure to wrap the call to the parser with some function which executes this visit.

antlr: how to [sometimes] parse things in quotes?

I have a situation where my language allows quotes strings but sometimes I want to interpret the contents of the quoted string as language constructs. Think of it as, say, eval function.
So to support quoted strings i need a lexer rule and it overrides my attempts to have a grammar rule evaluating things in quotes if prefixed with 'eval'. Is there any way to deal with this in the grammar?
IMO you should not try to handle this case directly through the lexer.
I think I would leave the string as it in the lexer and add some code in the eval rule of the parser that calls a sub-parser on the string content.
If you want to implement an eval function, you're really looking for a runtime interpreter.
The only time you need an "eval" function is when you want to build up the content to compile at runtime. If you have the content available at compile-time, you can parse it without it being a string...
So... keep it as a string, and then use the same parser at runtime to parse/compile its contents.

Recursive Descent vs Lex/Parse?

I think I understand (roughly) how recursive descent parsers (e.g. Scala's Parser Combinators) work: You parse the input string with one parser, and that parser calls other, smaller parsers for each "part" of the whole input, and so on, until you reach the low level parsers which directly generate the AST from fragments of the input string
I also think I understand how Lexing/Parsing works: you first run a lexer to break the whole input into a flat list of tokens, and you then run a parser to take the token list and generate an AST.
However, I do not understand is how the Lex/Parse strategy deals with cases where exactly how you tokenize something depends on the tokens that were tokenized earlier. For example, if I take a chunk of XML:
"<tag attr='moo' omg='wtf'>attr='moo' omg='wtf'</tag>"
A recursive descent parser may take this and break it down (each subsequent indent represents the decomposition of the parent string)
"<tag attr='moo' omg='wtf'>attr='moo' omg='wtf'</tag>"
-> "<tag attr='moo' omg='wtf'>"
-> "<tag"
-> "attr='moo'"
-> "attr"
-> "="
-> "moo"
-> "omg='wtf'"
-> "omg"
-> "="
-> "wtf"
-> ">"
-> "attr='moo' omg='wtf'"
-> "</tag>"
And the small parsers which individually parse <tag, attr="moo", etc. would then construct a representation of an XML tag and add attributes to it.
However, how does a single-step Lex/Parse work? How does the Lexer know that the string after <tag and before > must be tokenized into separate attributes, while the string between > and </tag> does not need to be? Wouldn't it need the Parser to tell it that the first string is within a tag body, and the second case is outside a tag body?
EDIT: Changed the example to make it clearer
Typically the lexer will have a "mode" or "state" setting, which changes according to the input. For example, on seeing a < character, the mode would change to "tag" mode, and the lexer would tokenize appropriately until it sees a >. Then it would enter "contents" mode, and the lexer would return all of attr='moo' omg='wtf' as a single string. Programming language lexers, for example, handle string literals this way:
string s1 = "y = x+5";
The y = x+5 would never be handled as a mathematical expression and then turned back into a string. It's recognized as a string literal, because the " changes the lexer mode.
For languages like XML and HTML, it's probably easier to build a custom parser than to use one of the parser generators like yacc, bison, or ANTLR. They have a different structure than programming languages, which are a better fit for the automatic tools.
If your parser needs to turn a list of tokens back into the string it came from, that's a sign that something is wrong in the design. You need to parse it a different way.
How does the Lexer know that the string after must
be tokenized into separate attributes, while the string between > and
does not need to be?
It doesn't.
Wouldn't it need the Parser to tell it that the first string is within
a tag body, and the second case is outside a tag body?
Yes.
Generally, the lexer turns the input stream into a sequence of tokens. A token has no context - that is, a token has the same meaning no matter where it occurs in the input stream. Once the lexing process has completed, each token is treated as a single unit.
For XML, a generated lexer would typically identify integers, identifiers, string literal and so on as well as the control characters, like '<' and '>' but not a whole tag. The work of understanding what is an open tag, close tag, attribute, element, etc., is left to the parser proper.

Resources