Parsing expressions when an operand contain more than one character?

Parsing expressions when an operand contain more than one character? - parsing

I know there are many questions about mathematics expression parsing out there. I have researched and learned the algorithm to convert an infix string to postfix, and use the postfix string to evaluate the value of the expression.
But all of examples I have found deal only with the case that operands of the expression contain only one character. For example "1+2".
How do you do if the expression is "1 + 123"? The postfix string would become "1123+", so it's unable to be evaluated.
The method I have thought is to read each character of an operand from the infix string and temporarily keep them in a tempStack. And, when an operator is read, convert the operand in the tempStack to an integer, then push it into the postfix array.
But then the problem follows, my operands would be integer type but my operators are character type. So I can't put them in the same array.
Please suggest me the right way to do this. I know that there are APIs to do this work, but I want to learn this to strengthen my knowledge.
Thank you very much.

You don't transform the input into a 'postfix string' unless you separate the tokens by whitespace. The input "1 + 123" would then become e.g. "1 123 +". But it's better to push the tokens on a stack, for example an array of strings. If the language you use supports algebraic data types, you would create a Token type and push onto a stack of Tokens.

Related

Antlr differentiating a newline from a \n

Let's say I have the following statement:
SELECT "hi\n
there";
Notice there is a literal newline in there, and the escape \n. The string that antlr4 picks up for me is:
String_Literal: "hi\n\nthere"
In other words, not differentiating between the literal newline and the \n one. Is there a way to differentiate the two, or what's the usual process to do that?

My guess is that the output you pasted into your question comes from a call to the Antlr4 runtime method tree.toStringTree(parser) (or equivalent in whatever target language you've chosen).
That function calls escapeWhitespace in the utilities class/module/file, and that function does what it's name suggests: it converts (some) whitespace characters to C-like backslash escape sequences. (Specifically, it handles newline, carriage return, and tab characters.) It does not escape backslash characters, which makes its output ambiguous; there's no way to distinguish between the two character escape sequence \n and the escaped conversion of a newline character in the message.
They are different in the actual character string, because the Antlr4 lexer does not transform the string value of the matched token in any way. That's your responsibility.
In computing, it is very often the case that what you see is not what you got. What you see is just what you see, and a lot of computational power has gone into creating that vision for you. By the same token, nothing guarantees that the vision is an unambiguous, or even useful, representation of the actual values. The best you can say for it is that it's probably more useful than trying to read the data as individual bits. (And, indeed, the individual bits are not physical objects either; despite the common refrain, you could completely disassemble a computer and examine it with an arbitrarily powerful microscope, and you will not see a single 1 or 0.)
That might seem like irrelevant philosophizing, but it has a real consequence: when you're debugging and you see something that makes you think, "that looks wrong", you need to consider two possibilities: maybe the underlying data is incorrect, but may it's the process which rendered the representation which is at fault. In this case, I'd say that the failure of escapeWhitespace to convert backslash characters into pairs of backslashes is a bug, but that's a value judgement on my part. Anyway, the function is not critical to the operation of Antlr4, and you could easily replace it.

General query for parsing in c using special character

Can we write a scanf function to parse input on # sign and read
two numbers?
since # is not used in aa standard way i am a bit confused

The scanf pattern "%d#%d" matches, in order:
optional whitespace
an integer
the character #
more optional whitespace
another integer
It returns 2 if both integers were found.
If there is anything (even a space character) between the first integer and the #, the scan will stop after the first integer and return 1. It will also stop and return 1 if whatever follows the # is not an integer possibly preceded by whitespace. This imprecision is one of the reasons scanf is often discouraged.
You could allow whitespace before the # by adding a space into the pattern: "%d #%d". But whitespace includes newline characters, which can lead to unexpected behaviour. That's another reason scanf is often discouraged.
In summary:
Yes, you can get scanf to read a line consisting of two numbers separated by a #. Not a problem; scanf does not have a concept of "legal delimiter".
If the input is from an error-prone source, such as a human typist, scanf makes it difficult to produce good error messages, which can be a frustrating experience. But if you know the input is error-free, scanf will work just fine.
In the above patterns, if you want to accept floating point numbers, change the %d to %lf (for doubles, generally recommended) or %f fir floats.

Handling in-ambiguous yet breaking syntax in expression parsing

Context
I've recently come up with an issue that I couldn't solve by myself in a parser I'm writing.
This parser is a component in a compiler I'm building and the question is in regards to the expression parsing necessary in programming language parsing.
My parser uses recursive descent to parse expressions.
The problem
I parse expressions using normal regular language parsing rules, I've eliminated left recursion in all my rules but there is one syntactic "ambiguity" which my parser simply can't handle and it involves generics.
comparison → addition ( ( ">" | ">=" | "<" | "<=" ) addition )* ;
is the rule I use for parsing comparison nodes in the expression
On the other hand I decided to parse generic expressions this way:
generic → primary ( "<" arguments ">" ) ;
where
arguments → expression ( "," expression )* ;
Now because generic expressions have higher precedence as they are language constructs and not mathematical expressions, it causes a scenario where the generic parser will attempt to parse expressions when it shouldn't.
For example in a<2 it will parse "a" as a primary element of the identifier type, immediately afterwards find the syntax for a generic type, parse that and fail as it can't find the closing tag.
What is the solution to such a scenario? Especially in languages like C++ where generics can also have expressions in them if I'm not mistaken arr<1<2> might be legal syntax.
Is this a special edge case or does it require a modification to the syntax definition that im not aware of?
Thank you

for example in a<2 it will parse "a" as a primary element of the identifier type, immideatly afterwards find the syntax for a generic type, parse that and fail as it cant find the closing tag
This particular case could be solved with backtracking or unbounded lookahead. As you said, the parser will eventually fail when interpreting this as a generic, so when that happens, you can go back and parse it as a relational operator instead. The lookahead variant would be to look ahead when seeing a < to check whether the < is followed by comma-separated type names and a > and only go into the generic rule if that is the case.
However that approach no longer works if both interpretations are syntactically valid (meaning the syntax actually is ambiguous). One example of that would be x<y>z, which could either be a declaration of a variable z of type x<y> or two comparisons. This example is somewhat unproblematic since the latter meaning is almost never the intended one, so it's okay to always interpret it as the former (this happens in C# for example).
Now if we allow expressions, it becomes more complicated. For x<y>z it's easy enough to say that this should never be interpreted as two comparison as it makes no sense to compare the result of a comparison with something else (in many languages using relational operators on Booleans is a type error anyway). But for something like a<b<c>() there are two interpretations that might both be valid: Either a is a generic function called with the generic argument b<c or b is a generic function with the generic argument c (and a is compared to the result of calling that function). At this point it is no longer possible to resolve that ambiguity with syntactic rules alone:
In order to support this, you'll need to either check whether the given primary refers to a generic function and make different parsing decisions based on that or have your parser generate multiple trees in case of ambiguities and then select the correct one in a later phase. The former option means that your parser needs to keep track of which generic functions are currently defined (and in scope) and then only go into the generic rule if the given primary is the name of one of those functions. Note that this becomes a lot more complicated if you allow functions to be defined after they are used.
So in summary supporting expressions as generic arguments requires you to keep track of which functions are in scope while parsing and use that information to make your parsing decisions (meaning your parser is context sensitive) or generate multiple possible ASTs. Without expressions you can keep it context free and unambiguous, but will require backtracking or arbitrary lookahead (meaning it's LL(*)).
Since neither of those are ideal, some languages change the syntax for calling generic functions with explicit type parameters to make it LL(1). For example:
Java puts the generic argument list of a method before the method name, i.e. obj.<T>foo() instead of obj.foo<T>().
Rust requires :: before the generic argument list: foo::<T>() instead of foo<T>().
Scala uses square brackets for generics and for nothing else (array subscripts use parentheses): foo[T]() instead of foo<T>().

How can I determine the meaning of ambiguous terminator when I tokenize my code?

In python, the word in indicates a operator in an expression 1 in [1,2,3]. But, in statement for i in range(10), it indicates a keyword of 'for' statement. I wrote a lexer based on regular expression. I use the rule (\+|-|\*|/|is|in) to match operator and (for|in|if|elif|else) for keywords. I don't know if I should put in in the rule of operator or keywords. Both of them will lose one meaning. It seems that I should solve this in parsing. But I need give in a label in tokenizing. What should I do?

Call it "token_in" :) It's usually better not to categorize in your lexer; the parser is responsible for analyzing the syntactic purpose of a token.
In any case, I don't see the point of the lexer producing a single token type for different keywords. if and else are syntactically distinct tokens, and the parser wants to know that it is seeing an if; the fact that it is presented with a "keyword" is not particularly useful to it.

Tokenizing numbers for a parser

I am writing my first parser and have a few questions conerning the tokenizer.
Basically, my tokenizer exposes a nextToken() function that is supposed to return the next token. These tokens are distinguished by a token-type. I think it would make sense to have the following token-types:
SYMBOL (such as <, :=, ( and the like
WHITESPACE (tab, newlines, spaces...)
REMARK (a comment between /* ... */ or after // through the new line)
NUMBER
IDENT (such as the name of a function or a variable)
STRING (Something enclosed between "....")
Now, do you think this makes sense?
Also, I am struggling with the NUMBER token-type. Do you think it makes more sense to further split it up into a NUMBER and a FLOAT token-type? Without a FLOAT token-type, I'd receive NUMBER (eg 402), a SYMBOL (.) followed by another NUMBER (eg 203) if I were about to parse a float.
Finally, what do you think makes more sense for the tokenizer to return when it encounters a -909? Should it return the SYMBOL - first, followed by the NUMBER 909 or should it return a NUMBER -909 right away?

It depends upon your target language.
The point behind a lexer is to return tokens that make it easy to write a parser for your language. Suppose your lexer returns NUMBER when it sees a symbol that matches "[0-9]+". If it sees a non-integer number, such as "3.1415926" it will return NUMBER . NUMBER. While you could handle that in your parser, if your lexer is doing an appropriate job of skipping whitespace and comments (since they aren't relevant to your parser) then you could end up incorrectly parsing things like "123 /* comment / . \n / other comment */ 456" as floating point numbers.
As for lexing "-[0-9]+" as a NUMBER vs MINUS NUMBER again, that depends upon your target language, but I would usually go with MINUS NUMBER, otherwise you would end up lexing "A = 1-2-3-4" as SYMBOL = NUMBER NUMBER NUMBER NUMBER instead of SYMBOL = NUMBER MINUS NUMBER MINUS NUMBER MINUS NUMBER.
While we're on the topic, I'd strongly recommend the book Language Implementation Patterns, by Terrance Parr, the author of ANTLR.

You are best served by making your token types closely match your grammar's terminal symbols.
Without knowing the language/grammar, I expect you would be better served by having token types for "LESS_THAN", "LESS_THAN_OR_EQUAL" and also "FLOAT", "DOUBLE", "INTEGER", etc.

From my experience with actual lexers:
Make sure to check if you actually need comment / whitespace tokens. Compilers typically don't need them, while IDEs often do (to color comments green, for example).
Usually there's no single "operator" token; instead, there's a token for each distinct operator. So there's a PLUS token and AMPERSAND token and LESSER_THAN token etc.. This means that you only care about the lexeme (the actual text matched) when the token is an identifier or some sort of literal.
Avoid splitting literals. If "hello world" is a string literal, parse it as a single token. If -3.058e18 is a float literal, parse it as a single token as well. Lexers usually rely on regular expressions, which are expressive enough for all these things, and more. Of course, if the literals are complex enough you have to split them (e.g. the block literal in Smalltalk).

I think that the answer to your question is strictly tied to the semantic of NUMBER.
What a NUMBER should be? An always positive integer, a float...
I'd like to suggest you to lookup to the flex and yacc (aka lex & bison) tools of the U**x operating systems: these are powerful parsers and scanners generators that take a grammar and output a compilable and readily usable program.

It depends on how you are taking in tokens, if you are doing it character by character, then it might be a bit tricky, but if you are doing it word by word i.e.
int a = a + 2.0
then the tokens would be (discarding whitespace)
int
a
=
a
+
2.0
So you wouldn't run into the situation where you interpret the . as at token but rather take the whole string in - which is where you can determine if it's a FLOAT or NUMBER or whatever you want.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Parsing expressions when an operand contain more than one character? - parsing

Related

Antlr differentiating a newline from a \n

General query for parsing in c using special character

Handling in-ambiguous yet breaking syntax in expression parsing

How can I determine the meaning of ambiguous terminator when I tokenize my code?

Tokenizing numbers for a parser

Categories

Resources