General query for parsing in c using special character - parsing

Can we write a scanf function to parse input on # sign and read
two numbers?
since # is not used in aa standard way i am a bit confused

The scanf pattern "%d#%d" matches, in order:
optional whitespace
an integer
the character #
more optional whitespace
another integer
It returns 2 if both integers were found.
If there is anything (even a space character) between the first integer and the #, the scan will stop after the first integer and return 1. It will also stop and return 1 if whatever follows the # is not an integer possibly preceded by whitespace. This imprecision is one of the reasons scanf is often discouraged.
You could allow whitespace before the # by adding a space into the pattern: "%d #%d". But whitespace includes newline characters, which can lead to unexpected behaviour. That's another reason scanf is often discouraged.
In summary:
Yes, you can get scanf to read a line consisting of two numbers separated by a #. Not a problem; scanf does not have a concept of "legal delimiter".
If the input is from an error-prone source, such as a human typist, scanf makes it difficult to produce good error messages, which can be a frustrating experience. But if you know the input is error-free, scanf will work just fine.
In the above patterns, if you want to accept floating point numbers, change the %d to %lf (for doubles, generally recommended) or %f fir floats.

Related

Antlr differentiating a newline from a \n

Let's say I have the following statement:
SELECT "hi\n
there";
Notice there is a literal newline in there, and the escape \n. The string that antlr4 picks up for me is:
String_Literal: "hi\n\nthere"
In other words, not differentiating between the literal newline and the \n one. Is there a way to differentiate the two, or what's the usual process to do that?
My guess is that the output you pasted into your question comes from a call to the Antlr4 runtime method tree.toStringTree(parser) (or equivalent in whatever target language you've chosen).
That function calls escapeWhitespace in the utilities class/module/file, and that function does what it's name suggests: it converts (some) whitespace characters to C-like backslash escape sequences. (Specifically, it handles newline, carriage return, and tab characters.) It does not escape backslash characters, which makes its output ambiguous; there's no way to distinguish between the two character escape sequence \n and the escaped conversion of a newline character in the message.
They are different in the actual character string, because the Antlr4 lexer does not transform the string value of the matched token in any way. That's your responsibility.
In computing, it is very often the case that what you see is not what you got. What you see is just what you see, and a lot of computational power has gone into creating that vision for you. By the same token, nothing guarantees that the vision is an unambiguous, or even useful, representation of the actual values. The best you can say for it is that it's probably more useful than trying to read the data as individual bits. (And, indeed, the individual bits are not physical objects either; despite the common refrain, you could completely disassemble a computer and examine it with an arbitrarily powerful microscope, and you will not see a single 1 or 0.)
That might seem like irrelevant philosophizing, but it has a real consequence: when you're debugging and you see something that makes you think, "that looks wrong", you need to consider two possibilities: maybe the underlying data is incorrect, but may it's the process which rendered the representation which is at fault. In this case, I'd say that the failure of escapeWhitespace to convert backslash characters into pairs of backslashes is a bug, but that's a value judgement on my part. Anyway, the function is not critical to the operation of Antlr4, and you could easily replace it.

Why is the following piece of Lua code, completely valid?

From my Lua knowledge (and according to what I have read in Lua manuals), I've always been under impression that an identifier in Lua is only limited to A-Z & a-z & _ & digits (and can not start using a digit nor be a reserved keyword i.e. local local = 123).
And now I have run into some (obfuscated) Lua program which uses all kind of weird characters for an identifier:
https://i.imgur.com/HPLKMxp.png
-- Most likely, copy+paste won't work. Download the file from https://tknk.io/7HHZ
print(_VERSION .. " " .. (jit and "JIT" or "non-JIT"))
local T = {}
T.math = T.math or {}
T.math.​â®â€‹âŞâ®â€‹­ď»żâ€Śâ€­âŽ­ = math.sin
T.math.â¬â€‹â­â¬â­â«â®â€­â€¬ = math.cos
for k, v in pairs(T.math) do print(k, v) end
Output:
Lua 5.1 JIT
â¬â€‹â­â¬â­â«â®â€­â€¬ function: builtin#45
​â®â€‹âŞâ®â€‹­ď»żâ€Śâ€­âŽ­ function: builtin#44
It is unclear to me, why is this set of characters allowed for an identifier?
In other words, why is it a completely valid Lua program?
Unlike some languages, Lua is not really defined by a formal specification, one which covers every contingency and entirely explains all of Lua's behavior. Something as simple as "what character set is a Lua file encoded in" isn't really explain in Lua's documentation.
All the docs say about identifiers is:
Names (also called identifiers) in Lua can be any string of letters, digits, and underscores, not beginning with a digit and not being a reserved word.
But nothing ever really says what a "letter" is. There isn't even a definition for what character set Lua uses. As such, it's essentially implementation-dependent. A "letter" is... whatever the implementation wants it to be.
So, let's say you're writing a Lua implementation. And you want users to be able to provide Unicode-encoded strings (that is, strings within the Lua text). Lua 5.3 requires this. But you also don't want them to have to use UTF-16 encoding for their files (also because lua_load gets sequences of bytes, not shorts). So your Lua implementation assumes the byte sequence it gets in lua_load is encoded in UTF-8, so that users can write strings that use Unicode characters.
When it comes to writing the lexer/parser part of this implementation, how do you handle this? The simplest, easiest way to handle UTF-8 is to... not handle UTF-8. Indeed, that's the whole point of that encoding. Since everything that Lua defines with specific symbols are encoded in ASCII, and ASCII text is also UTF-8 text with the same meaning, you can basically treat a UTF-8 string like an ASCII string. For in-Lua strings, you just copy the sequence of bytes between the start and end characters of the string.
So how do you go about lexing identifiers? Well, you could ask the question above. Or you could ask a much simpler question: is the character a space, control character, digit, or symbol? A "letter" is merely something that isn't one of those.
Lua defines what things it considers to be "symbols". ASCII can tell you what is a control character, space, and a digit. In such an implementation, any UTF-8 code unit with a value outside of ASCII is a letter. Even if technically, those code units decode into something Unicode thinks of as a "symbol", your lexer just threats it as a letter.
This simple form of UTF-8 lexing gives you fast performance and low memory overhead. You don't have to decode UTF-8 into Unicode codepoints, and you don't need a giant Unicode table to tell you whether a codepoint is a "symbol" or "space" or whatever. And of course, it's also something that would naturally fall out of many ASCII-based Lua implementations.
So most Lua implementations will do it this way, if only by accident. Doing something more would require deliberate effort.
It also allows a user to use Unicode character sequences as identifiers. That means that someone can easily write code in their native language (outside of keywords).
But it also means that obfuscators have lots of ways to create "identifiers" that are just strings of nonsensical bytes. Indeed, because there are multiple ways in Unicode to "spell" the same apparent Unicode string (unless you examine the bytes directly), obfuscators can rig up identifiers that appear when rendered in a text editor to all be the same text, while actually being different strings.
To clarify there is only one identifier T
T.math is sugar syntax for T["math"] this also extends to the obfuscate strings. It is perfectly valid to have a key contain any characters or even start with a number.
Now being able to use the . rather then [ ] does not work with a string that don't conform to the identifier's limitations. See Nicol Bolas' answer for a great break down of those limitations.

Parsing expressions when an operand contain more than one character?

I know there are many questions about mathematics expression parsing out there. I have researched and learned the algorithm to convert an infix string to postfix, and use the postfix string to evaluate the value of the expression.
But all of examples I have found deal only with the case that operands of the expression contain only one character. For example "1+2".
How do you do if the expression is "1 + 123"? The postfix string would become "1123+", so it's unable to be evaluated.
The method I have thought is to read each character of an operand from the infix string and temporarily keep them in a tempStack. And, when an operator is read, convert the operand in the tempStack to an integer, then push it into the postfix array.
But then the problem follows, my operands would be integer type but my operators are character type. So I can't put them in the same array.
Please suggest me the right way to do this. I know that there are APIs to do this work, but I want to learn this to strengthen my knowledge.
Thank you very much.
You don't transform the input into a 'postfix string' unless you separate the tokens by whitespace. The input "1 + 123" would then become e.g. "1 123 +". But it's better to push the tokens on a stack, for example an array of strings. If the language you use supports algebraic data types, you would create a Token type and push onto a stack of Tokens.

What to do when unescapable character(s) are escaped?

In designing of a (mini)language:
When there are certain characters that should be escaped to lose special meanings (like quotes in some programming languages), what should be done, especially from a security perspective, when characters that are not escapable (e.g. normal characters which never have special meaning) are escaped? Should an error be "error"ed, or should the character be discarded, or should it be in the output the same as if it was not escaped?
Example:
In a simple language where strings are delimited by double-quotes("), and any quotes in a given string are escaped with a back-slash(\): for input "We \said, \"We want Moshiach Now\"" -- what would should be done with the letter s in said which is escaped?
I prefer the lexer to whine when this occurs. A lexer/parser should be tight about syntax; one can always loosen it up later. If you are sloppy, you'll find you can't retract a decision you didn't think you made.
Assume that you initially decide to treat " backslash not-an-escape " as that pair of characters, and the "T" is
not-an-escape today. Sometime later you decide to extend the language, and want "\T" to mean something special, and you change your language.
You'll find an angry mob of programmers storming your design castle,
because for them, "\T" means "\" "T" (or "T" depending on your default decision),
and you just broke their code. You hang your head in shame, retract the decision,
and then realize... oops, there are no more available escape characters!
This lesson goes for any piece of syntax that isn't well defined in your language. If it isn't explicitly legal, it should be implicitly illegal and your compiler should check it. Or you'll never be able to extend your successful language.
If your language isn't going to be successful, you may not care as much.
Well, one way to solve the problem is for the backslash to just mean backslash when it precedes a non-escapable character. That's what Python does:
>>> print "a\tb"
a b
>>> print "a\tb\Rc"
a b\Rc
Obviously, most systems take the escape character to mean "take the next character verbatim", so escaping a "non-escapable" character is usually harmless. The problem later happens when you get to comparisons and such, where the literal text does not represent the actual value (that's where you see a lot of issues securitywise, especially with things like URLs).
So on the one hand, you can only accept a limited number of escaped characters. In that sense, you have an "escape sequence", rather than an escaped character (the \x is the entire sequence rather than a \ followed by an x). That's like the most safe mechanism, and it's not really burdensome to write.
The other option is to ensure that you you "canonicalizing" everything you compare, through some ruleset. This typically means removing all of the escape sequences properly up front, before comparison and comparing only the final values rather than the literals.
Most systems interpret the slash as Will Hartung says, except for alphanumerics which are variously used as aliases for control codes, character classes, word boundaries, the start of hex sequences, case region markers, hex or octal digits, etc. \s in particular often means white-space in perl5 style regexs. JavaScript, which interprets it as 's' in one context and as whitespace in another suffers from subtle bugs because of this choice. Consider /foo\sbar/ vs new RegExp('foo\sbar').

Tokenizing numbers for a parser

I am writing my first parser and have a few questions conerning the tokenizer.
Basically, my tokenizer exposes a nextToken() function that is supposed to return the next token. These tokens are distinguished by a token-type. I think it would make sense to have the following token-types:
SYMBOL (such as <, :=, ( and the like
WHITESPACE (tab, newlines, spaces...)
REMARK (a comment between /* ... */ or after // through the new line)
NUMBER
IDENT (such as the name of a function or a variable)
STRING (Something enclosed between "....")
Now, do you think this makes sense?
Also, I am struggling with the NUMBER token-type. Do you think it makes more sense to further split it up into a NUMBER and a FLOAT token-type? Without a FLOAT token-type, I'd receive NUMBER (eg 402), a SYMBOL (.) followed by another NUMBER (eg 203) if I were about to parse a float.
Finally, what do you think makes more sense for the tokenizer to return when it encounters a -909? Should it return the SYMBOL - first, followed by the NUMBER 909 or should it return a NUMBER -909 right away?
It depends upon your target language.
The point behind a lexer is to return tokens that make it easy to write a parser for your language. Suppose your lexer returns NUMBER when it sees a symbol that matches "[0-9]+". If it sees a non-integer number, such as "3.1415926" it will return NUMBER . NUMBER. While you could handle that in your parser, if your lexer is doing an appropriate job of skipping whitespace and comments (since they aren't relevant to your parser) then you could end up incorrectly parsing things like "123 /* comment / . \n / other comment */ 456" as floating point numbers.
As for lexing "-[0-9]+" as a NUMBER vs MINUS NUMBER again, that depends upon your target language, but I would usually go with MINUS NUMBER, otherwise you would end up lexing "A = 1-2-3-4" as SYMBOL = NUMBER NUMBER NUMBER NUMBER instead of SYMBOL = NUMBER MINUS NUMBER MINUS NUMBER MINUS NUMBER.
While we're on the topic, I'd strongly recommend the book Language Implementation Patterns, by Terrance Parr, the author of ANTLR.
You are best served by making your token types closely match your grammar's terminal symbols.
Without knowing the language/grammar, I expect you would be better served by having token types for "LESS_THAN", "LESS_THAN_OR_EQUAL" and also "FLOAT", "DOUBLE", "INTEGER", etc.
From my experience with actual lexers:
Make sure to check if you actually need comment / whitespace tokens. Compilers typically don't need them, while IDEs often do (to color comments green, for example).
Usually there's no single "operator" token; instead, there's a token for each distinct operator. So there's a PLUS token and AMPERSAND token and LESSER_THAN token etc.. This means that you only care about the lexeme (the actual text matched) when the token is an identifier or some sort of literal.
Avoid splitting literals. If "hello world" is a string literal, parse it as a single token. If -3.058e18 is a float literal, parse it as a single token as well. Lexers usually rely on regular expressions, which are expressive enough for all these things, and more. Of course, if the literals are complex enough you have to split them (e.g. the block literal in Smalltalk).
I think that the answer to your question is strictly tied to the semantic of NUMBER.
What a NUMBER should be? An always positive integer, a float...
I'd like to suggest you to lookup to the flex and yacc (aka lex & bison) tools of the U**x operating systems: these are powerful parsers and scanners generators that take a grammar and output a compilable and readily usable program.
It depends on how you are taking in tokens, if you are doing it character by character, then it might be a bit tricky, but if you are doing it word by word i.e.
int a = a + 2.0
then the tokens would be (discarding whitespace)
int
a
=
a
+
2.0
So you wouldn't run into the situation where you interpret the . as at token but rather take the whole string in - which is where you can determine if it's a FLOAT or NUMBER or whatever you want.

Resources