How flex handle ambiguous patterns - flex-lexer

I want to use flex to handle patterns. In this case, both constant and function name are alphabetical strings that begin with an uppercase letter.
For example, in
Mother(Liz, Bob), how can I differentiate Mother and Liz?
I want ( to be a single token, so I can not regard Mother( as a pattern.

Normally, it would be unnecessary to generate different token types for different kinds of identifier. The parser shouldn't need that distinction if the different uses can be distinguished syntactically. (If you need semantic information to differentiate, and a sentence could be ambiguous without that information, then you might need semantic feedback but that does not appear to be the case here.)
If you don't have a parser, you would need to do some syntactic analysis. Say, for example, that function names are always followed by a ( -- which means that your language doesn't allow higher order functions. Then you could write a wrapper around yylex which reads one token in advance and emits a FUNCTION_NAME or CONSTANT_NAME, depending on the following token.

Related

How to force no whitespace in dot notation

I'm attempting to implement an existing scripting language using Ply. Everything has been alright until I hit a section with dot notation being used on objects. For most operations, whitespace doesn't matter, so I put it in the ignore list. "3+5" works the same as "3 + 5", etc. However, in the existing program that uses this scripting language (which I would like to keep this as accurate to as I can), there are situations where spaces cannot be inserted, for example "this.field.array[5]" can't have any spaces between the identifier and the dot or bracket. Is there a way to indicate this in the parser rule without having to handle whitespace not being important everywhere else? Or am I better off building these items in the lexer?
Unless you do something in the lexical scanner to pass whitespace through to the parser, there's not a lot the parser can do.
It would be useful to know why this.field.array[5] must be written without spaces. (Or, maybe, mostly without spaces: perhaps this.field.array[ 5 ] is acceptable.) Is there some other interpretation if there are spaces? Or is it just some misguided aesthetic judgement on the part of the scripting language's designer?
The second case is a lot simpler. If the only possibilities are a correct parse without space or a syntax error, it's only necessary to validate the expression after it's been recognised by the parser. A simple validation function would simply check that the starting position of each token (available as p.lexpos(i) where p is the action function's parameter and i is the index of the token the the production's RHS) is precisely the starting position of the previous token plus the length of the previous token.
One possible reason to require the name of the indexed field to immediately follow the . is to simplify the lexical scanner, in the event that it is desired that otherwise reserved words be usable as member names. In theory, there is no reason why any arbitrary identifier, including language keywords, cannot be used as a member selector in an expression like object.field. The . is an unambiguous signal that the following token is a member name, and not a different syntactic entity. JavaScript, for example, allows arbitrary identifiers as member names; although it might confuse readers, nothing stops you from writing obj.if = true.
That's a big of a challenge for the lexical scanner, though. In order to correctly analyse the input stream, it needs to be aware of the context of each identifier; if the identifier immediately follows a . used as a member selector, the keyword recognition rules must be suppressed. This can be done using lexical states, available in most lexer generators, but it's definitely a complication. Alternatively, one can adopt the rule that the member selector is a single token, including the .. In that case, obj.if consists of two tokens (obj, an IDENTIFIER, and .if, a SELECTOR). The easiest implementation is to recognise SELECTOR using a pattern like \.[a-zA-Z_][a-zA-Z0-9_]*. (That's not what JavaScript does. In JavaScript, it's not only possible to insert arbitrary whitespace between the . and the selector, but even comments.)
Based on a comment by the OP, it seems plausible that this is part of the reasoning for the design of the original scripting language, although it doesn't explain the prohibition of whitespace before the . or before a [ operator.
There are languages which resolve grammatical ambiguities based on the presence or absence of surrounding whitespace, for example in disambiguating operators which can be either unary or binary (Swift); or distinguishing between the use of | as a boolean operator from its use as an absolute value expression (uncommon but see https://cs.stackexchange.com/questions/28408/lexing-and-parsing-a-language-with-juxtaposition-as-an-operator); or even distinguishing the use of (...) in grouping expressions from their use in a function call. (Awk, for example). So it's certainly possible to imagine a language in which the . and/or [ tokens have different interpretations depending on the presence or absence of surrounding whitespace.
If you need to distinguish the cases of tokens with and without surrounding whitespace so that the grammar can recognise them in different ways, then you'll need to either pass whitespace through as a token, which contaminates the entire grammar, or provide two (or more) different versions of the tokens whose syntax varies depending on whitespace. You could do that with regular expressions, but it's probably easier to do it in the lexical action itself, again making use of the lexer state. Note that the lexer state includes lexdata, the input string itself, and lexpos, the index of the next input character; the index of the first character in the current token is in the token's lexpos attribute. So, for example, a token was preceded by whitespace if t.lexpos == 0 or t.lexer.lexdata[t.lexpos-1].isspace(), and it is followed by whitespace if t.lexer.lexpos == len(t.lexer.lexdata) or t.lexer.lexdata[t.lexer.lexpos].isspace().
Once you've divided tokens into two or more token types, you'll find that you really don't need the division in most productions. So you'll usually find it useful to define a new non-terminal for each token type representing all of the whitespace-context variants of that token; then, you only need to use the specific variants in productions where it matters.

What is a valid character in an identifier called?

Identifiers typically consist of underscores, digits; and uppercase and lowercase characters where the first character is not a digit. When writing lexers, it is common to have helper functions such as is_digit or is_alnum. If one were to implement such a function to scan a character used in an identifier, what would it be called? Clearly, is_identifier is wrong as that would be the entire token that the lexer scans and not the individual character. I suppose is_alnum_or_underscore would be accurate though quite verbose. For something as common as this, I feel like there should be a single word for it.
Unicode Annex 31 (Unicode Identifier and Pattern Syntax, UAX31) defines a framework for the definition of the lexical syntax of identifiers, which is probably as close as we're going to come to a standard terminology. UAX31 is used (by reference) by Python and Rust, and has been approved for C++23. So I guess it's pretty well mainstream.
UAX31 defines three sets of identifier characters, which it calls Start, Continue and Medial. All Start characters are also Continue characters; no Medial character is a Continue character.
That leads to the simple regular expression (UAX31-D1 Default Identifier Syntax):
<Identifier> := <Start> <Continue>* (<Medial> <Continue>+)*
A programming language which claims conformance with UAX31 does not need to accept the exact membership of each of these sets, but it must explicitly spell out the deviations in what's called a "profile". (There are seven other requirements, which are not relevant to this question. See the document if you want to fall down a very deep rabbit hole.)
That can be simplified even more, since neither UAX31 nor (as far as I know) the profile for any major language places any characters in Medial. So you can go with the flow and just define two categories: identifier-start and identifier-continue, where the first one is a subset of the second one.
You'll see that in a number of grammar documents:
Pythonidentifier ::= xid_start xid_continue*
RustIDENTIFIER_OR_KEYWORD : XID_Start XID_Continue*
| _ XID_Continue+
C++identifier:
identifier-start
identifier identifier-continue
So that's what I'd suggest. But there are many other possibilities:
SwiftCalls the sets identifier-head and identifier-characters
JavaCalls them JavaLetter and JavaLetterOrDigit
CDefines identifier-nondigit and identifier-digit; Continue would be the union of the two sets.

ANTLR4 - Parse subset of a language (e.g. just query statements)

I'm trying to figure out how I can best parse just a subset of a given language with ANTLR. For example, say I'm looking to parse U-SQL. Really, I'm only interested in parsing certain parts of the language, such as query statements. I couldn't be bothered with parsing the many other features of the language. My current approach has been to design my lexer / parser grammar as follows:
// ...
statement
: queryStatement
| undefinedStatement
;
// ...
undefinedStatement
: (.)+?
;
// ...
UndefinedToken
: (.)+?
;
The gist is, I add a fall-back parser rule and lexer rule for undefined structures and tokens. I imagine later, when I go to walk the parse tree, I can simply ignore the undefined statements in the tree, and focus on the statements I'm interested in.
This seems like it would work, but is this an optimal strategy? Are there more elegant options available? Thanks in advance!
Parsing a subpart of a grammar is super easy. Usually you have a top level rule which you call to parse the full input with the entire grammar.
For the subpart use the function that parses only a subrule like:
const expression = parser.statement();
I use this approach frequently when I want to parse stored procedures or data types only.
Keep in mind however, that subrules usually are not termined with the EOF token (as the top level rule should be). This will cause no syntax error if more than the subelement is in the token stream (the parser just stops when the subrule has matched completely). If that's a problem for you then add a copy of the subrule you wanna parse, give it a dedicated name and end it with EOF, like this:
dataTypeDefinition: // For external use only. Don't reference this in the normal grammar.
dataType EOF
;
dataType: // type in sql_yacc.yy
type = (
...
Check the MySQL grammar for more details.
This general idea -- to parse the interesting bits of an input and ignore the sea of surrounding tokens -- is usually called "island parsing". There's an example of an island parser in the ANTLR reference book, although I don't know if it is directly applicable.
The tricky part of island parsing is getting the island boundaries right. If you miss a boundary, or recognise as a boundary something which isn't, then your parse will fail disastrously. So you need to understand the input at least well enough to be able to detect where the islands are. In your example, that might mean recognising a SELECT statement, for example. However, you cannot blindly recognise the string of letters SELECT because that string might appear inside a string constant or a comment or some other context in which it was never intended to be recognised as a token at all.
I suspect that if you are going to parse queries, you'll basically need to be able to recognise any token. So it's not going to be sea of uninspected input characters. You can view it as a sea of recognised but unparsed tokens. In that case, it should be reasonably safe to parse a non-query statement as a keyword followed by arbitrary tokens other than ; and ending with a ;. (But you might need to recognise nested blocks; I don't really know what the possibilities are.)

Antlr: lookahead and lookbehind examples

I'm having a hard time figuring out how to recognize some text only if it is preceded and followed by certain things. The task is to recognize AND, OR, and NOT, but not if they're part of a word:
They should be recognized here:
x AND y
(x)AND(y)
NOT x
NOT(x)
but not here:
xANDy
abcNOTdef
AND gets recognized if it is surrounded by spaces or parentheses. NOT gets recognized if it is at the beginning of the input, preceded by a space, and followed by a space or parenthesis.
The trouble is that if I include parentheses as part of the definition of AND or NOT, they get consumed, and I need them to be separate tokens.
Is there some kind of lookahead/lookbehind syntax I can use?
EDIT:
Per the comments, here's some context. The problem is related to this problem: Antlr: how to match everything between the other recognized tokens? My working solution there is just to recognize AND, OR, etc. and skip everything else. Then, in a second pass over the text, I manually grab the characters not otherwise covered, and run a totally different tokenizer on it. The reason is that I need a custom, human-language-specific tokenizer for this content, which means that I can't, in advance, describe what is an ID. Each human language is different. I want to combine, in stages, a single query-language tokenizer, and then apply a human-language tokenizer to what's left.
ANTLR is not the right tool for this task. A normal parser is designed for a specific language, that is, a set of sentences consisting of elements that are known at parser creation time. There are ways to make this more flexible, e.g. by using a runtime function in a predicate to recognize words not defined in the grammar, but this has other (negative) implications.
What you should consider is NLP for a different approach to process natural language. It's more than just skipping things between two known tokens.

Designing a Language Lexer

I'm currently in the process of creating a programming language. I've laid out my entire design and am in progress of creating the Lexer for it. I have created numerous lexers and lexer generators in the past, but have never come to adopt the "standard", if one exists.
Is there a specific way a lexer should be created to maximise capability to use it with as many parsers as possible?
Because the way I design mine, they look like the following:
Code:
int main() {
printf("Hello, World!");
}
Lexer:
[
KEYWORD:INT, IDENTIFIER:"main", LEFT_ROUND_BRACKET, RIGHT_ROUNDBRACKET, LEFT_CURLY_BRACKET,
IDENTIFIER:"printf", LEFT_ROUND_BRACKET, STRING:"Hello, World!", RIGHT_ROUND_BRACKET, COLON,
RIGHT_CURLY_BRACKET
]
Is this the way Lexer's should be made? Also as a side-note, what should my next step be after creating a Lexer? I don't really want to use something such as ANTLR or Lex+Yacc or Flex+Bison, etc. I'm doing it from scratch.
If you don't want to use a parser generator [Note 1], then it is absolutely up to you how your lexer provides information to your parser.
Even if you do use a parser generator, there are many details which are going to be project-dependent. Sometimes it is convenient for the lexer to call the parser with each token; other times is is easier if the parser calls the lexer; in some cases, you'll want to have a driver which interacts separately with each component. And clearly, the precise datatype(s) of your tokens will vary from project to project, which can have an impact on how you communicate as well.
Personally, I would avoid use of global variables (as in the original yacc/lex protocol), but that's a general style issue.
Most lexers work in streaming mode, rather than tokenizing the entire input and then handing the vector of tokens to some higher power. Tokenizing one token at a time has a number of advantages, particularly if the tokenization is context-dependent, and, let's face it, almost all languages have some impurity somewhere in their syntax. But, again, that's entirely up to you.
Good luck with your project.
Notes:
Do you also forgo the use of compilers and write all your code from scratch in assembler or even binary?
Is there a specific way a lexer should be created to maximise capability to use it with as many parsers as possible?
In the lexers I've looked at, the canonical API is pretty minimal. It's basically:
Token readNextToken();
The lexer maintains a reference to the source text and its internal pointers into where it is currently looking. Then, every time you call that, it scans and returns the next token.
The Token type usually has:
A "type" enum for which kind of token it is: string, operator, identifier, etc. There are usually special kinds for "EOF", meaning a special terminator token that is produced after the end of the input, and "ERROR" for the rare cases where a syntax error comes from the lexical grammar. This is mainly just unterminated string literals or totally unknown characters in the source.
The source text of the token.
Sometimes literals are converted to their proper value representation during lexing in which case you'll have that value too. So a number token would have "123" as text but also have the numeric value 123. Or you can do that during parsing/compilation.
Location within the source file of the token. This is for error reporting. Usually 1-based line and column, but can also just be start and end byte offsets. The latter is a little faster to produce and can be converted to line and column lazily if needed.
Depending on your grammar, you may need to be able to rewind the lexer too.

Resources