Is it possible to remove the internal control of lexer by the parser for parsing heredoc in shell?

Is it possible to remove the internal control of lexer by the parser for parsing heredoc in shell? - parsing

To deal with heredoc in shell (e.g., bash), the grammar rule will change the variable need_here_doc via push_heredoc().
| LESS_LESS WORD
{
source.dest = 0;
redir.filename = $2;
$$ = make_redirection (source, r_reading_until, redir, 0);
push_heredoc ($$);
}
http://git.savannah.gnu.org/cgit/bash.git/tree/parse.y#n539
static void
push_heredoc (r)
REDIRECT *r;
{
if (need_here_doc >= HEREDOC_MAX)
{
last_command_exit_value = EX_BADUSAGE;
need_here_doc = 0;
report_syntax_error (_("maximum here-document count exceeded"));
reset_parser ();
exit_shell (last_command_exit_value);
}
redir_stack[need_here_doc++] = r;
}
http://git.savannah.gnu.org/cgit/bash.git/tree/parse.y#n2794
need_here_doc is used in read_token(), which is called by yylex(). This makes the behavior of yylex() non-automomous.
Is it normal to design a parser that can change how yylex() behaves?
Is it because the shell language is not LALR(1), so there is no way to avoid changing the behavior of yylex() by the grammar actions?
if (need_here_doc)
gather_here_documents ();
http://git.savannah.gnu.org/cgit/bash.git/tree/parse.y#n3285
current_token = read_token (READ);
http://git.savannah.gnu.org/cgit/bash.git/tree/parse.y#n2761

Is it normal to design a parser that can change how yylex() behaves?
Sure. It might not be ideal, but it's extremely common.
The Posix shell syntax is far from the ideal candidate for a flex/bison parser, and about the only thing you can say for the bash implementation using flex and bison is that it demonstrates how flexible those tools can be if pushed to their respective limits. Here-docs are not the only place where "lexical feedback" is necessary.
But even in more disciplined languages, lexical feedback can be useful. Or its alternative: writing partial parsing logic into the lexical scanner in order for it to know when the parse would require a different set of lexical rules.
Possibly the most well-known (or most frequently-commented) lexical feedback is the parsing of C-style cast expressions, which require the lexer to know whether the foo in (foo) is a typename or not. (This is usually implemented by way of a symbol table shared between the parser and the lexer but the precise implementation details are tricky.)
Here are a few other examples, which might be considered relatively benign uses of lexical feedback, although they certainly increase the coupling between lexer and parser.
Python (and Haskell) require the lexical scanner to reformulate leading whitespace into INDENT or DEDENT tokens. But if the line break occurs within parentheses, the whitespace handling is suppressed (including the NEWLINE token itself).
Ecmascript (Javascript) and other languages allow regular expression literals to be written surrounded by /s. But the / could also be a division operator or the first character in a /= mutation operator. The lexical decision depends on the parse context. (This could be guessed by the lexical scanner from the recent token history, which would count as reproducing part of the parsing logic in the lexical scanner.)
Similar to the above, many languages overload < in ways which complicate the logic in the lexical scanner. The use as a template bracket rather than a comparison operator might be dealt with in the scanner -- in C++, for example, it will depend on features like whether the preceding identifier was a template or not -- but that doesn't actually change lexical context. However, the use of an angle bracket to indicate the start of an X/HTML literal (or template) definitely changes lexical context. As with the regex example above, it will be necessary to know whether or not a comparison operator would be syntactically valid or not.
Is it because the shell language is not LALR(1), so there is no way to avoid changing the behavior of yylex() by the grammar actions?
The Posix shell syntax is most certainly not LALR(1), or even context-free. But most languages could not be parsed scannerlessly with an LALR(1) parser, and many languages turn out not to have context-free grammars if you take all syntactic considerations into account. (Cf. C-style cast expressions, above.) Perhaps shell is further from the platonic ideal than most. But then, it grew over the years from a kernel intended to be simple to type, rather than formally analysable. (No comment from me about whether this excuse can be extended to Perl, which I don't plan to discuss here.)
What I'd say in general is that languages which embed other languages (regular expressions, HTML fragments, Flex/Bison semantic actions, shell arithmetic expansions, etc., etc.) present challenges for a simplistic parser/scanner model. Despite lots of interesting work and solid experimentation, my sense is that language embedding still lacks a good implementable formal structure. And since most languages do have embedded sublanguages, there is and will continue to be a certain adhockery in their parser implementations. In part, that's what makes this field of study so much fun.

Related

How would you re-write the production rules for an existing grammar so that semi-colons were optional?

Suppose that there was a programming language Mod(C) just like C++ except that it was white-space sensitive.
That is, parsers and compilers written for Mod(C) did not ignore line-feeds, spaces, etc...
Also suppose that someone had already written down the production rules for describing a formal grammar for this modified version of C++
My question is, how would you modify the production rules so that semi-colons were optional in the event that the semi-colon was followed by a line-break?
Actually, the semi-colon would optional if some <optional_semi_colon> token is followed by:
one or more spaces and tabs
zero or one line-comments
a line-break (\n\r or \r\n or \n or \r)
The following piece of code would compile just fine:
#include <iostream>
using namespace std;
int main() {
for (int i = 1; i <= 5; ++i) {
cout << i << " "; // there is a semi-colon here. that's okay
}
return 0 // no semi-colon on this line
}

It's not at all clear what you mean by "white-space sensitive" and your example doesn't show any white-space sensitivity other than the possible optionality of semicolons at the end of a line. That is, you don't seem to be looking for an implementation of the off-side rule, as with Python or Haskell, where indentation indicates block structure. [Note 1]
Presumably, you are not just asking how to turn any newline into a semicolon, since that would be trivial. So I assume that you want something like JavaScript's automatic semicolon insertion (ASI), which automatically inserts a semicolon at the end of a line if:
the line doesn't already end with a newline, and
the current parse cannot be extended with the first token of the next line, either because
2.1. there is no item in the current state's itemset which would allow the next token to be shifted, or
2.2. the items which might allow a shift have been marked as not allowing a newline.
[Note 2]
Provision 2.1 prevents the incorrect semicolon insertion in:
let x = a
+ b
On the other hand, there are cases when you really want the newline to end the statement, in order to avoid silent bugs, such as:
yield
/* The above yield always produces undefined. */
console.log("We've been resumed");
yield -- and other statements with optional operands -- are annotated in the grammar with [No LineTerminator here], which triggers provision 2.2 and thus allows ASI even though the next token (console in the above example) could otherwise have extended the parse. Another context where newlines are banned is between a value and the postfix ++ operator, so that
let v = 2 * a
++v
is accepted with the (presumably) intended meaning. (Without ASI, there would be a syntax error after the ++, where ASI is not allowed because there's no newline character.)
JavaScript is not the only language with optional command terminators, but it's probably the language with the most elaborate set of rules controlling the parse. Other languages include:
Python, which in addition to the off-side rule, ignores semicolons inside parentheses, braces and brackets;
Awk, which treats newlines as statement terminators unless the newline follows one of the tokens ,, {, ?, :, ||, &&, do, or else. [Note 3]
Bash, in which a newline is a command terminator except in specific contexts, such as after a |, || or && operator or a keyword like do, then and else, and inside an array literal or an arithmetic expansion. [Note 4]
Kotlin, whose rules I don't know. And undoubtedly other languages as well.
JavaScript (and, I believe, Kotlin) suffer from an ambiguity with function calls, because
let a = b
((x)=>console.log(x))(42)
is parsed as calling b with the argument (x)=>console.log(x) and then calling the result with the argument 42. (Which is a runtime error, because the result of console.log is undefined, which cannot be called.)
There are also languages like Lua, in which semicolons are always optional, even if you run statements together on a single line (a = 3 b = 4), and which therefore also suffers from the misinterpretation of function call expressions. However, unlike JavaScript, Lua requires that the ( be on the same line as the function expression, and therefore flags the equivalent of the above example as a syntax error. (The check is not part of the grammar, for what it's worth: After the function call is parsed, a semantic check is performed to verify that the line number of the ( token is the same as the line number of the last token in the function expression.)
I went to the trouble of enumerating all of the above examples by way of illustrating the fact that optional semicolons are not a simple grammatical transformation, and that there is no simple rule which can determine the precise circumstances in which a newline is an implicit statement terminator. Realistic implementations of the feature are non-trivial, and differ in their details; the algorithm chosen needs to be tested against a variety of realistic code samples, and it needs to be carefully documented so that programmers using the language don't find themselves surprised by the results. If you get it wrong, but your language nonetheless becomes popular, you'll find projects with style guides which require semicolons even in context in which they were optional. None of that is intended to imply that you shouldn't pursue the idea; only that it is perhaps more complicated than it looks at first glance.
Having said all that, I don't believe that any of the above examples require a context-sensitive grammar (unless you want to implement the off-side rule). Even in the case of JavaScript, possibly with some minor exceptions, a parser can be created by starting with an LALR automaton and then adjusting the transition rules, state by state, in order to either ignore a newline token or reduce it to a statement terminator (as well as implementing the lookahead restrictions in certain rules). Most of these modifications will effectively be simply the deletion of one conflicting parser actions, similar to the operator-precedence-based resolution of ambiguous expression grammars. (And it's worth noting that most parser generators make no attempt to rewrite the original grammar after processing of the precedence declarations.)
However, while the existence of a PDA demonstrates the existence of a context-free grammar (at least for a context-free superset of the target language [Note 5]), it does not demonstrate the existence of a simple or elegant grammar. It seems to me likely that recreating a grammar from the modified PDA will produce a bloated monster without much value as a discursive tool. The modified PDA itself is sufficient to perform the parse, so reconstructing a grammar is not of much practical value.
Notes
That's perhaps just as well, because the off-side rule is not context-free, and thus cannot be implemented with a context free grammar. Although there are well-known techniques for implementing in a lexical scanner.
That's a slight oversimplification of the ASI rules. In some cases, JavaScript also allows a semicolon to be inserted before a }, even if there is no newline at that point. But that's not a significant complication.
? and : are a Gnu AWK extension.
That's not a complete list, by any means. See the Posix shell grammar and the bash manual for more details.
While the published grammars for the languages mentioned above, like the grammars for C and C++, are nominally context-free grammars, they do not actually encompass the entirety of the well-formedness rules in the respective language standards and/or manuals, which include constraints (or "early errors", in the terms of ECMA-262), "that can be detected and reported prior to the evaluation of any construct". Many of these rules are clearly context-sensitive (such as prohibitions on multiple definitions of the same name in a lexical scope).
Context-sensitive parsing is not necessarily a bad thing. Sometimes it's a lot simpler than trying to achieve the same result with a context-free grammar (as in the case of Lua function calls mentioned above). But it's certainly convenient to parse as much as possible using a generated parser, since such a parser can more easily be repurposed for other applications, such as linters, code browsers, syntax highlighters, and so on, not all of which need to be as precise as a compiler.

Does context-sensitive tokenisation require multiple goal symbols in the lexical grammar?

According to the ECMAScript spec:
There are several situations where the identification of lexical input
elements is sensitive to the syntactic grammar context that is
consuming the input elements. This requires multiple goal symbols for
the lexical grammar.
Two such symbols are InputElementDiv and InputElementRegExp.
In ECMAScript, the meaning of / depends on the context in which it appears. Depending on the context, a / can either be a division operator, the start of a regex literal or a comment delimiter. The lexer cannot distinguish between a division operator and regex literal on its own, so it must rely on context information from the parser.
I'd like to understand why this requires the use of multiple goal symbols in the lexical grammar. I don't know much about language design so I don't know if this is due to some formal requirement of a grammar or if it's just convention.
Questions
Why not just use a single goal symbol like so:
InputElement ::
[...]
DivPunctuator
RegularExpressionLiteral
[...]
and let the parser tell the lexer which production to use (DivPunctuator vs RegExLiteral), rather than which goal symbol to use (InputElementDiv vs InputElementRegExp)?
What are some other languages that use multiple goal symbols in their lexical grammar?
How would we classify the ECMAScript lexical grammar? It's not context-sensitive in the sense of the formal definition of a CSG (i.e. the LHS of its productions are not surrounded by a context of terminal and nonterminal symbols).

Saying that the lexical production is "sensitive to the syntactic grammar context that is consuming the input elements" does not make the grammar context-sensitive, in the formal-languages definition of that term. Indeed, there are productions which are "sensitive to the syntactic grammar context" in just about every non-trivial grammar. It's the essence of parsing: the syntactic context effectively provides the set of potentially expandable non-terminals, and those will differ in different syntactic contexts, meaning that, for example, in most languages a statement cannot be entered where an expression is expected (although it's often the case that an expression is one of the manifestations of a statement).
However, the difference does not involve different expansions for the same non-terminal. What's required in a "context-free" language is that the set of possible derivations of a non-terminal is the same set regardless of where that non-terminal appears. So the context can provide a different selection of non-terminals, but every non-terminal can be expanded without regard to its context. That is the sense in which the grammar is free of context.
As you note, context-sensitivity is usually abstracted in a grammar by a grammar with a pattern on the left-hand side rather than a single non-terminal. In the original definition, the context --everything other than the non-terminal to be expanded-- needed to be passed through the production untouched; only a single non-terminal could be expanded, but the possible expansions depend on the context, as indicated by the productions. Implicit in the above is that there are grammars which can be written in BNF which don't even conform to that rule for context-sensitivity (or some other equivalent rule). So it's not a binary division, either context-free or context-sensitive. It's possible for a grammar to be neither (and, since the empty context is still a context, any context-free grammar is also context-sensitive). The bottom line is that when mathematicians talk, the way they use words is sometimes unexpected. But it always has a clear underlying definition.
In formal language theory, there are not lexical and syntactic productions; just productions. If both the lexical productions and the syntactic productions are free of context, then the total grammar is free of context. From a practical viewpoint, though, combined grammars are harder to parse, for a variety of reasons which I'm not going to go into here. It turns out that it is somewhat easier to write the grammars for a language, and to parse them, with a division between lexical and syntactic parsers.
In the classic model, the lexical analysis is done first, so that the parser doesn't see individual characters. Rather, the syntactic analysis is done with an "alphabet" (in a very expanded sense) of "lexical tokens". This is very convenient -- it means, for example, that the lexical analysis can simply drop whitespace and comments, which greatly simplifies writing a syntactic grammar. But it also reduces generality, precisely because the syntactic parser cannot "direct" the lexical analyser to do anything. The lexical analyser has already done what it is going to do before the syntactic parser is aware of its needs.
If the parser were able to direct the lexical analyser, it would do so in the same way as it directs itself. In some productions, the token non-terminals would include InputElementDiv and while in other productions InputElementRegExp would be the acceptable non-terminal. As I noted, that's not context-sensitivity --it's just the normal functioning of a context-free grammar-- but it does require a modification to the organization of the program to allow the parser's goals to be taken into account by the lexical analyser. This is often referred to (by practitioners, not theorists) as "lexical feedback" and sometimes by terms which are rather less value neutral; it's sometimes considered a weakness in the design of the language, because the neatly segregated lexer/parser architecture is violated. C++ is a pretty intense example, and indeed there are C++ programs which are hard for humans to parse as well, which is some kind of indication. But ECMAScript does not really suffer from that problem; human beings usually distinguish between the division operator and the regexp delimiter without exerting any noticeable intellectual effort. And, while the lexical feedback required to implement an ECMAScript parser does make the architecture a little less tidy, it's really not a difficult task, either.
Anyway, a "goal symbol" in the lexical grammar is just a phrase which the authors of the ECMAScript reference decided to use. Those "goal symbols" are just ordinary lexical non-terminals, like any other production, so there's no difference between saying that there are "multiple goal symbols" and saying that the "parser directs the lexer to use a different production", which I hope addresses the question you asked.
Notes
The lexical difference in the two contexts is not just that / has a different meaning. If that were all that it was, there would be no need for lexical feedback at all. The problem is that the tokenization itself changes. If an operator is possible, then the /= in
a /=4/gi;
is a single token (a compound assignment operator), and gi is a single identifier token. But if a regexp literal were possible at that point (and it's not, because regexp literals cannot follow identifiers), then the / and the = would be separate tokens, and so would g and i.
Parsers which are built from a single set of productions are preferred by some programmers (but not the one who is writing this :-) ); they are usually called "scannerless parsers". In a scannerless parser for ECMAScript there would be no lexical feedback because there is no separate lexical analysis.
There really is a breach between the theoretical purity of formal language theory and the practical details of writing a working parser of a real-life programming language. The theoretical models are really useful, and it would be hard to write a parser without knowing something about them. But very few parsers rigidly conform to the model, and that's OK. Similarly, the things which are popularly calle "regular expressions" aren't regular at all, in a formal language sense; some "regular expression" operators aren't even context-free (back-references). So it would be a huge mistake to assume that some theoretical result ("regular expressions can be identified in linear time and constant space") is actually true of a "regular expression" library. I don't think parsing theory is the only branch of computer science which exhibits this dichotomy.

Why not just use a single goal symbol like so:
InputElement ::
...
DivPunctuator
RegularExpressionLiteral
...
and let the parser tell the lexer which production to use (DivPunctuator vs RegExLiteral), rather than which goal symbol to use (InputElementDiv vs InputElementRegExp)?
Note that DivPunctuator and RegExLiteral aren't productions per se, rather they're nonterminals. And in this context, they're right-hand-sides (alternatives) in your proposed production for InputElement. So I'd rephrase your question as: Why not have the syntactic parser tell the lexical parser which of those two alternatives to use? (Or equivalently, which of those two to suppress.)
In the ECMAScript spec, there's a mechanism to accomplish this: grammatical parameters (explained in section 5.1.5).
E.g., you could define the parameter Div, where:
+Div means "a slash should be recognized as a DivPunctuator", and
~Div means "a slash should be recognized as the start of a RegExLiteral".
So then your production would become
InputElement[Div] ::
...
[+Div] DivPunctuator
[~Div] RegularExpressionLiteral
...
But notice that the syntactic parser still has to tell the lexical parser to use either InputElement[+Div] or InputElement[~Div] as the goal symbol, so you arrive back at the spec's current solution, modulo renaming.
What are some other languages that use multiple goal symbols in their lexical grammar?
I think most don't try to define a single symbol that derives all tokens (or input elements), let alone have to divide it up into variants like ECMAScript's InputElementFoo, so it might be difficult to find another language with something similar in its specification.
Instead, it's pretty common to simply define rules for the syntax of different kinds of tokens (e.g. Identifier, NumericLiteral) and then reference them from the syntactic productions. So that's kind of like having multiple lexical goal symbols, but not (I would say) in the sense you were asking about.
How would we classify the ECMAScript lexical grammar?
It's basically context-free, plus some extensions.

Can we use BNF for parsing AND lexing instead of regex?

With a Backus-Naur form grammar (BNF), we can specify the syntax of the programming language in order to parse it and produce an abstract syntax tree (AST).
<if> ::= "if" <expression> "then" <action> "end"
But we can also specify the tokens with a BNF grammar, as the first usage of BNF did for ALGOL-60:
<digit> ::= "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
<digit_with_zero> ::= <digit> | "0"
<integer> ::= <digit> | <digit_with_zero> <integer>
However, this usage of the BNF in order to lex (= produce a list of minimal meaningful units aka tokens) has been deprecated in favor of regular expressions (like [1-9][0-9]*).
It seems clear that the regex are much more concise.
It seems also that keeping the structure of an if statement is interesting for the interpreter or the compiler which will handle the AST produced by the parser, but keeping the structure of an integer (or a float) is not.
But do you agree that BNF could be used for both lexing and parsing?
And do you agree with the reasons which make regex much more suited for lexing?
Or are there others?

Regular expressions (in the mathematical sense) are equivalent in power to regular grammars and regular grammars can be written in BNF. So in that sense, it is clearly possible to write a full grammar for any context-free language in pure BNF.
Indeed, it is not even necessary to maintain the lexer/parser dichotomy. Some programmers find it convenient to use scannerless parsing (the article is not great but it has some interesting references), although many of these are based on the PEG formalism (which is not context-free) rather than BNF. (These are not the same despite the superficial resemblance.)
That said, it might not be convenient. In general, like most questions related to the structure of parsers, the answer is going to be based less on theory and more on a combination of practicality (with reference to a specific use case) and programmer prejudice.
As is well known, purity is rarely the most practical. Most real-life parser and scanner generators deviate from the pure theoretical models in order to provide mechanisms which are easier to use, easier to implement efficiently, or more powerful. For example, the character class syntax ([a-zA-Z]), which is almost universal in scanner generators, is a clear extension to regular expression syntax which deliberately avoids the need to explicitly list the entire contents of the set. One could say that the listing is implicit and unambiguous in the example I just presented, but most scanner generators also allow the use of classes like [[:alnum:]] ("alphanumeric symbols"), where the precise list of matched symbols is either locale-dependent or, in the Unicode world, extensible in the future. Regardless, this is obviously a useful extension.
While it is true that some aspects of regular expressions are more compact than their equivalent regular grammars -- especially the Kleene star operator, which in BNF requires an additional non-terminal and thus an additional name -- there are also cases where the ability to name subexpressions makes regular grammars more compact. Many scanner generators, starting with Lex, allowed named subpatterns as another regular expression extension. Furthermore, it is possible (with some caveats) to add the Kleene star and other operators to BNF as macros, and many parser generators do so. So there is a certain convergence of notation.
As you say, one difference between scanners and parsers is that the scanner generally makes no attempt to parse the substructure of a lexeme. But it is not true that no lexeme has substructure, and these substructures often do need to be analysed. The most notorious example is probably floating point numbers, which have to be analysed into a multiplier and an exponent, and the multiplier also analysed into an integer part and a fractional part. This analysis is commonly done using primitive functions available in the scanner implementation language (such as strtod for C scanners), but that does mean a second lexical scan. (Using the built-in avoids the considerable inconvenience of writing a mathematically correct string-to-internal converter, which is a much more difficult problem than it first appears. Rolling your own number converter is not recommended.)
Other lexemes with internal structure include string literals (which may contain escape sequences) and a large variety of more complex lexemes available in certain languages (dates and times, IP addresses, HTML tags, etc., etc.). All of these things tend to blur the boundary between scanning and parsing. Which is fine, because, as I said, the boundary is situational and not restrained by any absolute law of nature.
Still, it is certainly the case that many lexemes do not have any interesting internal structure, and furthermore that while it is easy to rewrite a regular expression as a regular grammar, it is considerably harder to rewrite it as an unambiguous, deterministic regular grammar, much less an LALR(1) regular grammar. (This is one of the reasons scannerless parsing is often associated with PEG, but it can also be solved with GLL or GLR parsers, at a slight loss of efficiency.)

Which exactly part of parsing should be done by the lexical analyser?

Does there exist a formal definition of the purpose, or at a clear best practice of usage, of lexical analysis (lexer) during/before parsing?
I know that the purpose of a lexer is to transform a stream of characters to a stream of tokens, but can't it happen that in some (context-free) languages the intended notion of a "token" could nonetheless depend on the context and "tokens" could be hard to identify without complete parsing?
There seems to be nothing obviously wrong with having a lexer that transforms every input character into a token and lets the parser do the rest. But would it be acceptable to have a lexer that differentiates, for example, between a "unary minus" and a usual binary minus, instead of leaving this to the parser?
Are there any precise rules to follow when deciding what shall be done by the lexer and what shall be left to the parser?

Does there exist a formal definition of the purpose [of a lexical analyzer]?
No. Lexical analyzers are part of the world of practical programming, for which formal models are useful but not definitive. A program which purports to do something should do that thing, of course, but "lexically analyze my programming language" is not a sufficiently precise requirements statement.
… or a clear best practice of usage
As above, the lexical analyzer should do what it purports to do. It should also not attempt to do anything else. Code duplication should be avoided. Ideally, the code should be verifiable.
These best practices motivate the use of a mature and well-document scanner framework whose input language doubles as a description of the lexical grammar being analyzed. However, practical considerations based on the idiosyncracies of particular programming languages normally result in deviations from this ideal.
There seems to be nothing obviously wrong with having a lexer that transforms every input character into a token…
In that case, the lexical analyzer would be redundant; the parser could simply use the input stream as is. This is called "scannerless parsing", and it has its advocates. I'm not one of them, so I won't enter into a discussion of pros and cons. If you're interested, you could start with the Wikipedia article and follow its links. If this style fits your problem domain, go for it.
can't it happen that in some (context-free) languages the intended notion of a "token" could nonetheless depend on the context?
Sure. A classic example is found in EcmaScript regular expression "literals", which need to be lexically analyzed with a completely different scanner. EcmaScript 6 also defines string template literals, which require a separate scanning environment. This could motivate scannerless processing, but it can also be implemented with an LR(1) parser with lexical feedback, in which the reduce action of particular marker non-terminals causes a switch to a different scanner.
But would it be acceptable to have a lexer that differentiates, for example, between a "unary minus" and a usual binary minus, instead of leaving this to the parser?
Anything is acceptable if it works, but that particular example strikes me as not particular useful. LR (and even LL) expression parsers do not require any aid from the lexical scanner to show the context of a minus sign. (Naïve operator precedence grammars do require such assistance, but a more carefully thought out op-prec architecture wouldn't. However, the existence of LALR parser generators more or less obviates the need for op-prec parsers.)
Generally speaking, for the lexer to be able to identify syntactic context, it needs to duplicate the analysis being done by the parser, thus violating one of the basic best practices of code development ("don't duplicate functionality"). Nonetheless, it can occasionally be useful, so I wouldn't go so far as to advocate an absolute ban. For example, many parsers for yacc/bison-like production rules compensate for the fact that a naïve grammar is LALR(2) by specially marking ID tokens which are immediately followed by a colon.
Another example, again drawn from EcmaScript, is efficient handling of automatic semicolon insertion (ASI), which can be done using a lookup table whose keys are 2-tuples of consecutive tokens. Similarly, Python's whitespace-aware syntax is conveniently handled by assistance from the lexical scanner, which must be able to understand when indentation is relevant (not inside parentheses or braces, for example).

Regular expression can be used to express all kinds of lexical parser requirements?

I'm learning Compilers Principles recently. I notice all examples from text books describes a language lexcial parser using "lex" or "flex" with regular expressions to show how to analyze input source files.
Does it indicate that, all known programming languages, can be implemented using type 3 grammar to do lexical parsing? Or it's just that text books are using simple samples to show ideas?

Most lexemes in most languages can be identified with regular expressions, but there are exceptions. (When it comes to parsing computer languages, there are always exceptions. Without exception.)
For example, you cannot match a C++ raw string literal with a regex. You cannot tell without syntactic analysis whether /= in a Javacript program is the single lexeme used to indicate divide-and-assign, or whether it is the start of a regular expression which matches a atring starting with =. Languages which allow nested comments (unlike C) require something a bit more powerful.
But it's enormously easier to write a few regexes than to write a full state machine in raw C, so there is a lot of motivation to find ways of bending flex to your will for a few exceptional cases. And flex cooperates to a certain extent by providing features which allow you to escape from the regex straightjacket when necessary. In an advanced class on lexical analysis you might learn more about these features.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart