Using midaction rules in Lemon to interpret "let" expression - parsing

I'm trying to write a "toy" interpreter using Flex + Lemon that supports a very basic "let" syntax where a variable X is temporarily bound to an expression. For example, "letx 3 + 4 in x + 8" should evaluate to 15.
In essence, what I'd "like" the rule to say is:
expr(E) ::= LETX expr(N) IN expr(O). {
environment->X = N;
E = O;
}
But that won't work since O is evaluated before the X = N assignment is made.
I understand that the usual solution for this would be a mid-rule action. Lemon doesn't explicitly support this, but I've read elsewhere that would just be syntactic sugar in any event.
So I've tried to put together a mid-rule action that would do my assignment of X = N before interpreting O:
midruleaction ::= /* mid rule */. { environment->X = N; }
expr(E) ::= LETX expr(N) IN midruleaction expr(O). { E = O; }
But that won't work because there's no way for the midruleaction rule to access N, or at least none I can see in the lemon docs/examples.
I think I'm missing something here. I know I could build up a tree and then walk it in a second pass. And I might end up doing that, but I'd like to understand how to solve this more directly first.
Any suggestions?

It's really not a very scalable solution to evaluate immediately in a parser. See below.
It is true that mid-rule actions are (mostly) syntactic sugar. However, in most cases they are not syntactic sugar for "markers" (non-terminals with empty right-hand sides) but rather for non-terminals representing production prefixes. For example, you could write your letx rule like this:
expr(E) ::= letx_prefix IN expr(O). { E = O; }
letx_prefix ::= LETX expr(N). { environment->X = N; }
Or you could do this:
expr(E) ::= LETX assigned_expr IN expr(O). { E = O; }
assigned_expr ::= expr(N). { environment->X = N; }
The first one is the prefix desugaring; the second one is the one I'd use because I feel that it separates concerns better. The important point is that the environment->X = N; action requires access to the semantic values of the prefix of the RHS, so it must be part of a prefix rule (which includes at least the symbols whose semantic values are requires), rather than a marker, which has access to no semantic values at all.
Having said all that, immediate evaluation during parsing is a very limited strategy. It cannot cope with a large range of constructs which require deferred evaluation, such as loops and function definitions. It cannot cleanly cope with constructs which may suppress evaluation, such as conditionals and short-circuit operators. (These can be handled using MRAs and a stateful environment which contains an evaluation-suppressed flag, but that's very ugly.)
Another problem is that syntactically incorrect expressions may be partially evaluated before the syntax error is discovered, and it may not be immediately obvious to the user which parts of the expression have and have not been evaluated.
On the whole, you're better off building an easily-evaluated AST during the parse, and evaluating the AST when the parse successfully completes.

Related

Convert Arithmetic Expression Tree to string without unnecessary parenthesis?

Assume I build a Abstract Syntax Tree of simple arithmetic operators, like Div(left,right), Add(left,right), Prod(left,right),Sum(left,right), Sub(left,right).
However when I want to convert the AST to string, I found it is hard to remove those unnecessary parathesis.
Notice the output string should follow the normal math operator precedence.
Examples:
Prod(Prod(1,2),Prod(2,3)) let's denote this as ((1*2)*(2,3))
make it to string, it should be 1*2*2*3
more examples:
(((2*3)*(3/5))-4) ==> 2*3*3/5 - 4
(((2-3)*((3*7)/(1*5))-4) ==> (2-3)*3*7/(1*5) - 4
(1/(2/3))/5 ==> 1/(2/3)/5
((1/2)/3))/5 ==> 1/2/3/5
((1-2)-3)-(4-6)+(1-3) ==> 1-2-3-(4-6)+1-3
I find the answer in this question.
Although the question is a little bit different from the link above, the algorithm still applies.
The rule is: if the children of the node has lower precedence, then a pair of parenthesis is needed.
If the operator of a node one of -, /, %, and if the right operand equals its parent node's precedence, it also needs parenthesis.
I give the pseudo-code (scala like code) blow:
def toString(e:Expression, parentPrecedence:Int = -1):String = {
e match {
case Sub(left2,right2) =>
val p = 10
val left = toString(left2, p)
val right = toString(right, p + 1) // +1 !!
val op = "-"
lazy val s2 = left :: right :: Nil mkString op
if (parentPrecedence > p )
s"($s2)"
else s"$s2"
//case Modulus and divide is similar to Sub except for p
case Sum(left2,right2) =>
val p = 10
val left = toString(left2, p)
val right = toString(right, p) //
val op = "-"
lazy val s2 = left :: right :: Nil mkString op
if (parentPrecedence > p )
s"($s2)"
else s"$s2"
//case Prod is similar to Sum
....
}
}
For a simple expression grammar, you can eliminate (most) redundant parentheses using operator precedence, essentially the same way that you parse the expression into an AST.
If you're looking at a node in an AST, all you need to do is to compare the precedence of the node's operator with the precedence of the operator for the argument, using the operator's associativity in the case that the precedences are equal. If the node's operator has higher precedence than an argument, the argument does not need to be surrounded by parentheses; otherwise it needs them. (The two arguments need to be examined independently.) If an argument is a literal or identifier, then of course no parentheses are necessary; this special case can easily be handled by making the precedence of such values infinite (or at least larger than any operator precedence).
However, your example includes another proposal for eliminating redundant parentheses, based on the mathematical associativity of the operator. Unfortunately, mathematical associativity is not always applicable in a computer program. If your expressions involving floating point numbers, for example, a+(b+c) and (a+b)+c might have very different values:
(gdb) p (1000000000000000000000.0 + -1000000000000000000000.0) + 2
$1 = 2
(gdb) p 1000000000000000000000.0 + (-1000000000000000000000.0 + 2)
$2 = 0
For this reason, it's pretty common for compilers to avoid rearranging the order of application of multiplication and addition, at least for floating point arithmetic, and also for integer arithmetic in the case of languages which check for integer overflow.
But if you do really want to rearrange based on mathematical associativity, you'll need an additional check during the walk of the AST; before checking precedence, you'll want to check whether the node you're visiting and its left argument use the same operator, where that operator is known to be mathematically associative. (This assumes that only operators which group to the left are mathematically associative. In the unlikely case that you have a mathematically associative operator which groups to the right, you'll want to check the visited node and its right-hand argument.)
If that condition is met, you can rotate the root of the AST, turning (for example) PROD(PROD(a,b),□)) into PROD(a,PROD(b,□)). That might lead to additional rotations in the case that a is also a PROD node.

Does a priority declaration disambiguate between alternative lexicals?

In my previous question, there was a priority > declaration in the example. It turned out not to matter because the solution there did not actually invoke priority but rather avoided it by making the alternatives disjoint. In this question, I'm asking whether priority can be used to select one lexical production over another. In the example below, the language of the production WordInitialDigit is intentionally a subset of that of WordAny. The production Word looks like it should disambiguate between the two properly, but the resulting parse tree has an ambiguity node at the top. Is a priority declaration able to decide between different lexical reductions, or does it require there to be a basis of common lexical elements? Or something else?
The example is contrived (there are no actions in the grammar), but the situations it arises from are not. For example, I'd like to use something like this for error recovery, where I can recognize a natural boundary for a unit of syntax and write a production for it. This generic production would be the last element in a priority chain; if it reduces, it means that there was no valid parse. More generally, I need to be able to select lexical elements based on syntactic context. I had hoped, since Rascal is scannerless, that this would be seamless. Perhaps it is, though I don't see it at the moment.
I'm on the unstable branch, version 0.10.0.201807050853.
EDIT: This question is not about > for defining an expression grammar. The documentation for priority declarations talks mostly about expressions, but the very first sentence provides what looks like a perfectly clear definition:
Priority declarations define a partial ordering between the productions within a single non-terminal.
So the example has two productions, an ordering declared between them, and yet the parser is still generating an ambiguity node in the clear presence of a disambiguation rule. So to put a finer point on my question, it looks like I don't know which of two situations pertains. Either (1) if this isn't supposed to work, then there's a defect in the language definition as documented, a deficiency in error reporting of the compiler, and a language design decision that's somewhere between counter-intuitive and user-hostile. Or (2) if this is supposed to work, there's a defect in the compiler and/or parser (presumably because the focus was initially on expressions) and at some point the example will pass its tests.
module ssce
import analysis::grammars::Ambiguity;
import ParseTree;
import IO;
import String;
lexical WordChar = [0-9A-Za-z] ;
lexical Digit = [0-9] ;
lexical WordInitialDigit = Digit WordChar* !>> WordChar;
lexical WordAny = WordChar+ !>> WordChar;
syntax Word =
WordInitialDigit
> WordAny
;
test bool WordInitialDigit_0() = parseAccept( #Word, "4foo" );
test bool WordInitialDigit_1() = parseAccept( #WordInitialDigit, "4foo" );
test bool WordInitialDigit_2() = parseAccept( #WordAny, "4foo" );
bool verbose = false;
bool parseAccept( type[&T<:Tree] begin, str input )
{
try
{
parse(begin, input, allowAmbiguity=false);
}
catch ParseError(loc _):
{
return false;
}
catch Ambiguity(loc l, str a, str b):
{
if (verbose)
{
println("[Ambiguity] #<a>, \"<b>\"");
Tree tt = parse(begin, input, allowAmbiguity=true) ;
iprintln(tt);
list[Message] m = diagnose(tt) ;
println( ToString(m) );
}
fail;
}
return true;
}
bool parseReject( type[&T<:Tree] begin, str input )
{
try
{
parse(begin, input, allowAmbiguity=false);
}
catch ParseError(loc _):
{
return true;
}
return false;
}
str ToString( list[Message] msgs ) =
( ToString( msgs[0] ) | it + "\n" + ToString(m) | m <- msgs[1..] );
str ToString( Message msg)
{
switch(msg)
{
case error(str s, loc _): return "error: " + s;
case warning(str s, loc _): return "warning: " + s;
case info(str s, loc _): return "info: " + s;
}
return "";
}
Excellent questions.
TL;DR:
the rule priority mechanism is not capable of an algorithmic ordering of a non-terminal's alternatives. Although some kind of partial order is involved in the additional grammatical constraints that a priority declaration generates, there is no "trying" one rule first, before the other. So it simply can't do that. The good news is that the priority mechanism has a formal semantics independent of any parsing algorithm, it's just defined in terms of context-free grammar rules and reduction traces.
using ambiguous rules for error recovery or "robust parsing", is a good idea. However, if there are too many such rules, the parser will eventually start showing quadratic or even cubic behavior, and tree building after parsing might even have higher polynomials. I believe the generated parser algorithm should have a (parameterized) mode for error recovery rather then expressing this at the grammar level.
Accepting ambiguity at parse time, and filtering/choosing trees after parsing is the recommended way to go.
All this talk of "ordering" in the documentation is misleading. Disambiguation is minefield of confusing terminology. For now, I recommend this SLE paper which has some definitions: https://homepages.cwi.nl/~jurgenv/papers/SLE2013-1.pdf
Details
priority mechanism not capable of choosing among alternatives
The use of the > operator and left, right generates a partial order between mutually recursive rules, such as found in expression languages, and limited to specific item positions in each rule: namely the left-most and right-most recursive positions which overlap. Rules which are lower in the hierarchy are not allowed to be grammatically expanded as "children" of rules which are higher in the hierarchy. So in E "*" E, neither E may be expaned to E "+" E if E "*" E > E "+" E.
The additional constraints do not choose for any E which alternative to try first. No they simply disallow certain expansions, assuming the other expansion is still valid and thus the ambiguity is solved.
The reason for the limitation at specific positions is that for these positions the parser generator can "prove" that they will generate ambiguity, and thus filtering one of the two alternatives by disallowing certain nestings will not result in additional parse errors. (consider a rule for array indexing: E "[" E "]" which should not have additional constraints for the second E. This is a so-called "syntax-safe" disambiguation mechanism.
All and all it is a pretty weak mechanism algorithmically, and specifically tailored for mutually recursive combinator/expression-like languages. The end-goal of the mechanism is to make sure we use have to use only 1 non-terminal for the entire expression language, and the parse trees looking very much akin in shape to abstract syntax trees. Rascal inherited all these considerations from SDF, via SDF2, by the way.
Current implementations actually "factor" the grammar or the parse table in some fashion invisibly to get the same effect, as-if somebody would have factored the grammar completely; however these implementations under-the-hood are very specific to the parsing algorithm in question. the GLR version is quite different from the GLL version, which again is quite different from the DataDependent version.
Post-parse filtering
Of course any tree, including ambiguous parse forests produced by the parser, can be manipulated by Rascal programs using pattern matching, visit, etc. You could write any algorithm to remove the trees you want. However, this requires the entire forest to be constructed first. It's possible and often fast enough, but there is a faster alternative.
Since the tree is built in a bottom-up fashion from the parse graph after parsing, we can also apply "rewrite rules" during the construction of the tree, and remove certain alternatives.
For example:
Tree amb({Tree a, *Tree others}) = amb(others) when weDoNotWant(a);
Tree amb({Tree a}) = a;
This first rule would match on the ambiguity cluster for all trees, and remove all alternatives which weDoNotWant. The second rule removes the cluster if only one alternative is left and let's the last tree "win".
If you want to choose among alternatives:
Tree amb({Tree a, Tree b, *Tree others}) = amb({a, others} when weFindPeferable(a, b);
If you don't want to use Tree but a more specific non-terminal like Statement that should also work.
This example module uses #prefer tags in syntax definitions to "prefer" rules which have been tagged over the other rules, as post-parse rewrite rules:
https://github.com/usethesource/rascal/blob/master/src/org/rascalmpl/library/lang/sdf2/filters/PreferAvoid.rsc
Hacking around with additional lexical constraints
Next to priority disambiguation and post-parse rewriting, we still have the lexical level disambiguation mechanisms in the toolkit:
`NT \ Keywords" - rejecting finite (keyword) languages from a non-terminals
CC << NT, NT >> CC, CC !<< NT, NT !>> CC follow and preceede restrictions (where CC stands for character-class and NT for non-terminal)
Solving other kinds of ambiguity apart from the operator precedence stuff can be tried with these, in particular if the length of different sub-sentences is shorter/longer between the different alternatives, !>> can do the "maximal munch" or "longest match" thing. So I was thinking out loud:
lexical C = A? B?;
where A is one lexical alternative and B is the other. With the proper !>> restrictions on A and !<< restrictions on B the grammar might be tricked into always wanting to put all characters in A, unless they don't fit into A as a language, in which case they would default to B.
The obvious/annoying advice
Think harder about an unambiguous and simpler grammar.
Sometimes this means to abstract and allow more sentences in the grammar, avoiding use of the grammar for "type checking" the tree. It's often better to over-approximate the syntax of the language and then use (static) semantic analysis (over simpler trees) to get what you want, rather then staring at a complex ambiguous grammar.
A typical example: C blocks with declarations only at the start are much harder to define unambiguously then C blocks where declarations are allowed everywhere. And for a C90 mode, all you have to do is flag declarations which are not at the start of a block.
This particular example
lexical WordChar = [0-9A-Za-z] ;
lexical Digit = [0-9] ;
lexical WordInitialDigit = Digit WordChar* !>> WordChar;
lexical WordAny = WordChar+ !>> WordChar;
syntax Word =
WordInitialDigit
| [0-9] !<< WordAny // this would help!
;
wrap up
Great question, thanks for the patience. Hope this helps!
The > disambiguation mechanism is for recursive definitions, like for example a expression grammar.
So it's to solve the following ambiguity:
syntax E
= [0-9]+
| E "+" E
| E "-" E
;
The string 1 + 3 - 4 can not be parsed as 1 + (3 - 4) or (1 + 3) - 4.
The > gives an order to this grammar, which production should be at the top of the tree.
layout L = " "*;
syntax E
= [0-9]+
| E "+" E
> E "-" E
;
this now only allows the (1 + 3) - 4 tree.
To finish this story, how about 1 + 1 + 1? That could be 1 + (1 + 1) or (1 + 1) + 1.
This is what we have left, right, and non-assoc for. They define how recursion in the same production should be handled.
syntax E
= [0-9]+
| left E "+" E
> left E "-" E
;
will now enforce: 1 + (1 + 1).
When you take an operator precendence table, like for example this c operator precedance table you can almost literally copy them.
note that these two disambiguation features are not exactly opposite to each other. the first ambiguitity could also have been solved by putting both productions in a left group like this:
syntax E
= [0-9]+
| left (
E "+" E
| E "-" E
)
;
As the left side of the tree is favored, you will now get a different tree 1 + (3 - 4). So it makes a difference, but it all depends on what you want.
More details can be found in the tutor pages on disambiguation

Left recursion parsing

Description:
While reading Compiler Design in C book I came across the following rules to describe a context-free grammar:
a grammar that recognizes a list of one or more statements, each of
which is an arithmetic expression followed by a semicolon. Statements are made up of a
series of semicolon-delimited expressions, each comprising a series of numbers
separated either by asterisks (for multiplication) or plus signs (for addition).
And here is the grammar:
1. statements ::= expression;
2. | expression; statements
3. expression ::= expression + term
4. | term
5. term ::= term * factor
6. | factor
7. factor ::= number
8. | (expression)
The book states that this recursive grammar has a major problem. The right hand side of several productions appear on the left-hand side as in production 3 (And this property is called left recursion) and certain parsers such as recursive-descent parser can't handle left-recursion productions. They just loop forever.
You can understand the problem by considering how the parser decides to apply a particular production when it is replacing a non-terminal that has more than one right hand side. The simple case is evident in Productions 7 and 8. The parser can choose which production to apply when it's expanding a factor by looking at the next input symbol. If this symbol is a number, then the compiler applies Production 7 and replaces the factor with a number. If the next input symbol was an open parenthesis, the parser
would use Production 8. The choice between Productions 5 and 6 cannot be solved in this way, however. In the case of Production 6, the right-hand side of term starts with a factor which, in tum, starts with either a number or left parenthesis. Consequently, the
parser would like to apply Production 6 when a term is being replaced and the next input symbol is a number or left parenthesis. Production 5-the other right-hand side-starts with a term, which can start with a factor, which can start with a number or left parenthesis, and these are the same symbols that were used to choose Production 6.
Question:
That second quote from the book got me completely lost. So by using an example of some statements as (for example) 5 + (7*4) + 14:
What's the difference between factor and term? using the same example
Why can't a recursive-descent parser handle left-recursion productions? (Explain second quote).
What's the difference between factor and term? using the same example
I am not giving the same example as it won't give you clear picture of what you have doubt about!
Given,
term ::= term * factor | factor
factor ::= number | (expression)
Now,suppose if I ask you to find the factors and terms in the expression 2*3*4.
Now,multiplication being left associative, will be evaluated as :-
(2*3)*4
As you can see, here (2*3) is the term and factor is 4(a number). Similarly you can extend this approach upto any level to draw the idea about term.
As per given grammar, if there's a multiplication chain in the given expression, then its sub-part,leaving a single factor, is a term ,which in turn yields another sub-part---the another term, leaving another single factor and so on. This is how expressions are evaluated.
Why can't a recursive-descent parser handle left-recursion productions? (Explain second quote).
Your second statement is quite clear in its essence. A recursive descent parser is a kind of top-down parser built from a set of mutually recursive procedures (or a non-recursive equivalent) where each such procedure usually implements one of the productions of the grammar.
It is said so because it's clear that recursive descent parser will go into infinite loop if the non-terminal keeps on expanding into itself.
Similarly, talking about a recursive descent parser,even with backtracking---When we try to expand a non-terminal, we may eventually find ourselves again trying to expand the same non-terminal without having consumed any input.
A-> Ab
Here,while expanding the non-terminal A can be kept on expanding into
A-> AAb -> AAAb -> ... -> infinite loop of A.
Hence, we prevent left-recursive productions while working with recursive-descent parsers.
The rule factor matches the string "1*3", the rule term does not (though it would match "(1*3)". In essence each rule represents one level of precedence. expression contains the operators with the lowest precedence, factor the second lowest and term the highest. If you're in term and you want to use an operator with lower precedence, you need to add parentheses.
If you implement a recursive descent parser using recursive functions, a rule like a ::= b "*" c | d might be implemented like this:
// Takes the entire input string and the index at which we currently are
// Returns the index after the rule was matched or throws an exception
// if the rule failed
parse_a(input, index) {
try {
after_b = parse_b(input, index)
after_star = parse_string("*", input, after_b)
after_c = parse_c(input, after_star)
return after_c
} catch(ParseFailure) {
// If one of the rules b, "*" or c did not match, try d instead
return parse_d(input, index)
}
}
Something like this would work fine (in practice you might not actually want to use recursive functions, but the approach you'd use instead would still behave similarly). Now, let's consider the left-recursive rule a ::= a "*" b | c instead:
parse_a(input, index) {
try {
after_a = parse_a(input, index)
after_star = parse_string("*", input, after_a)
after_b = parse_c(input, after_star)
return after_b
} catch(ParseFailure) {
// If one of the rules a, "*" or b did not match, try c instead
return parse_c(input, index)
}
}
Now the first thing that the function parse_a does is to call itself again at the same index. This recursive call will again call itself. And this will continue ad infinitum, or rather until the stack overflows and the whole program comes crashing down. If we use a more efficient approach instead of recursive functions, we'll actually get an infinite loop rather than a stack overflow. Either way we don't get the result we want.

How to adapt this LL(1) parser to a LL(k) parser?

In the appendices of the Dragon-book, a LL(1) front end was given as a example. I think it is very helpful. However, I find out that for the context free grammar below, a at least LL(2) parser was needed instead.
statement : variable ':=' expression
| functionCall
functionCall : ID'(' (expression ( ',' expression )*)? ')'
;
variable : ID
| ID'.'variable
| ID '[' expression ']'
;
How could I adapt the lexer for LL(1) parser to support k look ahead tokens?
Are there some elegant ways?
I know I can add some buffers for tokens. I'd like to discuss some details of programming.
this is the Parser:
class Parser
{
private Lexer lex;
private Token look;
public Parser(Lexer l)
{
lex = l;
move();
}
private void move()
{
look = lex.scan();
}
}
and the Lexer.scan() returns the next token from the stream.
In effect, you need to buffer k lookahead tokens in order to do LL(k) parsing. If k is 2, then you just need to extend your current method, which buffers one token in look, using another private member look2 or some such. For larger k, you could use a ring buffer.
In practice, you don't need the full lookahead all the time. Most of the time, one-token lookahead is sufficient. You should structure the code as a decision tree, where future tokens are only consulted if necessary to resolve ambiguity. (It's often useful to provide a special token type, "unknown", which can be assigned to the buffered token list to indicate that the lookahead hasn't reached that point yet. Alternatively, you can just always maintain k tokens of lookahead; for handbuilt parsers, that can be simpler.)
Alternatively, you can use a fallback structure where you simply try one alternative and if that doesn't work, instead of reporting a syntax error, restore the state of the parser and lexer to the next alternative. In this model, the lexer takes as an explicit argument the current input buffer position, and the input buffer needs to be rewindable. However, you can use a lookahead buffer to effectively memoize the lexer function, which can avoid rewinding and rescanning. (Scanning is usually fast enough that occasional rescans don't matter, so you might want to put off adding code complexity until your profiling indicates that it would be useful.)
Two notes:
1) I'm skeptical about the rule:
functionCall : ID'(' (expression ( ',' expression )*)* ')'
;
That would allow, for example:
function(a[3], b[2] c[x] d[y], e.foo)
which doesn't look right to me. Normally, you'd mark the contents of the () as optional instead of repeatable, eg. using an optional marker ? instead of the second Kleene star *:
functionCall : ID'(' (expression ( ',' expression )*)? ')'
;
2) In my opinion, you really should consider using bottom-up parsing for an expression language, either a generated LR(1) parser or a hand-built Pratt parser. LL(1) is rarely adequate. Of course, if you're using a parser generator, you can use tools like ANTLR which effectively implement LL(∞); that will take care of the lookahead for you.

Defining Function Signatures in a Simple Language Grammar

I am currently learning how to create a simple expression language using Irony. I'm having a little bit of trouble figuring out the best way to define function signatures, and determining whose responsibility it is to validate the input to those functions.
So far, I have a simple grammar that defines the basic elements of my language. This includes a handful of binary operators, parentheses, numbers, identifiers, and function calls. The BNF for my grammar looks something like this:
<expression> ::= <number> | <parenexp> | <binexp> | <fncall> | <identifier>
<parenexp> ::= ( <expression> )
<fncall> ::= <identifier> ( <argumentlist> )
<binexp> ::= <expression> <binop> <expression>
<binop> ::= + - * / %
... the rest of the grammar definition
Using the Irony parser, I am able to validate the syntax of various input strings to make sure they conform to this grammar:
x + y / z * AVG(a + b, p) -> Valid Syntax
x +/ AVG(x -> Invalid Syntax
All that is well and good, but now I want to go a step further and define the available functions, along with the number of parameters that each function requires. So for example, I want to have a function FOO that accepts one parameter and BAR that accepts two parameters:
FOO(a + b) * BAR(x + y, p + q) -> Valid
FOO(a + b, 13) -> Invalid
When the second statement is parsed, I'd like to be able to output an error message that is aware of the expected input for this function:
Too many arguments specified for function 'FOO'
I don't actually need to evaluate any of these statements, only validate the syntax of the statements and determine if they are valid expressions or not.
How exactly should I be doing this? I know that technically I could simply add the functions to the grammar like so:
<foofncall> ::= FOO( <expression> )
<barfncall> ::= BAR( <expression>, <expression> )
But something about this doesn't feel quite right. To me it seems like the grammar should only define a generic function call, and not every function available to the language.
How is this typically accomplished in other languages?
What are the components called that should handle the responsibilities of analyzing the basic syntax of the language grammar versus the more specific elements like function definitions? Should both responsibilities be handled by the same component?
While you can do typechecking in directly in the grammar so its enforced in the parser, its generally a bad idea to do so. Instead, the parser should just parse the basic syntax, and separate typechecking code should be used for typechecking.
In the normal case of a compiler, the parser just produces an abstract syntax tree or some equivalent representation of the program. Then, a typechecking pass is run over the AST that ensures all types match appropriately -- ensures that functions have the right number of arguments and those arguments have the right type, as well as ensuring that variables have the right type for what is assigned to them and how they are used.
Besides being generally simpler, this usually allows you to give better error messages -- instead of just 'Invalid', you can say 'too many arguments to FOO' or what have you.

Resources