Parser - Segmentation fault when calling yytext - parsing

My parser is recognizing the grammar and indicating the correct error line using yylineno. I want to print the symbol wich caused the error.
int yyerror(string s)
{
extern int yylineno; // defined and maintained in lex.yy.c
extern char *yytext; // defined and maintained in lex.yy.c
cerr << "error: " << s << " -> " << yytext << " # line " << yylineno << endl;
//exit(1);
}
I get this error when I write something not acceptable by the grammar:
error: syntax error -> Segmentation fault
Am I not supposed to used yytext? If not what variable contains the symbol that caused the syntax error?
Thanks

Depending on the version of lex you are using, yytext may be an array or may be a pointer. Since it is defined in a different compilation unit, if it is an array and you declare it as a pointer, you won't see any error messages from the compiler or linker (linker generally don't do type checking). Instead it will treat the first several characters in the array as a pointer and try to dereference it and probably crash.
If you are using flex, you can add a %pointer declaration to the first section of your .l file to ensure that it is a pointer and not an array

Are you using lex or flex? If you're using lex,yytext is a char[], not a char*.
EDIT If you aren't using flex you should be, it is superior in every way and has been from the moment of its appearance nearly 30 years ago. lex was obsoleted on that day.

Related

Lua - why is string after function call allowed?

I'm trying to implement a simple C++ function, which checks a syntax of Lua script. For that I'm using Lua's compiler function luaL_loadbufferx() and checking its return value afterwards.
Recently, I have ran into a problem, because the code, that I thought should be marked invalid, was not detected and instead the script failed later at a runtime (eg. in lua_pcall()).
Example Lua code (can be tested on official Lua demo):
function myfunc()
return "everyone"
end
-- Examples of unexpected behaviour:
-- The following lines pass the compile time check without errors.
print("Hello " .. myfunc() "!") -- Runtime error: attempt to call a string value
print("Hello " .. myfunc() {1,2,3}) -- Runtime error: attempt to call a string value
-- Other examples:
-- The following lines contain examples of invalid syntax, which IS detected by compiler.
print("Hello " myfunc() .. "!") -- Compile error: ')' expected near 'myfunc'
print("Hello " .. myfunc() 5) -- Compile error: ')' expected near '5'
print("Hello " .. myfunc() .. ) -- Compile error: unexpected symbol near ')'
The goal is obviously to catch all syntax errors at compile time. So my questions are:
What exactly is meant by calling a string value?
Why is this syntax allowed in the first place? Is it some Lua feature I'm unaware of, or the luaL_loadbufferx() is faulty in this particular example?
Is it possible to detect such errors by any other method without running it? Unfortunately, my function doesn't have access to global variables at compile time, so I can't just just run the code directly via lua_pcall().
Note: I'm using Lua version 5.3.4 (manual here).
Thank you very much for your help.
Both myfunc() "!" and myfunc(){1,2,3} are valid Lua expressions.
Lua allows calls of the form exp string. See functioncall and prefixexp in the Syntax of Lua.
So myfunc() "!" is a valid function call that calls whatever myfunc returns and call it with the string "!".
The same thing happens for a call of the form exp table-literal.
Another approach is to change string's metatable making a call to a string valid.
local mt = getmetatable ""
mt.__call = function (self, args) return self .. args end
print(("x") "y") -- outputs `xy`
Now those valid syntax calls to a string will result in string concatenation instead of runtime errors.
I'm writing answer to my own question just in case anyone else stumbles upon the similar problem in the future and also looks for solution.
Manual
Lua manual (in its section 3.4.10 – Function Calls) basically states, that there are three different ways of providing arguments to Lua function.
Arguments have the following syntax: args ::= ‘(’ [explist] ‘)’
args ::= tableconstructor
args ::= LiteralString
All argument expressions are evaluated before the call. A call of the form f{fields} is syntactic sugar for f({fields}); that is, the argument list is a single new table. A call of the form f'string' (or f"string" or f[[string]]) is syntactic sugar for f('string'); that is, the argument list is a single literal string.
Explanation
As lhf pointed out in his answer, both myfunc()"!" and myfunc(){1,2,3} are valid Lua expressions. It means the Lua compiler is doing nothing wrong, considering it doesn't know the function return value at a compile time.
The original example code given in the question:
print("Hello " .. myfunc() "!")
Could be then rewritten as:
print("Hello " .. (myfunc()) ("!"))
Which (when executed) translates to:
print("Hello " .. ("everyone") ("!"))
And thus resulting in the runtime error message attempt to call a string value (which could be rewritten as: the string everyone is not a function, so you can't call it).
Solution
As far as I understand, these two alternative ways of supplying arguments have no real benefit over the standard func(arg) syntax. That's why I ended up modyfing the Lua parser files. The disadventage of keeping this alternative syntax was too big. Here is what I've done (relevant for v5.3.4):
In file lparser.c i searched for function:
static void suffixedexp (LexState *ls, expdesc *v)
Inside this function i changed the case statement:
case '(': case TK_STRING: case '{':to case '(':
Warning! By doing this I have modified the Lua language, so as lhf stated in his comment, it can no longer be called pure Lua. If you are unsure whether it is exactly what you want, I can't recommend this approach.
With this slight modification compiler detects the two above mentioned alternative syntaxes as errors. Of course, I can no longer use them inside Lua scripts, but for my specific application it's just fine.
All I need to do is to note this change somewhere to find it in case of upgrading Lua to higher version.

How do you read the "<<" and ">>" symbols out loud?

I'm wondering if there is a standard way, if we are pronouncing typographical symbols out loud, for reading the << and >> symbols? This comes up for me when teaching first-time C++ students and discussing/fixing exactly what symbols need to be written in particular places.
The best answer should not be names such as "bitwise shift" or "insertion", because those refer to more specific C++ operators, as opposed to the context-free symbol itself (which is what we want here). In that sense, this question is not the same as questions such as this or this, none of whose answers satisfy this question.
Some comparative examples:
We can read #include <iostream> as "pound include bracket iostream
bracket".
We can read int a, b, c; as "int a comma b comma c
semicolon".
We can read if (a && b) c = 0; as "if open parenthesis a double ampersand b close parenthesis c equals zero semicolon".
So an equivalent question would be: How do we similarly read cout << "Hello";? At the current time in class we are referring to these symbols as "left arrow" and "right arrow", but if there is a more conventional phrasing I would prefer to use that.
Other equivalent ways of stating this question:
How do we typographically read <<?
What is the general name of the symbol <<, whether being used for bit-shifts, insertion, or overloaded for something entirely new?
If a student said, "Professor, I don't remember how to make an insertion operator; please tell me what symbol to type", then what is the best verbal response?
What is the best way to fill in this analogy? "For the multiplication operation we use an asterisk; for the division operation we use a forward-slash; for the insertion operation we use ____."
Saw this question through your comment on Slashdot. I suggest a simpler name for students that uses an already common understanding of the symbol. In the same way that + is called "plus" and - is (often) called "minus," you can call < by the name "less" or "less-than" and > by "greater" or "greater-than." This recalls math operations and symbols that are taught very early for most students and should be easy for them to remember. Plus, you can use the same name when discussing the comparison operators. So, you would read
std::cout << "Hello, world!" << std::endl;
as
S T D colon colon C out less less double-quote Hello comma world exclamation-point double-quote less less S T D colon colon end L semicolon.
Also,
#include <iostream>
as
pound include less I O stream greater
So, the answer to
"Professor, I don't remember how to make an insertion operator; please tell me what symbol to type."
is "Less less."
The more customary name "left/right angle bracket" should be taught at the same time to teach the more common name, but "less/greater" is a good reminder of what the actual symbol is, I think.
Chevron is also a neat name, but a bit obscure in my opinion, not to mention the corporate affiliation.
A proposal: Taking the appearance of the insertion/extraction operators as similar to the Guillemet symbols, we might look to the Unicode description of those symbols. There they are described as "Left-pointing double angle quotation mark" and "Right-pointing double angle quotation mark" (link).
So perhaps we could be calling the symbols "double-left angle" and "double-right angle".
My comment was mistaken (Chrome's PDF Reader has a buggy "Find in File" feature that didn't give me all of the results at first).
Regarding the OP's specific question about the name of the operator, regardless of context - then there is no answer, because the ISO C++ specification does not name the operators outside of a use context (e.g. the + operator is named "addition" but only with number types, it is not named as such when called to perform string concatenation, for example). That is, the ISO C++ standard does not give operator tokens a specific name.
The section on Shift Operators (5.8) only defines and names them for integral/enum types, and the section on Overloaded Operators does not confer upon them a name.
Myself, if I were teaching C++ and explaining the <</>> operators I would say "the double-angle-bracket operator is used to denote bitshifts with integer types, and insertion/extraction with streams and strings". Or if I were being terse I'd overload the word and simply say "the bitshift operator is overloaded for streams to mean something completely different".
Regarding the secondary question (in the comment thread) about the name of the <</>> operators in the context of streams and strings, the the C++14 ISO specification (final working-draft: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4296.pdf ) does refer to them as "extractors and inserters":
21.4.8.9 Inserters and extractors
template<class charT, class traits, class Allocator>
basic_istream<charT,traits>&
operator>>(
basic_istream<charT,traits>& is,
basic_string<charT,traits,Allocator>& str
);
(and the rest of the >> operator overload definitions follow)
This is further expanded upon on 2.7.2.2.2:
27.7.2.2.2 Arithmetic extractors
operator>>(unsigned short& val);
operator>>(unsigned int& val);
operator>>(long& val);
(and so on...)
cout << "string" << endl;// I really just say "send string to see out. Add end line."
i++; // i plus plus
auto x = class.func() // auto x equal class dot func
10 - ( i %4) * x; // ten minus the quantity i mod four times x
stdout // stud-out
stderr // stud-err
argc // arg see
argv // arg vee
char* // char pointer
&f // address of f
Just because it's an "extraction" or an "insertion" operator does not mean that is the OPERATION.
The operation is "input" and "output"
They are stream operators.
The natural label would be c-out stream output double-quote Hello world exclamation double-quote stream output endline
This the OPERATION you are doing (the verb)
What the ARM calls the operator is irrelevant in that it is a systemic way of looking at things and we are trying to help humans understand things instead

DParser- warning: trying to write code to binary file

I have a large grammar written for DParser and using the Python binding. When I first run the parser, and DParser generates its internal tables, I get a number of warnings like these:
warning: trying to write code to binary file
warning: trying to write code to binary file
warning: trying to write code to binary file
Not sure what the cause of source of these warnings are. The only thing I could find was in the DParser source code "write_tables.c":
write_code(FILE *fp, Grammar *g, Rule *r, char *code,
char *fname, int line, char *pathname)
{
char *c;
if ( !fp ) {
d_warn("trying to write code to binary file");
return;
}
...
}
Any hints or ideas would be appreciated.
I found out that the problem with these warnings was because I had errors in my grammar and I had forgotten to add quotes around [ ] in some cases. Like:
[ example_non_terminal ]
It was taking example_non_terminal as a character set. A number of these were causing the problem. The correct grammar should have been:
'[' example_non_terminal ']'

How to make lex/flex recognize tokens not separated by whitespace?

I'm taking a course in compiler construction, and my current assignment is to write the lexer for the language we're implementing. I can't figure out how to satisfy the requirement that the lexer must recognize concatenated tokens. That is, tokens not separated by whitespace. E.g.: the string 39if is supposed to be recognized as the number 39 and the keyword if. Simultaneously, the lexer must also exit(1) when it encounters invalid input.
A simplified version of the code I have:
%{
#include <stdio.h>
%}
%option main warn debug
%%
if |
then |
else printf("keyword: %s\n", yytext);
[[:digit:]]+ printf("number: %s\n", yytext);
[[:alpha:]][[:alnum:]]* printf("identifier: %s\n", yytext);
[[:space:]]+ // skip whitespace
[[:^space:]]+ { printf("ERROR: %s\n", yytext); exit(1); }
%%
When I run this (or my complete version), and pass it the input 39if, the error rule is matched and the output is ERROR: 39if, when I'd like it to be:
number: 39
keyword: if
(I.e. the same as if I entered 39 if as the input.)
Going by the manual, I have a hunch that the cause is that the error rule matches a longer possible input than the number and keyword rules, and flex will prefer it. That said, I have no idea how to resolve this situation. It seems unfeasible to write an explicit regexp that will reject all non-error input, and I don't know how else to write a "catch-all" rule for the sake of handling lexer errors.
UPDATE: I suppose I could just make the catch-all rule be . { exit(1); } but I'd like to get some nicer debug output than "I got confused on line 1".
You're quite right that you should just match a single "any" character as a fallback. The "standard" way of getting information about where in the line the parsing is at is to use the --bison-bridge option, but that can be a bit of a pain, particularly if you're not using bison. There are a bunch of other ways -- look in the manual for the ways to specify your own i/o functions, for example, -- but the all around simplest IMHO is to use a start condition:
%x LEXING_ERROR
%%
// all your rules; the following *must* be at the end
. { BEGIN(LEXING_ERROR); yyless(1); }
<LEXING_ERROR>.+ { fprintf(stderr,
"Invalid character '%c' found at line %d,"
" just before '%s'\n",
*yytext, yylineno, yytext+1);
exit(1);
}
Note: Make sure that you've ignored whitespace in your rules. The pattern .+ matches any number but at least one non-newline character, or in other words up to the end of the current line (it will force flex to read that far, which shouldn't be a problem). yyless(n) backs up the read pointer by n characters, so after the . rule matches, it will rescan that character producing (hopefully) a semi-reasonable error message. (It won't really be reasonable if your input is multibyte, or has weird control characters, so you could write more careful code. Up to you. It also might not be reasonable if the error is at the end of a line, so you might also want to write a more careful regex which gets more context, and maybe even limits the number of forward characters read. Lots of options here.)
Look up start conditions in the flex manual for more info about %x and BEGIN

Bison grammar warnings

I am writing a parser with Bison and I am getting the following warnings.
fol.y:42 parser name defined to default :"parse"
fol.y:61: warning: type clash ('' 'pred') on default action
I have been using Google to search for a way to get rid of them, but have pretty much come up empty handed on what they mean (much less how to fix them) since every post I found with them has a compilation error and the warnings them selves aren't addressed. Could someone tell me what they mean and how to fix them? The relevant code is below. Line 61 is the last semicolon. I cut out the rest of the grammar since it is incredibly verbose.
%union {
char* var;
char* name;
char* pred;
}
%token <var> VARIABLE
%token <name> NAME
%token <pred> PRED
%%
fol:
declines clauses {cout << "Done parsing with file" << endl;}
;
declines:
declines decline
|decline
;
decline:
PRED decs
;
The first message is likely just a warning that you didn't include %start parse in the grammar specification.
The second means that somewhere you have rule that is supposed to return a value but you haven't properly specified which type of value it is to return. The PRED returns the pred element of your union; the problem might be that you've not created %type entries for decline and declines. If you have a union, you have to specify the type for most, if not all, rules — or maybe just rules that don't have an explicit action (so as to override the default $$ = $1; action).
I'm not convinced that the problem is in the line you specify, and because we don't have a complete, minimal reproduction of your problem, we can't investigate for you to validate it. The specification for decs may be relevant (I'm not convinced it is, but it might be).
You may get more information from the output of bison -v, which is the y.output file (or something similar).
Finally found it.
To fix this:
fol.y:42 parser name defined to default :"parse"
Add %name parse before %token
Eg:
%name parse
%token NUM
(From: https://bdhacker.wordpress.com/2012/05/05/flex-bison-in-ubuntu/#comment-2669)

Resources