We know compiler saves line numbers in symbol table during lexical analysis. I had been wondering if it is possible to save line numbers in any other phases of analysis of source code? If possible then when and how?
You can obviously copy saved line numbers from some lexical component to some other component, provided you keep the line number in the lexical object; that copy could be made in any compiler phase.
However, the lexical phase is really the only point in the compilation where the source code itself is being analyzed textually, so it is the phase in which you will know what line number in the source code you have reached.
Related
In my objdump -t output, I see the following two lines:
00000000000004d2 l F .text.unlikely 00000000000000ec function-signature-goes-here [clone .cold.427]
and
00000000000018e0 g F .text 0000000000000690 function-signature-goes-here
I know l means local and g means global. I also know that .text is a section, or a type of section, in an object file, containing compiled program instructions. But what is .text.unlikely? Assuming it's a different section (or type-of-section) from .text - what's the difference?
In my GCC v5.4.0 manpage, I found the following switch:
-freorder-functions
which says:
Reorder functions in the object file in order to improve code
locality. This is implemented by using special subsections
".text.hot" for most frequently executed functions and
".text.unlikely" for unlikely executed functions. Reordering is done
by the linker so object file format must support named sections and
linker must place them in a reasonable way.
Also profile feedback must be available to make this option effective.
See -fprofile-arcs for details.
Enabled at levels -O2, -O3, -Os.
Looks like the compiler was run with optimization flags or that switch for this binary, and functions are organized in subsections to optimize spatial locality.
What are the symbol table and AST needed for during code compilation?
I'm trying to a get a basic high level understanding of the code compilation process.
I understand the basic steps to be:
Lexical analysis
Syntax analysis
Semantic analysis
Code generation
Code optimisation
Linking
As I understand it the symbol table starts to get built during the lexical analysis step as the code is lexed. This would include token type and the actual tokens identified. During later steps additional info is added to the symbol table such as scope and data type. If I understand correctly during syntax analysis the AST is built which represents the structure of the code and is annotated with the same information as in the symbol table.
I'm confused as to why both symbol table and AST are needed. Is one used to build the other? Are both of them fed into the code generation step? Is this language dependent (compiled vs interpreted)?
Where does white-space removal take place? I've been told this is during lexical analysis but if that's the case my var = 5 would get converted to myvar=5 which is now syntactically correct.
Thanks for your input.
I am working on a project to automate COBOL to generate a class diagram. I am developing using a .NET console application. I need help tracking down the procedure name where the perform statement in used in the below example.
**Z-POST-COPYRIGHT.
move 0 to RETURN-CODE
perform Z-WRITE-FILE**
How do I track the procedure name 'Z-Post-COPYRIGHT' where the procedure 'Z-write-file' is called? The only idea I could think of in terms of COBOL is through indentation as the procedure names are always indented. Ideally in the database, the code should track the procedure name after the word 'perform' and procedure under which it is called (in this case it is Z-POST-COPYRIGHT).
I assume you want to do this "on your own" without external tools (a faster approach can be found at the end).
You first have to "know" your source:
which compiler was it compiled with (get a manual for this compiler)
which options were used
Then you have to preparse the source:
include copybooks (doing the given REPLACING rules if any)
if the source is in free-form reference format: concatenate contents of last line and current line if you find a - in column 7
check for REPLACE and change the result accordingly
remove all comments (maybe only * and \ in column 7 in fixed-form reference format or similar (extensions like "variable" format / "terminal" format", ... exist, maybe only inline comments - when in free-form reference-format, otherwise maybe inline comments *> or compiler specific extensions like |) - depending on the further re-engineering you want to do it could be a good idea to extract them and store them at least with a line number reference
The you finally can track the procedure name with the following rule:
go backwards to the last separator period (there are more rules but the rule "at least one line break, another period, a space a comma or a semicolon" [I've never seen the last two in real code but it is possible" should be enough)
check if there is only one word between this separator period and the next
if this word is no reserved COBOL word (this depends on your compiler) it is very likely a procedure name
Start from here and check the output, then fine grade the rule with actual false positives or missing entries.
If you want to do more than only extract the procedure-names for PERFORM and GO TO (you should at least check the sources for PERFROM ... THRU) then this can get to a lot of work...
Faster approach with external tools:
run a COBOL compiler on the complete sources and tell it to do the preparsing only - this way you have the big second point solved already
if you have the option: tell the compiler or an external tool to create a symbol table / cross reference - this will tell you in which line a procedure is and its name (you can simply find the correct procedure by comparing the line)
Just a note: You may want to check GnuCOBOL (formerly OpenCOBOL) for the preparsing and/or generation of symbol tables/cross-reference and/or printcbl for a completely external tool doing preparsing and/or cobxref for a complete cross reference generation.
I'm writing a small compiler for a interest and I need to know what stage in the wrong keyword is detected (a keyword that is not in the language) during lexical analysis or parsing ?
You'll detect this during parsing. It's not that you have a "wrong keyword", it's that you have an identifier (i.e. a variable name) appearing in a place where you don't expect. So, if your source code looks like:
reeeturn 3;
From the compiler's perspective, you're just using some variable named reeeturn. That could be an error because a variable with that name isn't defined. Or, in this case, it's probably a syntax error to have a number follow an identifier.
But there's no lexical error here. It's a totally valid sequence of tokens: identifier, number, semicolon.
This is probably during lexical analysis. Lexical analysis is the compiler phase where the input file is cut apart into pieces and tagged with what those pieces mean, whereas parsing takes those existing pieces and uses them to assemble an AST. Without seeing the code I can't be certain about this, but based on this reasoning I'd suspect the error is in the scanner and not the parser.
Hope this helps!
It depends on the language.
The lexing phase is responsible for creating a token stream from the source file. If the "wrong keyword" is still a valid token in the language, it will be tokenized properly - for example, in C, the "wrong keyword" will be tokenized to an identifier. It is only later during parsing that the mistake will be revealed.
On the other hand, in a language in which the "wrong keyword" cannot be any other valid token (e.g. a language using sigils for variables), the lexer itself will complain.
By concept/function/implementation, what are the differences between compilers and parsers?
A compiler is often made up of several components, one of which is a parser.
A common set of components in a compiler is:
Lexer - break the program up into words.
Parser - check that the syntax of the sentences are correct.
Semantic Analysis - check that the sentences make sense.
Optimizer - edit the sentences for brevity.
Code generator - output something with equivalent semantic meaning using another vocabulary.
To add a little bit:
As mentioned elsewhere, small C is a recursive decent compiler that generated code as it parsed. Basically syntactical analysis, semantic analysis, and code generation in one pass. As I recall, it also lexed in the parser.
A long time ago, I wrote a C compiler (actually several: the Introl-C family for microcontrollers) that used recursive descent and did syntax and semantic checking during the parse and produced a tree representation of the program from which code was generated.
Today, I'm working on a compiler that does source -> tokens -> AST -> IR -> code, pretty much as I described above.
A parser just reads a text into an internal, more abstract representation, often a tree or graph of some sort.
A compiler translates such an internal representation into another format. Most often this means converting source code into executable programs. But the target doesn't have to be machine code. It can be another programming language as well; the compiler would still be a compiler. Obviously a compiler needs a parser to actually read its input.
Compiler always have a parser inside. Parser just process the language and return the tree representation of it, compiler generate something from that tree, actual machine codes or another language.
A parser is one element of a compiler.
Are you looking for the differences between an interpreter and a compiler?
A parser takes in raw-data and parses it into a tree structure. This syntax-tree is then passed on to generator, which will turn it into whatever it is supposed to generate.
So, a parser is a part of a compiler.
In general, parser is a part of the compiler, but compiler is designed to convert the received script generally into machine-readable code or sometimes into another language.
A compiler is a special type of computer program that translates a human readable text file into a form that the computer can more easily understand. At its most basic level, a computer can only understand two things, a 1 and a 0. At this level, a human will operate very slowly and find the information contained in the long string of 1s and 0s incomprehensible. A compiler is a computer program that bridges this gap.
A parser is a piece of software that evaluates the syntax of a script when it is executed on a web server. For scripting languages used on the web, the parser works like a compiler might work in other types of application development environments.Parsers are commonly used in script development because they can evaluate code when the script is executed and do not require that the code be compiled first.