the difference between parsing and compiling - parsing

I am new to the world of the assembly and would like to know what parsing means in this context and the difference between parsing and compiling. Thank you.

A recognizer does verify that a sequence of symbols belongs or not to a given language. With the assumption that it terminates, it will terminate with a Boolean value: true (belongs) or false (does not belong).
A parser is a recognizer that also outputs the derivation steps that one can use to build a syntax tree.
Parsing is the process that the parser performs.
A translator does translate from one language to another.
A compiler is a translator that translates from a high-level language to a low-level language.
Compiling is the process that the compiler performs.
A decompiler is a translator that translates from a low-level language to a high-level language.
An assembly language is low-level language that uses mnemonics that are very close to a given machine language. It is the "human friendly" version of the machine language, often with some extras like comments.
A machine language is the language that is used to write machine code and that will be executed by the hardware. For example, the Central Processor Unit (CPU).
It seems to me that its more correct to say that the assembly language is compiled by an assembler to produce a machine code written in a machine language. The term assembled is also used in this case. I write compiled, because, the assembly languages express not exactly the machine language, but have extras, that I think is somehow "higher" level compared to the machine language.

Related

Can't we just write a program in machine level language?

Compiler convert human understandable language into machine level language. Can't we just write a program in machine level language so that it will be easy and quick for a program to execute.
Yes, you can program in assembler under Linux.
Check this and this questions on Stack Overflow, for example. Also, the Linux Assembly HOWTO looks good.
Compiler convert human understandable language into machine level language. Can't we just write a program in machine level language so that it will be easy and quick for a program to execute.
No one writes programs in machine language. Normal binary exectuables are not just machine code anyway, so it would be pointless to try and do this. Binaries contain machine code but include specific, OS dependent formatting. For example, linux uses ELF. This format is understood by the linker and loader (on *nix, the loader is part of the kernel). The only place unadulterated machine code exists is in the system memory.
You can write programs in assembly language, which is very similar to machine language, but then this must be compiled and linked. In other words, it is the same thing as writing a program in any other compiled language.
Finally, creating a binary manually by formatting some machine code would not provide any advantages and it would be an endless headache to work with. You might do it as a learning excercise, but not for any real purpose.
Yes, but then you have to write more and then it takes longer, so using a higher level language and compiler is more efficient on your time. More often than not, a compiler is better than what a mere human can do (they are pretty good programmers these compiler writers).

ANTLR grammar for Scala?

I am trying to build a static analysis tool for a demo project. We are free to choose the language to analyze. I started off by writing a Java code analyzer using ANTLR. I now want to do the same for Scala code. However, I could not find the ANTLR grammar for Scala. Does it exist?
Is there any other machine readable form of Scala grammar?
I don't believe there is such a thing.
The thing is that for any language, but especially for a library language like Scala, lexical analysis and syntactic analysis is the least interesting and most trivial part of static analysis. In order to do anything even remotely interesting, you need to perform a significant amount of semantic analysis: desugaring, type inference, type checking, kind checking, macro expansion, overload resolution, implicit resolution, name binding. In short: you need to re-implement more or less the entire Scala compiler, modulo the actual code generation part. Remember that both Scala's macro system and Scala's type system are Turing-complete (in fact, Scala's macro system is Scala!): there could be significant compile-time and type-level computation going on that is impossible to analyze without actually performing macro expansion, type inference and type checking.
That is a massive task, and there are in fact only two projects that have successfully done it: one is the Scala compiler itself, the other is the IntelliJ IDEA Scala plugin.
And let's not even talk about compiler plugins, which are able to change the syntax and semantics of Scala in almost arbitrary ways.
But behold, there is hope: The Scala compiler itself provides an API called the Presentation Compiler, which is specifically designed for use by IDEs, code highlighters, and all kinds of static analysis tools. It gives you access to the entire information the compiler has during compilation, just before the optimization and code generation phases. It is used by ScalaDoc, the Scala REPL, the Scala Eclipse Plugin, the NetBeans Scala Plugin, SimplyScala.Com, the ENSIME Plugin for Emacs, some static analysis tools, and many others.
You can find a Scala grammar for ANTLR at https://github.com/lrlucena/grammars-v4/tree/master/scala . It is based on the Scala Language Specification http://www.scala-lang.org/files/archive/spec/2.11/13-syntax-summary.html .
Is Appendix A of the Scala Language Reference useful for you? It is in EBNF format.
Scalastyle uses scalariform to do the parsing for it. With this, you get an AST of case classes. However, you only get the information which is in the file, so for instance, you don't get inferred types.
If you don't need all of the extra information, then look at Scalariform. The Scalastyle code is fairly easy to understand, start with Checker.scala.

Text parsing library

A colleague of mine works on an universal text parsing library, based on C# lambdas. The core looks cool, but unfortunately to me he has hardcoded a grammar, specifical to his private task -- math expression evaluating. So, I will not use it as I had intended before I saw the API. And now I'm looking for another lib, that meets at least some of my requirements. It has to:
Be able to load a grammar from an external file -- say, XML, YML or JSON.
Return AST from grammar and parsed tree that is built from any text.
Work fast enough to load C# grammar then parse a large code file.
I'd prefer the library that has grammar format file simple enough for easy writing a grammar for math expressions, is open source and written in C# or C++.
Regards,
--
UPDATED: point 2 has been corrected.
You might check out Text Transformer which claims to be some kind of universal text processing language. I have no specific experience with it.
Building robust langauge front ends and usable processing tools is actually a lot of work.
If you want to process computer languages in a generic way, you might consider our DMS Software Reengineering Toolkit, a kind of generalized compiler technology for parsing, analyzing, transforming, and/or generating code (or any other kind of formal document).
DMS will accept arbitrary context free grammars for langauges, automatically builds an AST with no additional specification effort on your part, and is designed to handle not only large files but very large sets of files in a single computation. Normally people
that want to process code need pattern recognition, code analysis and code transformation capabilities; DMS has all of these built in. It also has a variety of predefined, mature grammars for a wide variety of computer langauges, well-known (C, C++, C#, COBOL, Java, JavaScript, ... ) and otherwise (Natural, EGL, Python, MATLAB, ...), and has been used to carry out massive automated analyses and transformations on programs in these various langauges.
DMS does not meet your open-source or C#/C++ implementation requirements. It is implemented as a set of domain-specific langauges for describing grammars, analyzers, transformations, prettyprinters, and scripting that allows parallel execution to enable complex analyses to run faster than single-threaded programs.

Language parsers

I need to parse C#, Ruby and Python source code to generate some reports. I need to get a list of method names inside a class, and I need some other info such as usage of global variable or something. Just parsing using RE could be a solution, but I expect a better (systematic) solution using parsers, if it is easily possible.
What parsers for those languages are provided?
For C#, I found http://csparser.codeplex.com/Wikipage , but for the others, I found a bunch of parsers using those languages, but not the language parsers of them.
It may be worth looking into the ANTLR parser generator.
You'll find, on the ANTLR site, grammars for all 3 languages you are interested in (Although the Ruby grammar is only for a "simplified" version of the language).
The next difficulty may be to adapt these grammars for the particular target language you would like, i.e. the language in which the parsers themselves will be generated. ANTLR's grammar language is very expressive, allowing one to deal with various context-sensitive languages. This is done by inserting various snippets (in the target language) and/or semantic or syntactic predicates (also in the target language) amid the EBNF-like grammar; consequently the grammar is a bit messier and may need adapting when the target language is changed. The "native" target language of ANTLR is Java, but many other targets languages are supported.
On the whole, ANTLR represents a bit a setup/learning-curve effort, but since you need to deal with 3 languages, it may well be worth the investment, as this will allow you to have a uniform framework (over which you have "full" control), rather than trying to corral three possibly very distinct, and possibly more "locked down" parsers as you started doing.
All three languages are relatively sophisticated languages and although your goal is "merely" to identify methods within programs, you may be able to hack/simplify some of the grammars (or maybe simply "ignore" parts of them), only mapping the few parser-level rules of interest to your eventual goal.
Once these rules are identified, you can apply the same or similar actions, i.e. snippets (in the target language) which implement what you wish to accomplish when the parser encounters such rules (eg: store the method's signature for future reporting, start counting the number of lines... whatever).
A final suggestion:
As hinted in comments to the question, and depending on your goals, you may be able to reuse existing utility programs to perform directly, or indirectly these goals.
Also, because indeed messing with parsers for these sophisticated languages may be somewhat overkill for you possibly simple and possibly error-tolerant goals, the Regular Expressions approach may fit the bill, somehow; the fact of the matter is that none of these languages are regular nor context free, so success with regex will be highly dependent on the eventual goals and on the input data (programs).
Yet another suggestion!
See Larry Lustig's answer! Introspection may simplify much of you task as well. The implication is that you'd need to a) write your logic within each of the the underlying language b) integrate/load the programs to be inspected. All depends, but again, a possible way out from the -let's be fair- relatively heavy investment with formal grammar tools.
For Python, the situation is trivial: there is a Python parser in the standard library as well as a more high-level module for manipulating ASTs.
Also, Python has a somewhat simple grammar (at least if you use the trick to keep an indentation stack in your lexer and inject fake BEGIN and END tokens in your token stream, so that you can treat Python as a simple keyword delimited Algol-like language in your parser), so it is often used as an example grammar for parser generators, which means that you can find literally dozens of Python parsers for pretty much every single parser generator, programming language and platform out there. (E.g., here is a Haskell module implementing a Python lexer and parser.)
For Ruby, there are quite a number of parsers available.
Ruby is incredibly hard to parse, so if you need full fidelity, you pretty much have to use the original YACC grammar file from the YARV Ruby implementation. (parse.y in the top-level source directory.) JRuby's parser is derived from that file, and it is the only one of the implementation parsers that has been explicitly designed to also be used by other clients and not just the interpreter itself. (For example, the Eclipse RDT plugin, the Eclipse DLTK/Ruby plugin, the NetBeans Ruby plugin and the jEdit Ruby syntax highlighting all use JRuby's parser.) To facilitate that, JRuby's parser has actually been repackaged as a separate project.
Of course, there are YACC clones for pretty much every language on the planet. However, be aware that YARV does not use a lex generated scanner. It uses a hand-written scanner in C, and also the YACC grammar contains quite a bit of semantic actions in C. Those parts will have to be re-implemented (like they were in JRuby).
The XRuby compiler is the only full Ruby implementation that does not use YARV's parse.y, it uses an ANTLRv3 grammar and an ANTLRv3 tree grammar that have been developed from scratch. ANTLR can generate parsers for a whole bunch of languages, including for example Java and C#. Its Ruby backend, however, is in dire need of some work.
RedParse is a Ruby parser written in Ruby, which claims to be able to parse all Ruby syntax correctly. It is used, for example, in the YARD Ruby documentation tool to, among other things, extract method names.
ruby_parser is another Ruby parser in Ruby. It is generated from parse.y via the racc parser generator that is part of Ruby's standard library.
YARV actually contains a parser library called ripper, which allows you to parse Ruby code. Unfortunately, it is completely undocumented, so you basically have to figure it out by reading blog posts. Except of course, being undocumented, almost nobody else has figured it out yet, either and written a blog post.
However, for your purposes, you don't actually need a full-blown Ruby parser. You only need enough to extract method names and some other stuff.
RDoc, the Ruby documentation generator, contains a Ruby parser which can parse just enough Ruby to, well, extract method names and some other stuff.
Cardinal is a Ruby implementation for the Parrot Virtual Machine. It does not yet run all of Ruby, but its parser should be powerful enough to support all you need. (The parser is written in the Parrot Grammar Engine, so you will obviously have to run it in Parrot, by for example writing your reporting tool in Perl6.)
tinyrb is another Ruby implementation that does not run full Ruby but contains a better written parser than YARV. In this case, the parser uses Ian Piumarta's leg Parsing Expression Grammar parser generator.
For Ruby and Python, can't you simply introspect the class to learn the name of the methods? You'd have to write the same functionality in each language but (at least in Python) there's hardly anything to it.
The DMS Software Reengineering Toolkit has full, robust C# and Python parsers that automatically build complete ASTs. DMS offers facilities for walking the trees and collecting whatever data you might wish to collect.
Another poster's answer here suggests Ruby is really hard to parse. C++ is also famously hard to parse. DMS has been used to parse some 30 other languages, including full C++ in a number of dialects, so Ruby seems eminently doable. Howeever, DMS doesn't have an off-the-shelf parser for Ruby.

ANTLR vs. Happy vs. other parser generators

I want to write a translator between two languages, and after some reading on the Internet I've decided to go with ANTLR. I had to learn it from scratch, but besides some trouble with eliminating left recursion everything went fine until now.
However, today some guy told me to check out Happy, a Haskell based parser generator. I have no Haskell knowledge, so I could use some advice, if Happy is indeed better than ANTLR and if it's worth learning it.
Specifically what concerns me is that my translator needs to support macro substitution, which I have no idea yet how to do in ANTLR. Maybe in Happy this is easier to do?
Or if think other parser generators are even better, I'd be glad to hear about them.
People keep believing that if they just get a parser, they've got it made
when building language tools. Thats just wrong. Parsers get you to the foothills
of the Himalayas then you need start climbing seriously.
If you want industrial-strength support for building language translators, see our
DMS Software Reengineering Toolkit. DMS provides
Unicode-based lexers
full context-free parsers (left recursion? No problem! Arbitrary lookahead? No problem. Ambiguous grammars? No problem)
full front ends for C, C#, COBOL, Java, C++, JavaScript, ...
(including full preprocessors for C and C++)
automatic construction of ASTs
support for building symbol tables with arbitrary scoping rules
attribute grammar evaluation, to build analyzers that leverage the tree structure
support for control and data flow analysis (as well realization of this for full C, Java and COBOL),
source-to-source transformations using the syntax of the source AND the target language
AST to source code prettyprinting, to reproduce target language text
Regarding the OP's request to handle macros: our C, COBOL and C++ front ends handle their respective language preprocessing by a) the traditional method of full expansion or b) non-expansion (where practical) to enable post-parsing transformation of the macros themselves. While DMS as a foundation doesn't specifically implement macro processing, it can support the construction and transformation of same.
As an example of a translator built with DMS, see the discussion of
converting
JOVIAL to C for the B-2 bomber. This is 100% translation for > 1 MSLOC of hard
real time code. [It may amuse you to know that we were never allowed to see the actual program being translated (top secret).]. And yes, JOVIAL has a preprocessor, and yes we translated most JOVIAL macros into equivalent C versions.
[Haskell is a cool programming language but it doesn't do anything like this by itself.
This isn't about what's expressible in the language. Its about figuring out what machinery is required to support the task of manipulating programs, and
spending 100 man-years building it.]

Resources