Tcl file parser for PYTHON - parsing

I have a .tcl file.
Is there any parser available which directly extracts data from .tcl file ? I don't want to use REGEX for this task. Is pyparsing will work for this problem ?
I am using Python 2.7

.tcl files are not data files, they are programming scripts, written in the Tcl programming language.
The Tcl language is extremely flexible in form and style, which makes writing a general-purpose parser a substantial project, whether in pyparsing or any other package. I encourage people, when they are embarking on a new pyparsing project, to begin by roughing out the BNF for the language, to whatever level of detail they want. This page from the Tcl wiki implies that developing a BNF for Tcl is not at all straightforward, if even possible.
It is very unlikely that anyone will respond to your question with an answer containing your Tcl-parser-implemented-in-Python. Perhaps there is a Tcl subset that you are particularly focused on - if you were to post some sample Tcl code and what you want to get from it, you are more likely to get helpful responses.

Related

Cross-platform parser development - What are the options?

I'm currently working on a project that makes use of a custom language with a simple context-free grammar.
Due to the project's characteristics the same language will have to be used on several platforms, especially mobile ones. Currently, I'm using my small hand-written Java parser (for the Android platform). Soon, I'll have to write basically the same parser for JavaScript and later possibly also for C# (Windows Phone) and Objective C (iOS). There is an additional chance that I'll also have to write it for PHP.
My question is: What options are there to simplify the parser development process? Do I really have to write basically the same parser for each platform or is there a less work-intensive way?
From a development process point of view the best alternative would enable me to write a grammar definition which would then automatically be compiled into a parser.
However, basically the only cross-platform parser generator I've found so far it the GOLD Parser which supports two of my target platforms (Java and C#). It would really be awesome if you could point me to other alternatives.
In case you don't know about other cross-platform compiler-compilers: Do you have hints how to structure the code towards future language extensibility?
I commend https://en.wikipedia.org/wiki/Comparison_of_parser_generators to your attention: if we restrict the domain to Java and C/C++, it suggests APG, GOLD, SableCC, and SLK (amongst others) as being cross-language enough for your stated goals. (I'm also requiring that the action code be separated from the grammar rather than inline, since the latter would defeat the purpose.) If you want JavaScript as well, it looks like your choices are APG (GPL-licensed) and WaxEye (MIT-licensed).
If your language is reasonably simple then I would say to just go with whichever you think will be easiest to integrate into your build environment(s) and has a reasonable match with how you think. Unless parsing time is a huge fraction of your application's total workload, parsing speed should not be an issue -- although table size and memory usage might matter in a mobile context. If your grammar is "simple enough," (i.e. not Perl, for instance) I would expect any of those tools to work.
Have a look in Antlr, I am using it for transforming java code and it is really great. Moreover you can find different grammars here.
REx parser generator supports the required targets, except for Objective C and PHP (code generators for those might be possible). It has not yet been published as open source, though, and there is no decent documentation, just sample grammars. But there are projects that are using it successfully, e.g. xqlint. Here is a paper describing the experience from that project.

Verilog gate level parser

I want to parse Verilog gate level code and store the data in a data structure (ex. graph).
Then I want to do something on the gates in C/C++ and output a corresponding Verilog file.
(I would like to build one program which input and output are Verilog gate level code)
(input.v => myProgram => output.v)
If there is any library or open source code to do so?
I found that it can be done by Flex and Bison but I have no idea how to use Flex and Bison.
There was a similar question a few days ago about doing this in ruby, in which I pointed to my Verilog parser gem. Not sure if it is robust enough for you though, would love feedback, bug reports, feature requests.
There are perl verilog parsers out there but I have not used any of them directly and avoid perl, hopefully others can add info about other parsers.
I have used Verilog-Perl successfully to parse Verilog code. It is well-maintained: it even supports the recent SystemVerilog extensions.
Yosys (https://github.com/cliffordwolf/yosys) is a framework for Verilog Synthesis written in C++. Yosys is still under construction but if you only want to read and write gate-level netlists it can do what you need..
PS: A reference manual (that also covers the C++ APIs) is on the way. I've written ~100 pages already, but can't publish it before I've finished my BSc. thesis (another month or so).

Writing a code formatting tool for a programming language

I'm looking into the feasibility of writing a code formatting tool for the Apex language, a Salesforce.com variation on Java, and perhams VisualForce, its tag based markup language.
I have no idea on where to start this, apart from feeling/knowing that writing a language parser from scratch is probably not the best approach.
I have a fairly thin grasp of what Antlr is and what it does, but conceptually, I'm imagining one could 'train' antlr to understand the syntax of Apex. I could then get a structured version of the code in a data structure (AST?) which I could then walk to produce correctly formatted code.
Is this the right concept? Is Antlr a tool to do that? Any links to a brief synopsis on this? I'm looking for investing a few days in this task, not months, and I'm not sure if its even vaguely achievable.
Since Apex syntax is similar to Java, I'd look at Eclipse's JDT. Edit down the Java grammar to match Apex. Do the same w/ formatting rules/options. This is more than a few days of work.
Steven Herod wrote:
... I'm imagining one could 'train' antlr to understand the syntax of Apex. ...
What do you mean by "'train' antlr"? "Train" as in artificial intelligence (training a neural-net)? If so, then you are mistaken.
Steven Herod wrote:
... get a structured version of the code in a data structure (AST?) which I could then walk to produce correctly formatted code.
Is this the right concept? Is Antlr a tool to do that?
Yes, more or less. You write a grammar that precisely defines the language you want to parse. Then you use ANTLR which will generate a lexer (tokenizer) and parser based on the grammar file. You can let the parser create an AST from your input source and then walk the AST and emit (custom) output/code.
Steven Herod wrote:
... I'm looking for investing a few days in this task, not months, and I'm not sure if its even vaguely achievable.
Well, I don't know you of course, but I'd say writing a grammar for a language similar to Java, and then emitting output by walking the AST within just a couple of days is impossible, even more so for someone new to ANTLR. I am fairly familiar with ANTLR, but I couldn't do it in just a few days. Note that I'm only talking about the "parsing-part", after you've done that, you'll need to integrate this in some text editor. This all looks to be more a project of several months, not even weeks, let alone several days.
So, in short, if all you want to do is write a custom code highlighter, ANTLR isn't your best choice.
You could have a look at Xtext which uses ANTLR under the hood. To quote their website:
With Xtext you can easily create your own programming languages and domain-specific languages (DSLs). The framework supports the development of language infrastructures including compilers and interpreters as well as full blown Eclipse-based IDE integration. ...
But I doubt you'll have an Eclipse plugin up and running within just a few days.
Anyway, best of luck!
Our DMS Software Reengineering Toolkit is designed to do this as kind poker-pot ante necessary to do any kind of automated software reengineering project.
DMS allows one to define a grammar, similar to ANTLR's (and other parser generator) styles. Unlike ANTLR (and other parser generators), DMS uses a GLR parser, which means you don't have to bend the language grammar rules to meet the requirements of the parser generator. If you can write an context-free grammar, DMS will convert that into a parser for that language. This means in fact you can get a working, correct grammar up considerably faster than with typical LL or L(AL)R parser generators.
Unlike ANTLR (and other parser generators), there is no additional work to build the AST; it is automatically constructed. This means you spend zero time write tree-building rules and none debugging them.
DMS additionally provides a pretty-printing specification language, specifying text boxes stack vertically, horizontally, or indented, in which you can define the "format" that is used to convert the AST back into completely legal, nicely formatted source text. None of the well known parser generators provide any help here; if you want to prettyprint the tree, you get to do a great deal of custom coding. For more details on this, see my SO answer to Compiling an AST back to source. What this means is you can build a prettyprinter for your grammar in an (intense) afternoon by simply annotating the grammar rules with box layout directives.
DMS's lexer is very careful to capture comments and "lexical formats" (was that number octal? What kind of quotes did that string have? Escaped characters?) so that they can be regenerated correctly. Parse-to-AST and then prettyprint-AST-to-text round trips arbitrarily ugly code into formatted code following the prettyprinting rules. (This round trip is the poker ante: if you want go further, to actually manipulate the AST, you still want to be able to regenerate valid source text).
We recently built parser/prettyprinters for EGL. This took about a week end to end. Granted, we are expert at our tools.
You can download any of a number of different formatters built using DMS from our web site, to see what such formatting can do.
EDIT July 2012: Last week (5 days) using DMS, from scratch we (I personally) built a fully compliant IEC61131-3 "Structured Text" (industrial control language, Pascal-like) parser and prettyprinter. (It handles all the examples from the standards documents).
Reverse engineering a language to get a parser is hard. Very hard! Even if it's very close to Java.
But why reinvent the wheel?
There is a wonderful Apex parser implementation as part of the Force.com IDE on GitHub. It's just a jar without source code but you can use it for whatever you want. And the developers behind it are really supportive and helpful.
We are currently building an Apex module of the famous Java static code analyzer PMD here. And we use Salesforce.com internal parser. It works like a charm.
And hey, it's an open source project and we need contributers of any kind ;-)

Is there a scripting language or parser which has been ported to multiple languages?

For a research project i'm looking for an interpreter or even a parser for a programming language (doesn't matter what programming language) that has been ported to a number of languages. This probably means the code is small enough to do so.
I know Lisp-ish languages have been ported to a lot of environments, because Lisp is so easy to parse, however I haven't found a single implementation that has been multi targetted. For instance; it is very hard to find a version which works in PHP where the same code (the Lisp which runs on top / is parsed that is) would also work in Python.
Hope someone here can help...
What do I want to do with it? For a tool I'm making, the user group will write tiny pieces of logic; however the system underneath differs while the logic is the same. We don't want to force our users to learn Java, PHP, C#, etc just to write that logic.
The Lua (www.lua.org) scripting language can run from within C and has bindings to Python, php, Java, C#, probably some other languages too. It's a very small interpreter (something like 200k when compiled) because it comes "without batteries" - no builtin functions for some common operations like copying arrays. It should be pretty trivial to add support for embedding in another language, compared to other scripting languages, if need be.

Language parsers

I need to parse C#, Ruby and Python source code to generate some reports. I need to get a list of method names inside a class, and I need some other info such as usage of global variable or something. Just parsing using RE could be a solution, but I expect a better (systematic) solution using parsers, if it is easily possible.
What parsers for those languages are provided?
For C#, I found http://csparser.codeplex.com/Wikipage , but for the others, I found a bunch of parsers using those languages, but not the language parsers of them.
It may be worth looking into the ANTLR parser generator.
You'll find, on the ANTLR site, grammars for all 3 languages you are interested in (Although the Ruby grammar is only for a "simplified" version of the language).
The next difficulty may be to adapt these grammars for the particular target language you would like, i.e. the language in which the parsers themselves will be generated. ANTLR's grammar language is very expressive, allowing one to deal with various context-sensitive languages. This is done by inserting various snippets (in the target language) and/or semantic or syntactic predicates (also in the target language) amid the EBNF-like grammar; consequently the grammar is a bit messier and may need adapting when the target language is changed. The "native" target language of ANTLR is Java, but many other targets languages are supported.
On the whole, ANTLR represents a bit a setup/learning-curve effort, but since you need to deal with 3 languages, it may well be worth the investment, as this will allow you to have a uniform framework (over which you have "full" control), rather than trying to corral three possibly very distinct, and possibly more "locked down" parsers as you started doing.
All three languages are relatively sophisticated languages and although your goal is "merely" to identify methods within programs, you may be able to hack/simplify some of the grammars (or maybe simply "ignore" parts of them), only mapping the few parser-level rules of interest to your eventual goal.
Once these rules are identified, you can apply the same or similar actions, i.e. snippets (in the target language) which implement what you wish to accomplish when the parser encounters such rules (eg: store the method's signature for future reporting, start counting the number of lines... whatever).
A final suggestion:
As hinted in comments to the question, and depending on your goals, you may be able to reuse existing utility programs to perform directly, or indirectly these goals.
Also, because indeed messing with parsers for these sophisticated languages may be somewhat overkill for you possibly simple and possibly error-tolerant goals, the Regular Expressions approach may fit the bill, somehow; the fact of the matter is that none of these languages are regular nor context free, so success with regex will be highly dependent on the eventual goals and on the input data (programs).
Yet another suggestion!
See Larry Lustig's answer! Introspection may simplify much of you task as well. The implication is that you'd need to a) write your logic within each of the the underlying language b) integrate/load the programs to be inspected. All depends, but again, a possible way out from the -let's be fair- relatively heavy investment with formal grammar tools.
For Python, the situation is trivial: there is a Python parser in the standard library as well as a more high-level module for manipulating ASTs.
Also, Python has a somewhat simple grammar (at least if you use the trick to keep an indentation stack in your lexer and inject fake BEGIN and END tokens in your token stream, so that you can treat Python as a simple keyword delimited Algol-like language in your parser), so it is often used as an example grammar for parser generators, which means that you can find literally dozens of Python parsers for pretty much every single parser generator, programming language and platform out there. (E.g., here is a Haskell module implementing a Python lexer and parser.)
For Ruby, there are quite a number of parsers available.
Ruby is incredibly hard to parse, so if you need full fidelity, you pretty much have to use the original YACC grammar file from the YARV Ruby implementation. (parse.y in the top-level source directory.) JRuby's parser is derived from that file, and it is the only one of the implementation parsers that has been explicitly designed to also be used by other clients and not just the interpreter itself. (For example, the Eclipse RDT plugin, the Eclipse DLTK/Ruby plugin, the NetBeans Ruby plugin and the jEdit Ruby syntax highlighting all use JRuby's parser.) To facilitate that, JRuby's parser has actually been repackaged as a separate project.
Of course, there are YACC clones for pretty much every language on the planet. However, be aware that YARV does not use a lex generated scanner. It uses a hand-written scanner in C, and also the YACC grammar contains quite a bit of semantic actions in C. Those parts will have to be re-implemented (like they were in JRuby).
The XRuby compiler is the only full Ruby implementation that does not use YARV's parse.y, it uses an ANTLRv3 grammar and an ANTLRv3 tree grammar that have been developed from scratch. ANTLR can generate parsers for a whole bunch of languages, including for example Java and C#. Its Ruby backend, however, is in dire need of some work.
RedParse is a Ruby parser written in Ruby, which claims to be able to parse all Ruby syntax correctly. It is used, for example, in the YARD Ruby documentation tool to, among other things, extract method names.
ruby_parser is another Ruby parser in Ruby. It is generated from parse.y via the racc parser generator that is part of Ruby's standard library.
YARV actually contains a parser library called ripper, which allows you to parse Ruby code. Unfortunately, it is completely undocumented, so you basically have to figure it out by reading blog posts. Except of course, being undocumented, almost nobody else has figured it out yet, either and written a blog post.
However, for your purposes, you don't actually need a full-blown Ruby parser. You only need enough to extract method names and some other stuff.
RDoc, the Ruby documentation generator, contains a Ruby parser which can parse just enough Ruby to, well, extract method names and some other stuff.
Cardinal is a Ruby implementation for the Parrot Virtual Machine. It does not yet run all of Ruby, but its parser should be powerful enough to support all you need. (The parser is written in the Parrot Grammar Engine, so you will obviously have to run it in Parrot, by for example writing your reporting tool in Perl6.)
tinyrb is another Ruby implementation that does not run full Ruby but contains a better written parser than YARV. In this case, the parser uses Ian Piumarta's leg Parsing Expression Grammar parser generator.
For Ruby and Python, can't you simply introspect the class to learn the name of the methods? You'd have to write the same functionality in each language but (at least in Python) there's hardly anything to it.
The DMS Software Reengineering Toolkit has full, robust C# and Python parsers that automatically build complete ASTs. DMS offers facilities for walking the trees and collecting whatever data you might wish to collect.
Another poster's answer here suggests Ruby is really hard to parse. C++ is also famously hard to parse. DMS has been used to parse some 30 other languages, including full C++ in a number of dialects, so Ruby seems eminently doable. Howeever, DMS doesn't have an off-the-shelf parser for Ruby.

Resources