How do I use the Rust parser (libsyntax) myself? - parsing

I want to use the Rust parser (libsyntax) to parse a Rust file and extract information like function names out of it. I started digging in the docs and code, so my first goal is a program that prints all function names of freestanding functions in a .rs file.
The program should expand all macros before it prints the function names, so functions declared via macro aren't missed. That's why I can't write some crappy little parser by myself to do the job.
I have to admit that I'm not yet perfectly good at programming Rust, so I apologize in advance for any stupid statements in this question.
How I understood it I need to do the following steps:
Parse the file via the Parser struct
Expand macros with MacroExpander
???
Use a Visitor to walk the AST and extract the information I need (eg. via visit_fn)
So here are my questions:
How do I use MacroExpander?
How do I walk the expanded AST with a custom visitor?
I had the idea of using a custom lint check instead of a fully fledged parser. I'm investigating this option.
If it matters, I'm using rustc 0.13.0-nightly (f168c12c5 2014-10-25 20:57:10 +0000)

You can use syntex to parse Rust, so you don't need to use unstable Rust.
Here's a simple example:
// Tested against syntex_syntax v0.33
extern crate syntex_syntax as syntax;
use std::rc::Rc;
use syntax::codemap::{CodeMap};
use syntax::errors::{Handler};
use syntax::errors::emitter::{ColorConfig};
use syntax::parse::{self, ParseSess};
fn main() {
let codemap = Rc::new(CodeMap::new());
let tty_handler =
Handler::with_tty_emitter(ColorConfig::Auto, None, true, false, codemap.clone());
let parse_session = ParseSess::with_span_handler(tty_handler, codemap.clone());
let src = "fn foo(x: i64) { let y = x + 1; return y; }".to_owned();
let result = parse::parse_crate_from_source_str(String::new(), src, Vec::new(), &parse_session);
println!("parse result: {:?}", result);
}
This prints the whole AST:
parse result: Ok(Crate { module: Mod { inner: Span { lo: BytePos(0), hi: BytePos(43), expn_id: ExpnId(4294967295) },
items: [Item { ident: foo#0, attrs: [], id: 4294967295, node: Fn(FnDecl { inputs: [Arg { ty: type(i64), pat:
pat(4294967295: x), id: 4294967295 }], output: Default(Span { lo: BytePos(15), hi: BytePos(15), expn_id: ExpnId(4294967295) }),
variadic: false }, Normal, NotConst, Rust, Generics { lifetimes: [], ty_params: [], where_clause: WhereClause { id:
4294967295, predicates: [] } }, Block { stmts: [stmt(4294967295: let y = x + 1;), stmt(4294967295: return y;)], expr:
None, id: 4294967295, rules: Default, span: Span { lo: BytePos(15), hi: BytePos(43), expn_id: ExpnId(4294967295) } }),
vis: Inherited, span: Span { lo: BytePos(0), hi: BytePos(43), expn_id: ExpnId(4294967295) } }] }, attrs: [], config: [],
span: Span { lo: BytePos(0), hi: BytePos(42), expn_id: ExpnId(4294967295) }, exported_macros: [] })

I'm afraid I can't answer your question directly; but I can present an alternative that might help.
If all you need is the AST, you can retrieve it in JSON format using rustc -Z ast-json. Then use your favorite language (Python is great) to process the output.
You can also get pretty-printed source using rustc --pretty=(expanded|normal|typed).
For example, given this hello.rs:
fn main() {
println!("hello world");
}
We get:
$ rustc -Z ast-json hello.rs
{"module":{"inner":null,"view_items":[{"node":{"va... (etc.)
$ rustc --pretty=normal hello.rs
#![no_std]
#[macro_use]
extern crate "std" as std;
#[prelude_import]
use std::prelude::v1::*;
fn main() { println!("hello world"); }
$ rustc --pretty=expanded hello.rs
#![no_std]
#[macro_use]
extern crate "std" as std;
#[prelude_import]
use std::prelude::v1::*;
fn main() {
::std::io::stdio::println_args(::std::fmt::Arguments::new({
#[inline]
#[allow(dead_code)]
static __STATIC_FMTSTR:
&'static [&'static str]
=
&["hello world"];
__STATIC_FMTSTR
},
&match () {
() => [],
}));
}
If you need more than that though, a lint plugin would be the best option. Properly handling macro expansion, config flags, the module system, and anything else that comes up is quite non-trivial. With a lint plugin, you get the type-checked AST right away without fuss. Cargo supports compiler plugins too, so your tool will fit nicely into other people's projects.

The syn crate works indeed. At the beginning I wrongly think it is for writing procedural macros (as its readme suggests), but indeed it can parse a source code file. Please look at this page: https://docs.rs/syn/1.0.77/syn/struct.File.html . It even gives an example that inputs a .rs file and output AST (of course, you can do anything with it - not just printing):
use std::env;
use std::fs::File;
use std::io::Read;
use std::process;
fn main() {
let mut args = env::args();
let _ = args.next(); // executable name
let filename = match (args.next(), args.next()) {
(Some(filename), None) => filename,
_ => {
eprintln!("Usage: dump-syntax path/to/filename.rs");
process::exit(1);
}
};
let mut file = File::open(&filename).expect("Unable to open file");
let mut src = String::new();
file.read_to_string(&mut src).expect("Unable to read file");
let syntax = syn::parse_file(&src).expect("Unable to parse file");
// Debug impl is available if Syn is built with "extra-traits" feature.
println!("{:#?}", syntax);
}
Thanks #poolie for pointing out this hint (though lacking a bit of details).

syntex seems to no longer be maintained (last updated 2017), but https://crates.io/crates/syn may do what you need.

Related

Implement heredocs with trim indent using PEG.js

I working on a language similar to ruby called gaiman and I'm using PEG.js to generate the parser.
Do you know if there is a way to implement heredocs with proper indentation?
xxx = <<<END
hello
world
END
the output should be:
"hello
world"
I need this because this code doesn't look very nice:
def foo(arg) {
if arg == "here" then
return <<<END
xxx
xxx
END
end
end
this is a function where the user wants to return:
"xxx
xxx"
I would prefer the code to look like this:
def foo(arg) {
if arg == "here" then
return <<<END
xxx
xxx
END
end
end
If I trim all the lines user will not be able to use a string with leading spaces when he wants. Does anyone know if PEG.js allows this?
I don't have any code yet for heredocs, just want to be sure if something that I want is possible.
EDIT:
So I've tried to implement heredocs and the problem is that PEG doesn't allow back-references.
heredoc = "<<<" marker:[\w]+ "\n" text:[\s\S]+ marker {
return text.join('');
}
It says that the marker is not defined. As for trimming I think I can use location() function
I don't think that's a reasonable expectation for a parser generator; few if any would be equal to the challenge.
For a start, recognising the here-string syntax is inherently context-sensitive, since the end-delimiter must be a precise copy of the delimiter provided after the <<< token. So you would need a custom lexical analyser, and that means that you need a parser generator which allows you to use a custom lexical analyser. (So a parser generator which assumes you want a scannerless parser might not be the optimal choice.)
Recognising the end of the here-string token shouldn't be too difficult, although you can't do it with a single regular expression. My approach would be to use a custom scanning function which breaks the here-string into a series of lines, concatenating them as it goes until it reaches a line containing only the end-delimiter.
Once you've recognised the text of the literal, all you need to normalise the spaces in the way you want is the column number at which the <<< starts. With that, you can trim each line in the string literal. So you only need a lexical scanner which accurately reports token position. Trimming wouldn't normally be done inside the generated lexical scanner; rather, it would be the associated semantic action. (Equally, it could be a semantic action in the grammar. But it's always going to be code that you write.)
When you trim the literal, you'll need to deal with the cases in which it is impossible, because the user has not respected the indentation requirement. And you'll need to do something with tab characters; getting those right probably means that you'll want a lexical scanner which computes visible column positions rather than character offsets.
I don't know if peg.js corresponds with those requirements, since I don't use it. (I did look at the documentation, and failed to see any indication as to how you might incorporate a custom scanner function. But that doesn't mean there isn't a way to do it.) I hope that the discussion above at least lets you check the detailed documentation for the parser generator you want to use, and otherwise find a different parser generator which will work for you in this use case.
Here is the implementation of heredocs in Peggy successor to PEG.js that is not maintained anymore. This code was based on the GitHub issue.
heredoc = "<<<" begin:marker "\n" text:($any_char+ "\n")+ _ end:marker (
&{ return begin === end; }
/ '' { error(`Expected matched marker "${begin}", but marker "${end}" was found`); }
) {
const loc = location();
const min = loc.start.column - 1;
const re = new RegExp(`\\s{${min}}`);
return text.map(line => {
return line[0].replace(re, '');
}).join('\n');
}
any_char = (!"\n" .)
marker_char = (!" " !"\n" .)
marker "Marker" = $marker_char+
_ "whitespace"
= [ \t\n\r]* { return []; }
EDIT: above didn't work with another piece of code after heredoc, here is better grammar:
{ let heredoc_begin = null; }
heredoc = "<<<" beginMarker "\n" text:content endMarker {
const loc = location();
const min = loc.start.column - 1;
const re = new RegExp(`^\\s{${min}}`, 'mg');
return {
type: 'Literal',
value: text.replace(re, '')
};
}
__ = (!"\n" !" " .)
marker 'Marker' = $__+
beginMarker = m:marker { heredoc_begin = m; }
endMarker = "\n" " "* end:marker &{ return heredoc_begin === end; }
content = $(!endMarker .)*

Building a DspComplex ROM in Chisel

I'm attempting to build a ROM-based Window function using DSPComplex and FixedPoint types, but seem to keep running into the following error:
chisel3.core.Binding$ExpectedHardwareException: vec element 'dsptools.numbers.DspComplex#32' must be hardware, not a bare Chisel type
The source code for my attempt at this looks like the following:
class TaylorWindow(len: Int, window: Seq[FixedPoint]) extends Module {
val io = IO(new Bundle {
val d_valid_in = Input(Bool())
val sample = Input(DspComplex(FixedPoint(16.W, 8.BP), FixedPoint(16.W, 8.BP)))
val windowed_sample = Output(DspComplex(FixedPoint(24.W, 8.BP), FixedPoint(24.W, 8.BP)))
val d_valid_out = Output(Bool())
})
val win_coeff = Vec(window.map(x=>DspComplex(x, FixedPoint(0, 16.W, 8.BP))).toSeq) // ROM storing our coefficients.
io.d_valid_out := io.d_valid_in
val counter = Reg(UInt(10.W))
// Implicit reset
io.windowed_sample:= io.sample * win_coeff(counter)
when(io.d_valid_in) {
counter := counter + 1.U
}
}
println(getVerilog(new TaylorWindow(1024, fp_seq)))
I'm actually reading the coefficients in from a file (this particular window has a complex generation function that I'm doing in Python elsewhere) with the following sequence of steps
val filename = "../generated/taylor_coeffs"
val coeff_file = Source.fromFile(filename).getLines
val double_coeffs = coeff_file.map(x => x.toDouble)
val fp_coeffs = double_coeffs.map(x => FixedPoint.fromDouble(x, 16.W, 8.BP))
val fp_seq = fp_coeffs.toSeq
Does this mean the DSPComplex type isn't able to be translated to Verilog?
Commenting out the win_coeff line seems to make the whole thing generate (but clearly doesn't do what I want it to do)
I think you should try using
val win_coeff = VecInit(window.map(x=>DspComplex.wire(x, FixedPoint.fromDouble(0.0, 16.W, 8.BP))).toSeq) // ROM storing our coefficients.
which will create hardware values like you want. The Vec just creates a Vec of the type specfied

Parsing string with more then 15 characters with boost-spirit

I have a somewhat simple problem that i somehow cannot find any answers for. While working on parsing a larger grammar, i discovered that parsing any string larger then 15 characters would lead the parser to return as failed. The parser looks like this:
namespace parser {
template <typename Iterator>
struct p_grammar : qi::grammar<Iterator, standard::space_type> {
p_grammar() : p_grammar::base_type(spec) {
spec = "qwertyuiopasdfgh";
}
qi::rule<Iterator, standard::space_type> spec;
};
And will be run from within another function:
void MainWindow::parserTest() {
typedef parser::p_grammar<std::string::const_iterator> p_grammar;
p_grammar grammar;
using boost::spirit::standard::space;
std::string::const_iterator iter = editor->toPlainText().toStdString().begin();
std::string::const_iterator end = editor->toPlainText().toStdString().end();
if ( phrase_parse(iter,end,grammar,space) ) {
outputLog->append("Parsing succesfull");
} else {
outputLog->append("Parsing failed");
}
}
Removing the last character in "qwertyuiopasdfgh", so only 15 characters are present, makes it parse without failure.
Feel like I'm overlooking something obvious here.
You should be using valid iterators:
std::string value = editor->toPlainText().toStdString()
std::string::const_iterator iter = value.begin(), end = value.end();
You were using iterators into a temporary that wasn't stored.

When using bytes.replace is there a way to use wildcards?

I am programming in Go and I read a text file in and I replace multiple things on it to translate the code from one language to Go to be able to run. The problem I am having is that when trying to replace things like Println statements I cannot get a parenthesis on the end of the statement without being really specific to the code I am converting. Is there a way to use the code like this?
src = bytes.Replace(src, []byte("Insert"), []byte("Println(" * ")"), -1)
and have the ability to just put a parenthesis at the end of the line of code?
package main
import (
"fmt"
"regexp"
)
func main() {
src := []byte(`
Write(1, 3, "foo", 3*qux(42));
WriteLn("Enter bar: ");
`)
re := regexp.MustCompile(`Write\((.*)\);`)
re2 := regexp.MustCompile(`WriteLn\((.*)\);`)
src = re.ReplaceAll(src, []byte(`Print($1)`))
src = re2.ReplaceAll(src, []byte(`PrintLn($1)`))
fmt.Printf("%s", src)
}
(Alse here)
Output:
Print(1, 3, "foo", 3*qux(42))
PrintLn("Enter bar: ")

How to find functions in a cpp file that contain a specific word

using grep, vim's grep, or another unix shell command, I'd like to find the functions in a large cpp file that contain a specific word in their body.
In the files that I'm working with the word I'm looking for is on an indented line, the corresponding function header is the first line above the indented line that starts at position 0 and is not a '{'.
For example searching for JOHN_DOE in the following code snippet
int foo ( int arg1 )
{
/// code
}
void bar ( std::string arg2 )
{
/// code
aFunctionCall( JOHN_DOE );
/// more code
}
should give me
void bar ( std::string arg2 )
The algorithm that I hope to catch in grep/vim/unix shell scripts would probably best use the indentation and formatting assumptions, rather than attempting to parse C/C++.
Thanks for your suggestions.
I'll probably get voted down for this!
I am an avid (G)VIM user but when I want to review or understand some code I use Source Insight. I almost never use it as an actual editor though.
It does exactly what you want in this case, e.g. show all the functions/methods that use some highlighted data type/define/constant/etc... in a relations window...
(source: sourceinsight.com)
Ouch! There goes my rep.
As far as I know, this can't be done. Here's why:
First, you have to search across lines. No problem, in vim adding a _ to a character class tells it to include new lines. so {_.*} would match everything between those brackets across multiple lines.
So now you need to match whatever the pattern is for a function header(brittle even if you get it to work), then , and here's the problem, whatever lines are between it and your search string, and finally match your search string. So you might have a regex like
/^\(void \+\a\+ *(.*)\)\_.*JOHN_DOE
But what happens is the first time vim finds a function header, it starts matching. It then matches every character until it finds JOHN_DOE. Which includes all the function headers in the file.
So the problem is that, as far as I know, there's no way to tell vim to match every character except for this regex pattern. And even if there was, a regex is not the tool for this job. It's like opening a beer with a hammer. What we should do is write a simple script that gives you this info, and I have.
fun! FindMyFunction(searchPattern, funcPattern)
call search(a:searchPattern)
let lineNumber = line(".")
let lineNumber = lineNumber - 1
"call setpos(".", [0, lineNumber, 0, 0])
let lineString = getline(lineNumber)
while lineString !~ a:funcPattern
let lineNumber = lineNumber - 1
if lineNumber < 0
echo "Function not found :/"
endif
let lineString = getline(lineNumber)
endwhile
echo lineString
endfunction
That should give you the result you want and it's way easier to share, debug, and repurpose than a regular expression spit from the mouth of Cthulhu himself.
Tough call, although as a starting point I would suggest this wonderful VIM Regex Tutorial.
You cannot do that reliably with a regular expression, because code is not a regular language. You need a real parser for the language in question.
Arggh! I admit this is a bit over the top:
A little program to filter stdin, strip comments, and put function bodies on the same line. It'll get fooled by namespaces and function definitions inside class declarations, besides other things. But it might be a good start:
#include <stdio.h>
#include <assert.h>
int main() {
enum {
NORMAL,
LINE_COMMENT,
MULTI_COMMENT,
IN_STRING,
} state = NORMAL;
unsigned depth = 0;
for(char c=getchar(),prev=0; !feof(stdin); prev=c,c=getchar()) {
switch(state) {
case NORMAL:
if('/'==c && '/'==prev)
state = LINE_COMMENT;
else if('*'==c && '/'==prev)
state = MULTI_COMMENT;
else if('#'==c)
state = LINE_COMMENT;
else if('\"'==c) {
state = IN_STRING;
putchar(c);
} else {
if(('}'==c && !--depth) || (';'==c && !depth)) {
putchar(c);
putchar('\n');
} else {
if('{'==c)
depth++;
else if('/'==prev && NORMAL==state)
putchar(prev);
else if('\t'==c)
c = ' ';
if(' '==c && ' '!=prev)
putchar(c);
else if(' '<c && '/'!=c)
putchar(c);
}
}
break;
case LINE_COMMENT:
if(' '>c)
state = NORMAL;
break;
case MULTI_COMMENT:
if('/'==c && '*'==prev) {
c = '\0';
state = NORMAL;
}
break;
case IN_STRING:
if('\"'==c && '\\'!=prev)
state = NORMAL;
putchar(c);
break;
default:
assert(!"bug");
}
}
putchar('\n');
return 0;
}
Its c++, so just it in a file, compile it to a file named 'stripper', and then:
cat my_source.cpp | ./stripper | grep JOHN_DOE
So consider the input:
int foo ( int arg1 )
{
/// code
}
void bar ( std::string arg2 )
{
/// code
aFunctionCall( JOHN_DOE );
/// more code
}
The output of "cat example.cpp | ./stripper" is:
int foo ( int arg1 ) { }
void bar ( std::string arg2 ){ aFunctionCall( JOHN_DOE ); }
The output of "cat example.cpp | ./stripper | grep JOHN_DOE" is:
void bar ( std::string arg2 ){ aFunctionCall( JOHN_DOE ); }
The job of finding the function name (guess its the last identifier to precede a "(") is left as an exercise to the reader.
For that kind of stuff, although it comes to primitive searching again, I would recommend compview plugin. It will open up a search window, so you can see the entire line where the search occured and automatically jump to it. Gives a nice overview.
(source: axisym3.net)
Like Robert said Regex will help. In command mode start a regex search by typing the "/" character followed by your regex.
Ctags1 may also be of use to you. It can generate a tag file for a project. This tag file allows a user to jump directly from a function call to it's definition even if it's in another file using "CTRL+]".
u can use grep -r -n -H JOHN_DOE * it will look for "JOHN_DOE" in the files recursively starting from the current directory
you can use the following code to practically find the function which contains the text expression:
public void findFunction(File file, String expression) {
Reader r = null;
try {
r = new FileReader(file);
} catch (FileNotFoundException ex) {
ex.printStackTrace();
}
BufferedReader br = new BufferedReader(r);
String match = "";
String lineWithNameOfFunction = "";
Boolean matchFound = false;
try {
while(br.read() > 0) {
match = br.readLine();
if((match.endsWith(") {")) ||
(match.endsWith("){")) ||
(match.endsWith("()")) ||
(match.endsWith(")")) ||
(match.endsWith("( )"))) {
// this here is because i guessed that method will start
// at the 0
if((match.charAt(0)!=' ') && !(match.startsWith("\t"))) {
lineWithNameOfFunction = match;
}
}
if(match.contains(expression)) {
matchFound = true;
break;
}
}
if(matchFound)
System.out.println(lineWithNameOfFunction);
else
System.out.println("No matching function found");
} catch (IOException ex) {
ex.printStackTrace();
}
}
i wrote this in JAVA, tested it and works like a charm. has few drawbacks though, but for starters it's fine. didn't add support for multiple functions containing same expression and maybe some other things. try it.

Resources