I have noticed that ruby on rails has some small macros that have a c feel.
__FILE__ and __LINE__
For instance. This makes debugging and generating unique cache keys much easier.
store = Rails.cache.fetch("#{__FILE__}|#{__LINE__}|#{self.store_id}")...
Most handy when I am likely to get the same store over and over for a given set of orders.
What I'm wondering though is; can I create my own MACROS/defines? Precompiled code while not a good idea sometimes, does have its place.
QUESTION: Is there a way to create c style macros in ruby/rails?
They look like C but they are not macros, they are just objects, like everything else in ruby, these in particular having the power to determine the name and line number of the file currently being interpreted.
C macros are just references to compile-time executable code. Ruby is ... like, totally, completely opposite from this :-) Compilation in ruby is really syntax evaluation and then interpretation. If some languages like C, C++ and Java are early-binding and others like PHP and Python are late-binding, ruby has missed the bus and will get there at the last possible moment.
So if you want to create your own stuff, create a module and/or class having something inside. What you create inside will be a variable, symbol, constant, or method, perhaps local, of the instance, or of the class, perhaps public, protected or private. It's generally not good form to use the special naming conventions __FOO__ or others (even if they are valid names).
(end of answer... if I am hearing what you're asking, perhaps this will help you make the mental leap that is needed)...
I went from C, to C++, to Java, then to Ruby. After a brief time in the clinic for reformed Java programmers, I was able to understand how amazingly powerful ruby is. But you really have to think about how to solve problems differently.
I highly recommend the book Eloquent Ruby, linked here to Amazon.
__FILE__ and __LINE__ aren't macros per se, they're keywords (that return String values).
Having said that, you are allowed to define constants and methods that start with an underscore or _ character. For example:
__MICKEY__ = :mouse
Now, a constant can refer to anything—a String, an expression, even a code block (a Proc or lambda) or a whole Class.
That means that instead of compile-time macro expansion (Ruby, being an interpreted language, doesn't really have a 'compile phase') you can take advantage of (arguably more powerful) run-time code execution.
For instance, the C++ macro:
#define SUM(a, b, c) a + b + c
Can be suitably approximated in Ruby as
def SUM(a, b, c); a + b + c; end
I say 'approximated' because a) in this case we're defining a method, not a constant or a macro, and b) C++ macros can still work when evaluated with a different number of arguments than they expect (because C++ macros are expanded, meaning, it's as if we did find and replace), whereas Ruby method call semantics come into play, such as checking arity of methods and arguments.
To get around that, in our particular example, we can do (something arguably even more flexible)
def SUM(*args); args.inject(:+); end
SUM(1, 2) # => 3
SUM(1, 2, 3, 5) # => 11
As you can see, you can achieve a similar (possibly even greater) level of expressiveness that C++ macros provide by leveraging Ruby's dynamic nature.
Related
This is a question I've been mildly irritated about for some time and just never got around to search the answer to.
However I thought I might at least ask the question and perhaps someone can explain.
Basically many languages I've worked in utilize syntactic sugar to write (using syntax from C++):
int main() {
int a = 2;
a += 3; // a=a+3
}
while in lua the += is not defined, so I would have to write a=a+3, which again is all about syntactical sugar. when using a more "meaningful" variable name such as: bleed_damage_over_time or something it starts getting tedious to write:
bleed_damage_over_time = bleed_damage_over_time + added_bleed_damage_over_time
instead of:
bleed_damage_over_time += added_bleed_damage_over_time
So I would like to know not how to solve this if you don't have a nice solution, in that case I would of course be interested in hearing it; but rather why lua doesn't implement this syntactical sugar.
This is just guesswork on my part, but:
1. It's hard to implement this in a single-pass compiler
Lua's bytecode compiler is implemented as a single-pass recursive descent parser that immediately generates code. It does not parse to a separate AST structure and then in a second pass convert that to bytecode.
This forces some limitations on the grammar and semantics. In particular, anything that requires arbitrary lookahead or forward references is really hard to support in this model. This means assignments are already hard to parse. Given something like:
foo.bar.baz = "value"
When you're parsing foo.bar.baz, you don't realize you're actually parsing an assignment until you hit the = after you've already parsed and generated code for that. Lua's compiler has a good bit of complexity just for handling assignments because of this.
Supporting self-assignment would make that even harder. Something like:
foo.bar.baz += "value"
Needs to get translated to:
foo.bar.baz = foo.bar.baz + "value"
But at the point that the compiler hits the =, it's already forgotten about foo.bar.baz. It's possible, but not easy.
2. It may not play nice with the grammar
Lua doesn't actually have any statement or line separators in the grammar. Whitespace is ignored and there are no mandatory semicolons. You can do:
io.write("one")
io.write("two")
Or:
io.write("one") io.write("two")
And Lua is equally happy with both. Keeping a grammar like that unambiguous is tricky. I'm not sure, but self-assignment operators may make that harder.
3. It doesn't play nice with multiple assignment
Lua supports multiple assignment, like:
a, b, c = someFnThatReturnsThreeValues()
It's not even clear to me what it would mean if you tried to do:
a, b, c += someFnThatReturnsThreeValues()
You could limit self-assignment operators to single assignment, but then you've just added a weird corner case people have to know about.
With all of this, it's not at all clear that self-assignment operators are useful enough to be worth dealing with the above issues.
I think you could just rewrite this question as
Why doesn't <languageX> have <featureY> from <languageZ>?
Typically it's a trade-off that the language designers make based on their vision of what the language is intended for, and their goals.
In Lua's case, the language is intended to be an embedded scripting language, so any changes that make the language more complex or potentially make the compiler/runtime even slightly larger or slower may go against this objective.
If you implement each and every tiny feature, you can end up with a 'kitchen sink' language: ADA, anyone?
And as you say, it's just syntactic sugar.
Another reason why Lua doesn't have self-assignment operators is that table access can be overloaded with metatables to have arbitrary side effects. For self assignment you would need to choose to desugar
foo.bar.baz += 2
into
foo.bar.baz = foo.bar.baz + 2
or into
local tmp = foo.bar
tmp.baz = tmp.baz + 2
The first version runs the __index metamethod for foo twice, while the second one does so only once. Not including self-assignment in the language and forcing you to be explicit helps avoid this ambiguity.
I'm learning Elixir and wonder why it has two types of function definitions:
functions defined in a module with def, called using myfunction(param1, param2)
anonymous functions defined with fn, called using myfn.(param1, param2)
Only the second kind of function seems to be a first-class object and can be passed as a parameter to other functions. A function defined in a module needs to be wrapped in a fn. There's some syntactic sugar which looks like otherfunction(&myfunction(&1, &2)) in order to make that easy, but why is it necessary in the first place? Why can't we just do otherfunction(myfunction))? Is it only to allow calling module functions without parenthesis like in Ruby? It seems to have inherited this characteristic from Erlang which also has module functions and funs, so does it actually comes from how the Erlang VM works internally?
It there any benefit having two types of functions and converting from one type to another in order to pass them to other functions? Is there a benefit having two different notations to call functions?
Just to clarify the naming, they are both functions. One is a named function and the other is an anonymous one. But you are right, they work somewhat differently and I am going to illustrate why they work like that.
Let's start with the second, fn. fn is a closure, similar to a lambda in Ruby. We can create it as follows:
x = 1
fun = fn y -> x + y end
fun.(2) #=> 3
A function can have multiple clauses too:
x = 1
fun = fn
y when y < 0 -> x - y
y -> x + y
end
fun.(2) #=> 3
fun.(-2) #=> 3
Now, let's try something different. Let's try to define different clauses expecting a different number of arguments:
fn
x, y -> x + y
x -> x
end
** (SyntaxError) cannot mix clauses with different arities in function definition
Oh no! We get an error! We cannot mix clauses that expect a different number of arguments. A function always has a fixed arity.
Now, let's talk about the named functions:
def hello(x, y) do
x + y
end
As expected, they have a name and they can also receive some arguments. However, they are not closures:
x = 1
def hello(y) do
x + y
end
This code will fail to compile because every time you see a def, you get an empty variable scope. That is an important difference between them. I particularly like the fact that each named function starts with a clean slate and you don't get the variables of different scopes all mixed up together. You have a clear boundary.
We could retrieve the named hello function above as an anonymous function. You mentioned it yourself:
other_function(&hello(&1))
And then you asked, why I cannot simply pass it as hello as in other languages? That's because functions in Elixir are identified by name and arity. So a function that expects two arguments is a different function than one that expects three, even if they had the same name. So if we simply passed hello, we would have no idea which hello you actually meant. The one with two, three or four arguments? This is exactly the same reason why we can't create an anonymous function with clauses with different arities.
Since Elixir v0.10.1, we have a syntax to capture named functions:
&hello/1
That will capture the local named function hello with arity 1. Throughout the language and its documentation, it is very common to identify functions in this hello/1 syntax.
This is also why Elixir uses a dot for calling anonymous functions. Since you can't simply pass hello around as a function, instead you need to explicitly capture it, there is a natural distinction between named and anonymous functions and a distinct syntax for calling each makes everything a bit more explicit (Lispers would be familiar with this due to the Lisp 1 vs. Lisp 2 discussion).
Overall, those are the reasons why we have two functions and why they behave differently.
I don't know how useful this will be to anyone else, but the way I finally wrapped my head around the concept was to realize that elixir functions aren't Functions.
Everything in elixir is an expression. So
MyModule.my_function(foo)
is not a function but the expression returned by executing the code in my_function. There is actually only one way to get a "Function" that you can pass around as an argument and that is to use the anonymous function notation.
It is tempting to refer to the fn or & notation as a function pointer, but it is actually much more. It's a closure of the surrounding environment.
If you ask yourself:
Do I need an execution environment or a data value in this spot?
And if you need execution use fn, then most of the difficulties become much
clearer.
I may be wrong since nobody mentioned it, but I was also under the impression that the reason for this is also the ruby heritage of being able to call functions without brackets.
Arity is obviously involved but lets put it aside for a while and use functions without arguments. In a language like javascript where brackets are mandatory, it is easy to make the difference between passing a function as an argument and calling the function. You call it only when you use the brackets.
my_function // argument
(function() {}) // argument
my_function() // function is called
(function() {})() // function is called
As you can see, naming it or not does not make a big difference. But elixir and ruby allow you to call functions without the brackets. This is a design choice which I personally like but it has this side effect you cannot use just the name without the brackets because it could mean you want to call the function. This is what the & is for. If you leave arity appart for a second, prepending your function name with & means that you explicitly want to use this function as an argument, not what this function returns.
Now the anonymous function is bit different in that it is mainly used as an argument. Again this is a design choice but the rational behind it is that it is mainly used by iterators kind of functions which take functions as arguments. So obviously you don't need to use & because they are already considered arguments by default. It is their purpose.
Now the last problem is that sometimes you have to call them in your code, because they are not always used with an iterator kind of function, or you might be coding an iterator yourself. For the little story, since ruby is object oriented, the main way to do it was to use the call method on the object. That way, you could keep the non-mandatory brackets behaviour consistent.
my_lambda.call
my_lambda.call()
my_lambda_with_arguments.call :h2g2, 42
my_lambda_with_arguments.call(:h2g2, 42)
Now somebody came up with a shortcut which basically looks like a method with no name.
my_lambda.()
my_lambda_with_arguments.(:h2g2, 42)
Again, this is a design choice. Now elixir is not object oriented and therefore call not use the first form for sure. I can't speak for José but it looks like the second form was used in elixir because it still looks like a function call with an extra character. It's close enough to a function call.
I did not think about all the pros and cons, but it looks like in both languages you could get away with just the brackets as long as you make brackets mandatory for anonymous functions. It seems like it is:
Mandatory brackets VS Slightly different notation
In both cases you make an exception because you make both behave differently. Since there is a difference, you might as well make it obvious and go for the different notation. The mandatory brackets would look natural in most cases but very confusing when things don't go as planned.
Here you go. Now this might not be the best explanation in the world because I simplified most of the details. Also most of it are design choices and I tried to give a reason for them without judging them. I love elixir, I love ruby, I like the function calls without brackets, but like you, I find the consequences quite misguiding once in a while.
And in elixir, it is just this extra dot, whereas in ruby you have blocks on top of this. Blocks are amazing and I am surprised how much you can do with just blocks, but they only work when you need just one anonymous function which is the last argument. Then since you should be able to deal with other scenarios, here comes the whole method/lambda/proc/block confusion.
Anyway... this is out of scope.
I've never understood why explanations of this are so complicated.
It's really just an exceptionally small distinction combined with the realities of Ruby-style "function execution without parens".
Compare:
def fun1(x, y) do
x + y
end
To:
fun2 = fn
x, y -> x + y
end
While both of these are just identifiers...
fun1 is an identifier that describes a named function defined with def.
fun2 is an identifier that describes a variable (that happens to contain a reference to function).
Consider what that means when you see fun1 or fun2 in some other expression? When evaluating that expression, do you call the referenced function or do you just reference a value out of memory?
There's no good way to know at compile time. Ruby has the luxury of introspecting the variable namespace to find out if a variable binding has shadowed a function at some point in time. Elixir, being compiled, can't really do this. That's what the dot-notation does, it tells Elixir that it should contain a function reference and that it should be called.
And this is really hard. Imagine that there wasn't a dot notation. Consider this code:
val = 5
if :rand.uniform < 0.5 do
val = fn -> 5 end
end
IO.puts val # Does this work?
IO.puts val.() # Or maybe this?
Given the above code, I think it's pretty clear why you have to give Elixir the hint. Imagine if every variable de-reference had to check for a function? Alternatively, imagine what heroics would be necessary to always infer that variable dereference was using a function?
There's an excellent blog post about this behavior: link
Two types of functions
If a module contains this:
fac(0) when N > 0 -> 1;
fac(N) -> N* fac(N-1).
You can’t just cut and paste this into the shell and get the same
result.
It’s because there is a bug in Erlang. Modules in Erlang are sequences
of FORMS. The Erlang shell evaluates a sequence of
EXPRESSIONS. In Erlang FORMS are not EXPRESSIONS.
double(X) -> 2*X. in an Erlang module is a FORM
Double = fun(X) -> 2*X end. in the shell is an EXPRESSION
The two are not the same. This bit of silliness has been Erlang
forever but we didn’t notice it and we learned to live with it.
Dot in calling fn
iex> f = fn(x) -> 2 * x end
#Function<erl_eval.6.17052888>
iex> f.(10)
20
In school I learned to call functions by writing f(10) not f.(10) -
this is “really” a function with a name like Shell.f(10) (it’s a
function defined in the shell) The shell part is implicit so it should
just be called f(10).
If you leave it like this expect to spend the next twenty years of
your life explaining why.
Elixir has optional braces for functions, including functions with 0 arity. Let's see an example of why it makes a separate calling syntax important:
defmodule Insanity do
def dive(), do: fn() -> 1 end
end
Insanity.dive
# #Function<0.16121902/0 in Insanity.dive/0>
Insanity.dive()
# #Function<0.16121902/0 in Insanity.dive/0>
Insanity.dive.()
# 1
Insanity.dive().()
# 1
Without making a difference between 2 types of functions, we can't say what Insanity.dive means: getting a function itself, calling it, or also calling the resulting anonymous function.
fn -> syntax is for using anonymous functions. Doing var.() is just telling elixir that I want you to take that var with a func in it and run it instead of referring to the var as something just holding that function.
Elixir has a this common pattern where instead of having logic inside of a function to see how something should execute, we pattern match different functions based on what kind of input we have. I assume this is why we refer to things by arity in the function_name/1 sense.
It's kind of weird to get used to doing shorthand function definitions (func(&1), etc), but handy when you're trying to pipe or keep your code concise.
In elixir we use def for simply define a function like we do in other languages.
fn creates an anonymous function refer to this for more clarification
Only the second kind of function seems to be a first-class object and can be passed as a parameter to other functions. A function defined in a module needs to be wrapped in a fn. There's some syntactic sugar which looks like otherfunction(myfunction(&1, &2)) in order to make that easy, but why is it necessary in the first place? Why can't we just do otherfunction(myfunction))?
You can do otherfunction(&myfunction/2)
Since elixir can execute functions without the brackets (like myfunction), using otherfunction(myfunction)) it will try to execute myfunction/0.
So, you need to use the capture operator and specify the function, including arity, since you can have different functions with the same name. Thus, &myfunction/2.
In C, I could generate an executable, do an extensive rename only refactor, then compare executables again to confirm that the executable did not change. This was very handy to ensure that the refactor did not break anything.
Has anyone done anything similar with Ruby, particularly a Rails app? Strategies and methods would be appreciated. Ideally, I could run a script that output a single file of some sort that was purely bytecode and was not changed by naming changes. I'm guessing JRuby or Rubinus would be helpful here.
I don't think this strategy will work for Ruby. Unlike C, where the compiler throws away the names, most of the things you name in Ruby carry that name with them. That includes classes, modules, constants, and instance variables.
Automated unit and integration tests are the way to go to support Ruby refactoring.
Interesting question -- I like the definitive "yes" answer you can get from this regression strategy, at least for the specific case of rename refactoring.
I'm not expert enough to tell whether you can compile ruby (or at least a subset, without things like eval) but there seem to be some hints at:
http://www.hokstad.com/the-problem-with-compiling-ruby.html
http://rubini.us/2011/03/17/running-ruby-with-no-ruby/
Supposing that a complete compilation isn't possible, what about an abstract interpretation approach? You could parse the ruby into an AST, emit some kind of C code from the AST, and then compile the C code. The C code would not need to fully capture the behavior of the ruby code. It would only need to be compilable and to be distinct whenever the ruby was distinct. (Actually running it could result in gibberish, or perhaps an immediate memory violation error.)
As a simple example, suppose that ruby supported multiplication and C didn't. Then you could include a static mult function in your C code and translate from:
a = b + c*d
to
a = b + mult(c,d)
and the resulting compiled code would be invariant under name refactoring but would show discrepancies under other sorts of change. The mult function need not actually implement multiplication, you could have one of these instead:
static int mult( int a, int b ) { return a + b; } // pretty close
static int mult( int a, int b ) { return *0; } // not close at all, but still sufficient
and you'd still get the invariance you need as long as the C compiler isn't going to inline the definition. The same sort of translation, from an uncompilable ruby construct to a less functional but distinct C construct, should work for object manipulation and so forth, mapping class operations into C structure references. The key point is just that you want to keep the naming relationships intact while sacrificing actual behavior.
(I wonder whether you could do something with a single C struct that has members (all pointers to the same struct type) named after all the class and property names in the ruby code. Class and object operations would then correspond to nested dereference operations using this single structure. Just a notion.)
Even if you cannot formulate a precise mapping, an imprecise mapping that misses some minor distinctions might still be enough to increase confidence in the original name refactoring.
The quickest way to implement such a scheme might be to map from byte code to C (rather from the ruby AST to C). That would save a lot of parsing, but the mapping would be harder to understand and verify.
I need to do some metaprogramming on a large Mathematica code base (hundreds of thousands of lines of code) and don't want to have to write a full-blown parser so I was wondering how best to get the code from a Mathematica notebook out in an easily-parsed syntax.
Is it possible to export a Mathematica notebook in FullForm syntax, or to save all definitions in FullForm syntax?
The documentation for Save says that it can only export in the InputForm syntax, which is non-trivial to parse.
The best solution I have so far is to evaluate the notebook and then use DownValues to extract the rewrite rules with arguments (but this misses symbol definitions) as follows:
DVs[_] := {}
DVs[s_Symbol] := DownValues[s]
stream = OpenWrite["FullForm.m"];
WriteString[stream,
DVs[Symbol[#]] & /# Names["Global`*"] // Flatten // FullForm];
Close[stream];
I've tried a variety of approaches so far but none are working well. Metaprogramming in Mathematica seems to be extremely difficult because it keeps evaluating things that I want to keep unevaluated. For example, I wanted to get the string name of the infinity symbol using SymbolName[Infinity] but the Infinity gets evaluated into a non-symbol and the call to SymbolName dies with an error. Hence my desire to do the metaprogramming in a more suitable language.
EDIT
The best solution seems to be to save the notebooks as package (.m) files by hand and then translate them using the following code:
stream = OpenWrite["EverythingFullForm.m"];
WriteString[stream, Import["Everything.m", "HeldExpressions"] // FullForm];
Close[stream];
You can certainly do this. Here is one way:
exportCode[fname_String] :=
Function[code,
Export[fname, ToString#HoldForm#FullForm#code, "String"],
HoldAllComplete]
For example:
fn = exportCode["C:\\Temp\\mmacode.m"];
fn[
Clear[getWordsIndices];
getWordsIndices[sym_, words : {__String}] :=
Developer`ToPackedArray[words /. sym["Direct"]];
];
And importing this as a string:
In[623]:= Import["C:\\Temp\\mmacode.m","String"]//InputForm
Out[623]//InputForm=
"CompoundExpression[Clear[getWordsIndices], SetDelayed[getWordsIndices[Pattern[sym, Blank[]], \
Pattern[words, List[BlankSequence[String]]]], Developer`ToPackedArray[ReplaceAll[words, \
sym[\"Direct\"]]]], Null]"
However, going to other language to do metaprogramming for Mathematica sounds ridiculous to me, given that Mathematica is very well suited for that. There are many techniques available in Mathematica to do meta-programming and avoid premature evaluation. One that comes to my mind I described in this answer, but there are many others. Since you can operate on parsed code and use the pattern-matching in Mathematica, you save a lot. You can browse the SO Mathematica tags (past questions) and find lots of examples of meta-programming and evaluation control.
EDIT
To ease your pain with auto-evaluating symbols (there are only a few actually, Infinity being one of them).If you just need to get a symbol name for a given symbol, then this function will help:
unevaluatedSymbolName = Function[sym, SymbolName#Unevaluated#sym, HoldAllComplete]
You use it as
In[638]:= unevaluatedSymbolName[Infinity]//InputForm
Out[638]//InputForm="Infinity"
Alternatively, you can simply add HoldFirst attribute to SymbolName function via SetAttributes. One way is to do that globally:
SetAttributes[SymbolName,HoldFirst];
SymbolName[Infinity]//InputForm
Modifying built-in functions globally is however dangerous since it may have unpredictable effects for such a large system as Mathematica:
ClearAttributes[SymbolName, HoldFirst];
Here is a macro to use that locally:
ClearAll[withUnevaluatedSymbolName];
SetAttributes[withUnevaluatedSymbolName, HoldFirst];
withUnevaluatedSymbolName[code_] :=
Internal`InheritedBlock[{SymbolName},
SetAttributes[SymbolName, HoldFirst];
code]
Now,
In[649]:=
withUnevaluatedSymbolName[
{#,StringLength[#]}&[SymbolName[Infinity]]]//InputForm
Out[649]//InputForm= {"Infinity", 8}
You may also wish to do some replacements in a piece of code, say, replace a given symbol by its name. Here is an example code (which I wrap in Hold to prevent it from evaluation):
c = Hold[Integrate[Exp[-x^2], {x, -Infinity, Infinity}]]
The general way to do replacements in such cases is using Hold-attributes (see this answer) and replacements inside held expressions (see this question). For the case at hand:
In[652]:=
withUnevaluatedSymbolName[
c/.HoldPattern[Infinity]:>RuleCondition[SymbolName[Infinity],True]
]//InputForm
Out[652]//InputForm=
Hold[Integrate[Exp[-x^2], {x, -"Infinity", "Infinity"}]]
, although this is not the only way to do this. Instead of using the above macro, we can also encode the modification to SymbolName into the rule itself (here I am using a more wordy form ( Trott - Strzebonski trick) of in-place evaluation, but you can use RuleCondition as well:
ClearAll[replaceSymbolUnevaluatedRule];
SetAttributes[replaceSymbolUnevaluatedRule, HoldFirst];
replaceSymbolUnevaluatedRule[sym_Symbol] :=
HoldPattern[sym] :> With[{eval = SymbolName#Unevaluated#sym}, eval /; True];
Now, for example:
In[629]:=
Hold[Integrate[Exp[-x^2],{x,-Infinity,Infinity}]]/.
replaceSymbolUnevaluatedRule[Infinity]//InputForm
Out[629]//InputForm=
Hold[Integrate[Exp[-x^2], {x, -"Infinity", "Infinity"}]]
Actually, this entire answer is a good demonstration of various meta-programming techniques. From my own experiences, I can direct you to this, this, this, this and this answers of mine, where meta-programming was essential to solve problem I was addressing. You can also judge by the fraction of functions in Mathematica carrying Hold-attributes to all functions - it is about 10-15 percents if memory serves me well. All those functions are effectively macros, operating on code. To me, this is a very indicative fact, telling me that Mathematica jeavily builds on its meta-programming facilities.
The full forms of expressions can be extracted from the Code and Input cells of a notebook as follows:
$exprs =
Cases[
Import["mynotebook.nb", "Notebook"]
, Cell[content_, "Code"|"Input", ___] :>
ToExpression[content, StandardForm, HoldComplete]
, Infinity
] //
Flatten[HoldComplete ## #, 1, HoldComplete] & //
FullForm
$exprs is assigned the expressions read, wrapped in Hold to prevent evaluation. $exprs could then be saved into a text file:
Export["myfile.txt", ToString[$exprs]]
Package files (.m) are slightly easier to read in this way:
Import["mypackage.m", "HeldExpressions"] //
Flatten[HoldComplete ## #, 1, HoldComplete] &
I have what I imagine will be a fairly involved technical challenge: I want to be able to reliably alpha-rename identifiers in multiple languages (as many as possible). This will require special consideration for each language, and I'm asking for advice for how to minimize the amount of work I need to do by sharing code. Something like a unified parsing or abstract syntax framework that already has support for many languages would be great.
For example, here is some python code:
def foo(x):
def bar(y):
return x+y
return bar
An alpha renaming of x to y changes the x to a y and preserves semantics. So it would become:
def foo(y):
def bar(y1):
return y+y1
return bar
See how we needed to rename y to y1 in order to keep from breaking the code? That is why this is a hard problem. It seems like the program would have to have a pretty good knowledge of what constitutes a scope, rather than just doing, say, a string search and replace.
I would also like to preserve as much of the formatting as possible: comments, spacing, indentation. But that is not 100% necessary, it would just be nice.
Any tips?
To do this safely, you need to be able to to determine
all the identifiers (and those things that are not, e.g., the middle of a comment) in your code
the scopes of validity for each identifer
the ability to substitute a new identifier for an old one in the text
the ability to determine if renaming an identifier causes another name to be shadowed
To determine identifiers accurately, you need a least a langauge-accurate lexer. Identifiers in PHP look different than the do in COBOL.
To determine scopes of validity, you have to be determine program structure in practice, since most "scopes" are defined by such structure. This means you need a langauge-accurate parser; scopes in PHP are different than scopes in COBOL.
To determine which names are valid in which scopes, you need to know the language scoping rules. Your language may insist that the identifier X will refer to different Xes depending on the context in which X is found (consider object constructors named X with different arguments). Now you need to be able to traverse the scope structures according to the naming rules. Single inheritance, multiple inheritance, overloading, default types all will pretty much require you to build a model of the scopes for the programs, insert the identifiers and corresponding types into each scope, and then climb from the point of encounter of an identifier in the program text through the various scopes according to the language semantics. You will need symbol tables, inheritance linkages, ASTs, and the ability to navigage all of these. These structures are different from PHP and COBOL, but they share lots of common ideas so you likely need a library with the common concept support.
To rename an identifier, you have to modify the text. In a million lines of code, you need to point carefully. Modifying an AST node is one way to point carefully. Actually, you need to modify all the identifiers that correspond to the one being renamed; you have to climb over the tree to find them all, or record in the AST where all the references exist so they can be found easily. After modifyingy the tree you have to regenerate the source text after modifying the AST. That's a lot of machinery; see my SO answer on how to prettyprint ASTs preseriving all of the stuff you reasonably suggest should be preserved.
(Your other choice is to keep track in the AST of where the text for the string is,
and the read/patch/write the file.)
Before you update the file, you need to check that you haven't shadowed something. Consider this code:
{ local x;
x=1;
{local y;
y=2;
{local z;
z=y
print(x);
}
}
}
We agree this code prints "1". Now we decide to rename y to x.
We've broken the scoping, and now the print statement which referred
conceptually to the outer x refers to an x captured by the renamed y. The code now prints "2", so our rename broke it. This means that one must check all the other identifiers in scopes in which the renamed variable might be found, to see if the new name "captures" some name we weren't expecting. (This would be legal if the print statement printed z).
This is a lot of machinery.
Yes, there is a framework that has almost all of this as well as a number of robust language front ends. See our DMS Software Reengineering Toolkit. It has parsers producing ASTs, prettyprinters to produce text back from ASTs, generic symbol table management machinery (including support for multiple inheritance), AST visiting/modification machinery. Ithas prettyprinting machinery to turn ASTs back into text. It has front ends for C, C++, COBOL and Java that implement name and type resolution (e.g. instanting symbol table scopes and identifier to symbol table entry mappings); it has front ends for many other langauges that don't have scoping implemented yet.
We've just finished an exercise in implementing "rename" for Java. (All the above issues of course appeared). We about about to start one for C++.
You could try to create Xtext based implementations for the involved languages. The Xtext framework provides reliable infrastructure for cross language rename refactoring. However, you'll have to provide a grammar a at least a "good enough" scope resolution for each language.
Languages mostly guarantee tokens will be unique, whatever the context. A naive first approach (and this will break many, many pieces of code) would be:
cp file file.orig
sed -i 's/\b(newTokenName)\b/TEMPTOKEN/g' file
sed -i 's/\b(oldTokenName)\b/newTokenName/g' file
With GNU sed, this will break on PHP. Rewriting \b to a general token match, like ([^a-zA-Z~$-_][^a-zA-Z0-9~$-_]) would work on most C, Java, PHP, and Python, but not Perl (need to add # and % to the token characters. Beyond that, it would require a plugin architecture that works for any language you wanted to add. At some point, there will be two languages whose variable and function naming rules will be incompatible, and at that point, you'll need to do more and more in the plugin.