How do I inspect the in-memory representation of OCaml values? - memory

Is there any way to inspect the in-memory representation of OCaml values in OCaml, without using something like gdb? Ideally, the output would give me either hex bytes or binary bits, similar to what you can get via gdb.
The Obj module looks promising, and Real World OCaml has a section about it, but does it offer a single way to get in-memory representation, without having to know how every OCaml value is laid out in memory?

I searched the net and I found an OCaml package named "inspect" that seems to do something like what you want. Here's what it shows for the values true and [1; 2; 3]:
# Sexpr.dump true;;
(DUMP 1)- : unit = ()
# Sexpr.dump [1;2;3];;
(DUMP
(BLK/0
:TAG 0
:VALUES
1
(BLK/1 :TAG 0 :VALUES 2 (BLK/2 :TAG 0 :VALUES 3 0))))- : unit = ()
It's available as an opam package named "inspect". The Github page for the package repository is https://github.com/krohrer/caml-inspect.
The output won't make sense without some familiarity with the layout of OCaml values. That's unavoidable I think. There is information on the Github page and in the OCaml manual.

Related

Extracting values from a deeply nested data structure in haskell

I've been trying to work out how to use the language-bash package to parse some simple bash scripts, and I've come across the following structure
Right (List [Statement (Last (Pipeline {timed = False, timedPosix = False, inverted = False, commands = [Command (SimpleCommand [Assign (Parameter "x" Nothing) Equals (RValue [Char '3'])] []) []]})) Sequential])
as a result of running
import Language.Bash.Parse
parse "" "x=3"
I could theoretically just pattern match the whole thing away, though I was wondering if there was a cleaner way of accessing the values of the Assign datatype ('x', (Char '3').
Is there anyway to cleanly access those values (or generally access values in a complex datastructure) without obsessive pattern matching ?
Not really.
Here's the problem. You probably want to either handle an extremely limited set of possible Bash statements, in which case just writing out the patterns for specific List values will be faster than anything else you could possibly do.
Or, you want to handle a wide variety of Bash statements, in which case you can't really avoid the functional infrastructure to handle general List values. The same way you'd write an interpreter or compiler for any complex abstract syntax tree, you'll end up more or less writing a function for every (major) type and a case for every constructor.
The main Haskell tools for dealing with big, complex data structures are:
The "functional infrastructure" described above. That is, plain old functions defined using pattern matching, that process recursive data structures in a manner that mirrors the structures themselves. Don't underestimate this approach! It may seem like a lot of work, but it's likely to lead you to a correct program that handles all well-formed inputs, in a way that ad hoc approaches won't. Start with:
{-# OPTIONS_GHC -Wall #-}
data M = ... some monad ...
data Result = ... representation of what you want to extract from the script ...
processList :: List -> M Result
...
processStatement :: Statement -> M Result
...
and go from there. The -Wall is important to get the -Wincomplete-patterns warning so you don't miss any constructors.
Lenses, which provide a more ergonomic hierarchical syntax for referring to parts of deeply nested data structures. Since bash-language doesn't provide lenses for these structures, you'd need to write them yourself. They might allow you to write something along the lines of:
lst ^. _Right.statements._head.andOr.pipeline.commands.
_head._SimpleCommand.assignments._head.parameter.base
to extract the "x" from "x=3". Obviously, that doesn't help much, but lenses complement the "functional infrastructure" approach. The code to actually process all those types is often more easily expressed with lenses than pattern matching.
Generics, which allow you to generically access certain patterns within recursive data structures, while ignoring the "rest" of the data structure that you don't care about. The bash-language library includes deriving clauses for both Data and Generic generics. If it didn't, you could use StandaloneDeriving clauses to derive them. As an example, you can use Data generics to extract all Parameters from a List, regardless of where those Parameters appear, with something like:
import Language.Bash.Parse
import Language.Bash.Word
import Data.Data
import Data.Generics.Schemes
import Data.Generics.Aliases
parameters :: (Data a) => a -> [Parameter]
parameters = everything (++) (mkQ [] (\p -> [p]))
main = do
let Right lst = parse "" "x=3; y=4; LANG=C echo $x $y"
print $ parameters lst
Here, this prints out a list of all parameters appearing in this shell "script", whether for purposes of assignment or substitution, so it includes: "x", "y", "LANG", and "x" and "y" again.
This is a powerful tool, but it's likely to be applicable to only a few specific use-cases.
Ultimately, you'll probably have to take the view that you are writing a Bash interpreter (even if your interpreter does something besides "executing" the Bash script). Someone's been nice enough to supply a Bash parser to get the input source code into an AST, but the other half of the interpreter -- the actual interpretation itself -- still needs to be written by you.

What can i get from the lua bytecode retrivied via string.dump()?

I’m using lua + luajit 2.0.4 and I’m wondering - Is it possible to restore the original parts of the code from the dumps of lua functions?
function a(l)
if l > 3 then
print(l*l)
end
end
local b = string.dump(a)
In this example, I am doing the string.dump of the 'a' function, and here I come to the questions like:
Is it possible to write this dump into a .txt file?
Is it possible to get the original names of functions, variables, and upvalues?
Is it possible to get strings, numbers, tables?
Is it possible to restore it to the full code, and if not, is it possible to get a disassembled listing?
"Yes" to all questions with a couple of caveats. For (1), make sure that "b" is used as part of the "mode" parameter in io.open on Windows, as the output of string.dump will have some binary content. For (2), it's only true when string.dump is used without the strip option, which was added in LuaJIT:
string.dump(f [,strip])
An extra argument has been added to string.dump(). If set to true,
'stripped' bytecode without debug information is generated. This
speeds up later bytecode loading and reduces memory usage.
For (4), I found this document to be very useful: http://files.catwell.info/misc/mirror/lua-5.2-bytecode-vm-dirk-laurie/lua52vm.html (it's for Lua 5.2, but most of the content applies to LuaJIT as well); it also include a section on the difference between full and stripped bytecode that may answer some of your questions.

Is a 'standard module' part of the 'programming language'?

On page 57 of the book "Programming Erlang" by Joe Armstrong (2007) 'lists:map/2' is mentioned in the following way:
Virtually all the modules that I write use functions like
lists:map/2 —this is so common that I almost consider map
to be part of the Erlang language. Calling functions such
as map and filter and partition in the module lists is extremely
common.
The usage of the word 'almost' got me confused about what the difference between Erlang as a whole and the Erlang language might be, and if there even is a difference at all. Is my confusion based on semantics of the word 'language'? It seems to me as if a standard module floats around the borders of what does and does not belong to the actual language it's implemented in. What are the differences between a programming language at it's core and the standard libraries implemented in them?
I'm aware of the fact that this is quite the newby question, but in my experience jumping to my own conclusions can lead to bad things. I was hoping someone could clarify this somewhat.
Consider this simple program:
1> List = [1, 2, 3, 4, 5].
[1,2,3,4,5]
2> Fun = fun(X) -> X*2 end.
#Fun<erl_eval.6.50752066>
3> lists:map(Fun, List).
[2,4,6,8,10]
4> [Fun(X) || X <- List].
[2,4,6,8,10]
Both produce the same output, however the first one list:map/2 is a library function, and the second one is a language construct at its core, called list comprehension. The first one is implemented in Erlang (accidentally also using list comprehension), the second one is parsed by Erlang. The library function can be optimized only as much as the compiler is able to optimize its implementation in Erlang. However, the list comprehension may be optimized as far as being written in assembler in the Beam VM and called from the resulted beam file for maximum performance.
Some language constructs look like they are part of the language, whereas in fact they are implemented in the library, for example spawn/3. When it's used in the code it looks like a keyword, but in Erlang it's not one of the reserved words. Because of that, Erlang compiler will automatically add the erlang module in front of it and call erlang:spawn/3, which is a library function. Those functions are called BIFs (Build-In Functions).
In general, what belongs to the language itself is what that language's compiler can parse and translate to the executable code (or in other words, what's defined by the language's grammar). Everything else is a library. Libraries are usually written in the language for which they are designed, but it doesn't necessarily have to be the case, e.g. some of Erlang library functions are written using C as Erlang NIFs.

Using ANTLR to analyze and modify source code; am I doing it wrong?

I'm writing a program where I need to parse a JavaScript source file, extract some facts, and insert/replace portions of the code. A simplified description of the sorts of things I'd need to do is, given this code:
foo(['a', 'b', 'c']);
Extract 'a', 'b', and 'c' and rewrite the code as:
foo('bar', [0, 1, 2]);
I am using ANTLR for my parsing needs, producing C# 3 code. Somebody else had already contributed a JavaScript grammar. The parsing of the source code is working.
The problem I'm encountering is figuring out how to actually properly analyze and modify the source file. Each approach that I try to take in actually solving the problem leads me to a dead end. I can't help but think that I'm not using the tool as it's intended or am just too much of a novice when it comes to dealing with ASTs.
My first approach was to parse using a TokenRewriteStream and implement the EnterRule_* partial methods for the rules I'm interested in. While this seems to make modifying the token stream pretty easy, there is not enough contextual information for my analysis. It seems that all I have access to is a flat stream of tokens, which doesn't tell me enough about the entire structure of code. For example, to detect whether the foo function is being called, simply looking at the first token wouldn't work because that would also falsely match:
a.b.foo();
To allow me to do more sophisticated code analysis, my second approach was to modify the grammar with rewrite rules to produce more of a tree. Now, the first sample code block produces this:
Program
CallExpression
Identifier('foo')
ArgumentList
ArrayLiteral
StringLiteral('a')
StringLiteral('b')
StringLiteral('c')
This is working great for analyzing the code. However, now I am unable to easily rewrite the code. Sure, I could modify the tree structure to represent the code I want, but I can't use this to output source code. I had hoped that the token associated with each node would at least give me enough information to know where in the original text I would need to make the modifications, but all I get are token indexes or line/column numbers. To use the line and column numbers, I would have to make an awkward second pass through the source code.
I suspect I'm missing something in understanding how to properly use ANTLR to do what I need. Is there a more proper way for me to solve this problem?
What you are trying to do is called program transformation, that is, the automated generation of one program from another. What you are doing "wrong" is assuming is parser is all you need, and discovering that it isn't and that you have to fill in the gap.
Tools that do that this well have parsers (to build ASTs), means to modify the ASTs (both procedural and pattern directed), and prettyprinters which convert the (modified) AST back into legal source code. You seem to be struggling with the the fact that ANTLR doesn't come with prettyprinters; that's not part of its philosophy; ANTLR is a (fine) parser-generator. Other answers have suggested using ANTLR's "string templates", which are not by themselves prettyprinters, but can be used to implement one, at the price of implementing one. This harder to do than it looks; see my SO answer on compiling an AST back to source code.
The real issue here is the widely made but false assumption that "if I have a parser, I'm well on my way to building complex program analysis and transformation tools." See my essay on Life After Parsing for a long discussion of this; basically, you need a lot more tooling that "just" a parser to do this, unless you want to rebuild a significant fraction of the infrastructure by yourself instead of getting on with your task. Other useful features of practical program transformation systems include typically source-to-source transformations, which considerably simplify the problem of finding and replacing complex patterns in trees.
For instance, if you had source-to-source transformation capabilities (of our tool, the DMS Software Reengineering Toolkit, you'd be able to write parts of your example code changes using these DMS transforms:
domain ECMAScript.
tag replace; -- says this is a special kind of temporary tree
rule barize(function_name:IDENTIFIER,list:expression_list,b:body):
expression->expression
= " \function_name ( '[' \list ']' ) "
-> "\function_name( \firstarg\(\function_name\), \replace\(\list\))";
rule replace_unit_list(s:character_literal):
expression_list -> expression_list
replace(s) -> compute_index_for(s);
rule replace_long_list(s:character_list, list:expression_list):
expression_list -> expression_list
"\replace\(\s\,\list)-> "compute_index_for\(\s\),\list";
with rule-external "meta" procedures "first_arg" (which knows how to compute "bar" given the identifier "foo" [I'm guessing you want to do this), and "compute_index_for" which given a string literals, knows what integer to replace it with.
Individual rewrite rules have parameter lists "(....)" in which slots representing subtrees are named, a left-hand side acting as a pattern to match, and an right hand side acting as replacement, both usually quoted in metaquotes " which seperates rewrite-rule language text from target-language (e.g. JavaScript) text. There's lots of meta-escapes ** found inside the metaquotes which indicate a special rewrite-rule-language item. Typically these are parameter names, and represent whatever type of name tree the parameter represents, or represent an external meta procedure call (such as first_arg; you'll note the its argument list ( , ) is metaquoted!), or finally, a "tag" such as "replace", which is a peculiar kind of tree that represent future intent to do more transformations.
This particular set of rules works by replacing a candidate function call by the barized version, with the additional intent "replace" to transform the list. The other two transformations realize the intent by transforming "replace" away by processing elements of the list one at a time, and pushing the replace further down the list until it finally falls off the end and the replacement is done. (This is the transformational equivalent of a loop).
Your specific example may vary somewhat since you really weren't precise about the details.
Having applied these rules to modify the parsed tree, DMS can then trivially prettyprint the result (the default behavior in some configurations is "parse to AST, apply rules until exhaustion, prettyprint AST" because this is handy).
You can see a complete process of "define language", "define rewrite rules", "apply rules and prettyprint" at (High School) Algebra as a DMS domain.
Other program transformation systems include TXL and Stratego. We imagine DMS as the industrial strength version of these, in which we have built all that infrastructure including many standard language parsers and prettyprinters.
So it's turning out that I can actually use a rewriting tree grammar and insert/replace tokens using a TokenRewriteStream. Plus, it's actually really easy to do. My code resembles the following:
var charStream = new ANTLRInputStream(stream);
var lexer = new JavaScriptLexer(charStream);
var tokenStream = new TokenRewriteStream(lexer);
var parser = new JavaScriptParser(tokenStream);
var program = parser.program().Tree as Program;
var dependencies = new List<IModule>();
var functionCall = (
from callExpression in program.Children.OfType<CallExpression>()
where callExpression.Children[0].Text == "foo"
select callExpression
).Single();
var argList = functionCall.Children[1] as ArgumentList;
var array = argList.Children[0] as ArrayLiteral;
tokenStream.InsertAfter(argList.Token.TokenIndex, "'bar', ");
for (var i = 0; i < array.Children.Count(); i++)
{
tokenStream.Replace(
(array.Children[i] as StringLiteral).Token.TokenIndex,
i.ToString());
}
var rewrittenCode = tokenStream.ToString();
Have you looked at the string template library. It is by the same person who wrote ANTLR and they are intended to work together. It sounds like it would suit do what your looking for ie. output matched grammar rules as formatted text.
Here is an article on translation via ANTLR

Save Mathematica code in `FullForm` syntax

I need to do some metaprogramming on a large Mathematica code base (hundreds of thousands of lines of code) and don't want to have to write a full-blown parser so I was wondering how best to get the code from a Mathematica notebook out in an easily-parsed syntax.
Is it possible to export a Mathematica notebook in FullForm syntax, or to save all definitions in FullForm syntax?
The documentation for Save says that it can only export in the InputForm syntax, which is non-trivial to parse.
The best solution I have so far is to evaluate the notebook and then use DownValues to extract the rewrite rules with arguments (but this misses symbol definitions) as follows:
DVs[_] := {}
DVs[s_Symbol] := DownValues[s]
stream = OpenWrite["FullForm.m"];
WriteString[stream,
DVs[Symbol[#]] & /# Names["Global`*"] // Flatten // FullForm];
Close[stream];
I've tried a variety of approaches so far but none are working well. Metaprogramming in Mathematica seems to be extremely difficult because it keeps evaluating things that I want to keep unevaluated. For example, I wanted to get the string name of the infinity symbol using SymbolName[Infinity] but the Infinity gets evaluated into a non-symbol and the call to SymbolName dies with an error. Hence my desire to do the metaprogramming in a more suitable language.
EDIT
The best solution seems to be to save the notebooks as package (.m) files by hand and then translate them using the following code:
stream = OpenWrite["EverythingFullForm.m"];
WriteString[stream, Import["Everything.m", "HeldExpressions"] // FullForm];
Close[stream];
You can certainly do this. Here is one way:
exportCode[fname_String] :=
Function[code,
Export[fname, ToString#HoldForm#FullForm#code, "String"],
HoldAllComplete]
For example:
fn = exportCode["C:\\Temp\\mmacode.m"];
fn[
Clear[getWordsIndices];
getWordsIndices[sym_, words : {__String}] :=
Developer`ToPackedArray[words /. sym["Direct"]];
];
And importing this as a string:
In[623]:= Import["C:\\Temp\\mmacode.m","String"]//InputForm
Out[623]//InputForm=
"CompoundExpression[Clear[getWordsIndices], SetDelayed[getWordsIndices[Pattern[sym, Blank[]], \
Pattern[words, List[BlankSequence[String]]]], Developer`ToPackedArray[ReplaceAll[words, \
sym[\"Direct\"]]]], Null]"
However, going to other language to do metaprogramming for Mathematica sounds ridiculous to me, given that Mathematica is very well suited for that. There are many techniques available in Mathematica to do meta-programming and avoid premature evaluation. One that comes to my mind I described in this answer, but there are many others. Since you can operate on parsed code and use the pattern-matching in Mathematica, you save a lot. You can browse the SO Mathematica tags (past questions) and find lots of examples of meta-programming and evaluation control.
EDIT
To ease your pain with auto-evaluating symbols (there are only a few actually, Infinity being one of them).If you just need to get a symbol name for a given symbol, then this function will help:
unevaluatedSymbolName = Function[sym, SymbolName#Unevaluated#sym, HoldAllComplete]
You use it as
In[638]:= unevaluatedSymbolName[Infinity]//InputForm
Out[638]//InputForm="Infinity"
Alternatively, you can simply add HoldFirst attribute to SymbolName function via SetAttributes. One way is to do that globally:
SetAttributes[SymbolName,HoldFirst];
SymbolName[Infinity]//InputForm
Modifying built-in functions globally is however dangerous since it may have unpredictable effects for such a large system as Mathematica:
ClearAttributes[SymbolName, HoldFirst];
Here is a macro to use that locally:
ClearAll[withUnevaluatedSymbolName];
SetAttributes[withUnevaluatedSymbolName, HoldFirst];
withUnevaluatedSymbolName[code_] :=
Internal`InheritedBlock[{SymbolName},
SetAttributes[SymbolName, HoldFirst];
code]
Now,
In[649]:=
withUnevaluatedSymbolName[
{#,StringLength[#]}&[SymbolName[Infinity]]]//InputForm
Out[649]//InputForm= {"Infinity", 8}
You may also wish to do some replacements in a piece of code, say, replace a given symbol by its name. Here is an example code (which I wrap in Hold to prevent it from evaluation):
c = Hold[Integrate[Exp[-x^2], {x, -Infinity, Infinity}]]
The general way to do replacements in such cases is using Hold-attributes (see this answer) and replacements inside held expressions (see this question). For the case at hand:
In[652]:=
withUnevaluatedSymbolName[
c/.HoldPattern[Infinity]:>RuleCondition[SymbolName[Infinity],True]
]//InputForm
Out[652]//InputForm=
Hold[Integrate[Exp[-x^2], {x, -"Infinity", "Infinity"}]]
, although this is not the only way to do this. Instead of using the above macro, we can also encode the modification to SymbolName into the rule itself (here I am using a more wordy form ( Trott - Strzebonski trick) of in-place evaluation, but you can use RuleCondition as well:
ClearAll[replaceSymbolUnevaluatedRule];
SetAttributes[replaceSymbolUnevaluatedRule, HoldFirst];
replaceSymbolUnevaluatedRule[sym_Symbol] :=
HoldPattern[sym] :> With[{eval = SymbolName#Unevaluated#sym}, eval /; True];
Now, for example:
In[629]:=
Hold[Integrate[Exp[-x^2],{x,-Infinity,Infinity}]]/.
replaceSymbolUnevaluatedRule[Infinity]//InputForm
Out[629]//InputForm=
Hold[Integrate[Exp[-x^2], {x, -"Infinity", "Infinity"}]]
Actually, this entire answer is a good demonstration of various meta-programming techniques. From my own experiences, I can direct you to this, this, this, this and this answers of mine, where meta-programming was essential to solve problem I was addressing. You can also judge by the fraction of functions in Mathematica carrying Hold-attributes to all functions - it is about 10-15 percents if memory serves me well. All those functions are effectively macros, operating on code. To me, this is a very indicative fact, telling me that Mathematica jeavily builds on its meta-programming facilities.
The full forms of expressions can be extracted from the Code and Input cells of a notebook as follows:
$exprs =
Cases[
Import["mynotebook.nb", "Notebook"]
, Cell[content_, "Code"|"Input", ___] :>
ToExpression[content, StandardForm, HoldComplete]
, Infinity
] //
Flatten[HoldComplete ## #, 1, HoldComplete] & //
FullForm
$exprs is assigned the expressions read, wrapped in Hold to prevent evaluation. $exprs could then be saved into a text file:
Export["myfile.txt", ToString[$exprs]]
Package files (.m) are slightly easier to read in this way:
Import["mypackage.m", "HeldExpressions"] //
Flatten[HoldComplete ## #, 1, HoldComplete] &

Resources