As part of a program which dynamically loads user inputted strings as Haskell source code, I want to do some pre-processing on the user's input before compiling it.
One of the things I would like to be able to do is to search the source for particular function occurrences and add an extra argument to them. So, for example , I might want all occurrences of :
addThreeNumbers 3 5
To become:
addThreeNumbers 3 5 10
What is the best way of accomplishing such behavior? Is it complicated enough to warrant manipulating some sort of abstract syntax tree with functions in the GHC API / Template Haskell? Or is this something simple that can be accomplished with some sort of Haskell pre-processing / parsing library? If so, what libraries and resources would you recommend?
Ghc 7.6 qualified imports, ghc-pkg hide, and ghc's -package option allow you to seamlessly add a layer between the importing file and the imported file.
Example:
Create a package with your own Data.Char, with standard .cabal file and cabal install.
{-# language PackageImports #-}
module Data.Char (
toUpper
, Char
, String
-- ... Export every else from "Base" Data char because the limitation of
-- the current export facility you can not use
-- module Data.Char hiding (toUpper)
) where
import "base" Data.Char hiding (toUpper)
import qualified "base" Data.Char as OldChar
toUpper :: Char -> IO Char
toUpper c = do
print "Oh Yeahhhhhhhhh"
return $ OldChar.toUpper c
Hide the base package ghc-pkg hide base -- this hides many modules in this case an you need to wrap all of them if you need them.
> ghci -XNoImplicitPrelude -- We need language flag because the Prelude is in
-- base and I did not make a wrapped Prelude
ghci> import Data.Char
ghci> toUpper 'c' -- The wrapped function
"Oh Yeahhhhhhhhh"
'C'
ghci> isSpace ' ' -- The unwrapped normal Data.Char function
True
Now you can use template Haskell to wrap your functions and call any IO action you need to get external information. The Users do not even need to change any of their function calls or module imports with some variation of adding 'internal' to their name.
Being able to wrap module interfaces seamlessly also means you can change the implantation of a imported module without touching the package/module code or the existing code base you are working with either; you only have to make a middle layer.
Edit response to question:
Sure you can the ghc-api lets you do all of that, but is considerably more complex, fewer examples then I would like are floating around and I seem to see more people having a hard time with it then success stories.
For evaluation of code hint
pluggins is suggested for dynamic loading of modules
haskell-src-ext suggested for parsing and changing code. This is what stylish-haskell uses to do small modification to code and is your best bet. It reportedly covers most(all?) of Haskell 2010, and many but not all GHC extensions and is probably your best bet if you do not like the first solution I provided.
The GHC-API is the only one fully compatibly with GHC compatible code as far as I know but is considerably more complex, less well documented, and more likely to change from GHC version to GHC version, or at least there is no promise it will be the same, from my limited experience. I suggested putting a module in the middle because it seemed like the quickest to get working with good test coverage, took the least amount of new knowledge and fulfilled the requirements that I picked out of your question.
Related
I'm studying how dependent pattern matching works in agda.
If I can see elaborated core terms(https://github.com/agda/agda/blob/master/src/full/Agda/Syntax/Internal.hs#L202) of arbitrary source code of .agda file,
it will be really helpful for me.
However, agda cli seems not to offer any options for this usage. Is there any?
There's three options you could try depending on how much detail you want, though none of them are perfect:
If all you want is to see what implicit arguments Agda has inserted, you can enable the flags --show-implicit and --show-irrelevant, create a new hole with the term you want to inspect by adding _ = {! yourTerm !} at the bottom of the file, reload the file with C-c C-l, and then press C-u C-c C-m with the cursor inside the hole. [Writing this out made me realize there ought to be a simpler way to do this.]
If you want to inspect and possibly manipulate the full AST of an Agda term, you can do so using the reflection API (https://agda.readthedocs.io/en/v2.6.2.1/language/reflection.html). In particular, you can get the reflected syntax of an arbitrary Agda term by using the quoteTerm primitive.
Finally, if you need more information you can look in the source code of Agda itself and enable the debug flags for printing the information you want. Note that there is no guarantee that this debug information will be useful or even readable, as it is intended for use by the developers. With that being said, you could for example print the case tree generated from a definition by pattern matching by adding {-# OPTIONS -v tc.cc:12 #-} at the top of your file. In Emacs, this debug information will end up in a separate buffer titled *Agda debug* (which you'll have to open manually after loading the .agda file).
I am new to Haskell, and I have been trying to write a JSON parser using Parsec as an exercise. This has mostly been going well, I am able to parse lists and objects with relatively little code which is also readable (great!). However, for JSON I also need to parse primitives like
Integers (possibly signed)
Floats (possibly using scientific notation such as "3.4e-8")
Strings with e.g. escaped quotes
I was hoping to find ready to use parsers for things like these as part of Parsec. The closest I get is the Parsec.Tokens module (defines integer and friends), but those parsers require a "language definition" that seems way beyond what I should have to make to parse something as simple as JSON -- it appears to be designed for programming languages.
So my questions are:
Are the functions in Parsec.Token the right way to go here? If so, how to make a suitable language definition?
Are "primitive" parsers for integers etc defined somewhere else? Maybe in another package?
Am I supposed to write these kinds of low-level parsers myself? I can see myself reusing them frequently... (obscure scientific data formats etc.)
I have noticed that a question on this site says Megaparsec has these primitives included [1], but I suppose these cannot be used with parsec.
Related questions:
How do I get Parsec to let me call `read` :: Int?
How to parse an Integer with parsec
Are the functions in Parsec.Token the right way to go here?
Yes, they are. If you don't care about the minutiae specified by a language definition (i.e. you don't plan to use the parsers which depend on them, such as identifier or reserved), just use emptyDef as a default:
import Text.Parsec
import qualified Text.Parsec.Token as P
import Text.Parsec.Language (emptyDef)
lexer = P.makeTokenParser emptyDef
integer = P.integer lexer
As you noted, this feels unnecesarily clunky for your use case. It is worth mentioning that megaparsec (cf. Alec's suggestion) provides a corresponding integer parser without the ceremony. (The flip side is that megaparsec doesn't try to bake in support for e.g. reserved words, but that isn't difficult to implement in the cases you actually need it.)
On page 57 of the book "Programming Erlang" by Joe Armstrong (2007) 'lists:map/2' is mentioned in the following way:
Virtually all the modules that I write use functions like
lists:map/2 —this is so common that I almost consider map
to be part of the Erlang language. Calling functions such
as map and filter and partition in the module lists is extremely
common.
The usage of the word 'almost' got me confused about what the difference between Erlang as a whole and the Erlang language might be, and if there even is a difference at all. Is my confusion based on semantics of the word 'language'? It seems to me as if a standard module floats around the borders of what does and does not belong to the actual language it's implemented in. What are the differences between a programming language at it's core and the standard libraries implemented in them?
I'm aware of the fact that this is quite the newby question, but in my experience jumping to my own conclusions can lead to bad things. I was hoping someone could clarify this somewhat.
Consider this simple program:
1> List = [1, 2, 3, 4, 5].
[1,2,3,4,5]
2> Fun = fun(X) -> X*2 end.
#Fun<erl_eval.6.50752066>
3> lists:map(Fun, List).
[2,4,6,8,10]
4> [Fun(X) || X <- List].
[2,4,6,8,10]
Both produce the same output, however the first one list:map/2 is a library function, and the second one is a language construct at its core, called list comprehension. The first one is implemented in Erlang (accidentally also using list comprehension), the second one is parsed by Erlang. The library function can be optimized only as much as the compiler is able to optimize its implementation in Erlang. However, the list comprehension may be optimized as far as being written in assembler in the Beam VM and called from the resulted beam file for maximum performance.
Some language constructs look like they are part of the language, whereas in fact they are implemented in the library, for example spawn/3. When it's used in the code it looks like a keyword, but in Erlang it's not one of the reserved words. Because of that, Erlang compiler will automatically add the erlang module in front of it and call erlang:spawn/3, which is a library function. Those functions are called BIFs (Build-In Functions).
In general, what belongs to the language itself is what that language's compiler can parse and translate to the executable code (or in other words, what's defined by the language's grammar). Everything else is a library. Libraries are usually written in the language for which they are designed, but it doesn't necessarily have to be the case, e.g. some of Erlang library functions are written using C as Erlang NIFs.
I am trying to pretty print an AST generated from
createAstFromFile(|cwd:///Java/Hello.java|,true);
Have I just missed how to do this in the documentation?
If you mean unparsing an AST (getting the Java code back) you will have to write something yourself.
If you however mean printing the AST structure nicely indented, we have iprintln exactly for this purpose.
Also, for large ASTs, the REPL might not like it that much, checkout our (as pf yet) undocumented Fast print functions in util::FastPrint. The fiprintln prints to the rascal output window, which is a lot faster.
No I believe the current release does not contains this feature. If you don't rewrite the AST, you can of course get the source by reading the location, as in:
rascal>import IO;
ok
rascal>readFile(ast#\loc)
str: ...
That only works when the weather is right.. The other solutions are:
to use string templates mapping the AST back to source (simplest)
map ASTs to the Box language and call the format function (most powerful and configurable)
a hybrid of the above
I seem to recall there is a function which maps back M3 ASTs back to JDT ASTs in Java and then calls the pretty print function of JDT, but it looks like it was discontinued. In other words, here are some TODOs.
I'm writing a program where I need to parse a JavaScript source file, extract some facts, and insert/replace portions of the code. A simplified description of the sorts of things I'd need to do is, given this code:
foo(['a', 'b', 'c']);
Extract 'a', 'b', and 'c' and rewrite the code as:
foo('bar', [0, 1, 2]);
I am using ANTLR for my parsing needs, producing C# 3 code. Somebody else had already contributed a JavaScript grammar. The parsing of the source code is working.
The problem I'm encountering is figuring out how to actually properly analyze and modify the source file. Each approach that I try to take in actually solving the problem leads me to a dead end. I can't help but think that I'm not using the tool as it's intended or am just too much of a novice when it comes to dealing with ASTs.
My first approach was to parse using a TokenRewriteStream and implement the EnterRule_* partial methods for the rules I'm interested in. While this seems to make modifying the token stream pretty easy, there is not enough contextual information for my analysis. It seems that all I have access to is a flat stream of tokens, which doesn't tell me enough about the entire structure of code. For example, to detect whether the foo function is being called, simply looking at the first token wouldn't work because that would also falsely match:
a.b.foo();
To allow me to do more sophisticated code analysis, my second approach was to modify the grammar with rewrite rules to produce more of a tree. Now, the first sample code block produces this:
Program
CallExpression
Identifier('foo')
ArgumentList
ArrayLiteral
StringLiteral('a')
StringLiteral('b')
StringLiteral('c')
This is working great for analyzing the code. However, now I am unable to easily rewrite the code. Sure, I could modify the tree structure to represent the code I want, but I can't use this to output source code. I had hoped that the token associated with each node would at least give me enough information to know where in the original text I would need to make the modifications, but all I get are token indexes or line/column numbers. To use the line and column numbers, I would have to make an awkward second pass through the source code.
I suspect I'm missing something in understanding how to properly use ANTLR to do what I need. Is there a more proper way for me to solve this problem?
What you are trying to do is called program transformation, that is, the automated generation of one program from another. What you are doing "wrong" is assuming is parser is all you need, and discovering that it isn't and that you have to fill in the gap.
Tools that do that this well have parsers (to build ASTs), means to modify the ASTs (both procedural and pattern directed), and prettyprinters which convert the (modified) AST back into legal source code. You seem to be struggling with the the fact that ANTLR doesn't come with prettyprinters; that's not part of its philosophy; ANTLR is a (fine) parser-generator. Other answers have suggested using ANTLR's "string templates", which are not by themselves prettyprinters, but can be used to implement one, at the price of implementing one. This harder to do than it looks; see my SO answer on compiling an AST back to source code.
The real issue here is the widely made but false assumption that "if I have a parser, I'm well on my way to building complex program analysis and transformation tools." See my essay on Life After Parsing for a long discussion of this; basically, you need a lot more tooling that "just" a parser to do this, unless you want to rebuild a significant fraction of the infrastructure by yourself instead of getting on with your task. Other useful features of practical program transformation systems include typically source-to-source transformations, which considerably simplify the problem of finding and replacing complex patterns in trees.
For instance, if you had source-to-source transformation capabilities (of our tool, the DMS Software Reengineering Toolkit, you'd be able to write parts of your example code changes using these DMS transforms:
domain ECMAScript.
tag replace; -- says this is a special kind of temporary tree
rule barize(function_name:IDENTIFIER,list:expression_list,b:body):
expression->expression
= " \function_name ( '[' \list ']' ) "
-> "\function_name( \firstarg\(\function_name\), \replace\(\list\))";
rule replace_unit_list(s:character_literal):
expression_list -> expression_list
replace(s) -> compute_index_for(s);
rule replace_long_list(s:character_list, list:expression_list):
expression_list -> expression_list
"\replace\(\s\,\list)-> "compute_index_for\(\s\),\list";
with rule-external "meta" procedures "first_arg" (which knows how to compute "bar" given the identifier "foo" [I'm guessing you want to do this), and "compute_index_for" which given a string literals, knows what integer to replace it with.
Individual rewrite rules have parameter lists "(....)" in which slots representing subtrees are named, a left-hand side acting as a pattern to match, and an right hand side acting as replacement, both usually quoted in metaquotes " which seperates rewrite-rule language text from target-language (e.g. JavaScript) text. There's lots of meta-escapes ** found inside the metaquotes which indicate a special rewrite-rule-language item. Typically these are parameter names, and represent whatever type of name tree the parameter represents, or represent an external meta procedure call (such as first_arg; you'll note the its argument list ( , ) is metaquoted!), or finally, a "tag" such as "replace", which is a peculiar kind of tree that represent future intent to do more transformations.
This particular set of rules works by replacing a candidate function call by the barized version, with the additional intent "replace" to transform the list. The other two transformations realize the intent by transforming "replace" away by processing elements of the list one at a time, and pushing the replace further down the list until it finally falls off the end and the replacement is done. (This is the transformational equivalent of a loop).
Your specific example may vary somewhat since you really weren't precise about the details.
Having applied these rules to modify the parsed tree, DMS can then trivially prettyprint the result (the default behavior in some configurations is "parse to AST, apply rules until exhaustion, prettyprint AST" because this is handy).
You can see a complete process of "define language", "define rewrite rules", "apply rules and prettyprint" at (High School) Algebra as a DMS domain.
Other program transformation systems include TXL and Stratego. We imagine DMS as the industrial strength version of these, in which we have built all that infrastructure including many standard language parsers and prettyprinters.
So it's turning out that I can actually use a rewriting tree grammar and insert/replace tokens using a TokenRewriteStream. Plus, it's actually really easy to do. My code resembles the following:
var charStream = new ANTLRInputStream(stream);
var lexer = new JavaScriptLexer(charStream);
var tokenStream = new TokenRewriteStream(lexer);
var parser = new JavaScriptParser(tokenStream);
var program = parser.program().Tree as Program;
var dependencies = new List<IModule>();
var functionCall = (
from callExpression in program.Children.OfType<CallExpression>()
where callExpression.Children[0].Text == "foo"
select callExpression
).Single();
var argList = functionCall.Children[1] as ArgumentList;
var array = argList.Children[0] as ArrayLiteral;
tokenStream.InsertAfter(argList.Token.TokenIndex, "'bar', ");
for (var i = 0; i < array.Children.Count(); i++)
{
tokenStream.Replace(
(array.Children[i] as StringLiteral).Token.TokenIndex,
i.ToString());
}
var rewrittenCode = tokenStream.ToString();
Have you looked at the string template library. It is by the same person who wrote ANTLR and they are intended to work together. It sounds like it would suit do what your looking for ie. output matched grammar rules as formatted text.
Here is an article on translation via ANTLR