HTML is a tree of nodes, before all. It's not just a text.
However, most templating engines handle their input and output as it was just a text; they don't care what happens around their tags, their {$foo}'s and <% bar() %>'s; also they don't care about what are they outputting. Sometimes they happen to produce a correct html, but that's just a coincidence; they didn't aim for that, all they wanted is to replace some funny marks in the text stream with their evaluation.
There are a few templating engines which do treat their output as a set of nodes; XSLT and Haml come to mind. For some tasks, this has advantages: for example, you can automatically reformat (like, delete all empty text nodes; auto-indent; word-wrap). The result is guaranteed to be a correct xml/sgml unless you use a strict subset of operations that can break that. Also, such templating engine would automatically quote strings, differently in text nodes and in attributes, because it strictly knows whether you're writing an attribute or a text node.
Moreover, it can conditionally remove a node from output because it knows where it does begin and end, which is useful, and do other non-trivial node operations.
You might not like XSLT for its verbosiness or functionalness, but it's damn helps that your template is xmllint-able XML, and your output is a good sgml/xml.
So the question is: Which template engines do you know that treat their output as a set of correct nodes, not just an unstructured text?
I know XSLT, Haml and some obscure python-based one. Moar!
Which template engines do you know
that treat their output as a set of
correct nodes
Surprisingly ASP.NET does! You can change the HTML output of the page through a kind of DOM if you want: http://en.wikipedia.org/wiki/Asp.net#Rendering_technique
HTML combinator libraries, e.g.:
Wing Beats (F#)
SharpDOM (C#)
CityLizard (.NET, any language)
BlazeHtml (Haskell)
Ocsigen / Eliom (OCaml)
XML literals as a language feature can easily act as a "HTML template language"
VB.NET XML literals
Hasic (uses VB.NET XML literals)
Scala XML literals
Hamlet (uses Template Haskell)
though these are all embedded DSLs, dependent on their host language
There is a standard way of representing XML (or a subset of it such as XHTML) in Scheme known as SXML. This is one of the things that, in my opinion, make Scheme a good language for web development. It is possible to build up the contents of a document as a native Scheme list, and then render this to (correct) XHTML in one function call.
Here is an example that takes a simple text string, and wraps it as the contents of a one-paragraph HTML page. So the function as-page is acting as a template; its output is a Scheme list which can be easily translated to its equivalent HTML. Unbalanced or malformed tags are not possible with this approach.
(use-modules (sxml simple))
(define (as-page txt)
`(html
(head (title "A web page"))
(body (p ,txt))))
(as-page "It works!!!!!")
;; $2 = (html (head (title "A web page")) (body (p "It works!!!!")))
(sxml->xml (as-page "It works!!!!"))
;; $3 = <html><head><title>A web page</title></head><body><p>It works!!!!</p></body></html>
TAL (originally part of Zope but now implemented in a variety of languages) is XML-based. It's very logical and intention-revealing to work with - instead of shoving in a heap of text you're telling the template something like "set the href attribute of this link to http://google.com/ and set its text content to 'Search Google'. You don't have to manage which strings need to be escaped - generally if you intend something to be interpreted as markup, you put it in a template, and if you don't intend it to be interpreted as markup you feed it in as an attribute value or text content and TAL will escape it correctly.
Basically all templating engines which use XML as their file format (for defining templates). By using XML, they enforce that the file must be well-formed.
[EDIT] Examples are: Genshi (Python) or JSP 2.0 (Java).
With the Nagare web framework, the views are always a tree of XML nodes, directly built in Python.
The tree can then be manipulated in Python, transformed with XSL, serialized in HTML or XHTML ...
(the 'nagare.namespaces' package comes with the Nagare projet but can be used in any Python application)
Example:
>>> from nagare.namespaces import xhtml_base
>>> h = xhtml_base.Renderer() # The XHTML tree builder
>>>
>>> # Building the tree of nodes
>>> with h.html:
>>> with h.body:
>>> h << h.h1('Hello world')
>>> tree = h.root # The tree root element
>>>
>>> print tree.write_xmlstring() # Tree serialized in XML
<html><body><h1>Hello world</h1></body></html>
I maintain a list of push-style templating systems here:
http://perlmonks.org/?node_id=674273
And am in the process of evaluating various Perl templating systems for their separation index:
http://bit.ly/bXaYt7
But the tree-based one is written by me, HTML::Seamstress -
http://search.cpan.org/dist/HTML-Seamstress/
The term "push-style" comes from Terence Parr's Paper "Enforcing Strict Model-View
Separation in Template Engines" -
http://www.cs.usfca.edu/~parrt/papers/mvc.templates.pdf
Also, http://snapframework.com/docs/tutorials/heist from Haskell's snap seems to fit.
TAL is absolutely not push-style. It may be XML-based but it is pull-style (the most degenerate form of push-style).
Related
So as stated in the title, my task is to traverse the Parse Tree generated for code written in Java (grammar is a standard Java grammar), print most of it unchanged and modify only some words, for example type declarations.
My current approach was to create ParseTreeListener and implement the logic in the enterEveryRule method, but unfortunately it doesn't appear to work even for basic printing. The output is very messy and there are a lot of repetitions, as if every node was visited multiple times.
My another try was to implement appropriate methods in BaseListener that would do the changes to the type declarations I need, but from there I see no possibility to print the rest of the code unchanged.
Looking forward to your help!
You could use ANTLR's string templates to produce code from the ASTs.
In general, you start with set of "standard" string templates that can regenerate source code corresponding to the underlying tree.
To get the effect you want, you judiciously choose the standard string templates on AST nodes where you don't want changes, and variant templates where you do want changes.
IMHO, it is better to modify the AST, and then simply apply the standard templates.
I am working on a complicated system that uses a number of XML schemas and associated parsers. One of the schemas is used to hold general data that are accessed by all of the other schemas. I would like to maintain this division in the (flex and bison) parsers. So, if I parse the main XML file and get to, say, the tag <matrix>, I would like to call a <matrix> parser as a subroutine, return its content to the calling program and continue parsing there after the </matrix> tag. I have been looking around the net, but I have not found anything useful. Is it even possible to do this?
It seems easiest to maintain the common pieces in a separate file and to split the individual parser components into two more files: Part 1 has the Prologue and the individual grammar rules, part 2 has the epilogue. Then the three files can be concatenated (in a Makefile) before calling the parser:
parser.y: parser.part1 common.inc parser.part2
cat parser.part1 common.inc parser.part2 >parser.y
Your approach is wrong. You shouldn't need a special parser for each distinctive tag. You should parse all tags regardless of their properties and link them to a tree. Afterwards you can validate the tree to ensure a correct consistency of nested tags. If the markup language you're talking about is really that special, then you could create a parser that takes rules describing each tag. In this case parsing and checking are done at the same time, most HTML parsers are implemented like this.
I'm writing a program where I need to parse a JavaScript source file, extract some facts, and insert/replace portions of the code. A simplified description of the sorts of things I'd need to do is, given this code:
foo(['a', 'b', 'c']);
Extract 'a', 'b', and 'c' and rewrite the code as:
foo('bar', [0, 1, 2]);
I am using ANTLR for my parsing needs, producing C# 3 code. Somebody else had already contributed a JavaScript grammar. The parsing of the source code is working.
The problem I'm encountering is figuring out how to actually properly analyze and modify the source file. Each approach that I try to take in actually solving the problem leads me to a dead end. I can't help but think that I'm not using the tool as it's intended or am just too much of a novice when it comes to dealing with ASTs.
My first approach was to parse using a TokenRewriteStream and implement the EnterRule_* partial methods for the rules I'm interested in. While this seems to make modifying the token stream pretty easy, there is not enough contextual information for my analysis. It seems that all I have access to is a flat stream of tokens, which doesn't tell me enough about the entire structure of code. For example, to detect whether the foo function is being called, simply looking at the first token wouldn't work because that would also falsely match:
a.b.foo();
To allow me to do more sophisticated code analysis, my second approach was to modify the grammar with rewrite rules to produce more of a tree. Now, the first sample code block produces this:
Program
CallExpression
Identifier('foo')
ArgumentList
ArrayLiteral
StringLiteral('a')
StringLiteral('b')
StringLiteral('c')
This is working great for analyzing the code. However, now I am unable to easily rewrite the code. Sure, I could modify the tree structure to represent the code I want, but I can't use this to output source code. I had hoped that the token associated with each node would at least give me enough information to know where in the original text I would need to make the modifications, but all I get are token indexes or line/column numbers. To use the line and column numbers, I would have to make an awkward second pass through the source code.
I suspect I'm missing something in understanding how to properly use ANTLR to do what I need. Is there a more proper way for me to solve this problem?
What you are trying to do is called program transformation, that is, the automated generation of one program from another. What you are doing "wrong" is assuming is parser is all you need, and discovering that it isn't and that you have to fill in the gap.
Tools that do that this well have parsers (to build ASTs), means to modify the ASTs (both procedural and pattern directed), and prettyprinters which convert the (modified) AST back into legal source code. You seem to be struggling with the the fact that ANTLR doesn't come with prettyprinters; that's not part of its philosophy; ANTLR is a (fine) parser-generator. Other answers have suggested using ANTLR's "string templates", which are not by themselves prettyprinters, but can be used to implement one, at the price of implementing one. This harder to do than it looks; see my SO answer on compiling an AST back to source code.
The real issue here is the widely made but false assumption that "if I have a parser, I'm well on my way to building complex program analysis and transformation tools." See my essay on Life After Parsing for a long discussion of this; basically, you need a lot more tooling that "just" a parser to do this, unless you want to rebuild a significant fraction of the infrastructure by yourself instead of getting on with your task. Other useful features of practical program transformation systems include typically source-to-source transformations, which considerably simplify the problem of finding and replacing complex patterns in trees.
For instance, if you had source-to-source transformation capabilities (of our tool, the DMS Software Reengineering Toolkit, you'd be able to write parts of your example code changes using these DMS transforms:
domain ECMAScript.
tag replace; -- says this is a special kind of temporary tree
rule barize(function_name:IDENTIFIER,list:expression_list,b:body):
expression->expression
= " \function_name ( '[' \list ']' ) "
-> "\function_name( \firstarg\(\function_name\), \replace\(\list\))";
rule replace_unit_list(s:character_literal):
expression_list -> expression_list
replace(s) -> compute_index_for(s);
rule replace_long_list(s:character_list, list:expression_list):
expression_list -> expression_list
"\replace\(\s\,\list)-> "compute_index_for\(\s\),\list";
with rule-external "meta" procedures "first_arg" (which knows how to compute "bar" given the identifier "foo" [I'm guessing you want to do this), and "compute_index_for" which given a string literals, knows what integer to replace it with.
Individual rewrite rules have parameter lists "(....)" in which slots representing subtrees are named, a left-hand side acting as a pattern to match, and an right hand side acting as replacement, both usually quoted in metaquotes " which seperates rewrite-rule language text from target-language (e.g. JavaScript) text. There's lots of meta-escapes ** found inside the metaquotes which indicate a special rewrite-rule-language item. Typically these are parameter names, and represent whatever type of name tree the parameter represents, or represent an external meta procedure call (such as first_arg; you'll note the its argument list ( , ) is metaquoted!), or finally, a "tag" such as "replace", which is a peculiar kind of tree that represent future intent to do more transformations.
This particular set of rules works by replacing a candidate function call by the barized version, with the additional intent "replace" to transform the list. The other two transformations realize the intent by transforming "replace" away by processing elements of the list one at a time, and pushing the replace further down the list until it finally falls off the end and the replacement is done. (This is the transformational equivalent of a loop).
Your specific example may vary somewhat since you really weren't precise about the details.
Having applied these rules to modify the parsed tree, DMS can then trivially prettyprint the result (the default behavior in some configurations is "parse to AST, apply rules until exhaustion, prettyprint AST" because this is handy).
You can see a complete process of "define language", "define rewrite rules", "apply rules and prettyprint" at (High School) Algebra as a DMS domain.
Other program transformation systems include TXL and Stratego. We imagine DMS as the industrial strength version of these, in which we have built all that infrastructure including many standard language parsers and prettyprinters.
So it's turning out that I can actually use a rewriting tree grammar and insert/replace tokens using a TokenRewriteStream. Plus, it's actually really easy to do. My code resembles the following:
var charStream = new ANTLRInputStream(stream);
var lexer = new JavaScriptLexer(charStream);
var tokenStream = new TokenRewriteStream(lexer);
var parser = new JavaScriptParser(tokenStream);
var program = parser.program().Tree as Program;
var dependencies = new List<IModule>();
var functionCall = (
from callExpression in program.Children.OfType<CallExpression>()
where callExpression.Children[0].Text == "foo"
select callExpression
).Single();
var argList = functionCall.Children[1] as ArgumentList;
var array = argList.Children[0] as ArrayLiteral;
tokenStream.InsertAfter(argList.Token.TokenIndex, "'bar', ");
for (var i = 0; i < array.Children.Count(); i++)
{
tokenStream.Replace(
(array.Children[i] as StringLiteral).Token.TokenIndex,
i.ToString());
}
var rewrittenCode = tokenStream.ToString();
Have you looked at the string template library. It is by the same person who wrote ANTLR and they are intended to work together. It sounds like it would suit do what your looking for ie. output matched grammar rules as formatted text.
Here is an article on translation via ANTLR
I'm looking for a fast library/class to parse plain text using expressions like below:
Text is: <b>Name:</b>John<br><i>Age</i>32<br>
Pattern is: {*}Name:</b>{%}<br>{*}Age</i>{%}<br>
And it will find me two values: John and 32.
Intent is to parse simple HTML web pages without involving heavy duty tools. It should not be using string operations or regexps internally but probably do char by char parsing.
Since you appear to be asking the user to specify the HTML content you want, it's probably alright to use regular expressions here (why do you have an aversion to them?). It's not HTML parsing, anymore, just simple text matching, which is what regular expressions are designed for.
Here's an example:
$match =~ s/{\*}/.*?/g;
$match =~ s/{%}/(.*?)/g;
$html =~ /$match/;
Which will leave what you need in your capturing groups.
A regex replacement would work. Just get it to return both values together like "John%32" and then split the response to get the two separate values.
There's really no advantage to character-by-character parsing manually implemented here, as such problems have been by and large solved for these types of problems.
If you're dealing with an extremely normalized set of data (i.e. the template you described above is formatted exactly the same in every circumstance with no possibility of missing closing tags, HTML being inserted in odd places, etc.), regular expressions are a perfectly appropriate tool to parse this sort of data.
If the HTML can not be guaranteed to be perfect, then the most straightforward solution is to use a tool to load the HTML structure into a DOM and find the appropriate elements in the document tree.
Developing a character-by-character approach will probably end up being equivalent to manually implementing one of the above two options, which is not a trivial thing to implement.
I'm writing a lexer/parser for a small subset of C in ANTLR that will be run in a Java environment. I'm new to the world of language grammars and in many of the ANTLR tutorials, they create an AST - Abstract Syntax Tree, am I forced to create one and why?
Creating an AST with ANTLR is incorporated into the grammar. You don't have to do this, but it is a really good tool for more complicated requirements. This is a tutorial on tree construction you can use.
Basically, with ANTLR when the source is getting parsed, you have a few options. You can generate code or an AST using rewrite rules in your grammar. An AST is basically an in memory representation of your source. From there, there's a lot you can do.
There's a lot to ANTLR. If you haven't already, I would recommend getting the book.
I found this answer to the question on jGuru written by Terence Parr, who created ANTLR. I copied this explanation from the site linked here:
Only simple, so-called syntax directed translations can be done with actions within the parser. These kinds of translations can only spit out constructs that are functions of information already seen at that point in the parse. Tree parsers allow you to walk an intermediate form and manipulate that tree, gradually morphing it over several translation phases to a final form that can be easily printed back out as the new translation.
Imagine a simple translation problem where you want to print out an html page whose title is "There are n items" where n is the number of identifiers you found in the input stream. The ids must be printed after the title like this:
<html>
<head>
<title>There are 3 items</title>
</head>
<body>
<ol>
<li>Dog</li>
<li>Cat</li>
<li>Velociraptor</li>
</body>
</html>
from input
Dog
Cat
Velociraptor
So with simple actions in your grammar how can you compute the title? You can't without reading the whole input. Ok, so now we know we need an intermediate form. The best is usually an AST I've found since it records the input structure. In this case, it's just a list but it demonstrates my point.
Ok, now you know that a tree is a good thing for anything but simple translations. Given an AST, how do you get output from it? Imagine simple expression trees. One way is to make the nodes in the tree specific classes like PlusNode, IntegerNode and so on. Then you just ask each node to print itself out. For input, 3+4 you would have tree:
+
|
3 -- 4
and classes
class PlusNode extends CommonAST {
public String toString() {
AST left = getFirstChild();
AST right = left.getNextSibling();
return left + " + " + right;
}
}
class IntNode extends CommonAST {
public String toString() {
return getText();
}
}
Given an expression tree, you can translate it back to text with t.toString(). SO, what's wrong with this? Seems to work great, right? It appears to work well in this case because it's simple, but I argue that, even for this simple example, tree grammars are more readable and are formalized descriptions of precisely what you coded in the PlusNode.toString().
expr returns [String r]
{
String left=null, right=null;
}
: #("+" left=expr right=expr) {r=left + " + " + right;}
| i:INT {r=i.getText();}
;
Note that the specific class ("heterogeneous AST") approach actually encodes a complete recursive-descent parser for #(+ INT INT) by hand in toString(). As parser generator folks, this should make you cringe. ;)
The main weakness of the heterogeneous AST approach is that it cannot conveniently access context information. In a recursive-descent parser, your context is easily accessed because it can be passed in as a parameter. You also know precisely which rule can invoke which other rule (e.g., is this expression a WHILE condition or an IF condition?) by looking at the grammar. The PlusNode class above exists in a detached, isolated world where it has no idea who will invoke it's toString() method. Worse, the programmer cannot tell in which context it will be invoked by reading it.
In summary, adding actions to your input parser works for very straightforward translations where:
the order of output constructs is the same as the input order
all constructs can be generated from information parsed up to the point when you need to spit them out
Beyond this, you will need an intermediate form--the AST is the best form usually. Using a grammar to describe the structure of the AST is analogous to using a grammar to parse your input text. Formalized descriptions in a domain-specific high-level language like ANTLR are better than hand coded parsers. Actions within a tree grammar have very clear context and can conveniently access information passed from invoking rlues. Translations that manipulate the tree for multipass translations are also much easier using a tree grammar.
I think the creation of the AST is optional. The Abstract Syntax Tree is useful for subsequent processing like semantic analysis of the parsed program.
Only you can decide if you need to create one. If your only objective is syntactic validation then you don't need to generate one. In javacc (similar to ANTLR) there is a utility called JJTree that allows the generation of the AST. So I imagine this is optional in ANTLR as well.