SVG parsing and data type - parsing

I'm writing an SVG parser, mainly as an exercise for learning how to use Parsec. Currently I'm using the following data type to represent my SVG file:
data SVG = Element String [Attribute] [SVG]
| SelfClosingTag [Attribute]
| Body String
| Comment String
| XMLDecl String
This works quite well, however I'm not sure about the Element String [Attribute] [SVG] part of my data type.
Since there is only a limited number of potential tags for an SVG, I was thinking about using a type to represent an SVG element instead of using a String. Something like this:
data SVG = Element TagName [Attribute] [SVG]
| ...
data TagName = A
| AltGlyph
| AltGlyphDef
...
| View
| Vkern
Is it a good idea? What would be the benefits of doing this if there are any?
Is there a more elegant solution?

I personally prefer the approach of enumerating all possible TagNames. This way, the compiler can give you errors and warnings if you make any careless mistakes. For example, if I want to write a function that covers every possible type of Element, then if every type is enumerated in an ADT, the compiler can give you non-exhaustive match warnings. If you represent it as a string, this is not possible. Additionally, if I want to match an Element of a specific type, and I accidentally misspell the TagName, the compiler will catch it. A third reason, which probably doesn't really apply here, but is worth noting in general is that if I later decide to add or remove a variant of TagName, then the compiler will tell me every place that needs to be modified. I doubt this will happen for SVG tag names, but in general it is something to keep in mind.

To answer your question:
You can do this either way depending on what you are going to do with your parse tree after you make it.
If all you care to do with you SVG parser is describe the shape of the SGV data, you are just fin with a string.
On the other hand if you want to somehow transform that SVG data into something like a graphic (that is you anticipate evaluating your AST) you will find that it is best to represent all semantic information in the type system. It will make the next steps much easier.
The question in my mind is whether the parsing pass is exactly the place to make that happen. (Full disclosure, I have only a passing familiarity with SVG.) I suspect that rather then just a flat list of tags, you would be better off with Element each with it's own set of required and optional attributes. if this transformation "happens later in the program" there is no need to create a TagName data type. You can catch all the type errors at the same time you merge the attributes into the Elements.
On the other hand, a good argument could be made to parse straight into a complete Element tree in which case, I would drop the generic [Attribute] and [SVG] fields of the Element constructor and instead make appropriate fields in your TagName constructor.
Another answer to the question you didn't ask:
Put source code location into your parse tree early. From personal experence, I can tell you it gets harder the larger your program gets.

Related

Data model for transforming source into AST and back?

I am working on a custom programming language. On compiling it, the parser first converts the text into a simple stream of tokens. The tokens are then converted into a simple tree. The tree is then converted into an object graph (with holes in it, as the types haven't yet been necessarily fully figured out). The holey tree is then transformed into a compact object graph.
Then we can go further and compile it to, say, JavaScript. The compact object graph is then transformed into a JavaScript AST. The JS AST is then transformed into a "concrete" syntax tree (with whitespace and such), and then that is converted into the JS text.
So in going from text to compact object graph, there are 5 transformation steps (text -> token_list -> tree -> holey_graph -> graph). In other situations (other languages), you might have more or less.
The way I am doing this transformation now is very ad-hoc and not keeping track of line numbers, so it's impossible to really tell where an error is coming from. I would like to fix that.
In my case, I am wondering how you could create a data model to keep track of the line of text where something was defined. This way, you could report any compilation errors nicely to the developer. The way I have modeled that so far is with a sort of "folding" model as I'm calling it. The initial "fold" is on the text -> token_list transformation. For each token, it keeps track of 3 things: the line, the column, and the text length, for the token. At first you may model it like this:
{
token: 'function',
line: 10,
column: 2,
size: 8
}
But that is tying two concepts into one object: the token itself, and the "fold" as I am calling it. Really it would be better like this:
fold = {
line: 10,
column: 2,
size: 8
}
token = {
value: 'function'
}
// bind the two together.
fold.data = token
token.fold = fold
Then, you transform from token to AST node in the simple tree. That might be like:
treeNode = {
type: 'function'
}
fold = {
previous: tokenFold,
data: treeNode
}
And so connecting the dots like this. In the end, you would have a fold list, which could be traversed theorertically from the compact object graph, to the text, so if there was a compile eror when doing typechecking for example, you could report the exact line number and everything to the developer. The navigation would look something like this:
data = compactObjectGraph
.fold
.previous.previous.previous.previous
.data
data.line
data.column
data.size
In theory. But the problem is, the "compact object graph" might have been created not from a simple linear chain of inputs, but from a suite of inputs. While I have modeled this on paper so far, I am starting to think there isn't actually in reality a clear way of mapping from object to object how it was transfformed, using this sort of "fold" sort of system.
The question is, how can I define the data model to allow for getting back to the source text line/column number, given there is a complex sequence of transformations from one data structure to the next? That is, at a high level, what is a way to model this that will allow you to isolate the transformation data structures, yet be able to map from the last generated one to the first, to find how some compact object graph node was actually represented in the original source text?
I would create a data structure containing the filename, line and column. In C++ it may work well to store a reference to this structure, rather than copy it to many places.
There isn't really that many ways to solve this, but having a single structure that is re-usable across your other data structures is almost certainly the right solution.
I answered your question on Quora in July, so maybe you missed it: https://qr.ae/pvkrwJ
Basically you have to stamp all the compiler artifacts with source information from which they are derived. Best represented a some kind of structure (Mats' response). Yeah, that takes effort, because
you have to do it everywhere in the compiler.
To do a perfect job, you'd need to stamp it with the complete set of source items that caused its generation; you're essentially producing a dependency graph. (You could represent such sets as trees of subsets to maximize sharing). Then any complaint the compiler issued could clearly identify the set of causes.
To do a less perfect job, you can pick any of of the contributing items and use that as the source location dependency. That means that a compiler complaint will only identify one source location that might be the cause, and the reader will have to guess at others if that isn't the principal source of the problem. Judicious choice of which-cause source information can arrange it so the answer is right much of the time and that's probably good enough.

How to properly do custom markdown markup

I currently work on a personal writing project which has ended up with me maintaining a few different versions due to the differences of the relevant platforms and output formats I want to support that are not trivially solved. After several instances of me glancing at pandoc and the sheer forest that it represents, I have concluded mere templates don't do what I need, and worse, that I seem to need a combination of a custom filter and writer... suffice to say: messing with the AST is where I feel way out of my depth. Enough so that, rather than asking specific questions of 'how do I do X' here, this is a question of 'is X the right way to go about it, or what is the proper way to do it, and can you give an example of how it ties together?'... so if this question is rather lengthy: my apologies.
My current goal is to have custom markup like the following which is supposed to 'track' which character says something:
<paul|"Hi there">
If I convert to HTML, I'd want something similar to:
<span class="speech paul">"Hi there"</span>
to pop out (and perhaps the <p> tags), whereas if it is just pure markdown / plain text, I'd want it to silently disappear:
"Hi there"
Looking at the JSON AST structures I've studied, it would make sense that I'd want a new structure type similar to the 'Emph' tag called 'Speech' which allows whole blobs of text to be put inside of it with a bit of extra information attached (the person speaking). So something like this:
{"t":"Speech","speaker":"paul","c":[ ... ] }
Problem #1: At the point a lua-filter sees the document, it is obviously already distilled to an AST. This means replacing the items in a manner similar to what most macro expander samples do cannot really work since it would require reading forward. With this method, I just replace bits and pieces in place (<NAME| becomes a StartSpeech and the first solitary > that follows becomes an EndSpeech, but that would make malformed input a bigger potential problem because of silent-ish failures. Additionally, these tags would be completely out of sorts with how an AST is supposed to look.
To complicate matters even further, some of my characters end up learning a secondary language throughout the story, for which I apply a different format that contains a simplified understanding of the spoken text with perspective-characters understanding of what was said. Example:
<paul|"Heb je goed geslapen?"|"Did you ?????">
I could probably add a third 'UnderstoodSpeech' group to my filter, but (problem #2) at this point, the relationship between the speaker, the original speech, and the understood translation is completely gone. As long as the final documents need these values in these respective orders and only in these orders, it is fine... but what if I want my HTML version to look like
"Did you?????"
with a tool-tip / hover-over effect containing the original speech? That would be near impossible to achieve because the AST does not contain that kind of relational detail.
Whatever kind of AST I create in the filter is what I need to understand in my custom writer. Ideally, I want to re-use as much stock functionality of pandoc as possible for the writer, but I don't even know if that is feasible at this point.
So now my question: could someone with great pandoc understanding please give me an example on how to keep relevant data-bits together and apply them in the correct manner? By this I mean show a basic example of what needs to be put in the lua-filter and lua-writer scripts in the following toolchain
[CUSTOMIZED MARKDOWN INPUT] -> lua-filter -> lua-writer -> [CUSTOMIZED HTML5 OUTPUT]

Does it make any sense to use `nom` to process custom enum types?

I am attempting to implement a parser for a simple query language. The goal is to generate operations from the text and then evaluate them before passing them up the tree. If I understand correctly, I'll have to implement some of the nom traits (InputLength, InputTake, Slice).
Part way through implementing the InputTake trait, I realize that I'm expected to return subslices of the enums which represent my query operations where a split may be made part way through an identifier. For example, I may parse an identifier name_of_var and this take_split() method could produce 2 slices which doesn't make sense to me.
What should I be doing here? I don't like the idea of slicing a bool/number since they only make sense as a whole.
What do you think about returning None in the case where I consider a byte slice invalid?
For what it's worth...
I assumed that the output type of one parser was the input of a parent parser. What really happens is that all parsers can expect the same input type and return whatever they like. The generated object (which is an AST) is returned and manipulated in the end.

F# using XML Type Provider to modify xml

I need to process a bunch of XML documents. They are quite complex in their structure (i.e. loads of nodes), but the processing consists in changing the values for a few nodes and saving the file under a different name.
I am looking for a way to do that without having to reconstruct the output XML by explicitly instantiating all the types and passing all of the unchanged values in, but simply by copying them from the input. If the types generated automatically by the type provider were record types, I could simply create the output by let output = { input with changedNode = myNewValue }, but with the type provider I have to do let output = MyXml.MyRoot(input.UnchangedNode1, input.UnchangedNode2, myNewValue, input.UnchangedNode3, ...). This is further complicated by my changed values being in some of the nested nodes, so I have quite a lot of fluff to pass in to get to it.
The F# Data type providers were primarily designed to provide easy access when reading the data, so they do not have very good story for writing data (partly, the issue is that the underlying JSON representation is quite different than the underlying XML representation).
For XML, the type provider just wraps the standard XElement types, which happen to be mutable. This means that you can actually navigate to the elements using provided types, but then use the underlying LINQ to XML to mutate the value. For example:
type X = XmlProvider<"<foos><foo a=\"1\" /><foo a=\"2\" /></foos>">
// Change the 'a' attribute of all 'foo' nodes to 1234
let doc = X.GetSample()
for f in doc.Foos do
f.XElement.SetAttributeValue(XName.Get "a", 1234)
// Prints the modified document
doc.ToString()
This is likely not perfect - sometimes, you'll need to change the parent element (like here, the provided f.A property is not mutable), but it might do the trick. I don't know whether this is the best way of solving the problem in general, or whether something like XSLT might be easier - it probably depends on the concrete transformations.

How to walk whole Parse Tree and print it's content with slight changes in ANTLR4?

So as stated in the title, my task is to traverse the Parse Tree generated for code written in Java (grammar is a standard Java grammar), print most of it unchanged and modify only some words, for example type declarations.
My current approach was to create ParseTreeListener and implement the logic in the enterEveryRule method, but unfortunately it doesn't appear to work even for basic printing. The output is very messy and there are a lot of repetitions, as if every node was visited multiple times.
My another try was to implement appropriate methods in BaseListener that would do the changes to the type declarations I need, but from there I see no possibility to print the rest of the code unchanged.
Looking forward to your help!
You could use ANTLR's string templates to produce code from the ASTs.
In general, you start with set of "standard" string templates that can regenerate source code corresponding to the underlying tree.
To get the effect you want, you judiciously choose the standard string templates on AST nodes where you don't want changes, and variant templates where you do want changes.
IMHO, it is better to modify the AST, and then simply apply the standard templates.

Resources