XML parsing in Opa - xml-parsing

Anyone well versed in xml_parsers in Opa? I need two idioms:
I don't care about an attribute (it may or may not be present on a tag; if it's present, I don't care about the content) and
I don't care about a tag (it may be there or it may not).
Anyone knows how to express that? As an example say I want to parse:
<x a="123">
<z>...</z>
</x>
I want the parser to store value of attribute a and content of tag <v> and it should work even if <x> has optional attribute b (or more) and if there is a <y> tag before <z>.

Related

How to properly do custom markdown markup

I currently work on a personal writing project which has ended up with me maintaining a few different versions due to the differences of the relevant platforms and output formats I want to support that are not trivially solved. After several instances of me glancing at pandoc and the sheer forest that it represents, I have concluded mere templates don't do what I need, and worse, that I seem to need a combination of a custom filter and writer... suffice to say: messing with the AST is where I feel way out of my depth. Enough so that, rather than asking specific questions of 'how do I do X' here, this is a question of 'is X the right way to go about it, or what is the proper way to do it, and can you give an example of how it ties together?'... so if this question is rather lengthy: my apologies.
My current goal is to have custom markup like the following which is supposed to 'track' which character says something:
<paul|"Hi there">
If I convert to HTML, I'd want something similar to:
<span class="speech paul">"Hi there"</span>
to pop out (and perhaps the <p> tags), whereas if it is just pure markdown / plain text, I'd want it to silently disappear:
"Hi there"
Looking at the JSON AST structures I've studied, it would make sense that I'd want a new structure type similar to the 'Emph' tag called 'Speech' which allows whole blobs of text to be put inside of it with a bit of extra information attached (the person speaking). So something like this:
{"t":"Speech","speaker":"paul","c":[ ... ] }
Problem #1: At the point a lua-filter sees the document, it is obviously already distilled to an AST. This means replacing the items in a manner similar to what most macro expander samples do cannot really work since it would require reading forward. With this method, I just replace bits and pieces in place (<NAME| becomes a StartSpeech and the first solitary > that follows becomes an EndSpeech, but that would make malformed input a bigger potential problem because of silent-ish failures. Additionally, these tags would be completely out of sorts with how an AST is supposed to look.
To complicate matters even further, some of my characters end up learning a secondary language throughout the story, for which I apply a different format that contains a simplified understanding of the spoken text with perspective-characters understanding of what was said. Example:
<paul|"Heb je goed geslapen?"|"Did you ?????">
I could probably add a third 'UnderstoodSpeech' group to my filter, but (problem #2) at this point, the relationship between the speaker, the original speech, and the understood translation is completely gone. As long as the final documents need these values in these respective orders and only in these orders, it is fine... but what if I want my HTML version to look like
"Did you?????"
with a tool-tip / hover-over effect containing the original speech? That would be near impossible to achieve because the AST does not contain that kind of relational detail.
Whatever kind of AST I create in the filter is what I need to understand in my custom writer. Ideally, I want to re-use as much stock functionality of pandoc as possible for the writer, but I don't even know if that is feasible at this point.
So now my question: could someone with great pandoc understanding please give me an example on how to keep relevant data-bits together and apply them in the correct manner? By this I mean show a basic example of what needs to be put in the lua-filter and lua-writer scripts in the following toolchain
[CUSTOMIZED MARKDOWN INPUT] -> lua-filter -> lua-writer -> [CUSTOMIZED HTML5 OUTPUT]

Is Pug context free?

I was thinking to make a Pug parser but besides the indents are well-known to be context-sensitive (that can be trivially hacked with a lexer feedback loop to make it almost context-free which is adopted by Python), what otherwise makes it not context-free?
XML tags are definitely not context-free, that each starting tag needs to match an end tag, but Pug does not have such restriction, that makes me wonder if we could just parse each starting identifier as a production for a tag root.
The main thing that Pug seems to be missing, at least from a casual scan of its website, is a formal description of its syntax. Or even an informal description. Perhaps I wasn't looking in right places.
Still, based on the examples, it doesn't look awful. There will be some challenges; in particular, it does not have a uniform tokenisation context, so the scanner is going to be complicated, not just because of the indentation issue. (I got the impression from the section on whitespace that the indentation rule is much stricter than Python's, but I didn't find a specification of what it is exactly. It appeared to me that leading whitespace after the two-character indent is significant whitespace. But that doesn't complicate things much; it might even simplify the task.)
What will prove interesting is handling embedded JavaScript. You will at least need to tokenise the embedded JS, and the corner cases in the JS spec make it non-trivial to tokenise without parsing. Anyway, just tokenising isn't sufficient to know where the embedded code terminates. (For the lexical challenge, consider the correct identification of regular expression literals. /= might be the start of a regex or it might be a divide-and-assign operator; how a subsequent { is tokenised will depend on that decision.) Template strings present another challenge (recursive embedding). However, JavaScript parsers do exist, so you might be able to leverage one.
In other words, recognising tag nesting is not going to be the most challenging part of your project. Once you've identified that a given token is a tag, the nesting part is trivial (and context-free) because it is precisely defined by the indentation, so a DEDENT token will terminate the tag.
However, it is worth noting that tag parsing is not particularly challenging for XML (or XML-like HTML variants). If you adopt the XML rule that close tags cannot be omitted (except for self-closing tags), then the tagname in a close tag does not influence the parse of a correct input. (If the tagname in the close tag does not match the close tag in the corresponding open tag, then the input is invalid. But the correspondence between open and close tags doesn't change.) Even if you adopt the HTML-5 rule that close tags cannot be omitted except in the case of a finite list of special-case tagnames, then you could theoretically do the parse with a CFG. (However, the various error recovery rules in HTML-5 are far from context free, so that would only work for input which did not require rematching of close tags.)
Ira Baxter makes precisely this point in the cross-linked post he references in a comment: you can often implement context-sensitive aspects of a language by ignoring them during the parse and detecting them in a subsequent analysis, or even in a semantic predicate during the parse. Correct matching of open- and close tagnames would fall into this category, as would the "declare-before-use" rule in languages where the declaration of an identifier does not influence the parse. (Not true of C or C++, but true in many other languages.)
Even if these aspects cannot be ignored -- as with C typedefs, for example -- the simplest solution might be to use an ambiguous CFG and a parsing technology which produces all possible parses. After the parse forest is generated, you could walk the alternatives and reject the ones which are inconsistent. (In the case of C, that would include an alternative parse in which a name was typedef'd and then used in a context where a typename is not valid.)

SVG parsing and data type

I'm writing an SVG parser, mainly as an exercise for learning how to use Parsec. Currently I'm using the following data type to represent my SVG file:
data SVG = Element String [Attribute] [SVG]
| SelfClosingTag [Attribute]
| Body String
| Comment String
| XMLDecl String
This works quite well, however I'm not sure about the Element String [Attribute] [SVG] part of my data type.
Since there is only a limited number of potential tags for an SVG, I was thinking about using a type to represent an SVG element instead of using a String. Something like this:
data SVG = Element TagName [Attribute] [SVG]
| ...
data TagName = A
| AltGlyph
| AltGlyphDef
...
| View
| Vkern
Is it a good idea? What would be the benefits of doing this if there are any?
Is there a more elegant solution?
I personally prefer the approach of enumerating all possible TagNames. This way, the compiler can give you errors and warnings if you make any careless mistakes. For example, if I want to write a function that covers every possible type of Element, then if every type is enumerated in an ADT, the compiler can give you non-exhaustive match warnings. If you represent it as a string, this is not possible. Additionally, if I want to match an Element of a specific type, and I accidentally misspell the TagName, the compiler will catch it. A third reason, which probably doesn't really apply here, but is worth noting in general is that if I later decide to add or remove a variant of TagName, then the compiler will tell me every place that needs to be modified. I doubt this will happen for SVG tag names, but in general it is something to keep in mind.
To answer your question:
You can do this either way depending on what you are going to do with your parse tree after you make it.
If all you care to do with you SVG parser is describe the shape of the SGV data, you are just fin with a string.
On the other hand if you want to somehow transform that SVG data into something like a graphic (that is you anticipate evaluating your AST) you will find that it is best to represent all semantic information in the type system. It will make the next steps much easier.
The question in my mind is whether the parsing pass is exactly the place to make that happen. (Full disclosure, I have only a passing familiarity with SVG.) I suspect that rather then just a flat list of tags, you would be better off with Element each with it's own set of required and optional attributes. if this transformation "happens later in the program" there is no need to create a TagName data type. You can catch all the type errors at the same time you merge the attributes into the Elements.
On the other hand, a good argument could be made to parse straight into a complete Element tree in which case, I would drop the generic [Attribute] and [SVG] fields of the Element constructor and instead make appropriate fields in your TagName constructor.
Another answer to the question you didn't ask:
Put source code location into your parse tree early. From personal experence, I can tell you it gets harder the larger your program gets.

haskell parsing data structure with extra information

I have problem to extract extra information from my parsing.
I have my own data structure to parse, and that works fine. I wrote the parser for my data structure as Parse MyDataStructure which parse all the information about MyDataStructure.
The problem is that in the string I'm parsing, mixed with MyDataStructure, there is also some information about what should I do with MyDataStructure which is of course not part of MyDataStructure, i.e. I cannot store this information inside MyDataStructure.
Now the problem is that I don't know how to store this information, since in Haskell I cannot change some global variable to store information, and the return value of my parser is already MyDataStructure.
Is there a way I can somehow store this new information, without changing MyDataStructure, i.e. including field to store the extra information (but the extra information are not part of MyDataStructure so I would really like avoiding doing that)?
I hope I have been clear enough.
As #9000 says, you could use a tuple. If you find yourself needing to pass it through a number of functions, using the State Monad might make things easier.

(XML Parsing) Which design pattern or approach is more efficient

I have a problem here, and it goes like this;
I have 50 Classes (an XML parser) for different responses/message types, to illustrate a sample of the existing XML message received see below:
<XML>
<transaction>
<messagetype>message</messagetype>
<message>Blah, Blah, Blah</message>
</transaction>
</XML>
Until recently, a requirement to allow multi-transaction messages be received has been imposed thus the recent message will now look like the one below:
<XML>
<transaction>
<messagetype>message</messagetype>
<message>Blah, Blah, Blah</message>
<messagetype>notification</messagetype>
<message>stopped</message>
<messagetype>notification</messagetype>
<message>started</message>
<messagetype>alert</messagetype>
<message>no service</message>
</transaction>
</XML>
What I want to know, is what approach will be more efficient:
a. Create a new Class/Method to catch all type of request and traverse through all the XML element then store it to an array, then iterate through the loop and pass each array (xml element node) to their respective parsers.
b. Edit each Parser to accomodate the changes. (I seem to see this a very, very tedious job)
c. Create one big parser, putting all parsing stuffs there and then traverse using switch cases (this disregarding all the existing parsers)
Also, take note that the element nodes can variably change during each responses. so the child nodes can be 1 to N ( where N is the limit ).
Are there any viable solution/s to this kind of scenario? I do not wish to re-write the existing code (one of the programmers virtues) but if its the only way, then so be it.
I am implementing this on iPhone using Objective-C
TIA

Resources