Parsing text with simple wildcards logic in Java / C / Objective-C - parsing

I'm looking for a fast library/class to parse plain text using expressions like below:
Text is: <b>Name:</b>John<br><i>Age</i>32<br>
Pattern is: {*}Name:</b>{%}<br>{*}Age</i>{%}<br>
And it will find me two values: John and 32.
Intent is to parse simple HTML web pages without involving heavy duty tools. It should not be using string operations or regexps internally but probably do char by char parsing.

Since you appear to be asking the user to specify the HTML content you want, it's probably alright to use regular expressions here (why do you have an aversion to them?). It's not HTML parsing, anymore, just simple text matching, which is what regular expressions are designed for.
Here's an example:
$match =~ s/{\*}/.*?/g;
$match =~ s/{%}/(.*?)/g;
$html =~ /$match/;
Which will leave what you need in your capturing groups.

A regex replacement would work. Just get it to return both values together like "John%32" and then split the response to get the two separate values.

There's really no advantage to character-by-character parsing manually implemented here, as such problems have been by and large solved for these types of problems.
If you're dealing with an extremely normalized set of data (i.e. the template you described above is formatted exactly the same in every circumstance with no possibility of missing closing tags, HTML being inserted in odd places, etc.), regular expressions are a perfectly appropriate tool to parse this sort of data.
If the HTML can not be guaranteed to be perfect, then the most straightforward solution is to use a tool to load the HTML structure into a DOM and find the appropriate elements in the document tree.
Developing a character-by-character approach will probably end up being equivalent to manually implementing one of the above two options, which is not a trivial thing to implement.

Related

How to properly do custom markdown markup

I currently work on a personal writing project which has ended up with me maintaining a few different versions due to the differences of the relevant platforms and output formats I want to support that are not trivially solved. After several instances of me glancing at pandoc and the sheer forest that it represents, I have concluded mere templates don't do what I need, and worse, that I seem to need a combination of a custom filter and writer... suffice to say: messing with the AST is where I feel way out of my depth. Enough so that, rather than asking specific questions of 'how do I do X' here, this is a question of 'is X the right way to go about it, or what is the proper way to do it, and can you give an example of how it ties together?'... so if this question is rather lengthy: my apologies.
My current goal is to have custom markup like the following which is supposed to 'track' which character says something:
<paul|"Hi there">
If I convert to HTML, I'd want something similar to:
<span class="speech paul">"Hi there"</span>
to pop out (and perhaps the <p> tags), whereas if it is just pure markdown / plain text, I'd want it to silently disappear:
"Hi there"
Looking at the JSON AST structures I've studied, it would make sense that I'd want a new structure type similar to the 'Emph' tag called 'Speech' which allows whole blobs of text to be put inside of it with a bit of extra information attached (the person speaking). So something like this:
{"t":"Speech","speaker":"paul","c":[ ... ] }
Problem #1: At the point a lua-filter sees the document, it is obviously already distilled to an AST. This means replacing the items in a manner similar to what most macro expander samples do cannot really work since it would require reading forward. With this method, I just replace bits and pieces in place (<NAME| becomes a StartSpeech and the first solitary > that follows becomes an EndSpeech, but that would make malformed input a bigger potential problem because of silent-ish failures. Additionally, these tags would be completely out of sorts with how an AST is supposed to look.
To complicate matters even further, some of my characters end up learning a secondary language throughout the story, for which I apply a different format that contains a simplified understanding of the spoken text with perspective-characters understanding of what was said. Example:
<paul|"Heb je goed geslapen?"|"Did you ?????">
I could probably add a third 'UnderstoodSpeech' group to my filter, but (problem #2) at this point, the relationship between the speaker, the original speech, and the understood translation is completely gone. As long as the final documents need these values in these respective orders and only in these orders, it is fine... but what if I want my HTML version to look like
"Did you?????"
with a tool-tip / hover-over effect containing the original speech? That would be near impossible to achieve because the AST does not contain that kind of relational detail.
Whatever kind of AST I create in the filter is what I need to understand in my custom writer. Ideally, I want to re-use as much stock functionality of pandoc as possible for the writer, but I don't even know if that is feasible at this point.
So now my question: could someone with great pandoc understanding please give me an example on how to keep relevant data-bits together and apply them in the correct manner? By this I mean show a basic example of what needs to be put in the lua-filter and lua-writer scripts in the following toolchain
[CUSTOMIZED MARKDOWN INPUT] -> lua-filter -> lua-writer -> [CUSTOMIZED HTML5 OUTPUT]

Multiple parsers calling each other?

I am working on a complicated system that uses a number of XML schemas and associated parsers. One of the schemas is used to hold general data that are accessed by all of the other schemas. I would like to maintain this division in the (flex and bison) parsers. So, if I parse the main XML file and get to, say, the tag <matrix>, I would like to call a <matrix> parser as a subroutine, return its content to the calling program and continue parsing there after the </matrix> tag. I have been looking around the net, but I have not found anything useful. Is it even possible to do this?
It seems easiest to maintain the common pieces in a separate file and to split the individual parser components into two more files: Part 1 has the Prologue and the individual grammar rules, part 2 has the epilogue. Then the three files can be concatenated (in a Makefile) before calling the parser:
parser.y: parser.part1 common.inc parser.part2
cat parser.part1 common.inc parser.part2 >parser.y
Your approach is wrong. You shouldn't need a special parser for each distinctive tag. You should parse all tags regardless of their properties and link them to a tree. Afterwards you can validate the tree to ensure a correct consistency of nested tags. If the markup language you're talking about is really that special, then you could create a parser that takes rules describing each tag. In this case parsing and checking are done at the same time, most HTML parsers are implemented like this.

What is the proper Lua pattern for quoted text?

I've been playing with this for an hour or tow and have found myself at a road block with the Lua pattern matching utilities. I am attempting to match all quoted text in a string and replace it if needed.
The pattern I have come up with so far is: (\?[\"\'])(.-)%1
This works in some cases but, not all cases:
Working: "This \"is a\" string of \"text to\" test with"
Not Working: "T\\\"his \"is\' a\" string\" of\' text\" to \"test\" wit\\\"h"
In the not working example I would like it to match to (I made a function that gets the matches I desire, I'm just looking for a pattern to use with gsub and curious if a lua pattern can do this):
string
a" string" of
is' a" string" of' text
test
his "is' a" string" of' text" to "test" wit
I'm going to continue to use my function instead for the time being, but am curious if there is a pattern I could/should be using and i'm just missing something with patterns.
(a few edits b/c I forgot about stackoverflows formating)
(another edit to make a non-html example since it was leading to assumptions that I was attempting to parse html)
Trying to match escaped, quoted text using regular expressions is like trying to remove the daisies (and only the daises) from a field using a lawnmower.
I made a function that gets the matches I desire
This is the correct move.
I'm curious if a lua pattern can do this
From a practical point of view, even if a pattern can do this, you don't want to. From a theoretical point of view, you are trying to find a double quote that is preceded by an even number of backslashes. This is definitely a regular language, and the regular expression you want would be something like the following (Lua quoting conventions)
[[[^\](\\)*"(.-[^\](\\)*)"]]
And the quoted string would be result #2. But Lua patterns are not full regular expressions; in particular, you cannot put a * after a parenthesized pattern.
So my guess is that this problem cannot be solved using Lua patterns, but since Lua patterns are not a standard thing in automata theory, I'm not aware of any body of proof technique that you could use to prove it.
The issue with escaped quotes is that, in general, if there's an odd number of backslashes before the quote, then it's escaped, and if there's an even number, it's not. I do not believe that Lua pattern-matching is powerful enough to represent this condition, so if you need to parse text like this, then you should seek another way. Perhaps you can iterate through the string and parse it, or you could find each quote in turn and read backwards, counting the backslashes until you find a non-backslash character (or the beginning of the string).
If you absolutely must use patterns for some reason, you could try doing this in a multi-step process. First, gsub for all occurrences of two backslashes in a row, and replace them with some sentinel value. This must be a value that does not already occur in the string. You could try something like "\001" if you know this string doesn't contain non-printable characters. Anyway, once you've replaced all sequences of two backslashes in a row, any backslashes left are escaping the following character. Now you can apply your original pattern, and then finally you can replace all instances of your sentinel value with two backslashes again.
Lua's pattern language is adequate for many simple cases. And it has at least one trick you don't find in a typical regular expression package: a way to match balanced parenthesis. But it has its limits as well.
When those limits are exceeded, then I reach for LPeg. LPeg is an implementation of a Parsing Expression Grammer for Lua, and was implemented by one of Lua's original authors so the adaptation to Lua is done quite well. A PEG allows specification of anything from simple patterns through complete language grammars to be written. LPeg compiles the grammar to a bytecode and executes it extremely efficiently.
you should NOT be trying to parse HTML with regular expressions, HTML and XML are NOT regular languages and can not be successfully manipulated with regular expressions. You should use a dedicated HTML parser. Here are lots of explanations why.

Which templating languages output HTML *as a tree of nodes*?

HTML is a tree of nodes, before all. It's not just a text.
However, most templating engines handle their input and output as it was just a text; they don't care what happens around their tags, their {$foo}'s and <% bar() %>'s; also they don't care about what are they outputting. Sometimes they happen to produce a correct html, but that's just a coincidence; they didn't aim for that, all they wanted is to replace some funny marks in the text stream with their evaluation.
There are a few templating engines which do treat their output as a set of nodes; XSLT and Haml come to mind. For some tasks, this has advantages: for example, you can automatically reformat (like, delete all empty text nodes; auto-indent; word-wrap). The result is guaranteed to be a correct xml/sgml unless you use a strict subset of operations that can break that. Also, such templating engine would automatically quote strings, differently in text nodes and in attributes, because it strictly knows whether you're writing an attribute or a text node.
Moreover, it can conditionally remove a node from output because it knows where it does begin and end, which is useful, and do other non-trivial node operations.
You might not like XSLT for its verbosiness or functionalness, but it's damn helps that your template is xmllint-able XML, and your output is a good sgml/xml.
So the question is: Which template engines do you know that treat their output as a set of correct nodes, not just an unstructured text?
I know XSLT, Haml and some obscure python-based one. Moar!
Which template engines do you know
that treat their output as a set of
correct nodes
Surprisingly ASP.NET does! You can change the HTML output of the page through a kind of DOM if you want: http://en.wikipedia.org/wiki/Asp.net#Rendering_technique
HTML combinator libraries, e.g.:
Wing Beats (F#)
SharpDOM (C#)
CityLizard (.NET, any language)
BlazeHtml (Haskell)
Ocsigen / Eliom (OCaml)
XML literals as a language feature can easily act as a "HTML template language"
VB.NET XML literals
Hasic (uses VB.NET XML literals)
Scala XML literals
Hamlet (uses Template Haskell)
though these are all embedded DSLs, dependent on their host language
There is a standard way of representing XML (or a subset of it such as XHTML) in Scheme known as SXML. This is one of the things that, in my opinion, make Scheme a good language for web development. It is possible to build up the contents of a document as a native Scheme list, and then render this to (correct) XHTML in one function call.
Here is an example that takes a simple text string, and wraps it as the contents of a one-paragraph HTML page. So the function as-page is acting as a template; its output is a Scheme list which can be easily translated to its equivalent HTML. Unbalanced or malformed tags are not possible with this approach.
(use-modules (sxml simple))
(define (as-page txt)
`(html
(head (title "A web page"))
(body (p ,txt))))
(as-page "It works!!!!!")
;; $2 = (html (head (title "A web page")) (body (p "It works!!!!")))
(sxml->xml (as-page "It works!!!!"))
;; $3 = <html><head><title>A web page</title></head><body><p>It works!!!!</p></body></html>
TAL (originally part of Zope but now implemented in a variety of languages) is XML-based. It's very logical and intention-revealing to work with - instead of shoving in a heap of text you're telling the template something like "set the href attribute of this link to http://google.com/ and set its text content to 'Search Google'. You don't have to manage which strings need to be escaped - generally if you intend something to be interpreted as markup, you put it in a template, and if you don't intend it to be interpreted as markup you feed it in as an attribute value or text content and TAL will escape it correctly.
Basically all templating engines which use XML as their file format (for defining templates). By using XML, they enforce that the file must be well-formed.
[EDIT] Examples are: Genshi (Python) or JSP 2.0 (Java).
With the Nagare web framework, the views are always a tree of XML nodes, directly built in Python.
The tree can then be manipulated in Python, transformed with XSL, serialized in HTML or XHTML ...
(the 'nagare.namespaces' package comes with the Nagare projet but can be used in any Python application)
Example:
>>> from nagare.namespaces import xhtml_base
>>> h = xhtml_base.Renderer() # The XHTML tree builder
>>>
>>> # Building the tree of nodes
>>> with h.html:
>>> with h.body:
>>> h << h.h1('Hello world')
>>> tree = h.root # The tree root element
>>>
>>> print tree.write_xmlstring() # Tree serialized in XML
<html><body><h1>Hello world</h1></body></html>
I maintain a list of push-style templating systems here:
http://perlmonks.org/?node_id=674273
And am in the process of evaluating various Perl templating systems for their separation index:
http://bit.ly/bXaYt7
But the tree-based one is written by me, HTML::Seamstress -
http://search.cpan.org/dist/HTML-Seamstress/
The term "push-style" comes from Terence Parr's Paper "Enforcing Strict Model-View
Separation in Template Engines" -
http://www.cs.usfca.edu/~parrt/papers/mvc.templates.pdf
Also, http://snapframework.com/docs/tutorials/heist from Haskell's snap seems to fit.
TAL is absolutely not push-style. It may be XML-based but it is pull-style (the most degenerate form of push-style).

Will ANTLR Help? Different Suggestion?

Before I dive into ANTLR (because it is apparently not for the faint of heart), I just want to make sure I have made the right decision regarding its usage.
I want to create a grammar that will parse in a text file with predefined tags so that I can populate values within my application. (The text file is generated by another application.) So, essentially, I want to be able to parse something like this:
Name: TheFileName
Values: 5 3 1 6 1 3
Other Values: 5 3 1 5 1
In my application, TheFileName is stored as a String, and both sets of values are stored to an array. (This is just a sample, the file is much more complicated.) Anyway, am I at least going down the right path with ANTLR? Any other suggestions?
Edit
The files are created by the user and they define the areas via tags. So, it might look something like this.
Name: <string>TheFileName</string>
Values: <array>5 3 1 6 1 3</array>
Important Value: <double>3.45</double>
Something along those lines.
The basic question is how is the file more complicated? Is it basically more of the same, with a tag, a colon and one or more values, or is the basic structure of the other lines more complex? If it's basically just more of the same, code to recognize and read the data is pretty trivial, and a parser generator isn't likely to gain much. If the other lines have substantially different structure, it'll depend primarily on how they differ.
Edit: Based on what you've added, I'd go one (tiny) step further, and format your file as XML. You can then use existing XML parsers (and such) to read the files, extract data, verify that they fit a specified format, etc.
It depends on what control you have over the format of the file you are parsing. If you have no control then a parser-generator such as ANTLR may be valuable. (We do this ourselves for FORTRAN output files over which we have no control). It's quite a bit of work but we have now mastered the basic ANTLR lexer/parser strategy and it's starting to work well.
If, however, you have some or complete control over the format then create it with as much markup as necessary. I would always create such a file in XML as there are so many tools for processing it (not only the parsing, but also XPath, databases, etc.) In general we use ANTLR to parse semi-structured information into XML.
If you don't need for the format to be custom-built, then you should look into using an existing format such as JSON or XML, for which there are parsers available.
Even if you do need a custom format, you may be better off designing one that is dirt simple so that you don't need a full-blown grammar to parse it. Designing your own scripting grammar from scratch and doing a good job of it is a lot of work.
Writing grammar parsers can also be really fun, so if you're curious then you should go for it. But I don't recommend carelessly mixing learning exercises with practical work code.
Well, if it's "much more complicated", then, yes, a parser generator would be helpful. But, since you don't show the actual format of your file, how could anybody know what might be the right tool for the job?
I use the free GOLD Parser Builder, which is incredibly easy to use, and can generate the parser itself in many different languages. There are samples for parsing such expressions also.
If the format of the file is up to the user can you even define a grammar for it?
Seems like you just want a lexer at best. Using ANTLR just for the lexer part is possible, but would seem like overkill.

Resources