Using a parser generator I want to create a parser for "From headers" in email messages. Here is an example of a From header:
From: "John Doe" <john#doe.org>
I think it will be straightforward to implement a parser for that.
However, there is a complication in the "From header" syntax: comments may be inserted just about anywhere. For example, a comment may be inserted within "john":
From: "John Doe" <jo(this is a comment)hn#doe.org>
And comments may be inserted in many other places.
How to handle this complication? Does it require a "2-pass" parser: one pass to remove all comments and a second pass to create the parse tree for the From header? Do modern parser generators support multiple passes on the input? Can it be parsed in a single pass? If yes, would you sketch the approach please?
I'm not convinced that your interpretation of email addresses is correct; my reading of RFC-822 leads me to believe that a comment can only come before or after a "word", and that "word"s in the local-part of an addr-spec need to be separated by dots ("."). Section 3.1.4 gives a pretty good hint on how to parse: you need a lexical analyzer which feeds syntactic symbols into the parser; the lexical analyzer is expected to unfold headers, ignore whitespace, and identify comments, quoted strings, atoms, and special characters.
Of course, RFC-822 has long been obsoleted, and I think that email headers with embedded comments are anachronistic.
Nonetheless, it seems like you could easily achieve the analysis you wish using flex and bison. As indicated, flex would identify the comments. Strictly speaking, you cannot identify comments with a regular expression, since comments nest. But you can recognize simple nested structures using a start condition stack, or even more economically by maintaining a counter (since flex won't return until the outermost parenthesis is found, the counter doesn't need to be global.)
Related
I was thinking to make a Pug parser but besides the indents are well-known to be context-sensitive (that can be trivially hacked with a lexer feedback loop to make it almost context-free which is adopted by Python), what otherwise makes it not context-free?
XML tags are definitely not context-free, that each starting tag needs to match an end tag, but Pug does not have such restriction, that makes me wonder if we could just parse each starting identifier as a production for a tag root.
The main thing that Pug seems to be missing, at least from a casual scan of its website, is a formal description of its syntax. Or even an informal description. Perhaps I wasn't looking in right places.
Still, based on the examples, it doesn't look awful. There will be some challenges; in particular, it does not have a uniform tokenisation context, so the scanner is going to be complicated, not just because of the indentation issue. (I got the impression from the section on whitespace that the indentation rule is much stricter than Python's, but I didn't find a specification of what it is exactly. It appeared to me that leading whitespace after the two-character indent is significant whitespace. But that doesn't complicate things much; it might even simplify the task.)
What will prove interesting is handling embedded JavaScript. You will at least need to tokenise the embedded JS, and the corner cases in the JS spec make it non-trivial to tokenise without parsing. Anyway, just tokenising isn't sufficient to know where the embedded code terminates. (For the lexical challenge, consider the correct identification of regular expression literals. /= might be the start of a regex or it might be a divide-and-assign operator; how a subsequent { is tokenised will depend on that decision.) Template strings present another challenge (recursive embedding). However, JavaScript parsers do exist, so you might be able to leverage one.
In other words, recognising tag nesting is not going to be the most challenging part of your project. Once you've identified that a given token is a tag, the nesting part is trivial (and context-free) because it is precisely defined by the indentation, so a DEDENT token will terminate the tag.
However, it is worth noting that tag parsing is not particularly challenging for XML (or XML-like HTML variants). If you adopt the XML rule that close tags cannot be omitted (except for self-closing tags), then the tagname in a close tag does not influence the parse of a correct input. (If the tagname in the close tag does not match the close tag in the corresponding open tag, then the input is invalid. But the correspondence between open and close tags doesn't change.) Even if you adopt the HTML-5 rule that close tags cannot be omitted except in the case of a finite list of special-case tagnames, then you could theoretically do the parse with a CFG. (However, the various error recovery rules in HTML-5 are far from context free, so that would only work for input which did not require rematching of close tags.)
Ira Baxter makes precisely this point in the cross-linked post he references in a comment: you can often implement context-sensitive aspects of a language by ignoring them during the parse and detecting them in a subsequent analysis, or even in a semantic predicate during the parse. Correct matching of open- and close tagnames would fall into this category, as would the "declare-before-use" rule in languages where the declaration of an identifier does not influence the parse. (Not true of C or C++, but true in many other languages.)
Even if these aspects cannot be ignored -- as with C typedefs, for example -- the simplest solution might be to use an ambiguous CFG and a parsing technology which produces all possible parses. After the parse forest is generated, you could walk the alternatives and reject the ones which are inconsistent. (In the case of C, that would include an alternative parse in which a name was typedef'd and then used in a context where a typename is not valid.)
I found the powerful RegexNER and it's superset TokensRegex from Stanford CoreNLP.
There are some rules that should give me fine results, like the pattern for PERSONs with titles:
"g. Meho Mehic" or "gdin. N. Neko" (g. and gdin. are abbrevs in Bosnian for mr.).
I'm having some trouble with existing tokenizer. It splits some strings on two tokens and some leaves as one, for example, token "g." is left as word <word>g.</word> and token "gdin." is split on 2 tokens: <word>gdin</word> and <word>.</word>.
That causes trouble with my regex, I have to deal with one-token and multi-token cases (note the two "maybe-dot"s), RegexNER example:
( /g\.?|gdin\.?/ /\./? ([{ word:/[A-Z][a-z]*\.?/ }]+) ) PERSON
Also, this causes another issue, with sentence splitting, some sentences are not well recognized so regex fails... For example, when a sentence contains "gdin." it will split it on two, so a dot will end the (non-existing) sentence. I managed to bypass this with ssplit.isOneSentence = true for now.
Questions:
Do I have to make my own tokenizer, and how? (to merge some tokens like "gdin.")
Are there any settings I missed that could help me with this?
Ok I thought about this for a bit and can actually think of something pretty straight forward for your case. One thing you could do is add "gdin" to the list of titles in the tokenizer.
The tokenizer rules are in edu.stanford.nlp.process.PTBLexer.flex (look at line 741)
I do not really understand the tokenizer that well, but clearly there are a list of job titles in there, so they must be cases where it will not split off the period.
This will of course require you to work with a custom build of Stanford CoreNLP.
You can get the full code at our GitHub:https://github.com/stanfordnlp/CoreNLP
There are instructions on the main page for building a jar with all of the main Stanford CoreNLP classes. I think if you just run the ant process it will automatically generate the new PTBLexer.java based on PTBLexer.flex.
What exactly is parsing? I mean, generally. How different is parsing different from searching? On command line, if I use the grep tool/command; is that parsing?
For example, if I have just one string:
"Hello world! How are you doing today?"
and I tried to search (using grep or any other tool) whether the word "you" is within that string; is that parsing?
What if I do a web search; for example in Google? Is that parsing?
Or is parsing the name of the process that is a part of the process known as "Search"?
The verb "parse" is essentially related to the word "part", as in "part of speech". (See, for example, the on-line etymology dictionary.)
To "parse" a sentence has traditionally meant to break the sentence down into its component parts and identify their relationship with each other. For example, given "I asked a question.", we can parse it into a subject ("I"), a transitive verb in past tense ("asked"), and an object phrase consisting of an article ("a") and a noun ("question"). The parse indicates that the subject performed some action on the object; this is not the same statement as *"A question asked I", and not just because the latter is ungrammatical.
With the advent of computer languages and computational theory, the term "parsing" has been generalized to include analysis of strings which are not human languages. Some people would even use it to simply mean "to divide a string into its component parts", such as "parsing" a line in a CSV file into fields.
It's quite a stretch to apply that to merely searching for a string inside another string, although there may be contexts in which that is an acceptable use of the word. Personally, I would only use it for the action of completely deconstructing a structured string.
I am working on a complicated system that uses a number of XML schemas and associated parsers. One of the schemas is used to hold general data that are accessed by all of the other schemas. I would like to maintain this division in the (flex and bison) parsers. So, if I parse the main XML file and get to, say, the tag <matrix>, I would like to call a <matrix> parser as a subroutine, return its content to the calling program and continue parsing there after the </matrix> tag. I have been looking around the net, but I have not found anything useful. Is it even possible to do this?
It seems easiest to maintain the common pieces in a separate file and to split the individual parser components into two more files: Part 1 has the Prologue and the individual grammar rules, part 2 has the epilogue. Then the three files can be concatenated (in a Makefile) before calling the parser:
parser.y: parser.part1 common.inc parser.part2
cat parser.part1 common.inc parser.part2 >parser.y
Your approach is wrong. You shouldn't need a special parser for each distinctive tag. You should parse all tags regardless of their properties and link them to a tree. Afterwards you can validate the tree to ensure a correct consistency of nested tags. If the markup language you're talking about is really that special, then you could create a parser that takes rules describing each tag. In this case parsing and checking are done at the same time, most HTML parsers are implemented like this.
When defining the grammar for a language parser, how do you deal with things like comments (eg /* .... */) that can occur at any point in the text?
Building up your grammar from tags within tags seems to work great when things are structured, but comments seem to throw everything.
Do you just have to parse your text in two steps? First to remove these items, then to pick apart the actual structure of the code?
Thanks
Normally, comments are treated by the lexical analyzer outside the scope of the main grammar. In effect, they are (usually) treated as if they were blanks.
One approach is to use a separate lexer. Another, much more flexible way, is to amend all your token-like entries (keywords, lexical elements, etc.) with an implicit whitespace prefix, valid for the current context. This is how most of the modern Packrat parsers are dealing with whitespaces.