NotNumber in XPath - parsing

The w3c working draft "Building a Tokenizer for XPath or XQuery" mentions a NotNumber. I can't find any references for this. What is this?
Reference: https://www.w3.org/TR/xquery-xpath-parsing/#DEFAULT_xpath
There: "2.1.2 XPath Lexical States"
The first table matches the pattern "NotNumber" amongst others. Is this like the IEEE 754 NaN?
The same question is about NotOperatorKeyword, which also does not have a reference.

Related

What is the definitive EBNF syntax?

I am currently using the lark parser for python to try and read in some problem specifications. I am getting confused about what the "proper" syntax is for Extended Backus-Naur form, especially about how the LHS and RHS are separated. The wikipedia page uses an equals = sign, lark expects just a colon; see lark cheat sheet. Other sources use the ::= separator - e.g. the atom ebnf package.
Is there a definitive answer? The official ISO spec seems to suggest that the "defining-symbol" should be = but there seems to be wriggle room in the spec. So why all the different versions?
Since the world hasn't yet appointed a Lord High Commissioner of Grammar Formalisms, there is no definitive syntax. You're certainly free to use the ISO "Extended BNF" standard, particularly if you're writing some other ISO standard, but don't expect it to be implemented by a parser generator, even one which extends normal BNF. (There's no definitive standard for BNF, either.)
I have no way of knowing what was going on in the minds of the authors of the ISO standard, but I suspect that their expectations were realistic: it's intended to allow precise description of syntaxes for standards documents, but there are many features which are not suitable for automated implementation (including a way of writing rule restrictions in English to be used when the formalism isn't sufficiently general). It's often possible to automatically extract (most of) a grammar from an ISO standard, but the task is neither simple nor -- as far as I can see -- intended to be simple, since most ISO standards are not distributed as plain text documents and extracting formatted text from either PDF or HTML formats presents its own challenges.
The options you present for punctuation are most of the common ones, although mathematicians often write BNF using ⇒ to separate left- and right-hand sides. (Unfortunately, most keyboards lack that useful character.)
I'm personally not fond of the ::= separator, although it is used by various parser generators. It seems to me to be way too much typing for a simple punctuator, and it is also annoyingly difficult to align with alternatives flagged with |. But to each their own.

CommonMark Parsing ***

Let's say I want to parse the string ***cat*** into Markdown using the CommonMark standard. The standard says (http://spec.commonmark.org/0.28/#phase-2-inline-structure):
....
If one is found:
Figure out whether we have emphasis or strong emphasis: if both closer
and opener spans have length >= 2, we have strong, otherwise regular.
Insert an emph or strong emph node accordingly, after the text node
corresponding to the opener.
Remove any delimiters between the opener and closer from the delimiter
stack.
Remove 1 (for regular emph) or 2 (for strong emph) delimiters from the
opening and closing text nodes. If they become empty as a result,
remove them and remove the corresponding element of the delimiter
stack. If the closing node is removed, reset current_position to the
next element in the stack.
....
Based on my reading of this the result should be <em><strong>cat</strong></em> since first the <strong> is added, THEN the <em>. However, all online markdown editors I have tried this in output <strong><em>cat</em></strong>. What am I missing?
Here is a visual representation of what I think should be happening
TextNode[***] TextNode[cat] TextNode[***]
TextNode[*] StrongEmphasis TextNode[cat] TextNode[*]
TextNode[] Emphasis StrongEmphasis TextNode[cat] TextNode[]
Emphasis StrongEmphasis TextNode[cat]
It's important to remember that Commonmark and Markdown are not necessarily the same thing. Commonmark is a recent variant of Markdown. Most Markdown parsers existed and established their behavior long before the Commonmark spec was even started.
While the original Markdown rules make no comment on whether the <em> or <strong> tag should be first in the given example, the reference implementation's (markdown.pl) actual behavior was to list the <strong> tag before the <em> tag in the output. In fact, the MarkdownTest package, which was created by the author of Markdown and markdown.pl) explicitly required that output (the original is no longer available online that I know of, but mdtest is a faithful copy with its history showing no modifications of that test since the initial import from MarkdownTest). AFAICT, every (non-Commonmark) Markdown parser has followed that behavior exactly.
The Commonmark spec took a different route. The spec specifically states in Rule 14 of Section 6.4 (Emphasis and strong emphasis):
An interpretation <em><strong>...</strong></em> is always preferred to <strong><em>...</em></strong>.
... and backs it up with example 444:
***foo***
<p><em><strong>foo</strong></em></p>
In fact, you can see that that is exactly the behavior of the reference implementation of Commonmark.
As an aside, the original question quotes from the Appendix to the spec which recommends how to implement a parser. While potentially useful to a parser creator, I would not recommend using that section to determine proper syntax handling and/or output. The actual rules should be consulted instead; and in fact, they clearly provide the expected output in this instance. But this question is about an apparent disparity between implementations and the spec, not interpretation of the spec.
For a more complete comparison, see Babelmark. With the exception of a few (completely) broken implementations, every "classic" Markdown parser follows markdown.pl, while every Commonmark parser follows the Commonmark spec. Therefore, there is no actual disparity between the spec and implementations. The disparity is between Markdown and Commonmark.
As for why the Commonmark authors chose a different route in this regard, or why they insist on calling Commonmark "Markdown" when it is clearly different are off topic here and better asked of the authors themselves.

Is the alternative operator in ABNF commutative?

Is the alternative operator (/) in Augmented Backus-Naur Form commutative?
For example, is s = a / b the same as s = b / a?
I haven't found any primary sources on BNF or ABNF which explicitly specify / semantics when both sides would yield valid matches. They don't allude to context-free grammars and their allowance for non-determinism either. If anyone knows of clarifying references please share.
EDIT: Tony's answer points out RFC 3501 from 2003 specifies the semantics of ABNF alternation, at least as it's used in that document.
RFC 5234: Augmented BNF for Syntax Specifications: ABNF (2008)
The introduction contrasts BNF and ABNF (with emphasis added here):
Over the years, a modified version of Backus-Naur Form (BNF), called Augmented BNF (ABNF), has been popular among many Internet specifications. It balances compactness and simplicity with reasonable representational power. In the early days of the Arpanet, each specification contained its own definition of ABNF. This included the email specifications, RFC733 and then RFC822 , which came to be the common citations for defining ABNF. The current document separates those definitions to permit selective reference.
The differences between standard BNF and ABNF involve naming rules, repetition, alternatives, order-independence, and value ranges.
"Selective reference" and "order-independence" may relate to alternation ordering semantics, but it's unclear.
RFC 822: Standard for the Format of ARPA Internet Text Messages (1982)
Unless I'm missing something, the cited RFCs don't specify / semantics either. Section 2.2 evades the problem.
2.2. RULE1 / RULE2: ALTERNATIVES
Elements separated by slash ("/") are alternatives. There-
fore "foo / bar" will accept foo or bar.
Various rule definitions show they recognize the practical importance of avoiding ambiguity. For example, here's how RFC 822 defines optional-field and its dependencies:
optional-field =
/ "Message-ID" ":" msg-id
/ "Resent-Message-ID" ":" msg-id
/ "In-Reply-To" ":" *(phrase / msg-id)
/ "References" ":" *(phrase / msg-id)
/ "Keywords" ":" #phrase
/ "Subject" ":" *text
/ "Comments" ":" *text
/ "Encrypted" ":" 1#2word
/ extension-field ; To be defined
/ user-defined-field ; May be pre-empted
extension-field =
<Any field which is defined in a document
published as a formal extension to this
specification; none will have names beginning
with the string "X-">
user-defined-field =
<Any field which has not been defined
in this specification or published as an
extension to this specification; names for
such fields must be unique and may be
pre-empted by published extensions>
The Syntax and Semantics of the Proposed International Algebraic Language of the Zurich ACM-GAMM Conference (Backus 1958)
BNF comes from IAL notation. The paper introduces ̅o̅r "metalinguistic connective", which is intuitively related to /. However, it also dodges the ambiguous choice problem and presumably just uses it carefully.
Recommendation
Due to unspecified semantics my suggestion is to treat every possible match in an alternation rule as valid. When the grammar isn't carefully designed to avoid ambiguity this interpretation can result in multiple valid parse trees for the same input. Addressing ambiguous parses as they occur is safer than forging ahead with an unintentionally valid parse tree.
Alternatively, if you have influence over how the grammar is specified you could consider a notation with clearer semantics. For example, Parsing Expression Grammar: A Recognition-Based Syntactic Foundation (Ford 2004) gives alternatives deterministic prioritized choice semantics (left-most match wins).
Some RFCs clarify this explicitly, with for example IMAPv4's RFC3501 including specification of PEG-like behaviour in RFC 3501 section 9:
In the case of alternative or optional rules in which a later rule
overlaps an earlier rule, the rule which is listed earlier MUST take
priority. For example, "\Seen" when parsed as a flag is the \Seen
flag name and not a flag-extension, even though "\Seen" can be parsed
as a flag-extension. Some, but not all, instances of this rule are
noted below.
I don't know how common such disambiguation (hah) is, though. Many other RFCs I've looked at (I've been implementing an ABNF parser library in recent days) just leave it unspecified. Many RFC ABNF grammars are unambiguous (e.g. RFC8259 (JSON)); however, many are ambiguous (e.g. RFC5322 (Internet Messages)) and require fixups to work with an ambiguity-preserving parser :-(

Does this require a 2-pass parse: comments embedded within tokens?

Using a parser generator I want to create a parser for "From headers" in email messages. Here is an example of a From header:
From: "John Doe" <john#doe.org>
I think it will be straightforward to implement a parser for that.
However, there is a complication in the "From header" syntax: comments may be inserted just about anywhere. For example, a comment may be inserted within "john":
From: "John Doe" <jo(this is a comment)hn#doe.org>
And comments may be inserted in many other places.
How to handle this complication? Does it require a "2-pass" parser: one pass to remove all comments and a second pass to create the parse tree for the From header? Do modern parser generators support multiple passes on the input? Can it be parsed in a single pass? If yes, would you sketch the approach please?
I'm not convinced that your interpretation of email addresses is correct; my reading of RFC-822 leads me to believe that a comment can only come before or after a "word", and that "word"s in the local-part of an addr-spec need to be separated by dots ("."). Section 3.1.4 gives a pretty good hint on how to parse: you need a lexical analyzer which feeds syntactic symbols into the parser; the lexical analyzer is expected to unfold headers, ignore whitespace, and identify comments, quoted strings, atoms, and special characters.
Of course, RFC-822 has long been obsoleted, and I think that email headers with embedded comments are anachronistic.
Nonetheless, it seems like you could easily achieve the analysis you wish using flex and bison. As indicated, flex would identify the comments. Strictly speaking, you cannot identify comments with a regular expression, since comments nest. But you can recognize simple nested structures using a start condition stack, or even more economically by maintaining a counter (since flex won't return until the outermost parenthesis is found, the counter doesn't need to be global.)

EBNF Grammar for list of words separated by a space

I am trying to understand how to use EBNF to define a formal grammar, in particular a sequence of words separated by a space, something like
<non-terminal> [<word>[ <word>[ <word>[ ...]]] <non-terminal>
What is the correct way to define a word terminal?
What is the correct way to represent required whitespace?
How are optional, repetitive lists represented?
Are there any show-by-example tutorials on EBNF anywhere?
Many thanks in advance!
You have to decide whether your lexical analyzer is going to return a token (terminal) for the spaces. You also have to decide how it (the lexical analyzer) is going to define words, or whether your grammar is going to do that (in which case, what is the lexical analyzer going to return as terminals?).
For the rest, it is mostly a question of understanding the niceties of EBNF notation, which is an ISO standard (ISO 14977:1996 — and it is available as a free download from Freely Available Standards, which you can also get to from ISO), but it is a standard that is largely ignored in practice. (The languages I deal with — C, C++, SQL — use a BNF notation in the defining documents, but it is not EBNF in any of them.)
Whatever you want to make the correct definition of a word. You need to think about how you'd want to treat the name P. J. O'Neill, for example. What tokens will the lexical analyzer return for that?
This is closely related to the previous issue; what are the terminals that lexical analyzer is going to return.
Optional repetitive lists are enclosed in { and } braces, or you can use the Kleene Star notation.
There is a paper Extended BNF — A generic base standard by R. S. Scowen that explains EBNF. There's also the Wikipedia entry on EBNF.
I think that a non-empty, space-separated word list might be defined using:
non_empty_word_list = word { space word }
where all the names there are non-terminals. You'd need to define those in terms of the relevant terminals of your system.

Resources