I'm trying to wrap my head around how to handle C-style multiline comments (/* */) with a recursive descent parser. Because these comments can appear anywhere, how do you account for them? For example, suppose you're parsing a sentence into word tokens, what do we do if there's a comment inside a word?
Ex.
This is a sentence = word word word word
vs
This is a sen/*sible*/tence = ???
Thanks!
In C, like pretty well every other programming language, a comment is effectively whitespace; a comment cannot occur within a token.
So comments cannot interrupt the parsing of a token, and thus only need to be recognized and ignored.
Related
I'm having a hard time figuring out how to recognize some text only if it is preceded and followed by certain things. The task is to recognize AND, OR, and NOT, but not if they're part of a word:
They should be recognized here:
x AND y
(x)AND(y)
NOT x
NOT(x)
but not here:
xANDy
abcNOTdef
AND gets recognized if it is surrounded by spaces or parentheses. NOT gets recognized if it is at the beginning of the input, preceded by a space, and followed by a space or parenthesis.
The trouble is that if I include parentheses as part of the definition of AND or NOT, they get consumed, and I need them to be separate tokens.
Is there some kind of lookahead/lookbehind syntax I can use?
EDIT:
Per the comments, here's some context. The problem is related to this problem: Antlr: how to match everything between the other recognized tokens? My working solution there is just to recognize AND, OR, etc. and skip everything else. Then, in a second pass over the text, I manually grab the characters not otherwise covered, and run a totally different tokenizer on it. The reason is that I need a custom, human-language-specific tokenizer for this content, which means that I can't, in advance, describe what is an ID. Each human language is different. I want to combine, in stages, a single query-language tokenizer, and then apply a human-language tokenizer to what's left.
ANTLR is not the right tool for this task. A normal parser is designed for a specific language, that is, a set of sentences consisting of elements that are known at parser creation time. There are ways to make this more flexible, e.g. by using a runtime function in a predicate to recognize words not defined in the grammar, but this has other (negative) implications.
What you should consider is NLP for a different approach to process natural language. It's more than just skipping things between two known tokens.
Which style of multiline comments used on Dart?
I know the C-style of the multiline comments. This style does not allow multiline comments inside multiline comments (nested comments).
That is the 'C' style comments end at the first */ encountered in multiline comments.
Examples:
Vaild C-style comment:
/*
*/
Not valid C-style comment:
/*
/**/
*/
In Dart both styles are valid but as I know in most popular languages used only the C-style comments.
Here is my question.
From whence this style in Dart language? From a historical point of view and practical.
P.S.
I am writing PEG parser for Dart and was surprised when I found it in the grammar.
This rule does not allow in my parser auto recognize multilne comment as terminal because it recursive call himself.
MULTI_LINE_COMMENT <- '/*' (MULTI_LINE_COMMENT / !'*/' .)* '*/' ;
Also how this multiline comment can be described in Bison/Flex terminology?
This question arrives because in PEG parser terminology the comments are part of white spaces. And the white spaces in most cases can be assumed as terminals because they does not change behaviour (they does not branch and are not recursive by human logic, i.e produced directly into tokens by lexical scaners).
I know that in PEG parsers there is no division on terminals and not-terminals but for better error reporting some euristic analysis of grammar rules never prevents
From whence this style in Dart language?
I believe they added this because it makes it easier to comment out large blocks of code which may already contain block comments. Most other grammatical constructs nest, so it always seemed strange that C-style block comments did not to me.
I think C originally worked that way because it made it easier to lex on old PDP-11s with almost no memory. We don't have that limitation anymore, so we can have a more user-friendly comment syntax.
Using a parser generator I want to create a parser for "From headers" in email messages. Here is an example of a From header:
From: "John Doe" <john#doe.org>
I think it will be straightforward to implement a parser for that.
However, there is a complication in the "From header" syntax: comments may be inserted just about anywhere. For example, a comment may be inserted within "john":
From: "John Doe" <jo(this is a comment)hn#doe.org>
And comments may be inserted in many other places.
How to handle this complication? Does it require a "2-pass" parser: one pass to remove all comments and a second pass to create the parse tree for the From header? Do modern parser generators support multiple passes on the input? Can it be parsed in a single pass? If yes, would you sketch the approach please?
I'm not convinced that your interpretation of email addresses is correct; my reading of RFC-822 leads me to believe that a comment can only come before or after a "word", and that "word"s in the local-part of an addr-spec need to be separated by dots ("."). Section 3.1.4 gives a pretty good hint on how to parse: you need a lexical analyzer which feeds syntactic symbols into the parser; the lexical analyzer is expected to unfold headers, ignore whitespace, and identify comments, quoted strings, atoms, and special characters.
Of course, RFC-822 has long been obsoleted, and I think that email headers with embedded comments are anachronistic.
Nonetheless, it seems like you could easily achieve the analysis you wish using flex and bison. As indicated, flex would identify the comments. Strictly speaking, you cannot identify comments with a regular expression, since comments nest. But you can recognize simple nested structures using a start condition stack, or even more economically by maintaining a counter (since flex won't return until the outermost parenthesis is found, the counter doesn't need to be global.)
When defining the grammar for a language parser, how do you deal with things like comments (eg /* .... */) that can occur at any point in the text?
Building up your grammar from tags within tags seems to work great when things are structured, but comments seem to throw everything.
Do you just have to parse your text in two steps? First to remove these items, then to pick apart the actual structure of the code?
Thanks
Normally, comments are treated by the lexical analyzer outside the scope of the main grammar. In effect, they are (usually) treated as if they were blanks.
One approach is to use a separate lexer. Another, much more flexible way, is to amend all your token-like entries (keywords, lexical elements, etc.) with an implicit whitespace prefix, valid for the current context. This is how most of the modern Packrat parsers are dealing with whitespaces.
I've been playing with this for an hour or tow and have found myself at a road block with the Lua pattern matching utilities. I am attempting to match all quoted text in a string and replace it if needed.
The pattern I have come up with so far is: (\?[\"\'])(.-)%1
This works in some cases but, not all cases:
Working: "This \"is a\" string of \"text to\" test with"
Not Working: "T\\\"his \"is\' a\" string\" of\' text\" to \"test\" wit\\\"h"
In the not working example I would like it to match to (I made a function that gets the matches I desire, I'm just looking for a pattern to use with gsub and curious if a lua pattern can do this):
string
a" string" of
is' a" string" of' text
test
his "is' a" string" of' text" to "test" wit
I'm going to continue to use my function instead for the time being, but am curious if there is a pattern I could/should be using and i'm just missing something with patterns.
(a few edits b/c I forgot about stackoverflows formating)
(another edit to make a non-html example since it was leading to assumptions that I was attempting to parse html)
Trying to match escaped, quoted text using regular expressions is like trying to remove the daisies (and only the daises) from a field using a lawnmower.
I made a function that gets the matches I desire
This is the correct move.
I'm curious if a lua pattern can do this
From a practical point of view, even if a pattern can do this, you don't want to. From a theoretical point of view, you are trying to find a double quote that is preceded by an even number of backslashes. This is definitely a regular language, and the regular expression you want would be something like the following (Lua quoting conventions)
[[[^\](\\)*"(.-[^\](\\)*)"]]
And the quoted string would be result #2. But Lua patterns are not full regular expressions; in particular, you cannot put a * after a parenthesized pattern.
So my guess is that this problem cannot be solved using Lua patterns, but since Lua patterns are not a standard thing in automata theory, I'm not aware of any body of proof technique that you could use to prove it.
The issue with escaped quotes is that, in general, if there's an odd number of backslashes before the quote, then it's escaped, and if there's an even number, it's not. I do not believe that Lua pattern-matching is powerful enough to represent this condition, so if you need to parse text like this, then you should seek another way. Perhaps you can iterate through the string and parse it, or you could find each quote in turn and read backwards, counting the backslashes until you find a non-backslash character (or the beginning of the string).
If you absolutely must use patterns for some reason, you could try doing this in a multi-step process. First, gsub for all occurrences of two backslashes in a row, and replace them with some sentinel value. This must be a value that does not already occur in the string. You could try something like "\001" if you know this string doesn't contain non-printable characters. Anyway, once you've replaced all sequences of two backslashes in a row, any backslashes left are escaping the following character. Now you can apply your original pattern, and then finally you can replace all instances of your sentinel value with two backslashes again.
Lua's pattern language is adequate for many simple cases. And it has at least one trick you don't find in a typical regular expression package: a way to match balanced parenthesis. But it has its limits as well.
When those limits are exceeded, then I reach for LPeg. LPeg is an implementation of a Parsing Expression Grammer for Lua, and was implemented by one of Lua's original authors so the adaptation to Lua is done quite well. A PEG allows specification of anything from simple patterns through complete language grammars to be written. LPeg compiles the grammar to a bytecode and executes it extremely efficiently.
you should NOT be trying to parse HTML with regular expressions, HTML and XML are NOT regular languages and can not be successfully manipulated with regular expressions. You should use a dedicated HTML parser. Here are lots of explanations why.