Explanation of JFlex Block Comment rule - flex-lexer

I was looking on how to implement block comments in JFlex for custom language support in intellij and found that it can be described as
"/*" !([^]* "*/" [^]*) ("*/")?
I don't quite understand how to read this and would like it if it were explained in plain English.
At the moment I'm reading this as
first expect a /* then
expect not
any character? (Not sure why they used [^]) zero or more times
followed */
Any character zero or more
An optional */

You've basically deciphered it correctly. Here's a few explanatory notes:
[^]* matches an arbitrary sequence of characters. It's similar to .* except that . doesn't match newlines or unpaired surrogates; [^] matches absolutely anything.
So ([^]* "*/" [^]*) matches any sequence which includes */. And therefore !([^]* "*/" [^]*) matches anything except a sequence containing */. In other words, it matches anything up to but not including */, which is the rest of the comment.
Now what happens if the user makes a mistake and forgets to close the last comment? In that case, there is no */ and will match up to the end of input. Since there's no way to know where the comment should have ended (without being able to read the programmer's mind), the best we can do is to stop trying to parse. Thus, we accept the unterminated comment as a comment. That's why the final "*/"? is optional. It will match the comment terminator if there is one, and otherwise it will match an empty sequence at the end of the input.

Related

Lua Multi-Line comment remover

I'm trying to remove all normal and multi-line comments from a string, but it doesn't remove entire multi-line comment I tried
str:gsub("%-%-[^\n\r]+", "")
on this code
print(1)
--a
print(2) --b
--[[
print(4)
]]
output:
print(1)
print(2)
print(4)
]]
expected output:
print(1)
print(2)
The pattern you have provided to gsub, %-%-[^\n\r]+, will only remove "short" comments ("line" comments). It doesn't even attempt to deal with "long" comments and thus just treats their first line as a line comment, removing it.
Thus Piglet is right: You must remove the line comments after removing the long comments, not the other way around, as to not lose the start of long comments.
The pattern suggested by Piglet however necessarily fails for some (carefully crafted) long comments or even line comments. Consider
--[this is a line comment]print"Hello World!"
Piglet's pattern would strip the balanced parenthesis, treating the comment as if it were a long comment and uncommenting the rest of the line! We obtain:
print"Hello World!"
in a similar vein, this may happily consider a second line comment part of a long comment, outcommenting your entire code:
--[
-- all my code goes here
print"Hello World!"
-- end of all my code
--]
would be turned into the empty string.
Furthermore, long comments may use multiple equal signs (=) and must be terminated by the same sequence of equal signs (which is not equivalent to matching square ([]) brackets):
--[=[
A long long comment
]] <- not the termination of this long long comment
(poor regular-grammar-based syntax highlighters fail this)
]=]
this would terminate the comment at ]], leaving some syntax errors:
<- not the termination of this long long comment
(poor regular-grammar-based syntax highlighters fail this)
]=]
considering that Lua 5.1 already deprecates nesting long comments (whereas LuaJIT will entirely reject it), there is no need for matching balanced parenthesis here. Rather, you need to find long comment start sequences and then terminate at the next stop sequence. Here's some hacky pattern-based code to do just this:
for equal_signs in str:gmatch"%-%-%[(=*)%[" do
str = str:gsub("%-%-%["..equal_signs.."%[(.-)%]"..equal_signs.."%]", "", 1)
end
and here's an example string str for it to process, enclosed in a long string literal for easier testing:
local str = [==[
--[[a "long" comment]]
print"hello world"
--[=[another long comment
--[[this does not disrupt it at all
]=]
--]] oops, just a line comment
--[doesn't care about line comments]
]==]
which yields:
print"hello world"
--]]
--[doesn't care about line comments]
retaining the newlines.
now why is this hacky, despite fixing all of the aforementioned issues? Well, it's inefficient. It runs over the entire source, replacing long comments of a certain length, each time it encounters a long comment. For n long comments this means clear quadratic complexity O(n²).
You can't trivially optimize this by not replacing long comments if you have already replaced all long comments of the same length, reducing the complexity to O(n sqrt n) - since there may be at most sqrt(n) different long comment lengths for sources of length n: The gsub is limited to one replacement as to not remove part of long comments with more equal signs:
--[=[another long comment
--[[this does not disrupt it at all
]=]
You could however optimize it by using string.find repeatedly to always find (1) the opening delimiter (2) then the closing delimiter, adding all the substrings inbetween to a rope to concatenate to a string. Assuming linear matching performance (which isn't the case but could - assuming a better implementation than the current one - be the case for simple patterns such as this one) this would run in linear time. Implementing this is left as an excercise to the reader as pattern-based approaches are overall infeasible.
Note also that removing comments (to minify code?) may introduce syntax errors, as at the tokenization stage, comment (or whitespace) tokens (which are later suppressed) might be used to separate other tokens. Consider the following pathological case:
do--[[]]print("hello world")end
which would be turned into
doprint("hello world")end
which is an entirely different beast (call to doprint now, syntax error since the end isn't matched by an opening do anymore).
In addition, any pattern-based solution is likely to fail to consider context, removing "comments" inside string literals or - even harder to work around - long string literals. Again workarounds might be possible (i.e. by replacing strings with placeholders and later substituting them back), but this gets messy & error-prone. Consider
quoted_string = "--[[this is no comment but rather part of the string]]"
long_string = [=[--[[this is no comment but rather part of the string]]]=]
which would be turned into an empty string by comment removal patterns.
Conclusion
Pattern-based solutions are bound to fall short of myriads of edge cases. They will also usually be inefficient.
At least a partial tokenization that distinguishes between comments and "everything else" is needed. This must take care of long strings & long comments properly, counting the number of equals signs. Using a handwritten tokenizer is possible, but I'd recommend using lhf's ltokenp.
Even when using a proper tokenization stage to strip long comments, you might still have the aforementioned tokenization issue. For that reason you'll have to insert whitespace instead of the comment (if there isn't already). To save the most space you could check whether removing the comment alters the tokenization (i.e. removing the comment here if--[[comment]]"str"then end is fine, since the string will still be considered a distinct token from the keyword if).
What's your root problem here? If you're searching for a Lua minifier, just grab a battle-tested one rather than trying to roll your own (and especially before you try to rename local variables using patterns!).
Why should str:gsub("%-%-[^\n\r]+", "") remove
print(4)
]]
?
This pattern matches -- followed by anything but a linebreak or carriage return.
So it matches --a and --[[.
If there is an opening bracket immediately after -- you need to match anything until and including the corresponding closing bracket.
That would be -- followed by a balanced pair of brackets.
Hence "%-%-%b[]"
Then in a second run remove any short comments.

How to comment a grammar rule in yacc and a regex matching rule in lex?

I want to comment this matching rule in lex. I don't want to delete it. I just want it commented so anyone sees the lex file later be informed that this part has been commented
<tickPragma_name_1>. {
myyyless(0);
BEGIN (0);
}
how I can do that? I know that I can comment the C code inside the {} .but I want to comment the whole rule.
You can surround the rule with /* and */, just as in C, but with two major caveats:
Everything needs to be indented by at least one space, including the /* (and, I believe, the contents).
There cannot be any nested comments in the action.

How to get the last matched text in Flex parser

I want match something like:
var i=1;
So I want to know if var has started at word boundary.
When it matches this line I want to know the last character of previous yytext.
Just to be sure that a char before var is really a non variable character( aka "\b" in regex)
One crude way to maintain old_yytext in each rule and also have a default rule ".".
How to get it?
The only way is to save a copy of the previous token, or at least the last character. Flex's buffer management strategy does not guarantee that the previous token still exists in memory. It is possible that the current token starts at the beginning of flex's buffer.
But doing the work of saving the previous token in every rule would be really silly. You should trust flex to work as advertised, and write appropriate rules. For example, if your identifier pattern looks like this:
[[:alpha:]][[:alnum:]]*
then it is impossible for var to immediately follow an identifier because it would have been included in the idebtifier.
There is one common case in a "normal" flex scanner definition where a keyword or identifier might immediately follow an alphanumeric character, which is when the keyword immediately follows a number (123var). This is not usually a problem, because in almost all languages, it will trigger a syntax error (and if it isn't a syntax error, maybe it is ok :-) )
If you really want to trigger a lexical error, you can add a pattern which recognizes a number followed by a letter.

RegEx negative-lookahead and behind to find characters not embedded within a wrapper

I would like to match strings/characters that are not surrounded by a well-defined string-wrapper. In this case the wrapper is '#L#' on the left of the string and '#R#' on the right of the string.
With the following string for example:
This is a #L#string#R# and it's #L#good or ok#R# to change characters in the next string
I would like to be able to search for (any number of characters) to change them on a case by case basis. For example:
Searching for "in", would match twice - the word 'in', and the 'in' contained within the last word 'string'.
Searching for a "g", should be found within the word 'change' and in the final word string (but not the first occurrence of string contained within the wrapper).
I'm somewhat familiar with how lookahead works in the sense that it identifies a match, and doesn't return the matching criteria as part of the identified match.
Unfortunately, I can't get my head around how to do it.
I've also been playing with this at http://regexpal.com/ but can't seem to find anything that works. Examples I've found for iOS are problematic, so perhaps the javascript tester is a tiny bit different.
I took some guidance from a previous question I asked, which seemed to be almost the same but sufficiently different to mean I couldn't work out how to reuse it:
Replacing 'non-tagged' content in a web page
Any ideas?
At first all the #L# to #R# blocks and then use alternation operator | to match the string in from the remaining string. To differentiate the matches, put in inside a capturing group.
#L#.*?#R#|(in)
DEMO
OR
Use a negative lookahead assertion. This would match the sub-string in only if it's not followed by #L# or #R#, zero or more times and further followed by #R#. So this would match all the in's which was not present inside the #L# and #R# blocks.
in(?!(?:(?!#[RL]#).)*#R#)
DEMO

Regex for string first chars

in my Rails app I need to validate a string that on creation can not have its first chars empty or composed by any special chars.
For example: " file" and "%file" aren't valid. Do you know what Regex I should use?
Thanks!
The following regex will only match if the first letter of the string is a letter, number, or '_':
^\w
To restrict to just letters or numbers:
^[0-9a-zA-Z]
The ^ has a special meaning in regular expressions, when it is outside of a character class ([...]) it matches the start of the string (without actually matching any characters).
If you want to match all invalid strings you can place a ^ inside of the character class to negate it, so the previous expressions would be:
^[^\w]
or
^[^0-9a-zA-Z]
A good place to interactively try out Ruby regexes is Rubular. The link I gave shows the answer that #Dave G gave along with a few test examples (and at first glance it seems to work). You could expand the examples to convince yourself further.
The regex
^[^[:punct:][:space:]]+
Should do what you want. I'm not 100% sure of what Ruby provides as far as regular expressions and POSIX class support so your mileage on this may vary.

Resources