Lua Multi-Line comment remover - lua

I'm trying to remove all normal and multi-line comments from a string, but it doesn't remove entire multi-line comment I tried
str:gsub("%-%-[^\n\r]+", "")
on this code
print(1)
--a
print(2) --b
--[[
print(4)
]]
output:
print(1)
print(2)
print(4)
]]
expected output:
print(1)
print(2)

The pattern you have provided to gsub, %-%-[^\n\r]+, will only remove "short" comments ("line" comments). It doesn't even attempt to deal with "long" comments and thus just treats their first line as a line comment, removing it.
Thus Piglet is right: You must remove the line comments after removing the long comments, not the other way around, as to not lose the start of long comments.
The pattern suggested by Piglet however necessarily fails for some (carefully crafted) long comments or even line comments. Consider
--[this is a line comment]print"Hello World!"
Piglet's pattern would strip the balanced parenthesis, treating the comment as if it were a long comment and uncommenting the rest of the line! We obtain:
print"Hello World!"
in a similar vein, this may happily consider a second line comment part of a long comment, outcommenting your entire code:
--[
-- all my code goes here
print"Hello World!"
-- end of all my code
--]
would be turned into the empty string.
Furthermore, long comments may use multiple equal signs (=) and must be terminated by the same sequence of equal signs (which is not equivalent to matching square ([]) brackets):
--[=[
A long long comment
]] <- not the termination of this long long comment
(poor regular-grammar-based syntax highlighters fail this)
]=]
this would terminate the comment at ]], leaving some syntax errors:
<- not the termination of this long long comment
(poor regular-grammar-based syntax highlighters fail this)
]=]
considering that Lua 5.1 already deprecates nesting long comments (whereas LuaJIT will entirely reject it), there is no need for matching balanced parenthesis here. Rather, you need to find long comment start sequences and then terminate at the next stop sequence. Here's some hacky pattern-based code to do just this:
for equal_signs in str:gmatch"%-%-%[(=*)%[" do
str = str:gsub("%-%-%["..equal_signs.."%[(.-)%]"..equal_signs.."%]", "", 1)
end
and here's an example string str for it to process, enclosed in a long string literal for easier testing:
local str = [==[
--[[a "long" comment]]
print"hello world"
--[=[another long comment
--[[this does not disrupt it at all
]=]
--]] oops, just a line comment
--[doesn't care about line comments]
]==]
which yields:
print"hello world"
--]]
--[doesn't care about line comments]
retaining the newlines.
now why is this hacky, despite fixing all of the aforementioned issues? Well, it's inefficient. It runs over the entire source, replacing long comments of a certain length, each time it encounters a long comment. For n long comments this means clear quadratic complexity O(n²).
You can't trivially optimize this by not replacing long comments if you have already replaced all long comments of the same length, reducing the complexity to O(n sqrt n) - since there may be at most sqrt(n) different long comment lengths for sources of length n: The gsub is limited to one replacement as to not remove part of long comments with more equal signs:
--[=[another long comment
--[[this does not disrupt it at all
]=]
You could however optimize it by using string.find repeatedly to always find (1) the opening delimiter (2) then the closing delimiter, adding all the substrings inbetween to a rope to concatenate to a string. Assuming linear matching performance (which isn't the case but could - assuming a better implementation than the current one - be the case for simple patterns such as this one) this would run in linear time. Implementing this is left as an excercise to the reader as pattern-based approaches are overall infeasible.
Note also that removing comments (to minify code?) may introduce syntax errors, as at the tokenization stage, comment (or whitespace) tokens (which are later suppressed) might be used to separate other tokens. Consider the following pathological case:
do--[[]]print("hello world")end
which would be turned into
doprint("hello world")end
which is an entirely different beast (call to doprint now, syntax error since the end isn't matched by an opening do anymore).
In addition, any pattern-based solution is likely to fail to consider context, removing "comments" inside string literals or - even harder to work around - long string literals. Again workarounds might be possible (i.e. by replacing strings with placeholders and later substituting them back), but this gets messy & error-prone. Consider
quoted_string = "--[[this is no comment but rather part of the string]]"
long_string = [=[--[[this is no comment but rather part of the string]]]=]
which would be turned into an empty string by comment removal patterns.
Conclusion
Pattern-based solutions are bound to fall short of myriads of edge cases. They will also usually be inefficient.
At least a partial tokenization that distinguishes between comments and "everything else" is needed. This must take care of long strings & long comments properly, counting the number of equals signs. Using a handwritten tokenizer is possible, but I'd recommend using lhf's ltokenp.
Even when using a proper tokenization stage to strip long comments, you might still have the aforementioned tokenization issue. For that reason you'll have to insert whitespace instead of the comment (if there isn't already). To save the most space you could check whether removing the comment alters the tokenization (i.e. removing the comment here if--[[comment]]"str"then end is fine, since the string will still be considered a distinct token from the keyword if).
What's your root problem here? If you're searching for a Lua minifier, just grab a battle-tested one rather than trying to roll your own (and especially before you try to rename local variables using patterns!).

Why should str:gsub("%-%-[^\n\r]+", "") remove
print(4)
]]
?
This pattern matches -- followed by anything but a linebreak or carriage return.
So it matches --a and --[[.
If there is an opening bracket immediately after -- you need to match anything until and including the corresponding closing bracket.
That would be -- followed by a balanced pair of brackets.
Hence "%-%-%b[]"
Then in a second run remove any short comments.

Related

Antlr differentiating a newline from a \n

Let's say I have the following statement:
SELECT "hi\n
there";
Notice there is a literal newline in there, and the escape \n. The string that antlr4 picks up for me is:
String_Literal: "hi\n\nthere"
In other words, not differentiating between the literal newline and the \n one. Is there a way to differentiate the two, or what's the usual process to do that?
My guess is that the output you pasted into your question comes from a call to the Antlr4 runtime method tree.toStringTree(parser) (or equivalent in whatever target language you've chosen).
That function calls escapeWhitespace in the utilities class/module/file, and that function does what it's name suggests: it converts (some) whitespace characters to C-like backslash escape sequences. (Specifically, it handles newline, carriage return, and tab characters.) It does not escape backslash characters, which makes its output ambiguous; there's no way to distinguish between the two character escape sequence \n and the escaped conversion of a newline character in the message.
They are different in the actual character string, because the Antlr4 lexer does not transform the string value of the matched token in any way. That's your responsibility.
In computing, it is very often the case that what you see is not what you got. What you see is just what you see, and a lot of computational power has gone into creating that vision for you. By the same token, nothing guarantees that the vision is an unambiguous, or even useful, representation of the actual values. The best you can say for it is that it's probably more useful than trying to read the data as individual bits. (And, indeed, the individual bits are not physical objects either; despite the common refrain, you could completely disassemble a computer and examine it with an arbitrarily powerful microscope, and you will not see a single 1 or 0.)
That might seem like irrelevant philosophizing, but it has a real consequence: when you're debugging and you see something that makes you think, "that looks wrong", you need to consider two possibilities: maybe the underlying data is incorrect, but may it's the process which rendered the representation which is at fault. In this case, I'd say that the failure of escapeWhitespace to convert backslash characters into pairs of backslashes is a bug, but that's a value judgement on my part. Anyway, the function is not critical to the operation of Antlr4, and you could easily replace it.

NSRegularExpression not matching number sign (#)

I'm working on a Guitar Chord transposer, and so from a given text file, I want to identify guitar chords. e.g. G#, Ab, F#m, etc.
I'm almost there! I have run into a few problems already due to the number sign (hash tag).
#
For example, you can't include the number sign in your regex pattern. The NSRegularExpression will not initialize with this:
let fail: String = "\\b[ABCDEFG](b|#)?\\b"
let success: String = "\\b[CDEFGAB](b|\\u0023)?\\b"
I had to specifically provide the unicode character. I can live with that.
However, now that I have a NSRegularExpression object, it won't match these (sharps = number sign) when I have a line of text such as:
Am Bb G# C Dm F E
When it starts processing the G#, the sharp associated with that second capture group is not matched. (i.e. the NSTextCheckingResult's second range has a location of NSNotFound) Note, it works for Bb... it matches the 'b'
I'm wondering what I need to do here. It would seem the documentation doesn't cover this case of '#' which IS in fact sometimes used in Regex patterns (I think related to comments or sth)
One thing that would be great would be to not have to look up the unicode identifier for a #, but just use it as a String "#" then convert that so it plays nicely with the pattern. There exists the chance that \u0023 is in fact not the code associated with # ...
The \b word boundary is a context dependent construct. It matches in 4 contexts: 1) between start of string and a word char, 2) between a word char and end of string, 3) between word and a non-word and 4) a non-word and a word char.
Your regex is written in such a way that ultimately the regex engine sees a \b after # and that means a # will only match if there is a word char after it.
If you replace \b with (?!\w), a negative lookahead that fails the match if there is a word char immediately to the right of the current location, it will work.
So, you may use
\\b[CDEFGAB](b|\\u0023)?(?!\\w)
See the regex demo.
Details
\b - a word boundary
[CDEFGAB] - a char from the set
(b|\\u0023)? - an optional sequence of b or #
(?!\\w) - a negative lookahead failing the match (and causing backtracking into the preceding pattern! To avoid that, add + after ? to prevent backtracking into that pattern) if there is a word char immediately to the right of the current position.
(I'd like to first say #WiktorStribiżew has been a tremendous help and what I am writing now would not have been possible without him! I'm not concerned about StackOverflow points and rep, so if you like this answer, please upvote his answer.)
This issue took many turns and had a few issues going on. Ultimately this question should be called How do I use Regex on iOS to detect Musical Chords in a text file?
The answer is (so far), not simply.
CRASH COURSE IN MUSIC THEORY
In music you have notes. They are made up of a letter between A->G and an optional symbol called an accidental. (A note relates to the acoustic frequency of the sound you hear when that note is played) An accidental can be a flat (represented as a ♭ or simply a b), or a sharp (represented as a ♯ or simply a #, as these are easier to type on a keyboard). An accidental serves to make a note a semitone higher (#) or lower (b). As such, a F# is the same acoustic frequency as a Gb. On a piano, the white keys are notes with no accidentals, and the black keys represent notes with an accidental. Depending on some factors of the piece of music, that piece won't mix accidental types. It will either be flats throughout the piece or sharps. (Depending on the musical key of the composition, but this is not that relevant here.)
In terms of regex, you have something like ABCDEFG? to determine the note. In reality it's more complicated.
Then, a Musical Chord is comprised of the root note and it's chord type. There are over 50 types of chords. They have a 'text signature' that is unique. Also, a 'major' chord has an empty signature. So in terms of pseudo-regex you have for a Chord:
[ABCDEFG](b|#)?(...|...|...)?
where the first part you recognize as the note (as before), and the last optional is to determine the chord type. The different types were omitted, but can be as simple as a m (for Minor chord), or maj7#5 (for a major 7th chord with an augmented 5th... don't worry about it. Just know there are many string constants to represent a chord type)
Then finally, with guitar you often have a corresponding bass note that changes the chord's tonality somewhat. You denote this by adding a slash and then the note, giving the general pseudoform:
[ABCDEFG](b|#)?(...|...|...)?(/[ABCDEFG](b|#)?)? // NOT real Regex
real examples: C/F or C#m/G# and so on
where the last part has a slash then the same pattern to recognize a note.
So putting these all together, in general we want to find chords that could take on many forms, such as:
F Gm C#maj7/G# F/C Am A7 A7/F# Bmaj13#11
I was hoping to find one Regex to rule them all. I ended up writing code that works, though it seems like I kind of hacked around a bit to get the results I desired.
You can see this code here, written in Swift. It is not complete for my purposes, but it will parse a string, return a list of Chord Results and their text range within the original string. From there you would have to finish the implementation to suit your needs.
There have been a few issues on iOS:
iOS does not handle the number sign (#) well at all. When providing regex patterns or match text, I either had to replace the # with its unicode \u0023, or what ultimately worked was replacing all occurrences of # with another character (such as 'S'), and then convert it back once regex did it's thing. So this code I wrote often has to 'sanitize' the pattern or the input text before doing anything.
I couldn't get a Regex Pattern to perfectly parse a chord structure. It wasn't fully working for a Chord with a bass note, but it would successfully match a Chord with a bass note, then I had to split those 2 components and parse them separately, then recombine them
Regex is really a bit of voodoo, and I think it sucks that for something so confusing to many people, there are also different platform-dependent implementations of it. For example, Wiktor referred me to Regex patterns he wrote to help me solve the problem on www.regex101.com, that would WORK on that website, but these would not work on iOS, and NSRegularExpression would throw an error (often it had something to do with this # character)
My solution pays absolutely no regard to performance. It just wanted it to work.

How to get the last matched text in Flex parser

I want match something like:
var i=1;
So I want to know if var has started at word boundary.
When it matches this line I want to know the last character of previous yytext.
Just to be sure that a char before var is really a non variable character( aka "\b" in regex)
One crude way to maintain old_yytext in each rule and also have a default rule ".".
How to get it?
The only way is to save a copy of the previous token, or at least the last character. Flex's buffer management strategy does not guarantee that the previous token still exists in memory. It is possible that the current token starts at the beginning of flex's buffer.
But doing the work of saving the previous token in every rule would be really silly. You should trust flex to work as advertised, and write appropriate rules. For example, if your identifier pattern looks like this:
[[:alpha:]][[:alnum:]]*
then it is impossible for var to immediately follow an identifier because it would have been included in the idebtifier.
There is one common case in a "normal" flex scanner definition where a keyword or identifier might immediately follow an alphanumeric character, which is when the keyword immediately follows a number (123var). This is not usually a problem, because in almost all languages, it will trigger a syntax error (and if it isn't a syntax error, maybe it is ok :-) )
If you really want to trigger a lexical error, you can add a pattern which recognizes a number followed by a letter.

Is it always necessary to use '.\n' when reading streams in Prolog?

I'm using pipes to communicate two Prolog processes and every time I reached a read/2 predicate to read a message from my pipe, the program blocked and remained like that. I couldn't understand why that happened (I tried with extremely simple programs) and at the end I realized three things:
Every time I use write/2 to send a message, the sender process must end that message with .\n. If the message does not end like this, the receiver process will get stuck at the read/2 predicate.
If the sender does not flush the output, the message therefore is not left in the pipe buffer. It may seem obvious but it wasn't for me at the beginning.
Although when the message is not flushed the read/2 is blocking, wait_for_input/3 is not blocking at all, so no need for flush_output/1 in such case.
Examples:
This does not work:
example1 :-
pipe(R,W),
write(W,hello),
read(R,S). % The program is blocked here.
That one won't work either:
example2 :-
pipe(R,W),
write(W,'hello.\n'),
read(R,S). % The program is blocked here.
While these two, do work:
example3 :-
pipe(R,W),
write(W,'hello.\n'),
flush_output(W),
read(R,S).
example4 :-
pipe(R,W),
write(W,'hello.\n'),
wait_for_input([W],L,infinite).
Now my question is why? Is there a reason why Prolog only "accepts" full lines ended with a period when reading from a pipe (actually reading from any stream you may want to read)? And why does read block while wait_for_input/3 doesn't (assuming the message is not flushed)?
Thanks!
A valid Prolog read-term always ends with a period, called end char (* 6.4.8 *). And in 6.4.8 Other tokens, the standard reads:
An end char shall be followed by a layout character or a %.
So this is what the standard demands.
A newline after the period is one possibility to end a read-term, besides space, tab and other layout characters as well as %. However, due to the prevalence of ttys and related buffering, it seems a good convention to just stick with a newline.
The reason why the end char is needed is that Prolog syntax permits infix and postfix operators. Consider as input
f(1) + g(2).
when reading f(1) you might believe that this is already the entire term, but you still must await the period to be sure that there is no infix or postfix thereafter.
Also note that you must use writeq/1 or write_canonical/1 to produce output that can be read back. You cannot use write/1.
As an example, consider write([(.)+ .]). First, this is valid syntax. The dots are immediately followed by some other character. Remark the . is commonly called a period at the end, whereas it is called a dot within Prolog text.
write/1 will write this as [. + .]. Note, that the first . is now followed by a space. So when this text is read back,
only [. will be read.
There are many other ugly examples such as this one, usually they do not hit you. But once you are hit, you are hit...

Tex command which affects the next complete word

Is it possible to have a TeX command which will take the whole next word (or the next letters up to but not including the next punctuation symbol) as an argument and not only the next letter or {} group?
I’d like to have a \caps command on certain acronyms but don’t want to type curly brackets over and over.
First of all create your command, for example
\def\capsimpl#1{{\sc #1}}% Your main macro
The solution to catch a space or punctuation:
\catcode`\#=11
\def\addtopunct#1{\expandafter\let\csname punct#\meaning#1\endcsname\let}
\addtopunct{ }
\addtopunct{.} \addtopunct{,} \addtopunct{?}
\addtopunct{!} \addtopunct{;} \addtopunct{:}
\newtoks\capsarg
\def\caps{\capsarg{}\futurelet\punctlet\capsx}
\def\capsx{\expandafter\ifx\csname punct#\meaning\punctlet\endcsname\let
\expandafter\capsend
\else \expandafter\continuecaps\fi}
\def\capsend{\expandafter\capsimpl\expandafter{\the\capsarg}}
\def\continuecaps#1{\capsarg=\expandafter{\the\capsarg#1}\futurelet\punctlet\capsx}
\catcode`\#=12
#Debilski - I wrote something similar to your active * code for the acronyms in my thesis. I activated < and then \def<#1> to print the acronym, as well as the expansion if it's the first time it's encountered. I also went a bit off the deep end by allowing defining the expansions in-line and using the .aux files to send the expansions "back in time" if they're used before they're declared, or to report errors if an acronym is never declared.
Overall, it seemed like it would be a good idea at the time - I rarely needed < to be catcode 12 in my actual text (since all my macros were in a separate .sty file), and I made it behave in math mode, so I couldn't foresee any difficulties. But boy was it brittle... I don't know how many times I accidentally broke my build by changing something seemingly unrelated. So all that to say, be very careful activating characters that are even remotely commonly-used.
On the other hand, with XeTeX and higher unicode characters, it's probably a lot safer, and there are generally easy ways to type these extra characters, such as making a multi (or compose) key (I usually map either numlock or one of the windows keys to this), so that e.g. multi-!-! produces ¡). Or if you're running in emacs, you can use C-\ to switch into TeX input mode briefly to insert unicode by typing the TeX command for it (though this is a pain for actually typing TeX documents, since it intercepts your actual \'s, and please please don't try defining your own escape character!)
Regarding whitespace after commands: see package xspace, and TeX FAQ item Commands gobble following space.
Now why this is very difficult: as you noted yourself, things like that can only be done by changing catcodes, it seems. Catcodes are assigned to characters when TeX reads them, and TeX reads one line at a time, so you can not do anything with other spaces on the same line, IMHO. There might be a way around this, but I do not see it.
Dangerous code below!
This code will do what you want only at the end of the line, so if what you want is more "fluent" typing without brackets, but you are willing to hit 'return' after each acronym (and not run any auto-indent later), you can use this:
\def\caps{\begingroup\catcode`^^20 =11\mcaps}
\def\mcaps#1{\def\next##1 {\sc #1##1\catcode`^^20 =10\endgroup\ }\next}
One solution might be setting another character as active and using this one for escaping. This does not remove the need for a closing character but avoids typing the \caps macro, thus making it overall easier to type.
Therefore under very special circumstances, the following works.
\catcode`\*=\active
\def*#1*{\textsc{\MakeTextLowercase{#1}}}
Now follows an *Acronym*.
Unfortunately, this makes uses of \section*{} impossible without additional macro definitions.
In Xetex, it seems to be possible to exploit unicode characters for this, so one could define
\catcode`\•=\active
\def•#1•{\textsc{\MakeTextLowercase{#1}}}
Now follows an •Acronym•.
Which should reduce the effects on other commands but of course needs to have the character ‘•’ mapped to the keyboard somewhere to be of use.

Resources