Lua source code manipulation: get innermost function() location for a given line - lua

I've got a file with syntactically correct Lua 5.1 source code.
I've got a position (line and character offset) inside that file.
I need to get an offset in bytes to the closing parenthesis of the innermost function() body that contains that position (or figure out that the position belongs to the main chunk of the file).
I.e.:
local function foo()
^ result
print("bar")
^ input
end
local foo = function()
^ result
print("bar")
^ input
end
local foo = function()
return function()
^ result
print("bar")
^ input
end
end
...And so on.
How do I do that robustly?

EDIT: My original answer did not take into account the "innermost" requirement. I've since taken that into account
To make things "robust," there are a few considerations.
First of all, it's important that you skip over string and comment contents, to avoid incorrect output in situations like:
foo = function()
print(" function() ")
-- function()
print("bar")
^ input
end
This can be somewhat difficult, considering Lua's nested string and comment syntax. Consider, for example, a situation where the input begins in a nested string or comment:
foo = function()
print([[
bar = function()
print("baz")
^ input
end
]])
end
Consequently, if you want a completely robust system, it is not acceptable to only parse backwards until you hit the end of a function parameter list, because you may not have parsed backwards far enough to reach a [[ which would invalidate your match. It is therefore necessary to parse the entire file up to your position (unless you're okay with incorrect matches in these weird situations. If this is an editor plugin, these "incorrect" results may actually be desirable, because they would allow you to edit lua code which is stored in string literal form inside other lua code using the same plugin).
Because the particular syntax that you're trying to match doesn't have any kind of "nesting", a full-blown parser isn't needed. You will need to maintain a stack, however, to keep track of scope. With that in mind, all you need to do is step through the source file character-by-character from the beginning, applying the following logic:
Every time a " or ' is encountered, ignore the characters up to the closing " or '. Be careful to handle escapes like \" and \\
Every time a -- is encountered, ignore the characters up to the closing newline for the comment. Be careful to only do this if the comment is not a multiline comment.
Every time a multiline string opening symbol is encountered (such as [[, [=[, etc), or a multiline comment symbol is encountered (such as --[[ or --[=[, etc) ignore the characters up until the closing square brackets with the proper number of matching equals signs between them.
When a word boundary is encountered check to see if the characters after it could begin a block which ends with an end (for example, if, while, for, function, etc. DO NOT include repeat). If so, push the position on the scope stack. A "word boundary" in this case is any character which could not be used a lua identifier (this is to prevent matches in cases like abcfunction()). The beginning of the file is also considered a word boundary.
If a word boundary is encountered and it is followed by end, pop the top element of the stack. If the stack has no elements, complain about a syntax error.
When you finally step forward and reach your "input" position, pop elements from the stack until you find a function scope. Step forward from that position to the next ), ignoring )'s in comments (which could theoretically be found in an argument list if it spans multiple lines or contains inline --[[ ]] comments). That position is your result.
This should handle every case, including situations where the function syntactic sugar is used, like
function foo()
print("bar")
end
which you did not include in your example but which I imagine you still want to match.

Related

How does groovy distinguish division from strings?

Groovy supports / as a division operator:
groovy> 1 / 2
===> 0.5
It supports / as a string delimiter, which can even be multiline:
groovy> x = /foo/
===> foo
groovy:000> x = /foo
groovy:001> bar/
===> foo
bar
Given this, why can't I evaluate a slashy-string literal in groovysh?
groovy:000> /foo/
groovy:001>
clearly groovysh thinks this is unterminated for some reason.
How does groovy avoid getting confused between division and strings? What does this code mean:
groovy> f / 2
Is this a function call f(/2 .../) where / is beginning a multiline slashy-string, or f divided by 2?
How does Groovy distinguish division from strings?
I'm not entirely sure how Groovy does it, but I'll describe how I'd do it, and I'd be very surprised if Groovy didn't work in a similar way.
Most parsing algorithms I've heard of (Shunting-yard, Pratt, etc) recognize two distinct kinds of tokens:
Those that expect to be preceded by an expression (infix operators, postfix operators, closing parentheses, etc). If one of these is not preceded by an expression, it's a syntax error.
Those that do not expect to be preceded by an expression (prefix operators, opening parentheses, identifiers, literals, etc). If one of these is preceded by an expression, it's a syntax error.
To make things easier, from this point onward I'm going to refer to the former kind of token as an operator and the latter as a non-operator.
Now, the interesting thing about this distinction is that it's made not based on what the token actually is, but rather on the immediate context, particularly the preceding tokens. Because of this, the same token can be interpreted very differently depending on its position in the code, and whether the parser classifies it as an operator or a non-operator. For example, the '-' token, if in an operator position, denotes a subtraction, but the same token in a non-operator position is a negation. There is no issue deciding whether a '-' is a subtraction operator or not, because you can tell based on its context.
The same is, in general, true for the '/' character in Groovy. If preceded by an expression, it's interpreted as an operator, which means it's a division. Otherwise, it's a non-operator, which makes it a string literal. So, you can generally tell if a '/' is a division or not, by looking at the token that immediately precedes it:
The '/' is a division if it follows an identifier, literal, postfix operator, closing parenthesis, or other token that denotes the end of an expression.
The '/' begins a string if it follows a prefix operator, infix operator, opening parenthesis, or other such token, or if it begins a line.
Of course, it isn't quite so simple in practice. Groovy is designed to be flexible in the face of various styles and uses, and therefore things like semicolons or parentheses are often optional. This can make parsing somewhat ambiguous at times. For example, say our parser comes across the following line:
println / foo
This is most likely an attempt to print a multiline string: foo is the beginning of a string being passed to println as an argument, and the optional parentheses around the argument list are left out. Of course, to a simple parser it looks like a division. I expect the Groovy parser can tell the difference by reading ahead to the following lines to see which interpretation does not give an error, but for something like groovysh that is literally impossible (since, as a repl, it doesn't yet have access to more lines), so it's forced to just guess.
Why can't I evaluate a slashy-string literal in groovysh?
As before, I don't know the exact reason, but I do know that because groovysh is a repl, it's bound to have more trouble with the more ambiguous rules. Even so, a simple single-line slashy-string is pretty unambiguous, so I believe something else may be going on here. Here is the result of me playing with various forms in groovysh:
> /foo - unexpected char: '/' # line 2, column 1.
> /foo/ - awaits further input
> /foo/bar - unexpected char: '/' # line 2, column 1.
> /foo/bar/ - awaits further input
> /foo/ + 'bar' - unexpected char: '/' # line 2, column 1.
> 'foo' + /bar/ - evaluates to 'foobar'
> /foo/ - evaluates to 'foo'
> /foo - awaits further input
> /foo/bar - Unknown property: bar
It appears that something strange happens when a '/' character is the first character in a line. The pattern it appears to follow (as far as I can tell) is this:
A slash as the first character of a line begins a strange parsing mode.
In this mode, every line that ends with a slash followed by nothing but whitespace causes the repl to await further lines.
On the first line that ends with something other than a slash (or whitespace following a slash), the error unexpected char: '/' # line 2, column 1. is printed.
I've also noticed a couple of interesting points regarding this:
Both forward slashes (/) and backslashes (\) appear to count, and seem to be completely interchangeable, in this special mode.
This does not appear to happen at all in groovyConsole or in actual Groovy files.
Putting any whitespace before the opening slash character causes groovysh to interpret it correctly, but only if the opening slash is a forward slash, not a backslash.
So, I personally expect that this is just a quirk of groovysh, either a bug or some under-documented feature I haven't heard about.

How would I create a parser which consumes a character that is also at the beginning and end

How would I create a parser that allows a character which also happens to be the same as the begin/end character. Using the following example:
'Isn't it hot'
The second single-quote should be accepted as part of the content that is between the beginning and ending single-quote. I created a parser like this:
char("'").seq((word()|char("'")|whitespace()).plus()).seq(char("'"))
but it fails as:
Failure[1:15]: "'" expected
If I use "any()|char("'") then it greedily consumes the ending single-quote causing an error as well.
Would I need to create an actual Grammar class? I have attempted to create one but can't figure out how to make a Parser that doesn't try to consume the end marker greedily.
The problem is that plus() is greedy and blind. This means the repetition consumes as much input as possible, but does not consider what comes afterwards. In your example, everything up to the end of the input is consumed, but then the last quote in the sequence cannot be matched anymore.
You can solve the problem by using the non-blind variation plusGreedy(Parser) instead:
char("'")
.seq((word() | char("'") | whitespace()).plusGreedy(char("'")))
.seq(char("'"));
This consumes the input as long as there is still a char("'") left that can be consumed afterwards.

(F) Lex, how do I match negation?

Some language grammars use negations in their rules. For example, in the Dart specification the following rule is used:
~('\'|'"'|'$'|NEWLINE)
Which means match anything that is not one of the rules inside the parenthesis. Now, I know in flex I can negate character rules (ex: [^ab] , but some of the rules I want to negate could be more complicated than a single character so I don't think I could use character rules for that. For example I may need to negate the sequence '"""' for multiline strings but I'm not sure what the way to do it in flex would be.
(TL;DR: Skip down to the bottom for a practical answer.)
The inverse of any regular language is a regular language. So in theory it is possible to write the inverse of a regular expression as a regular expression. Unfortunately, it is not always easy.
The """ case, at least, is not too difficult.
First, let's be clear about what we are trying to match.
Strictly speaking "not """" would mean "any string other than """". But that would include, for example, x""".
So it might be tempting to say that we're looking for "any string which does not contain """". (That is, the inverse of .*""".*). But that's not quite correct either. The typical usage is to tokenise an input like:
"""This string might contain " or ""."""
If we start after the initial """ and look for the longest string which doesn't contain """, we will find:
This string might contain " or "".""
whereas what we wanted was:
This string might contain " or "".
So it turns out that we need "any string which does not end with " and which doesn't contain """", which is actually the conjunction of two inverses: (~.*" ∧ ~.*""".*)
It's (relatively) easy to produce a state diagram for that:
(Note that the only difference between the above and the state diagram for "any string which does not contain """" is that in that state diagram, all the states would be accepting, and in this one states 1 and 2 are not accepting.)
Now, the challenge is to turn that back into a regular expression. There are automated techniques for doing that, but the regular expressions they produce are often long and clumsy. This case is simple, though, because there is only one accepting state and we need only describe all the paths which can end in that state:
([^"]|\"([^"]|\"[^"]))*
This model will work for any simple string, but it's a little more complicated when the string is not just a sequence of the same character. For example, suppose we wanted to match strings terminated with END rather than """. Naively modifying the above pattern would result in:
([^E]|E([^N]|N[^D]))* <--- DON'T USE THIS
but that regular expression will match the string
ENENDstuff which shouldn't have been matched
The real state diagram we're looking for is
and one way of writing that as a regular expression is:
([^E]|E(E|NE)*([^EN]|N[^ED]))
Again, I produced that by tracing all the ways to end up in state 0:
[^E] stays in state 0
E in state 1:
(E|NE)*: stay in state 1
[^EN]: back to state 0
N[^ED]:back to state 0 via state 2
This can be a lot of work, both to produce and to read. And the results are error-prone. (Formal validation is easier with the state diagrams, which are small for this class of problems, rather than with the regular expressions which can grow to be enormous).
A practical and scalable solution
Practical Flex rulesets use start conditions to solve this kind of problem. For example, here is how you might recognize python triple-quoted strings:
%x TRIPLEQ
start \"\"\"
end \"\"\"
%%
{start} { BEGIN( TRIPLEQ ); /* Note: no return, flex continues */ }
<TRIPLEQ>.|\n { /* Append the next token to yytext instead of
* replacing yytext with the next token
*/
yymore();
/* No return yet, flex continues */
}
<TRIPLEQ>{end} { /* We've found the end of the string, but
* we need to get rid of the terminating """
*/
yylval.str = malloc(yyleng - 2);
memcpy(yylval.str, yytext, yyleng - 3);
yylval.str[yyleng - 3] = 0;
return STRING;
}
This works because the . rule in start condition TRIPLEQ will not match " if the " is part of a string matched by {end}; flex always chooses the longest match. It could be made more efficient by using [^"]+|\"|\n instead of .|\n, because that would result in longer matches and consequently fewer calls to yymore(); I didn't write it that way above simply for clarity.
This model is much easier to extend. In particular, if we wanted to use <![CDATA[ as the start and ]]> as the terminator, we'd only need to change the definitions
start "<![CDATA["
end "]]>"
(and possibly the optimized rule inside the start condition, if using the optimization suggested above.)

Insert text into current buffer from function

I'm having a hard time finding the documentation for this. How do I read/write text from the current buffer in my vim functions?
More concretely, if my buffer contains the words foo bar how would write a function to overwrite the word bar with cat so that in the end my buffer contains foo cat?
You can use the substitute ex command inside a function. For example
function! ReplaceBar()
:%s/bar/cat/g
endfunction
This defines a function. The % character means operate on the entire buffer. This searches for bar, replaces it with cat, and the g flag replaces every instance on a line, not just the first.
You can run this function by typing :call ReplaceBar() and hitting enter. Often it's convenient to define a function that does this kind of work, then define a command that calls it:
command! -nargs=0 Bar call ReplaceBar()
That command can be run by typing :Bar.
To access line(s), you can use the getline() function. setline() updates those lines in the buffer. Likewise, new lines are inserted via append().
The latter can also be done with :put ={variable or expression}, and replacements with :substitute. What is better depends on the particular use case. The benefit of the former, lower-level functions is that they don't clobber stuff like the expression register, last used search pattern, search history, etc.
In neovim, they have nvim_put. A few examples:
:call nvim_put(['cat'], 'c', v:false, v:true) ; insert 'cat' right where the cursor is, as if you typed `:normal! icat`
:call nvim_put(['cat'], 'l', v:true, v:true) ; insert 'cat' on the next line
Others have covered the use of :put pretty well, so I won't cover it myself.

Easiest way to remove Latex tag (but not its content)?

I am using TeXnicCenter to edit a LaTeX document.
I now want to remove a certain tag (say, emph{blabla}} which occurs multiple times in my document , but not tag's content (so in this example, I want to remove all emphasization).
What is the easiest way to do so?
May also be using another program easily available on Windows 7.
Edit: In response to regex suggestions, it is important that it can deal with nested tags.
Edit 2: I really want to remove the tag from the text file, not just disable it.
Using a regular expression do something like s/\\emph\{([^\}]*)\}/\1/g. If you are not familiar with regular expressions this says:
s -- replace
/ -- begin match section
\\emph\{ -- match \emph{
( -- begin capture
[^\}]* -- match any characters except (meaning up until) a close brace because:
[] a group of characters
^ means not or "everything except"
\} -- the close brace
and * means 0 or more times
) -- end capture, because this is the first (in this case only) capture, it is number 1
\} -- match end brace
/ -- begin replace section
\1 -- replace with captured section number 1
/ -- end regular expression, begin extra flags
g -- global flag, meaning do this every time the match is found not just the first time
This is with Perl syntax, as that is what I am familiar with. The following perl "one-liners" will accomplish two tasks
perl -pe 's/\\emph\{([^\}]*)\}/\1/g' filename will "test" printing the file to the command line
perl -pi -e 's/\\emph\{([^\}]*)\}/\1/g' filename will change the file in place.
Similar commands may be available in your editor, but if not this will (should) work.
Crowley should have added this as an answer, but I will do that for him, if you replace all \emph{ with { you should be able to do this without disturbing the other content. It will still be in braces, but unless you have done some odd stuff it shouldn't matter.
The regex would be a simple s/\\emph\{/\{/g but the search and replace in your editor will do that one too.
Edit: Sorry, used the wrong brace in the regex, fixed now.
\renewcommand{\emph}[1]{#1}
any reasonably advanced editor should let you do a search/replace using regular expressions, replacing emph{bla} by bla etc.

Resources