Parse backticks inside quoted strings antlr4 - parsing

I'm developing a program that parses and executes commands similar to bash. I want to obtain the string within the backticks that might be contained in either single or double quotes. For example, I want the string "echo hello", from the input string 'echo "`echo hello`" so I can process it first.
Is it possible to get the parsed string directly from antlr, or should I handle this command substitution functionality within my actual program? Any help would be much appreciated!

Pretty easy: first do a first parse run for the entire input. Find the string tokens and feed them again to your lexer. You have to have a rule that matches backtick strings, to make this work. In the second lexer run you get all embedded backtick delimited strings.
Alternatively, just iterate over the strings from the first run and look for a start back tick, then search for the next backtick as the second delimiter. Do that as often as you find more back ticks.

Related

Snippets for Gedit: how to change the text in a placeholder to make the letters uppercase?

I’m trying to improve a snippet for Gedit that helps me write shell scripts.
Currently, the snippet encloses the name of a variable into double quotes surrounding curly brackets preceded with a dollar sign. But to make the letters uppercase, I have to switch to the caps-lock mode or hold down a shift key when entering the words. Here is the code of the snippet:
"\${$1}"
I would like that the snippet makes the letters uppercase for me. To do that, I need to know how to make text uppercase and change the content of a placeholder.
I have carefully read the following articles:
https://wiki.gnome.org/Apps/Gedit/Plugins/Snippets
https://blogs.gnome.org/jessevdk/2009/12/06/about-snippets/
https://www.marxists.org/admin/volunteers/gedit-sed.htm
How do you create a date snippet in gedit?
But I still have no idea how to achieve what I want — to make the letters uppercase. I tried to use the output of shell programs, a Python script, the regular expressions — the initial text in the placeholder is not changed. The last attempt was the following (for clarity, I removed the surrounding double-quotes and the curly brackets with the dollar — working just on the letter case):
${1}$<[1]: return $1.upper()>
But instead of MY_VARIABLE I get my_variableMY_VARIABLE.
Perhaps, the solution is obvious, but I cannot get it.
I did it! The solution found!
Before all, I have to say that I don’t count the solution as correct or corresponding to the ideas of the Gedit editor. It’s a dirty hack, of course. But, strangely, there is no way to change the initial content of placeholders in the snippets — haven’t I just found a standard way to do that?
So. If they don’t allow us to change the text in placeholders, let’s ask the system to do that.
The first thought that stroke me was to print backspace characters. There are two ways to do that: a shell script and a python script. The first approach might look like: $1$(printf '\b') The second one should do the same: $1$<[1]: return '\b'> But both of them don’t work — Gedit prints surrogate squares instead of real backspace characters.
Thus, xdotool is our friend! So is metaprogramming! You will not believe — metaprogramming in a shell script inside a snippet — sed will be writing the scenario for xdotool. Also, I’ve added a feature that changes spaces to underscores for easier typing. Here is the final code of the snippet:
$1$(
eval \
xdotool key \
--delay 5 \
`echo "${1}" | sed "s/./ BackSpace/g;"`
echo "\"\${${1}}\"" \
| tr '[a-z ]' '[A-Z_]'
)$0
Here are some explanations.
Usually, I never use backticks in my scripts because of some troubles and incompatibilities. But now is not the case! It seems Gedit cannot interpret the $(...) constructions correctly when they are nested, so I use the backticks here.
A couple of words about using the xdotool command. The most critical part is the --delay option. By default, it’s 12 milliseconds. If I leave it as is, there will be an error when the length of the text in the placeholder is quite long. Not to mention the snippet processing becomes slow. But if I set the time interval too small, some of the emulated keystrokes sometimes will be swallowed somewhere. So, five milliseconds is the delay that turns out optimal for my system.
At last, as I use backspaces to erase the typed text, I cannot use template parts outside the placeholder. Thus, such transformations must be inside the script. The complex heap after the echo command is what the template parts are.
What the last tr command does is the motivator of all this activity.
It turns out, Gedit snippets may be a power tool. Good luck!

Parsing newline in Jison

Hi I am a newbie for Jison and was trying to learn it. I try the online jison parser calculator code on http://techtonik.github.io/jison/try/. It is working fine for the expression
5*PI^2.
But when I added a new expression on a newline, the parser will not take the newline and try to parse another expression as if it is on the same line.
Input :
5*PI^2
23+56
Parser takes it as :
5*PI^223+56
This fails, hence I want to know how to parse newline in jison parsor.
The problem here is that the Jison parser expects a single expression to parse, and it tries to evaluate whether the ENTIRE text is valid as a whole. What you've given it in this case is TWO separate expressions that don't evaluate correctly together, which is why it fails. If, for example, you evaluate
5*PI^2
+
23+56
Then it has no problems. This is because Jison is trying to parse the entire value it's given, and it allows you to break complex expressions up over multiple lines.
However, that doesn't stop you from evaluating lines individually if you want to. Instead of passing the parse function the entire text from the field, just split the text into an array using JavaScript's string split method (splitting on the new-line character, '\n'), then loop through and pass each line of the content to the parse function separately.

antlr: how to [sometimes] parse things in quotes?

I have a situation where my language allows quotes strings but sometimes I want to interpret the contents of the quoted string as language constructs. Think of it as, say, eval function.
So to support quoted strings i need a lexer rule and it overrides my attempts to have a grammar rule evaluating things in quotes if prefixed with 'eval'. Is there any way to deal with this in the grammar?
IMO you should not try to handle this case directly through the lexer.
I think I would leave the string as it in the lexer and add some code in the eval rule of the parser that calls a sub-parser on the string content.
If you want to implement an eval function, you're really looking for a runtime interpreter.
The only time you need an "eval" function is when you want to build up the content to compile at runtime. If you have the content available at compile-time, you can parse it without it being a string...
So... keep it as a string, and then use the same parser at runtime to parse/compile its contents.

What is the proper Lua pattern for quoted text?

I've been playing with this for an hour or tow and have found myself at a road block with the Lua pattern matching utilities. I am attempting to match all quoted text in a string and replace it if needed.
The pattern I have come up with so far is: (\?[\"\'])(.-)%1
This works in some cases but, not all cases:
Working: "This \"is a\" string of \"text to\" test with"
Not Working: "T\\\"his \"is\' a\" string\" of\' text\" to \"test\" wit\\\"h"
In the not working example I would like it to match to (I made a function that gets the matches I desire, I'm just looking for a pattern to use with gsub and curious if a lua pattern can do this):
string
a" string" of
is' a" string" of' text
test
his "is' a" string" of' text" to "test" wit
I'm going to continue to use my function instead for the time being, but am curious if there is a pattern I could/should be using and i'm just missing something with patterns.
(a few edits b/c I forgot about stackoverflows formating)
(another edit to make a non-html example since it was leading to assumptions that I was attempting to parse html)
Trying to match escaped, quoted text using regular expressions is like trying to remove the daisies (and only the daises) from a field using a lawnmower.
I made a function that gets the matches I desire
This is the correct move.
I'm curious if a lua pattern can do this
From a practical point of view, even if a pattern can do this, you don't want to. From a theoretical point of view, you are trying to find a double quote that is preceded by an even number of backslashes. This is definitely a regular language, and the regular expression you want would be something like the following (Lua quoting conventions)
[[[^\](\\)*"(.-[^\](\\)*)"]]
And the quoted string would be result #2. But Lua patterns are not full regular expressions; in particular, you cannot put a * after a parenthesized pattern.
So my guess is that this problem cannot be solved using Lua patterns, but since Lua patterns are not a standard thing in automata theory, I'm not aware of any body of proof technique that you could use to prove it.
The issue with escaped quotes is that, in general, if there's an odd number of backslashes before the quote, then it's escaped, and if there's an even number, it's not. I do not believe that Lua pattern-matching is powerful enough to represent this condition, so if you need to parse text like this, then you should seek another way. Perhaps you can iterate through the string and parse it, or you could find each quote in turn and read backwards, counting the backslashes until you find a non-backslash character (or the beginning of the string).
If you absolutely must use patterns for some reason, you could try doing this in a multi-step process. First, gsub for all occurrences of two backslashes in a row, and replace them with some sentinel value. This must be a value that does not already occur in the string. You could try something like "\001" if you know this string doesn't contain non-printable characters. Anyway, once you've replaced all sequences of two backslashes in a row, any backslashes left are escaping the following character. Now you can apply your original pattern, and then finally you can replace all instances of your sentinel value with two backslashes again.
Lua's pattern language is adequate for many simple cases. And it has at least one trick you don't find in a typical regular expression package: a way to match balanced parenthesis. But it has its limits as well.
When those limits are exceeded, then I reach for LPeg. LPeg is an implementation of a Parsing Expression Grammer for Lua, and was implemented by one of Lua's original authors so the adaptation to Lua is done quite well. A PEG allows specification of anything from simple patterns through complete language grammars to be written. LPeg compiles the grammar to a bytecode and executes it extremely efficiently.
you should NOT be trying to parse HTML with regular expressions, HTML and XML are NOT regular languages and can not be successfully manipulated with regular expressions. You should use a dedicated HTML parser. Here are lots of explanations why.

Writing a subshell parsing rule on ANTLR

I'm trying to create a simple BaSH-like grammar on ANTLRv3 but haven't been able to parse (and check) input inside subshell commands.
Further explanation:
I want to parse the following input:
$(command parameters*)
`command parameters`
"some text $(command parameters*)"
And be able to check it's contents as I would with simple input such as: command parameters.
i.e.:
Parsing it would generate a tree like (SUBSHELL (CMD command (PARAM parameters*))) (tokens are in upper-case)
I'm able to ignore '$('s and '`'s, but that won't cover the cases where the subshells are used inside double-quoted strings, like:
$ echo "String test $(ls -l) end"
So... any tips on how do I achieve this?
I'm not very familiar with the details of Antlr v3, but I can tell you that you can't handle bash-style command substitution inside double-quoted strings in a traditional-style lexer, as the nesting cannot be expressed using a regular grammar. Most traditional compiler-compilers restrict lexers to use regular grammars so that efficient DFAs can be constructed for them. (Lexers, which irreducibly have to scan every single character of the source, have historically been one of the slowest parts of a compiler.)
You must either parse " as a token and (ideally) use a different lexer or lexer mode for the internals of strings, so that most shell metacharacters, e.g. '{', aren't parsed as tokens but as text; or alternatively, do away with the lexer-parser division and use a scannerless approach, so that the "lexer" rule for double-quoted strings can call into the "parser" rule for command substitutions.
I would favour the scannerless approach. I would investigate how well Antlr v3 supports writing grammars that work directly over a character stream, rather than using a token stream.

Resources