Language parsing in Clojure with line numbers - parsing

I have a very simply language. A function is defined as some number of comments (indicated by the line starting with a semicolon) followed by a function name (a word followed by parens), followed by anything else, and ending with a "q". Here is a parse-ez function:
(defn routine []
(multi* (regex #";.*")
(regex #"(\w+)\(.*\).*" 1)
(multi* (regex #"[^q].*"))
(regex #"q.*"))
This works, but I want to return the line numbers on which the different patterns match. Is there a way to do this or do I need to write my own parser?
As it stands right now my language is simple enough that writing a new parser wouldn't matter too much, but it will limit me as complexity increases.

There is a "line-pos" function in parse-ez. Can't you use that?
line-pos doc:
"Returns [line column] vector representing the current cursor position
of the parser"

Related

Finding strings between two strings in lua

I have been trying to find all possible strings in between 2 strings
This is my input: "print/// to be able to put any amount of strings here endprint///"
The goal is to print every string in between print/// and endprint///
You can use Lua's string patterns to achieve that.
local text = "print/// to be able to put any amount of strings here endprint///"
print(text:match("print///(.*)endprint///"))
The pattern "print///(.*)endprint///" captures any character that is between "print///" and "endprint///"
Lua string patterns here
In this kind of problem, you don't use the greedy quantifiers * or +, instead, you use the lazy quantifier -. This is because * matches until the last occurrence of the sub-pattern after it, while - matches until the first occurence of the sub-pattern after it. So, you should use this pattern:
print///(.-)endprint///
And to match it in Lua, you do this:
local text = "print/// to be able to put any amount of strings here endprint///"
local match = text:match("print///(.-)endprint///")
-- `match` should now be the text in-between.
print(match) -- "to be able to put any amount of strings here "

How to use context free grammars?

Could someone help me with using context free grammars. Up until now I've used regular expressions to remove comments, block comments and empty lines from a string so that it can be used to count the PLOC. This seems to be extremely slow so I was looking for a different more efficient method.
I saw the following post: What is the best way to ignore comments in a java file with Rascal?
I have no idea how to use this, the help doesn't get me far as well. When I try to define the line used in the post I immediately get an error.
lexical SingleLineComment = "//" ~[\n] "\n";
Could someone help me out with this and also explain a bit about how to setup such a context free grammar and then to actually extract the wanted data?
Kind regards,
Bob
First this will help: the ~ in Rascal CFG notation is not in the language, the negation of a character class is written like so: ![\n].
To use a context-free grammar in Rascal goes in three steps:
write it, like for example the syntax definition of the Func language here: http://docs.rascal-mpl.org/unstable/Recipes/#Languages-Func
Use it to parse input, like so:
// This is the basic parse command, but be careful it will not accept spaces and newlines before and after the TopNonTerminal text:
Prog myParseTree = parse(#Prog, "example string");
// you can do the same directly to an input file:
Prog myParseTree = parse(#TopNonTerminal, |home:///myProgram.func|);
// if you need to accept layout before and after the program, use a "start nonterminal":
start[Prog] myParseTree = parse(#start[TopNonTerminal], |home:///myProgram.func|);
Prog myProgram = myParseTree.top;
// shorthand for parsing stuff:
myProgram = [Prog] "example";
myProgram = [Prog] |home:///myLocation.txt|;
Once you have the tree you can start using visit and / deepmatch to extract information from the tree, or write recursive functions if you like. Examples can be found here: http://docs.rascal-mpl.org/unstable/Recipes/#Languages-Func , but here are some common idioms as well to extract information from a parse tree:
// produces the source location of each node in the tree:
myParseTree#\loc
// produces a set of all nodes of type Stat
{ s | /Stat s := myParseTree }
// pattern match an if-then-else and bind the three expressions and collect them in a set:
{ e1, e2, e3 | (Stat) `if <Exp e1> then <Exp e2> else <Exp e3> end` <- myExpressionList }
// collect all locations of all sub-trees (every parse tree is of a non-terminal type, which is a sub-type of Tree. It uses |unknown:///| for small sub-trees which have not been annotated for efficiency's sake, like literals and character classes:
[ t#\loc?|unknown:///| | /Tree t := myParseTree ]
That should give you a start. I'd go try out some stuff and look at more examples. Writing a grammar is a nice thing to do, but it does require some trial and error methods like writing a regex, but even more so.
For the grammar you might be writing, which finds source code comments but leaves the rest as "any character" you will need to use the longest match disambiguation a lot:
lexical Identifier = [a-z]+ !>> [a-z]; // means do not accept an Identifier if there is still [a-z] to add to it; so only the longest possible Identifier will match.
This kind of context-free grammar is called an "Island Grammar" metaphorically, because you will write precise rules for the parts you want to recognize (the comments are "Islands") while leaving the rest as everything else (the rest is "Water"). See https://dl.acm.org/citation.cfm?id=837160

How to split particular words in lua

I am trying to split this statement in Lua
sendex,000D6F0011BA2D60,fb,btn,1,on,100,null
i need output like this way:
Mac:000D6F0011BA2D60
Value:1
command:on
value:100
how to split and get the values?
local input = "sendex,000D6F0011BA2D60,fb,btn,1,on,100,null"
local buffer = {}
for word in input:gmatch('[^,]+') do
table.insert(buffer, word)
--print(word) -- uncomment this to see the words as they are being matched ;)
end
print("Mac:"..buffer[2])
print("Value:"..buffer[5])
...
For a complete explanation of what string.gmatch does, see the Lua reference. To summarize, it iterates over a string and searches for a pattern, in this case [^,]+, meaning all groups of 1 or more characters that aren't a comma. Every time it finds said pattern, it does something with it and continues searching.
If your input is exactly like you have described, the code below works:
s="sendex,000D6F0011BA2D60,fb,btn,1,on,100,null"
Mac,Value,command,value = s:match(".-,(.-),.-,.-,(.-),(.-),(.-),")
print(Mac,Value,command,value)
It uses the non-greedy pattern .- to split the input into fields. It also captures the relevant fields.

Lua pattern help (Double parentheses)

I have been coding a program in Lua that automatically formats IRC logs from a roleplay. In the roleplay logs there is a specific guideline for "Out of character" conversation, which we use double parentheses for. For example: ((<Things unrelated to roleplay go here>)). I have been trying to have my program remove text between double brackets (and including both brackets). The code is:
ofile = io.open("Output.txt", "w")
rfile = io.open("Input.txt", "r")
p = rfile:read("*all")
w = string.gsub(p, "%(%(.*?%)%)", "")
ofile:write(w)
The pattern here is > "%(%(.*?%)%)" I've tried multiple variations of the pattern. All resulted in fruitless results:
1. %(%(.*?%)%) --Wouldn't do anything.
2. %(%(.*%)%) --Would remove *everything* after the first OOC message.
Then, my friend told me that prepending the brackets with percentages wouldn't work, and that I had to use backslashes to 'escape' the parentheses.
3. \(\(.*\)\) --resulted in the output file being completely empty.
4. (\(\(.*\)\)) --Same result as above.
5. (\(\(.*?\)\) --would for some reason, remove large parts of the text for no apparent reason.
6. \(\(.*?\)\) --would just remove all the text except for the last line.
The short, absolute question:
What pattern would I need to use to remove all text between double parentheses, and remove the double parentheses themselves too?
You're friend is thinking of regular expressions. Lua patterns are similar, but different. % is the correct escape character.
Your pattern should be %(%(.-%)%). The - is similar to * in that it matches any number of the preceding sequence, but while * tries to match as many characters as it can (it's greedy), - matches the least amount of characters possible (it's non-greedy). It won't go overboard and match extra double-close-parenthesis.

searching strings for keywords: questions about the "failure function"

I've got a question on failure function description from "Compilers: Principles, Techniques, and Tools" aka DragonBook
Firstly, the quote:
In order to process text strings rapidly and search those strings for a keyword,
it is useful to define, for keyword b1b2...bn, and position s in that keyword , a failure function, f (s) ...
The objective is that b1b2.. - bf(s) is the longest proper prefix of
b1...bs, that is also a suffix of b1...bs. The reason f (s) is important is that
if we are trying to match a text string for blb2..bn, and we have matched the
first s positions, but we then fail (i.e., the next position of the text string does
not hold bs+l), then f (s) is the longest prefix of b1..bn that could possibly
match the text string up to the point we are at. Of course, the next character of
the text string must be bf(s)+1 or else we still have problems and must consider
a yet shorter prefix, which will be bf(f(s)).
So, the questions:
1. If we've matched s positions with the text, why f (s) is the longest prefix of b1..bn that matches the string? I think s - is the longest prefix.
2. Next character of the text string must be bf(s)+1, why? We have a mismatch at this position, does it matter at all what the char is?
f(s) is the longest prefix at that position that might match the entire keyword. The idea is not to try to match the keyword with the text from the start, but to find a position where the keyword appears.
Consider a search for the word 'aaaba' in the text 'aaaabaa'. The match fails after the three first a's, but it's not necessary to retry from the second 'a', since we know that if the next letter is a 'b' (which it is), we may have a match there.

Resources