Ignoring a delimiter in a string between () - delimiter

In Alteryx I need to parse a string that is ; delimited, but the data is constructed in a way that there are extra ; between brackets () when there are multiple outputs.
I found this expression for RegEx (;)+(?![^{]}) which would work if the ; was between {} but I can’t figure out how to sub out () to make it work.
Data strings look like
Corp Other Matters (Board; Mgmt); Board (Replace); Strategy(Change)
Corp Other Matters (Board); Strategy (Change; Shift)
I’d like an output with
Corp Other Matters (Board; Mgmt) Board (Replace) Strategy(Change)
Corp Other Matters (Board) Strategy (Change; Shift)
Bonus points if there’s a way to then create entries for
Corp Other Matters (Board)
Corp Other Matters (Mgmt)

Related

ANTLR4 Assembler Language Parser - issues - miscellaneous comments

I am trying to write a parser for the IBM Assembler Language, Example below.
Comment lines start with a star* at the first character, however there are 2 problems
Beyond a set point in the line there can also be descriptive text, but there is no star* neccessary.
The descriptive can/does contain lexer tokens, such as ENTRY or INPUT.....
* TYPE.
ARG DSECT
NXENT DS F some comment text ENTRY NUMBER
NMADR DS F some comment text INPUT NAME
NAADR DS F some comment text
NATYP DS F some comment text
NAENT DS F some comment text
ORG NATYP some comment text
In my lexer I have devised the following, which works absolutley fine:
fragment CommentLine: Star {getCharPositionInLine() == 1}? .*? Nl
;
fragment Star: '*';
fragment Nl: '\r'? '\n' ;
COMMENT_LINE
: CommentLine -> channel (COMMENT)
;
My question is how do I manage the line comments starting at a particular char position in the parser grammer? I.e. Parser -> NAME DS INT? LETTER ??????????
Sending comments to a COMMENT channel (or -> skiping them) is a technique used to avoid having to define all the places comments are valid in your parser rules.
(Old 360+ Assembler programmer here)
Since there are not really ways to place arbitrarily positioned comments in Assembler source, you don't really need to deal with shunting them off to the side. Actually because of the way comments are handled in assembler source, there's just NOT a way to identify them in a Lexer rule.
Since it can be a parser rule, you could set up a rule like:
trailingComment: (ID | STRING | NUMBER)* EOL;
where ID, STRING, NUMBER, etc. are just the tokens in your lexer (You'd need to include pretty much all of them... a good argument, for not getting down to tokens for MVC, CLC, CLI, (all the op codes... the path to madness). And of course EOL is your rule to match end of line (probably '\r?\n')
You would then end each of your rules for parsing a line that can contain a trailing comment (pretty much all of them) with the trailingComment rule.

How to use context free grammars?

Could someone help me with using context free grammars. Up until now I've used regular expressions to remove comments, block comments and empty lines from a string so that it can be used to count the PLOC. This seems to be extremely slow so I was looking for a different more efficient method.
I saw the following post: What is the best way to ignore comments in a java file with Rascal?
I have no idea how to use this, the help doesn't get me far as well. When I try to define the line used in the post I immediately get an error.
lexical SingleLineComment = "//" ~[\n] "\n";
Could someone help me out with this and also explain a bit about how to setup such a context free grammar and then to actually extract the wanted data?
Kind regards,
Bob
First this will help: the ~ in Rascal CFG notation is not in the language, the negation of a character class is written like so: ![\n].
To use a context-free grammar in Rascal goes in three steps:
write it, like for example the syntax definition of the Func language here: http://docs.rascal-mpl.org/unstable/Recipes/#Languages-Func
Use it to parse input, like so:
// This is the basic parse command, but be careful it will not accept spaces and newlines before and after the TopNonTerminal text:
Prog myParseTree = parse(#Prog, "example string");
// you can do the same directly to an input file:
Prog myParseTree = parse(#TopNonTerminal, |home:///myProgram.func|);
// if you need to accept layout before and after the program, use a "start nonterminal":
start[Prog] myParseTree = parse(#start[TopNonTerminal], |home:///myProgram.func|);
Prog myProgram = myParseTree.top;
// shorthand for parsing stuff:
myProgram = [Prog] "example";
myProgram = [Prog] |home:///myLocation.txt|;
Once you have the tree you can start using visit and / deepmatch to extract information from the tree, or write recursive functions if you like. Examples can be found here: http://docs.rascal-mpl.org/unstable/Recipes/#Languages-Func , but here are some common idioms as well to extract information from a parse tree:
// produces the source location of each node in the tree:
myParseTree#\loc
// produces a set of all nodes of type Stat
{ s | /Stat s := myParseTree }
// pattern match an if-then-else and bind the three expressions and collect them in a set:
{ e1, e2, e3 | (Stat) `if <Exp e1> then <Exp e2> else <Exp e3> end` <- myExpressionList }
// collect all locations of all sub-trees (every parse tree is of a non-terminal type, which is a sub-type of Tree. It uses |unknown:///| for small sub-trees which have not been annotated for efficiency's sake, like literals and character classes:
[ t#\loc?|unknown:///| | /Tree t := myParseTree ]
That should give you a start. I'd go try out some stuff and look at more examples. Writing a grammar is a nice thing to do, but it does require some trial and error methods like writing a regex, but even more so.
For the grammar you might be writing, which finds source code comments but leaves the rest as "any character" you will need to use the longest match disambiguation a lot:
lexical Identifier = [a-z]+ !>> [a-z]; // means do not accept an Identifier if there is still [a-z] to add to it; so only the longest possible Identifier will match.
This kind of context-free grammar is called an "Island Grammar" metaphorically, because you will write precise rules for the parts you want to recognize (the comments are "Islands") while leaving the rest as everything else (the rest is "Water"). See https://dl.acm.org/citation.cfm?id=837160

ANTLR: Different token with trailing bracket

I am working on an ANTLRv4 grammar for BUGS - my repo is here, the link points to a particular commit so shouldn't go out of date.
Minimum code example below.
I would like the input rule to go along t route if input is T(, but to go along the id route if the input is T for the grammar below.
grammar temp;
input: t | id;
t: T '(';
id: ID;
T: 'T' {_input.LA(1)==(}?;
ID: [a-zA-Z][a-zA-Z0-9._]*;
My ANLTRv4 specification of BUGS grammar was obtained heavily inspired with the FLEX+BISON lexing and parsing grammar incorporated in JAGS 4.3.0 source code, in files src/lib/compiler/parser.yy and src/lib/compiler/scanner.ll.
The way they accomplish it is by using the trailing context in the lexer, e.g. r/s. The way to do it in ANTLR is given here, but I cannot get it to work.
I need it to work this way because another part of the grammar depends on this mechanism - relevant code fragment here.
You can recreate my particular issue by cloning my repo and running make - this will give list of tokens lexed and error in parsing stage. In the tokens list the letter T is lexed as token 'T' rather than ID as I'd like it to be.
I feel there is much more natural/correct way to do it in ANTLR, but I'm new to this and cannot figure out a way.
PS If you have an idea how to better name this question please edit it.
If I understand the problem correctly the following code will work fine:
grammar temp;
input: t | id;
t: T '(';
id: ID | T;
T: 'T';
LPAREN: '(';
ID: [a-zA-Z][a-zA-Z0-9._]*;

(F) Lex, how do I match negation?

Some language grammars use negations in their rules. For example, in the Dart specification the following rule is used:
~('\'|'"'|'$'|NEWLINE)
Which means match anything that is not one of the rules inside the parenthesis. Now, I know in flex I can negate character rules (ex: [^ab] , but some of the rules I want to negate could be more complicated than a single character so I don't think I could use character rules for that. For example I may need to negate the sequence '"""' for multiline strings but I'm not sure what the way to do it in flex would be.
(TL;DR: Skip down to the bottom for a practical answer.)
The inverse of any regular language is a regular language. So in theory it is possible to write the inverse of a regular expression as a regular expression. Unfortunately, it is not always easy.
The """ case, at least, is not too difficult.
First, let's be clear about what we are trying to match.
Strictly speaking "not """" would mean "any string other than """". But that would include, for example, x""".
So it might be tempting to say that we're looking for "any string which does not contain """". (That is, the inverse of .*""".*). But that's not quite correct either. The typical usage is to tokenise an input like:
"""This string might contain " or ""."""
If we start after the initial """ and look for the longest string which doesn't contain """, we will find:
This string might contain " or "".""
whereas what we wanted was:
This string might contain " or "".
So it turns out that we need "any string which does not end with " and which doesn't contain """", which is actually the conjunction of two inverses: (~.*" ∧ ~.*""".*)
It's (relatively) easy to produce a state diagram for that:
(Note that the only difference between the above and the state diagram for "any string which does not contain """" is that in that state diagram, all the states would be accepting, and in this one states 1 and 2 are not accepting.)
Now, the challenge is to turn that back into a regular expression. There are automated techniques for doing that, but the regular expressions they produce are often long and clumsy. This case is simple, though, because there is only one accepting state and we need only describe all the paths which can end in that state:
([^"]|\"([^"]|\"[^"]))*
This model will work for any simple string, but it's a little more complicated when the string is not just a sequence of the same character. For example, suppose we wanted to match strings terminated with END rather than """. Naively modifying the above pattern would result in:
([^E]|E([^N]|N[^D]))* <--- DON'T USE THIS
but that regular expression will match the string
ENENDstuff which shouldn't have been matched
The real state diagram we're looking for is
and one way of writing that as a regular expression is:
([^E]|E(E|NE)*([^EN]|N[^ED]))
Again, I produced that by tracing all the ways to end up in state 0:
[^E] stays in state 0
E in state 1:
(E|NE)*: stay in state 1
[^EN]: back to state 0
N[^ED]:back to state 0 via state 2
This can be a lot of work, both to produce and to read. And the results are error-prone. (Formal validation is easier with the state diagrams, which are small for this class of problems, rather than with the regular expressions which can grow to be enormous).
A practical and scalable solution
Practical Flex rulesets use start conditions to solve this kind of problem. For example, here is how you might recognize python triple-quoted strings:
%x TRIPLEQ
start \"\"\"
end \"\"\"
%%
{start} { BEGIN( TRIPLEQ ); /* Note: no return, flex continues */ }
<TRIPLEQ>.|\n { /* Append the next token to yytext instead of
* replacing yytext with the next token
*/
yymore();
/* No return yet, flex continues */
}
<TRIPLEQ>{end} { /* We've found the end of the string, but
* we need to get rid of the terminating """
*/
yylval.str = malloc(yyleng - 2);
memcpy(yylval.str, yytext, yyleng - 3);
yylval.str[yyleng - 3] = 0;
return STRING;
}
This works because the . rule in start condition TRIPLEQ will not match " if the " is part of a string matched by {end}; flex always chooses the longest match. It could be made more efficient by using [^"]+|\"|\n instead of .|\n, because that would result in longer matches and consequently fewer calls to yymore(); I didn't write it that way above simply for clarity.
This model is much easier to extend. In particular, if we wanted to use <![CDATA[ as the start and ]]> as the terminator, we'd only need to change the definitions
start "<![CDATA["
end "]]>"
(and possibly the optimized rule inside the start condition, if using the optimization suggested above.)

How to match any symbol in ANTLR parser (not lexer)?

How to match any symbol in ANTLR parser (not lexer)? Where is the complete language description for ANTLR4 parsers?
UPDATE
Is the answer is "impossible"?
You first need to understand the roles of each part in parsing:
The lexer: this is the object that tokenizes your input string. Tokenizing means to convert a stream of input characters to an abstract token symbol (usually just a number).
The parser: this is the object that only works with tokens to determine the structure of a language. A language (written as one or more grammar files) defines the token combinations that are valid.
As you can see, the parser doesn't even know what a letter is. It only knows tokens. So your question is already wrong.
Having said that it would probably help to know why you want to skip individual input letters in your parser. Looks like your base concept needs adjustments.
It depends what you mean by "symbol". To match any token inside a parser rule, use the . (DOT) meta char. If you're trying to match any character inside a parser rule, then you're out of luck, there is a strict separation between parser- and lexer rules in ANTLR. It is not possible to match any character inside a parser rule.
It is possible, but only if you have such a basic grammar that the reason to use ANTlr is negated anyway.
If you had the grammar:
text : ANY_CHAR* ;
ANY_CHAR : . ;
it would do what you (seem to) want.
However, as many have pointed out, this would be a pretty strange thing to do. The purpose of the lexer is to identify different tokens that can be strung together in the parser to form a grammar, so your lexer can either identify the specific string "JSTL/EL" as a token, or [A-Z]'/EL', [A-Z]'/'[A-Z][A-Z], etc - depending on what you need.
The parser is then used to define the grammar, so:
phrase : CHAR* jstl CHAR* ;
jstl : JSTL SLASH QUALIFIER ;
JSTL : 'JSTL' ;
SLASH : '/'
QUALIFIER : [A-Z][A-Z] ;
CHAR : . ;
would accept "blah blah JSTL/EL..." as input, but not "blah blah EL/JSTL...".
I'd recommend looking at The Definitive ANTlr 4 Reference, in particular the section on "Islands in the stream" and the Grammar Reference (Ch 15) that specifically deals with Unicode.

Resources