Does the Feel language builtin string function 'replace' affect the first match or all occurrances of the search pattern? - dmn

The Decision Model and Notation Feel Language has many builtin functions.
For strings, one function is replace. It accepts a search string, a regex pattern, a replacement string, and optional flags.
Does replace act only on the first regex match or does it replace all matches? The DMN version 1.3 specification, page 138, does not seem to address this.

In your question, it replaces all matches.
Some other valid examples:
replace("banana","a","o") = "bonono"
taken as one of the agreed behaviour test cases, from the DMN TCK project.
I agree in the DMN Specification document from OMG, it could list some more down-to-Earth examples :)

Related

What is parsing? (And differences from search and grep?

What exactly is parsing? I mean, generally. How different is parsing different from searching? On command line, if I use the grep tool/command; is that parsing?
For example, if I have just one string:
"Hello world! How are you doing today?"
and I tried to search (using grep or any other tool) whether the word "you" is within that string; is that parsing?
What if I do a web search; for example in Google? Is that parsing?
Or is parsing the name of the process that is a part of the process known as "Search"?
The verb "parse" is essentially related to the word "part", as in "part of speech". (See, for example, the on-line etymology dictionary.)
To "parse" a sentence has traditionally meant to break the sentence down into its component parts and identify their relationship with each other. For example, given "I asked a question.", we can parse it into a subject ("I"), a transitive verb in past tense ("asked"), and an object phrase consisting of an article ("a") and a noun ("question"). The parse indicates that the subject performed some action on the object; this is not the same statement as *"A question asked I", and not just because the latter is ungrammatical.
With the advent of computer languages and computational theory, the term "parsing" has been generalized to include analysis of strings which are not human languages. Some people would even use it to simply mean "to divide a string into its component parts", such as "parsing" a line in a CSV file into fields.
It's quite a stretch to apply that to merely searching for a string inside another string, although there may be contexts in which that is an acceptable use of the word. Personally, I would only use it for the action of completely deconstructing a structured string.

Does this require a 2-pass parse: comments embedded within tokens?

Using a parser generator I want to create a parser for "From headers" in email messages. Here is an example of a From header:
From: "John Doe" <john#doe.org>
I think it will be straightforward to implement a parser for that.
However, there is a complication in the "From header" syntax: comments may be inserted just about anywhere. For example, a comment may be inserted within "john":
From: "John Doe" <jo(this is a comment)hn#doe.org>
And comments may be inserted in many other places.
How to handle this complication? Does it require a "2-pass" parser: one pass to remove all comments and a second pass to create the parse tree for the From header? Do modern parser generators support multiple passes on the input? Can it be parsed in a single pass? If yes, would you sketch the approach please?
I'm not convinced that your interpretation of email addresses is correct; my reading of RFC-822 leads me to believe that a comment can only come before or after a "word", and that "word"s in the local-part of an addr-spec need to be separated by dots ("."). Section 3.1.4 gives a pretty good hint on how to parse: you need a lexical analyzer which feeds syntactic symbols into the parser; the lexical analyzer is expected to unfold headers, ignore whitespace, and identify comments, quoted strings, atoms, and special characters.
Of course, RFC-822 has long been obsoleted, and I think that email headers with embedded comments are anachronistic.
Nonetheless, it seems like you could easily achieve the analysis you wish using flex and bison. As indicated, flex would identify the comments. Strictly speaking, you cannot identify comments with a regular expression, since comments nest. But you can recognize simple nested structures using a start condition stack, or even more economically by maintaining a counter (since flex won't return until the outermost parenthesis is found, the counter doesn't need to be global.)

ParseKit greedy matching mode

I am making something like formula validator and I am using ParseKit framework to accomplish that. My approach is to create proper grammar and when didMatchFormula callback method is called on sample string I assume formula has been found and therefore it is valid.
There is one difficulty however - formula is detected from sample string even if it contains also other characters following formula part. I would need something like greedy mode for matching - an entire string would be matched against formula grammar so that didMatchFormula would be called only if string contains formula and no other characters.
Can you give me some hints how to accomplish that with PaseKit or in other way.
I cannot use regular expressions since my formulas would use recursion and regexp is not a good tool for handling that.
Developer of ParseKit here.
Probably the simplest and most elegant way to do this with ParseKit (or any parsing toolkit) is to design your formula language have a terminator char after every statement. This would be the same concept as ; terminating statements in most C-like programming languages.
Here's an example toy formula language which uses . as the statement terminator:
#start = lang;
lang = statment+;
statment = Word+ terminator;
terminator = '.';
Notice how I have designed the language so that your "greedy" requirement is an inherent feature of the language. Think about it – if the input string ends with any junk content which is not a valid statement ending in a ., my lang production will not find a match and the parse will fail.
With this type of design, you won't need any "greedy" features in the parsking toolkit you use. Rather, your requirement will be naturally met by your language design.

Making a Lua pattern case insensitive with LPeg

I have an app that (among other things) supports plain-text searches and searches using Lua patterns. As a convenience, the app supports case-insensitive searches. Here is an image snippet:
The code that transforms the given Lua pattern into a case-insensitive Lua pattern isn't too pretty. It basically worries about whether or not a character is preceded by an odd or even number of escapes (%) and whether or not it is located inside of square brackets. The pattern shown in the image becomes %a[bB][bB]%%[cC][%abB%%cC]
I haven't had a chance to learn LPeg yet, and I suppose this could be my motivator.
My question is whether this is something that LPeg could have handled easily?
Yes, but for an easier entry into the LPeg world, consider LPeg's "re" module, which gives you a regex-like syntax and which you can specify a set of rules, as in a grammar (think Yacc, etc.). You'd basically write rules for escaped characters, bracket groups and regular characters. Then, you could associate functions to the rules, that would emit either the same text they consumed as the input or the case-insensitive modified version.
The structure of your rules would take care of the even-odd distinction automatically, bracket context, etc. LPeg uses "ordered choice", so if you add your escape rule first, it will handle %[ correctly and avoid mixing it up with the brackets rule, for example.

What is the proper Lua pattern for quoted text?

I've been playing with this for an hour or tow and have found myself at a road block with the Lua pattern matching utilities. I am attempting to match all quoted text in a string and replace it if needed.
The pattern I have come up with so far is: (\?[\"\'])(.-)%1
This works in some cases but, not all cases:
Working: "This \"is a\" string of \"text to\" test with"
Not Working: "T\\\"his \"is\' a\" string\" of\' text\" to \"test\" wit\\\"h"
In the not working example I would like it to match to (I made a function that gets the matches I desire, I'm just looking for a pattern to use with gsub and curious if a lua pattern can do this):
string
a" string" of
is' a" string" of' text
test
his "is' a" string" of' text" to "test" wit
I'm going to continue to use my function instead for the time being, but am curious if there is a pattern I could/should be using and i'm just missing something with patterns.
(a few edits b/c I forgot about stackoverflows formating)
(another edit to make a non-html example since it was leading to assumptions that I was attempting to parse html)
Trying to match escaped, quoted text using regular expressions is like trying to remove the daisies (and only the daises) from a field using a lawnmower.
I made a function that gets the matches I desire
This is the correct move.
I'm curious if a lua pattern can do this
From a practical point of view, even if a pattern can do this, you don't want to. From a theoretical point of view, you are trying to find a double quote that is preceded by an even number of backslashes. This is definitely a regular language, and the regular expression you want would be something like the following (Lua quoting conventions)
[[[^\](\\)*"(.-[^\](\\)*)"]]
And the quoted string would be result #2. But Lua patterns are not full regular expressions; in particular, you cannot put a * after a parenthesized pattern.
So my guess is that this problem cannot be solved using Lua patterns, but since Lua patterns are not a standard thing in automata theory, I'm not aware of any body of proof technique that you could use to prove it.
The issue with escaped quotes is that, in general, if there's an odd number of backslashes before the quote, then it's escaped, and if there's an even number, it's not. I do not believe that Lua pattern-matching is powerful enough to represent this condition, so if you need to parse text like this, then you should seek another way. Perhaps you can iterate through the string and parse it, or you could find each quote in turn and read backwards, counting the backslashes until you find a non-backslash character (or the beginning of the string).
If you absolutely must use patterns for some reason, you could try doing this in a multi-step process. First, gsub for all occurrences of two backslashes in a row, and replace them with some sentinel value. This must be a value that does not already occur in the string. You could try something like "\001" if you know this string doesn't contain non-printable characters. Anyway, once you've replaced all sequences of two backslashes in a row, any backslashes left are escaping the following character. Now you can apply your original pattern, and then finally you can replace all instances of your sentinel value with two backslashes again.
Lua's pattern language is adequate for many simple cases. And it has at least one trick you don't find in a typical regular expression package: a way to match balanced parenthesis. But it has its limits as well.
When those limits are exceeded, then I reach for LPeg. LPeg is an implementation of a Parsing Expression Grammer for Lua, and was implemented by one of Lua's original authors so the adaptation to Lua is done quite well. A PEG allows specification of anything from simple patterns through complete language grammars to be written. LPeg compiles the grammar to a bytecode and executes it extremely efficiently.
you should NOT be trying to parse HTML with regular expressions, HTML and XML are NOT regular languages and can not be successfully manipulated with regular expressions. You should use a dedicated HTML parser. Here are lots of explanations why.

Resources