How to replace some characters of input file, before it getting lexed in flex? - flex-lexer

How to replace all occurrences of some character or char-sequence with some other character or char-sequence, before flex lexes it. For example I want B\65R to match identifier rule as it is equivalent to BAR in my grammar. So, essentially I want to turn a sequence of \dd into its equivalent ascii character and then lex it. (\65 -> A, \66 -> B, …).
I know, I can first search the entire file for a sequence of \dd and replace it with equivalent character and then feed it to flex. But I wonder if there exists a better way. Something like writing a rule that matches \dd and then replacing it with corresponding alternative in the input stream, so that, I don't have to parse entire file twice.

Several options...
Next, flex is going to read from a filter that
substitutes "\dd" by "chr(dd)" (untested).
You could run something along the lines of
YYIN = popen("perl -pe 's/\\(\d\d)/chr($1)/e' ", "r");
yylex()....

Related

How to recognise single new line tokens in a flex/bison based parser and ignore multiple new lines?

I want my bison based parser to recognise single new line tokens like '\n' but ignore multiple new lines so they dont have a role in the overall grammar except in situations i want just a single new line to be included after a pattern,for example leave a new line after a definition but then ignore other new lines.
So far in my lexer i just include the [\n] { } type of rule which ignores new lines,but want to recognise single new line tokens so i tried [\n{1}] {return '\n';} but it doesnt seem to work as intended.
Any help is appreciated.
The first problem is that [\n{1}] doesn't do what you think. That means: "recognize one character that can be a newline, an opening curly bracket, a one or a closing curly bracket".
To solve this it's better to understand the criteria for priority in Flex.
The pattern with the bigger match has priority.
If the pattern has the same length, the pattern above has priority.
Try the following:
[\n] {return '\n';}
[\n]+ {}
A single newline matches both, but uses the rule above (returns the token). More than one newline matches the second rule but not the first (it is ignored).

How to use grep to search for strings with (exclusively) a finite set of characters

I have a plain text file with a one string per line. I'd like to identify any instances where a string contains a value outside of a restricted character set. In this particular instance, if the string contains any character outside of the set "[THADGRC.SMBN-WVKY]" I want to retain it and pass it along to a new file.
For example, let's say the original file "mystrings.txt" contained the following data:
THADGRC.SMBN-WVKY
YKVW-NBMS.CRGDHAT
THADGRC.SMBN-WVKYI
My intention is to retain only the third sequence, because it contains a character outside of the allowed set (I) in this case.
It doesn't matter how many times, or in what order, an allowed character is present - all I care about is if a character exists in that string outside of the allowed set.
Originally I tried:
cat mystrings.txt | grep -v [THADGRC\.SMBN-WVKY] > badstrings.txt
but of course the third string contains those allowed character in addition to the non-allowed characters, thus this search ended up producing no "offending" strings.
Last thing: I'm not sure what characters outside of the allowed set might exist in this text file. It would be great to know ahead of time to just search for anything with an "I", but I don't actually know this ahead of time.
So the question: is there a way to use grep (or another tool, say awk?) to pass in a restricted list of characters, and flag any instances where a string contains any number of characters outside of that set?
Thanks for your consideration
I think that your problem is N-W. This doesn't match "N", "-" and "W", it matches a range from "N" to "W". You should move "-" to the end of the character class, or escape it. I suggest changing to:
grep '[^THADGRC.SMBNWVKY-]' mystrings.txt
Also, note that "." doesn't have to be escaped when it's inside a character class.
Your attempt says "remove any lines which contain one of these characters at least once". But you want "print any lines which contain at least one character not in this set."
(Also, quote your regular expressions , and lose the useless cat.)
grep '[^-THADGRC.SMBNWVKY]' mystrings.txt > badstrings.txt
I moved the dash to the beginning of the character class on the assumption that you want a literal dash, not the regex range N-W (i.e. N, O, P, Q, R, S, T, U, V, W).

FParsec - how to escape a separator

I'm working on an EDI file parser, and I'm having considerable difficulty implementing an escape for the 'segment terminator'. For anyone fortunate enough to not work with EDI, the segment terminator (usually an apostrophe) is the deliter between segments, which are like cells.
The desired behaviour looks something like this:
ABC+123'DEF+567' -> ["ABC+123", "DEF+567"]
ABC+123?'DEF+567' -> ["ABC+123?'DEF+567"]
Using FParsec, without escaping the apostrophe (and, for simplicity, ignoring parameterisation), the parser looks something like this:
let pSegment = //logic to parse the contents of a segment
let pAllSegments = sepEndBy pSegment (str "'")
This approach with the above example would yield ["ABC+123?", "DEF+567"].
My next consideration was to use a regex:
let pAllSegments = sepEndBy pSegment (regex #"[^\?]'")
The problem here is that the character prior to the apostrophe is also consumed, leading to incomplete messages.
I'm fairly certain I just don't understand FParsec well enough here. Does anyone have any pointers?
The issue is in the parse contents step.
The parser is working 'bottom up'. It finds the contents of the segments, which are not permitted to contain the terminator, then finds that all these segments are separated by the terminator, and constructs the list.
My error was in the pSegment step, which was using a parameterised version of (?:[A-Za-z0-9 \\.]|\?[\?\+:\?])*. See that second ?? That should have been a '.

How can I combine words with numbers when pattern matching in LUA?

I'm trying to match any strings that come in that follow the format Word 100.00% ~(45.56, 34.76) in LUA. As such, I'm looking to do a regex close (in theory) to this:
%D%s[%d%.%d]%%(%d.%d, %d.%d)
But I'm having no luck so far. LUA's patterns are weird.
What am I missing?
Your pattern is close you neglected to allow for multiple instances of a digit you can do this by using a + at like %d+.
You also did not use [,( and . correctly in the pattern.
[s in a pattern will create a set of chars that you are trying to match such as [abc] means you are looking to match any as bs or c at that position.
( are used to define a capture so the specific values you want returned rather then the whole string in the event of a match, in order to use it as a char you for the match you need to escape it with a %.
. will match any character rather then specifically a . you will need to add a % to escape if you want to match a . specifically.
local str = "Word 100.00% ~(45.56, 34.76)"
local pattern = "%w+%s%d+%.%d+%%%s~%(%d+%.%d+, %d+%.%d+%)"
print(string.match(str, pattern))
Here you will see the input string print if it matches the pattern otherwise you will see nil.
Suggested resource: Understanding Lua Patterns

Regular expression in Ruby

Could anybody help me make a proper regular expression from a bunch of text in Ruby. I tried a lot but I don't know how to handle variable length titles.
The string will be of format <sometext>title:"<actual_title>"<sometext>. I want to extract actual_title from this string.
I tried /title:"."/ but it doesnt find any matches as it expects a closing quotation after one variable from opening quotation. I couldn't figure how to make it check for variable length of string. Any help is appreciated. Thanks.
. matches any single character. Putting + after a character will match one or more of those characters. So .+ will match one or more characters of any sort. Also, you should put a question mark after it so that it matches the first closing-quotation mark it comes across. So:
/title:"(.+?)"/
The parentheses are necessary if you want to extract the title text that it matched out of there.
/title:"([^"]*)"/
The parentheses create a capturing group. Inside is first a character class. The ^ means it's negated, so it matches any character that's not a ". The * means 0 or more. You can change it to one or more by using + instead of *.
I like /title:"(.+?)"/ because of it's use of lazy matching to stop the .+ consuming all text until the last " on the line is found.
It won't work if the string wraps lines or includes escaped quotes.
In programming languages where you want to be able to include the string deliminator inside a string you usually provide an 'escape' character or sequence.
If your escape character was \ then you could write something like this...
/title:"((?:\\"|[^"])+)"/
This is a railroad diagram. Railroad diagrams show you what order things are parsed... imagine you are a train starting at the left. You consume title:" then \" if you can.. if you can't then you consume not a ". The > means this path is preferred... so you try to loop... if you can't you have to consume a '"' to finish.
I made this with https://regexper.com/#%2Ftitle%3A%22((%3F%3A%5C%5C%22%7C%5B%5E%22%5D)%2B)%22%2F
but there is now a plugin for Atom text editor too that does this.

Resources