Pattern match dropping new lines characters - lua

How to extract the values from a csv like string dropping the new lines characters (\r\n or \n) with a pattern.
A line looks like:
1.1;2.2;Example, 3
Notice there are only 3 values and the separator is ;. The problem I'm having is to come up with a pattern that reads the values while dropping the new line characters (the file comes from a windows machine so it has \r\n, reading it from a linux and would like to be independent from the new line character used).
My simple example right now is:
s = "1.1;2.2;Example, 3\r\n";
p = "(.-);(.-);(.-)";
a, b, c = string.match(s, p);
print(c:byte(1, -1));
The two last characters printed by the code above are the \r\n.
The problem is that both, \r and \n are detected by the %c and %s classes (control characters and space characters), as show by this code:
s = "a\r";
print(s:match("%c"));
print(s:match("%s"));
print(s:match("%d"));
So, is it possible to left out from the match the new lines characters? (It should not be assumed that the last two characters will be new lines characters)
The 3ยบ value may contain spaces, punctuation and alphanumeric characters and since \r\n are detected as space characters a pattern like `"(.-);(.-);([%w%s%c]-).*" does not work.

Your pattern
p = "(.-);(.-);(.-)";
does not work: the third field is always empty because .- matches a little as possible. You need to anchor it at the end of the string, but then the third field will contain trailing newline chars:
p = "(.-);(.-);(.-)$";
So, just stop at the first trailing newline char. This also anchors the last match. Try this pattern instead:
p = "(.-);(.-);(.-)[\r\n]";
If trailing newline chars are optional, try this pattern:
p = "(.-);(.-);(.-)[\r\n]*$";

Without any lua experience I found a naive solution:
clean_CR = s:gsub("\r","");
clean_NL = clean_CR:gsub("\n","");
With POSIX regex syntax I'd use
^([^;]*);([^;]*);([^\n\r]*).*$
.. with "\n" and "\r" possibly included as "^M", "^#" (control/unicode characters) .. depending on your editor.

Related

lua match repeating pattern

I need to encapsulate in some way pattern in lua pattern matching to find whole sequence of this pattern in string. What do I mean by that.
For example we have string like that:
"word1,word2,word3,,word4,word5,word6, word7,"
I need to match first sequence of words followed by coma (word1,word2,word3,)
In python I would use this pattern "(\w+,)+", but similar pattern in lua (like (%w+,)+), will return just nil, because brackets in lua patterns means completely different thing.
I hope now you see my problem.
Is there a way to do repeating patterns in lua?
Your example wasn't too clear in terms of what should happen to the word4,word5,word6 and word7,
This would give you any seqence of comma separated words without white space or empty positions.
local text = "word1,word2,word3,,word4,word5,word6, word7,"
-- replace any comma followed by any white space or comma
--- by a comma and a single white space
text = text:gsub(",[%s,]+", ", ")
-- then match any sequence of >=1 non-whitespace characters
for sequence in text:gmatch("%S+,") do
print(sequence)
end
Prints
word1,word2,word3,
word4,word5,word6,
word7,
You could do this easily using LPeg if that's available to you:
local lpeg = require "lpeg"
local str = "word1,word2,word3,,word4,word5,word6, word7,"
local word = (lpeg.R"az"+lpeg.R"AZ"+lpeg.R"09") ^ 1
local sequence = lpeg.C((word * ",") ^1)
print(sequence:match(str))

Rails 5 - regex - for string not found [duplicate]

I have following regex handy to match all the lines containing console.log() or alert() function in any javascript file opened in the editor supporting PCRE.
^.*\b(console\.log|alert)\b.*$
But I encounter many files containing window.alert() lines for alerting important messages, I don't want to remove/replace them.
So the question how to regex-match (single line regex without need to run frequently) all the lines containing console.log() and alert() but not containing word window. Also how to escape round brackets(parenthesis) which are unescapable by \, to make them part of string literal ?
I tried following regex but in vain:
^.*\b(console\.log|alert)((?!window).)*\b.*$
You should use a negative lookhead, like this:
^(?!.*window\.).*\b(console\.log|alert)\b.*$
The negative lookhead will assert that it is impossible to match if the string window. is present.
Regex Demo
As for the parenthesis, you can escape them with backslashes, but because you have a word boundary character, it will not match if you put the escaped parenthesis, because they are not word characters.
The metacharacter \b is an anchor like the caret and the dollar sign.
It matches at a position that is called a "word boundary". This match
is zero-length.
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a
word character.
After the last character in the string, if the last
character is a word character.
Between two characters in the string,
where one is a word character and the other is not a word character.

Lua Pattern Matching, get character before match

Currently I have code that looks like this:
somestring = "param=valueZ&456"
local stringToPrint = (somestring):gsub("(param=)[^&]+", "%1hello", 1)
StringToPrint will look like this:
param=hello&456
I have replaced all of the characters before the & with the string "hello". This is where my question becomes a little strange and specific.
I want my string to appear as: param=helloZ&456. In other words, I want to preserve the character right before the & when replacing the string valueZ with hello to make it helloZ instead. How can this be done?
I suggest:
somestring:gsub("param=[^&]*([^&])", "param=hello%1", 1)
See the Lua demo
Here, the pattern matches:
param= - literal substring param=
[^&]* - 0 or more chars other than & as many as possible
([^&]) - Group 1 capturing a symbol other than & (here, backtracking will occur, as the previous pattern grabs all such chars other than & and then the engine will take a step back and place the last char from that chunk into Group 1).
There are probably other ways to do this, but here is one:
somestring = "param=valueZ&456"
local stringToPrint = (somestring):gsub("(param=).-([^&]&)", "%1hello%2", 1)
print(stringToPrint)
The thing here is that I match the shortest string that ends with a character that is not & and a character that is &. Then I add the two ending characters to the replaced part.

Removing lines that begin with > in a rails string

I'm trying to remove any lines that begin with the character '>' in a long string (i.e. replies to an email).
In PHP I'd iterate over each line with an if statement, in linux I'd try and use sed or awk.
What's the most elegant rails approach?
You can try this:
your_string.gsub(/^\>.+\n/,'')
Your question is implying that the input is one string, containing multiple lines.
Do you want the output to be just one string with multiple lines as well? I'm assuming yes.
either using String and Array operations:
str.lines.reject{|x| x =~ /^>/}.join # this will return a new string, without those ">" lines
or using Regular Expressions:
str.gsub(/^>.+\n*/. '')
Better Solution:
You will need to use non-greedy multi-line matching mode for your Regular Expression:
str.gsub(/^>.*?$\n*/m, '') # by using gsub!() you can modify the string in place
^> matches your ">" character at the start of a line
.*?$ matches any characters after the start character until the end of the line (non-greedy)
\n* matches the newline character itself if any (you want to remove that as well)
the "m" at the end of the regular expressions indicates multi-line matching , which will apply the RegExp for each line in the string.
It should work as you expect:
your_string.lines.to_a.reject{|line| line[0] == '>'}.join

What does this do?

gsub(/^/, "\t" * num)
What character is being substituted?
No character is being substituted, it is just inserting num tabs at the beginning so you could say that it is substituting the zero width "beginning of line" marker. Whoever wrote that would have been better off with something more like this:
tabbed = "\t" * num + original
A regular expression really isn't the right tool for simple string concatenation.
Clarification: If you're expecting your string to contain multiple lines then using:
gsub(/^/, "\t" * num)
to prefix all the lines with tabs is a reasonably thing to do and less noisy than splitting, prefixing, and re-joining. If you're only expecting to deal with a single line in your string then simple string concatenation would be the better choice.
^ means "start of line" in regex syntax, so this will insert num tab characters at the beginning of every line. Technically you could say that it replaces the empty string at the start of every line.

Resources