Lua split string using specific pattern - lua

i need to split each row of a input file using the specific pattern " - ". I'm not so far from solution but my code actually splits also single spaces. Each row of the file is formatted as follow:
NAME - ID - USERNAME - GROUP NAME - GROUP ID - TIMESTAMP
name field may have spaces, same as group name and timestamp, for example a row like that
LUCKY STRIKE - 11223344 - #lucky - CIGARETTES SMOKERS - 44332211 - 11:42 may/5th
is valid.
So these tokenized values should be stored inside a table.
Here is my code:
local function splitstring(inputstr)
sep = "(%s-%s)"
local t={} ; i=1
for str in string.gmatch(inputstr, "([^"..sep.."]+)") do
t[i] = str
i = i + 1
end
print("=========="..t[1].."===========")
print("=========="..t[2].."===========")
print("=========="..t[3].."===========")
return t
end
when i run it, puts "lucky" in first field, strike in second field, the id inside third field.
Is there a way to store "lucky strike" inside first field, parsing ONLY by pattern specified?
Hope you guys could help me.
p.s. I already see the lua manual but didn't help me so much...

Here is another take:
s="LUCKY STRIKE - 11223344 - #lucky - CIGARETTES SMOKERS - 44332211 - 11:42 may/5th"
s=s.." - "
for v in s:gmatch("(.-)%s+%-%s+") do
print("["..v.."]")
end
The pattern reflects the definition of the field: everything until - surrounded by spaces. Here "everything" is implemented using the non-greedy pattern .-.To make this work uniformly, we add the separator to the end as well. Many pattern matching problems that use separators can benefit from this uniformity.

There are a few things wrong with what you have.
Firstly, - is a repetition symbol in Lua patterns:
http://www.lua.org/manual/5.2/manual.html#6.4.1
You need to use %- to get a literal -.
We're not done: The resulting gmatch call is string.gmatch(inputstr, "[^%s%-%s]+"). Since your separator pattern is inside [], it's a character class. It says "Give me all the things that aren't a space or a -, and be as greedy as you can", which is why it stops at the first space character.
Your best bet is to do something like:
local function splitstring(inputstr)
sep = "%-"
local t={} ; i=1
for str in string.gmatch(inputstr, "[^"..sep.."]+") do
t[i] = str
i = i + 1
end
print("=========="..t[1].."===========")
print("=========="..t[2].."===========")
print("=========="..t[3].."===========")
return t
end
Which yields:
==========LUCKY STRIKE ===========
========== 11223344 ===========
========== #lucky ===========
... And now independently fix the problem of the spaces around the values.

Related

Lua Pattern Matching, get character before match

Currently I have code that looks like this:
somestring = "param=valueZ&456"
local stringToPrint = (somestring):gsub("(param=)[^&]+", "%1hello", 1)
StringToPrint will look like this:
param=hello&456
I have replaced all of the characters before the & with the string "hello". This is where my question becomes a little strange and specific.
I want my string to appear as: param=helloZ&456. In other words, I want to preserve the character right before the & when replacing the string valueZ with hello to make it helloZ instead. How can this be done?
I suggest:
somestring:gsub("param=[^&]*([^&])", "param=hello%1", 1)
See the Lua demo
Here, the pattern matches:
param= - literal substring param=
[^&]* - 0 or more chars other than & as many as possible
([^&]) - Group 1 capturing a symbol other than & (here, backtracking will occur, as the previous pattern grabs all such chars other than & and then the engine will take a step back and place the last char from that chunk into Group 1).
There are probably other ways to do this, but here is one:
somestring = "param=valueZ&456"
local stringToPrint = (somestring):gsub("(param=).-([^&]&)", "%1hello%2", 1)
print(stringToPrint)
The thing here is that I match the shortest string that ends with a character that is not & and a character that is &. Then I add the two ending characters to the replaced part.

Lua Pattern Matching issue

I'm trying to parse a text file using lua and store the results in two arrays. I thought my pattern would be correct, but this is the first time I've done anything of the sort.
fileio.lua:
questNames = {}
questLevels = {}
lineNumber = 1
file = io.open("results.txt", "w")
io.input(file)
for line in io.lines("questlist.txt") do
questNames[lineNumber], questLevels[lineNumber]= string.match(line, "(%a+)(%d+)")
lineNumber = lineNumber + 1
end
for i=1,lineNumber do
if (questNames[i] ~= nil and questLevels[i] ~= nil) then
file:write(questNames[i])
file:write(" ")
file:write(questLevels[i])
file:write("\n")
end
end
io.close(file)
Here's a small snippet of questlist.txt:
If the dead could talk16
Forgotten soul16
The Toothmaul Ploy9
Well-Armed Savages9
And here's a matching snippet of results.txt:
talk 16
soul 16
Ploy 9
Savages 9
What I'm after in results.txt is:
If the dead could talk 16
Forgotten soul 16
The Toothmaul Ploy 9
Well-Armed Savages 9
So my question is, which pattern do I use in order to select all text up to a number?
Thanks for your time.
%a matches letters. It does not match spaces.
If you want to match everything up to a sequence of digits you want (.-)(%d+).
If you want to match a leading sequence of non-digits then you want ([^%d]+)(%d+).
That being said if all you want to do is insert a space before a sequence of digits then you can just use line:gsub("%d+", " %0", 1) to do that (the one to only do it for the first match, leave that off to do it for every match on the line).
As an aside I don't think io.input(file) is doing anything useful for you (or what you might expect). It is replacing the default standard input file handle with the file handle file.

Break strings into substrings based on delimiters, with empty substrings

I am using LUA to create a table within a table, and am running into an issue. I need to also populate the NIL values that appear, but can not seem to get it right.
String being manipulated:
PatID = '07-26-27~L73F11341687Per^^^SCI^SP~N7N558300000Acc^'
for word in PatID:gmatch("[^\~w]+") do table.insert(PatIDTable,word) end
local _, PatIDCount = string.gsub(PatID,"~","")
PatIDTableB = {}
for i=1, PatIDCount+1 do
PatIDTableB[i] = {}
end
for j=1, #PatIDTable do
for word in PatIDTable[j]:gmatch("[^\^]+") do
table.insert(PatIDTableB[j], word)
end
end
This currently produces this output:
table
[1]=table
[1]='07-26-27'
[2]=table
[1]='L73F11341687Per'
[2]='SCI'
[3]='SP'
[3]=table
[1]='N7N558300000Acc'
But I need it to produce:
table
[1]=table
[1]='07-26-27'
[2]=table
[1]='L73F11341687Per'
[2]=''
[3]=''
[4]='SCI'
[5]='SP'
[3]=table
[1]='N7N558300000Acc'
[2]=''
EDIT:
I think I may have done a bad job explaining what it is I am looking for. It is not necessarily that I want the karats to be considered "NIL" or "empty", but rather, that they signify that a new string is to be started.
They are, I guess for lack of a better explanation, position identifiers.
So, for example:
L73F11341687Per^^^SCI^SP
actually translates to:
1. L73F11341687Per
2.
3.
4. SCI
5. SP
If I were to have
L73F11341687Per^12ABC^^SCI^SP
Then the positions are:
1. L73F11341687Per
2. 12ABC
3.
4. SCI
5. SP
And in turn, the table would be:
table
[1]=table
[1]='07-26-27'
[2]=table
[1]='L73F11341687Per'
[2]='12ABC'
[3]=''
[4]='SCI'
[5]='SP'
[3]=table
[1]='N7N558300000Acc'
[2]=''
Hopefully this sheds a little more light on what I'm trying to do.
Now that we've cleared up what the question is about, here's the issue.
Your gmatch pattern will return all of the matching substrings in the given string. However, your gmatch pattern uses "+". That means "one or more", which therefore cannot match an empty string. If it encounters a ^ character, it just skips it.
But, if you just tried :gmatch("[^\^]*"), which allows empty matches, the problem is that it would effectively turn every ^ character into an empty match. Which is not what you want.
What you want is to eat the ^ at the end of a substring. But, if you try :gmatch("([^\^])\^"), you'll find that it won't return the last string. That's because the last string doesn't end with ^, so it isn't a valid match.
The closest you can get with gmatch is this pattern: "([^\^]*)\^?". This has the downside of putting an empty string at the end. However, you can just remove that easily enough, since one will always be placed there.
local s0 = '07-26-27~L73F11341687Per^^^SCI^SP~N7N558300000Acc^'
local tt = {}
for s1 in (s0..'~'):gmatch'(.-)~' do
local t = {}
for s2 in (s1..'^'):gmatch'(.-)^' do
table.insert(t, s2)
end
table.insert(tt, t)
end

Split lua string into characters

I only found this related to what I am looking for: Split string by count of characters but it is not useful for what I mean.
I have a string variable, which is an ammount of 3 numbers (can be from 000 to 999). I need to separate each of the numbers (characters) and get them into a table.
I am programming for a game mod which uses lua, and it has some extra functions. If you could help me to make it using: http://wiki.multitheftauto.com/wiki/Split would be amazing, but any other way is ok too.
Thanks in advance
Corrected to what the OP wanted to ask:
To just split a 3-digit number in 3 numbers, that's even easier:
s='429'
c1,c2,c3=s:match('(%d)(%d)(%d)')
t={tonumber(c1),tonumber(c2),tonumber(c3)}
The answer to "How do I split a long string composed of 3 digit numbers":
This is trivial. You might take a look at the gmatch function in the reference manual:
s="123456789"
res={}
for num in s:gmatch('%d%d%d') do
res[#res+1]=tonumber(num)
end
or if you don't like looping:
res={}
s:gsub('%d%d%d',function(n)res[#res+1]=tonumber(n)end)
I was looking for something like this, but avoiding looping - and hopefully having it as one-liner. Eventually, I found this example from lua-users wiki: Split Join:
fields = {str:match((str:gsub("[^"..sep.."]*"..sep, "([^"..sep.."]*)"..sep)))}
... which is exactly the kind of syntax I'd like - one liner, returns a table - except, I don't really understand what is going on :/ Still, after some poking about, I managed to find the right syntax to split into characters with this idiom, which apparently is:
fields = { str:match( (str:gsub(".", "(.)")) ) }
I guess, what happens is that gsub basically puts parenthesis '(.)' around each character '.' - so that match would consider those as a separate match unit, and "extract" them as separate units as well... But I still don't get why is there extra pair of parenthesis around the str:gsub(".", "(.)") piece.
I tested this with Lua5.1:
str = "a - b - c"
fields = { str:match( (str:gsub(".", "(.)")) ) }
print(table_print(fields))
... where table_print is from lua-users wiki: Table Serialization; and this code prints:
"a"
" "
"-"
" "
"b"
" "
"-"
" "
"c"

Funny CSV format help

I've been given a large file with a funny CSV format to parse into a database.
The separator character is a semicolon (;). If one of the fields contains a semicolon it is "escaped" by wrapping it in doublequotes, like this ";".
I have been assured that there will never be two adjacent fields with trailing/ leading doublequotes, so this format should technically be ok.
Now, for parsing it in VBScript I was thinking of
Replacing each instance of ";" with a GUID,
Splitting the line into an array by semicolon,
Running back through the array, replacing the GUIDs with ";"
It seems to be the quickest way. Is there a better way? I guess I could use substrings but this method seems to be acceptable...
Your method sounds fine with the caveat that there's absolutely no possibility that your GUID will occur in the text itself.
On approach I've used for this type of data before is to just split on the semi-colons regardless then, if two adjacent fields end and start with a quote, combine them.
For example:
Pax;is;a;good;guy";" so;says;his;wife.
becomes:
0 Pax
1 is
2 a
3 good
4 guy"
5 " so
6 says
7 his
8 wife.
Then, when you discover that fields 4 and 5 end and start (respectively) with a quote, you combine them by replacing the field 4 closing quote with a semicolon and removing the field 5 opening quote (and joining them of course).
0 Pax
1 is
2 a
3 good
4 guy; so
5 says
6 his
7 wife.
In pseudo-code, given:
input: A string, first character is input[0]; last
character is input[length]. Further, assume one dummy
character, input[length+1]. It can be anything except
; and ". This string is one line of the "CSV" file.
length: positive integer, number of characters in input
Do this:
set start = 0
if input[0] = ';':
you have a blank field in the beginning; do whatever with it
set start = 2
endif
for each c between 1 and length:
next iteration unless string[c] = ';'
if input[c-1] ≠ '"' or input[c+1] ≠ '"': // test for escape sequence ";"
found field consting of half-open range [start,c); do whatever
with it. Note that in the case of empty fields, start≥c, leaving
an empty range
set start = c+1
endif
end foreach
Untested, of course. Debugging code like this is always fun….
The special case of input[0] is to make sure we don't ever look at input[-1]. If you can make input[-1] safe, then you can get rid of that special case. You can also put a dummy character in input[0] and then start your data—and your parsing—from input[1].
One option would be to find instances of the regex:
[^"];[^"]
and then break the string apart with substring:
List<string> ret = new List<string>();
Regex r = new Regex(#"[^""];[^""]");
Match m;
while((m = r.Match(line)).Success)
{
ret.Add(line.Substring(0,m.Index + 1);
line = line.Substring(m.Index + 2);
}
(Sorry about the C#, I don't known VBScript)
Using quotes is normal for .csv files. If you have quotes in the field then you may see opening and closing and the embedded quote all strung together two or three in a row.
If you're using SQL Server you could try using T-SQL to handle everything for you.
SELECT * INTO MyTable FROM OPENDATASOURCE('Microsoft.JET.OLEDB.4.0',
'Data Source=F:\MyDirectory;Extended Properties="text;HDR=No"')...
[MyCsvFile#csv]
That will create and populate "MyTable". Read more on this subject here on SO.
I would recommend using RegEx to break up the strings.
Find every ';' that is not a part of
";" and change it to something else
that does not appear in your fields.
Then go through and replace ";" with ;
Now you have your fields with the correct data.
Most importers can swap out separator characters pretty easily.
This is basically your GUID idea. Just make sure the GUID is unique to your file before you start and you will be fine. I tend to start using 'Z'. After enough 'Z's, you will be unique (sometimes as few as 1-3 will do).
Jacob

Resources