Lua Pattern Matching issue - lua

I'm trying to parse a text file using lua and store the results in two arrays. I thought my pattern would be correct, but this is the first time I've done anything of the sort.
fileio.lua:
questNames = {}
questLevels = {}
lineNumber = 1
file = io.open("results.txt", "w")
io.input(file)
for line in io.lines("questlist.txt") do
questNames[lineNumber], questLevels[lineNumber]= string.match(line, "(%a+)(%d+)")
lineNumber = lineNumber + 1
end
for i=1,lineNumber do
if (questNames[i] ~= nil and questLevels[i] ~= nil) then
file:write(questNames[i])
file:write(" ")
file:write(questLevels[i])
file:write("\n")
end
end
io.close(file)
Here's a small snippet of questlist.txt:
If the dead could talk16
Forgotten soul16
The Toothmaul Ploy9
Well-Armed Savages9
And here's a matching snippet of results.txt:
talk 16
soul 16
Ploy 9
Savages 9
What I'm after in results.txt is:
If the dead could talk 16
Forgotten soul 16
The Toothmaul Ploy 9
Well-Armed Savages 9
So my question is, which pattern do I use in order to select all text up to a number?
Thanks for your time.

%a matches letters. It does not match spaces.
If you want to match everything up to a sequence of digits you want (.-)(%d+).
If you want to match a leading sequence of non-digits then you want ([^%d]+)(%d+).
That being said if all you want to do is insert a space before a sequence of digits then you can just use line:gsub("%d+", " %0", 1) to do that (the one to only do it for the first match, leave that off to do it for every match on the line).
As an aside I don't think io.input(file) is doing anything useful for you (or what you might expect). It is replacing the default standard input file handle with the file handle file.

Related

Trying to get some kind of key:value data from a string in Lua

I'm (again) stuck because patterns... so let's see if with a little of help... The case is I have e. g. a string returned by a function that contains the following:
đź“„ My Script
ScriptID:RL_SimpleTest
Version:0.0.1
ScriptType:MenuScript
AnotherKey:AnotherValue
And, maybe, some more text...
And I'd want to parse it line by line and should the line contains a ":" get the left side content of the line in a variable (k) and the right content in another one (v), so e. g. I'd have k containing "ScriptID" and v containing "RL_SimpleTest" for the second line (the first one should be just ignored) and so on...
Well, I've started with something like this:
function RL_Test:StringToKeyValue(str, sep1, sep2)
sep1 = sep1 or "\n"
sep2 = sep2 or ":"
local t = {}
for line in string.gmatch(str, "([^" .. sep1 .. "]+)") do
print(line)
for k in string.gmatch(line, "([^" .. sep2 .. "]+)") do --Here is where I'm lost trying to get the key/value pair separately and at the same time...
--t[k] = v
print(k)
end
end
return t
end
With the hope once I got isolated the line containing the data in the key:value form that I want to extract, I'd be able to do some kind of for k, v in string.gmatch(line, "([^" .. sep2 .. "]+)") or something so and that way get the two pieces of data, but of course it doesn't work and even though I have a feeling it's a triviality I don't know even where to start, always for the lack of patterns understanding...
Well, I hope at least I exposed it right... Thanks in advance for any help.
local t = {}
for line in (s..'\n'):gmatch("(.-)\r?\n") do
for a, b in line:gmatch("([^:]+):([^:\n\r]+)") do
t[a] = b
end
end
The pattern is quite simple. Match anything that is not a colon that is followed by a colon that is followed by anything that is not a colon or a line break. Put what you want in captures and you're done.
I assume every line is of the format k:v, containing exactly one colon, or containing no colon (no k/v pair).
Then you can simply first match nonempty lines using [^\n]+ (assuming UNIX LF line endings), then match each line using ^([^:]+):([^:]+)$. Breakdown of the second pattern:
^ and $ are anchors. They force the pattern to match the entire line.
([^:]+) matches & captures one or more non-semicolon characters.
This leaves you with:
function RL_Test:StringToKeyValue(str)
local t = {}
for line in str:gmatch"[^\n]+" do
local k, v = line:match"^([^:]+):([^:]+)$"
if k then -- line is k:v pair?
t[k] = v
end
end
return t
end
If you want to support Windows CRLF line endings, use for line in (s..'\n'):gmatch'(.-)\r?\n' do as in Piglet's answer for matching the lines instead.
This answer differs from Piglet's answer in that it uses match instead of gmatch for matching the k/v pairs, allowing exactly one k/v pair with exactly one colon per line, whereas Piglet's code may extract multiple k/v pairs per line.

Lua split string using specific pattern

i need to split each row of a input file using the specific pattern " - ". I'm not so far from solution but my code actually splits also single spaces. Each row of the file is formatted as follow:
NAME - ID - USERNAME - GROUP NAME - GROUP ID - TIMESTAMP
name field may have spaces, same as group name and timestamp, for example a row like that
LUCKY STRIKE - 11223344 - #lucky - CIGARETTES SMOKERS - 44332211 - 11:42 may/5th
is valid.
So these tokenized values should be stored inside a table.
Here is my code:
local function splitstring(inputstr)
sep = "(%s-%s)"
local t={} ; i=1
for str in string.gmatch(inputstr, "([^"..sep.."]+)") do
t[i] = str
i = i + 1
end
print("=========="..t[1].."===========")
print("=========="..t[2].."===========")
print("=========="..t[3].."===========")
return t
end
when i run it, puts "lucky" in first field, strike in second field, the id inside third field.
Is there a way to store "lucky strike" inside first field, parsing ONLY by pattern specified?
Hope you guys could help me.
p.s. I already see the lua manual but didn't help me so much...
Here is another take:
s="LUCKY STRIKE - 11223344 - #lucky - CIGARETTES SMOKERS - 44332211 - 11:42 may/5th"
s=s.." - "
for v in s:gmatch("(.-)%s+%-%s+") do
print("["..v.."]")
end
The pattern reflects the definition of the field: everything until - surrounded by spaces. Here "everything" is implemented using the non-greedy pattern .-.To make this work uniformly, we add the separator to the end as well. Many pattern matching problems that use separators can benefit from this uniformity.
There are a few things wrong with what you have.
Firstly, - is a repetition symbol in Lua patterns:
http://www.lua.org/manual/5.2/manual.html#6.4.1
You need to use %- to get a literal -.
We're not done: The resulting gmatch call is string.gmatch(inputstr, "[^%s%-%s]+"). Since your separator pattern is inside [], it's a character class. It says "Give me all the things that aren't a space or a -, and be as greedy as you can", which is why it stops at the first space character.
Your best bet is to do something like:
local function splitstring(inputstr)
sep = "%-"
local t={} ; i=1
for str in string.gmatch(inputstr, "[^"..sep.."]+") do
t[i] = str
i = i + 1
end
print("=========="..t[1].."===========")
print("=========="..t[2].."===========")
print("=========="..t[3].."===========")
return t
end
Which yields:
==========LUCKY STRIKE ===========
========== 11223344 ===========
========== #lucky ===========
... And now independently fix the problem of the spaces around the values.

Using string.find

I'm trying write some code that looks at two data sets and matches them (if match), at the moment I am using string.find and this kinda work but its very rigid. For example: it works on check1 but not on check2/3, as theres a space in the feed or some other word. i like to return a match on all 3 of them but how can i do that? (match by more than 4 characters, maybe?)
check1 = 'jan'
check2 = 'janAnd'
check3 = 'jan kevin'
input = 'jan is friends with kevin'
if string.find(input.. "" , check1 ) then
print("match on jan")
end
if string.find( input.. "" , check2 ) then
print("match on jan and")
end
if string.find( input.. "" , check3 ) then
print("match on jan kevin")
end
PS: i have tried gfind, gmatch, match, but no luck with them
find only does direct match, so if the string you are searching is not a substring you are searching in (with some pattern processing for character sets and special characters), you get no match.
If you are interested in matching those strings you listed in the example, you need to look at fuzzy search. This SO answer may help as well as this one. I've implemented the algorithm listed in the second example, but got better results with two- and tri-gram matching based on this algorithm.
Lua's string.find works not just with exact strings but with patterns as well. But the syntax is a bit different from what you have in your "checks". You'd want check2 to be "jan.+" to match "jan" followed by one or more characters. Your third check will need to be jan.+kevin. Here the dot stands for any character, while the following plus sign indicates that this might be a sequence of one or more characters. There's more info at http://www.lua.org/pil/20.2.html.

Break strings into substrings based on delimiters, with empty substrings

I am using LUA to create a table within a table, and am running into an issue. I need to also populate the NIL values that appear, but can not seem to get it right.
String being manipulated:
PatID = '07-26-27~L73F11341687Per^^^SCI^SP~N7N558300000Acc^'
for word in PatID:gmatch("[^\~w]+") do table.insert(PatIDTable,word) end
local _, PatIDCount = string.gsub(PatID,"~","")
PatIDTableB = {}
for i=1, PatIDCount+1 do
PatIDTableB[i] = {}
end
for j=1, #PatIDTable do
for word in PatIDTable[j]:gmatch("[^\^]+") do
table.insert(PatIDTableB[j], word)
end
end
This currently produces this output:
table
[1]=table
[1]='07-26-27'
[2]=table
[1]='L73F11341687Per'
[2]='SCI'
[3]='SP'
[3]=table
[1]='N7N558300000Acc'
But I need it to produce:
table
[1]=table
[1]='07-26-27'
[2]=table
[1]='L73F11341687Per'
[2]=''
[3]=''
[4]='SCI'
[5]='SP'
[3]=table
[1]='N7N558300000Acc'
[2]=''
EDIT:
I think I may have done a bad job explaining what it is I am looking for. It is not necessarily that I want the karats to be considered "NIL" or "empty", but rather, that they signify that a new string is to be started.
They are, I guess for lack of a better explanation, position identifiers.
So, for example:
L73F11341687Per^^^SCI^SP
actually translates to:
1. L73F11341687Per
2.
3.
4. SCI
5. SP
If I were to have
L73F11341687Per^12ABC^^SCI^SP
Then the positions are:
1. L73F11341687Per
2. 12ABC
3.
4. SCI
5. SP
And in turn, the table would be:
table
[1]=table
[1]='07-26-27'
[2]=table
[1]='L73F11341687Per'
[2]='12ABC'
[3]=''
[4]='SCI'
[5]='SP'
[3]=table
[1]='N7N558300000Acc'
[2]=''
Hopefully this sheds a little more light on what I'm trying to do.
Now that we've cleared up what the question is about, here's the issue.
Your gmatch pattern will return all of the matching substrings in the given string. However, your gmatch pattern uses "+". That means "one or more", which therefore cannot match an empty string. If it encounters a ^ character, it just skips it.
But, if you just tried :gmatch("[^\^]*"), which allows empty matches, the problem is that it would effectively turn every ^ character into an empty match. Which is not what you want.
What you want is to eat the ^ at the end of a substring. But, if you try :gmatch("([^\^])\^"), you'll find that it won't return the last string. That's because the last string doesn't end with ^, so it isn't a valid match.
The closest you can get with gmatch is this pattern: "([^\^]*)\^?". This has the downside of putting an empty string at the end. However, you can just remove that easily enough, since one will always be placed there.
local s0 = '07-26-27~L73F11341687Per^^^SCI^SP~N7N558300000Acc^'
local tt = {}
for s1 in (s0..'~'):gmatch'(.-)~' do
local t = {}
for s2 in (s1..'^'):gmatch'(.-)^' do
table.insert(t, s2)
end
table.insert(tt, t)
end

Funny CSV format help

I've been given a large file with a funny CSV format to parse into a database.
The separator character is a semicolon (;). If one of the fields contains a semicolon it is "escaped" by wrapping it in doublequotes, like this ";".
I have been assured that there will never be two adjacent fields with trailing/ leading doublequotes, so this format should technically be ok.
Now, for parsing it in VBScript I was thinking of
Replacing each instance of ";" with a GUID,
Splitting the line into an array by semicolon,
Running back through the array, replacing the GUIDs with ";"
It seems to be the quickest way. Is there a better way? I guess I could use substrings but this method seems to be acceptable...
Your method sounds fine with the caveat that there's absolutely no possibility that your GUID will occur in the text itself.
On approach I've used for this type of data before is to just split on the semi-colons regardless then, if two adjacent fields end and start with a quote, combine them.
For example:
Pax;is;a;good;guy";" so;says;his;wife.
becomes:
0 Pax
1 is
2 a
3 good
4 guy"
5 " so
6 says
7 his
8 wife.
Then, when you discover that fields 4 and 5 end and start (respectively) with a quote, you combine them by replacing the field 4 closing quote with a semicolon and removing the field 5 opening quote (and joining them of course).
0 Pax
1 is
2 a
3 good
4 guy; so
5 says
6 his
7 wife.
In pseudo-code, given:
input: A string, first character is input[0]; last
character is input[length]. Further, assume one dummy
character, input[length+1]. It can be anything except
; and ". This string is one line of the "CSV" file.
length: positive integer, number of characters in input
Do this:
set start = 0
if input[0] = ';':
you have a blank field in the beginning; do whatever with it
set start = 2
endif
for each c between 1 and length:
next iteration unless string[c] = ';'
if input[c-1] ≠ '"' or input[c+1] ≠ '"': // test for escape sequence ";"
found field consting of half-open range [start,c); do whatever
with it. Note that in the case of empty fields, start≥c, leaving
an empty range
set start = c+1
endif
end foreach
Untested, of course. Debugging code like this is always fun….
The special case of input[0] is to make sure we don't ever look at input[-1]. If you can make input[-1] safe, then you can get rid of that special case. You can also put a dummy character in input[0] and then start your data—and your parsing—from input[1].
One option would be to find instances of the regex:
[^"];[^"]
and then break the string apart with substring:
List<string> ret = new List<string>();
Regex r = new Regex(#"[^""];[^""]");
Match m;
while((m = r.Match(line)).Success)
{
ret.Add(line.Substring(0,m.Index + 1);
line = line.Substring(m.Index + 2);
}
(Sorry about the C#, I don't known VBScript)
Using quotes is normal for .csv files. If you have quotes in the field then you may see opening and closing and the embedded quote all strung together two or three in a row.
If you're using SQL Server you could try using T-SQL to handle everything for you.
SELECT * INTO MyTable FROM OPENDATASOURCE('Microsoft.JET.OLEDB.4.0',
'Data Source=F:\MyDirectory;Extended Properties="text;HDR=No"')...
[MyCsvFile#csv]
That will create and populate "MyTable". Read more on this subject here on SO.
I would recommend using RegEx to break up the strings.
Find every ';' that is not a part of
";" and change it to something else
that does not appear in your fields.
Then go through and replace ";" with ;
Now you have your fields with the correct data.
Most importers can swap out separator characters pretty easily.
This is basically your GUID idea. Just make sure the GUID is unique to your file before you start and you will be fine. I tend to start using 'Z'. After enough 'Z's, you will be unique (sometimes as few as 1-3 will do).
Jacob

Resources