Wiki-fying a text using LPeg - lua

Long story coming up, but I'll try to keep it brief. I have many pure-text paragraphs which I extract from a system and re-output in wiki format so that the copying of said data is not such an arduous task. This all goes really well, except that there are no automatic references being generated for the 'topics' we have pages for, which end up needing to be added by reading through all the text and adding it in manually by changing Topic to [[Topic]].
First requirement: each topic is only to be made clickable once, which is the first occurrence. Otherwise, it would become a really spammy linkfest, which would detract from readability. To avoid issues with topics that start with the same words
Second requirement: overlapping topic names should be handled in such a way that the most 'precise' topic gets the link, and in later occurrences, the less precise topics do not get linked, since they're likely not correct.
Example:
topics = { "Project", "Mary", "Mr. Moore", "Project Omega"}
input = "Mary and Mr. Moore work together on Project Omega. Mr. Moore hates both Mary and Project Omega, but Mary simply loves the Project."
output = function_to_be_written(input)
-- "[[Mary]] and [[Mr. Moore]] work together on [[Project Omega]]. Mr. Moore hates both Mary and Project Omega, but Mary simply loves the [[Project]]."
Now, I quickly figured out a simple or complicated string.gsub() could not get me what I need to satisfy the second requirement, as it provides no way to say 'Consider this match as if it did not happen - I want you to backtrack further'. I need the engine to do something akin to:
input = "abc def ghi"
-- Looping over the input would, in this order, match the following strings:
-- 1) abc def ghi
-- 2) abc def
-- 3) abc
-- 4) def ghi
-- 5) def
-- 6) ghi
Once a string matches an actual topic and has not been replaced before by its wikified version, it is replaced. If this topic has been replaced by a wikified version before, don't replace, but simply continue the matching at the end of the topic. (So for a topic "abc def", it would test "ghi" next in both cases.)
Thus I arrive at LPeg. I have read up on it, played with it, but it is considerably complex, and while I think I need to use lpeg.Cmt and lpeg.Cs somehow, I am unable to mix the two properly to make what I want to do work. I am refraining from posting my practice attempts as they are of miserable quality and probably more likely to confuse anyone than assist in clarifying my problem.
(Why do I want to use a PEG instead of writing a triple-nested loop myself? Because I don't want to, and it is a great excuse to learn PEGs.. except that I am in over my head a bit. Unless it is not possible with LPeg, the first is not an option.)

So... I got bored and needed something to do:
topics = { "Project", "Mary", "Mr. Moore", "Project Omega"}
pcall ( require , 'luarocks.require' )
require 'lpeg'
local locale = lpeg.locale ( )
local endofstring = -lpeg.P(1)
local endoftoken = (locale.space+locale.punct)^1
table.sort ( topics , function ( a , b ) return #a > #b end ) -- Sort by word length (longest first)
local topicpattern = lpeg.P ( false )
for i = 1, #topics do
topicpattern = topicpattern + topics [ i ]
end
function wikify ( input )
local topicsleft = { }
for i = 1 , #topics do
topicsleft [ topics [ i ] ] = true
end
local makelink = function ( topic )
if topicsleft [ topic ] then
topicsleft [ topic ] = nil
return "[[" .. topic .. "]]"
else
return topic
end
end
local patt = lpeg.Ct (
(
lpeg.Cs ( ( topicpattern / makelink ) )* #(-locale.alnum+endofstring) -- Match topics followed by something thats not alphanumeric
+ lpeg.C ( ( lpeg.P ( 1 ) - endoftoken )^0 * endoftoken ) -- Skip tokens that aren't topics
)^0 * endofstring -- Match adfinum until end of string
)
return table.concat ( patt:match ( input ) )
end
print(wikify("Mary and Mr. Moore work together on Project Omega. Mr. Moore hates both Mary and Project Omega, but Mary simply loves the Project.")..'"')
print(wikify("Mary and Mr. Moore work on Project Omegality. Mr. Moore hates Mary and Project Omega, but Mary loves the Projectaaa.")..'"')
I start off my making a pattern which matches all the different topics; we want to match the longest topics first, so sort the table by word length from longest to shortest.
Now we need to make a list of the topics we haven't seen in the current input.
makelink quotes/links the topic if we haven't seen it already, otherwise leaves it be.
Now for the actual lpeg stuff:
lpeg.Ct packs all our captures into a table (to be concated together for output)
topicpattern / makelink captures a topic, and passes in through our makelink function.
lpeg.Cs substitutes the result of makelink back in where the match of the topic was.
+ lpeg.C ( ( lpeg.P ( 1 ) - locale.space )^0 * locale.space^1 ) if we didn't match a topic, skip a word (that is, not spaces followed by a space)
^0 repeat.
Hope thats what you wanted :)
Daurn
Note: Edited code, description no longer correct

So why don't you use string.find? It search only for a first topic occurrence and gives you its starting index and length. All you have to do is to add '[[' to a result.
For each chunk, copy the topics table and when the first occurency has been found, remove it.
Sort topics by length, most long first so that the most relevant topic will be found first
LPeg is a good tool, but it's not necessary to use it here.

Related

Lua - How to ignore a result from a table iteration without removing it?

I wanto to create a crossword puzzles's solver with Lua. I'm not used to this language tho and my english is poor, sorry for that.
I have to iterate multiples times the same table of tables checking if the given word is present or not and, if present, replace every char of that word in the table with a "*" simbol.
For example:
schema= {
{"A","B","C","D","H","F","G","W","T","Y"},
{"U","H","E","L","L","O","I","I","O","L"},
{"G","F","D","R","Y","T","R","G","R","R"}}
function(schema,"HELLO")
schema= {
{"A","B","C","D","H","F","G","W","T","Y"},
{"U","*","*","*","*","*","I","I","O","L"},
{"G","F","D","R","Y","T","R","G","R","R"}}
For now i'm focusing on find the word scanning the table from left to right. Here's my code:
i = 1
t = {}
for k,w in pairs(schema) do
t[k] = w
end
cercaPrima = function(tabella,stringa)
for v = 1, 10 do
if string.sub(stringa,1,1) == t[i][v] then
print(t[i][v]) v = v+1
return cercaDS(t,stringa,i,v)
else
v = v+1
end
end
if i < #t then
i = i+1
cercaPrima(tabella,stringa)
else
return print("?")
end
end
cercaDS = function(tabella,stringa,d,s)
local o = 2
local l = 2
while o <= #stringa do
if string.sub(stringa,o,l) == tabella[d][s] then
print(tabella[d][s])
tabella[d][s] = "*"
s=s+1
o=o+1
l=l+1
else
l=l-1
s=s-l
o=#stringa+1
tabella[d][s] = "*"
return cercaPrima(tabella,stringa)
end
end
end
cercaPrima(schema,"HELLO")
It's probably overcomplicated, but my question is: How can I make it ignore the first "H" (not turning it into a "*") while keep iterating the table looking for another "H" who fits the criteria?
My goal is to create a function who takes a table and a list of words in input, iterates the table looking for every single word, if it finds them all it replaces every char of every word found in the table with a "*" and print the remaining characters as a string.
Another problem that i'll probabily have is: what if a char of a word is a char of another word too? It will read "*" instead of the real char if it has already found the first word.
Should I create a new table for every word I'm looking for? But then how can i merge those table togheter to extrapolate the remaining characters?
Thank you for your help!
If you want to ignore something one time you can use a conditional statement. Just remember that you encountered it already using a variable. But I don't see how this makes sense here.
A problem like this is probably solved better by turing each line and column into strings and then stimply search the strings for words.
I find string.gsub() is a great find and replacement tool.
Maybe it hit not all requirements but maybe it inspire you.
> function cercaPrisma(tab,txt) for i=1,#tab do print((table.concat(tab[i]):gsub(txt, ('*'):rep(txt:len())))) end end
> cercaPrisma(schema, 'HELLO')
ABCDHFGWTY
U*****IIOL
GFDRYTRGRR
> cercaPrisma(schema, 'DRY')
ABCDHFGWTY
UHELLOIIOL
GF***TRGRR

vba to find word or phrase then highlight the entire paragraph that follows it

I've used the following code to find a key word or phrase then highlight the line -- but can't figure out how to make it highlight the entire paragraph and/ or list that follows it...
For example:
(ideally, highlighting this paragraph AND the list that provides further details would be best -but if only the paragraph is achievable, that's better than nothing):
"The Contractor shall turn in monthly status reports within ten business days after the end of each month. The report should include:
(a) Accomplishments
(b) Meetings and Outcomes
(c) Completed Travel and Purpose of Travel"
I've researched several commands and looked for examples but still at a loss as a novice. I tried "wdParagraph" but couldn't get that to work. Located refs of maybe using a "paragraph.range.select" and also found a note that advised these "start" and "end" terms (below) to select a paragraph.. but not sure how to achieve this? Hoping someone has an example of how to accomplish this as it will help greatly with quickly identifying hundreds of software reqs out of a 100 page word doc.. so frustrated!
* Selection.StartOf Unit:=wdParagraphm
* Selection.MoveEnd Unit:=wdParagraph
Sub Find_Highlight_Word_to_End_of_Line()
'BUT NEED IT TO HIGHLIGHT THROUGH END OF PARAGRAPH
'AND HIGHLIGHT LISTED ITEMS IF APPLICABLE
'LIKE THE LISTS IN THE EXAMPLE DOCUMENT
Dim sFindText As String
'Start from the top of the document
Selection.HomeKey wdStory
sFindText = "Contractor Shall"
Selection.Find.Execute sFindText
Do Until Selection.Find.Found = False
Selection.EndKey Unit:=wdLine, Extend:=wdExtend
Selection.Range.HighlightColorIndex = wdYellow
Selection.MoveRight
Selection.Find.Execute
Loop
End Sub
A couple of awesome experts shared 2 methods with me to achieve the full paragraph highlighting I needed.. Hope these help others! Fascinating to see different ways to achieve the same result!
Method 2 Code (Alternative) is shorter:
Sub Highlight_Paragraph()
Dim oRng As Range
Set oRng = ActiveDocument.Range
With oRng.Find
Do While .Execute(FindText:="Contractor Shall")
oRng.Paragraphs(1).Range.HighlightColorIndex = wdYellow
oRng.Collapse 0
Loop
End With
lbl_Exit:
Set oRng = Nothing
Exit Sub
End Sub
Method 1 Code (MatchWildcards=False):
Sub Demo()
Application.ScreenUpdating = False
With ActiveDocument.Range
With .Find
.ClearFormatting
.Replacement.ClearFormatting
.Text = "Contractor Shall"
.Replacement.Text = ""
.Forward = True
.Wrap = wdFindStop
.Format = True
.MatchWildcards = False
.Execute
End With
Do While .Find.Found
.Duplicate.Paragraphs.First.Range.HighlightColorIndex = wdYellow
.Start = .Duplicate.Paragraphs.First.Range.End
.Find.Execute
Loop
End With
Application.ScreenUpdating = True
End Sub

Find all upper/lower/mixed combinations of a string

I need this for a game server using Lua..
I would like to be able to save all combinations of a name
into a string that can then be used with:
if exists (string)
example:
ABC_-123
aBC_-123
AbC_-123
ABc_-123
abC_-123
etc
in the game only numbers, letters and _ - . can be used as names.
(A_B-C, A-B.C, AB_8 ... etc)
I understand the logic I just don't know how to code it:D
0-Lower
1-Upper
then
000
001
etc
You can use recursive generator. The first parameter contains left part of the string generated so far, and the second parameter is the remaining right part of the original string.
function combinations(s1, s2)
if s2:len() > 0 then
local c = s2:sub(1, 1)
local l = c:lower()
local u = c:upper()
if l == u then
combinations(s1 .. c, s2:sub(2))
else
combinations(s1 .. l, s2:sub(2))
combinations(s1 .. u, s2:sub(2))
end
else
print(s1)
end
end
So the function is called in this way.
combinations("", "ABC_-123")
You only have to store intermediate results instead of printing them.
If you are interested only in the exists function then you don't need all combinations.
local stored_string = "ABC_-123"
function exists(tested_string)
return stored_string:lower() == tested_string:lower()
end
You simply compare the stored string and the tested string in case-insensitive way.
It can be easily tested:
assert(exists("abC_-123"))
assert(not exists("abd_-123"))
How to do this?
There's native function in Lua to generate all permutations of a string, but here are a few things that may prove useful.
Substrings
Probably the simplest solution, but also the least flexible. Rather than combinations, you can check if a substring exists within a given string.
if str:find(substr) then
--code
end
If this solves your problem, I highly reccomend it.
Get all permutations
A more expensive, but still a working solution. This accomplishes nearly exactly what you asked.
function GetScrambles(str, tab2)
local tab = {}
for i = 1,#str do
table.insert(tab, str:sub(i, i))
end
local tab2 = tab2 or {}
local scrambles = {}
for i = 0, Count(tab)-1 do
local permutation = ""
local a = Count(tab)
for j = 1, #tab do
tab2[j] = tab[j]
end
for j = #tab, 1, -1 do
a = a / j
b = math.floor((i/a)%j) + 1
permutation = permutation .. tab2[b]
tab2[b] = tab2[j]
end
table.insert(scrambles, permutation)
end
return scrambles
end
What you asked
Basically this would be exactly what you originally asked for. It's the same as the above code, except with every substring of the string.
function GetAllSubstrings(str)
local substrings = {}
for i = 1,#str do
for ii = i,#str do
substrings[#substrings+1]=str:sub(ii)
end
end
return substrings
end
Capitals
You'd basically have to, with every permutation, make every possible combination of capitals with it.
This shouldn't be too difficult, I'm sure you can code it :)
Are you joking?
After this you should probably be wondering. Is all of this really necessary? It seems like a bit much!
The answer to this lies in what you are doing. Do you really need all the combinations of the given characters? I don't think so. You say you need it for case insensitivity in the comments... But did you know you could simply convert it into lower/upper case? It's very simple
local str = "hELlO"
print(str:lower())
print(str:upper())
This is HOW you should store names, otherwise you should leave it case sensitive.
You decide
Now YOU pick what you're going to do. Whichever direction you pick, I wish you the best of luck!

Walking over strings to guess a name from an email based on dictionary of names?

Let's say I have a dictionary of names (a huge CSV file). I want to guess a name from an email that has no obvious parsable points (., -, _). I want to do something like this:
dict = ["sam", "joe", "john", "parker", "jane", "smith", "doe"]
word = "johnsmith"
x = 0
y = word.length-1
name_array = []
for i in x..y
match_me = word[x..i]
dict.each do |name|
if match_me == name
name_array << name
end
end
end
name_array
# => ["john"]
Not bad, but I want "John Smith" or ["john", "smith"]
In other words, I recursively loop through the word (i.e., unparsed email string, "johndoe#gmail.com") until I find a match within the dictionary. I know: this is incredibly inefficient. If there's a much easier way of doing this, I'm all ears!
If there's not better way of doing it, then show me how to fix the example above, for it suffers from two major flaws: (1) how do I set the length of the loop (see problem of finding "i" below), and (2) how do I increment "x" in the example above so that I can cycle through all possible character combinations given an arbitrary string?
Problem of finding the length of the loop, "i":
for an arbitrary word, how can we derive "i" given the pattern below?
for a (i = 1)
a
for ab (i = 3)
a
ab
b
for abc (i = 6)
a
ab
abc
b
bc
c
for abcd (i = 10)
a
ab
abc
abcd
b
bc
bcd
c
cd
d
for abcde (i = 15)
a
ab
abc
abcd
abcde
b
bc
bcd
bcde
c
cd
cde
d
de
e
r = /^(#{Regexp.union(dict)})(#{Regexp.union(dict)})$/
word.match(r)
=> #<MatchData "johnsmith" 1:"john" 2:"smith">
The regex might take some time to build, but it's blazing fast.
I dare suggest a brute force solution that is not very elegant but still useful in case
you have a large number of items (building a regexp can be a pain)
the string to analyse is not limited to two components
you want to get all splittings of a string
you want only complete analyses of a string, that span from ^ to $.
Because of my poor English, I could not figure out a long personal name that can be split in more than one way, so let's analyse a phrase:
word = "godisnowhere"
The dictionary:
#dict = [ "god", "is", "now", "here", "nowhere", "no", "where" ]
#lengths = #dict.collect {|w| w.length }.uniq.sort
The array #lengths adds a slight optimization to the algorithm, we will use it to prune subwords of lengths that don't exist in the dictionary without actually performing dictionary lookup. The array is sorted, this is another optimization.
The main part of the solution is a recursive function that finds the initial subword in a given word and restarts for the tail subword.
def find_head_substring(word)
# boundary condition:
# remaining subword is shorter than the shortest word in #dict
return [] if word.length < #lengths[0]
splittings = []
#lengths.each do |len|
break if len > word.length
head = word[0,len]
if #dict.include?(head)
tail = word[len..-1]
if tail.length == 0
splittings << head
else
tails = find_head_substring(tail)
unless tails.empty?
tails.collect!{|tail| "#{head} #{tail}" }
splittings.concat tails
end
end
end
end
return splittings
end
Now see how it works
find_head_substring(word)
=>["god is no where", "god is now here", "god is nowhere"]
I have not tested it extensively, so I apologize in advance :)
If you just want the hits of matches in your dictionary:
dict.select{ |r| word[/#{r}/] }
=> ["john", "smith"]
You run a risk of too many confusing subhits, so you might want to sort your dictionary so longer names are first:
dict.sort_by{ |w| -w.size }.select{ |r| word[/#{r}/] }
=> ["smith", "john"]
You will still encounter situations where a longer name has a shorter substring following it and get multiple hits so you'll need to figure out a way to weed those out. You could have an array of first names, and another of last names, and take the first returned result of scanning for each, but given the diversity of first and last names, that doesn't guarantee 100% accuracy, and will still gather some bad results.
This sort of problem has no real good solution without further hints to the code about the person's name. Perhaps scanning the body of the message for salutation or valediction sections will help.
I'm not sure what you're doing with i, but isn't it as simple as:
dict.each do |first|
dict.each do |last|
puts first,last if first+last == word
end
end
This one bags all occurrences, not necessarily exactly two:
pattern = Regexp.union(dict)
matches = []
while match = word.match(pattern)
matches << match.to_s # Or just leave off to_s to keep the match itself
word = match.post_match
end
matches

splitting space delimited entries into new columns in R

I am coding a survey that outputs a .csv file. Within this csv I have some entries that are space delimited, which represent multi-select questions (e.g. questions with more than one response). In the end I want to parse these space delimited entries into their own columns and create headers for them so i know where they came from.
For example I may start with this (note that the multiselect columns have an _M after them):
Q1, Q2_M, Q3, Q4_M
6, 1 2 88, 3, 3 5 99
6, , 3, 1 2
and I want to go to this:
Q1, Q2_M_1, Q2_M_2, Q2_M_88, Q3, Q4_M_1, Q4_M_2, Q4_M_3, Q4_M_5, Q4_M_99
6, 1, 1, 1, 3, 0, 0, 1, 1, 1
6,,,,3,1,1,0,0,0
I imagine this is a relatively common issue to deal with but I have not been able to find it in the R section. Any ideas how to do this in R after importing the .csv ? My general thoughts (which often lead to inefficient programs) are that I can:
(1) pull column numbers that have the special suffix with grep()
(2) loop through (or use an apply) each of the entries in these columns and determine the levels of responses and then create columns accordingly
(3) loop through (or use an apply) and place indicators in appropriate columns to indicate presence of selection
I appreciate any help and please let me know if this is not clear.
I agree with ran2 and aL3Xa that you probably want to change the format of your data to have a different column for each possible reponse. However, if you munging your dataset to a better format proves problematic, it is possible to do what you asked.
process_multichoice <- function(x) lapply(strsplit(x, " "), as.numeric)
q2 <- c("1 2 3 NA 4", "2 5")
processed_q2 <- process_multichoice(q2)
[[1]]
[1] 1 2 3 NA 4
[[2]]
[1] 2 5
The reason different columns for different responses are suggested is because it is still quite unpleasant trying to retrieve any statistics from the data in this form. Although you can do things like
# Number of reponses given
sapply(processed_q2, length)
#Frequency of each response
table(unlist(processed_q2), useNA = "ifany")
EDIT: One more piece of advice. Keep the code that processes your data separate from the code that analyses it. If you create any graphs, keep the code for creating them separate again. I've been down the road of mixing things together, and it isn't pretty. (Especially when you come back to the code six months later.)
I am not entirely sure what you trying to do respectively what your reasons are for coding like this. Thus my advice is more general – so just feel to clarify and I will try to give a more concrete response.
1) I say that you are coding the survey on your own, which is great because it means you have influence on your .csv file. I would NEVER use different kinds of separation in the same .csv file. Just do the naming from the very beginning, just like you suggested in the second block.
Otherwise you might geht into trouble with checkboxes for example. Let's say someone checks 3 out of 5 possible answers, the next only checks 1 (i.e. "don't know") . Now it will be much harder to create a spreadsheet (data.frame) type of results view as opposed to having an empty field (which turns out to be an NA in R) that only needs to be recoded.
2) Another important question is whether you intend to do a panel survey(i.e longitudinal study asking the same participants over and over again) . That (among many others) would be a good reason to think about saving your data to a MySQL database instead of .csv . RMySQL can connect directly to the database and access its tables and more important its VIEWS.
Views really help with survey data since you can rearrange the data in different views, conditional on many different needs.
3) Besides all the personal / opinion and experience, here's some (less biased) literature to get started:
Complex Surveys: A Guide to Analysis Using R (Wiley Series in Survey Methodology
The book is comparatively simple and leaves out panel surveys but gives a lot of R Code and examples which should be a practical start.
To prevent re-inventing the wheel you might want to check LimeSurvey, a pretty decent (not speaking of the templates :) ) tool for survey conductors. Besides I TYPO3 CMS extensions pbsurvey and ke_questionnaire (should) work well too (only tested pbsurvey).
Multiple choice items should always be coded as separate variables. That is, if you have 5 alternatives and multiple choice, you should code them as i1, i2, i3, i4, i5, i.e. each one is a binary variable (0-1). I see that you have values 3 5 99 for Q4_M variable in the first example. Does that mean that you have 99 alternatives in an item? Ouch...
First you should go on and create separate variables for each alternative in a multiple choice item. That is, do:
# note that I follow your example with Q4_M variable
dtf_ins <- as.data.frame(matrix(0, nrow = nrow(<initial dataframe>), ncol = 99))
# name vars appropriately
names(dtf_ins) <- paste("Q4_M_", 1:99, sep = "")
now you have a data.frame with 0s, so what you need to do is to get 1s in an appropriate position (this is a bit cumbersome), a function will do the job...
# first you gotta change spaces to commas and convert character variable to a numeric one
y <- paste("c(", gsub(" ", ", ", x), ")", sep = "")
z <- eval(parse(text = y))
# now you assing 1 according to indexes in z variable
dtf_ins[1, z] <- 1
And that's pretty much it... basically, you would like to reconsider creating a data.frame with _M variables, so you can write a function that does this insertion automatically. Avoid for loops!
Or, even better, create a matrix with logicals, and just do dtf[m] <- 1, where dtf is your multiple-choice data.frame, and m is matrix with logicals.
I would like to help you more on this one, but I'm recuperating after a looong night! =) Hope that I've helped a bit! =)
Thanks for all the responses. I agree with most of you that this format is kind of silly but it is what I have to work with (survey is coded and going into use next week). This is what I came up with from all the responses. I am sure this is not the most elegant or efficient way to do it but I think it should work.
colnums <- grep("_M",colnames(dat))
responses <- nrow(dat)
for (i in colnums) {
vec <- as.vector(dat[,i]) #turn into vector
b <- lapply(strsplit(vec," "),as.numeric) #split up and turn into numeric
c <- sort(unique(unlist(b))) #which values were used
newcolnames <- paste(colnames(dat[i]),"_",c,sep="") #column names
e <- matrix(nrow=responses,ncol=length(c)) #create new matrix for indicators
colnames(e) <- newcolnames
#next loop looks for responses and puts indicators in the correct places
for (i in 1:responses) {
e[i,] <- ifelse(c %in% b[[i]],1,0)
}
dat <- cbind(dat,e)
}
Suggestions for improvement are welcome.

Resources