Lua: getting rid of part of a path (sub, gsub,gmatch?) - lua

So i have this variable:
a = [[C:\aaa\aaa\aa\bbb\ccc\ddd]]
And i need to end up here:
a = [[ccc\ddd]]
Note that the path (the aaa,ccc and ddd folders) might be different from time to time, but the word "bbb" is always gonna be there and thats what i´d like to use to start chopping the text (from the end of the word not from the beginning)
I´ve been reading some string tutorials and everything i tried just doesnt work (pretty new to scripting here). I think the "\" character messes things up.
Whats the best way to deal with this? Thaaaanks!

This is a good time to make use of patterns.
Information on that here: understanding lua patterns
With a pattern you could use string.match to flexibly capture the part of the string you want
a ="C:\\aaa\\aaa\\aa\\bbb\\ccc\\ddd"
print(string.match(a, "bbb\\(.*)"))

Related

LaTeX - Define a list of words to use a certain font

Problem
I'm writing an essay/documentation about an application. Within this doc there are a lot of code-words which I want to be highlighted using a different font. Currently I work with:
{\fontfamily{cmtt}\selectfont SOME-KEY-WORD}
Which is a bit of work to use every time.
Question
I'm looking for a way to declare a list of words to use a specific font within the text.
I know that I can use the listings package and define morekeywords which will be highlighted within the listings-environment but I need it in the text.
I thought of something like this:
\defineList{\fontfamily{cmtt}}{
SOME-KEY-WORD-1,
SOME-Key-word-2,
...
}
EDIT
I forgot to mention that I already tried something like:
\def\somekeyword{\fontfamily{cmtt}\selectfont some\_key\_word\normalfont}
which is a little bit better then the first attempt but I still need to use \somekeyword in the text.
EDIT 2
I came upon a workaround:
\newcommand{\cmtt}[1]{{\fontfamily{cmtt}\selectfont #1\normalfont}}
It's a little better then EDIT but still not the perfect solution.
Substitution every time a word occurs, without providing any clues to TeX, might be difficult and is beyond my skills (though I'd be interested to see someone come up with a solution).
But why not simply create a macro for each of those words?
\newcommand\somekeyword{\fontfamily{cmtt}\selectfont SOME-KEY-WORD}
Use like this:
Hello, \somekeyword{} is the magic word!
The trailing {} are unfortunately necessary to prevent eating the subsequent whitespace; even the built-in \LaTeX command requires them.
If you have very many of these words and are worried about maintainability, you can even create a macro to create the macros:
\newcommand\declareword[2][]{%
\expandafter\newcommand%
\csname\if\relax#1\relax#2\else#1\fi\endcsname%
{{\fontfamily{cmtt}\selectfont #2}}%
}
\declareword{oneword} % defines \oneword
\declareword{otherword} % defines \otherword
\declareword[urlspy]{urls.py} % defines \urlspy
...
The optional argument indicates the name of the command, in case the word itself contains characters like . which cannot be used in the name of a command.

How to match efficiently against keys in a table in Lua?

Available in my Lua 5.1 environment are obviously the default Lua pattern matching, but also a reasonably recent version of PCRE and LPEG. I don't honestly care which of these is used; as long as my problem is tackled in an efficient manner I'm happy. (My personal knowledge of LPEG especially is next to non-existent, but I hear it has some very good qualities.)
I have a table with certain string patterns as keys, the accompanying values are to be used once the keys matches... which means they aren't really important for this matter.
Suppose you have:
tbl = { ["aaa"] = 12, ["aab"] = 452, ["aba"] = -2 }
Now my goal is to find out which one of these matches first in a particular string like "accaccaacaadacaabacdaaba".
In reality, the keys are more numerous and the match string is considerably lengthier. This means simply matching against all keys one by one and compare the column the match begins at is a very inefficient solution that is not viable for me.
Parts of the match strings can have considerable overlaps, too. From the theory, I know one state machine per key pattern would be ideal in this regard; just go through the motions on every pattern and the moment you have a complete match on one of them you are done.
But I would be crazy to go code something like that myself when there's so many pattern matching libraries in my environment. The only one I know is technically capable is PCRE; just append the keys like "aaa|aab|aba" and you'll get the first feasible match.
But there's also the problem. For one, I am unsure how intelligent it is when compiling such a match. (I think it first tries 'aaa', unwinds completely once it fails, then completely tries aab, but I haven't tested) which wouldn't be too efficient compared to matching it like "a(a[ab]|ba)" where similarities get resolved faster.
Additionally, I'd like to have the capacity to put in some flexibility ("a.ad" where the second character doesn't matter, or matches a number.. basic stuff like that). With a pattern like that in such an additive approach, I do not see a way to regain the original pattern that matched so I can use the value that goes with it.
(Worst case, I could just generate a lot of entries in the table to match every possible wildcard variation and do away with the pattern requirement, but I honestly don't want to.)
Which library is the right tool for the job, and to boot, how to best use said library to achieve above-stated goals without reinventing the wheel?
A comment to your question mentioned Aho–Corasick algorithm.
If your environment has access to os.execute or io.popen, you can call fgrep -o -f patterns filename, where patterns is the name of a file that contains patterns separated with newlines, and filename is the name of your input. -o means that only matches will be output, one per line. You can replace filename with - so that fgrep reads from standard input: echo "String to match" | fgrep -o -f patterns.
fgrep implements Aho–Corasick algorithm.
However, remember that Aho–Corasick algorithm does not recognise metacharacters.
Just as Alexander Mashin's answer said, Aho–Corasick algorithm is an efficient algorithm that will solve your problem. In Lua land, cloudflare /
lua-aho-corasick is an implementation for LuaJIT using FFI. There's also a pure lua implemetation jgrahamc/aho-corasick-lua which might be slower.

Regex with ruby issue

My regular extression (regex) is still work in progress, and I'm having the following issue with trying to extract some anchor text from a hash of where the element is stored.
My hash looks like:
hash["example"]
=> " Project, Area 1"
My ruby of which is trying to do the extraction of "Project" and "Area 1":
hash["ITA Area"].scan(/<a href=\"(.*)\">(.*)<\/a>/)
Any help would be much appreciated as always.
Your groups are using greedy matching, so it's going to grab as much as it can before, say, a < for the second group. Change the (.*) parts to (.*?) to use possessive matching.
There are loads of posts here on why you should not be using regex to parse html. There are many reasons why... such as, what if there is more than one space between the a and href, etc. It would be ideal to use a tool designed for parsing html.
You will have to exape the backslashes for the backslashes. so something like... \\\\ instead of just \\. It sounds stupid, but I had a similar problem with it.
I'm not entirely sure what your issue is, but the regexp should match. Double quotes " need not be escaped. As mentioned in Dan Breen's answer, you need to use non-greedy matchers if the string is expected to contain more than one possible match.
The canonical SO reason to use a real HTML parser is calmly explained right here.
However, regexen can parse simple snippets without too much trouble.
Update: Aha, the anchor text. That's actually pretty easy:
> s.scan /([^<>]*)<\/a>/
=> [["Project"], ["Area 1"]]

Sanitize pasted text from MS-Word

Here's my wild and whacky psuedo-code. Anyone know how to make this real?
Background:
This dynamic content comes from a ckeditor. And a lot of folks paste Microsoft Word content in it. No worries, if I just call the attribute untouched it loads pretty. But the catch is that I want it to be just 125 characters abbreviated. When I add truncation to it, then all of the Microsoft Word scripts start popping up. Then I added simple_format, and sanitize, and truncate, and even made my controller start spotting out specific variables that MS would make and gsub them out. But there's too many of them, and it seems like an awfully messy way to accomplish this. Thus so! Realizing that by itself, its clean. I thought, why not just slice it. However, the microsoft word text becomes blank but still holds its numbered position in the string. So I came up with this (probably awful) solution below.
It's in three steps.
When the text parses, it doesn't display any of the MSWord junk. But that text still holds a number position in a slice statement. So I want to use a regexp to find the first actual character.
Take that character and find out what its numbered position is in the total string.
Use a slice statement to cut it from.
def about_us_truncated
x = self.about_us.find.first(regExp representing first actual character)
x.charCount = y
self.about_us[y..125]
end
The only other idea i got, is a regex statement that allows it to explicitly slice only actual characters like so :
about_us([a-zA-Z][0..125]) , but that is definately not how it is written.
Here is some sample text of MS Word junk :
&Lt;! [If Gte Mso 9]>&Lt;Xml>&Lt;Br /> &Lt;O:Office Document Settings>&Lt;Br /> &Lt;O:Allow Png/>&Lt;Br /> &Lt;/O:Off...
You haven't provided much information to go off of, but don't be too leery of trying to build this regex on your own before you seek help...
Take your sample text and paste it in Rubular in the test string area and start building your regex. It has a great quick reference at the bottom.
Stumbled across this
http://gist.github.com/139987
it looks like it requires the sanitize gem.
This is technically not a straight answer, but it seems like the best possible one you can find.
In order to prevent MS Word, you should be using CK Editor's built-in MS word sanitizer. This is because writing regex for it can be very complicated and you can very easily break tags in half and destroy your site with it.
What I did as a workaround, is I did a force paste as plain text in the CK Editor.

Replace strings in LaTeX

I want LaTeX to automatically replace strings like " a ", " s ", " z " with " a~", " s~", " z~", because they can't be at line end. Any suggestions?
For Czech typographic rules, there is a preprocessor called Vlna" by Petr Olšák - download . The set of (usually prepositions in czech) is customizable - so it might be usable for other languages as well.
You can use \StrSubstitute from xstring package.
e.g.
\StrSubstitute{change ME}{ ME}{d}
will convert change ME into changed.
Although, nesting is not possible, so to make another substitution you must use an intermediate variable in this way
\StrSubstitute{change ME}{ ME}{d}[\mystring]
\StrSubstitute{\mystring}{ed}{ing}
Finally, your solution would be
\usepackage{xstring}
\def\mystring{...source string here...}
\begin{document}
\StrSubstitute{\mystring}{ a }{a~{}}[\mystring]
\StrSubstitute{\mystring}{ s }{s~{}}[\mystring]
\StrSubstitute{\mystring}{ z }{z~{}}[\mystring]
\mystring
\end{document}
Note the use of the empty string {} to avoid the sequence ~}.
I'm afraid (to the best of my knowledge) this is basically impossible with LaTeX. A LuaTeX-based solution might be possible, though.
It's not actually clear to me, however, that " a ", for example, shouldn't appear at the end of a
line. Although I might be used to different typographic rules.
(Is there anything wrong with the line break in the last paragraph? :))
As far as I know there is no way to do this in LaTeX itself. I'd go for automating this with some external tools, as my typical setup involves a Makefile handling the LaTeX run by itself. This makes it rather easy to run tools like sed on the sources and do some replacements using regular expressions, and a simple rule would do this for your case.
If you use some LaTeX editor that does everything for you you should check the editors regular expression search and replace functionality.
Yes, this is the age old argument of data processing vs. data composition. We have always done these things in a pre-processor environment responsible for extracting the information from its source environment, SQL or plain-text, and created the contents of a \input(file.tex).
But yes, it is possible (TeX is after all a programming language) but you will have to become a wizard. Get the 4 volume set TeX in Practice by Stephan von Bechtolsheim.
The approach would be to begin an environment (execute a macro) whose ''argument'' was all text down to the end of the environment. Then just munge though the tokens fixing the ones you want.
Still, I don't think any of us are advocating this approach.
If you are using TeXmaker to write your LaTeX file, then you may click on the Edit button on the toolbar, then click on Replace.
A dialogue box will come up, and you can enter your strings one after the other.
You put the strings to be changed in the Find text input and what you want it to be changed to in the Replace text input.
You can also specify where you want the replacement to start from.
Click Find and Replace (or similiar) in the menu of your text editor and do it.

Resources