Could anybody help me make a proper regular expression from a bunch of text in Ruby. I tried a lot but I don't know how to handle variable length titles.
The string will be of format <sometext>title:"<actual_title>"<sometext>. I want to extract actual_title from this string.
I tried /title:"."/ but it doesnt find any matches as it expects a closing quotation after one variable from opening quotation. I couldn't figure how to make it check for variable length of string. Any help is appreciated. Thanks.
. matches any single character. Putting + after a character will match one or more of those characters. So .+ will match one or more characters of any sort. Also, you should put a question mark after it so that it matches the first closing-quotation mark it comes across. So:
/title:"(.+?)"/
The parentheses are necessary if you want to extract the title text that it matched out of there.
/title:"([^"]*)"/
The parentheses create a capturing group. Inside is first a character class. The ^ means it's negated, so it matches any character that's not a ". The * means 0 or more. You can change it to one or more by using + instead of *.
I like /title:"(.+?)"/ because of it's use of lazy matching to stop the .+ consuming all text until the last " on the line is found.
It won't work if the string wraps lines or includes escaped quotes.
In programming languages where you want to be able to include the string deliminator inside a string you usually provide an 'escape' character or sequence.
If your escape character was \ then you could write something like this...
/title:"((?:\\"|[^"])+)"/
This is a railroad diagram. Railroad diagrams show you what order things are parsed... imagine you are a train starting at the left. You consume title:" then \" if you can.. if you can't then you consume not a ". The > means this path is preferred... so you try to loop... if you can't you have to consume a '"' to finish.
I made this with https://regexper.com/#%2Ftitle%3A%22((%3F%3A%5C%5C%22%7C%5B%5E%22%5D)%2B)%22%2F
but there is now a plugin for Atom text editor too that does this.
Related
I am tryng to get rid of shortcodes inside a Google Sheet column. I have many items such as [spacer type="1" height="20"][spacer] or [FinalTilesGallery id="37"] I just would like to cancel them. Is there any simple way to do it?
Thanks !
For in-place replacement, the quick option would be to use the Find and Replace dialog (Ctrl + H) with Search Using Regular Expressions turned on, which is more powerful than your standard Find and Replace.
Find: \[.*?\] - Match anything within an open-bracket up to the very next close-bracket. This should work assuming you have no nested brackets, e.g. [[no][no]].
If you do have nested brackets, you'll have to change this to \[[^\[\]]*\]. And continue to Replace All until all the codes are gone.
Replace: Nothing.
Replace All. If you don't want to affect other sheets that may be in your document, make sure you select the right range to work with, too.
This just erases everything within the brackets.
If you want to erase any redundant spaces left by this, simply Find and Replace again (with Regular Expressions) on + (space and plus), which will match 1 or more spaces and replace with (single space).
E.g.:
string [] [] string2 -> string string2 after the shortcode replacement.
After replacing spaces, it will become string string2.
Let's say your original strings are in the range A2:A. Place the following into B2 of an otherwise completely empty Column B (or the second cell of any other empty column):
=ArrayFormula(IF(A2:A="",,TRIM(REGEXREPLACE(A2:A,"\[[^\[\]]+\]",""))))
I can't see your data, so I don't know what kind of information is between these shortcodes. If you find that this leaves you with concatenated pieces of data where there should be spaces between them, replace the above with this version:
=ArrayFormula(IF(A2:A="",,TRIM(REGEXREPLACE(SUBSTITUTE(SUBSTITUTE(A2:A,"["," ["),"]","] "),"\[[^\[\]]+\]",""))))
I can't teach regular expression language here. But I will note that, since square brackets have specific meaning within regex, your literal square brackets must be indicated with the escape character: the backslash.
Here is the regex expression alone:
\[[^\[\]]+\]
The opening \[ and the closing \], then, reference your actual opening and closing bracket sets. If we remove those, we have this left:
[^\[\]]+
Again, you see the escaped opening and closing square brackets, which I'll replace with the word these:
[^these]+
What remains there are opening and closing brackets with regex meaning, i.e., "anything in this group." And the circumflex symbol ^ as the first character within this set of square brackets means "anything except." The + symbol means "in any string length of one or more characters."
So that whole regex expression then reads: "A literal open square bracket, followed by one or more characters that are anything except square brackets, ending with a literal closing square bracket."
And we are REGEXREPLACE-ing any instance of that with "" (i.e., nothing).
I have a plain text file with a one string per line. I'd like to identify any instances where a string contains a value outside of a restricted character set. In this particular instance, if the string contains any character outside of the set "[THADGRC.SMBN-WVKY]" I want to retain it and pass it along to a new file.
For example, let's say the original file "mystrings.txt" contained the following data:
THADGRC.SMBN-WVKY
YKVW-NBMS.CRGDHAT
THADGRC.SMBN-WVKYI
My intention is to retain only the third sequence, because it contains a character outside of the allowed set (I) in this case.
It doesn't matter how many times, or in what order, an allowed character is present - all I care about is if a character exists in that string outside of the allowed set.
Originally I tried:
cat mystrings.txt | grep -v [THADGRC\.SMBN-WVKY] > badstrings.txt
but of course the third string contains those allowed character in addition to the non-allowed characters, thus this search ended up producing no "offending" strings.
Last thing: I'm not sure what characters outside of the allowed set might exist in this text file. It would be great to know ahead of time to just search for anything with an "I", but I don't actually know this ahead of time.
So the question: is there a way to use grep (or another tool, say awk?) to pass in a restricted list of characters, and flag any instances where a string contains any number of characters outside of that set?
Thanks for your consideration
I think that your problem is N-W. This doesn't match "N", "-" and "W", it matches a range from "N" to "W". You should move "-" to the end of the character class, or escape it. I suggest changing to:
grep '[^THADGRC.SMBNWVKY-]' mystrings.txt
Also, note that "." doesn't have to be escaped when it's inside a character class.
Your attempt says "remove any lines which contain one of these characters at least once". But you want "print any lines which contain at least one character not in this set."
(Also, quote your regular expressions , and lose the useless cat.)
grep '[^-THADGRC.SMBNWVKY]' mystrings.txt > badstrings.txt
I moved the dash to the beginning of the character class on the assumption that you want a literal dash, not the regex range N-W (i.e. N, O, P, Q, R, S, T, U, V, W).
I'm trying to match any strings that come in that follow the format Word 100.00% ~(45.56, 34.76) in LUA. As such, I'm looking to do a regex close (in theory) to this:
%D%s[%d%.%d]%%(%d.%d, %d.%d)
But I'm having no luck so far. LUA's patterns are weird.
What am I missing?
Your pattern is close you neglected to allow for multiple instances of a digit you can do this by using a + at like %d+.
You also did not use [,( and . correctly in the pattern.
[s in a pattern will create a set of chars that you are trying to match such as [abc] means you are looking to match any as bs or c at that position.
( are used to define a capture so the specific values you want returned rather then the whole string in the event of a match, in order to use it as a char you for the match you need to escape it with a %.
. will match any character rather then specifically a . you will need to add a % to escape if you want to match a . specifically.
local str = "Word 100.00% ~(45.56, 34.76)"
local pattern = "%w+%s%d+%.%d+%%%s~%(%d+%.%d+, %d+%.%d+%)"
print(string.match(str, pattern))
Here you will see the input string print if it matches the pattern otherwise you will see nil.
Suggested resource: Understanding Lua Patterns
Hi I've been struggling with this for the last hour and am no closer. How exactly do I strip everything except numbers, commas and decimal points from a rails string? The closest I have so far is:-
rate = rate.gsub!(/[^0-9]/i, '')
This strips everything but the numbers. When I try add commas to the expression, everything is getting stripped. I got the aboves from somewhere else and as far as I can gather:
^ = not
Everything to the left of the comma gets replaced by what's in the '' on the right
No idea what the /i does
I'm very new to gsub. Does anyone know of a good tutorial on building expressions?
Thanks
Try:
rate = rate.gsub(/[^0-9,\.]/, '')
Basically, you know the ^ means not when inside the character class brackets [] which you are using, and then you can just add the comma to the list. The decimal needs to be escaped with a backslash because in regular expressions they are a special character that means "match anything".
Also, be aware of whether you are using gsub or gsub!
gsub! has the bang, so it edits the instance of the string you're passing in, rather than returning another one.
So if using gsub! it would be:
rate.gsub!(/[^0-9,\.]/, '')
And rate would be altered.
If you do not want to alter the original variable, then you can use the version without the bang (and assign it to a different var):
cleaned_rate = rate.gsub!(/[^0-9,\.]/, '')
I'd just google for tutorials. I haven't used one. Regexes are a LOT of time and trial and error (and table-flipping).
This is a cool tool to use with a mini cheat-sheet on it for ruby that allows you to quickly edit and test your expression:
http://rubular.com/
You can just add the comma and period in the square-bracketed expression:
rate.gsub(/[^0-9,.]/, '')
You don't need the i for case-insensitivity for numbers and symbols.
There's lots of info on regular expressions, regex, etc. Maybe search for those instead of gsub.
You can use this:
rate = rate.gsub!(/[^0-9\.\,]/g,'')
Also check this out to learn more about regular expressions:
http://www.regexr.com/
I have been coding a program in Lua that automatically formats IRC logs from a roleplay. In the roleplay logs there is a specific guideline for "Out of character" conversation, which we use double parentheses for. For example: ((<Things unrelated to roleplay go here>)). I have been trying to have my program remove text between double brackets (and including both brackets). The code is:
ofile = io.open("Output.txt", "w")
rfile = io.open("Input.txt", "r")
p = rfile:read("*all")
w = string.gsub(p, "%(%(.*?%)%)", "")
ofile:write(w)
The pattern here is > "%(%(.*?%)%)" I've tried multiple variations of the pattern. All resulted in fruitless results:
1. %(%(.*?%)%) --Wouldn't do anything.
2. %(%(.*%)%) --Would remove *everything* after the first OOC message.
Then, my friend told me that prepending the brackets with percentages wouldn't work, and that I had to use backslashes to 'escape' the parentheses.
3. \(\(.*\)\) --resulted in the output file being completely empty.
4. (\(\(.*\)\)) --Same result as above.
5. (\(\(.*?\)\) --would for some reason, remove large parts of the text for no apparent reason.
6. \(\(.*?\)\) --would just remove all the text except for the last line.
The short, absolute question:
What pattern would I need to use to remove all text between double parentheses, and remove the double parentheses themselves too?
You're friend is thinking of regular expressions. Lua patterns are similar, but different. % is the correct escape character.
Your pattern should be %(%(.-%)%). The - is similar to * in that it matches any number of the preceding sequence, but while * tries to match as many characters as it can (it's greedy), - matches the least amount of characters possible (it's non-greedy). It won't go overboard and match extra double-close-parenthesis.