I have a big text and I'd like to remove everything before a certain string.
The problem is, there are several occurrences of that string in the text, and I want to decide which one is correct by later analyzing the found piece of text.
I can't include that analysis in a regular expression because of its complexity:
text = <<HERE
big big text
goes here
HERE
pos = -1
a = text.scan(/some regexp/im)
a.each do |m|
s = m[0]
# analysis of found string
...
if ( s is good ) # is the right candidate
pos = ??? # here I'd like to have a position of the found string in the text.
end
end
result_text = text[pos..-1]
$~.offset(n) will give the position of the n-th part of a match.
I think you should count how many occurrences there are in your big string then use index to cut off all the occurrences that do not match the final pattern.
Related
I'm trying to replace words (sequence of characters, more generally) in a string with corresponding values in an array. An example is:
"The dimension of the square is {{width}} and {{length}}" with array [10,20] should give
"The dimension of the square is 10 and 20"
I have tried to use gsub as
substituteValues.each do |sub|
value.gsub(/\{\{(.*?)\}\}/, sub)
end
But I could not get it to work. I have also thought of using a hash instead of an array as follows:
{"{{width}}"=>10, "{{height}}"=>20}. I feel this might work better but again, I'm not sure how to code it (new to ruby). Any help is appreciated.
You can use
h = {"{{width}}"=>10, "{{length}}"=>20}
s = "The dimension of the square is {{width}} and {{length}}"
puts s.gsub(/\{\{(?:width|length)\}\}/, h)
# => The dimension of the square is 10 and 20
See the Ruby demo. Details:
\{\{(?:width|length)\}\} - a regex that matches
\{\{ - a {{ substring
(?:width|length) - a non-capturing group that matches width or length words
\}\} - a }} substring
gsub replaces all occurrences in the string with
h - used as the second argument, allows replacing the found matches that are equal to hash keys with the corresponding hash values.
You may use a bit simpler hash definition without { and } and then use a capturing group in the regex to match length or width. Then you need
h = {"width"=>10, "length"=>20}
s = "The dimension of the square is {{width}} and {{length}}"
puts s.gsub(/\{\{(width|length)\}\}/) { h[Regexp.last_match[1]] }
See this Ruby demo. So, here, (width|length) is used instead of (?:width|length) and only Group 1 is used as the key in h[Regexp.last_match[1]] inside the block.
I recently solved this problem, but felt there is a simpler way to do it. I'd like to use fewer lines of code than I am now. I'm new to ruby so if the answer is simple I'd love to add it to my toolbag. Thank you in advance.
goal: accept a word as an arg, and return the word with it's last vowel removed, if no vowels - return the original word
def hipsterfy(word)
vowels = "aeiou"
i = word.length - 1
while i >= 0
if vowels.include?(word[i])
return word[0...i] + word[i+1..-1]
end
i -= 1
end
word
end
try this regex magic:
def hipsterfy(word)
word.gsub(/[aeiou](?=[^aeiou]*$)/, "")
end
how does it work?
[aeiou] looks for a vowel., and ?=[^aeiou]*$ adds the constraint "where there is no vowel match in the following string. So the regex finds the last vowel. Then we just gsub the matched (last vowel) with "".
You could use rindex to find the last vowel's index and []= to remove the corresponding character:
def hipsterfy(word)
idx = word.rindex(/[aoiou]/)
word[idx] = '' if idx
word
end
The if idx is needed because rindex returns nil if no vowel is found. Note that []= modifies word.
There's also rpartition which splits the string at the given pattern, returning an array containing the part before, the match and the part after. By concat-enating the former and latter, you can effectively remove the middle part: (i.e. the vowel)
def hipsterfy(word)
before, _, after = word.rpartition(/[aoiou]/)
before.concat(after)
end
This variant returns a new string, leaving word unchanged.
Another common approach when dealing with some last occurrence is to reverse the string so you can deal with a first occurrence instead (which is usually simpler). Here, you can utilize sub:
def hipsterfy(word)
word.reverse.sub(/[aeiou]/, '').reverse
end
Here is another way to do it.
Reverse the characters of the string
Use find_index to get the first vowel location in this reversed string
Delete the character at this index
Un-reverse the characters and join them back together.
reverse_chars = str.chars.reverse
vowel_idx = reverse_chars.find_index { |char| char =~ /[aeiou]/ }
reverse_chars.delete_at(vowel_idx) if vowel_idx
result = reverse_chars.reverse.join
I'm trying write some code that looks at two data sets and matches them (if match), at the moment I am using string.find and this kinda work but its very rigid. For example: it works on check1 but not on check2/3, as theres a space in the feed or some other word. i like to return a match on all 3 of them but how can i do that? (match by more than 4 characters, maybe?)
check1 = 'jan'
check2 = 'janAnd'
check3 = 'jan kevin'
input = 'jan is friends with kevin'
if string.find(input.. "" , check1 ) then
print("match on jan")
end
if string.find( input.. "" , check2 ) then
print("match on jan and")
end
if string.find( input.. "" , check3 ) then
print("match on jan kevin")
end
PS: i have tried gfind, gmatch, match, but no luck with them
find only does direct match, so if the string you are searching is not a substring you are searching in (with some pattern processing for character sets and special characters), you get no match.
If you are interested in matching those strings you listed in the example, you need to look at fuzzy search. This SO answer may help as well as this one. I've implemented the algorithm listed in the second example, but got better results with two- and tri-gram matching based on this algorithm.
Lua's string.find works not just with exact strings but with patterns as well. But the syntax is a bit different from what you have in your "checks". You'd want check2 to be "jan.+" to match "jan" followed by one or more characters. Your third check will need to be jan.+kevin. Here the dot stands for any character, while the following plus sign indicates that this might be a sequence of one or more characters. There's more info at http://www.lua.org/pil/20.2.html.
I am using LUA to create a table within a table, and am running into an issue. I need to also populate the NIL values that appear, but can not seem to get it right.
String being manipulated:
PatID = '07-26-27~L73F11341687Per^^^SCI^SP~N7N558300000Acc^'
for word in PatID:gmatch("[^\~w]+") do table.insert(PatIDTable,word) end
local _, PatIDCount = string.gsub(PatID,"~","")
PatIDTableB = {}
for i=1, PatIDCount+1 do
PatIDTableB[i] = {}
end
for j=1, #PatIDTable do
for word in PatIDTable[j]:gmatch("[^\^]+") do
table.insert(PatIDTableB[j], word)
end
end
This currently produces this output:
table
[1]=table
[1]='07-26-27'
[2]=table
[1]='L73F11341687Per'
[2]='SCI'
[3]='SP'
[3]=table
[1]='N7N558300000Acc'
But I need it to produce:
table
[1]=table
[1]='07-26-27'
[2]=table
[1]='L73F11341687Per'
[2]=''
[3]=''
[4]='SCI'
[5]='SP'
[3]=table
[1]='N7N558300000Acc'
[2]=''
EDIT:
I think I may have done a bad job explaining what it is I am looking for. It is not necessarily that I want the karats to be considered "NIL" or "empty", but rather, that they signify that a new string is to be started.
They are, I guess for lack of a better explanation, position identifiers.
So, for example:
L73F11341687Per^^^SCI^SP
actually translates to:
1. L73F11341687Per
2.
3.
4. SCI
5. SP
If I were to have
L73F11341687Per^12ABC^^SCI^SP
Then the positions are:
1. L73F11341687Per
2. 12ABC
3.
4. SCI
5. SP
And in turn, the table would be:
table
[1]=table
[1]='07-26-27'
[2]=table
[1]='L73F11341687Per'
[2]='12ABC'
[3]=''
[4]='SCI'
[5]='SP'
[3]=table
[1]='N7N558300000Acc'
[2]=''
Hopefully this sheds a little more light on what I'm trying to do.
Now that we've cleared up what the question is about, here's the issue.
Your gmatch pattern will return all of the matching substrings in the given string. However, your gmatch pattern uses "+". That means "one or more", which therefore cannot match an empty string. If it encounters a ^ character, it just skips it.
But, if you just tried :gmatch("[^\^]*"), which allows empty matches, the problem is that it would effectively turn every ^ character into an empty match. Which is not what you want.
What you want is to eat the ^ at the end of a substring. But, if you try :gmatch("([^\^])\^"), you'll find that it won't return the last string. That's because the last string doesn't end with ^, so it isn't a valid match.
The closest you can get with gmatch is this pattern: "([^\^]*)\^?". This has the downside of putting an empty string at the end. However, you can just remove that easily enough, since one will always be placed there.
local s0 = '07-26-27~L73F11341687Per^^^SCI^SP~N7N558300000Acc^'
local tt = {}
for s1 in (s0..'~'):gmatch'(.-)~' do
local t = {}
for s2 in (s1..'^'):gmatch'(.-)^' do
table.insert(t, s2)
end
table.insert(tt, t)
end
I'm practicing with Ruby and regex to delete certain unwanted characters. For example:
input = input.gsub(/<\/?[^>]*>/, '')
and for special characters, example ☻ or :
input = input.gsub('&#', '')
This leaves only numbers, ok. But this only works if the user enters a special character as a code, like this:
My question:
How I can delete special characters if the user enters a special character without code, like this:
™ ☻
First of all, I think it might be easier to define what constitutes "correct input" and remove everything else. For example:
input = input.gsub(/[^0-9A-Za-z]/, '')
If that's not what you want (you want to support non-latin alphabets, etc.), then I think you should make a list of the glyphs you want to remove (like ™ or ☻), and remove them one-by-one, since it's hard to distinguish between a Chinese, Arabic, etc. character and a pictograph programmatically.
Finally, you might want to normalize your input by converting to or from HTML escape sequences.
If you just wanted ASCII characters, then you can use:
original = "aøbauhrhræoeuacå"
cleaned = ""
original.each_byte { |x| cleaned << x unless x > 127 }
cleaned # => "abauhrhroeuac"
You can use parameterize:
'#!#$%^&*()111'.parameterize
=> "111"
You can match all the characters you want, and then join them together, like this:
original = "aøbæcå"
stripped = original.scan(/[a-zA-Z]/).to_s
puts stripped
which outputs "abc"
An easier way to do this inspirated by Can Berk Güder answer is:
In order to delete special characters:
input = input.gsub(/\W/, '')
In order to keep word characters:
input = input.scan(/\w/)
At the end input is the same! Try it on : http://rubular.com/