Best way to count words in a string in Ruby? - ruby-on-rails

Is there anything better than string.scan(/(\w|-)+/).size (the - is so, e.g., "one-way street" counts as 2 words instead of 3)?

string.split.size
Edited to explain multiple spaces
From the Ruby String Documentation page
split(pattern=$;, [limit]) → anArray
Divides str into substrings based on a delimiter, returning an array
of these substrings.
If pattern is a String, then its contents are used as the delimiter
when splitting str. If pattern is a single space, str is split on
whitespace, with leading whitespace and runs of contiguous whitespace
characters ignored.
If pattern is a Regexp, str is divided where the pattern matches.
Whenever the pattern matches a zero-length string, str is split into
individual characters. If pattern contains groups, the respective
matches will be returned in the array as well.
If pattern is omitted, the value of $; is used. If $; is nil (which is
the default), str is split on whitespace as if ' ' were specified.
If the limit parameter is omitted, trailing null fields are
suppressed. If limit is a positive number, at most that number of
fields will be returned (if limit is 1, the entire string is returned
as the only entry in an array). If negative, there is no limit to the
number of fields returned, and trailing null fields are not
suppressed.
" now's the time".split #=> ["now's", "the", "time"]
While that is the current version of ruby as of this edit, I learned on 1.7 (IIRC), where that also worked. I just tested it on 1.8.3.

I know this is an old question, but this might be useful to someone else looking for something more sophisticated than string.split. I wrote the words_counted gem to solve this particular problem, since defining words is pretty tricky.
The gem lets you define your own custom criteria, or use the out of the box regexp, which is pretty handy for most use cases. You can pre-filter words with a variety of options, including a string, lambda, array, or another regexp.
counter = WordsCounted::Counter.new("Hello, Renée! 123")
counter.word_count #=> 2
counter.words #=> ["Hello", "Renée"]
# filter the word "hello"
counter = WordsCounted::Counter.new("Hello, Renée!", reject: "Hello")
counter.word_count #=> 1
counter.words #=> ["Renée"]
# Count numbers only
counter = WordsCounted::Counter.new("Hello, Renée! 123", rexexp: /[0-9]/)
counter.word_count #=> 1
counter.words #=> ["123"]
The gem provides a bunch more useful methods.

If the 'word' in this case can be described as an alphanumeric sequence which can include '-' then the following solution may be appropriate (assuming that everything that doesn't match the 'word' pattern is a separator):
>> 'one-way street'.split(/[^-a-zA-Z]/).size
=> 2
>> 'one-way street'.split(/[^-a-zA-Z]/).each { |m| puts m }
one-way
street
=> ["one-way", "street"]
However, there are some other symbols that can be included in the regex - for example, ' to support the words like "it's".

This is pretty simplistic but does the job if you are typing words with spaces in between. It ends up counting numbers as well but I'm sure you could edit the code to not count numbers.
puts "enter a sentence to find its word length: "
word = gets
word = word.chomp
splits = word.split(" ")
target = splits.length.to_s
puts "your sentence is " + target + " words long"

The best way to do is to use split method.
split divides a string into sub-strings based on a delimiter, returning an array of the sub-strings.
split takes two parameters, namely; pattern and limit.
pattern is the delimiter over which the string is to be split into an array.
limit specifies the number of elements in the resulting array.
For more details, refer to Ruby Documentation: Ruby String documentation
str = "This is a string"
str.split(' ').size
#output: 4
The above code splits the string wherever it finds a space and hence it give the number of words in the string which is indirectly the size of the array.

The above solution is wrong, consider the following:
"one-way street"
You will get
["one-way","", "street"]
Use
'one-way street'.gsub(/[^-a-zA-Z]/, ' ').split.size

This splits words only on ASCII whitespace chars:
p " some word\nother\tword|word".strip.split(/\s+/).size #=> 4

Related

How to find last occurrence of a substring in a given string?

I have a string, which describe some word, I must change ending of it to "sd", if ending == "jk".
For an example, I have word: "lazerjk", I need to get from it "lazersd".
I tried to use method .gsub!, but it doesn't work correctly if we have more than one occurrence of substring "jk" in a word.
String#rindex returns the index of the last occurrence of the given substring
String#[]= can take two integers arguments, first is index where start to replace and second - length of replaced string
You can use them this way:
replaced = "foo"
replacing = "booo"
string = "foo bar foo baz"
string[string.rindex(replaced), replaced.size] = replacing
string
# => "foo bar booo baz"
"jughjkjkjk\njk".sub(/jk$\z/, 'sd')
=> "jughjkjkjk\nsd"
without $ is probably sufficient.
It sounds like you're looking to replace a specific suffix only. If so, I would probably suggest using sub along with an anchored regex (to check for the desired characters only at the end of the string):
string_1 = "lazerjk"
string_2 = "lazerjk\njk"
string_3 = "lazerjkr"
string_1.sub(/jk\z/, "sd")
#=> "lazersd"
string_2.sub(/jk\z/, "sd")
#=> "lazerjk\nsd"
string_3.sub(/jk\z/, "sd")
#=> "lazerjkr"
Or, you could do without a regex at all by using the reverse! method along with a simple conditional statement to sub! only when the suffix is present:
string = "lazerjk"
old_suffix = "jk"
new_suffix = "sd"
string.reverse!.sub!(old_suffix.reverse, new_suffix.reverse).reverse! if string.end_with? (old_suffix)
string
#=> "lazersd"
OR, you could even use a completely different approach. Here's an example using chomp to remove the unwanted suffix and then ljust to pad the desired suffix to the modified string.
string = "lazerjk"
string.chomp("jk").ljust(string.length, "sd")
#=> "lazersd"
Note that the new suffix only gets added if the length of the string was modified with the initial chomp. Otherwise, the string remains unchanged.
If the goal is to substitute the LAST OCCURRENCE (as opposed to suffix only), then this could be accomplished by using sub along with reverse:
string = "jklazerjkm"
old_substring = "jk"
new_substring = "sd"
string.reverse.sub(old_substring.reverse, new_substring.reverse).reverse
#=> "jklazersdm"
Replacing "jk" at the end of a string with something else is straightforward and can be addressed without concern for other instances of "jk" that may be in the string, so I assume that is not what is being asked. Rather, I assume the problem is to replace the last instance of "jk" in a string with "sd".
Here are two solutions that make use of String#sub with a regular expression.
Use a negative lookahead
The idea here is to match "jk" provided it is not followed later in the string by another instance of "jk".
"lajkz\nejkrjklm".sub(/jk(?!.*jk)/m, "sd")
#=> "lajkz\nejkrsdlm"
Capture the part of the string that precedes the last "jk"
The match, if there is one, consists of the front of the string followed by the last "jk", which is replaced by the captured string followed by "sd".
"lajkz\nejkrjklm".sub(/\A(.*)jk/m) { $1 + "sd" }
#=> "lajkz\nejkrsdlm"
The two regular expressions can be written in free-spacing mode to make them self-documenting. The first is the following.
/
jk # match literal
(?! # begin a negative lookahead
.* # match zero or more characters other than line terminators
jk # match literal
) # end negative lookahead
/mx # invoke multiline and free-spacing regex definition modes.
Multiline mode causes . to match any character, including a line terminator.
The second regular expression can be written as follows.
\A # match the beginning of the string
(.*) # match zero or more characters other than line terminators
# and save the match to capture group 1
jk # match literal
/mx # invoke multiline and free-spacing regex definition modes.
Note that in both expressions .* is greedy, meaning that it will match as many characters as possible, including "jk" so long as other requirements of the expression are met, here that the last instance of "jk" in the string is matched.
Here is a different solution:
str = "jughjkjkjk\njk"
pattern = "jk"
replace_with = "sd"
str = str.reverse.sub(pattern.reverse, replace_with.reverse).reverse

How to find a word in a single long string?

I want to be able to copy and paste a large string of words from say a text document where there are spaces, returns and not commas between each and every word. Then i want to be able to take out each word individually and put them in a table for example...
input:
please i need help
output:
{1, "please"},
{2, "i"},
{3, "need"},
{4, "help"}
(i will have the table already made with the second column set to like " ")
havent tried anything yet as nothing has come to mind and all i could think of was using gsub to turn spaces into commas and find a solution from there but again i dont think that would work out so well.
Your delimiters are spaces ( ), commas (,) and newlines (\n, sometimes \r\n or \r, the latter very rarely). You now want to find words delimited by these delimiters. A word is a sequence of one or more non-delimiter characters. This trivially translates to a Lua pattern which can be fed into gmatch. Paired with a loop & inserting the matches in a table you get the following:
local words = {}
for word in input:gmatch"[^ ,\r\n]+" do
table.insert(words, word)
end
if you know that your words are gonna be in your locale-specific character set (usually ASCII or extended ASCII), you can use Lua's %w character class for matching sequences of alphanumeric characters:
local words = {}
for word in input:gmatch"%w+" do
table.insert(words, word)
end
Note: The resulting table will be in "list" form:
{
[1] = "first",
[2] = "second",
[3] = "third",
}
(for which {"first", "second", "third"} would be shorthand)
I don't see any good reasons for the table format you have described, but it can be trivially created by inserting tables instead of strings into the list.

How to capture a string between signs in lua?

how can I extract a few words separated by symbols in a string so that nothing is extracted if the symbols change?
for example I wrote this code:
function split(str)
result = {};
for match in string.gmatch(str, "[^%<%|:%,%FS:%>,%s]+" ) do
table.insert(result, match);
end
return result
end
--------------------------Example--------------------------------------------
str = "<busy|MPos:-750.222,900.853,1450.808|FS:2,10>"
my_status={}
status=split(str)
for key, value in pairs(status) do
table.insert(my_status,value)
end
print(my_status[1]) --
print(my_status[2]) --
print(my_status[3]) --
print(my_status[4]) --
print(my_status[5]) --
print(my_status[6]) --
print(my_status[7]) --
output :
busy
MPos
-750.222
900.853
1450.808
2
10
This code works fine, but if the characters and text in the str string change, the extraction is still done, which I do not want to be.
If the string change to
str = "Hello stack overFlow"
Output:
Hello
stack
over
low
nil
nil
nil
In other words, I only want to extract if the string is in this format: "<busy|MPos:-750.222,900.853,1450.808|FS:2,10>"
In lua patterns, you can use captures, which are perfect for things like this. I use something like the following:
--------------------------Example--------------------------------------------
str = "<busy|MPos:-750.222,900.853,1450.808|FS:2,10>"
local status, mpos1, mpos2, mpos3, fs1, fs2 = string.match(str, "%<(%w+)%|MPos:(%--%d+%.%d+),(%--%d+%.%d+),(%--%d+%.%d+)%|FS:(%d+),(%d+)%>")
print(status, mpos1, mpos2, mpos3, fs1, fs2)
I use string.match, not string.gmatch here, because we don't have an arbitrary number of entries (if that is the case, you have to have a different approach). Let's break down the pattern: All captures are surrounded by parantheses () and get returned, so there are as many return values as captures. The individual captures are:
the status flag (or whatever that is): busy is a simple word, so we can use the %w character class (alphanumeric characters, maybe %a, only letters would also do). Then apply the + operator (you already know that one). The + is within the capture
the three numbers for the MPos entry each get (%--%d+%.%d+), which looks weird at first. I use % in front of any non-alphanumeric character, since it turns all magic characters (such as + into normal ones). - is a magic character, so it is required here to match a literal -, but lua allows to put that in front of any non-alphanumerical character, which I do. So the minus is optional, so the capture starts with %-- which is one or zero repetitions (- operator) of a literal - (%-). Then I just match two integers separated by a dot (%d is a digit, %. matches a literal dot). We do this three times, separated by a comma (which I don't escape since I'm sure it is not a magical character).
the last entry (FS) works practically the same as the MPos entry
all entries are separated by |, which I simply match with %|
So putting it together:
start of string: %<
status field: (%w+)
separator: %|
MPos (three numbers): MPos:(%--%d+%.%d+),(%--%d+%.%d+),(%--%d+%.%d+)
separator: %|
FS entry (two integers): FS:(%d+),(%d+)
end of string: %>
With this approach you have the data in local variables with sensible names, which you can then put into a table (for example).
If the match failes (for instance, when you use "Hello stack overFlow"), nil` is returned, which can simply be checked for (you could check any of the local variables, but it is common to check the first one.

Match a word or whitespaces in Lua

(Sorry for my broken English)
What I'm trying to do is matching a word (with or without numbers and special characters) or whitespace characters (whitespaces, tabs, optional new lines) in a string in Lua.
For example:
local my_string = "foo bar"
my_string:match(regex) --> should return 'foo', ' ', 'bar'
my_string = " 123!#." -- note: three whitespaces before '123!#.'
my_string:match(regex) --> should return ' ', ' ', ' ', '123!#.'
Where regex is the Lua regular expression pattern I'm asking for.
Of course I've done some research on Google, but I couldn't find anything useful. What I've got so far is [%s%S]+ and [%s+%S+] but it doesn't seem to work.
Any solution using the standart library, e.g. string.find, string.gmatch etc. is OK.
Match returns either captures or the whole match, your patterns do not define those. [%s%S]+ matches "(space or not space) multiple times more than once", basically - everything. [%s+%S+] is plain wrong, the character class [ ] is a set of single character members, it does not treat sequences of characters in any other way ("[cat]" matches "c" or "a"), nor it cares about +. The [%s+%S+] is probably "(a space or plus or not space or plus) single character"
The first example 'foo', ' ', 'bar' could be solved by:
regex="(%S+)(%s)(%S+)"
If you want a variable number of captures you are going to need the gmatch iterator:
local capt={}
for q,w,e in my_string:gmatch("(%s*)(%S+)(%s*)") do
if q and #q>0 then
table.insert(capt,q)
end
table.insert(capt,w)
if e and #e>0 then
table.insert(capt,e)
end
end
This will not however detect the leading spaces or discern between a single space and several, you'll need to add those checks to the match result processing.
Lua standard patterns are simplistic, if you are going to need more intricate matching, you might want to have a look at lua lpeg library.

Ruby .scan method returns empty using regex

So given a string like this "\"turkey AND ham\" NOT \"roast beef\"" I need to get an array with the inner strings like so: ["turkey AND ham", "roast beef"] and eliminate OR's, AND's and NOT's that may or may not be there.
With the help of Rubular I came up with this regex /\\["']([^"']*)\\["']/
which returns the following 2 groups:
Match 1
1. turkey AND ham
Match 2
1. roast beef
however when I use it with .scan keep getting and empty array.
I looked at this and this other SO posts, and a few others, but can not figure out where I am going wrong
Here is the result from my rails console:
=> q = "\"turkey and ham\" OR \"roast beef\""
=> q.scan(/\\["']([^"']*)\\["']/)
=> []
Expectation:
["turkey AND ham", "roast beef"]
I shall also mention I suck at regex.
When the regex used with scan contains a capture group (#davidhu2000's approach), one generally can use lookarounds1 instead. It's just a matter of personal preference. To allow for double-quoted strings that contain either single- or (escaped) double-quoted strings, you could use the following regex.
r = /
(?<=") # match a double quote in a positive lookbehind
[^"]+ # match one or more characters that are not double-quotes
(?=") # match a double quote in a positive lookahead
| # or
(?<=') # match a single quote in a positive lookbehind
[^']+ # match one or more characters that are not single-quotes
(?=') # match a single quote in a positive lookahead
/x # free-spacing regex definition mode
"\"turkey AND ham\" NOT 'roast beef'".scan(r)
#=> ["turkey AND ham", "roast beef"]
As '"turkey AND ham" NOT "roast beef"' #=> "\"turkey AND ham\" NOT \"roast beef\"" (i.e., how the single-quoted string is saved), we need not be concerned about that being an additional case to deal with.
1 For any in the audience who still consider regular expressions to be black magic, there are four kinds of lookarounds (positive and negative lookbehinds and lookaheads) as elaborated in the doc for Regexp. Sometimes they are regarded as "zero-width" matches as they are not part of the matched text.
You regex is trying to match \, which won't match anything in the string, since the \ existed to escape the double quote, and won't be part of the string.
So if you remove \\ in your regex
res = q.scan(/["']([^"']*)["']/)
This will return a 2d array
res = [["turkey and ham"], ["roast beef"]]
Each inner array is all the matching groups from the regex, so if you have two capture groups in your regex, you will see two items in the inner array.
If you want a simple array, you can run flatten method on the array.

Resources