Is there a way to add a space after commas in a string only if it doesn't exist.
Example:
word word,word,word,
Would end up as
word word, word, word,
Is there a function in ruby or rails to do this?
This will be used on hundreds of thousands of sentences, so it needs to be fast (performance would matter).
Using negative lookahead to check no space after comma, then replace with comma and space.
print 'word word,word,word,'.gsub(/,(?![ ])/, ', ')
Just use a regular expression to replace all instances of "," not followed by a space with ", ".
str = "word word,word,word,"
str = str.gsub(/,([^ ])/, ', \1') # "word word, word, word,"
If the string contains no multiple adjacent spaces (or should not contain such), you don't need a regex:
"word word, word, word,".gsub(',', ', ').squeeze(' ')
#=> "word word, word, word, "
Add missing space:
"word word,word,word,".gsub(/,(?=\w)/, ', ') # "word word, word, word,"
and removing the last unnecessary comma if necessary
"word word,word,word,".gsub(/,(?=\w)/, ', ').sub(/,\Z/, '') # "word word, word, word"
Related
I'm having some trouble to find the right pattern to get the string I want.
My starting string is :
,,,,C3:,D3,E3,F3,,
I would like to have
C3: [D3,E3,F3]
I would like to replace each starting commas by double space
Replace coma after colon by double space and left square bracket
Replace trailing commas by right square bracket
For now, I tried this :
> a = ",,,,C3:,D3,E3,F3,,"
=> ",,,,C3:,D3,E3,F3,,"
> b = a.gsub(/^,*/, " ").gsub(/(?<=:),/, " [").gsub(/[,]*$/,"" ).gsub(/[ ]*$/, "]")
=> " C3: [D3,E3,F3]"
> b == " C3: [D3,E3,F3]"
=> false
I can't reach to replace each starting comma by a double space to obtain 8 spaces in this case.
Could you help me to find the right regexp and if possible to improve my code, please ?
To replace each starting comma with a double space, you need to use \G operator, i.e. .gsub(/\G,/, ' '). That operator tells the regex engine to match at the start of the string and then after each successful match. So, you only replace each consecutive comma in the beginning of the string with .gsub(/\G,/, ' ').
Then, you can add other replacements:
s.gsub(/\G,/, ' ').sub(/,+\z/, ']').sub(/:,+/, ': [')
See the IDEONE demo
s = ",,,,C3:,D3,E3,F3,,"
puts s.gsub(/\G,/, ' ').sub(/,+\z/, ']').sub(/:,+/, ': [')
Output:
C3: [D3,E3,F3]
To construct the desired string, one needs to know:
the number of leading commas (the size of the string comprised of the leading commas)
the string following the leading commas up to and including the colon
the string between the comma following the colon and two or more commas
It is a simple matter to construct a regex that saves each of these three strings to a capture group:
r = /
(,*) # match leading commas in capture group 1
(.+:) # match up and including colon in capture group 2
, # match comma
(.+) # match any number of any characters in capture group 3
,, # match two commas
/x # extended/free-spacing regex definition mode
",,,,C3:,D3,E3,F3,," =~ r
We can now form the desired string from the contents of the three capture groups:
"#{' '*$1.size}#{$2} [#{$3}]"
#=> " C3: [D3,E3,F3]"
I was learning regular expression in iOS, saw this tutorial:http://www.raywenderlich.com/30288/nsregularexpression-tutorial-and-cheat-sheet
It reads like this for \b:
\b matches word boundary characters such as spaces and punctuation. to\b will match the "to" in "to the moon" and "to!", but it will not match "tomorrow". \b is handy for "whole word" type matching.
and \s:
\s matches whitespace characters such as spaces, tabs, and newlines. hello\s will match "hello " in "Well, hello there!".
I have two questions on this:
1) what is the difference between \s and \b? when to use which?
2) \b is handy for "whole word" type matching -> Don't understand the meaning..
Need some guidance on these two.
\b Boundary characters
\b matches the boundary itself but not the boundary character (like a comma or period). It has no length in itself but can be used to find for example e in the end of a word.
For example in the sentence: "Hello there, this is one test. Testing"
The regex e\b will match an e if it's at the end of the word (followed by a word boundary). Notice in the image below that the e in "test" and "Testing" didn't match since the "e" is not followed by a boundary.
\s Whitespace
\s on the other hand matches the actual white space characters (like spaces and tabs). In the same sentence it will match all the spaces between the words.
Edit
Since \b doesn't make much sense alone I showed to how to it as e\b (above). The OP asked (in a comment) about what e\s would match compared to e\b to better explain the difference between \b and \s.
In the same string there is only one match for e\s while there was two matches for e\b since the comma is not a whitespace. Note that the e\s match (image 3) includes the white space where as the e\b match doesn't (image 1).
\b is matching a word boundary. That is a zero width assertion, means it is not matching a character, it is matching a position, where a certain condition is true.
\b is related to \w. \w is defining "word characters", means letters, digits and underscores. So \b is now matching on a change from a word character to a non-word character, or the other way round. Means it matches the start and end of a word, but not the character before or after the word.
\s is a predefined character class that is matching any whitespace character.
See and try out what \bFoo\b matches here on Regexr
See and try out what \sFoo\s matches here on Regexr
\b is zero-width. That is, it doesn't actually match any character. Meanwhile, \s does match a character. This is an important distinction for capturing and more complicated regular expressions.
For example, say you're trying to match numbers that begin with multiple zeros, like 007 or 000101101. You might try:
0+\d*
But see, that would also match 1007 and 101000101101! So then, you might try:
\s0+\d*
But see how that wouldn't match a 007 at the beginning of the string (because there's no space character)? Using \b allows you to get the "whole word (or number)":
\b0+\d*
\b matches any character that is not a letter or number without including itself in the match.
\s matches only white space.
For example:
\b would match any of these: "!?,.##$%^&*()_+ ".
$text = "Hello, Yo! moo .";
$regex = "~o\b~";
^---Will match all three o's.
$text = "Hello, Yo! moo .";
$regex = "~o\s~";
^---Will only match the 'o' in 'moo'.
I need to take/cut first 300 words or characters from a string.
That means, I need a limited number of characters from a string, from the beginning.
Something like truncating.
Is there a function to do this?
str = "many words here words words words ..."
first_500_words = str.split(" ").first(500).join(" ")
first_500_chars = str[0..500]
Depending on the size of your text and performance needs, one option is #text.split(/\s+/).slice(0,300).join(' ')
If you actually want to truncate on character level, which is advisable because different words differ in display length quite a bit, use:
def truncate_words(text, length = 300, end_string = ' …')
words = text.split()
words[0..(length-1)].join(' ') + (words.length > length ? end_string : '')
end
which I found here: http://snippets.dzone.com/posts/show/804
If you're using Rails, you can also use string.truncate but it does not take into account word boundries.
str = "this is really long string which I want to truncate..."
str.truncate 300, separator: " "
or if you prefer to youse brackets
str.truncate(300, separator: " ")
It's the most elegant solution of all above. As you mentioned in the topic, you use Rails so it will work. If you code in raw Ruby, you should write something like this:
str.split.first(300).join " "
The split method no need to take argument if you need to split the text by spaces.
I'm trying to parse words out of a string and put them into an array. I've tried the following thing:
#string1 = "oriented design, decomposition, encapsulation, and testing. Uses "
puts #string1.scan(/\s([^\,\.\s]*)/)
It seems to do the trick, but it's a bit shaky (I should include more special characters for example). Is there a better way to do so in ruby?
Optional: I have a cs course description. I intend to extract all the words out of it and place them in a string array, remove the most common word in the English language from the array produced, and then use the rest of the words as tags that users can use to search for cs courses.
The split command.
words = #string1.split(/\W+/)
will split the string into an array based on a regular expression. \W means any "non-word" character and the "+" means to combine multiple delimiters.
For me the best to spliting sentences is:
line.split(/[^[[:word:]]]+/)
Even with multilingual words and punctuation marks work perfectly:
line = 'English words, Polski Żurek!!! crème fraîche...'
line.split(/[^[[:word:]]]+/)
=> ["English", "words", "Polski", "Żurek", "crème", "fraîche"]
Well, you could split the string on spaces if that's your delimiter of interest
#string1.split(' ')
Or split on word boundaries
\W # Any non-word character
\b # Any word boundary character
Or on non-words
\s # Any whitespace character
Hint: try testing each of these on http://rubular.com
And note that ruby 1.9 has some differences from 1.8
For Rails you can use something like this:
#string1.split(/\s/).delete_if(&:blank?)
I would write something like this:
#string
.split(/,+|\s+/) # any ',' or any whitespace characters(space, tab, newline)
.reject(&:empty?)
.map { |w| w.gsub(/\W+$|^\W+^*/, '') } # \W+$ => any trailing punctuation; ^\W+^* => any leading punctuation
irb(main):047:0> #string1 = "oriented design, 'with', !!qwe, and testing. can't rubyisgood#)(*#%)(*, and,rails,is,good"
=> "oriented design, 'with', !!qwe, and testing. can't rubyisgood#)(*#%)(*, and,rails,is,good"
irb(main):048:0> #string1.split(/,+|\s+/).reject(&:empty?).map { |w| w.gsub(/\W+$|^\W+^*/, '')}
=> ["oriented", "design", "with", "qwe", "and", "testing", "can't", "rubyisgood", "and", "rails", "is", "good"]
Is there anything better than string.scan(/(\w|-)+/).size (the - is so, e.g., "one-way street" counts as 2 words instead of 3)?
string.split.size
Edited to explain multiple spaces
From the Ruby String Documentation page
split(pattern=$;, [limit]) → anArray
Divides str into substrings based on a delimiter, returning an array
of these substrings.
If pattern is a String, then its contents are used as the delimiter
when splitting str. If pattern is a single space, str is split on
whitespace, with leading whitespace and runs of contiguous whitespace
characters ignored.
If pattern is a Regexp, str is divided where the pattern matches.
Whenever the pattern matches a zero-length string, str is split into
individual characters. If pattern contains groups, the respective
matches will be returned in the array as well.
If pattern is omitted, the value of $; is used. If $; is nil (which is
the default), str is split on whitespace as if ' ' were specified.
If the limit parameter is omitted, trailing null fields are
suppressed. If limit is a positive number, at most that number of
fields will be returned (if limit is 1, the entire string is returned
as the only entry in an array). If negative, there is no limit to the
number of fields returned, and trailing null fields are not
suppressed.
" now's the time".split #=> ["now's", "the", "time"]
While that is the current version of ruby as of this edit, I learned on 1.7 (IIRC), where that also worked. I just tested it on 1.8.3.
I know this is an old question, but this might be useful to someone else looking for something more sophisticated than string.split. I wrote the words_counted gem to solve this particular problem, since defining words is pretty tricky.
The gem lets you define your own custom criteria, or use the out of the box regexp, which is pretty handy for most use cases. You can pre-filter words with a variety of options, including a string, lambda, array, or another regexp.
counter = WordsCounted::Counter.new("Hello, Renée! 123")
counter.word_count #=> 2
counter.words #=> ["Hello", "Renée"]
# filter the word "hello"
counter = WordsCounted::Counter.new("Hello, Renée!", reject: "Hello")
counter.word_count #=> 1
counter.words #=> ["Renée"]
# Count numbers only
counter = WordsCounted::Counter.new("Hello, Renée! 123", rexexp: /[0-9]/)
counter.word_count #=> 1
counter.words #=> ["123"]
The gem provides a bunch more useful methods.
If the 'word' in this case can be described as an alphanumeric sequence which can include '-' then the following solution may be appropriate (assuming that everything that doesn't match the 'word' pattern is a separator):
>> 'one-way street'.split(/[^-a-zA-Z]/).size
=> 2
>> 'one-way street'.split(/[^-a-zA-Z]/).each { |m| puts m }
one-way
street
=> ["one-way", "street"]
However, there are some other symbols that can be included in the regex - for example, ' to support the words like "it's".
This is pretty simplistic but does the job if you are typing words with spaces in between. It ends up counting numbers as well but I'm sure you could edit the code to not count numbers.
puts "enter a sentence to find its word length: "
word = gets
word = word.chomp
splits = word.split(" ")
target = splits.length.to_s
puts "your sentence is " + target + " words long"
The best way to do is to use split method.
split divides a string into sub-strings based on a delimiter, returning an array of the sub-strings.
split takes two parameters, namely; pattern and limit.
pattern is the delimiter over which the string is to be split into an array.
limit specifies the number of elements in the resulting array.
For more details, refer to Ruby Documentation: Ruby String documentation
str = "This is a string"
str.split(' ').size
#output: 4
The above code splits the string wherever it finds a space and hence it give the number of words in the string which is indirectly the size of the array.
The above solution is wrong, consider the following:
"one-way street"
You will get
["one-way","", "street"]
Use
'one-way street'.gsub(/[^-a-zA-Z]/, ' ').split.size
This splits words only on ASCII whitespace chars:
p " some word\nother\tword|word".strip.split(/\s+/).size #=> 4