If I have some text that I want to print out on a page, but only want to print say the first 100 words before eclipsing it... what's the easiest way to do this?
How's this for a start:
def first_words(s, n)
a = s.split(/\s/) # or /[ ]+/ to only split on spaces
a[0...n].join(' ') + (a.size > n ? '...' : '')
end
s = "The quick brown fox jumps over the lazy dog. " * 20
puts "#{s.size}, #{s.split(/\s/).size}"
#-> 900, 180
puts first_words(s, 10)
#-> The quick brown fox jumps over the lazy dog. The...
puts first_words("a b c d", 10)
#-> a b c d
You have a couple of options, one way is that you could say that a word is n characters and then take a substring of that length, append the ellipsis to the end and display it. Or you could run though the string and count the number of spaces, if you assume that there is only one space between each of the words, then the 100th space will be after then 100th word, append the ellipsis and you are done.
Which one has better performance would likely depend upon how the functions are written, most likely the substring operation is going to be faster than counting the spaces. However, the performance difference might be negligible so unless you are doing this a lot, counting spaces would likely be the most accurate way to go.
Also, just as a reference, the average length of a word in the English language is 5.1 characters.
text.slice(0..100)
if text.size > 100 then puts "..."
http://www.ruby-doc.org/core/classes/String.html
Related
I have a name of a show like oferson of interest.
In my code I am trying to split it into single words then capitilize the first letter of each word, then join them back together with a space between each word which then becomes: Oferson Of Interest. I then want to search for the word Of and replace it with a lower case.
The problem I can't seem to figure out is, at the end of the program I get oferson of Interest which isn't what I want. I just wanted the word "of" to be lower case not the first letter of the word "Oferson", simply put I wanted an output of Oferson of Interest not oferson of Interest.
How can I search for the single word 'of' not for every instance of the letters 'o' and 'f' in the sentence?
mine = 'oferson of interest'.split(' ').map {|w| w.capitalize }.join(' ')
if mine.include? "Of"
mine.gsub!(/Of/, 'of')
else
puts 'noting;'
end
puts mine
The simplest answer is to use word boundaries in your regular expression:
str = "oferson of interest".split.collect(&:capitalize).join(" ")
str.gsub!(/\bOf\b/i, 'of')
# => Oferson of Interest
You're dealing with "stop words": Words you don't want to process for some reason. Build a list of stopwords you want to ignore, and compare each word to them to see whether you want to do further processing to it:
require 'set'
STOPWORDS = %w[a for is of the to].to_set
TEXT = [
'A stitch in time saves nine',
'The quick brown fox jumped over the lazy dog',
'Now is the time for all good men to come to the aid of their country'
]
TEXT.each do |text|
puts text.split.map{ |w|
STOPWORDS.include?(w.downcase) ? w.downcase : w.capitalize
}.join(' ')
end
# >> a Stitch In Time Saves Nine
# >> the Quick Brown Fox Jumped Over the Lazy Dog
# >> Now is the Time for All Good Men to Come to the Aid of Their Country
That's a simple example, but shows the basics. In real life you'll want to handle punctuation, like hyphenated words.
I used a Set, because it's extremely fast as the list of stop words grows; It's akin to a Hash so the check is faster than using include? on an array:
require 'set'
require 'fruity'
LETTER_ARRAY = ('a' .. 'z').to_a
LETTER_SET = LETTER_ARRAY.to_set
compare do
array {LETTER_ARRAY.include?('0') }
set { LETTER_SET.include?('0') }
end
# >> Running each test 16384 times. Test will take about 2 seconds.
# >> set is faster than array by 10x ± 0.1
It gets more interesting when you want to protect the first letter of the resulting string, but the simple trick is to force just that letter back to uppercase if it matters:
require 'set'
STOPWORDS = %w[a for is of the to].to_set
TEXT = [
'A stitch in time saves nine',
'The quick brown fox jumped over the lazy dog',
'Now is the time for all good men to come to the aid of their country'
]
TEXT.each do |text|
str = text.split.map{ |w|
STOPWORDS.include?(w.downcase) ? w.downcase : w.capitalize
}.join(' ')
str[0] = str[0].upcase
puts str
end
# >> A Stitch In Time Saves Nine
# >> The Quick Brown Fox Jumped Over the Lazy Dog
# >> Now is the Time for All Good Men to Come to the Aid of Their Country
This isn't a good task for a regular expression, unless you're dealing with very consistent text patterns. Since you're working on the names of TV shows, odds are good you're not going to find much consistency and your pattern would grow in complexity quickly.
I have a big text and I'd like to remove everything before a certain string.
The problem is, there are several occurrences of that string in the text, and I want to decide which one is correct by later analyzing the found piece of text.
I can't include that analysis in a regular expression because of its complexity:
text = <<HERE
big big text
goes here
HERE
pos = -1
a = text.scan(/some regexp/im)
a.each do |m|
s = m[0]
# analysis of found string
...
if ( s is good ) # is the right candidate
pos = ??? # here I'd like to have a position of the found string in the text.
end
end
result_text = text[pos..-1]
$~.offset(n) will give the position of the n-th part of a match.
I think you should count how many occurrences there are in your big string then use index to cut off all the occurrences that do not match the final pattern.
I managed to cobble together this statement based on lots of help and copying and pasting. It basically returns the first x number of words in a string and im using it as a helper in my app.
Could someone please help me understand how I would add a condition to say if the actual string is less than x words don't add the finishing bit (which is a ...). So in the equation below I' like the 'finish' section to only be added if they are more than the number of words passed into the equation.
def first_x_words(str,n=20,finish='…')
str.split(' ')[0,n].inject{|sum,word| sum + ' ' + word} + finish
end
Actually - if I could make it more complicated, is it possible, after I find a condition where there are less than x words, to check to see if the last 4 characters are </p> and if they are, remove them.
Thanks,
Adam
This should do what you're looking for:
def first_x_words(str, n = 20, finish = '…')
# By default, Ruby will split on whitespace, so no
# argument needs to be passed.
words = str.split
# Rebuild 'n' words into a new string.
truncated = words[0..n-1].inject do |sum, word|
sum << ' ' << word
end
# Either append a finishing string or remove any
# trailing '</p>' tag.
if words.length > n
truncated << finish
else
truncated.chomp!("</p>")
end
# Return the completed string.
truncated
end
It's messy but if you really want to do it
def first_x_words(str,n=20,finish='…')
# make finish blank if the text is short enough
finish = '' if str.split(' ').count < n
str.split(' ')[0,n].inject{|sum,word| sum + ' ' + word} + finish
# remove trailing </p> if any
str.chomp('</p>')
end
I added one line of code before and after your original code so you can hopefully understand it better.
Given a string like:
"#[19:Sara Mas] what's the latest with the TPS report? #[30:Larry Peters] can you help out here?"
I want to find a way to dynamically return, the user tagged and the content surrounding. Results should be:
user_id: 19
copy: what's the latest with the TPS report?
user_id: 30
copy: can you help out here?
Any ideas on how this can be done with ruby/rails? Thanks
How is this regex for finding matches?
#\[\d+:\w+\s\w+\]
Split the string, then handle the content iteratively. I don't think it'd take more than:
tmp = string.split('#').map {|str| [str[/\[(\d*).*/,1], str[/\](.*^)/,1]] }
tmp.first #=> ["19", "what's the latest with the TPS report?"]
Does that help?
result = subject.scan(/\[(\d+).*?\](.*?)(?=#|\Z)/m)
This grabs id and content in backreferences 1 and 2 respectively. For stoping the capture either # or the end of string must be met.
"
\\[ # Match the character “[” literally
( # Match the regular expression below and capture its match into backreference number 1
\\d # Match a single digit 0..9
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
. # Match any single character that is not a line break character
*? # Between zero and unlimited times, as few times as possible, expanding as needed (lazy)
\\] # Match the character “]” literally
( # Match the regular expression below and capture its match into backreference number 2
. # Match any single character that is not a line break character
*? # Between zero and unlimited times, as few times as possible, expanding as needed (lazy)
)
(?= # Assert that the regex below can be matched, starting at this position (positive lookahead)
# Match either the regular expression below (attempting the next alternative only if this one fails)
\# # Match the character “\#” literally
| # Or match regular expression number 2 below (the entire group fails if this one fails to match)
\$ # Assert position at the end of the string (or before the line break at the end of the string, if any)
)
"
This will match something starting from # and ending to punctuation makr. Sorry if I didn't understand correctly.
result = subject.scan(/#.*?[.?!]/)
Is there anything better than string.scan(/(\w|-)+/).size (the - is so, e.g., "one-way street" counts as 2 words instead of 3)?
string.split.size
Edited to explain multiple spaces
From the Ruby String Documentation page
split(pattern=$;, [limit]) → anArray
Divides str into substrings based on a delimiter, returning an array
of these substrings.
If pattern is a String, then its contents are used as the delimiter
when splitting str. If pattern is a single space, str is split on
whitespace, with leading whitespace and runs of contiguous whitespace
characters ignored.
If pattern is a Regexp, str is divided where the pattern matches.
Whenever the pattern matches a zero-length string, str is split into
individual characters. If pattern contains groups, the respective
matches will be returned in the array as well.
If pattern is omitted, the value of $; is used. If $; is nil (which is
the default), str is split on whitespace as if ' ' were specified.
If the limit parameter is omitted, trailing null fields are
suppressed. If limit is a positive number, at most that number of
fields will be returned (if limit is 1, the entire string is returned
as the only entry in an array). If negative, there is no limit to the
number of fields returned, and trailing null fields are not
suppressed.
" now's the time".split #=> ["now's", "the", "time"]
While that is the current version of ruby as of this edit, I learned on 1.7 (IIRC), where that also worked. I just tested it on 1.8.3.
I know this is an old question, but this might be useful to someone else looking for something more sophisticated than string.split. I wrote the words_counted gem to solve this particular problem, since defining words is pretty tricky.
The gem lets you define your own custom criteria, or use the out of the box regexp, which is pretty handy for most use cases. You can pre-filter words with a variety of options, including a string, lambda, array, or another regexp.
counter = WordsCounted::Counter.new("Hello, Renée! 123")
counter.word_count #=> 2
counter.words #=> ["Hello", "Renée"]
# filter the word "hello"
counter = WordsCounted::Counter.new("Hello, Renée!", reject: "Hello")
counter.word_count #=> 1
counter.words #=> ["Renée"]
# Count numbers only
counter = WordsCounted::Counter.new("Hello, Renée! 123", rexexp: /[0-9]/)
counter.word_count #=> 1
counter.words #=> ["123"]
The gem provides a bunch more useful methods.
If the 'word' in this case can be described as an alphanumeric sequence which can include '-' then the following solution may be appropriate (assuming that everything that doesn't match the 'word' pattern is a separator):
>> 'one-way street'.split(/[^-a-zA-Z]/).size
=> 2
>> 'one-way street'.split(/[^-a-zA-Z]/).each { |m| puts m }
one-way
street
=> ["one-way", "street"]
However, there are some other symbols that can be included in the regex - for example, ' to support the words like "it's".
This is pretty simplistic but does the job if you are typing words with spaces in between. It ends up counting numbers as well but I'm sure you could edit the code to not count numbers.
puts "enter a sentence to find its word length: "
word = gets
word = word.chomp
splits = word.split(" ")
target = splits.length.to_s
puts "your sentence is " + target + " words long"
The best way to do is to use split method.
split divides a string into sub-strings based on a delimiter, returning an array of the sub-strings.
split takes two parameters, namely; pattern and limit.
pattern is the delimiter over which the string is to be split into an array.
limit specifies the number of elements in the resulting array.
For more details, refer to Ruby Documentation: Ruby String documentation
str = "This is a string"
str.split(' ').size
#output: 4
The above code splits the string wherever it finds a space and hence it give the number of words in the string which is indirectly the size of the array.
The above solution is wrong, consider the following:
"one-way street"
You will get
["one-way","", "street"]
Use
'one-way street'.gsub(/[^-a-zA-Z]/, ' ').split.size
This splits words only on ASCII whitespace chars:
p " some word\nother\tword|word".strip.split(/\s+/).size #=> 4