Alter regex markup to not separate float numbers (like 2.0) - ruby-on-rails

I was looking for a solution to a regex problem in Rails I had and an answer on a separate question lead me 90% of the path to the answer. Basically, what I would like to do is to have a ruby/rails script that will format a messy text in terms of capitalizing every letter after a "./,/!/?". This code by "Mark S"
ng = Nokogiri::HTML.fragment("<p>hello, how are you? oh, that's nice! i am glad you are fine. i am too.<br />i am glad to have met you.</p>")
ng.traverse{|n| (n.content = n.content.gsub(/(.*?)([\.|\!|\?])/) { " #{$1.strip.capitalize}#{$2}" }.strip) if n.text?}
ng.to_s
The only issue I have with this code, and it is a big issue, is that the code adds a space in between float numbers like "2.0", making a text like:
there is a cat in the hat.it has a 2.0 inch tail!
isn't that awesome?!I think so.
Become
There is a cat i the hat. It has a 2. 0 inch tail!
Isn't that awesome?! I think so.
where I obviously want it to be:
There is a cat i the hat. It has a 2.0 inch tail!
Isn't that awesome?! I think so.
Any suggestions on how to alter this text, for example so that any "." will be ignored by this code?

It seems you want to capitalize any lowercase letter at the beginning of the string or after ., !, or ?.
Use
s.gsub(/(\A|[.?!])(\p{Ll})/) { Regexp.last_match(1).length > 0 ? "#{$1} #{$2.capitalize}" : "#{$2.capitalize}" }
See the Ruby demo
Pattern details:
(\A|[.?!]) - Group 1 capturing the start of string location (empty string) or a ., ?, or !
(\p{Ll}) - Group 2 capturing any Unicode lowercase letter
Inside the replacement, we check if Group 1 value is not empty, and if it is, we just return the capitalized letter. Else, return the punctuation, a space, and the capitalized letter.
NOTE: However, there is a problem with abbreviations (as usual in these cases), like i.e., e.g., etc. Then there are words like iPhone, iCloud, eSklep, and so on.

Related

How to specify a range in Ruby

I've been looking for a good way to see if a string of items are all numbers, and thought there might be a way of specifying a range from 0 to 9 and seeing if they're included in the string, but all that I've looked up online has really confused me.
def validate_pin(pin)
(pin.length == 4 || pin.length == 6) && pin.count("0-9") == pin.length
end
The code above is someone else's work and I've been trying to identify how it works. It's a pin checker - takes in a set of characters and ensures the string is either 4 or 6 digits and all numbers - but how does the range work?
When I did this problem I tried to use to_a? Integer and a bunch of other things including ranges such as (0..9) and ("0..9) and ("0".."9") to validate a character is an integer. When I saw ("0-9) it confused the heck out of me, and half an hour of googling and youtube has only left me with regex tutorials (which I'm interested in, but currently just trying to get the basics down)
So to sum this up, my goal is to understand a more semantic/concise way to identify if a character is an integer. Whatever is the simplest way. All and any feedback is welcome. I am a new rubyist and trying to get down my fundamentals. Thank You.
Regex really is the right way to do this. It's specifically for testing patterns in strings. This is how you'd test "do all characters in this string fall in the range of characters 0-9?":
pin.match(/\A[0-9]+\z/)
This regex says "Does this string start and end with at least one of the characters 0-9, with nothing else in between?" - the \A and \z are start-of-string and end-of-string matchers, and the [0-9]+ matches any one or more of any character in that range.
You could even do your entire check in one line of regex:
pin.match(/\A([0-9]{4}|[0-9]{6})\z/)
Which says "Does this string consist of the characters 0-9 repeated exactly 4 times, or the characters 0-9, repeated exactly 6 times?"
Ruby's String#count method does something similar to this, though it just counts the number of occurrences of the characters passed, and it uses something similar to regex ranges to allow you to specify character ranges.
The sequence c1-c2 means all characters between c1 and c2.
Thus, it expands the parameter "0-9" into the list of characters "0123456789", and then it tests how many of the characters in the string match that list of characters.
This will work to verify that a certain number of numbers exist in the string, and the length checks let you implicitly test that no other characters exist in the string. However, regexes let you assert that directly, by ensuring that the whole string matches a given pattern, including length constraints.
Count everything non-digit in pin and check if this count is zero:
pin.count("^0-9").zero?
Since you seem to be looking for answers outside regex and since Chris already spelled out how the count method was being implemented in the example above, I'll try to add one more idea for testing whether a string is an Integer or not:
pin.to_i.to_s == pin
What we're doing is converting the string to an integer, converting that result back to a string, and then testing to see if anything changed during the process. If the result is =>true, then you know nothing changed during the conversion to an integer and therefore the string is only an Integer.
EDIT:
The example above only works if the entire string is an Integer and won’t properly deal with leading zeros. If you want to check to make sure each and every character is an Integer then do something like this instead:
pin.prepend(“1”).to_i.to_s(1..-1) == pin
Part of the question seems to be exactly HOW the following portion of code is doing its job:
pin.count("0-9")
This piece of the code is simply returning a count of how many instances of the numbers 0 through 9 exist in the string. That's only one piece of the relevant section of code though. You need to look at the rest of the line to make sense of it:
pin.count("0-9") == pin.length
The first part counts how many instances then the second part compares that to the length of the string. If they are equal (==) then that means every character in the string is an Integer.
Sometimes negation can be used to advantage:
!pin.match?(/\D/) && [4,6].include?(pin.length)
pin.match?(/\D/) returns true if the string contains a character other than a digit (matching /\D/), in which case it it would be negated to false.
One advantage of using negation here is that if the string contains a character other than a digit pin.match?(/\D/) would return true as soon as a non-digit is found, as opposed to methods that examine all the characters in the string.

Ruby: Split a string into substring of maximum 40 characters

I have some strings with a sentence and i need to subdivise it into a substring of maximum 40 characters.
But i don't want to split the sentence in the middle of a word.
I tried with .gsub function but it's return 40 characters maximum and avoid to cut the string in the middle of a word. But it's return only the first occurence.
sentence[0..40].gsub(/\s\w+$/,'')
I tried with split but i can select only the fist 40 characters and split in the middle of a word...
sentence.split(...){40}
My string is "Sure, we will show ourselves only when we know the east door has been opened.".
The string output i want is
["Sure, we will show ourselves only when we","know the east door has
been opened."]
Do you have a solution ? Thanks
Your first attempt:
sentence[0..40].gsub(/\s\w+$/,'')
almost works, but it has one fatal flaw. You are splitting on the number of characters before cutting off the last word. This means you have no way of knowing whether the bit being trimmed off was a whole word, or a partial word.
Because of this, your code will always cut off the last word.
I would solve the problem as follows:
sentence[/\A.{0,39}[a-z]\b/mi]
\A is an anchor to fix the regex to the start of the string.
.{0,39}[a-z] matches on 1 to 40 characters, where the last character must be a letter. This is to prevent the last selected character from being punctuation or space. (Is that desired behaviour? Your question didn't really specify. Feel free to tweak/remove that [a-z] part, e.g. [a-z.] to match a full stop, if desired.)
\b is a word boundary look-around. It is a zero-width matcher, on beginning/end of words.
/mi modifiers will include case insensitive (i.e. A-Z) and multi-line matches.
One very minor note is that because this regex is matching 1 to 40 characters (rather than zero), it is possible to get a null result. (Although this is seemingly very unlikely, since you'd need a 1-word, 41+ letter string!!) To account for this edge case, call .to_s on the result if needed.
Update: Thank you for the improved edit to your question, providing a concrete example of an input/result. This makes it much clearer what you are asking for, as the original post was somewhat ambiguous.
You could solve this with something like the following:
sentence.scan(/.{0,39}[a-z.!?,;](?:\b|$)/mi)
String#scan returns an array of strings that match the pattern - so you can then re-join these strings to reconstruct the original.
Again, I have added a few more characters (!?,;) to the list of "final characters in the substring". Feel free to tweak this as desired.
(?:\b|$) means "either a word boundary, or the end of the line". This fixes the issue of the result not including the final . in the substrings. Note that I have used a non-capture group (?:) to prevent the result of scan from changing.

Rails strip all except numbers commas and decimal points

Hi I've been struggling with this for the last hour and am no closer. How exactly do I strip everything except numbers, commas and decimal points from a rails string? The closest I have so far is:-
rate = rate.gsub!(/[^0-9]/i, '')
This strips everything but the numbers. When I try add commas to the expression, everything is getting stripped. I got the aboves from somewhere else and as far as I can gather:
^ = not
Everything to the left of the comma gets replaced by what's in the '' on the right
No idea what the /i does
I'm very new to gsub. Does anyone know of a good tutorial on building expressions?
Thanks
Try:
rate = rate.gsub(/[^0-9,\.]/, '')
Basically, you know the ^ means not when inside the character class brackets [] which you are using, and then you can just add the comma to the list. The decimal needs to be escaped with a backslash because in regular expressions they are a special character that means "match anything".
Also, be aware of whether you are using gsub or gsub!
gsub! has the bang, so it edits the instance of the string you're passing in, rather than returning another one.
So if using gsub! it would be:
rate.gsub!(/[^0-9,\.]/, '')
And rate would be altered.
If you do not want to alter the original variable, then you can use the version without the bang (and assign it to a different var):
cleaned_rate = rate.gsub!(/[^0-9,\.]/, '')
I'd just google for tutorials. I haven't used one. Regexes are a LOT of time and trial and error (and table-flipping).
This is a cool tool to use with a mini cheat-sheet on it for ruby that allows you to quickly edit and test your expression:
http://rubular.com/
You can just add the comma and period in the square-bracketed expression:
rate.gsub(/[^0-9,.]/, '')
You don't need the i for case-insensitivity for numbers and symbols.
There's lots of info on regular expressions, regex, etc. Maybe search for those instead of gsub.
You can use this:
rate = rate.gsub!(/[^0-9\.\,]/g,'')
Also check this out to learn more about regular expressions:
http://www.regexr.com/

Lua pattern help (Double parentheses)

I have been coding a program in Lua that automatically formats IRC logs from a roleplay. In the roleplay logs there is a specific guideline for "Out of character" conversation, which we use double parentheses for. For example: ((<Things unrelated to roleplay go here>)). I have been trying to have my program remove text between double brackets (and including both brackets). The code is:
ofile = io.open("Output.txt", "w")
rfile = io.open("Input.txt", "r")
p = rfile:read("*all")
w = string.gsub(p, "%(%(.*?%)%)", "")
ofile:write(w)
The pattern here is > "%(%(.*?%)%)" I've tried multiple variations of the pattern. All resulted in fruitless results:
1. %(%(.*?%)%) --Wouldn't do anything.
2. %(%(.*%)%) --Would remove *everything* after the first OOC message.
Then, my friend told me that prepending the brackets with percentages wouldn't work, and that I had to use backslashes to 'escape' the parentheses.
3. \(\(.*\)\) --resulted in the output file being completely empty.
4. (\(\(.*\)\)) --Same result as above.
5. (\(\(.*?\)\) --would for some reason, remove large parts of the text for no apparent reason.
6. \(\(.*?\)\) --would just remove all the text except for the last line.
The short, absolute question:
What pattern would I need to use to remove all text between double parentheses, and remove the double parentheses themselves too?
You're friend is thinking of regular expressions. Lua patterns are similar, but different. % is the correct escape character.
Your pattern should be %(%(.-%)%). The - is similar to * in that it matches any number of the preceding sequence, but while * tries to match as many characters as it can (it's greedy), - matches the least amount of characters possible (it's non-greedy). It won't go overboard and match extra double-close-parenthesis.

Titleize with roman numerals, dashes, apostrophes, etc. in Ruby on Rails

I'm simply trying to convert uppercased company names into proper names.
Company names can include:
Dashes
Apostrophes
Roman Numerals
Text like LLC, LP, INC which should stay uppercase.
I thought I might be able to use acronyms like this:
ACRONYMS = %W( LP III IV VI VII VIII IX GI)
ActiveSupport::Inflector.inflections(:en) do |inflect|
ACRONYMS.each { |a| inflect.acronym(a) }
end
However, the conversion does not take into account word breaks, so having VI and VII does not work. For example, the conversion of "ADVISORS".titleize is "Ad VI Sors", as the VI becomes a whole word.
Dashes get removed.
It seems like there should be a generic gem for this generic problem, but I didn't find one. Is this problem really not that common? What's the best solution besides completely hacking the current inflection library?
Company names are a little odd, since a lot of times they're Marks (as in Service Mark) more than proper names. That means precise capitalization might actually matter, and trying to titleize might not be worth it.
In any case, here's a pattern that might work. Build your list of tokens to "keep", then manually split the string up and titleize the non-token parts.
# Make sure you put long strings before short (VII before VI)
word_tokens = %w{VII VI IX XI}
# Special characters need to be separate, since they never appear as "part" of another word
special_tokens = %w{-}
# Builds a regex like /(\bVII\b|\bVI\b|-|)/ that wraps "word tokens" in a word boundary check
token_regex = /(#{word_tokens.map{|t| /\b#{t}\b/}.join("|")}|#{special_tokens.join("|")})/
title = "ADVISORS-XI"
title.split(token_regex).map{|s| s =~ token_regex ? s : s.titleize}.join

Resources