Regex for striping non alphabetic and non numeric characters - ruby-on-rails

Total programming newbie here. In ruby, how would I go about striping the following string of non alphabetic and non numeric characters and then split the string into an array by splitting it through spaces.
Example
string = "Honey - a sweet, sticky, yellow fluid made by bees and other insects from nectar collected from flowers."
Into this
tokenized_string = ["Honey", "a", "sweet", "sticky", "yellow", "fluid", "made", "by", "bees", "and", "other", "insects", "from", "nectar", "collected", "from", "flowers"]
Any help would be much appreciated!

I'd use:
string = "Honey - a sweet, sticky, yellow fluid made by bees and other insects from nectar collected from flowers."
string.delete('^A-Za-z0-9 ').split
# => ["Honey",
# "a",
# "sweet",
# "sticky",
# "yellow",
# "fluid",
# "made",
# "by",
# "bees",
# "and",
# "other",
# "insects",
# "from",
# "nectar",
# "collected",
# "from",
# "flowers"]
If you're trying to remove everything but alphanumerics, then the \w character class can't be used because it is defined as [A-Za-z0-9_], which allows _ to leak in or squeeze through. Here's an example:
'foo_BAR12'[/\w+/] # => "foo_BAR12"
That matched the entire string, including _.
'foo_BAR12'[/[A-Za-z0-9]+/] # => "foo"
That stopped at _, because the class [A-Za-z0-9] doesn't include it.
\w should be considered a matching pattern for variable names, NOT for alphanumerics. If you want a character class for alphanumerics, look at the POSIX \[\[:alnum:\]\] class:
'foo_BAR12'[/[[:alnum:]]+/] # => "foo"

There are a lot of possibilities, e.g.:
string.gsub(/\W/) { |m| m if m == ' ' }.split
or, even clearer:
string.gsub(/\W/) { |m| m if m.strip.empty? }.split

Very simple. The following gives you the array you want without your having to use split:
string.scan(/\w+/)
Play around with it on Rubular.com.

Do as belowe using String#scan
string = "Honey - a sweet, sticky, yellow fluid made by bees and other insects from nectar collected from flowers."
string.scan(/[a-zA-Z0-9]+/)
# => ["Honey",
# "a",
# "sweet",
# "sticky",
# "yellow",
# "fluid",
# "made",
# "by",
# "bees",
# "and",
# "other",
# "insects",
# "from",
# "nectar",
# "collected",
# "from",
# "flowers"]

Related

How to read a CSV that contains double quotes (")

I have a CSV File that looks like this
"url","id","role","url","deadline","availability","location","my_type","keywords","source","external_id","area","area (1)"
"https://myurl.com","123456","This a string","https://myurl.com?source=5&param=1","31-01-2020","1","Location´s Place","another_string, my_string","key1, key2, key3","anotherString","145129","Place in Earth",""
It has 13 columns.
The issue is that I get each row with a \" and I don't want that. Also, I get 16 columns back in the read.
This is what I have done
csv = CSV.new(File.open('myfile.csv'), quote_char:"\x00", force_quotes:false)
csv.read[1]
Output:
["\"https://myurl.com\"", "\"123456\"", "\"This a string\"", "\"https://myurl.com?source=5&param=1\"", "\"31-01-2020\"", "\"1\"", "\"Location´s Place\"", "\"another_string", " my_string\"", "\"key1", " key2", " key3\"", "\"anotherString\"", "\"145129\"", "\"Place in Earth\"", "\"\""]
The file you showed is a standard CSV file. There is nothing special needed. Just delete all those unnecessary arguments:
csv = CSV.new(File.open('myfile.csv'))
csv.read[1]
#=> [
# "https://myurl.com",
# "123456",
# "This a string",
# "https://myurl.com?source=5&param=1",
# "31-01-2020",
# "1",
# "Location´s Place",
# "another_string, my_string",
# "key1, key2, key3",
# "anotherString",
# "145129",
# "Place in Earth",
# ""
# ]
force_quotes doesn't do anything in your code, because it controls whether or not the CSV library will quote all fields when writing CSV. You are reading, not writing, so this argument is useless.
quote_char: "\x00" is clearly wrong, since the quote character in the example you posted is clearly " not NUL.
quote_char: '"' would be correct, but is not necessary, since it is the default.

Nokogiri: Clean up data output

I am trying to scrape player information from MLS sites to create a map of where the players come from, as well as other information. I am as new to this as it gets.
So far I have used this code:
require 'HTTParty'
require 'Nokogiri'
require 'JSON'
require 'Pry'
require 'csv'
page = HTTParty.get('https://www.atlutd.com/players')
parse_page = Nokogiri::HTML(page)
players_array = []
parse_page.css('.player_list.list-reset').css('.row').css('.player_info').map do |a|
player_info = a.text
players_array.push(player_info)
end
#CSV.open('atlantaplayers.csv', 'w') do |csv|
# csv << players_array
#end
pry.start(binding)
The output of the pry function is:
"Miguel Almirón10\nMidfielder\n-\nAsunción, ParaguayAge:\n23\nHT:\n5' 9\"\nWT:\n140\n"
Which when put into the csv creates this in a single cell:
"Miguel Almirón10
Midfielder
-
Asunción, ParaguayAge:
23
HT:
5' 9""
WT:
140
"
I've looked into things and have determined that it is possible nodes (\n)? that is throwing off the formatting.
My desired outcome here is to figure out how to get the pry output into the array as follows:
Miguel, Almiron, 10, Midfielder, Asuncion, Paraguay, 23, 5'9", 140
Bonus points if you can help with the accent marks on names. Also if there is going to be an issue with height, is there a way to convert it to metric?
Thank you in advance!
I've looked into things and have determined that it is possible nodes (\n)? that is throwing off the formatting.
Yes that's why it's showing in this odd format, you can strip the rendered text to remove extra spaces/lines then your text will show without the \ns
player_info = a.text.strip
[1] pry(main)> "Miguel Almirón10\n".strip
=> "Miguel Almirón10"
This will only remove the \n if you wish to store them in a CSV in this order
Miguel, Almiron, 10, Midfielder, Asuncion, Paraguay, 23, 5'9", 140
then you might want to split by spaces and then create an array for each row so when pushing the line to the CSV file it will look like this:
csv << ["Miguel", "Almiron", 10, "Midfielder", "Asuncion", "Paraguay", 23, "5'9\"", 140]
with the accent marks on names
you can use the transliterate method which will remove accents
[8] pry(main)> ActiveSupport::Inflector.transliterate("Miguel Almirón10")
=> "Miguel Almiron10"
See http://api.rubyonrails.org/classes/ActiveSupport/Inflector.html#method-i-transliterate and you might want to require 'rails' for this
Here's what I would use, i18n and people gems:
require 'people'
require "i18n"
I18n.available_locales = [:en]
#np = People::NameParser.new
players_array = []
parse_page.css('.player_info').each do |div|
name = #np.parse I18n.transliterate(div.at('.name a').text)
players_array << [
name[:first],
name[:last],
div.at('.jersey').text,
div.at('.position').text,
]
end
# => [["Miguel", "Almiron", "10", "Midfielder"],
# ["Mikey", "Ambrose", "22", "Defender"],
# ["Yamil", "Asad", "11", "Forward"],
# ...
That should get you started.

Rails validates NOT in regex

I'm trying to create a Rails 3 validation that will ensure that people are not using one of the common free email addresses.
My thought was something like this ....
validates_format_of :email, :with => /^((?!gmail).*)$|^((?!yahoo).*)$|^((?!hotmail).*)$/
or
validates_exclusion_of :email, :in => %w( gmail. GMAIL. hotmail. HOTMAIL. live. LIVE. aol. AOL. ), :message => "You must use your corporate email address."
But neither works properly. Any ideas?
Basically you've written a regex that matches anything. Let's break it down.
/
^( # [ beginning of string
(?!gmail) # followed by anything other than "gmail"
. # followed by any one character
)$ # followed by the end the of string
| # ] OR [
^( # beginning of the string
(?!yahoo) # followed by anything other than "yahoo"
. # followed by any one character
)$ # followed by the end of the string
| # ] OR [
^( # beginning of the string
(?!hotmail) # followed by anything other than "hotmail"
.* # followed by any or no characters
)$ # followed by the end the of string
/ # ]
When you think about it you'll realize that the only strings that won't match are ones that start with "gmail," "yahoo," and "hotmail"--all at the same time, which is impossible.
What you really want is something like this:
/
.+# # one or more characters followed by #
(?! # followed by anything other than...
(gmail|yahoo|hotmail) # [ one of these strings
\. # followed by a literal dot
) # ]
.+ # followed by one or more characters
$ # and the end of the string
/i # case insensitive
Put it together and you have:
expr = /.+#(?!(gmail|yahoo|hotmail)\.).+$/i
test_cases = %w[ foo#gmail.com
bar#yahoo.com
BAZ#HOTMAIL.COM
qux#example.com
quux
]
test_cases.map {|addr| expr =~ addr }
# => [nil, nil, nil, 0, nil]
# (nil means no match, 0 means there was a match starting at character 0)

Map string to another string in Ruby Rails

Hey guys,
i have 5 model attributes, for example, 'str' and 'dex'. A user has strength, dexterity attribute.
When i call user.increase_attr('dex') i want to do it through 'dex' and not having to pass 'dexterity' string all the way.
Of course, i can just check if ability == 'dex' and convert it to 'dexterity' when i will need to do user.dexterity += 1 and then save it.
But what is a good ruby way to do that ?
Look at Ruby's Abbrev module that's part of the standard library. This should give you some ideas.
require 'abbrev'
require 'pp'
class User
def increase_attr(s)
"increasing using '#{s}'"
end
end
abbreviations = Hash[*Abbrev::abbrev(%w[dexterity strength speed height weight]).flatten]
user = User.new
user.increase_attr(abbreviations['dex']) # => "increasing using 'dexterity'"
user.increase_attr(abbreviations['s']) # => "increasing using ''"
user.increase_attr(abbreviations['st']) # => "increasing using 'strength'"
user.increase_attr(abbreviations['sp']) # => "increasing using 'speed'"
If an ambiguous value is passed in, (the "s"), nothing will match. If a unique value is found in the hash, the returned value is the full string, making it easy to map short strings to the full string.
Because having varying lengths of the trigger strings would be confusing to the user you could strip all elements of the hash that have keys shorter than the shortest unambiguous key. In other words, remove anything shorter than two characters because of the collision of "speed" ("sp") and "strength" ("st"), meaning "h", "d" and "w" need to go. It's a "be kind to the poor human users" thing.
Here's what is created when Abbrev::abbrev does its magic and it's coerced into a Hash.
pp abbreviations
# >> {"dexterit"=>"dexterity",
# >> "dexteri"=>"dexterity",
# >> "dexter"=>"dexterity",
# >> "dexte"=>"dexterity",
# >> "dext"=>"dexterity",
# >> "dex"=>"dexterity",
# >> "de"=>"dexterity",
# >> "d"=>"dexterity",
# >> "strengt"=>"strength",
# >> "streng"=>"strength",
# >> "stren"=>"strength",
# >> "stre"=>"strength",
# >> "str"=>"strength",
# >> "st"=>"strength",
# >> "spee"=>"speed",
# >> "spe"=>"speed",
# >> "sp"=>"speed",
# >> "heigh"=>"height",
# >> "heig"=>"height",
# >> "hei"=>"height",
# >> "he"=>"height",
# >> "h"=>"height",
# >> "weigh"=>"weight",
# >> "weig"=>"weight",
# >> "wei"=>"weight",
# >> "we"=>"weight",
# >> "w"=>"weight",
# >> "dexterity"=>"dexterity",
# >> "strength"=>"strength",
# >> "speed"=>"speed",
# >> "height"=>"height",
# >> "weight"=>"weight"}
def increase_attr(attr)
attr_map = {'dex' => :dexterity, 'str' => :strength}
increment!(attr_map[attr]) if attr_map.include?(attr)
end
Basically create a Hash that has the key of 'dex', 'str' etc and points to the expanded version of that word(in symbol format).

Is it possible to simulate the behaviour of sprintf("%g") using the Rails NumberHelper methods?

sprintf("%g", [float]) allows me to format a floating point number without specifying precision, such that 10.00 is rendered as 10 and 10.01 is rendered as 10.01, and so on. This is neat.
In my application I'm rendering numbers using the Rails NumberHelper methods so that I can take advantage of the localization features, but I can't figure out how to achieve the above functionality through these helpers since they expect an explicit :precision option.
Is there a simple way around this?
Why not just use Ruby's Kernel::sprintf with NumberHelper? Recommended usage with this syntax: str % arg where str is the format string (%g in your case):
>> "%g" % 10.01
=> "10.01"
>> "%g" % 10
=> "10"
Then you can use the NumberHelper to print just the currency symbol:
>> foo = ActionView::Base.new
>> foo.number_to_currency(0, :format => "%u") + "%g"%10.0
=> "$10"
and define your own convenience method:
def pretty_currency(val)
number_to_currency(0, :format => "%u") + "%g"%val
end
pretty_currency(10.0) # "$10"
pretty_currency(10.01) # "$10.01"
I have solved this by adding another method to the NumberHelper module as follows:
module ActionView
module Helpers #:nodoc:
module NumberHelper
# Formats a +number+ such that the the level of precision is determined using the logic of sprintf("%g%"), that
# is: "Convert a floating point number using exponential form if the exponent is less than -4 or greater than or
# equal to the precision, or in d.dddd form otherwise."
# You can customize the format in the +options+ hash.
#
# ==== Options
# * <tt>:separator</tt> - Sets the separator between the units (defaults to ".").
# * <tt>:delimiter</tt> - Sets the thousands delimiter (defaults to "").
#
# ==== Examples
# number_with_auto_precision(111.2345) # => "111.2345"
# number_with_auto_precision(111) # => "111"
# number_with_auto_precision(1111.2345, :separator => ',', :delimiter => '.') # "1,111.2345"
# number_with_auto_precision(1111, :separator => ',', :delimiter => '.') # "1,111"
def number_with_auto_precision(number, *args)
options = args.extract_options!
options.symbolize_keys!
defaults = I18n.translate(:'number.format', :locale => options[:locale], :raise => true) rescue {}
separator ||= (options[:separator] || defaults[:separator])
delimiter ||= (options[:delimiter] || defaults[:delimiter])
begin
number_with_delimiter("%g" % number,
:separator => separator,
:delimiter => delimiter)
rescue
number
end
end
end
end
end
It is the specific call to number_with_delimiter with the %g option which renders the number as described in the code comments above.
This works great for me, but I'd welcome thoughts on this solution.

Resources