How can I create a nokogiri case insensitive text * search? - ruby-on-rails

Currnetly I am doing
words = []
words << "philip morris"
words << "Philip morris"
words << "philip Morris"
words << "Philip Morris"
for word in words
doc.search("[text()*='#{word}']")
end
When I was using hpricot I found where to downcase the results within the gem so I could just keep all my searchs lowercase, however nokogiri has been quite difficult to find where one could even do that. Is anyone aware of a way to do this?
Thank you very much for your time

The lower-case XPath function is not available but you can use the translate XPath 1.0 function to convert your text to lowercase e.g. for the English alphabet:
translate(text(),'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz')
I couldn't seem to use this in combination with the *= operator but you can use contains to do a substring search instead, making the full thing:
doc.search("//*[contains(translate(text(),'ABCDEFGHIJKLMNOPQRSTUVWXYZ','abcdefghijklmnopqrstuvwxyz'),'philip morris')]")

Related

How to remove from string before __

I am building a Rails 5.2 app.
In this app I got outputs from different suppliers (I am building a webshop).
The name of the shipping provider is in this format:
dhl_freight__233433
It could also be in this format:
postal__US-320202
How can I remove all that is before (and including) the __ so all that remains are the things after the ___ like for example 233433.
Perhaps some sort of RegEx.
A very simple approach would be to use String#split and then pick the second part that is the last part in this example:
"dhl_freight__233433".split('__').last
#=> "233433"
"postal__US-320202".split('__').last
#=> "US-320202"
You can use a very simple Regexp and a ask the resulting MatchData for the post_match part:
p "dhl_freight__233433".match(/__/).post_match
# another (magic) way to acces the post_match part:
p $'
Postscript: Learnt something from this question myself: you don't even have to use a RegExp for this to work. Just "asddfg__qwer".match("__").post_match does the trick (it does the conversion to regexp for you)
r = /[^_]+\z/
"dhl_freight__233433"[r] #=> "233433"
"postal__US-320202"[r] #=> "US-320202"
The regular expression matches one or more characters other than an underscore, followed by the end of the string (\z). The ^ at the beginning of the character class reads, "other than any of the characters that follow".
See String#[].
This assumes that the last underscore is preceded by an underscore. If the last underscore is not preceded by an underscore, in which case there should be no match, add a positive lookbehind:
r = /(?<=__[^_]+\z/
This requires the match to be preceded by two underscores.
There are many ruby ways to extract numbers from string. I hope you're trying to fetch numbers out of a string. Here are some of the ways to do so.
Ref- http://www.ruby-forum.com/topic/125709
line.delete("^0-9")
line.scan(/\d/).join('')
line.tr("^0-9", '')
In the above delete is the fastest to trim numbers out of strings.
All of above extracts numbers from string and joins them. If a string is like this "String-with-67829___numbers-09764" outut would be like this "6782909764"
In case if you want the numbers split like this ["67829", "09764"]
line.split(/[^\d]/).reject { |c| c.empty? }
Hope these answers help you! Happy coding :-)

Rails strip all except numbers commas and decimal points

Hi I've been struggling with this for the last hour and am no closer. How exactly do I strip everything except numbers, commas and decimal points from a rails string? The closest I have so far is:-
rate = rate.gsub!(/[^0-9]/i, '')
This strips everything but the numbers. When I try add commas to the expression, everything is getting stripped. I got the aboves from somewhere else and as far as I can gather:
^ = not
Everything to the left of the comma gets replaced by what's in the '' on the right
No idea what the /i does
I'm very new to gsub. Does anyone know of a good tutorial on building expressions?
Thanks
Try:
rate = rate.gsub(/[^0-9,\.]/, '')
Basically, you know the ^ means not when inside the character class brackets [] which you are using, and then you can just add the comma to the list. The decimal needs to be escaped with a backslash because in regular expressions they are a special character that means "match anything".
Also, be aware of whether you are using gsub or gsub!
gsub! has the bang, so it edits the instance of the string you're passing in, rather than returning another one.
So if using gsub! it would be:
rate.gsub!(/[^0-9,\.]/, '')
And rate would be altered.
If you do not want to alter the original variable, then you can use the version without the bang (and assign it to a different var):
cleaned_rate = rate.gsub!(/[^0-9,\.]/, '')
I'd just google for tutorials. I haven't used one. Regexes are a LOT of time and trial and error (and table-flipping).
This is a cool tool to use with a mini cheat-sheet on it for ruby that allows you to quickly edit and test your expression:
http://rubular.com/
You can just add the comma and period in the square-bracketed expression:
rate.gsub(/[^0-9,.]/, '')
You don't need the i for case-insensitivity for numbers and symbols.
There's lots of info on regular expressions, regex, etc. Maybe search for those instead of gsub.
You can use this:
rate = rate.gsub!(/[^0-9\.\,]/g,'')
Also check this out to learn more about regular expressions:
http://www.regexr.com/

Problem with TXT file extraction in ruby

I have data file as in format of TXT , I like to parse the URL field from TXT file using the below ruby code
f = File.open(txt_file, "r")
f.each_line { |line|
rows = line.split(',')
rows[3].each do |url|
next if url=="URL"
puts url
end
}
TXT contains:
name,option,price,URL
"x", "0,0,0,0,0,0", "123.40","http://domain.com/xym.jpg"
"x", "0,0,0,0,0,0", "111.34","http://domain.com/yum.jpg"
output:
0
Why does the output come from the option field "0,0,0,0,0,0"? How do I skip this and get the URL field?
Environment
ruby 1.8.7
rails 2.3.8
gem 1.3.7
I'd check out a CSV parsing tool to make this easier:
require 'rubygems'
require 'faster_csv'
FasterCSV.foreach(txt_file, :quote_char => '"',
:col_sep =>',', :row_sep =>:auto) do |row|
puts row[3] if row[3] != "URL"
break
end
Also, I think you're misunderstanding how the split() would work. If you run split() against one row from your file, you're going to get back an array of columns for that single row, not a multidimensional array as rows[3].each would suggest.
EDIT: Before reading, I completely agree with the answer by Jeff Swensen, I'll leave my answer here regardless.
I'm not entirely sure what your inside loop is for (rows[3].each) Because you can't convert a single line into a 'row' when you only have a single URL. You could split by the ** characters and return an Array of urls but then you still need to remove the extra double quotes, or you could use a Regular Expression, like so:
#!/usr/bin/env ruby
f = DATA
urls = f.readlines.map do |line|
line[/([^"]+)"\*\*/, 1]
end
urls.compact!
p urls
__END__
name ,option,price, **URL**
"x", "0,0,0,0,0,0", "123.40",**"http://domain.com/xym.jpg"**
"x", "0,0,0,0,0,0", "111.34",**"http://domain.com/yum.jpg"**
The call to compact is needed because map will insert nil objects when you hit something that doesn't match that expression. For the String#[] method, see here
The reason that "0" is the result is that your code is blindly splitting on the comma char when you seem to be expecting parsing CSV-style (where column values may contain delimiter chars if the entire column value is enclosed in quotes. I highly suggest using a csv parser. If you are using Ruby 1.9.2, then you will already have access to the FasterCSV library.
If you are sure that the fields you want are always surrounded by double quotations, you can use that as the basis for extracting rather than the comma.
File.open(txt_file) do |f|
f.each_line do |l|
cols = l.scan(/(?<!\\)"(.*?)(?<!\\)"/)
cols[3].tap{|url| puts url if url}
end
end
In your code, the opened IO is not closed. This is a bad practice. It is better to use a block so that you do not forget to close it.
The two (?<!\\)" in the regex match non-escaped double quotations. They use negative lookbehind.
.*? is a non-greedy match, which avoids a match from exceeding a non-escaped double quotation.
tap is to avoid repeating the cols[3] operation twice in puts and if.
Edit again
If you use ruby 1.8.7, you can either
update your regex engine to oniguruma by following easy steps here, http://oniguruma.rubyforge.org/
or
replace the regex. tap cannot be used also. Use the following instead:
.
File.open(txt_file) do |f|
f.each_line do |l|
cols = l.scan(/(?:\A|[^\\])"(.*?[^\\]|)"/)
url = cols[3]
puts url if url
end
end
I would recomment using oniguruma. It is a new regex engine introduced since ruby 1.9, and is much powerful and faster than the one used in ruby 1.8. It can be installed easily on ruby 1.8.
The data is in CSV format, but if all you want to do is grab the last field in the string, then do just that:
text =<<EOT
name,option,price,URL
"x", "0,0,0,0,0,0", "123.40","http://domain.com/xym.jpg"
"x", "0,0,0,0,0,0", "111.34","http://domain.com/yum.jpg"
EOT
require 'pp'
text.lines.map{ |l| l.split(',').last }
If you want to clean up the double-quotes and trailing line-breaks:
text.lines.map{ |l| l.split(',').last.gsub('"', '').chomp }
# => ["URL", "http://domain.com/xym.jpg", "http://domain.com/yum.jpg"]

Assistance with Some Interesting Syntax in Some Ruby Code I've Found

I'm currently reading Agile Web Development With Rails, 3rd edition. On page 672, I came across this method:
def capitalize_words(string)
string.gsub(/\b\w/) { $&.upcase }
end
What is the code in the block doing? I have never seen that syntax. Is it similar to the array.map(&:some_method) syntax?
It's Title Casing The Input. inside the block, $& is a built-in representing the current match (\b\w i.e. the first letter of each word) which is then uppercased.
You've touched on one of the few things I don't like about Ruby :)
The magic variable $& contains the matched string from the previous successful pattern match. So in this case, it'll be the first character of each word.
This is mentioned in the RDoc for String.gsub:
http://ruby-doc.org/core/classes/String.html#M000817
gsub replaces everything that matched in the regex with the result of the block. so yes, in this case you're matching the first letter of words, then replacing it with the upcased version.
as to the slightly bizarre syntax inside the block, this is equivalent (and perhaps easier to understand):
def capitalize_words(string)
string.gsub(/\b\w/) {|x| x.upcase}
end
or even slicker:
def capitalize_words(string)
string.gsub /\b\w/, &:upcase
end
as to the regex (courtesy the pickaxe book), \b matches a word boundary, and \w any 'word character' (alphanumerics and underscore). so \b\w matches the first character of the word.

How do I replace accented Latin characters in Ruby?

I have an ActiveRecord model, Foo, which has a name field. I'd like users to be able to search by name, but I'd like the search to ignore case and any accents. Thus, I'm also storing a canonical_name field against which to search:
class Foo
validates_presence_of :name
before_validate :set_canonical_name
private
def set_canonical_name
self.canonical_name ||= canonicalize(self.name) if self.name
end
def canonicalize(x)
x.downcase. # something here
end
end
I need to fill in the "something here" to replace the accented characters. Is there anything better than
x.downcase.gsub(/[àáâãäå]/,'a').gsub(/æ/,'ae').gsub(/ç/, 'c').gsub(/[èéêë]/,'e')....
And, for that matter, since I'm not on Ruby 1.9, I can't put those Unicode literals in my code. The actual regular expressions will look much uglier.
ActiveSupport::Inflector.transliterate (requires Rails 2.2.1+ and Ruby 1.9 or 1.8.7)
example:
>> ActiveSupport::Inflector.transliterate("àáâãäå").to_s
=> "aaaaaa"
Rails has already a builtin for normalizing, you just have to use this to normalize your string to form KD and then remove the other chars (i.e. accent marks) like this:
>> "àáâãäå".mb_chars.normalize(:kd).gsub(/[^\x00-\x7F]/n,'').downcase.to_s
=> "aaaaaa"
Better yet is to use I18n:
1.9.3-p392 :001 > require "i18n"
=> false
1.9.3-p392 :002 > I18n.transliterate("Olá Mundo!")
=> "Ola Mundo!"
I have tried a lot of this approaches but they were not achieving one or several of these requirements:
Respect spaces
Respect 'ñ' character
Respect case (I know is not a requirement for the original question but is not difficult to move an string to lowcase)
Has been this:
# coding: utf-8
string.tr(
"ÀÁÂÃÄÅàáâãäåĀāĂ㥹ÇçĆćĈĉĊċČčÐðĎďĐđÈÉÊËèéêëĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħÌÍÎÏìíîïĨĩĪīĬĭĮįİıĴĵĶķĸĹĺĻļĽľĿŀŁłÑñŃńŅņŇňʼnŊŋÒÓÔÕÖØòóôõöøŌōŎŏŐőŔŕŖŗŘřŚśŜŝŞşŠšſŢţŤťŦŧÙÚÛÜùúûüŨũŪūŬŭŮůŰűŲųŴŵÝýÿŶŷŸŹźŻżŽž",
"AAAAAAaaaaaaAaAaAaCcCcCcCcCcDdDdDdEEEEeeeeEeEeEeEeEeGgGgGgGgHhHhIIIIiiiiIiIiIiIiIiJjKkkLlLlLlLlLlNnNnNnNnnNnOOOOOOooooooOoOoOoRrRrRrSsSsSsSssTtTtTtUUUUuuuuUuUuUuUuUuUuWwYyyYyYZzZzZz"
)
– http://blog.slashpoundbang.com/post/12938588984/remove-all-accents-and-diacritics-from-string-in-ruby
You have to modify a little bit the character list to respect 'ñ' character but is an easy job.
My answer: the String#parameterize method:
"Le cœur de la crémiére".parameterize
=> "le-coeur-de-la-cremiere"
For non-Rails programs:
Install activesupport: gem install activesupport then:
require 'active_support/inflector'
"a&]'s--3\014\xC2àáâã3D".parameterize
# => "a-s-3-3d"
Decompose the string and remove non-spacing marks from it.
irb -ractive_support/all
> "àáâãäå".mb_chars.normalize(:kd).gsub(/\p{Mn}/, '')
aaaaaa
You may also need this if used in a .rb file.
# coding: utf-8
the normalize(:kd) part here splits out diacriticals where possible (ex: the "n with tilda" single character is split into an n followed by a combining diacritical tilda character), and the gsub part then removes all the diacritical characters.
I think that you maybe don't really what to go down that path. If you are developing for a market that has these kind of letters your users probably will think you are a sort of ...pip.
Because 'å' isn't even close to 'a' in any meaning to a user.
Take a different road and read up about searching in a non-ascii way. This is just one of those cases someone invented unicode and collation.
A very late PS:
http://www.w3.org/International/wiki/Case_folding
http://www.w3.org/TR/charmod-norm/#sec-WhyNormalization
Besides that I have no ide way the link to collation go to a msdn page but I leave it there. It should have been http://www.unicode.org/reports/tr10/
This assumes you use Rails.
"anything".parameterize.underscore.humanize.downcase
Given your requirements, this is probably what I'd do... I think it's neat, simple and will stay up to date in future versions of Rails and Ruby.
Update: dgilperez pointed out that parameterize takes a separator argument, so "anything".parameterize(" ") (deprecated) or "anything".parameterize(separator: " ") is shorter and cleaner.
Convert the text to normalization form D, remove all codepoints with unicode category non spacing mark (Mn), and convert it back to normalization form C. This will strip all diacritics, and your problem is reduced to a case insensitive search.
See http://www.siao2.com/2005/02/19/376617.aspx and http://www.siao2.com/2007/05/14/2629747.aspx for details.
The key is to use two columns in your database: canonical_text and original_text. Use original_text for display and canonical_text for searches. That way, if a user searches for "Visual Cafe," she sees the "Visual Café" result. If she really wants a different item called "Visual Cafe," it can be saved separately.
To get the canonical_text characters in a Ruby 1.8 source file, do something like this:
register_replacement([0x008A].pack('U'), 'S')
You probably want Unicode decomposition ("NFD"). After decomposing the string, just filter out anything not in [A-Za-z]. æ will decompose to "ae", ã to "a~" (approximately - the diacritical will become a separate character) so the filtering leaves a reasonable approximation.
iconv:
http://groups.google.com/group/ruby-talk-google/browse_frm/thread/8064dcac15d688ce?
=============
a perl module which i can't understand:
http://www.ahinea.com/en/tech/accented-translate.html
============
brute force (there's a lot of htose critters!:
http://projects.jkraemer.net/acts_as_ferret/wiki#UTF-8support
http://snippets.dzone.com/posts/show/2384
I had problems getting the foo.mb_chars.normalize(:kd).gsub(/[^\x00-\x7F]/n,'').downcase.to_s solution to work. I'm not using Rails and there was some conflict with my activesupport/ruby versions that I couldn't get to the bottom of.
Using the ruby-unf gem seems to be a good substitute:
require 'unf'
foo.to_nfd.gsub(/[^\x00-\x7F]/n,'').downcase
As far as I can tell this does the same thing as .mb_chars.normalize(:kd). Is this correct? Thanks!
If you are using PostgreSQL => 9.4 as your DB adapter, maybe you could add in a migration it's "unaccent" extension that I think does what you want, like this:
def self.up
enable_extension "unaccent" # No falla si ya existe
end
In order to test, in the console:
2.3.1 :045 > ActiveRecord::Base.connection.execute("SELECT unaccent('unaccent', 'àáâãäåÁÄ')").first
=> {"unaccent"=>"aaaaaaAA"}
Notice there is case sensitive up to now.
Then, maybe use it in a scope, like:
scope :with_canonical_name, -> (name) {
where("unaccent(foos.name) iLIKE unaccent('#{name}')")
}
The iLIKE operator makes the search case insensitive. There is another approach, using citext data type. Here is a discussion about this two approaches. Notice also that use of PosgreSQL's lower() function is not recommended.
This will save you some DB space, since you will no longer require the cannonical_name field, and perhaps make your model simpler, at the cost of some extra processing in each query, in an amount depending of whether you are using iLIKE or citext, and your dataset.
If you are using MySQL maybe you can use this simple solution, but I have not tested it.
lol.. i just tryed this.. and it is working.. iam still not pretty sure why.. but when i use this 4 lines of code:
str = str.gsub(/[^a-zA-Z0-9 ]/,"")
str = str.gsub(/[ ]+/," ")
str = str.gsub(/ /,"-")
str = str.downcase
it automaticly removes any accent from filenames.. which i was trying to remove(accent from filenames and renaming them than) hope it helped :)

Resources