Problem with TXT file extraction in ruby

Problem with TXT file extraction in ruby - ruby-on-rails

I have data file as in format of TXT , I like to parse the URL field from TXT file using the below ruby code
f = File.open(txt_file, "r")
f.each_line { |line|
rows = line.split(',')
rows[3].each do |url|
next if url=="URL"
puts url
end
}
TXT contains:
name,option,price,URL
"x", "0,0,0,0,0,0", "123.40","http://domain.com/xym.jpg"
"x", "0,0,0,0,0,0", "111.34","http://domain.com/yum.jpg"
output:
0
Why does the output come from the option field "0,0,0,0,0,0"? How do I skip this and get the URL field?
Environment
ruby 1.8.7
rails 2.3.8
gem 1.3.7

I'd check out a CSV parsing tool to make this easier:
require 'rubygems'
require 'faster_csv'
FasterCSV.foreach(txt_file, :quote_char => '"',
:col_sep =>',', :row_sep =>:auto) do |row|
puts row[3] if row[3] != "URL"
break
end
Also, I think you're misunderstanding how the split() would work. If you run split() against one row from your file, you're going to get back an array of columns for that single row, not a multidimensional array as rows[3].each would suggest.

EDIT: Before reading, I completely agree with the answer by Jeff Swensen, I'll leave my answer here regardless.
I'm not entirely sure what your inside loop is for (rows[3].each) Because you can't convert a single line into a 'row' when you only have a single URL. You could split by the ** characters and return an Array of urls but then you still need to remove the extra double quotes, or you could use a Regular Expression, like so:
#!/usr/bin/env ruby
f = DATA
urls = f.readlines.map do |line|
line[/([^"]+)"\*\*/, 1]
end
urls.compact!
p urls
__END__
name ,option,price, **URL**
"x", "0,0,0,0,0,0", "123.40",**"http://domain.com/xym.jpg"**
"x", "0,0,0,0,0,0", "111.34",**"http://domain.com/yum.jpg"**
The call to compact is needed because map will insert nil objects when you hit something that doesn't match that expression. For the String#[] method, see here

The reason that "0" is the result is that your code is blindly splitting on the comma char when you seem to be expecting parsing CSV-style (where column values may contain delimiter chars if the entire column value is enclosed in quotes. I highly suggest using a csv parser. If you are using Ruby 1.9.2, then you will already have access to the FasterCSV library.

If you are sure that the fields you want are always surrounded by double quotations, you can use that as the basis for extracting rather than the comma.
File.open(txt_file) do |f|
f.each_line do |l|
cols = l.scan(/(?<!\\)"(.*?)(?<!\\)"/)
cols[3].tap{|url| puts url if url}
end
end
In your code, the opened IO is not closed. This is a bad practice. It is better to use a block so that you do not forget to close it.
The two (?<!\\)" in the regex match non-escaped double quotations. They use negative lookbehind.
.*? is a non-greedy match, which avoids a match from exceeding a non-escaped double quotation.
tap is to avoid repeating the cols[3] operation twice in puts and if.
Edit again
If you use ruby 1.8.7, you can either
update your regex engine to oniguruma by following easy steps here, http://oniguruma.rubyforge.org/
or
replace the regex. tap cannot be used also. Use the following instead:
.
File.open(txt_file) do |f|
f.each_line do |l|
cols = l.scan(/(?:\A|[^\\])"(.*?[^\\]|)"/)
url = cols[3]
puts url if url
end
end
I would recomment using oniguruma. It is a new regex engine introduced since ruby 1.9, and is much powerful and faster than the one used in ruby 1.8. It can be installed easily on ruby 1.8.

The data is in CSV format, but if all you want to do is grab the last field in the string, then do just that:
text =<<EOT
name,option,price,URL
"x", "0,0,0,0,0,0", "123.40","http://domain.com/xym.jpg"
"x", "0,0,0,0,0,0", "111.34","http://domain.com/yum.jpg"
EOT
require 'pp'
text.lines.map{ |l| l.split(',').last }
If you want to clean up the double-quotes and trailing line-breaks:
text.lines.map{ |l| l.split(',').last.gsub('"', '').chomp }
# => ["URL", "http://domain.com/xym.jpg", "http://domain.com/yum.jpg"]

Related

Apply modification only to substring in Ruby

I have a string of the form "award.x_initial_value.currency" and I would like to camelize everything except the leading "x_" so that I get a result of the form: "award.x_initialValue.currency".
My current implementation is:
a = "award.x_initial_value.currency".split(".")
b = a.map{|s| s.slice!("x_")}
a.map!{|s| s.camelize(:lower)}
a.zip(b).map!{|x, y| x.prepend(y.to_s)}
I am not very happy with it since it's neither fast nor elegant and performance is key since this will be applied to large amounts of data.
I also googled it but couldn't find anything.
Is there a faster/better way of achieving this?

Since "performance is key" you could skip the overhead of ActiveSupport::Inflector and use a regular expression to perform the "camelization" yourself:
a = "award.x_initial_value.currency"
a.gsub(/(?<!\bx)_(\w)/) { $1.capitalize }
#=> "award.x_initialValue.currency"

▶ "award.x_initial_value.x_currency".split('.').map do |s|
"#{s[/\Ax_/]}#{s[/(\Ax_)?(.*)\z/, 2].camelize(:lower)}"
end.join('.')
#⇒ "award.x_initialValue.x_currency"
or, with one gsub iteration:
▶ "award.x_initial_value.x_currency".gsub(/(?<=\.|\A)(x_)?(.*?)(?=\.|\z)/) do |m|
"#{$~[1]}" << $~[2].camelize(:lower)
end
#⇒ "award.x_initialValue.x_currency"
In the latter version we use global substitution:
$~ is a short-hand to a global, storing the last regexp match occured;
$~[1] is the first matched entity, corresponding (x_)?, because of ? it might be either matched string, or nil; that’s why we use string extrapolation, in case of nil "#{nil}" will result in an empty string;
after all, we append the camelized second match to the string, discussed above;
NB Instead of $~ for the last match, one might use Regexp::last_match

Could you try solmething like this:
'award.x_initial_value.currency'.gsub(/(\.|\A)x_/,'\1#').camelize(:lower).gsub('#','x_')
# => award.x_initialValue.currency
NOTE: for # char can be used any of unused char for current name/char space.

How do I read each line of uploaded file with mixed Windows and Unix line endings?

I am trying to read each line of an uploaded file in Rails.
file_data = params[:files]
if file_data.respond_to?(:read)
file_data.read.gsub( /\n/, "\r\n" ).split("\r\n").each do |line|
inputUsers.push(line.strip)
end
elsif file_data.respond_to?(:path)
File.read(file_data.path).gsub( /\n/, "\r\n" ).split("\r\n").each do |line|
inputUsers.push(line.strip)
end
If the uploaded file contains a mix of Windows and Unix encodings, presumably being due to copying from multiple places, Rails doesn't properly seperate each line of the file and sometimes returns two lines as one.
The application is hosted on a Linux box. Also, the file is copied from a Google docs spreadsheet column.
Are there any solutions for this problem?
Edit:
Hex code for lines that don't get seperated into new lines look like:
636f 6d0d 0a4e 6968

Here's how I'd go about this. First, to test some code:
SAMPLE_TEXT = [
"now\ris\r\nthe\ntime\n",
"for all good men\n"
]
def read_file(data)
data.each do |li|
[ *li.split(/[\r\n]+/) ].each do |l|
yield l
end
end
end
read_file(SAMPLE_TEXT) do |li|
puts li
end
Which outputs:
now
is
the
time
for all good men
The magic occurs in [ *li.split(/[\r\n]+/) ]. Breaking it down:
li.split(/[\r\n]+/) causes the line to be split on returns, new-lines and combinations of those. If a line has multiples the code will gobble empty lines, so if there's a chance you'll receive those you'll need a little more sophisticated pattern, /[\r\n]{1,2}/ which, though untested, should work.
*li.split(/[\r\n]+/) uses the "splat" operator * which says to explode the following array into its component elements. This is a convenient way to get an array when you're not sure whether you have a single element or an array being passed into a method.
[*li.split(/[\r\n]+/)] takes the components returned and turns them back into a single array.
To modify the method to handle a file instead is easy:
def read_file(fname)
File.foreach(fname) do |li|
[ *li.split(/[\r\n]+/) ].each do |l|
yield l
end
end
end
Call it almost the same way as in the previous example:
read_file('path/to/file') do |li|
puts li
end
The reason you want to use foreach is it'll read line-by-line, which is a lot more memory efficient than slurping a file using read or readlines, either of which read the entire file into memory at once. foreach is extremely fast also, so you don't take a speed-hit when using it. As a result there's little advantage to read-type methods, and good advantages to using foreach.

You are substituting \n with \r\n, which is problematic when parsing Windows files. Now \r\n becomes \r\r\n.
Better is to substitute to the Unix line ending format and then split on \n:
file_data.read.gsub( /\n/, "\r\n" ).split("\r\n").each do |line|
becomes:
file_data.read.gsub( /\r\n/, "\n" ).split("\n").each do |line|

Try the built-in method:
File.readlines('foo').each do |line|
Or:
File.open('foo').read.gsub(/\r\n?/, "\n").each_line do |line|

gsub string for all spaces in Ruby 1.8

I have a string with spaces (one simple space and one ideographic space):
"qwe　rty uiop".gsub(/[\s]+/,'') #=> "qwe　rtyuiop"
How can I add all space-codes (for example 3000, 2060, 205f) in my pattern?
In Ruby 1.9 I just added \u3000 and other codes, but how do it in 1.8?

I think i found answer. In ActiveSupport::Multibyte::Chars is a UNOCODE_WHITESPACE constant. Solution:
pattern = ActiveSupport::Multibyte::Chars::UNICODE_WHITESPACE.collect do |c|
c.pack "U*"
end.join '|'
puts "qwe　rty uiop".mb_chars.gsub(/#{pattern}/,'')
#=> qwertyuiop

How to make a Ruby string safe for a filesystem?

I have user entries as filenames. Of course this is not a good idea, so I want to drop everything except [a-z], [A-Z], [0-9], _ and -.
For instance:
my§document$is°° very&interesting___thisIs%nice445.doc.pdf
should become
my_document_is_____very_interesting___thisIs_nice445_doc.pdf
and then ideally
my_document_is_very_interesting_thisIs_nice445_doc.pdf
Is there a nice and elegant way for doing this?

I'd like to suggest a solution that differs from the old one. Note that the old one uses the deprecated returning. By the way, it's anyway specific to Rails, and you didn't explicitly mention Rails in your question (only as a tag). Also, the existing solution fails to encode .doc.pdf into _doc.pdf, as you requested. And, of course, it doesn't collapse the underscores into one.
Here's my solution:
def sanitize_filename(filename)
# Split the name when finding a period which is preceded by some
# character, and is followed by some character other than a period,
# if there is no following period that is followed by something
# other than a period (yeah, confusing, I know)
fn = filename.split /(?<=.)\.(?=[^.])(?!.*\.[^.])/m
# We now have one or two parts (depending on whether we could find
# a suitable period). For each of these parts, replace any unwanted
# sequence of characters with an underscore
fn.map! { |s| s.gsub /[^a-z0-9\-]+/i, '_' }
# Finally, join the parts with a period and return the result
return fn.join '.'
end
You haven't specified all the details about the conversion. Thus, I'm making the following assumptions:
There should be at most one filename extension, which means that there should be at most one period in the filename
Trailing periods do not mark the start of an extension
Leading periods do not mark the start of an extension
Any sequence of characters beyond A–Z, a–z, 0–9 and - should be collapsed into a single _ (i.e. underscore is itself regarded as a disallowed character, and the string '$%__°#' would become '_' – rather than '___' from the parts '$%', '__' and '°#')
The complicated part of this is where I split the filename into the main part and extension. With the help of a regular expression, I'm searching for the last period, which is followed by something else than a period, so that there are no following periods matching the same criteria in the string. It must, however, be preceded by some character to make sure it's not the first character in the string.
My results from testing the function:
1.9.3p125 :006 > sanitize_filename 'my§document$is°° very&interesting___thisIs%nice445.doc.pdf'
=> "my_document_is_very_interesting_thisIs_nice445_doc.pdf"
which I think is what you requested. I hope this is nice and elegant enough.

From http://web.archive.org/web/20110529023841/http://devblog.muziboo.com/2008/06/17/attachment-fu-sanitize-filename-regex-and-unicode-gotcha/:
def sanitize_filename(filename)
returning filename.strip do |name|
# NOTE: File.basename doesn't work right with Windows paths on Unix
# get only the filename, not the whole path
name.gsub!(/^.*(\\|\/)/, '')
# Strip out the non-ascii character
name.gsub!(/[^0-9A-Za-z.\-]/, '_')
end
end

In Rails you might also be able to use ActiveStorage::Filename#sanitized:
ActiveStorage::Filename.new("foo:bar.jpg").sanitized # => "foo-bar.jpg"
ActiveStorage::Filename.new("foo/bar.jpg").sanitized # => "foo-bar.jpg"

If you use Rails you can also use String#parameterize. This is not particularly intended for that, but you will obtain a satisfying result.
"my§document$is°° very&interesting___thisIs%nice445.doc.pdf".parameterize

For Rails I found myself wanting to keep any file extensions but using parameterize for the remainder of the characters:
filename = "my§doc$is°° very&itng___thsIs%nie445.doc.pdf"
cleaned = filename.split(".").map(&:parameterize).join(".")
Implementation details and ideas see source: https://github.com/rails/rails/blob/master/activesupport/lib/active_support/inflector/transliterate.rb
def parameterize(string, separator: "-", preserve_case: false)
# Turn unwanted chars into the separator.
parameterized_string.gsub!(/[^a-z0-9\-_]+/i, separator)
#... some more stuff
end

If your goal is just to generate a filename that is "safe" to use on all operating systems (and not to remove any and all non-ASCII characters), then I would recommend the zaru gem. It doesn't do everything the original question specifies, but the filename produced should be safe to use (and still keep any filename-safe unicode characters untouched):
Zaru.sanitize! " what\ēver//wëird:user:înput:"
# => "whatēverwëirduserînput"
Zaru.sanitize! "my§docu*ment$is°° very&interes:ting___thisIs%nice445.doc.pdf"
# => "my§document$is°° very&interesting___thisIs%nice445.doc.pdf"

There is a library that may be helpful, especially if you're interested in replacing weird Unicode characters with ASCII: unidecode.
irb(main):001:0> require 'unidecoder'
=> true
irb(main):004:0> "Grzegżółka".to_ascii
=> "Grzegzolka"

How do I replace accented Latin characters in Ruby?

I have an ActiveRecord model, Foo, which has a name field. I'd like users to be able to search by name, but I'd like the search to ignore case and any accents. Thus, I'm also storing a canonical_name field against which to search:
class Foo
validates_presence_of :name
before_validate :set_canonical_name
private
def set_canonical_name
self.canonical_name ||= canonicalize(self.name) if self.name
end
def canonicalize(x)
x.downcase. # something here
end
end
I need to fill in the "something here" to replace the accented characters. Is there anything better than
x.downcase.gsub(/[àáâãäå]/,'a').gsub(/æ/,'ae').gsub(/ç/, 'c').gsub(/[èéêë]/,'e')....
And, for that matter, since I'm not on Ruby 1.9, I can't put those Unicode literals in my code. The actual regular expressions will look much uglier.

ActiveSupport::Inflector.transliterate (requires Rails 2.2.1+ and Ruby 1.9 or 1.8.7)
example:
>> ActiveSupport::Inflector.transliterate("àáâãäå").to_s
=> "aaaaaa"

Rails has already a builtin for normalizing, you just have to use this to normalize your string to form KD and then remove the other chars (i.e. accent marks) like this:
>> "àáâãäå".mb_chars.normalize(:kd).gsub(/[^\x00-\x7F]/n,'').downcase.to_s
=> "aaaaaa"

Better yet is to use I18n:
1.9.3-p392 :001 > require "i18n"
=> false
1.9.3-p392 :002 > I18n.transliterate("Olá Mundo!")
=> "Ola Mundo!"

I have tried a lot of this approaches but they were not achieving one or several of these requirements:
Respect spaces
Respect 'ñ' character
Respect case (I know is not a requirement for the original question but is not difficult to move an string to lowcase)
Has been this:
# coding: utf-8
string.tr(
"ÀÁÂÃÄÅàáâãäåĀāĂăĄąÇçĆćĈĉĊċČčÐðĎďĐđÈÉÊËèéêëĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħÌÍÎÏìíîïĨĩĪīĬĭĮįİıĴĵĶķĸĹĺĻļĽľĿŀŁłÑñŃńŅņŇňŉŊŋÒÓÔÕÖØòóôõöøŌōŎŏŐőŔŕŖŗŘřŚśŜŝŞşŠšſŢţŤťŦŧÙÚÛÜùúûüŨũŪūŬŭŮůŰűŲųŴŵÝýÿŶŷŸŹźŻżŽž",
"AAAAAAaaaaaaAaAaAaCcCcCcCcCcDdDdDdEEEEeeeeEeEeEeEeEeGgGgGgGgHhHhIIIIiiiiIiIiIiIiIiJjKkkLlLlLlLlLlNnNnNnNnnNnOOOOOOooooooOoOoOoRrRrRrSsSsSsSssTtTtTtUUUUuuuuUuUuUuUuUuUuWwYyyYyYZzZzZz"
)
– http://blog.slashpoundbang.com/post/12938588984/remove-all-accents-and-diacritics-from-string-in-ruby
You have to modify a little bit the character list to respect 'ñ' character but is an easy job.

My answer: the String#parameterize method:
"Le cœur de la crémiére".parameterize
=> "le-coeur-de-la-cremiere"
For non-Rails programs:
Install activesupport: gem install activesupport then:
require 'active_support/inflector'
"a&]'s--3\014\xC2àáâã3D".parameterize
# => "a-s-3-3d"

Decompose the string and remove non-spacing marks from it.
irb -ractive_support/all
> "àáâãäå".mb_chars.normalize(:kd).gsub(/\p{Mn}/, '')
aaaaaa
You may also need this if used in a .rb file.
# coding: utf-8
the normalize(:kd) part here splits out diacriticals where possible (ex: the "n with tilda" single character is split into an n followed by a combining diacritical tilda character), and the gsub part then removes all the diacritical characters.

I think that you maybe don't really what to go down that path. If you are developing for a market that has these kind of letters your users probably will think you are a sort of ...pip.
Because 'å' isn't even close to 'a' in any meaning to a user.
Take a different road and read up about searching in a non-ascii way. This is just one of those cases someone invented unicode and collation.
A very late PS:
http://www.w3.org/International/wiki/Case_folding
http://www.w3.org/TR/charmod-norm/#sec-WhyNormalization
Besides that I have no ide way the link to collation go to a msdn page but I leave it there. It should have been http://www.unicode.org/reports/tr10/

This assumes you use Rails.
"anything".parameterize.underscore.humanize.downcase
Given your requirements, this is probably what I'd do... I think it's neat, simple and will stay up to date in future versions of Rails and Ruby.
Update: dgilperez pointed out that parameterize takes a separator argument, so "anything".parameterize(" ") (deprecated) or "anything".parameterize(separator: " ") is shorter and cleaner.

Convert the text to normalization form D, remove all codepoints with unicode category non spacing mark (Mn), and convert it back to normalization form C. This will strip all diacritics, and your problem is reduced to a case insensitive search.
See http://www.siao2.com/2005/02/19/376617.aspx and http://www.siao2.com/2007/05/14/2629747.aspx for details.

The key is to use two columns in your database: canonical_text and original_text. Use original_text for display and canonical_text for searches. That way, if a user searches for "Visual Cafe," she sees the "Visual Café" result. If she really wants a different item called "Visual Cafe," it can be saved separately.
To get the canonical_text characters in a Ruby 1.8 source file, do something like this:
register_replacement([0x008A].pack('U'), 'S')

You probably want Unicode decomposition ("NFD"). After decomposing the string, just filter out anything not in [A-Za-z]. æ will decompose to "ae", ã to "a~" (approximately - the diacritical will become a separate character) so the filtering leaves a reasonable approximation.

iconv:
http://groups.google.com/group/ruby-talk-google/browse_frm/thread/8064dcac15d688ce?
=============
a perl module which i can't understand:
http://www.ahinea.com/en/tech/accented-translate.html
============
brute force (there's a lot of htose critters!:
http://projects.jkraemer.net/acts_as_ferret/wiki#UTF-8support
http://snippets.dzone.com/posts/show/2384

I had problems getting the foo.mb_chars.normalize(:kd).gsub(/[^\x00-\x7F]/n,'').downcase.to_s solution to work. I'm not using Rails and there was some conflict with my activesupport/ruby versions that I couldn't get to the bottom of.
Using the ruby-unf gem seems to be a good substitute:
require 'unf'
foo.to_nfd.gsub(/[^\x00-\x7F]/n,'').downcase
As far as I can tell this does the same thing as .mb_chars.normalize(:kd). Is this correct? Thanks!

If you are using PostgreSQL => 9.4 as your DB adapter, maybe you could add in a migration it's "unaccent" extension that I think does what you want, like this:
def self.up
enable_extension "unaccent" # No falla si ya existe
end
In order to test, in the console:
2.3.1 :045 > ActiveRecord::Base.connection.execute("SELECT unaccent('unaccent', 'àáâãäåÁÄ')").first
=> {"unaccent"=>"aaaaaaAA"}
Notice there is case sensitive up to now.
Then, maybe use it in a scope, like:
scope :with_canonical_name, -> (name) {
where("unaccent(foos.name) iLIKE unaccent('#{name}')")
}
The iLIKE operator makes the search case insensitive. There is another approach, using citext data type. Here is a discussion about this two approaches. Notice also that use of PosgreSQL's lower() function is not recommended.
This will save you some DB space, since you will no longer require the cannonical_name field, and perhaps make your model simpler, at the cost of some extra processing in each query, in an amount depending of whether you are using iLIKE or citext, and your dataset.
If you are using MySQL maybe you can use this simple solution, but I have not tested it.

lol.. i just tryed this.. and it is working.. iam still not pretty sure why.. but when i use this 4 lines of code:
str = str.gsub(/[^a-zA-Z0-9 ]/,"")
str = str.gsub(/[ ]+/," ")
str = str.gsub(/ /,"-")
str = str.downcase
it automaticly removes any accent from filenames.. which i was trying to remove(accent from filenames and renaming them than) hope it helped :)

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Problem with TXT file extraction in ruby - ruby-on-rails

Related

Apply modification only to substring in Ruby

How do I read each line of uploaded file with mixed Windows and Unix line endings?

gsub string for all spaces in Ruby 1.8

How to make a Ruby string safe for a filesystem?

How do I replace accented Latin characters in Ruby?

Categories

Resources