How would I find similar lines in two CSV files? - ruby-on-rails

Here is my code but it takes forever for huge files:
require 'rubygems'
require "faster_csv"
fname1 =ARGV[0]
fname2 =ARGV[1]
if ARGV.size!=2
puts "Display common lines in the two files \n Usage : ruby user_in_both_files.rb <file1> <file2> "
exit 0
end
puts "loading the CSV files ..."
file1=FasterCSV.read(fname1, :headers => :first_row)
file2=FasterCSV.read(fname2, :headers => :first_row)
puts "CSV files loaded"
#puts file2[219808].to_s.strip.gsub(/\s+/,'')
lineN1=0
lineN2=0
# count how many common lines
similarLines=0
file1.each do |line1|
lineN1=lineN1+1
#compare line 1 to all line from file 2
lineN2=0
file2.each do |line2|
puts "file1:l#{lineN1}|file2:l#{lineN2}"
lineN2=lineN2+1
if ( line1.to_s.strip.gsub(/\s+/,'') == line2.to_s.strip.gsub(/\s+/,'') )
puts "file1:l#{line1}|file2:l#{line2}->#{line1}\n"
similarLines=similarLines+1
end
end
end
puts "#{similarLines} similar lines."

Ruby has set operations available with arrays:
a_ary = [1,2,3]
b_ary = [3,4,5]
a_ary & b_ary # => 3
So, from that you should try:
puts "loading the CSV files ..."
file1 = FasterCSV.read(fname1, :headers => :first_row)
file2 = FasterCSV.read(fname2, :headers => :first_row)
puts "CSV files loaded"
common_lines = file1 & file2
puts common_lines.size
If you need to preprocess the arrays, do it as you load them:
file1 = FasterCSV.read(fname1, :headers => :first_row).map{ |l| l.to_s.strip.gsub(/\s+/, '') }
file2 = FasterCSV.read(fname2, :headers => :first_row).map{ |l| l.to_s.strip.gsub(/\s+/, '') }

You're gsubing File2 every time you loop through File1. I'd do that first, and then just compare the results of that.
Edit Something like this (untested)
file1lines = []
file1.each do |line1|
file1lines = line1.strip.gsub(/\s+/, '')
end
# Do the same for `file2lines`
file1lines.each do |line1|
lineN1=lineN1+1
#compare line 1 to all line from file 2
lineN2=0
file2lines.each do |line2|
puts "file1:l#{lineN1}|file2:l#{lineN2}"
lineN2=lineN2+1
if ( line1 == line2 )
puts "file1:l#{line1}|file2:l#{line2}->#{line1}\n"
similarLines=similarLines+1
end
end
end
I'd also get rid of all the putses in the loops unless you really need them. If the files are huge, that's probably slowing it down a noticeable amount.

Related

File is not filled when puts is used

I'm trying to create a Rails locale file from a CSV. The file is created and the CSV is correctly parsed, but the file is not filled. I don't have errors so I don't know what is wrong...
This is my code:
# frozen_string_literal: true
class FillLanguages
require 'csv'
def self.get
result = []
file = File.new('config/locales/languages.yml', 'w')
CSV.foreach('lib/csv/BCP-47_french.csv', headers: false, col_sep: ';') do |row|
result.push(row[0])
hash = {}
key = row[0]
hash[key] = row[1]
file.puts(hash.to_yaml)
end
result
end
end
Rails.logger.debug(hash) returns
{"af-ZA"=>"Africain (Afrique du Sud)"}
{"ar-AE"=>"Arabe (U.A.E.)"}
{"ar-BH"=>"Arabe (Bahreïn)"}
{"ar-DZ"=>"Arabe (Algérie)"}
{"ar-EG"=>"Arabe (Egypte)"}
{"ar-IQ"=>"Arabe (Irak)"}
...
as expected.
Rails.logger.debug(hash.to_yaml) returns
---
af-ZA: Africain (Afrique du Sud)
---
ar-AE: Arabe (U.A.E.)
---
ar-BH: Arabe (Bahreïn)
---
ar-DZ: Arabe (Algérie)
---
ar-EG: Arabe (Egypte)
---
ar-IQ: Arabe (Irak)
...
But the file still empty.
My CSV looks like:
https://i.gyazo.com/f3fa5ba8b1bfdd014018da5b46fa7ec0.png
Even if I try to puts a string like 'hello world' just after the line where I'm creating the file, it doesn't work...
You forgot to close the file.
You can either do it explicitly (best practice to do it in ensure block) or using File.open with block.
UPDATE:
IO#close → nil
Closes ios and flushes any pending writes to the operating system. The stream is unavailable for any further data operations; an IOError is raised if such an attempt is made. I/O streams are automatically closed when they are claimed by the garbage collector.
https://ruby-doc.org/core-2.5.0/IO.html#method-i-close
So your changes are not flushed to disc from IO buffers. You can also use explicit IO#flush to do that, but it's better to close files you opened.
# explicit close
class FillLanguages
require 'csv'
def self.get
result = []
file = File.new('config/locales/languages.yml', 'w')
CSV.foreach('lib/csv/BCP-47_french.csv', headers: false, col_sep: ';') do |row|
result.push(row[0])
hash = {}
key = row[0]
hash[key] = row[1]
file.puts(hash.to_yaml)
end
result
ensure
file.close
end
end
--
# block version
class FillLanguages
require 'csv'
def self.get
result = []
File.open('config/locales/languages.yml', 'w') do |file|
CSV.foreach('lib/csv/BCP-47_french.csv', headers: false, col_sep: ';') do |row|
result.push(row[0])
hash = {}
key = row[0]
hash[key] = row[1]
file.puts(hash.to_yaml)
end
end
result
end
end

Parse binary CSV file in Ruby

This should have been such an easy thing... buy I can't for the life of me figure out how to parse a CSV file that doesn't seem to have a specific encoding.
File.open(Rails.root.join('data', 'mike/test-csv.csv'), 'rb') { |f| f.read }
=> "ID,\x00Q\x00u\x00a\x00n\x00t\x00i\x00t\x00y\n\x006\x00e\x005\x004\x009\x001\x00e\x007\x00-\x007\x00f\x001\x005\x00-\x004\x001\x007\x00d\x00-\x00a\x004\x000\x003\x00-345\x00,\x00\x005\x000\x00.\x000\x000\x000\x000\x000\x000\x000\x000\x00\n"
Here's a gist of it, can't figure out a way to post the specific CSV.
All I get from checking the encoding of the file is that it's in binary format, any thoughts on how I could get it into a normal csv?
Note: This is a downloaded CSV so converting it to another encoding via opening it in excel and exporting (or something like that) is not an option :)
Thanks!
Updating with attempted solution 1:
path = Rails.root.join('data', 'mike/test-csv.csv')
CSV.read(path, {:headers => true, :encoding => 'utf-8'}).each do |d|
puts d
end
Result: 6e5491e7-7f15-417d-a403-345,50.00000000
While this is correct, it ONLY works with puts, for example:
CSV.read(path, {:headers => true, :encoding => 'utf-8'}).map { |row| row }
=> [#<CSV::Row "ID":"\u00006\u0000e\u00005\u00004\u00009\u00001\u0000e\u00007\u0000-\u00007\u0000f\u00001\u00005\u0000-\u00004\u00001\u00007\u0000d\u0000-\u0000a\u00004\u00000\u00003\u0000-345\u0000" "\u0000Q\u0000u\u0000a\u0000n\u0000t\u0000i\u0000t\u0000y":"\u0000\u00005\u00000\u0000.\u00000\u00000\u00000\u00000\u00000\u00000\u00000\u00000\u0000">]
CSV.read(path, {:headers => true, :encoding => 'utf-8'}).map(&:to_s)
=> ["\u00006\u0000e\u00005\u00004\u00009\u00001\u0000e\u00007\u0000-\u00007\u0000f\u00001\u00005\u0000-\u00004\u00001\u00007\u0000d\u0000-\u0000a\u00004\u00000\u00003\u0000-345\u0000,\u0000\u00005\u00000\u0000.\u00000\u00000\u00000\u00000\u00000\u00000\u00000\u00000\u0000\n"]
It's unfortunately still not the correct string :(
Final Solution (via #ashmaroli below):
path = Rails.root.join('data', 'mike/test-csv.csv')
csv_text = ''
File.open(path, 'r') do |csv|
csv.each_line do |line|
csv_text << line.gsub(/\u0000/, '')
end
end
CSV.parse(csv_text, headers:true).map do |row| row end
Result:
[#<CSV::Row "ID":"6e5491e7-7f15-417d-a403-345" "Quantity":"50.00000000">]
Github Gist
Download Example CSV File
path = Rails.root.join('data', 'mike/test-csv.csv')
file = ""
File.open(path, 'r') do |csv|
csv.each_line do |line|
file << line.gsub(/\u0000/, '')
end
end
print file
print file.inspect # same as above just wraps the string in a
# single line with "\n" chars

Writing to CSV returns : undefined method `map' for "\n" or "0"

I am trying to write to a CSV but i ran into a problem. I have seen this so I have just applied the solution but I get an error.
This is my code:
require 'csv'
data = Owner.find(2).cats
CSV.open("file.csv", "w") do |csv|
data.each do |cat|
csv << cat.name
end
end
I have checked in console and I am getting data for Owner.find(2).cats.
When trying to write this to my CSV i get the error:
undefined method `map' for 0:Fixnum
and when I try the simple solution from the same question :
require 'csv'
CSV.open("file.csv", "w") do |csv|
csv << "\n"
end
I get this error:
undefined method `map' for "\n":String
Do you know what I am doing wrong?
I am new to ruby so maybe I am doing one of the roockie mistakes
A CSV is a collection of rows, each of which is a collection of columns; it's a two-dimensional array, that gets converted to text form. So the top-level CSV object expects you to append arrays to it, not individual cell values.
Note that in this code:
CSV.open('filename','w') do |csv|
do stuff
end
The do stuff is only run exactly once. It's up to you to create the structure of the CSV, usually with something like this:
CSV.open('filename','w') do |csv|
data.each |item|
row = [item.field1, item.field2, item.field3]
csv << row
end
end
or even a double loop:
CSV.open('filename','w') do |csv|
data.each |item|
row = []
fields.each do |field|
row << item[field]
end
csv << row
end
end
As an example:
$ irb
irb(main):001:0> require 'csv' #=> true
irb(main):002:0> CSV.open("cats.csv", "w") do |csv|
irb(main):003:1* csv << [ "cat1" ] << [ "cat2" ] << [ "cat3 " ]
irb(main):004:1> end
#=> <#CSV io_type:File io_path:"cats.csv" encoding:UTF-8 lineno:3 col_sep:"," row_sep:"\n" quote_char:"\"">
irb(main):005:0>
$ cat cats.csv
cat1
cat2
cat3
$
Notice that the file has no quotation marks or square brackets in it.
require 'csv'
data = Owner.find(2).cats
CSV.open("file.csv", "w") do |csv|
data.each { |cat| csv << [cat.name] }
end

Rake task not saving or creating new record in database

I've created a ruby script that executes fine if I run it from Console.
The script fetches some information from various websites and saves it to my database table.
However, when I want to turn the code into a rake task, the code still runs, but it does not save any new records. I don't get any errors from the rake either.
# Add your own tasks in files placed in lib/tasks ending in .rake,
# for example lib/tasks/capistrano.rake, and they will automatically be available to Rake.
require File.expand_path('../config/application', __FILE__)
Rails.application.load_tasks
require './crawler2.rb'
task :default => [:crawler]
task :crawler do
### ###
require 'rubygems'
require 'nokogiri'
require 'open-uri'
start = Time.now
$a = 0
sites = ["http://www.nytimes.com","http://www.news.com"]
for $a in 0..sites.size-1
url = sites[$a]
$i = 75
$error = 0
avoid_these_links = ["/tv", "//www.facebook.com/"]
doc = Nokogiri::HTML(open(url))
links = doc.css("a")
hrefs = links.map {|link| link.attribute('href').to_s}.uniq.sort.delete_if {|href| href.empty?}.delete_if {|href| avoid_these_links.any? { |w| href =~ /#{w}/ }}.delete_if {|href| href.size < 10 }
#puts hrefs.length
#puts hrefs
for $i in 0..hrefs.length
begin
#puts hrefs[60] #for debugging)
#file = open(url)
#doc = Nokogiri::HTML(file) do
if hrefs[$i].downcase().include? "http://"
doc = Nokogiri::HTML(open(hrefs[$i]))
else
doc = Nokogiri::HTML(open(url+hrefs[$i]))
end
image = doc.at('meta[property="og:image"]')['content']
title = doc.at('meta[property="og:title"]')['content']
article_url = doc.at('meta[property="og:url"]')['content']
description = doc.at('meta[property="og:description"]')['content']
category = doc.at('meta[name="keywords"]')['content']
newspaper_id = 1
puts "\n"
puts $i
#puts "Image: " + image
#puts "Title: " + title
#puts "Url: " + article_url
#puts "Description: " + description
puts "Catory: " + category
Article.create({
:headline => title,
:caption => description,
:thumbnail_url => image,
:category_id => 3,
:status => true,
:journalist_id => 2,
:newspaper_id => newspaper_id,
:from_crawler => true,
:description => description,
:original_url => article_url}) unless Article.exists?(original_url: article_url)
$i +=1
#puts $i #for debugging
rescue
#puts "Error here: " + url+hrefs[$i] if $i < hrefs.length
$i +=1 # do_something_* again, with the next i
$error +=1
end
end
puts "Page: " + url
puts "Articles: " + hrefs.length.to_s
puts "Errors: " + $error.to_s
$a +=1
end
finish = Time.now
diff = ((finish - start)/60).to_s
puts diff + " Minutes"
### ###
end
The code executes fine, if I save the file as crawler.rb and open it in Console by doing --> " load './crawler2.rb' ". When I use the exact same code in a rake task, I get no new records.
I figured out what was wrong.
I need to remove:
require './crawler2.rb'
task :default => [:crawler]
and instead edit the following:
task :crawler => :environment do
Now the crawler runs every ten minutes with a bit of help from Heroku scheduler :-)
Thanks for the help guys - and sorry for the bad formatting. Hope this answer may help others.

Ruby csv import trouble

I tried to import data from csv in my rails app, but something went wrong:
CSV::MalformedCSVError in ArticlesController#index
Unclosed quoted field on line 1.
My csv looks like this:
"Код";"№ по каталогу (артикул)";"Наименование товара";"Ед. изм.";"Цена опт.";"Доп.";"Остатки";"Класс";"Группа";"Бренд";"Блок."
2223;15-562-44;15-562-44 (27-B07-F) VW Polo 95-R ;шт ;37,430;;;Амортизаторы ;Амортизаторы BOGE ;;
10327;24-052-1;24-052-1(46-A27-0) LAND ROVER 84- F ;шт ;68,750;;;Амортизаторы ;Амортизаторы BOGE ;;
10328;24-053-1;24-053-1(46-A28-0) LAND ROVER 84- R ;шт ;68,750;;;Амортизаторы ;Амортизаторы BOGE ;;
Maybe this is because of the first line (which has no ;;)
My code look like this:
def csv_import
require 'csv'
file = File.open("/#{Rails.public_path}/uploads/smallcsv.csv")
#csv = CSV.parse(file)
csv = CSV.open(file, "r:ISO-8859-15:UTF-8", {:col_sep => ";", :row_sep => ";;", :headers => :first_row})
file_path = "/#{Rails.public_path}/uploads/smallcsv.csv"
##parsed_file=CSV::Reader.parse(file_path)
csv.each do |row|
ename = row[2]
eprice = row[5]
eqnt = row[7]
esupp = row[10]
logger.warn(ename)
end
end
I'm running ruby 1.9+ with fastercsv gem
I figured this out myself using "CSV - Unquoted fields do not allow \r or \n (line 2)".
The problem was with the first line, so :auto helped me.

Resources