I use CSV to save each line in a text file as a separate object in the database.
Each line is saved with added closing brackets and double quotes:
["One line of text"]
Is there any option in CSV to exclude those, or else any other nifty way to remove them?
require 'csv'
def self.import_lines_of_text(filename)
csv_file_path = "db/questions/#{filename}"
CSV.foreach(csv_file_path) do |row|
clean_sentence = row.join(",")
self.create!(content: clean_sentence)
end
end
Related
I want to loop over a csv file using CSV.foreach, read the data, perform some operation with it, and write the result to the last column of that row, using the Row object.
So let's say I have a csv with data I need to save to a database using Rails ActiveRecord, I validate the record, if it is valid, I write true in the last column, if not I write the errors.
Example csv:
id,title
1,some title
2,another title
3,yet another title
CSV.foreach(path, "r+", headers: true) do |row|
archive = Archive.new(
title: row["title"]
)
archive.save!
row["valid"] = true
rescue ActiveRecord::RecordInvalid => e
row["valid"] = archive.errors.full_messages.join(";")
end
When I run the code it reads the data, but it does not write anything to the csv. Is this possible?
Is it possible to write in the same csv file?
Using:
Ruby 3.0.4
The row variable in your iterator exists only in memory. You need to write the information back to the file like this:
new_csv = ["id,title,valid\n"]
CSV.foreach(path, 'r+', headers: true) do |row| # error here, see edit note below
row["valid"] = 'foo'
new_csv << row.to_s
end
File.open(path, 'w+') do |f|
f.write new_csv
end
[EDIT] the 'r+' option to foreach is not valid, it should be 'r'
Maybe this is over-engineering things a bit. But I would do the following:
Read the original CSV file.
Create a temporary CSV file.
Insert the updated headers into the temporary CSV file.
Insert the updated records into the temporary CSV file.
Replace the original CSV file with the temporary CSV file.
csv_path = 'archives.csv'
input_csv = CSV.read(csv_path, headers: true)
input_headers = input_csv.headers
# using an UUID to prevent file conflicts
tmp_csv_path = "#{csv_path}.#{SecureRandom.uuid}.tmp"
output_headers = input_headers + %w[errors]
CSV.open(tmp_csv_path, 'w', write_headers: true, headers: output_headers) do |output_csv|
input_csv.each do |archive_data|
values = archive_data.values_at(*input_headers)
archive = Archive.new(archive_data.to_h)
archive.valid?
# error_messages is an empty string if there are no errors
error_messages = archive.errors.full_messages.join(';')
output_csv << values + [error_messages]
end
end
FileUtils.move(tmp_csv_path, csv_path)
I'd like multiple pieces of data on different lines within the same CSV cell like this:
"String" 2-15-2021 05:26pm
"String ..."
"String..."
I have tried the following and ended up with \n in the cell and not an actual new line, like this "2-15-2021 05:26pm \nHi, it's ...".
["\n", time, text.body].join("\n")
[time, text.body, "\n"].join("\n")
[time, text.body].join("\n")
The input data is an array of hashes. The output of a row is a hash with keys and values, one of the values is a list of strings (or this can be a list of lists of string, I am playing with what I can get to work). The list of strings is where I am trying to add line breaks.
I am using this to create the csv:
CSV.open("data.csv", "wb") do |csv|
csv << list.first.keys
list.each do |hash|
csv << hash.values
end
end
I ended up needing a list of strings that I could then join and add new lines onto.
values = []
values.push("#{time}, #{text.body}")
# And then in the hash for the csv, setting the value for that column like this:
{ message: values.join("\n\n")}
I am trying to get data out of pdf files, convert them to CSV, then organize into one table.
A sample pdf can be found here
https://www.ttb.gov/statistics/2011/201101wine.pdf
It's data on US wine production. So far, I have been able to get the PDF files and convert them into CSV.
Here is the CSV file that has been converted from PDF:
https://github.com/jjangmes/wine_data/blob/master/csv_data/201101wine.txt
However, when I try to find data by row, it's not working.
require 'csv'
csv_in = CSV.read('201001wine.txt', :row_sep => :auto, :col_sep => ";")
csv_in.each do |line|
puts line
end
When I put line[0], I get the entire data being printed. So it looks like the entire data is just shoved into row[0].
line will extract all the data.
line[0] will extract all the data with space in between lines.
line[1] gives the error "/bin/bash: shell_session_update: command not found"
How can I correctly divide up the data so I can parse them by row?
This is a really messy data with no heading or ID, so I think the best approach is to get the data in csv, and find the data I want by looking up the right row number.
Though not all data have the same number of rows, most do. So I thought that'd be the best way for now.
Thanks!
Edit 1:
Here is the code that I have to scrape and get the data.
require 'mechanize'
require 'docsplit'
require 'byebug'
require 'csv'
def pdf_to_text(pdf_filename)
extracted_text = Docsplit.extract_text([pdf_filename], ocr: false, col_sep: ";", output: 'csv_data')
extracted_text
end
def save_by_year(starting, ending)
agent = Mechanize.new{|a| a.ssl_version, a.verify_mode = 'TLSv1', OpenSSL::SSL::VERIFY_NONE}
agent.get('https://www.ttb.gov')
(starting..ending).each do |year|
year_page = agent.get("/statistics/#{year}winestats.shtml")
(1..12).each do |month|
month_trail = '%02d' % month
link = year_page.links_with(:href => "20#{year}/20#{year}#{month_trail}wine.pdf").first
page = agent.click(link)
File.open(page.filename.gsub(" /","_"), 'w+b') do |file|
file << page.body.strip
end
pdf_to_text("20#{year}#{month_trail}wine.pdf")
end
end
end
After converting, I am trying to access the data through accessing the text file then row in each.
Ruby n00b here. I am scraping the same page twice - but in a slightly different way each time - and exporting them to separate CSV files. I would like to then combine the first column from CSV no.1 and the second column from CSV no.2 to create CSV no.3.
The code to pull CSVs no.1 & 2 works. But add my attempt to combine the two CSVs into the third one (commented-out at the bottom) returns the following error - the two CSVs populate fine, but the third stays blank and the script is in what appears to be an infinite loop. I know these lines shouldn't be at the bottom, but I can't see where else it would go...
alts.rb:45:in `block in <main>': undefined local variable or method `scrapedURLs1' for main:Object (NameError)
from /Users/JammyStressford/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/csv.rb:1266:in `open'
from alts.rb:44:in `<main>'
The code itself:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'csv'
url = "http://www.example.com/page"
page = Nokogiri::HTML(open(url))
CSV.open("results1.csv", "wb") do |csv|
page.css('img.product-card-image').each do |scrape|
product1 = scrape['alt']
page.css('a.product-card-image-link').each do |scrape|
link1 = scrape['href']
scrapedProducts1 = "#{product1}"[0..-7]
scrapedURLs1 = "{link1}"
csv << [scrapedProducts1, scrapedURLs1]
end
end
end
CSV.open("Results2.csv", "wb") do |csv|
page.css('a.product-card-image-link').each do |scrape|
link2 = scrape['href']
page.css('img.product-card-image').each do |scrape|
product2 = scrape['alt']
scrapedProducts2 = "#{product2}"[0..-7]
scrapedURLs2 = "http://www.lyst.com#{link2}"
csv << [scrapedURLs2, scrapedProducts2]
end
end
end
## Here is where I am trying to combine the two columns into a new CSV. ##
## It doesn't work. I suspect that this part should be further up... ##
# CSV.open("productResults3.csv", "wb") do |csv|
# csv << [scrapedURLs1, scrapedProducts2]
#end
puts "upload complete!"
Thanks for reading.
Thank you for sharing your code and your question. I hope my input helps!
Your scrapedURLs1 = "{link}" and scrapedProducts1 = "#{scrape['alt']}"[0..-7] have a 1 on the end but you don't call it at csv << [scrapedProducts, scrapedURLs] THIS IS THE ERROR YOU ARE GETTING
I would recommend combining your first two steps to skip
writing to a file, but into an Array of Arrays and THEN you can write
them to file.
Do you realize that in the example code you've given
scrapedURLs1, scrapedProducts2 would be mixing the wrong urls to
the wrong products. Is that what you mean to do?
Within the commented out code scrapedURLs1, scrapedProducts2 do not exist, they have not been declared. You need to open both files to read with .each do |scrapedURLs1| and then another with .each do |scrapedProducts2| and then those variable will exist because the each Enumerator instantiates them.
Reusing the same |scrape| variable on your inner iteration isn't a good idea. Change the name to something else such as |scrape2| . It "happens" to work because you've already taken what you need in product=scrape['alt'] before the second loop. If you rename the second loop variable you can move the product=scrape['alt'] line into the inner loop and combine them. Example:
# In your code example you may get many links per product.
# If that was your intent then that may be fine.
# This code should get one link per product.
CSV.open("results1.csv", "wb") do |csv|
page.css('img.product-card-image').each do |scrape|
page.css('a.product-card-image-link').each do |scrape2|
# [ product , link ]
csv << [scrape['alt'][0..-7], scrape2['href']]
# NOTE that scrape['alt'][0..-7] and scrape2['href'] are already strings
# so you don't need to use "#{ }"
end
end
end
Side note: Ruby 2.0.0 does not need the line require "rubygems"
If you're working with CSVs I highly recommend you using James Edward Gray II's faster_csv gem. See an example of usage here: https://github.com/JEG2/faster_csv/blob/master/examples/csv_writing.rb
iam new to ruby.i want to remove non numeric characters from phone number parsed from a CSV file.
Here is the code iam using.
require 'csv'
csv_text = File.read('file.csv')
csv = CSV.parse(csv_text, :headers => true)
csv.each do |row|
puts "First Name: #{row['Name']} - HomePhone: #{row['Phone']} - Zip Code: #{row['Zipcode']}"
end
the out put print as Follows
FirstName:Abiel HomePhone:6667-88-76
(In CSV file HomePhone contains non numeric characters.)
I want the out put as FirstName:Abiel HomePhone:66678876
This should work:
row['Phone'].gsub(/[^0-9]/, "")
Yes, or just row['Phone'].gsub(/\D/, "")
where \d means a numeric char, and \D means anything non-numeric.