How parse the data from TXT file with tab separator? - ruby-on-rails

I am using ruby 1.8.7 , rails 2.3.8. I want to parse the data from TXT dump file separated by tab.
In this TXT dump contain some CSS property look like has some invalid data.
When run my code using FasterCSV gem
FasterCSV.foreach(txt_file, :quote_char => '"',:col_sep =>'\t', :row_sep =>:auto, :headers => :first_row) do |row|
col= row.to_s.split(/\t/)
puts col[15]
end
the error written in console as "Illegal quoting on line 38." Can any one suggest me how to skip the row which has invalid data and proceed data load process of remaining rows?

Here's one way to do it. We go to lower level, using shift to parse each row and then silent the MalformedCSVError exception, continuing with the next iteration. The problem with this is the loop doesn't look so nice. If anyone can improve this, you're welcome to edit the code.
FasterCSV.open(filename, :quote_char => '"', :col_sep => "\t", :headers => true) do |csv|
row = true
while row
begin
row = csv.shift
break unless row
# Do things with the row here...
rescue FasterCSV::MalformedCSVError
next
end
end
end

Just read the file as a regular one (not with FasterCSV), split it like you do know by \t and it should work

So the problem is that TSV files don't have a quote character. The specification simply specifies that you aren't allowed to have tabs in the data.
The CSV library doesn't really support this use case. I've worked around it by specifying a quote character that I know won't appear in my data. For example
CSV.parse(txt_file, :quote_char => '☎', :col_sep => "\t" do |row|
puts row[15]
end

Related

CSV::MalformedCSVError: Unquoted fields do not allow \r or \n in Ruby

I have CSV's which I am trying to import into my oracle database but unfortunately I keep on getting the same error:
> CSV::MalformedCSVError: Unquoted fields do not allow \r or \n (line 1).
I know there are tons of similar questions which have been asked but none relate specifically to my issue other than this one, but unfortunately it didn't help.
To explain my scenario:
I have CSV's in which the rows don't always end with a value, but
rather, just a comma because it's a null value hence it stays blank.
I would like to import the CSV's regardless of whether the ending is with a comma or without a comma.
Here are the first 5 lines of my CSV with changed values due to privacy reasons,
id,customer_id,provider_id,name,username,password,salt,email,description,blocked,created_at,updated_at,deleted_at
1,,1,Default Administrator,admin,1,1," ",Initial default user.,f,2019-10-04 14:28:38.492000,2019-10-04 14:29:34.224000,
2,,2,Default Administrator,admin,2,1,,Initial default user.,,2019-10-04 14:28:38.633000,2019-10-04 14:28:38.633000,
3,1,,Default Administrator,admin,3,1," ",Initial default user.,f,2019-10-04 14:41:38.030000,2019-11-27 10:23:03.329000,
4,1,,admin,admin,4,1," ",,,2019-10-28 12:21:23.338000,2019-10-28 12:21:23.338000,
5,2,,Default Administrator,admin,5,1," ",Initial default user.,f,2019-11-12 09:00:49.430000,2020-02-04 08:20:06.601000,2020-02-04 08:20:06.601000
As you can see the ending is sometimes with or without a comma and this structure repeats quite often.
This is my code with which I have been playing around with:
def csv_replace_empty_string
Dir.foreach(Rails.root.join('db', 'csv_export')) do |filename|
next if filename == '.' or filename == '..' or filename == 'extract_db_into_csv.sh' or filename =='import_csv.rb'
read_file = File.read(Rails.root.join('db', 'csv_export', filename))
replace_empty_string = read_file.gsub(/(?<![^,])""(?![^,])/, '" "')
format_csv = replace_empty_string.gsub(/\r\r?\n?/, "\n")
# format_csv = remove_empty_lines.sub!(/(?:\r?\n)+\z/, "")
File.open(Rails.root.join('db', 'csv_export', filename), "w") {|file| file.puts format_csv }
end
end
I have tried using many different kinds of gsubs found online in similar forums, but it didn't help.
Here is my function for importing the CSV in the db:
def import_csv_into_db
Dir.foreach(Rails.root.join('db', 'csv_export')) do |filename|
next if filename == '.' or filename == '..' or filename == 'extract_db_into_csv.sh' or filename =='import_csv.rb'
filename_renamed = File.basename(filename, File.extname(filename)).classify
CSV.foreach(Rails.root.join('db', 'csv_export',filename), :headers => true, :skip_blanks => true) do |row|
class_name = filename_renamed.constantize
class_name.create!(row.to_hash)
puts "Insert on table #{filename_renamed} complete"
end
end
end
I have also tried the options provided by CSV such as :row_sep => :"\n" or :row_sep => "\r" but keep on getting the same error.
I am pretty sure I have some sort of thinking error, but I can't seem to figure it out.
I fixed the issue by using the following:
format_csv = replace_empty_string.gsub(/\r\r?\n?/, "\n")
This was originally #mgrims answer, but I had to adjust my code by further removing the :skip_blanks :row_sep options.
It is importing successfully now!

Finding correct row in data extracted from PDF to text

I am trying to get data out of pdf files, convert them to CSV, then organize into one table.
A sample pdf can be found here
https://www.ttb.gov/statistics/2011/201101wine.pdf
It's data on US wine production. So far, I have been able to get the PDF files and convert them into CSV.
Here is the CSV file that has been converted from PDF:
https://github.com/jjangmes/wine_data/blob/master/csv_data/201101wine.txt
However, when I try to find data by row, it's not working.
require 'csv'
csv_in = CSV.read('201001wine.txt', :row_sep => :auto, :col_sep => ";")
csv_in.each do |line|
puts line
end
When I put line[0], I get the entire data being printed. So it looks like the entire data is just shoved into row[0].
line will extract all the data.
line[0] will extract all the data with space in between lines.
line[1] gives the error "/bin/bash: shell_session_update: command not found"
How can I correctly divide up the data so I can parse them by row?
This is a really messy data with no heading or ID, so I think the best approach is to get the data in csv, and find the data I want by looking up the right row number.
Though not all data have the same number of rows, most do. So I thought that'd be the best way for now.
Thanks!
Edit 1:
Here is the code that I have to scrape and get the data.
require 'mechanize'
require 'docsplit'
require 'byebug'
require 'csv'
def pdf_to_text(pdf_filename)
extracted_text = Docsplit.extract_text([pdf_filename], ocr: false, col_sep: ";", output: 'csv_data')
extracted_text
end
def save_by_year(starting, ending)
agent = Mechanize.new{|a| a.ssl_version, a.verify_mode = 'TLSv1', OpenSSL::SSL::VERIFY_NONE}
agent.get('https://www.ttb.gov')
(starting..ending).each do |year|
year_page = agent.get("/statistics/#{year}winestats.shtml")
(1..12).each do |month|
month_trail = '%02d' % month
link = year_page.links_with(:href => "20#{year}/20#{year}#{month_trail}wine.pdf").first
page = agent.click(link)
File.open(page.filename.gsub(" /","_"), 'w+b') do |file|
file << page.body.strip
end
pdf_to_text("20#{year}#{month_trail}wine.pdf")
end
end
end
After converting, I am trying to access the data through accessing the text file then row in each.

Specify starting point of foreach enumerator in ruby

I have a rake file that pulls in data from an external CSV file and enumerates through it with:
CSV.foreach(file, :headers => true) do |row|
What is the most effective way (in ruby) to specify the starting point within the spreadsheet?
:headers => true allows me to start importing from the second line, but what if I want to start a line 20?
Ruby enumerators include a drop method that will skip over the first n items.
When not passed a block, CSV.foreach returns an enumerator.
You can use
CSV.foreach(file, :headers => true).drop(20).each do |row|
This will skip the first 20 data rows (the header row does NOT count as one of those twenty).
Use .drop(#rows to ignore):
CSV.open(file, :headers => true).drop(20).each do |row|

Rails FasterCSV "unquoted fields do not allow \r or \n"

I'm having an issue with FasterCSV and my rake db:seeds migration. I get the error:
"rake aborted! Unquoted fields do not allow \r or \n (line 2)"
on the following seeds.rb data:
require 'csv'
directory = "db/init_data/"
file_name = "gardenzing020812.csv"
path_to_file = directory + file_name
puts 'Loading Plant records'
# Pre-load all Plant records
n=0
CSV.foreach(path_to_file) do |row|
Plant.create! :name => row[1],
:plant_type => row[3],
:group => row[2],
:image_path => row[45],
:height => row[5],
:sow_inside_outside => row[8]
n=n+1
end
I've searched for a solution to this problem and have discovered that for a lot of folks it's a UTF-8 encoding problem. I've tried requiring iconv and :encoding => 'u', but that then gives me the error "invalid byte sequence in UTF-8".
I'm a newbie, and I can't figure out if it's really an encoding issue that I need to crack (which I've been trying to do unsuccessfully and if so, I could really use some guidance) or, more likely I feel, that I've made a simple misstep and done something wrong with the way I've set up seeds.rb and possibly my excel -> csv file. There's no bad or awkward data in the csv file. It's simple one-word strings, text and integers. Please help!
It was as simple as clearing all the formatting off in the csv. Excel seems to have a habit of retaining a lot of the formatting after saving in a csv file, which was causing the failure. After I copied and pasted all the data with no formatting in a new csv file, it was fine.
Use String.encode(universal_newline: true) instead gsub.
It converting CRLF and CR to LF # Always break lines with \n
I do not have enough reputation to comment, but I wanted to say that I have been looking this error across the web day and night for a long time and finally found the solution in the comments, by mu is too short.
I finally got it to work when I put quotes around all of my values.
EDIT: Link to answer!!! Rails FasterCSV "unquoted fields do not allow \r or \n"

How do i skip over the first three rows instead of the only the first in FasterCSV

I am using FasterCSV and i am looping with a foreach like this
FasterCSV.foreach("#{Rails.public_path}/uploads/transfer.csv", :encoding => 'u', :headers => :first_row) do |row|
but the problem is my csv has the first 3 lines as the headers...any way to make fasterCSV skip the first three rows rather then only the first??
Not sure about FasterCSV, but in Ruby 1.9 standard CSV library (which is made from FasterCSV), I can do something like:
c = CSV.open '/path/to/my.csv'
c.drop(3).each do |row|
# do whatever with row
end
I'm not a user of FasterCSV, but why not do the control yourself:
additional_rows_to_skip = 2
FasterCSV.foreach("...", :encoding => 'u', :headers => :first_row) do |row|
if additional_rows_to_skip > 0
additional_rows_to_skip -= 1
else
# do stuff...
end
end
Thanks to Mladen Jablanovic. I got my clue.. But I realized something interesting
In 1.9, reading seems to be from POS.
In this I mean if you do
c = CSV.open iFileName
logger.debug c.first
logger.debug c.first
logger.debug c.first
You'll get three different results in your log. One for each of the three header rows.
c.each do |row| #now seems to start on the 4th row.
It makes perfect sense that it would read the file this way. Then it would only have to have the current row in memory.
I still like Mladen Jablanovićs answer, but this is an interesting bit of logic too.

Resources