How to check the CSV column consistency? - ruby-on-rails

I have a CSV file like:
Header: 1,2,3,4
content: a,b,c,d,e
a,b,c,d
a,b
a,b,c,d,d
Is there any CSV method that I can use to easily validate the column consistency instead of
parsing the CSV line by line?

One way or another the whole file has to be read.
Here is a relative simple way. First the file is read and converted to an array which is then mapped to another array based on length (number of fields per row). This array is the checked if the length is always the same.
If you'd hate to read the file twice you could remember the length of the header and while you parse the file check each record if it has the same number of fields and otherwise trow an exeption.
require "csv"
def valid? file
a = CSV.read(file).map { |e|e.length }
a.min == a.max
end
p valid?("data.csv")

csv_validator gem would be helpful here.

Related

Iterating through CSV::Rows

I'm going to preface that I'm still learning ruby.
I'm writing a script to parse a .csv and identify possible duplicate records in the data-set.
I have a .csv file with headers, so I'm parsing the data so that I can access each row using a header title as such:
#contact_table = CSV.parse(File.read("app/data/file.csv"), headers: true)
# Prints all last names in table
puts contact_table['last_name']
I'm trying to iterate over each row in the table and identify if the last name I'm currently iterating over is similar to the next last name, but I'm having trouble doing this. I guess the way I'm handling it is as if it's an array, but I checked the type and it's a CSV::Row.
example (this doesn't work):
#contact_table.each_with_index do |c, i|
puts "first contact is #{c['last_name']}, second contact is #{c[i + 1]['last_name']}"
end
I realized this doesn't work like this because the table isn't an array, it's a CSV::Row like I previously mentioned. Is there any method that can achieve this? I'm really blanking right now.
My csv looks something like this:
id,first_name,last_name,company,email,address1,address2,zip,city,state_long,state,phone
1,Donalt,Canter,Gottlieb Group,dcanter0#nydailynews.com,9 Homewood Alley,,50335,Des Moines,Iowa,IA,515-601-4495
2,Daphene,McArthur,"West, Schimmel and Rath",dmcarthur1#twitter.com,43 Grover Parkway,,30311,Atlanta,Georgia,GA,770-271-7837
#contact_table should be a CSV::Table which is a collection of CSV::Rows so in this:
#contact_table.each_with_index do |c, i|
...
end
c is a CSV::Row. That's why c['last_name'] works. The problem is that here:
c[i + 1]['last_name']
you're looking at c (a single row) instead of #contact_table, if you said:
#contact_table[i + 1]['last_name']
then you'd get the next last name or, when c is the last row, an exception because #contact_table[i+1] will be nil.
Also, inside the iteration, c is the current (or (i+1)th) row and won't always be the first.
What is your use case for this? Seems like a school project?
I recommend for_each instead of parse (see this comparison). I would probably use a Set for this.
Create a Set outside of the scope of parsing the file (i.e., above the parsing code). Let's call it rows.
Call rows.include?(row) during each iteration while parsing the file
If true, then you know you have a duplicate
If false, then call rows.add(row) to add the new row to the set
You could also just fill your set with an individual value from a column that must be distinct (e.g., row.field(:some_column_name)), such as email or phone number, and do the same inclusion check for that.
(If this is for a real app, please don't do this. Use model validations instead.)
I would use #read instead of #parse and do something like this:
require 'csv'
LASTNAME_INDEX = 2
data = CSV.read('data.csv')
data[1..-1].each_with_index do |row, index|
puts "Contact number #{index + 1} has the following last name : #{row[LASTNAME_INDEX]}"
end
#~> Contact number 1 has the following last name : Canter
#~> Contact number 2 has the following last name : McArthur

How to count empty range on csv files in one folder?

Suppose I have 7 CSV files in one folder two of them are empty on [2..-1] range, how do I count them and get answer 2?
This code iterates over the csv files in current folder, opens them, and checks that every cell in the 3rd, 4th, 5th... columns are empty in every line.
separator = ";"
empty_count = Dir["*.csv"].count do |csv|
File.readlines(csv).all? do |line|
line.split(separator,-1).drop(2).all? do |cell|
cell.empty?
end
end
end
You can always check the size of the file, there are many methods available. Use something like
File.zero?("csv1.csv")
This will give true for the empty csv files.

Reading "Awkward" CSV Files with FSharp CsvParser

I have a large file (200K - 300K lines of text).
It's almost but not quite a CSV file.
The column headers are on the second row, there's a row of dummy text
before that.
There are rows interspersed with the actual data rows. They have
commas, but most of the columns are blank. They aren't relevant to me.
I need to read this file efficiently, and parse the lines that actually are
valid, as CSV data.
My first idea was to write a clean procedure that strips out the first line, and the blank lines, leaving only the headers and details that I want
in a CSV File that the CsvParser can read.
This is easy enough, just ReadLine from a StreamReader, I can keep or disregard each line just by looking at it as a string.
Now though I have a new issue.
There is a column in the valid data that I can use to disregard a whole lot more rows.
If I read the Cleaned file using the CsvParser it's easy to filter by that column.
But, I don't really want to waste writing the rows I don't need to the Clean file.
I'd like to be able to check that Column, while Cleaning the File. But, at that point I'm working with strings representing entire lines. It's not easy to get at the specific column I want.
I can't Split on ',' there may be commas in the text of other columns.
I'm ending up writing the Csv Parsing Logic, that I was using CsvParser for in the first place.
Ideally, I'd like to read in the existing file, clean out the lines that I can based on strings, then somehow parse the resulting seq using the CsvParser.
I see CsvFile can Load from Streams and Readers, but I'm not sure that's much help.
Any suggestions or am I just asking too much? Should I just deal with the extra filtering on loading the Cleaned File?
You can avoid doing most of the work of parsing by using the CsvFile class directly.
The F# Data documentation has some extended examples that show how to do this in some detail.
Skipping over lines at the start of a file is handled by the skipRows parameter. Passing the ignoreErrors parameter will also ignore rows that fail to parse.
open FSharp.Data
let csv = CsvFile.Load(file, skipRows=1, ignoreErrors=true)
for row in csv.Rows do
printfn "%s" row.GetColumn "Name"
If you have to do more complex filtering of rows, a simple approach that doesn't require temporary files is to filter the results of File.ReadLines and pass that to CsvFile.Parse.
The example below skips a six-line prelude, reads in lines until it hits a blank line, uses CsvFile to parse the data, and finally filters the resulting rows to those of interest.
let tableA =
File.ReadLines(file)
|> Seq.skip(6)
|> Seq.takeWhile(fun l -> String.length l > 0)
|> String.concat "\n"
let csv = CsvFile.Parse(tableA)
for row in csv.Rows.Filter(fun row -> row?Close.AsFloat() > row?Open.AsFloat()) do
printfn "%s" row.GetColumn "Name"

Parse tab delimited CSV file to array of hashes in Ruby 2.0

I have the following code:
def csv_to_array(file)
csv = CSV::parse(file)
fields = csv.shift
array = csv.collect { |record| Hash[*fields.zip(record).flatten] }
end
This creates an array of hashes, and works fine with comma separated values. I am trying to replicate this code for a tab delimited file. Currently, when I run the above code on my tab delimited file, I get something like this:
array[0] = {"First Name\tLast Name\tCode\t"=>"Luigi\tSmith\t1406\t"}
So, each array object is a hash as intended, but it has one key value pair - The entire tab delimited header row being the key, and the individual row of data being the value.
How can I alter this code to return an array of hashes with individual key value pairs, with the header of each column mapping to the row value for that column?
It seems that the options you pass to parse are listed in ::new
>> CSV.parse("qwe\tq\twe", col_sep: "\t"){|a| p a}
["qwe", "q", "we"]
Use the col_sep option, this post has the code: Changing field separator/delimiter in exported CSV using Ruby CSV
also checkout the docs: http://ruby-doc.org/stdlib-2.1.0/libdoc/csv/rdoc/CSV.html
lots of good stuff in the DEFAULT_OPTIONS section

selecting specific rows from an unstructured csv file and writing to another file using python

I am trying to iterate through an unstrucutred csv file (it has no specific headings). The file is generated by an instrument. I would need to select specific rows that have specific column values and create another file. Below is the example of the file layout
,success, (row1)
1,2,protocol (row2)
78,f14,34(row3)
,67,34(row4)
,f14,34(row5)
3,f14,56,56(row6)
I need to select all rows with 'fi4' value. Below is the code
import csv
import sys
reader = csv.reader(open('c:/test_file.csv', newline=''), delimiter=',', quotechar='|')
for row in reader:
print(','.join(row))
I am unable to go beyond this point.
You're almost there:
for row in reader:
if row[1] == 'f14':
print(','.join(row))
You just need to check and see whether the row is one you're interested in or not by checking the value of the column and see if it's what you're looking for. That could be done with a simpleif row[1] == 'f14'conditional statement. However that would fail on any blank lines -- which it looks like your input file may have -- so you'd need to preface that check with another to make sure the row had at least that many columns in it.
To create another csv file with just those rows in it, all you'd need to write each row that passed all the checks to another file opened for output -- instead of, or in addition to, printing the row out. Here's a very concise way of just writing the rows to another file.
(Note: I'm not sure why you had thequotechar='|'in your code on thecsv.reader()call because there aren't any quote characters in the input file shown, so I left it out in the code below -- you might need to add it back if indeed that's what it would be if there were any.)
import csv
with open('test_file.csv', newline='') as infile, \
open('test_file_out.csv', 'w', newline='') as outfile:
csv.writer(outfile).writerows(row for row in csv.reader(infile)
if len(row) >= 2 and row[1] == 'f14')
Contents of'test_file_out.csv'file afterwards:
78,f14,34(row3)
,f14,34(row5)
3,f14,56,56(row6)

Resources