How to count empty range on csv files in one folder? - ruby-on-rails

Suppose I have 7 CSV files in one folder two of them are empty on [2..-1] range, how do I count them and get answer 2?

This code iterates over the csv files in current folder, opens them, and checks that every cell in the 3rd, 4th, 5th... columns are empty in every line.
separator = ";"
empty_count = Dir["*.csv"].count do |csv|
File.readlines(csv).all? do |line|
line.split(separator,-1).drop(2).all? do |cell|
cell.empty?
end
end
end

You can always check the size of the file, there are many methods available. Use something like
File.zero?("csv1.csv")
This will give true for the empty csv files.

Related

check for matching rows in csv file ruby

I am very new to ruby and I want to check for rows with the same number in a csv file.
What I am trying to do is go through the input csv file and copy element from the input file to the output file also adding another column called "duplicate" to the output file, then check if a similar phone is already in the output file while copying data from input to output then if the phone already exist, add "dupl" to the row in the duplicate column.
This is what I have.
file=CSV.read('input_file.csv')
output_file=File.open("output2.csv","w")
for row in file
output_file.write(row)
output_file.write("\n")
end
output_file.close
Example:
Phone
(202) 221-1323
(201) 321-0243
(202) 221-1323
(310) 343-4923
output file
Phone
Duplicate
(202) 221-1323
(201) 321-0243
(202) 221-1323
dupl
(310) 343-4923
So basically you want to write the input to output and append a "dupl" on the second occurrence of a duplicate?
Your input to output seems fine. To get the "dupl" flag, simply count the occurrence of each number in the list. If it's more than one, its a duplicate. But since you only want the flag to be shown on the second occurrence just count how often the number appeared up until that point:
lines = CSV.read('input_file.csv')
lines.each_with_index do |l,i|
output_file.write(l + ",")
if lines.take(i).count(l) >= 1
output_file.write("dupl")
end
output_file.write("\n")
end
l is the current line. take(i) is all lines before but not including the current line and count(l) applied to this counts how often the number appeared before if it's more than one, print a "dupl"
There probably is a more efficient answer to this, this is just a quick and easy to understand version.

Iterating through CSV::Rows

I'm going to preface that I'm still learning ruby.
I'm writing a script to parse a .csv and identify possible duplicate records in the data-set.
I have a .csv file with headers, so I'm parsing the data so that I can access each row using a header title as such:
#contact_table = CSV.parse(File.read("app/data/file.csv"), headers: true)
# Prints all last names in table
puts contact_table['last_name']
I'm trying to iterate over each row in the table and identify if the last name I'm currently iterating over is similar to the next last name, but I'm having trouble doing this. I guess the way I'm handling it is as if it's an array, but I checked the type and it's a CSV::Row.
example (this doesn't work):
#contact_table.each_with_index do |c, i|
puts "first contact is #{c['last_name']}, second contact is #{c[i + 1]['last_name']}"
end
I realized this doesn't work like this because the table isn't an array, it's a CSV::Row like I previously mentioned. Is there any method that can achieve this? I'm really blanking right now.
My csv looks something like this:
id,first_name,last_name,company,email,address1,address2,zip,city,state_long,state,phone
1,Donalt,Canter,Gottlieb Group,dcanter0#nydailynews.com,9 Homewood Alley,,50335,Des Moines,Iowa,IA,515-601-4495
2,Daphene,McArthur,"West, Schimmel and Rath",dmcarthur1#twitter.com,43 Grover Parkway,,30311,Atlanta,Georgia,GA,770-271-7837
#contact_table should be a CSV::Table which is a collection of CSV::Rows so in this:
#contact_table.each_with_index do |c, i|
...
end
c is a CSV::Row. That's why c['last_name'] works. The problem is that here:
c[i + 1]['last_name']
you're looking at c (a single row) instead of #contact_table, if you said:
#contact_table[i + 1]['last_name']
then you'd get the next last name or, when c is the last row, an exception because #contact_table[i+1] will be nil.
Also, inside the iteration, c is the current (or (i+1)th) row and won't always be the first.
What is your use case for this? Seems like a school project?
I recommend for_each instead of parse (see this comparison). I would probably use a Set for this.
Create a Set outside of the scope of parsing the file (i.e., above the parsing code). Let's call it rows.
Call rows.include?(row) during each iteration while parsing the file
If true, then you know you have a duplicate
If false, then call rows.add(row) to add the new row to the set
You could also just fill your set with an individual value from a column that must be distinct (e.g., row.field(:some_column_name)), such as email or phone number, and do the same inclusion check for that.
(If this is for a real app, please don't do this. Use model validations instead.)
I would use #read instead of #parse and do something like this:
require 'csv'
LASTNAME_INDEX = 2
data = CSV.read('data.csv')
data[1..-1].each_with_index do |row, index|
puts "Contact number #{index + 1} has the following last name : #{row[LASTNAME_INDEX]}"
end
#~> Contact number 1 has the following last name : Canter
#~> Contact number 2 has the following last name : McArthur

Ruby CSV.foreach start at specific row

I've seen a couple posts for this with no real answers or out-of-date answers, so I'm wondering if there are any new solutions. I have an enormous CSV I need to read in. I can't call open() on it bc it kills my server. I have no choice but to use .foreach().
Doing it this way, my script will take 6 days to run. I want to see if I can cut that down by using Threads and splitting the task in two or four. So one thread reads lines 1-n and one thread simultaneously will read lines n+1-end.
So I need to be able to only read in the last half of the file in one thread (and later if I split it into more threads, just a specific line through a specific line).
Is there anyway in Ruby to do this? Can this start at a certain row?
CSV.foreach(FULL_FACT_SHEET_CSV_PATH) do |trial|
EDIT:
Just to give an idea of what one of my threads looks like:
threads << Thread.new {
CSV.open('matches_thread3.csv', 'wb') do |output_csv|
output_csv << HEADER
count = 1
index = 0
CSV.foreach(CSV_PATH) do |trial|
index += 1
if index > 120000
break if index > 180000
#do stuff
end
end
end
}
But as you can see, it has to iterate the file until it gets to record 120,000 before it starts. So the goal would be to eliminate reading all of the rows before row 120,000 by starting to read at row 120,000.
If still relevant, you can do something like this using .with_index after :
rows_array = []
CSV.foreach(path).with_index do |row, i|
next if i == 0 #skip first row
rows_array << columns.map { |n| row[n] }
end
But as you can see, it has to iterate the file until it gets to record 120,000 before it starts. So the goal would be to eliminate reading all of the rows before row 120,000 by starting to read at row 120,000.
Impossible. Content of a CSV file is just a blob of text, with some commas and newlines. You can't know at which offset in the file row N starts without knowing where row N-1 ends. And to know this, you have to know where row N-1 starts (see recursion?) and read the file until you see where it ends (encounter a newline that is not part of field value).
Exception to this is if all your rows are of fixed size. In which case, you can seek directly to offset 120_000 * row_size. I am yet to see a file like this, though.
As per my understanding towards your Question in Ruby way it may help you.
require 'csv'
csv_file = "matches_thread3.csv"
# define one Constant Chunk Size for Jobs
CHUNK_SIZE = 120000
# split - by splitting (\n) will generate an array of CSV records
# each_slice - will create array of records of CHUNK_SIZE defined
File.read(csv_file).split("\n").drop(1).each_slice(CHUNK_SIZE).with_index
do |chunk, index|
data = []
# chunk will be work as multiple Jobs of 120000 records
chunk.each do |row|
data << r
##do stuff
end
end

How to check the CSV column consistency?

I have a CSV file like:
Header: 1,2,3,4
content: a,b,c,d,e
a,b,c,d
a,b
a,b,c,d,d
Is there any CSV method that I can use to easily validate the column consistency instead of
parsing the CSV line by line?
One way or another the whole file has to be read.
Here is a relative simple way. First the file is read and converted to an array which is then mapped to another array based on length (number of fields per row). This array is the checked if the length is always the same.
If you'd hate to read the file twice you could remember the length of the header and while you parse the file check each record if it has the same number of fields and otherwise trow an exeption.
require "csv"
def valid? file
a = CSV.read(file).map { |e|e.length }
a.min == a.max
end
p valid?("data.csv")
csv_validator gem would be helpful here.

selecting specific rows from an unstructured csv file and writing to another file using python

I am trying to iterate through an unstrucutred csv file (it has no specific headings). The file is generated by an instrument. I would need to select specific rows that have specific column values and create another file. Below is the example of the file layout
,success, (row1)
1,2,protocol (row2)
78,f14,34(row3)
,67,34(row4)
,f14,34(row5)
3,f14,56,56(row6)
I need to select all rows with 'fi4' value. Below is the code
import csv
import sys
reader = csv.reader(open('c:/test_file.csv', newline=''), delimiter=',', quotechar='|')
for row in reader:
print(','.join(row))
I am unable to go beyond this point.
You're almost there:
for row in reader:
if row[1] == 'f14':
print(','.join(row))
You just need to check and see whether the row is one you're interested in or not by checking the value of the column and see if it's what you're looking for. That could be done with a simpleif row[1] == 'f14'conditional statement. However that would fail on any blank lines -- which it looks like your input file may have -- so you'd need to preface that check with another to make sure the row had at least that many columns in it.
To create another csv file with just those rows in it, all you'd need to write each row that passed all the checks to another file opened for output -- instead of, or in addition to, printing the row out. Here's a very concise way of just writing the rows to another file.
(Note: I'm not sure why you had thequotechar='|'in your code on thecsv.reader()call because there aren't any quote characters in the input file shown, so I left it out in the code below -- you might need to add it back if indeed that's what it would be if there were any.)
import csv
with open('test_file.csv', newline='') as infile, \
open('test_file_out.csv', 'w', newline='') as outfile:
csv.writer(outfile).writerows(row for row in csv.reader(infile)
if len(row) >= 2 and row[1] == 'f14')
Contents of'test_file_out.csv'file afterwards:
78,f14,34(row3)
,f14,34(row5)
3,f14,56,56(row6)

Resources