check for matching rows in csv file ruby - ruby-on-rails

I am very new to ruby and I want to check for rows with the same number in a csv file.
What I am trying to do is go through the input csv file and copy element from the input file to the output file also adding another column called "duplicate" to the output file, then check if a similar phone is already in the output file while copying data from input to output then if the phone already exist, add "dupl" to the row in the duplicate column.
This is what I have.
file=CSV.read('input_file.csv')
output_file=File.open("output2.csv","w")
for row in file
output_file.write(row)
output_file.write("\n")
end
output_file.close
Example:
Phone
(202) 221-1323
(201) 321-0243
(202) 221-1323
(310) 343-4923
output file
Phone
Duplicate
(202) 221-1323
(201) 321-0243
(202) 221-1323
dupl
(310) 343-4923

So basically you want to write the input to output and append a "dupl" on the second occurrence of a duplicate?
Your input to output seems fine. To get the "dupl" flag, simply count the occurrence of each number in the list. If it's more than one, its a duplicate. But since you only want the flag to be shown on the second occurrence just count how often the number appeared up until that point:
lines = CSV.read('input_file.csv')
lines.each_with_index do |l,i|
output_file.write(l + ",")
if lines.take(i).count(l) >= 1
output_file.write("dupl")
end
output_file.write("\n")
end
l is the current line. take(i) is all lines before but not including the current line and count(l) applied to this counts how often the number appeared before if it's more than one, print a "dupl"
There probably is a more efficient answer to this, this is just a quick and easy to understand version.

Related

Reading "Awkward" CSV Files with FSharp CsvParser

I have a large file (200K - 300K lines of text).
It's almost but not quite a CSV file.
The column headers are on the second row, there's a row of dummy text
before that.
There are rows interspersed with the actual data rows. They have
commas, but most of the columns are blank. They aren't relevant to me.
I need to read this file efficiently, and parse the lines that actually are
valid, as CSV data.
My first idea was to write a clean procedure that strips out the first line, and the blank lines, leaving only the headers and details that I want
in a CSV File that the CsvParser can read.
This is easy enough, just ReadLine from a StreamReader, I can keep or disregard each line just by looking at it as a string.
Now though I have a new issue.
There is a column in the valid data that I can use to disregard a whole lot more rows.
If I read the Cleaned file using the CsvParser it's easy to filter by that column.
But, I don't really want to waste writing the rows I don't need to the Clean file.
I'd like to be able to check that Column, while Cleaning the File. But, at that point I'm working with strings representing entire lines. It's not easy to get at the specific column I want.
I can't Split on ',' there may be commas in the text of other columns.
I'm ending up writing the Csv Parsing Logic, that I was using CsvParser for in the first place.
Ideally, I'd like to read in the existing file, clean out the lines that I can based on strings, then somehow parse the resulting seq using the CsvParser.
I see CsvFile can Load from Streams and Readers, but I'm not sure that's much help.
Any suggestions or am I just asking too much? Should I just deal with the extra filtering on loading the Cleaned File?
You can avoid doing most of the work of parsing by using the CsvFile class directly.
The F# Data documentation has some extended examples that show how to do this in some detail.
Skipping over lines at the start of a file is handled by the skipRows parameter. Passing the ignoreErrors parameter will also ignore rows that fail to parse.
open FSharp.Data
let csv = CsvFile.Load(file, skipRows=1, ignoreErrors=true)
for row in csv.Rows do
printfn "%s" row.GetColumn "Name"
If you have to do more complex filtering of rows, a simple approach that doesn't require temporary files is to filter the results of File.ReadLines and pass that to CsvFile.Parse.
The example below skips a six-line prelude, reads in lines until it hits a blank line, uses CsvFile to parse the data, and finally filters the resulting rows to those of interest.
let tableA =
File.ReadLines(file)
|> Seq.skip(6)
|> Seq.takeWhile(fun l -> String.length l > 0)
|> String.concat "\n"
let csv = CsvFile.Parse(tableA)
for row in csv.Rows.Filter(fun row -> row?Close.AsFloat() > row?Open.AsFloat()) do
printfn "%s" row.GetColumn "Name"

Listing two or more variables alongside each other

I want an alternative to running frequency for string variables because I also want to get a case number for each of the string value (I have a separate variable for case ID).
After reviewing the string values I will need to find them to recode which is the reason I need to know the case number.
I know that PRINT command should do what I want but I get an error - is there any alternative?
PRINT / id var2 .
EXECUTE.
>Error # 4743. Command name: PRINT
>The line width specified exceeds the output page width or the record length or
>the maximum record length of 2147483647. Reduce the number of variables or
>split the output line into several records.
>Execution of this command stops.
Try the LIST command.
I often use the TEMPORARY commond prior to the LIST command, as often there is only a small select of record of interest I may want to "list"/investigate.
For example, in the below, only to list the records where VAR2 is not a blank string.
TEMP.
SELECT IF (len(VAR2)>0).
LIST ID VAR2.
Alternatively, you could also (but dependent on having CUSTOM TABLES add-on module), do something like below which would get the results into a tabular format also (which may be preferable if then exporting to Excel, for example.
CTABLES /TABLE CTABLES /VLABELS VARIABLES=ALL DISPLAY=NONE
/TABLE A[C]>B[C]
/CATEGORIES VARIABLES=ALL EMPTY=EXCLUDE.

selecting specific rows from an unstructured csv file and writing to another file using python

I am trying to iterate through an unstrucutred csv file (it has no specific headings). The file is generated by an instrument. I would need to select specific rows that have specific column values and create another file. Below is the example of the file layout
,success, (row1)
1,2,protocol (row2)
78,f14,34(row3)
,67,34(row4)
,f14,34(row5)
3,f14,56,56(row6)
I need to select all rows with 'fi4' value. Below is the code
import csv
import sys
reader = csv.reader(open('c:/test_file.csv', newline=''), delimiter=',', quotechar='|')
for row in reader:
print(','.join(row))
I am unable to go beyond this point.
You're almost there:
for row in reader:
if row[1] == 'f14':
print(','.join(row))
You just need to check and see whether the row is one you're interested in or not by checking the value of the column and see if it's what you're looking for. That could be done with a simpleif row[1] == 'f14'conditional statement. However that would fail on any blank lines -- which it looks like your input file may have -- so you'd need to preface that check with another to make sure the row had at least that many columns in it.
To create another csv file with just those rows in it, all you'd need to write each row that passed all the checks to another file opened for output -- instead of, or in addition to, printing the row out. Here's a very concise way of just writing the rows to another file.
(Note: I'm not sure why you had thequotechar='|'in your code on thecsv.reader()call because there aren't any quote characters in the input file shown, so I left it out in the code below -- you might need to add it back if indeed that's what it would be if there were any.)
import csv
with open('test_file.csv', newline='') as infile, \
open('test_file_out.csv', 'w', newline='') as outfile:
csv.writer(outfile).writerows(row for row in csv.reader(infile)
if len(row) >= 2 and row[1] == 'f14')
Contents of'test_file_out.csv'file afterwards:
78,f14,34(row3)
,f14,34(row5)
3,f14,56,56(row6)

Yaml , counting lines including blank lines

I am parsing a yaml file and searching for specific values, after the search matches i want to get the line number and print it. I managed to do exactly that but the major problem is that while parsing the yaml file using YAML.LOAD , the blank lines are ignored.
i can count the rest of the lines using keys i.e. 1 key per line but i an unable to count blank lines. please help, been stuck with this for a few days now.
this is how my code looks like:
hash = YAML.load(IO.read(File.join(File.dirname(__FILE__), 'en.yml')))
def recursive_hash_to_yml_string(input, hash, depth = 0)
hash.keys.each do |search|
#count = #count + 1
if hash[search].is_a?(String) && hash[search] == input
#yml_array.push(#count)
elsif hash[search].is_a?(Hash)
recursive_hash_to_yml_string(input, hash[search], depth + 1)
end
end
end
I agree with #Wukerplank - parsing a file should ignore blank lines. You might want to think about finding the line number using a different approach.
Perhaps you don't need to parse the YAML at all. If you are just searching the file for some matching text and returning the line number, maybe you'd manage better reading each line of the file using File.each_line.
You could iterate over each line in the file until you found a match and then do something with the line number.

How to compare and verify user input with a csv file columns data?

I am trying to get answer for the last few weeks. I want to give a textfield to users in the app and ask them to enter a id number and this number will be checked with the uploaded csv file columns. If it matches then display an alert saying that its found a match or else it doesn't.
A csv file is basically a normal file in which items are separated by commas.
For example:
id,name,email
1,Joe,joe#d.com
2,Pat,pat#d.com
To get the list of IDs stored in this csv files, simply loop through the lines, and grab the first item after splitting the string (use comma to split)
id = line.split(",").get(0); // first element ** Note this depends on what language you're using
Now, add this id to a collection of ids you are storing, like in a list.
IDs.add(id);
When you take the user input, all you need to do is to check if the id is in your list of ids.
if (IDs.contains(userId)) { print "Found"; }

Resources