Yaml , counting lines including blank lines - ruby-on-rails

I am parsing a yaml file and searching for specific values, after the search matches i want to get the line number and print it. I managed to do exactly that but the major problem is that while parsing the yaml file using YAML.LOAD , the blank lines are ignored.
i can count the rest of the lines using keys i.e. 1 key per line but i an unable to count blank lines. please help, been stuck with this for a few days now.
this is how my code looks like:
hash = YAML.load(IO.read(File.join(File.dirname(__FILE__), 'en.yml')))
def recursive_hash_to_yml_string(input, hash, depth = 0)
hash.keys.each do |search|
#count = #count + 1
if hash[search].is_a?(String) && hash[search] == input
#yml_array.push(#count)
elsif hash[search].is_a?(Hash)
recursive_hash_to_yml_string(input, hash[search], depth + 1)
end
end
end

I agree with #Wukerplank - parsing a file should ignore blank lines. You might want to think about finding the line number using a different approach.
Perhaps you don't need to parse the YAML at all. If you are just searching the file for some matching text and returning the line number, maybe you'd manage better reading each line of the file using File.each_line.
You could iterate over each line in the file until you found a match and then do something with the line number.

Related

check for matching rows in csv file ruby

I am very new to ruby and I want to check for rows with the same number in a csv file.
What I am trying to do is go through the input csv file and copy element from the input file to the output file also adding another column called "duplicate" to the output file, then check if a similar phone is already in the output file while copying data from input to output then if the phone already exist, add "dupl" to the row in the duplicate column.
This is what I have.
file=CSV.read('input_file.csv')
output_file=File.open("output2.csv","w")
for row in file
output_file.write(row)
output_file.write("\n")
end
output_file.close
Example:
Phone
(202) 221-1323
(201) 321-0243
(202) 221-1323
(310) 343-4923
output file
Phone
Duplicate
(202) 221-1323
(201) 321-0243
(202) 221-1323
dupl
(310) 343-4923
So basically you want to write the input to output and append a "dupl" on the second occurrence of a duplicate?
Your input to output seems fine. To get the "dupl" flag, simply count the occurrence of each number in the list. If it's more than one, its a duplicate. But since you only want the flag to be shown on the second occurrence just count how often the number appeared up until that point:
lines = CSV.read('input_file.csv')
lines.each_with_index do |l,i|
output_file.write(l + ",")
if lines.take(i).count(l) >= 1
output_file.write("dupl")
end
output_file.write("\n")
end
l is the current line. take(i) is all lines before but not including the current line and count(l) applied to this counts how often the number appeared before if it's more than one, print a "dupl"
There probably is a more efficient answer to this, this is just a quick and easy to understand version.

Iterating through CSV::Rows

I'm going to preface that I'm still learning ruby.
I'm writing a script to parse a .csv and identify possible duplicate records in the data-set.
I have a .csv file with headers, so I'm parsing the data so that I can access each row using a header title as such:
#contact_table = CSV.parse(File.read("app/data/file.csv"), headers: true)
# Prints all last names in table
puts contact_table['last_name']
I'm trying to iterate over each row in the table and identify if the last name I'm currently iterating over is similar to the next last name, but I'm having trouble doing this. I guess the way I'm handling it is as if it's an array, but I checked the type and it's a CSV::Row.
example (this doesn't work):
#contact_table.each_with_index do |c, i|
puts "first contact is #{c['last_name']}, second contact is #{c[i + 1]['last_name']}"
end
I realized this doesn't work like this because the table isn't an array, it's a CSV::Row like I previously mentioned. Is there any method that can achieve this? I'm really blanking right now.
My csv looks something like this:
id,first_name,last_name,company,email,address1,address2,zip,city,state_long,state,phone
1,Donalt,Canter,Gottlieb Group,dcanter0#nydailynews.com,9 Homewood Alley,,50335,Des Moines,Iowa,IA,515-601-4495
2,Daphene,McArthur,"West, Schimmel and Rath",dmcarthur1#twitter.com,43 Grover Parkway,,30311,Atlanta,Georgia,GA,770-271-7837
#contact_table should be a CSV::Table which is a collection of CSV::Rows so in this:
#contact_table.each_with_index do |c, i|
...
end
c is a CSV::Row. That's why c['last_name'] works. The problem is that here:
c[i + 1]['last_name']
you're looking at c (a single row) instead of #contact_table, if you said:
#contact_table[i + 1]['last_name']
then you'd get the next last name or, when c is the last row, an exception because #contact_table[i+1] will be nil.
Also, inside the iteration, c is the current (or (i+1)th) row and won't always be the first.
What is your use case for this? Seems like a school project?
I recommend for_each instead of parse (see this comparison). I would probably use a Set for this.
Create a Set outside of the scope of parsing the file (i.e., above the parsing code). Let's call it rows.
Call rows.include?(row) during each iteration while parsing the file
If true, then you know you have a duplicate
If false, then call rows.add(row) to add the new row to the set
You could also just fill your set with an individual value from a column that must be distinct (e.g., row.field(:some_column_name)), such as email or phone number, and do the same inclusion check for that.
(If this is for a real app, please don't do this. Use model validations instead.)
I would use #read instead of #parse and do something like this:
require 'csv'
LASTNAME_INDEX = 2
data = CSV.read('data.csv')
data[1..-1].each_with_index do |row, index|
puts "Contact number #{index + 1} has the following last name : #{row[LASTNAME_INDEX]}"
end
#~> Contact number 1 has the following last name : Canter
#~> Contact number 2 has the following last name : McArthur

Ruby CSV.foreach start at specific row

I've seen a couple posts for this with no real answers or out-of-date answers, so I'm wondering if there are any new solutions. I have an enormous CSV I need to read in. I can't call open() on it bc it kills my server. I have no choice but to use .foreach().
Doing it this way, my script will take 6 days to run. I want to see if I can cut that down by using Threads and splitting the task in two or four. So one thread reads lines 1-n and one thread simultaneously will read lines n+1-end.
So I need to be able to only read in the last half of the file in one thread (and later if I split it into more threads, just a specific line through a specific line).
Is there anyway in Ruby to do this? Can this start at a certain row?
CSV.foreach(FULL_FACT_SHEET_CSV_PATH) do |trial|
EDIT:
Just to give an idea of what one of my threads looks like:
threads << Thread.new {
CSV.open('matches_thread3.csv', 'wb') do |output_csv|
output_csv << HEADER
count = 1
index = 0
CSV.foreach(CSV_PATH) do |trial|
index += 1
if index > 120000
break if index > 180000
#do stuff
end
end
end
}
But as you can see, it has to iterate the file until it gets to record 120,000 before it starts. So the goal would be to eliminate reading all of the rows before row 120,000 by starting to read at row 120,000.
If still relevant, you can do something like this using .with_index after :
rows_array = []
CSV.foreach(path).with_index do |row, i|
next if i == 0 #skip first row
rows_array << columns.map { |n| row[n] }
end
But as you can see, it has to iterate the file until it gets to record 120,000 before it starts. So the goal would be to eliminate reading all of the rows before row 120,000 by starting to read at row 120,000.
Impossible. Content of a CSV file is just a blob of text, with some commas and newlines. You can't know at which offset in the file row N starts without knowing where row N-1 ends. And to know this, you have to know where row N-1 starts (see recursion?) and read the file until you see where it ends (encounter a newline that is not part of field value).
Exception to this is if all your rows are of fixed size. In which case, you can seek directly to offset 120_000 * row_size. I am yet to see a file like this, though.
As per my understanding towards your Question in Ruby way it may help you.
require 'csv'
csv_file = "matches_thread3.csv"
# define one Constant Chunk Size for Jobs
CHUNK_SIZE = 120000
# split - by splitting (\n) will generate an array of CSV records
# each_slice - will create array of records of CHUNK_SIZE defined
File.read(csv_file).split("\n").drop(1).each_slice(CHUNK_SIZE).with_index
do |chunk, index|
data = []
# chunk will be work as multiple Jobs of 120000 records
chunk.each do |row|
data << r
##do stuff
end
end

How to check the CSV column consistency?

I have a CSV file like:
Header: 1,2,3,4
content: a,b,c,d,e
a,b,c,d
a,b
a,b,c,d,d
Is there any CSV method that I can use to easily validate the column consistency instead of
parsing the CSV line by line?
One way or another the whole file has to be read.
Here is a relative simple way. First the file is read and converted to an array which is then mapped to another array based on length (number of fields per row). This array is the checked if the length is always the same.
If you'd hate to read the file twice you could remember the length of the header and while you parse the file check each record if it has the same number of fields and otherwise trow an exeption.
require "csv"
def valid? file
a = CSV.read(file).map { |e|e.length }
a.min == a.max
end
p valid?("data.csv")
csv_validator gem would be helpful here.

Get line that matches regex in rails

I have a long list of information stored in a variable and I need to run some regex expressions against that variable and get various pieces of information from what is found.
How can you store the line that matches a regex expression in a variable?
How can you get the line number of the line that matches a regex expression?
Here is an example of what I'm talking about.
body = "service timestamps log datetime msec localtime show-timezone
service password-encryption
!
hostname switch01
!
boot-start-marker"
If I search for the line that contains "hostname" I need the line number, in this case it would be 4. I also need to store the line "hostname switch01" as another variable.
Any ideas?
Thanks!
First you'd want to convert the string to lines: body.split('\n'), then you want to add line numbers to the lines: .each_with_index. Then you want to select the lines .select {|line, line_nr| line =~ your_regex }. Putting it all together:
body.split('\n').each_with_index
.select {|line, line_nr| line =~ your_regex }
.map {|line, line_nr| line_nr }
This will give you all the lines matching 'your_regex'
Let's say you have an object file that provides a #lines method:
lines = file.lines.each_with_index.select {|line, i| line =~ /regex/ }
If you already have a list of lines you can leave out the call to #lines. If you have a string you can use string.split("\n").
This will result in the variable lines containing an array of 2-element arrays with the line that matched your RegEx and the index of the line in the original file.
Breakdown
file.lines gets the lines - of course the other methods I mentioned might also apply here for you. We then add the index to each element with #each_with_index, because you want to store these as well. This has the same effect as #map.with_index {|e, i| [e, i]}, i.e. map every element to [element, index]. We then use the #select method to get all lines that do match your RegEx (FYI, =~ is the matching operator in Ruby, Perl and other languages - in case you didn't already know). We're done after that, but you might need to further transform the data so you can process it.

Resources