I have a simple 4-column Excel spreadsheet that matches universities to their ID codes for lookup purposes. The file is pretty big (300k).
I need to come up with a way to turn this data into a populated table in my Rails app. The catch is that this is a document that is updated now and then, so it can't just be a one-time solution. Ideally, it would be some sort of ruby script that would read the file and create the entries automatically so that when we get emailed a new version, we can just update it automatically. I'm on Heroku if that matters at all.
How can I accomplish something like this?
If you can, save the spreadsheet as CSV, there's much better gems for parsing CSV files than for parsing excel spreadsheets. I found an effective way of handling this kind of problem is to make a rake task that reads the CSV file and creates all the records as appropriate.
So for example, here's how to read all the lines from a file using the old, but still effective FasterCSV gem
data = FasterCSV.read('lib/tasks/data.csv')
columns = data.remove(0)
unique_column_index = -1#The index of a column that's always unique per row in the spreadsheet
data.each do | row |
r = Record.find_or_initialize_by_unique_column(row[unique_column_index])
columns.each_with_index do | index, column_name |
r[column_name] = row[index]
end
r.save! rescue => e Rails.logger.error("Failed to save #{r.inspect}")
end
It does kinda rely on you having a unique column in the original spreadsheet to go off though.
If you put that into a rake task, you can then wire it into you're Capistrano deploy script, so it'll be run every time you deploy. the find_or_initialize should ensure you shouldn't get duplicate records.
Parsing newish Excel files isn't too much trouble using Hpricot. This will give you a two-dimensional array:
require 'hpricot'
doc = open("data.xlsx") { |f| Hpricot(f) }
rows = doc.search('row')
rows = rows[1..rows.length] # Skips the header row
rows = rows.map do |row|
columns = []
row.search('cell').each do |cell|
# Excel stores cell indexes rather than blank cells
next_index = (cell.attributes['ss:Index']) ? (cell.attributes['ss:Index'].to_i - 1) : columns.length
columns[next_index] = cell.search('data').inner_html
end
columns
end
Related
I'm going to preface that I'm still learning ruby.
I'm writing a script to parse a .csv and identify possible duplicate records in the data-set.
I have a .csv file with headers, so I'm parsing the data so that I can access each row using a header title as such:
#contact_table = CSV.parse(File.read("app/data/file.csv"), headers: true)
# Prints all last names in table
puts contact_table['last_name']
I'm trying to iterate over each row in the table and identify if the last name I'm currently iterating over is similar to the next last name, but I'm having trouble doing this. I guess the way I'm handling it is as if it's an array, but I checked the type and it's a CSV::Row.
example (this doesn't work):
#contact_table.each_with_index do |c, i|
puts "first contact is #{c['last_name']}, second contact is #{c[i + 1]['last_name']}"
end
I realized this doesn't work like this because the table isn't an array, it's a CSV::Row like I previously mentioned. Is there any method that can achieve this? I'm really blanking right now.
My csv looks something like this:
id,first_name,last_name,company,email,address1,address2,zip,city,state_long,state,phone
1,Donalt,Canter,Gottlieb Group,dcanter0#nydailynews.com,9 Homewood Alley,,50335,Des Moines,Iowa,IA,515-601-4495
2,Daphene,McArthur,"West, Schimmel and Rath",dmcarthur1#twitter.com,43 Grover Parkway,,30311,Atlanta,Georgia,GA,770-271-7837
#contact_table should be a CSV::Table which is a collection of CSV::Rows so in this:
#contact_table.each_with_index do |c, i|
...
end
c is a CSV::Row. That's why c['last_name'] works. The problem is that here:
c[i + 1]['last_name']
you're looking at c (a single row) instead of #contact_table, if you said:
#contact_table[i + 1]['last_name']
then you'd get the next last name or, when c is the last row, an exception because #contact_table[i+1] will be nil.
Also, inside the iteration, c is the current (or (i+1)th) row and won't always be the first.
What is your use case for this? Seems like a school project?
I recommend for_each instead of parse (see this comparison). I would probably use a Set for this.
Create a Set outside of the scope of parsing the file (i.e., above the parsing code). Let's call it rows.
Call rows.include?(row) during each iteration while parsing the file
If true, then you know you have a duplicate
If false, then call rows.add(row) to add the new row to the set
You could also just fill your set with an individual value from a column that must be distinct (e.g., row.field(:some_column_name)), such as email or phone number, and do the same inclusion check for that.
(If this is for a real app, please don't do this. Use model validations instead.)
I would use #read instead of #parse and do something like this:
require 'csv'
LASTNAME_INDEX = 2
data = CSV.read('data.csv')
data[1..-1].each_with_index do |row, index|
puts "Contact number #{index + 1} has the following last name : #{row[LASTNAME_INDEX]}"
end
#~> Contact number 1 has the following last name : Canter
#~> Contact number 2 has the following last name : McArthur
Problem: I have a large CSV that i want to insert into DB2 table with Rails
Description: The CSV is about 2k lines/8K characters. The CLOB column is set up to handle over 10K characters. I can insert the CSV just fine though RubyMine database console. However my app crashes.
ActiveRecord produces one huge insert query. Code:
Logger.create(csv: csv_data.to_s)
DB2 returns an error:
ActiveRecord::JDBCError: [SQL0102] String constant beginning with 'foobar' too long.
I can insert huge PDF files into BLOB columns just fine using similar code. I tried creating the record first and then updating it with data, no difference.
This problem is the same as this. Except I need a Rails solution, rather than general one
Found a hack around this by splitting the csv_data into chunks and appending them to the column
update_attribute(:csv, '') if self.csv.nil? # Can't CONCAT to nil
# Split csv_data into chunks, concatenate each one to the field
csv_data.scan(/.{1,6144}/m).each do |part|
parm = ActiveRecord::Base.connection.quote(part)
ActiveRecord::Base.connection.execute("update #{Logger.table_name} set csv = CONCAT(csv, #{parm}) where id = #{self.id}")
end
I am writing an app that needs to quickly process hundreds of thousands of rows of data, so I've looked into nesting raw SQL in my Ruby code using ActiveRecord::Base.connection.execute, which is working beautifully. However whenever I run it I get the following Object as a result:
#<PG::Result:0x007fe158ab18c8 status=PGRES_TUPLES_OK ntuples=0 nfields=1 cmd_tuples=0>
I've googled around and can't find a way to parse the PG Result into something actually useful. Is there any built-in PG way to do this, or a workaround, or anything really?
Here is the query I'm using:
SELECT row_to_json(row(company_name, ccn_short_title, title))
FROM contents
WHERE contents.company_name = '#{company_name}'
AND contents.title = '#{title}';
Actually PG::Result responds to many well-known methods from Enumerable module. You can output them all to watch for the desired ones:
query = "SELECT row_to_json(row) from (select * from users) row"
result = ActiveRecord::Base.connection.execute(query)
result.methods - Object.methods
# => returns an array of methods which can be used
For example, you could iterate the results and map them to something more suitable...
result.map do |row|
JSON.parse(row["row_to_json"])
end
# => returns familiar hashes
Get a desired result hash by its index...
result[0]
And much more.
Given an array of part ids containing duplicates, how can I find the corresponding records in my Part model, including the duplicates?
An example array of part ids would be ["1B", "4", "3421", "4"]. If we assume I have a record corresponding to each of those values I would like to see 4 records returned in total, not 3. If possible, I was hoping to be able to make additional SQL operations on whatever is returned.
Here's what I'm currently using which doesn't include the duplicates:
#parts = Part.where(:part_id => params[:ids])
To give a little background, I'm trying to upload an XML file containing a list of parts used in some item. My application is meant to parse the XML file and compare the parts listed within against my Parts database so that I can see how much the part weighs. These items will sometimes contain duplicates of various parts so that's what I'm trying to account for here.
The only way I can think of doing it is using map...
#parts = params[:ids].map { |id| Part.find_by_id(id) }
hard to tell exactly what you are doing, are you looking up weight from the xml or from your data?
parts_xml = some_method_that_loads_xml
part_ids_from_xml = part_xml.... # pull out the ids
parts = Part.where("id IN (?)", part_ids_from_xml)
now you have two arrays (xml data and your 'matching' database records) and you can use select or detect to do in memory lookups by part_id
part_ids_from_xml.each do |part_id|
weight = parts.detect { |item| item.id == part_id }.weight
puts "#{id} weighs #{weight}"
end
see http://ruby-doc.org/core-2.0.0/Enumerable.html#method-i-detect
and http://ruby-doc.org/core-2.0.0/Enumerable.html#method-i-select
I'm trying to limit the number of times I do a mysql query, as this could end up being 2k+ queries just to accomplish a fairly small result.
I'm going through a CSV file, and I need to check that the format of the content in the csv matches the format the db expects, and sometimes I try to accomplish some basic clean-up (for example, I have one field that is a string, but is sometimes in the csv as jb2003-343, and I need to strip out the -343).
The first thing I do is get from the database the list of fields by name that I need to retrieve from the csv, then I get the index of those columns in the csv, then I go through each line in the csv and get each of the indexed columns
get_fields = BaseField.find_by_group(:all, :conditions=>['group IN (?)',params[:group_ids]])
csv = CSV.read(csv.path)
first_line=csv.first
first_line.split(',')
csv.each_with_index do |row|
if row==0
col_indexes=[]
csv_data=[]
get_fields.each do |col|
col_indexes << row.index(col.name)
end
else
csv_row=[]
col_indexes.each do |col|
#possibly check the value here against another mysql query but that's ugly
csv_row << row[col]
end
csv_data << csv_row
end
end
The problem is that when I'm adding the content of the csv_data for output, I no longer have any connection to the original get_fields query. Therefore, I can't seem to say 'does this match the type of data expected from the db'.
I could work my way back through the same process that got me down to that level, and make another query like this
get_cleanup = BaseField.find_by_csv_col_name(first_line[col])
if get_cleanup.format==row[col].is_a
csv_row << row[col]
else
# do some data clean-up
end
but as I mentioned, that could mean the get_cleanup is run 2000+ times.
instead of doing this, is there a way to search within the original get_fields result for the name, and then get the associated field?
I tried searching for 'search rails object', but kept getting back results about building search, not searching within an already existing object.
I know I can do array.search, but don't see anything in the object api about search.
Note: The code above may not be perfect, because I'm not running it yet, just wrote that off the top of my head, but hopefully it gives you the idea of what I'm going for.
When you populate your col_indexes array, rather than storing a single value, you can store a hash which includes index and the datatype.
get_fields.each do |col|
col_info = {:row_index = row.index(col.name), :name=>col.name :format=>col.format}
col_indexes << col_info
end
You can then access all your data in the for loop