I'm trying to limit the number of times I do a mysql query, as this could end up being 2k+ queries just to accomplish a fairly small result.
I'm going through a CSV file, and I need to check that the format of the content in the csv matches the format the db expects, and sometimes I try to accomplish some basic clean-up (for example, I have one field that is a string, but is sometimes in the csv as jb2003-343, and I need to strip out the -343).
The first thing I do is get from the database the list of fields by name that I need to retrieve from the csv, then I get the index of those columns in the csv, then I go through each line in the csv and get each of the indexed columns
get_fields = BaseField.find_by_group(:all, :conditions=>['group IN (?)',params[:group_ids]])
csv = CSV.read(csv.path)
first_line=csv.first
first_line.split(',')
csv.each_with_index do |row|
if row==0
col_indexes=[]
csv_data=[]
get_fields.each do |col|
col_indexes << row.index(col.name)
end
else
csv_row=[]
col_indexes.each do |col|
#possibly check the value here against another mysql query but that's ugly
csv_row << row[col]
end
csv_data << csv_row
end
end
The problem is that when I'm adding the content of the csv_data for output, I no longer have any connection to the original get_fields query. Therefore, I can't seem to say 'does this match the type of data expected from the db'.
I could work my way back through the same process that got me down to that level, and make another query like this
get_cleanup = BaseField.find_by_csv_col_name(first_line[col])
if get_cleanup.format==row[col].is_a
csv_row << row[col]
else
# do some data clean-up
end
but as I mentioned, that could mean the get_cleanup is run 2000+ times.
instead of doing this, is there a way to search within the original get_fields result for the name, and then get the associated field?
I tried searching for 'search rails object', but kept getting back results about building search, not searching within an already existing object.
I know I can do array.search, but don't see anything in the object api about search.
Note: The code above may not be perfect, because I'm not running it yet, just wrote that off the top of my head, but hopefully it gives you the idea of what I'm going for.
When you populate your col_indexes array, rather than storing a single value, you can store a hash which includes index and the datatype.
get_fields.each do |col|
col_info = {:row_index = row.index(col.name), :name=>col.name :format=>col.format}
col_indexes << col_info
end
You can then access all your data in the for loop
Related
I'm going to preface that I'm still learning ruby.
I'm writing a script to parse a .csv and identify possible duplicate records in the data-set.
I have a .csv file with headers, so I'm parsing the data so that I can access each row using a header title as such:
#contact_table = CSV.parse(File.read("app/data/file.csv"), headers: true)
# Prints all last names in table
puts contact_table['last_name']
I'm trying to iterate over each row in the table and identify if the last name I'm currently iterating over is similar to the next last name, but I'm having trouble doing this. I guess the way I'm handling it is as if it's an array, but I checked the type and it's a CSV::Row.
example (this doesn't work):
#contact_table.each_with_index do |c, i|
puts "first contact is #{c['last_name']}, second contact is #{c[i + 1]['last_name']}"
end
I realized this doesn't work like this because the table isn't an array, it's a CSV::Row like I previously mentioned. Is there any method that can achieve this? I'm really blanking right now.
My csv looks something like this:
id,first_name,last_name,company,email,address1,address2,zip,city,state_long,state,phone
1,Donalt,Canter,Gottlieb Group,dcanter0#nydailynews.com,9 Homewood Alley,,50335,Des Moines,Iowa,IA,515-601-4495
2,Daphene,McArthur,"West, Schimmel and Rath",dmcarthur1#twitter.com,43 Grover Parkway,,30311,Atlanta,Georgia,GA,770-271-7837
#contact_table should be a CSV::Table which is a collection of CSV::Rows so in this:
#contact_table.each_with_index do |c, i|
...
end
c is a CSV::Row. That's why c['last_name'] works. The problem is that here:
c[i + 1]['last_name']
you're looking at c (a single row) instead of #contact_table, if you said:
#contact_table[i + 1]['last_name']
then you'd get the next last name or, when c is the last row, an exception because #contact_table[i+1] will be nil.
Also, inside the iteration, c is the current (or (i+1)th) row and won't always be the first.
What is your use case for this? Seems like a school project?
I recommend for_each instead of parse (see this comparison). I would probably use a Set for this.
Create a Set outside of the scope of parsing the file (i.e., above the parsing code). Let's call it rows.
Call rows.include?(row) during each iteration while parsing the file
If true, then you know you have a duplicate
If false, then call rows.add(row) to add the new row to the set
You could also just fill your set with an individual value from a column that must be distinct (e.g., row.field(:some_column_name)), such as email or phone number, and do the same inclusion check for that.
(If this is for a real app, please don't do this. Use model validations instead.)
I would use #read instead of #parse and do something like this:
require 'csv'
LASTNAME_INDEX = 2
data = CSV.read('data.csv')
data[1..-1].each_with_index do |row, index|
puts "Contact number #{index + 1} has the following last name : #{row[LASTNAME_INDEX]}"
end
#~> Contact number 1 has the following last name : Canter
#~> Contact number 2 has the following last name : McArthur
I am writing an app that needs to quickly process hundreds of thousands of rows of data, so I've looked into nesting raw SQL in my Ruby code using ActiveRecord::Base.connection.execute, which is working beautifully. However whenever I run it I get the following Object as a result:
#<PG::Result:0x007fe158ab18c8 status=PGRES_TUPLES_OK ntuples=0 nfields=1 cmd_tuples=0>
I've googled around and can't find a way to parse the PG Result into something actually useful. Is there any built-in PG way to do this, or a workaround, or anything really?
Here is the query I'm using:
SELECT row_to_json(row(company_name, ccn_short_title, title))
FROM contents
WHERE contents.company_name = '#{company_name}'
AND contents.title = '#{title}';
Actually PG::Result responds to many well-known methods from Enumerable module. You can output them all to watch for the desired ones:
query = "SELECT row_to_json(row) from (select * from users) row"
result = ActiveRecord::Base.connection.execute(query)
result.methods - Object.methods
# => returns an array of methods which can be used
For example, you could iterate the results and map them to something more suitable...
result.map do |row|
JSON.parse(row["row_to_json"])
end
# => returns familiar hashes
Get a desired result hash by its index...
result[0]
And much more.
I have three tables in my database: values, keys, and sources. Both keys and sources have many values. sources has a datetime column called file_date. For each key, I have to select all values that have a source that is between a specific date range (usually within two years). Not all dates within that range have a source, and not all sources have a value. I need to create an array that contains all values from within that range.
I at first tried to simply query the entire sources array at once, like so:
Value.where(key: 1, source: sources_array)
However, since there are several values that are nil, it simply returned records that have a value at that date.
So now, I've created an array containing all of the sources between that date range. Days that don't have a source simply have nil. Then, for each key, I iterate through the sources array, returning the value that matches that foo and source. That is obviously not ideal, and it takes about 7 seconds for the page to load.
Here is that code:
sources.map do |source|
Value.where(source: source, key: key)
end
Any ideas?
You could also generate a single SQL query if you only need to retrieve the data and getting back arrays will suffice.
output = []
sources.each do |source|
output << "select * from values where (source like #{source} and key like #{key})"
end
result = ActiveRecord::Base.connection.execute(output.join(" union ")).to_a
The big issue with your .map is that it runs several queries, instead of one query with all the results you want. With each additional query, you add the overhead of sending the request to the database and receiving the response back. What you need to do is generate the query in a way that it returns all the results you may be looking for, and then dealing with any grouping or sorting application side.
You could try wrapping your queries in a single transaction:
ActiveRecord::Base.transaction do
sources.map do |source|
Value.where(source: source, key: key)
end
end
I'll need to refactor this like crazy, but I have something that has cut it down to 3 seconds:
source_ids = sources.map do |source|
if source
source.id
else
nil
end
end
values = Value.where(source: sources, key: key)
value_sources = values.pluck(:source_id)
value_decimals = values.pluck(:value_decimal)
source_ids.map do |id|
index = value_sources.index(id)
if index
value_decimals[index]
else
nil
end
end
Given an array of part ids containing duplicates, how can I find the corresponding records in my Part model, including the duplicates?
An example array of part ids would be ["1B", "4", "3421", "4"]. If we assume I have a record corresponding to each of those values I would like to see 4 records returned in total, not 3. If possible, I was hoping to be able to make additional SQL operations on whatever is returned.
Here's what I'm currently using which doesn't include the duplicates:
#parts = Part.where(:part_id => params[:ids])
To give a little background, I'm trying to upload an XML file containing a list of parts used in some item. My application is meant to parse the XML file and compare the parts listed within against my Parts database so that I can see how much the part weighs. These items will sometimes contain duplicates of various parts so that's what I'm trying to account for here.
The only way I can think of doing it is using map...
#parts = params[:ids].map { |id| Part.find_by_id(id) }
hard to tell exactly what you are doing, are you looking up weight from the xml or from your data?
parts_xml = some_method_that_loads_xml
part_ids_from_xml = part_xml.... # pull out the ids
parts = Part.where("id IN (?)", part_ids_from_xml)
now you have two arrays (xml data and your 'matching' database records) and you can use select or detect to do in memory lookups by part_id
part_ids_from_xml.each do |part_id|
weight = parts.detect { |item| item.id == part_id }.weight
puts "#{id} weighs #{weight}"
end
see http://ruby-doc.org/core-2.0.0/Enumerable.html#method-i-detect
and http://ruby-doc.org/core-2.0.0/Enumerable.html#method-i-select
I have a simple 4-column Excel spreadsheet that matches universities to their ID codes for lookup purposes. The file is pretty big (300k).
I need to come up with a way to turn this data into a populated table in my Rails app. The catch is that this is a document that is updated now and then, so it can't just be a one-time solution. Ideally, it would be some sort of ruby script that would read the file and create the entries automatically so that when we get emailed a new version, we can just update it automatically. I'm on Heroku if that matters at all.
How can I accomplish something like this?
If you can, save the spreadsheet as CSV, there's much better gems for parsing CSV files than for parsing excel spreadsheets. I found an effective way of handling this kind of problem is to make a rake task that reads the CSV file and creates all the records as appropriate.
So for example, here's how to read all the lines from a file using the old, but still effective FasterCSV gem
data = FasterCSV.read('lib/tasks/data.csv')
columns = data.remove(0)
unique_column_index = -1#The index of a column that's always unique per row in the spreadsheet
data.each do | row |
r = Record.find_or_initialize_by_unique_column(row[unique_column_index])
columns.each_with_index do | index, column_name |
r[column_name] = row[index]
end
r.save! rescue => e Rails.logger.error("Failed to save #{r.inspect}")
end
It does kinda rely on you having a unique column in the original spreadsheet to go off though.
If you put that into a rake task, you can then wire it into you're Capistrano deploy script, so it'll be run every time you deploy. the find_or_initialize should ensure you shouldn't get duplicate records.
Parsing newish Excel files isn't too much trouble using Hpricot. This will give you a two-dimensional array:
require 'hpricot'
doc = open("data.xlsx") { |f| Hpricot(f) }
rows = doc.search('row')
rows = rows[1..rows.length] # Skips the header row
rows = rows.map do |row|
columns = []
row.search('cell').each do |cell|
# Excel stores cell indexes rather than blank cells
next_index = (cell.attributes['ss:Index']) ? (cell.attributes['ss:Index'].to_i - 1) : columns.length
columns[next_index] = cell.search('data').inner_html
end
columns
end