How to insert large CSV into a CLOB column with DB2/Rails - ruby-on-rails

Problem: I have a large CSV that i want to insert into DB2 table with Rails
Description: The CSV is about 2k lines/8K characters. The CLOB column is set up to handle over 10K characters. I can insert the CSV just fine though RubyMine database console. However my app crashes.
ActiveRecord produces one huge insert query. Code:
Logger.create(csv: csv_data.to_s)
DB2 returns an error:
ActiveRecord::JDBCError: [SQL0102] String constant beginning with 'foobar' too long.
I can insert huge PDF files into BLOB columns just fine using similar code. I tried creating the record first and then updating it with data, no difference.
This problem is the same as this. Except I need a Rails solution, rather than general one

Found a hack around this by splitting the csv_data into chunks and appending them to the column
update_attribute(:csv, '') if self.csv.nil? # Can't CONCAT to nil
# Split csv_data into chunks, concatenate each one to the field
csv_data.scan(/.{1,6144}/m).each do |part|
parm = ActiveRecord::Base.connection.quote(part)
ActiveRecord::Base.connection.execute("update #{Logger.table_name} set csv = CONCAT(csv, #{parm}) where id = #{self.id}")
end

Related

How to upload Polygons from GeoPandas to Snowflake?

I have a geometry column of a geodataframe populated with polygons and I need to upload these to Snowflake.
I have been exporting the geometry column of the geodataframe to file and have tried both CSV and GeoJSON formats, but so far I either always get an error the staging table always winds up empty.
Here's my code:
design_gdf['geometry'].to_csv('polygons.csv', index=False, header=False, sep='|', compression=None)
import sqlalchemy
from sqlalchemy import create_engine
from snowflake.sqlalchemy import URL
engine = create_engine(
URL(<Snowflake Credentials Here>)
)
with engine.connect() as con:
con.execute("PUT file://<path to polygons.csv> #~ AUTO_COMPRESS=FALSE")
Then on Snowflake I run
create or replace table DB.SCHEMA.DESIGN_POLYGONS_STAGING (geometry GEOGRAPHY);
copy into DB.SCHEMA."DESIGN_POLYGONS_STAGING"
from #~/polygons.csv
FILE_FORMAT = (TYPE = CSV FIELD_DELIMITER = '|' SKIP_HEADER = 1 compression = None encoding = 'iso-8859-1');
Generates the following error:
"Number of columns in file (6) does not match that of the corresponding table (1), use file format option error_on_column_count_mismatch=false to ignore this error File '#~/polygons.csv.gz', line 3, character 1 Row 1 starts at line 2, column "DESIGN_POLYGONS_STAGING"[6] If you would like to continue loading when an error is encountered, use other values such as 'SKIP_FILE' or 'CONTINUE' for the ON_ERROR option. For more information on loading options, please run 'info loading_data' in a SQL client."
Can anyone identify what I'm doing wrong?
Inspired by #Simeon_Pilgrim's comment I went back to Snowflake's documentation. There I found an example of converting a string literal to a GEOGRAPHY.
https://docs.snowflake.com/en/sql-reference/functions/to_geography.html#examples
select to_geography('POINT(-122.35 37.55)');
My polygons looked like strings describing Polygons more than actual GEOGRAPHYs so I decided I needed to be treating them as strings and then calling TO_GEOGRAPHY() on them.
I quickly discovered that they needed to be explicitly enclosed in single quotes and copied into a VARCHAR column in the staging table. This was accomplished by modifying the CSV export code:
import csv
design_gdf['geometry'].to_csv(<path to polygons.csv>,
index=False, header=False, sep='|', compression=None, quoting=csv.QUOTE_ALL, quotechar="'")
The staging table now looks like:
create or replace table DB.SCHEMA."DESIGN_POLYGONS_STAGING" (geometry VARCHAR);
I ran into further problems copying into the staging table related to the presence of a polygons.csv.gz file I must have uploaded in a previous experiment. I deleted this file using:
remove #~/polygons.csv.gz
Finally, converting the staging table to GEOGRAPHY
create or replace table DB.SCHEMA."DESIGN_GEOGRAPHY_STAGING" (geometry GEOGRAPHY);
insert into DB.SCHEMA."DESIGN_GEOGRAPHY"
select to_geography(geometry)
from DB.SCHEMA."DESIGN_POLYGONS_STAGING"
and I wound up with a DESIGN_GEOGRAPHY table with a single column of GEOGRAPHYs in it. Success!!!

Converting string to csv

I have a piece of code in Ruby which essentially adds multiple lines into a csv through the use of
csv_out << listX
I have both a header that I supply in the **options and regular data.
And I am having a problem when I try to view the CSV, mainly that all the values are in one row and it looks to me that any software fails to recognize '\n' as a line separator.
Input example:
Make, Year, Mileage\n,Ford,2019,10000\nAudi, 2000, 100000
Output dimensions:
8x1 table
Desired dimensions:
3x3 table
Any idea of how to go around that? Either by replacing '\n' with something or using something else than csv.generate
csv = CSV.generate(encoding: 'UTF=8') do |csv_out|
csv_out << headers
data.each do |row|
csv_out << row.values
end
The problem seems to be the data.each part. Assuming that data holds the string you have posted, this loop is executed only once, and the string is written into a single row.
You have to loop over the indivdual pieces of data, for instance with
data.split("\n").each

How can I prevent SQL injections during CSV uploads?

I've just started learning about Rails security, and I'm wondering how I can avoid security issues while allowing users to upload CSV files into our database. We're using Postgres' "copy from stdin" functionality to upload the data from the CSV into a temp table, which is then used for upserts into another table. This is the basic code (thanks to this post):
conn = ActiveRecord::Base.connection_pool.checkout
raw = conn.raw_connection
raw.exec("COPY temp_table (col1, col2) FROM STDIN DELIMITER '|'")
# read column values from the CSV line by line in the following format:
# attributes = {column_1: 'column 1 data', column_2: 'column 2 data'}
# line = "#{attributes.values.join('|')}\n"
rc.put_copy_data line
# wrap up copy process & insert into & update primary table
I am wondering what I can or should do to sanitize the column values. We're using Rails 3.2 and Postgres 9.2.
No action is required; COPY never interprets the values as SQL syntax. Malformed CSV will produce an error due to bad quoting / incorrect column count. If you're sending your own data line-by-line you should probably exclude a line containing a single \. followed by a newline, but otherwise it's rather safe.
PostgreSQL doesn't sanitize the data in any way, it just handles it safely. So if you accept a string ');DROP TABLE customer;-- in your CSV it's quite safe in COPY. However, if your application reads that out of the database, assumes that "because it came from the database not the user it's safe," and interpolates it into an SQL string you're still just as stuffed.
Similarly, incorrect use of PL/PgSQL functions where EXECUTE is used with unsafe string concatenation will create problems. You must use of format and the %I or %L specifiers, use quote_literal / quote_ident, or (for literals) use EXECUTE ... USING.
This is not just true of COPY, it's the same if you do an INSERT of the manipulated data then use it unsafely after reading it back from the DB.

searching within an already retrieved mysql result

I'm trying to limit the number of times I do a mysql query, as this could end up being 2k+ queries just to accomplish a fairly small result.
I'm going through a CSV file, and I need to check that the format of the content in the csv matches the format the db expects, and sometimes I try to accomplish some basic clean-up (for example, I have one field that is a string, but is sometimes in the csv as jb2003-343, and I need to strip out the -343).
The first thing I do is get from the database the list of fields by name that I need to retrieve from the csv, then I get the index of those columns in the csv, then I go through each line in the csv and get each of the indexed columns
get_fields = BaseField.find_by_group(:all, :conditions=>['group IN (?)',params[:group_ids]])
csv = CSV.read(csv.path)
first_line=csv.first
first_line.split(',')
csv.each_with_index do |row|
if row==0
col_indexes=[]
csv_data=[]
get_fields.each do |col|
col_indexes << row.index(col.name)
end
else
csv_row=[]
col_indexes.each do |col|
#possibly check the value here against another mysql query but that's ugly
csv_row << row[col]
end
csv_data << csv_row
end
end
The problem is that when I'm adding the content of the csv_data for output, I no longer have any connection to the original get_fields query. Therefore, I can't seem to say 'does this match the type of data expected from the db'.
I could work my way back through the same process that got me down to that level, and make another query like this
get_cleanup = BaseField.find_by_csv_col_name(first_line[col])
if get_cleanup.format==row[col].is_a
csv_row << row[col]
else
# do some data clean-up
end
but as I mentioned, that could mean the get_cleanup is run 2000+ times.
instead of doing this, is there a way to search within the original get_fields result for the name, and then get the associated field?
I tried searching for 'search rails object', but kept getting back results about building search, not searching within an already existing object.
I know I can do array.search, but don't see anything in the object api about search.
Note: The code above may not be perfect, because I'm not running it yet, just wrote that off the top of my head, but hopefully it gives you the idea of what I'm going for.
When you populate your col_indexes array, rather than storing a single value, you can store a hash which includes index and the datatype.
get_fields.each do |col|
col_info = {:row_index = row.index(col.name), :name=>col.name :format=>col.format}
col_indexes << col_info
end
You can then access all your data in the for loop

How do I populate a table from an Excel Spreadsheet in Rails?

I have a simple 4-column Excel spreadsheet that matches universities to their ID codes for lookup purposes. The file is pretty big (300k).
I need to come up with a way to turn this data into a populated table in my Rails app. The catch is that this is a document that is updated now and then, so it can't just be a one-time solution. Ideally, it would be some sort of ruby script that would read the file and create the entries automatically so that when we get emailed a new version, we can just update it automatically. I'm on Heroku if that matters at all.
How can I accomplish something like this?
If you can, save the spreadsheet as CSV, there's much better gems for parsing CSV files than for parsing excel spreadsheets. I found an effective way of handling this kind of problem is to make a rake task that reads the CSV file and creates all the records as appropriate.
So for example, here's how to read all the lines from a file using the old, but still effective FasterCSV gem
data = FasterCSV.read('lib/tasks/data.csv')
columns = data.remove(0)
unique_column_index = -1#The index of a column that's always unique per row in the spreadsheet
data.each do | row |
r = Record.find_or_initialize_by_unique_column(row[unique_column_index])
columns.each_with_index do | index, column_name |
r[column_name] = row[index]
end
r.save! rescue => e Rails.logger.error("Failed to save #{r.inspect}")
end
It does kinda rely on you having a unique column in the original spreadsheet to go off though.
If you put that into a rake task, you can then wire it into you're Capistrano deploy script, so it'll be run every time you deploy. the find_or_initialize should ensure you shouldn't get duplicate records.
Parsing newish Excel files isn't too much trouble using Hpricot. This will give you a two-dimensional array:
require 'hpricot'
doc = open("data.xlsx") { |f| Hpricot(f) }
rows = doc.search('row')
rows = rows[1..rows.length] # Skips the header row
rows = rows.map do |row|
columns = []
row.search('cell').each do |cell|
# Excel stores cell indexes rather than blank cells
next_index = (cell.attributes['ss:Index']) ? (cell.attributes['ss:Index'].to_i - 1) : columns.length
columns[next_index] = cell.search('data').inner_html
end
columns
end

Resources