Custom Sink to Snowflake using Snowflake JDBC driver is very slow - spring-cloud-dataflow

I am using Spring Cloud Data Flow to create a custom stream to load data into Snowflake. I have written a custom sink to load data into Snowflake using Snowflake's JDBC driver. The methodology that I used is similar to any database update using the following steps:
Create a connection pool (used HikariCP) to obtain Snowflake database connection.
Using prepared statement, created a batch of rows to commit all at once.
Using a scheduled timer committed the batch to snowflake.
This is when I noticed that the batch is being updated very slowly in Snowflake - i.e. one or two records at a time and a batch of 8K rows took well over 45 minutes to update in Snowflake table (using a XS warehouse).
My question: Is there a better/another/recommended method to stream data into Snowflake? I am aware of Kafka connector to Snowflake and Snowpipes (which use an internal/external stage) but these are not the options we would like to pursue.
PreparedStatement preparedStatement = null;
Connection conn = null;
String compiledQuery = "INSERT INTO " + env.getProperty("snowtable") + " SELECT parse_json (column1) FROM VALUES (?)";
conn = DataSource.getConnection();
preparedStatement = conn.prepareStatement(compiledQuery);
for(int i = 0; i<messageslocal.size(); i++) {
preparedStatement.setString(1, messageslocal.get(i));
preparedStatement.addBatch();
}
preparedStatement.executeBatch();
Thank you!

Generally speaking Snowflake - like many column-store or hybrid store DBs - is not performing well for single or small number of lines insert. So the poor performance you experience does not look strange to me, especially on an XS WH.
Without knowing the context of your task, I would suggest to write to a JSON, PARQUET or CSV file (stored on S3 if you're in AWS) instead of writing directly to Snowflake through JDBC. You can make that JSON/PARQUET/CSV file available through a Stage in Snowflake.
Then you can either write a process that copies the Stage data to a table, or put a materialized view on top of the Stage. The materialized view will more or less do the equivalent to triggering the extract of the JSON/PARQUET/CSV data into a Snowflake Table, but this would operate asynchronously without impacting your application performance.

In addition to the great answer by #JeromeE, you should also try using multi-row insert. What you have in your code is a batch with individual inserts.

Related

Reading bulk data from a database using Apache Beam

I would like to know, how JdbcIO would execute a query in parallel if my query returns millions of rows.
I have referred https://issues.apache.org/jira/browse/BEAM-2803 and the related pull requests. I couldn't understand it completely.
ReadAll expand method uses a ParDo. Hence would it create multiple connections to the database to read the data in parallel? If I restrict the number of connections that can be created to a DB in the datasource, will it stick to the connection limit?
Can anyone please help me to understand how this would handled in JdbcIO? I am using 2.2.0
Update :
.apply(
ParDo.of(
new ReadFn<>(
getDataSourceConfiguration(),
getQuery(),
getParameterSetter(),
getRowMapper())))
The above code shows that ReadFn is applied with a ParDo. I think, the ReadFn will run in parallel. If my assumption is correct, how would I use the readAll() method to read from a DB where I can establish only a limited number of connections at a time?
Thanks
Balu
The ReadAll method handles the case where you have many multiple queries. You can store the queries as a PCollection of strings where each string is the query. Then when reading, each item is processed as a separate query in a single ParDo.
This does not work well for small number of queries because it limits paralellism to the number of queries. But if you have many, then it will preform much faster. This is the case for most of the ReadAll calls.
From the code it looks like a connection is made per worker in the setup function. This might include several queries depending on the number of workers and number of queries.
Where is the query limit set? It should behave similarly with or without ReadAll.
See the jira for more information: https://issues.apache.org/jira/browse/BEAM-2706
I am not very familiar with jdbcIO, but it seems like they implemented the version suggested in jira. Where a PCollection can be of anything and then a callback to modify the query depending on the element in the PCollection. This allows each item in the PCollection to represent a query but is a bit more flexible then having a new query as each element.
I created a Datasource, as follows.
ComboPooledDataSource cpds = new ComboPooledDataSource();
cpds.setDriverClass("com.mysql.jdbc.Driver"); // loads the jdbc driver
cpds.setJdbcUrl("jdbc:mysql://<IP>:3306/employees");
cpds.setUser("root");
cpds.setPassword("root");
cpds.setMaxPoolSize(5);
There is a better way to set this driver now.
I set the database pool size as 5. While doing JdbcIO transform, I used this datasource to create the connection.
In the pipeline, I set
option.setMaxNumWorkers(5);
option.setAutoscalingAlgorithm(AutoscalingAlgorithmType.THROUGHPUT_BASED);
I used a query which would return around 3 million records. While observing the DB connections , the number of connections were gradually increasing while the program was running. It used at most 5 connections on certain instances.
I think, this is how we can limit the number of connections created to a DB while running JdbcIO trnsformation to load bulk amount data from a database.
Maven dependency for ComboPoolDataSource
<dependency>
<groupId>c3p0</groupId>
<artifactId>c3p0</artifactId>
<version>0.9.1.2</version>
</dependency>
**please feel free to correct the answer if I missed something here.*
I had similar task
I got count of records from the database and split it into ranges of 1000 records
Then I apply readAll to PCollection of ranges
here is description of solution.
And thanks Balu reg. datasource configuration.

Custom ETL OrientDB from PostgreSQL

I am working on a Rails project, the database is OrientDB graph database. I need to transfer data from Postgres to OrientDB graph. I have written scripts to in Ruby to fetch data from postgres and load it into the graph structure by creating relevant edges and nodes.
However this process is very slow and is taking months to enter million records. The graph is somewhat a densely connected graph.
I wanted to use the inbuilt ETL configuration provided by OrientDB but it seems relatively complex since I need to create multiple vertexes from fields in the same table and then connect them. I referred to this documentation.
Can I write custom ETL to load data into OrientDB with the same speed as done by inbuilt ETL tool?
Also, are there any benchmarks for the speed of data loading into OrientDB.
if ETL doesn't fit your needs you can write a custom importer using Java or any other JVM language of your choice.
If you only need to import the db once, the best way is to use plocal access (embedded) and then move the resulting database under the server.
With this approach, you can achieve the best performances, because the network isn't involved.
The code should be something like this snippet:
OrientGraphFactory fc = new OrientGraphFactory("plocal:./databases/import", "admin", "admin");
Connection conn = DriverManager.getConnection("jdbc....");
Statement stmt = conn.createStatement();
ResultSet resultSet = stmt.executeQuery("SELECT * from table ");
while (resultSet.next()) {
OrientGraph graph = fc.getTx();
OrientVertex vertex1 = graph.addVertex("class:Class1", "name", resultSet.getString("name"));
OrientVertex vertex2 = graph.addVertex("class:Class2", "city", resultSet.getString("city"));
graph.addEdge(null, vertex1, vertex2, "class:edgeClass");
graph.shutdown();
}
fc.close();
resultSet.close();
stmt.close();
conn.close();
It is more pseudo code than working code, but take it as a template for operations needed to import a single query/table from the original RDBMS.
About performance, it is quite hard to give numbers, it depends on many factors: schema complexity, type of access (plocal, remote), lookups, connection's speed of the data source.
Few more words about teleporter. It will import the original database schema and data inside OrientDB, automatically. AFAIK you have a working OrientDB and for sure Teleporter will not create the same schema on OrientDb you did.

Stored procedure history and what application is calling it?

I was given the batch work to research our 200 stored procedures and find out a bunch of different information about them. Is there anyway in SQL Server 2012 to pull execution history on stored procedures? Also is there anyway to tell what application might be calling the stored procedure? Even an IP address would be helpful because we have several server that do various processing.
Any information you can provide me about this would be extremely helpful. I am relatively new to this type of thing in SQL. Thanks!
Is there anyway in SQL Server 2012 to pull execution history on stored procedures?
You can use sys.dm_exec_procedure_stats to find stored procedure execution times plus most time consuming, CPU intensive ones as well
SELECT TOP 10
d.object_id, d.database_id,
OBJECT_NAME(object_id, database_id) 'proc name',
d.cached_time, d.last_execution_time, d.total_elapsed_time,
d.total_elapsed_time/d.execution_count AS [avg_elapsed_time],
d.last_elapsed_time, d.execution_count
FROM
sys.dm_exec_procedure_stats AS d
ORDER BY
[total_worker_time] DESC;
Also is there anyway to tell what application might be calling the stored procedure? Even an IP address would be helpful because we have several server that do various processing.
The answer to both the above questions is NO, unless you monitor them real time using below query. You can run below query using SQL Server Agent as per your predefined intervals and capture the output in a table. Further please note that this gives you individual statements inside a stored procedure.
select
r.session_id,
s.login_name,
c.client_net_address,
s.host_name,
s.program_name,
st.text
from
sys.dm_exec_requests r
inner join
sys.dm_exec_sessions s on r.session_id = s.session_id
left join
sys.dm_exec_connections c on r.session_id = c.session_id
outer apply
sys.dm_exec_sql_text(r.sql_handle) st

How to persist large amounts of data by reading from a CSV file

How to persist large amounts of data by reading from a CSV file (say 20 million rows).
This is running close to 1 1/2 days so far and has persisted only 10 million rows, how can I batch this so that it becomes faster and is there a possibility to run this in a parallel fashion.
I am using the code here to read the CSV, I would like to know if there is a better way to achieve this.
Refer: dealing with large CSV files (20G) in ruby
You can try to first split the file into several smaller files, then you will be able to process several files in parallel.
Probably for splinting the file it will be faster to user a tool like split
split -l 1000000 ./test.txt ./out-files-
Then while you are processing each of the files and assuming you are inserting records instead of inserting them one by one, you can combine them into batches and do bulk inserts. Something like:
INSERT INTO some_table
VALUES
(1,'data1'),
(2, 'data2')
For better performance you'll need to build the SQL statement yourself and execute it:
ActiveRecord::Base.connection.execute('INSERT INTO <whatever you have built>')
Since you would like to persist your data to MySQL for further processing, using Load Data Infile from MySQL would be faster. something like the following with your schema:
sql = "LOAD DATA LOCAL INFILE 'big_data.csv'
INTO TABLE tests
FIELDS TERMINATED BY ',' ENCLOSED BY '\"'
LINES TERMINATED BY '\n'
(foo,foo1)"
con = ActiveRecord::Base.connection
con.execute(sql)
Key points:
If you use MySQL InnoDB engine, my advice is that always define a auto-increment PRIMARY KEY, InnoDB uses clustered index to store data in the table. A clustered index determines the physical order of data in a table.
refer: http://www.ovaistariq.net/521/understanding-innodb-clustered-indexes/
Config your MySQL Server parameters, the most important ones are
(1) close mysql binlog
(2) innodb_buffer_pool_size.
(3) innodb_flush_log_at_trx_commit
(4) bulk_insert_buffer_size
You can read this: http://www.percona.com/blog/2013/09/20/innodb-performance-optimization-basics-updated/
You should use producer-consumer scenario.
Sorry for my poor English.

How to execute SQL statements on a dataset which didn't come from a database?

Suppose I have an application which fetches a custom XML packet from the server which represents a dataset. Then, suppose I wish to execute a SQL statement on that data via a dataset. What can I use to do this? I don't need to know the code necessarily, but just what to use to make this possible and a general explanation of how.
For example, I may fetch a list of customers in XML format from the server. Then, I can use any third-party parser to dump that XML data into some client dataset. Then, execute a query on that dataset, for example select * from customers where ZipCode = '12345' without fetching this data from the server again.
XML is not the only limitation, that's just an example. I might want to do the same to some application settings loaded from an INI file. Either way, the concept is that the original source of the data is unknown.
Whether the dataset stores its temporary data in the memory or on the disk doesn't matter, but it would be excellent if it could keep it in the disk.
TXQuery (http://code.google.com/p/txquery/) is a component that provides a local SQL engine for executing SQL queries against one or more TDataSets. The only issues I have had with it is updating data via a TDBGrid of a query joining multiple tables (TDataSets) - specifically which table is being updated.
AnyDac v6 (now FireDac) also has a local SQL engine. http://www.da-soft.com/anydac/docu/frames.html?frmname=topic&frmfile=Local_SQL.html
Edit: For the example SQL in your question, because it only involves a single table, you do this with just a Filter on the datatset. For example
ADataSet.Filtered := False;
ADataSet.Filter := 'ZipCode=' + QuotedStr('12345');
ADataSet.Filtered := True;
Such a feature can be done using a local database. You just insert the TDataSet result into a local in-memory (or file-based) stand-alone database, then you can use regular SQL queries on it, including JOIN.
You can for instance use SQLite3, or the free edition of NexusDB.
NexusDB embedded has the benefit of being a native Delphi database, so stick to the DB.pas TDataSet paradigm.
Another option is to use the so-called Virtual Table mechanism of SQLite3, which allows to expose any data (even from TDataSet, XML, JSON or in-memory objects) to the SQLite3 engine, just as regular tables. Then you can run SQL statements on those "virtual" tables, including JOINs. With this approach, you do not require to INSERT the data into regular tables, but the data remain in their original form. Of course, you will miss some performance features like indexes, which should be handled on the virtual table provider side. We use this feature as the database core of our mORMot ORM/SOA framework, and this is pretty powerful.
The general process that you want to perform is complicated by the difference in data representation. SQL data is stored in tables made up of distinguishable records. XML is a structured representation of data, but in tree form rather than table/row form.
Each of these data forms may be qualified by a schema that provides a context for the data.
You have two general paths that you can follow:
Take the XML, and based on the schema insert it into a set of interlinked tables, then perform the SQL query. - if you have the schema, you can use code generators to make a parser, and then based ont the parse tree, you can insert into a local db with tables constructed on the fly. You can set up my SQL pretty easily from https://dev.mysql.com/doc/refman/5.7/en/installing.html and then in your version of delphi make a connection to the database, first fill it in, then query. This would satisfy your desire to have the data stored on the disk. unless you purge the tables when done, the data are still available in the local machine db.
This seems like more work than:
Use Xpath or Xquery and work directly on the XML. For this, a package like saxon in your favorite environment, or expat in python would work nicely.
Let me know if either of these paths seems as if it may be fruitful.

Resources