Reading bulk data from a database using Apache Beam

Reading bulk data from a database using Apache Beam - google-cloud-dataflow

I would like to know, how JdbcIO would execute a query in parallel if my query returns millions of rows.
I have referred https://issues.apache.org/jira/browse/BEAM-2803 and the related pull requests. I couldn't understand it completely.
ReadAll expand method uses a ParDo. Hence would it create multiple connections to the database to read the data in parallel? If I restrict the number of connections that can be created to a DB in the datasource, will it stick to the connection limit?
Can anyone please help me to understand how this would handled in JdbcIO? I am using 2.2.0
Update :
.apply(
ParDo.of(
new ReadFn<>(
getDataSourceConfiguration(),
getQuery(),
getParameterSetter(),
getRowMapper())))
The above code shows that ReadFn is applied with a ParDo. I think, the ReadFn will run in parallel. If my assumption is correct, how would I use the readAll() method to read from a DB where I can establish only a limited number of connections at a time?
Thanks
Balu

The ReadAll method handles the case where you have many multiple queries. You can store the queries as a PCollection of strings where each string is the query. Then when reading, each item is processed as a separate query in a single ParDo.
This does not work well for small number of queries because it limits paralellism to the number of queries. But if you have many, then it will preform much faster. This is the case for most of the ReadAll calls.
From the code it looks like a connection is made per worker in the setup function. This might include several queries depending on the number of workers and number of queries.
Where is the query limit set? It should behave similarly with or without ReadAll.
See the jira for more information: https://issues.apache.org/jira/browse/BEAM-2706
I am not very familiar with jdbcIO, but it seems like they implemented the version suggested in jira. Where a PCollection can be of anything and then a callback to modify the query depending on the element in the PCollection. This allows each item in the PCollection to represent a query but is a bit more flexible then having a new query as each element.

I created a Datasource, as follows.
ComboPooledDataSource cpds = new ComboPooledDataSource();
cpds.setDriverClass("com.mysql.jdbc.Driver"); // loads the jdbc driver
cpds.setJdbcUrl("jdbc:mysql://<IP>:3306/employees");
cpds.setUser("root");
cpds.setPassword("root");
cpds.setMaxPoolSize(5);
There is a better way to set this driver now.
I set the database pool size as 5. While doing JdbcIO transform, I used this datasource to create the connection.
In the pipeline, I set
option.setMaxNumWorkers(5);
option.setAutoscalingAlgorithm(AutoscalingAlgorithmType.THROUGHPUT_BASED);
I used a query which would return around 3 million records. While observing the DB connections , the number of connections were gradually increasing while the program was running. It used at most 5 connections on certain instances.
I think, this is how we can limit the number of connections created to a DB while running JdbcIO trnsformation to load bulk amount data from a database.
Maven dependency for ComboPoolDataSource
<dependency>
<groupId>c3p0</groupId>
<artifactId>c3p0</artifactId>
<version>0.9.1.2</version>
</dependency>
**please feel free to correct the answer if I missed something here.*

I had similar task
I got count of records from the database and split it into ranges of 1000 records
Then I apply readAll to PCollection of ranges
here is description of solution.
And thanks Balu reg. datasource configuration.

Related

How to run complex queries in Tarantool

I've always worked with relational DBs and recently decided to migrate a performance-critial service from SQL Server to Tarantool with a hope to take advantage of the fast in-memory search and processing. I've got a couple of questions while planning for the migration.
I've got a table with about one million records containing pricing information which means I'm dealing mostly with numbers and uuids. First, I need to run a select containing multiple conditions to get a subset of the data, like
SELECT * FROM rates WHERE SupplierId = #SupplierId AND ProductId = #ProductId AND (LocalDistributionZoneId = #LocalDistributionZoneId OR LocalDistributionZoneId IS NULL)
Q1: What is the strategy of running such a query in Lua? Do I create an index for each field in the predicate or I can go along with one secondary composite index?
Q2: Will it be more covenient to run such a query in SQL (box.sql.execute) rather than in pure Lua? Will it be considerably slower than running the same query in pure Lua?
Q3: If I use SQL, is it possible to review the execusion plan to make sure that the query I run really uses the indexes I've defined in the space?
Ok, after I've get the results from the first query I need to analyse the data and then based on the results of analysis, run one more query on the dataset returned by the first query.
Q4: Can Tarantool help me in dealing with the intermediate dataset? More specifically, may I somehow run more queries against the intermediate subset of tuples leveraging the indexes created in the space? Or, I would need to implement alternative strategies like re-add the intrim results to a temporary space with pre-defined indexes and then do another select, or implement further search myself?
Thank you!

Don't. Use SQL, it's faster: it doesn't create garbage collected objects for intermediate execution results.
Yes, please use our SQL features for that.
Use EXPLAIN statement.
I don't know what you exactly mean by "help". You could try to whatever strategy works best: create a more complex query, save the original query in a view to use in the resulting query, create a temporary table and work with it. To give more details let's look if the execution plan Tarantool chooses is good enough or you have to manually optimize it.

One single Azure SQL query is consuming almost all query_stats.total_worker_time and query_stats.execution_count

I'm running a production website for 4 years with azure SQL.
With help of 'Top Slow Request' query from alexsorokoletov on github I have 1 super slow query according to Azure query stats.
The one on top is the one that uses a lot of CPU.
When looking at the linq query and the execution plans / live stats, I can't find the bottleneck yet.
And the live stats
The join from results to project is not directly, there is a projectsession table in between, not visible in the query, but maybe under the hood of entity framework.
Might I be affected by parameter sniffing? Can I reset a hash? Maybe the optimized query plan was used in 2014 and now result table is about 4Million rows and the query is far from optimal?
If I run this query in Management Studio its very fast!
Is it just the stats that are wrong?
Regards
Vincent - The Netherlands.

I would suggest you try adding option(hash join) at the end of the query, if possible. Once you start getting into large arity, loops join is not particularly efficient. That would prove out if there is a more efficient plan (likely yes).
Without seeing more of the details (your screenshots are helpful but cut off whether auto-param or forced parameterization has kicked in and auto-parameterized your query), it is hard to confirm/deny this explicitly. You can read more about parameter sniffing in a blog post I wrote a bit longer ago than I care to admit ;) :
https://blogs.msdn.microsoft.com/queryoptteam/2006/03/31/i-smell-a-parameter/
Ultimately, if you update stats, dbcc freeproccache, or otherwise cause this plan to recompile, your odds of getting a faster plan in the cache are higher if you have this particular query + parameter values being executed often enough to sniff that during plan compilation. Your other option is to add optimize for unknown hints which will disable sniffing and direct the optimizer to use an average value for the frequency of any filters over parameter values. This will likely encourage more hash or merge joins instead of loops joins since the cardinality estimates of the operators in the tree will likely increase.

grafana with sharded influxdbs?

I've been investigating the use of influxdb for storing our metrics.
Seeing that influxDB does not offer a clustered (free) version, I see there is an 'alternative' of using influxdb-relay, which can handle both replication and 'write-sharding'.
But the relay does not handle read queries.
In Grafana you can define multiple data sources which each pointing to specific shard, but from what I can tell it cannot combine the results form the shards into 1 data series. (I know graphite has 'web-master', that does just this, it will query multiple graphite-web instances and combine the results prior to rendering)..is there such beast out there for grafana/influxdb?
If not how difficult do folks thing it would be to update to influxdb relays to accept queries, and query all shards matching the query/merging the results, etc. yeah..I know depending on the query the merge actions would likely require query interpretation, which is where it gets difficult.
Thoughts?

Count number of Postgres statements

I am trying to obtain count the number of Postgres Statements my Ruby on Rails application is performing against our database. I found this entry on stackoverflow, but it counts transactions. We have several transactions that make very large numbers of statements, so that doesn't give a good picture. I am hoping the data is available from PG itself - rather than trying to parse a log.
https://dba.stackexchange.com/questions/35940/how-many-queries-per-second-is-my-postgres-executing

I think you are looking for ActiveSupport instrumentation. Part of Rails, this framework is used throughout Rails applications to publish certain events. For example, there's an sql.activerecord event type that you can subscribe to to count your queries.
ActiveSupport::Notifications.subscribe "sql.activerecord" do |*args|
counter++
done
You could put this in config/initializers/ (to count across the app) or in one of the various before_ hooks of a controller (to count statements for a single request).
(The fine print: I have not actually tested this snippet, but that's how it should work AFAIK.)

PostgreSQL provides a few facilities that will help.
The main one is pg_stat_statements, an extension you can install to collect statement statistics. I strongly recommend this extension, it's very useful. It can tell you which statements run most often, which take the longest, etc. You can query it to add up the number of queries for a given database.
To get a rate over time you should have a script sample pg_stat_statements regularly, creating a table with the values that changed since last sample.
The pg_stat_database view tracks values including the transaction rate. It does not track number of queries.
There's pg_stat_user_tables, pg_stat_user_indexes, etc, which provide usage statistics for tables and indexes. These track individual index scans, sequential scans, etc done by a query, but again not the number of queries.

Importing data from oracle to neo4j using java API

Can u please share any links/sample source code for generating the graph using neo4j from Oracle database tables data .
And my use case is oracle schema table names as Nodes and columns are properties. And also need to genetate graph in tree structure.

Make sure you commit the transaction after creating the nodes with tx.success(), tx.finish().
If you still don't see the nodes, please post your code and/or any exceptions.

Use JDBC to extract your oracle db data. Then use the Java API to build the corresponding nodes :
GraphDatabaseService db;
try(Transaction tx = db.beginTx()){
Node datanode = db.createNode(Labels.TABLENAME);
datanode.setProperty("column name", "column value"); //do this for each column.
tx.success();
}
Also remember to scale your transactions. I tend to use around 1500 creates per transaction and it works fine for me, but you might have to play with it a little bit.
Just do a SELECT * FROM table LIMIT 1000 OFFSET X*1000 with X being the value for how many times you've run the query before. Then keep those 1000 records stored somewhere in a collection or something so you can build your nodes with them. Repeat this until you've handled every record in your database.
Not sure what you mean with "And also need to genetate graph in tree structure.", if you mean you'd like to convert foreign keys into relationships, remember to just index the key and in stead of adding the FK as a property, create a relationship to the original node in stead. You can find it by doing an index lookup. Or you could just create your own little in-memory index with a HashMap. But since you're already storing 1000 sql records in-memory, plus you are building the transaction... you need to be a bit careful with your memory depending on your JVM settings.

You need to code this ETL process yourself. Follow the below
Write your first Neo4j example by following this article.
Understand how to model with graphs.
There are multiple ways of talking to Neo4j using Java. Choose the one that suits your needs.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Reading bulk data from a database using Apache Beam - google-cloud-dataflow

I had similar task I got count of records from the database and split it into ranges of 1000 records Then I apply readAll to PCollection of ranges here is description of solution. And thanks Balu reg. datasource configuration.

Related

How to run complex queries in Tarantool

One single Azure SQL query is consuming almost all query_stats.total_worker_time and query_stats.execution_count

grafana with sharded influxdbs?

Count number of Postgres statements

Importing data from oracle to neo4j using java API

Categories

Resources