sql azure batch size - entity-framework-4

Have some employee segmentation tasks that result in a large number of records (about 2000) that needs to be inserted into SQL Azure. The records themselves are very small about 4 integers. A Azure worker role performs the segmentation task and inserts the resultant rows to a SQL Azure table. There might be multiple such tasks (each with about 1000 - 2000 rows) in queue and hence each of these inserts will need to be performed pretty fast.
Timing tests using a local machine to SQL Azure took significant time (approximately 2 minutes for the 1000 inserts). This might be caused due to the network latency. I am assuming the inserts from the worker role should be much faster.
However, since entity framework does not to batch inserts well we were thinking about using SQLBulkCopy. Would using SQLBulkcopy result in the queries being throttled if the batch size is say a 1000? Is there any recommended approach?

The Bulk Copy API should serve your purposes perfectly and result in very dramatic performance improvements.
I have tested inserting 10 million records with a batch size of 2000 into an Azure database and no throttling has occured with performance of ~10 seconds per batch when running from my local machine.

Related

SQL database DTUs affecting the Powerbi reports resfresh

I have created about 5 reports in Microsoft PowerBI using a SQL database created in Microsoft Azure.
The database has more than 50 million row.
Recently my reports have stopped refreshing. In case they refresh, the refresh time is long and is running really slow.
here is a screenshot of the error i'm having enter image description here
I contacted yesterday Microsoft PowerBI to check if the issue is from the software itself. I showed them my database and my reports and they told me that the DTU in my SQL database is reaching a maximum of 100% which is slowing down the response of the database during the refresh and preventing it from performing well. Here is a screenshot of my database performance enter image description here. Please note that this picture is showing only the maximum of the DTUs, the average is giving a 50% value
I'm not an expert in Azure and i need to know if the DTUs can really effect the performance of calling the data from the database to Powerbi.
Yes, the DTUs will affect the query performance for Azure SQL database. Off course it will effect the performance of calling the data from the database to PowerBi.
Reference Database transaction units (DTUs):
A database transaction unit (DTU) represents a blended measure of CPU, memory, reads, and writes.
For a single database at a specific compute size within a service tier, Microsoft guarantees a certain level of resources for that database (independent of any other database in the Azure cloud). This guarantee provides a predictable level of performance. The amount of resources allocated for a database is calculated as a number of DTUs and is a bundled measure of compute, storage, and I/O resources.
The ratio among these resources is originally determined by an online transaction processing (OLTP) benchmark workload designed to be typical of real-world OLTP workloads. When your workload exceeds the amount of any of these resources, your throughput is throttled, resulting in slower performance and time-outs.
When the DTUs reaching a maximum of 100%, it means the performance of the database has reached the resource limits.
You need to scale the Azure SQL database service price tier or do a performance turning.
For more details, please see: Monitoring and performance tuning. Azure SQL Database provides tools and methods you can use to monitor usage easily, add or remove resources (such as CPU, memory, or I/O), troubleshoot potential problems, and make recommendations to improve the performance of a database.
Hope this helps.

Does influxDB have a limit to the number of databases you can have?

Perf testing a tool. We generate a bunch of metrics per test run. But we want to keep each test run separate. This seems like a db per run would allow us to do that, and at the same time allow us to give the tools we create to customers who would only have 1 install, and thus need only one db. But we are talking hundreds of db's... granted they should be smaller as most would only be for a set of metrics covering a couple of hours. But will influxdb limit us? or will performance suffer significantly?
Looks memory/virtual memory appears to be the limiting factor.
On two 16gb boxes I added 500 db's with data on one, and 500 sets of data to the same db on the other. The data was pretty small actually, the individual dbs were 440K after being loaded.
Memory use on the 500 db's was way higher.
500dbs 19.9g virtual, 3.3g resident
1db 2.5g virtual, .9g resident

Dataflow control high fanout between steps

I have 3 dataflow steps in a Dataflow pipeline.
Reads from pubsub , saves in a table and splits into multiple events(puts into context output).
For each split, queries db and decorates the event with additional data.
Publishes to another pubsub topic for further procession.
PROBLEM:
After step 1, its splitting into 10K to 20K events.
Now in step 2 its running out of database connections. (I have a static hikari connection pool).
It works absolutely fine will less data. I am using a n1-standard-32 machine.
What should I do to limit the input to the next step? So that the parallelism is restricted or throttle events to next step.
I think basic idea is to reduce parallelism when executing step2 (If you have a massive parallelism, you will need 20k connections for 20k events because 20k events are processed in parallel).
Ideas include:
Stateful ParDo's execution is serialized per key per window, which means only one connection is need for a stateful ParDo because only one element should be processed at a given time for a key and a window.
One connection per bundle. You can initialize a connection at startBundle and make elements within a same bundle use a same connection (if my understanding is correct, within a bundle, execution is likely serialized).

Slow Query Performance

I am running some very large databases (500 MB and 300 MB) in my application on several different machines.
From a hardware perspective, the machines have been identically configured.
I am using SQL Server CE 4.0 as my DBMS.
The performance critical query has been indexed to improve its performance.
The problem is that on [only] one of the machines, I am observing egregiously slow query performance. This usually happens after a long period of time of inactivity (from a query perspective). After I do several (about 7-8) queries, the slow performance disappears.
The weird thing is that this initial slow query performance does not happen on the other machine.
The only difference between the two machines is the data contained inside the databases.
I suspect that the distribution of data on the slow machine is somehow reducing the effectiveness of the indexing and that SQL Server CE has to rebalance the indexing in a much more significant way than on the other faster machine.
One thing I notice is that when the query is very slow, the disk activity increases significantly and the process corresponding to reading the database shows a spike in the read bytes.
This does not happen on the other machine.
Does anyone know how I might go about root causing this issue?
My code is written in C++ and uses the ATL/OLEDB API to manipulate the database.
UPDATE: My performance profiling activities indicate that it's not the query itself that is slow - it is the processing of the returned rowset that takes a while. For each row returned, I query another database for related data. I understand that this is not the right way to do it but the performance problem only happens on one machine. One thing I noticed is that when I have other unrelated queries happening on the same database in other threads, the unrelated queries will stall the query that is exhibiting the performance problem.

how is one I/O measured in google app engine

I am making my first App. I am new to both SQL and GAE. Google Cloud SQL has tier "D0", which has "included I/O per day" of 200k. I have an example, could you please explain how many I/O's is this example?
Suppose I have a table in my Cloud SQL of 10 rows and 3 headers. the headers are "article name", "author", "date of publishing". so there are 30 fields in total. When a user starts my App and requests latest information, I want to send the user all 30 fields. I can send this to the user with a single SQL code.
Is the execution of that query counted as thirty I/O because 30 fields were transferred or one I/O because one SQL query was run?
Appreciate your help.
The pricing guide has this to say;
The number of I/O requests to storage made by your database instance depends on your queries, workload and data set. Cloud SQL will cache data in memory to serve your queries efficiently and to minimise the number of I/O requests.
In other words, neither of the two options, some queries may be served entirely from memory, generating no I/O, while some may generate many I/O requests. Optimising the database well with indexes will make your queries cheaper, generating table scans over large tables will cost more.
In short, same good practice rules apply as keeping a fast database as on a local machine, but not doing the optimisation won't just make your queries slower, but make them cost more.
The # of I/Os refers to disk operations. So that really depends on the query and the cached data.

Resources