I am making my first App. I am new to both SQL and GAE. Google Cloud SQL has tier "D0", which has "included I/O per day" of 200k. I have an example, could you please explain how many I/O's is this example?
Suppose I have a table in my Cloud SQL of 10 rows and 3 headers. the headers are "article name", "author", "date of publishing". so there are 30 fields in total. When a user starts my App and requests latest information, I want to send the user all 30 fields. I can send this to the user with a single SQL code.
Is the execution of that query counted as thirty I/O because 30 fields were transferred or one I/O because one SQL query was run?
Appreciate your help.
The pricing guide has this to say;
The number of I/O requests to storage made by your database instance depends on your queries, workload and data set. Cloud SQL will cache data in memory to serve your queries efficiently and to minimise the number of I/O requests.
In other words, neither of the two options, some queries may be served entirely from memory, generating no I/O, while some may generate many I/O requests. Optimising the database well with indexes will make your queries cheaper, generating table scans over large tables will cost more.
In short, same good practice rules apply as keeping a fast database as on a local machine, but not doing the optimisation won't just make your queries slower, but make them cost more.
The # of I/Os refers to disk operations. So that really depends on the query and the cached data.
Related
I have created about 5 reports in Microsoft PowerBI using a SQL database created in Microsoft Azure.
The database has more than 50 million row.
Recently my reports have stopped refreshing. In case they refresh, the refresh time is long and is running really slow.
here is a screenshot of the error i'm having enter image description here
I contacted yesterday Microsoft PowerBI to check if the issue is from the software itself. I showed them my database and my reports and they told me that the DTU in my SQL database is reaching a maximum of 100% which is slowing down the response of the database during the refresh and preventing it from performing well. Here is a screenshot of my database performance enter image description here. Please note that this picture is showing only the maximum of the DTUs, the average is giving a 50% value
I'm not an expert in Azure and i need to know if the DTUs can really effect the performance of calling the data from the database to Powerbi.
Yes, the DTUs will affect the query performance for Azure SQL database. Off course it will effect the performance of calling the data from the database to PowerBi.
Reference Database transaction units (DTUs):
A database transaction unit (DTU) represents a blended measure of CPU, memory, reads, and writes.
For a single database at a specific compute size within a service tier, Microsoft guarantees a certain level of resources for that database (independent of any other database in the Azure cloud). This guarantee provides a predictable level of performance. The amount of resources allocated for a database is calculated as a number of DTUs and is a bundled measure of compute, storage, and I/O resources.
The ratio among these resources is originally determined by an online transaction processing (OLTP) benchmark workload designed to be typical of real-world OLTP workloads. When your workload exceeds the amount of any of these resources, your throughput is throttled, resulting in slower performance and time-outs.
When the DTUs reaching a maximum of 100%, it means the performance of the database has reached the resource limits.
You need to scale the Azure SQL database service price tier or do a performance turning.
For more details, please see: Monitoring and performance tuning. Azure SQL Database provides tools and methods you can use to monitor usage easily, add or remove resources (such as CPU, memory, or I/O), troubleshoot potential problems, and make recommendations to improve the performance of a database.
Hope this helps.
Perf testing a tool. We generate a bunch of metrics per test run. But we want to keep each test run separate. This seems like a db per run would allow us to do that, and at the same time allow us to give the tools we create to customers who would only have 1 install, and thus need only one db. But we are talking hundreds of db's... granted they should be smaller as most would only be for a set of metrics covering a couple of hours. But will influxdb limit us? or will performance suffer significantly?
Looks memory/virtual memory appears to be the limiting factor.
On two 16gb boxes I added 500 db's with data on one, and 500 sets of data to the same db on the other. The data was pretty small actually, the individual dbs were 440K after being loaded.
Memory use on the 500 db's was way higher.
500dbs 19.9g virtual, 3.3g resident
1db 2.5g virtual, .9g resident
I have an MVC web site, where users can search for large recordsets from SQL Server and Oracle databases. Some of these recordsets can be very large, with many thousands of records. Sadly, it is a user requirement that they do not make their searches more specific.
When a user posts their search request to the database, my web page is hanging before often timing out (due to the amount of time taken to query the database).
We are thinking about removing the expensive database calls from the MVC site, and sending the query to a separate process to run in the background. When the query is complete, we can notify the user.
My proposed solution is:
1) When the user completes the search form in the web page, to simply display a message that the results are being generated and will be sent when complete
2) Send the SQL query to a database which can contain a list of SQL queries that need to be processed
3) Create a Windows Service which checks this database every couple of minutes for new queries
4) This Windows Service then queries the database. When the query is completed, it will create a CSV of the results, and email this to the user
I am looking for some advice and comments on my above approach? What do folks think of this as an approach to process expensive database calls in the background?
Generally speaking the requests will be made infrequently, but as mentioned, will be for a great amount of data. There is a chance that two or more requests could be made at the same time, but this will be infrequent.
I will also look at optimising the databases.
Grateful for any tips.
Martin :)
Another option is to supplement the existing code to execute the query on a separate thread so that periodic keep-alive updates can be sent to the requesting page while you wait for the query results. Similar to the way the insurance quote agregator pages work.
A second option is to make the results available as a hyperlink when they are ready and then communicate that either through the website or by email to the user.
Option three if these queries are not completely ad-hoc type queries then you could profile for the most frequent combinations and pre-compute them periodically placing the results into new tables (sort of halfway to optimising the current database structure).
The caveat there is that the data won't be as up to date - but given the time the queries are currently taking it probably isn't that important to be up to the second?
Whichever solution you choose I think it's going to depend on the user expectation - Do they know what they want and just send one big query and get it and be happy? or do they try several queries to find the right combination of parameters? If the latter then waiting for an email delivery of results might not be acceptable to them. But if what they want is a downloadable results document and they know what they want first time then it may. The only problem I see here is emails going astray or taking longer than the user thinks it should causing the request to be resubmitted multiple times and increasing the server workload - caching queries and results is probably a very good idea.
I would suggest to introduce layer of abstraction like messaging broker. Request will go in queue and batch layer will consume request from queue and once heavy work is done, batch layer will notify web layer again via messaging broker, Request-Reply pattern.
In addition on database side it is allways good to optimize queries.
I am working on testing space on Data Warehousing. In the scope I got newly created and dimensions and facts which should be validated. As per my knowledge and information got via browsing I would decide to cover for following
Schema validation of Facts and Dimension tables as per spec
Data duplicate check for Facts and Dimension table
Look-up validation for dimension table
Is there anything else that I can verify here?
In addition just curious how can I check whether data correctly populated to Fact table and row count, correct surrogate keys etc. In developers point of view are they using DML scripts to load the data?
Testing the Database
The database is tested in the following three ways:
Testing the database manager and monitoring tools - To test the
database manager and the monitoring tools, they should be used in the
creation, running, and management of test database.
Testing database features - Here is the list of features that we have
to test:
-Querying in parallel
-Create index in parallel
-Data load in parallel
3.Testing database performance - Query execution plays a very important role in data warehouse performance measures. There are sets of fixed queries that need to be run regularly and they should be tested. To test ad hoc queries, one should go through the user requirement document and understand the business completely. Take time to test the most awkward queries that the business is likely to ask against different index and aggregation strategies.
http://www.tutorialspoint.com/dwh/dwh_testing.htm
Also you can use ETL testing (Extract, Transform, and Load).
ETL Testing Techniques:
1) Verify that data is transformed correctly according to various business requirements and rules.
2) Make sure that all projected data is loaded into the data warehouse without any data loss and truncation.
3) Make sure that ETL application appropriately rejects, replaces with default values and reports invalid data.
4) Make sure that data is loaded in data warehouse within prescribed and expected time frames to confirm improved performance and scalability.
Apart from these 4 main ETL testing methods other testing methods like integration testing and user acceptance testing is also carried out to make sure everything is smooth and reliable.
Also you can test Schedule, Backup Recovery, Operational Environment, the Application and Logistic of the Test
For more information about ETL Testing / Data Warehouse Testing please visit http://www.softwaretestinghelp.com/etl-testing-data-warehouse-testing/
UPD:
Creating Indexes in Parallel
Indexes on the fact table can be partitioned or non-partitioned. Local partitioned indexes provide the simplest administration. The only disadvantage is that a search of a local non-prefixed index requires searching all index partitions.
The considerations for creating index tablespaces are similar to those for creating other tablespaces. Operating system striping with a small stripe width is often a good choice, but to simplify administration it is best to use a separate tablespace for each index. If it is a local index you may want to place it into the same tablespace as the partition to which it corresponds. If each partition is striped over a number of disks, the individual index partitions can be rebuilt in parallel for recovery. Alternatively, operating system mirroring can be used. For these reasons the NOLOGGING option of the index creation statement may be attractive for a data warehouse.
Tablespaces for partitioned indexes should be created in parallel in the same manner as tablespaces for partitioned tables.
Partitioned indexes are created in parallel using partition granules, so the maximum DOP possible is the number of granules. Local index creation has less inherent parallelism than global index creation, and so may run faster if a higher DOP is used. The following statement could be used to create a local index on the fact table:
CREATE INDEX I on fact(dim_1,dim_2,dim_3) LOCAL
PARTITION jan95 TABLESPACE Tsidx1,
PARTITION feb95 TABLESPACE Tsidx2,
...
PARALLEL(DEGREE 12) NOLOGGING;
To backup or restore January data, you only need to manage tablespace Tsidx1.
Parallel Query Tuning
The parallel query feature is useful for queries that access a large amount of data by way of large table scans, large joins, the creation of large indexes, bulk loads, aggregation, or copying. It benefits systems with all of the following characteristics:
symmetric multiprocessors (SMP), clusters, or massively parallel
systems
high I/O bandwidth (that is, many datafiles on many different disk
drives)
underutilized or intermittently used CPUs (for example, systems where
CPU usage is typically less than 30%)
sufficient memory to support additional memory-intensive processes
such as sorts, hashing, and I/O buffers
If any one of these conditions is not true for your system, the parallel query feature may not significantly help performance. In fact, on over-utilized systems or systems with small I/O bandwidth, the parallel query feature can impede system performance.
Here you can read about this in more detail: http://docs.oracle.com/cd/A57673_01/DOC/server/doc/A48506/pqo.htm#1559
This resources I hope will be helpful for you:
https://wiki.postgresql.org/wiki/Parallel_Query_Execution
https://technet.microsoft.com/en-us/library/ms178065%28v=sql.105%29.aspx
http://www.csee.umbc.edu/portal/help/oracle8/server.815/a67775/ch24_pex.htm#1978
I am an ETL tester.
for data validation and data quality testing in data warehouse follow below checks
1) metadata testing - testing the structure of underlying tables and their structure (as per design document).
2) data validation - in data validation you test the mapping transformations using SQL and PL/SQL.
We generally test it using Source and target table count, Source minus Target, Source Intersect Target and Target minus Source.
3) Duplicate check : To ensure no redundancy in data warehouse.
4) loading strategy check : to check if your target table is SCD or delete on reload (depends on requirements.)
I'm developing a polling application that will deal with an average of 1000-2000 votes per second coming from different users. In other words, it'll receive 1k to 2k requests per second with each request making a DB insert into the table that stores the voting data.
I'm using RoR 4 with MySQL and planning to push it to Heroku or AWS.
What performance issues related to database and the application itself should I be aware of?
How can I address this amount of inserts per second into the database?
EDIT
I was thinking in not inserting into the DB for each request, but instead writing to a memory stream the insert data. So I would have a scheduled job running every second that would read from this memory stream and generate a bulk insert, avoiding each insert to be made atomically. But i cannot think in a nice way to implement this.
While you can certainly do what you need to do in AWS, that high level of I/O will probably cost you. RDS can support up to 30,000 IOPS; you can also use multiple EBS volumes in different configurations to support high IO if you want to run the database yourself.
Depending on your planned usage patterns, I would probably look at pushing into an in-memory data store, something like memcached or redis, and then processing the requests from there. You could also look at DynamoDB, which might work depending on how your data is structured.
Are you going to have that level of sustained throughput consistently, or will it be in bursts? Do you absolutely have to preserve every single vote, or do you just need summary data? How much will you need to scale - i.e. will you ever get to 20,000 votes per second? 200,000?
These type of questions will help determine the proper architecture.