We're trying to reload the TFS 2010 SSAS cube, but when the warehouse is processing, we get an exception in the log. It is important to note that the cube does not fail completely, but loads incompletely. For example, we have data up to June 2011, but not beyond.
Microsoft.TeamFoundation.Server.WarehouseException: OLE DB error: OLE
DB or ODBC error: Snapshot isolation transaction failed in database
'Tfs_Warehouse' because the object accessed by the statement has been
modified by a DDL statement in another concurrent transaction since
the start of this transaction. It is disallowed because the metadata
is not versioned. A concurrent update to metadata can lead to
inconsistency if mixed with snapshot isolation.; 42000
This is our future production system, and contains data migrated over from a TFS 2008 system. The database size of the version control repository is close to 200GB, so we're dealing with a relatively large instance of TFS.
We could remove snapshot isolation from our warehouse, but I'm a little concerned about doing this, as I can't find anything that tells me whether snapshot isolation is required on the TFS_Warehouse database. Any insight would be appreciated.
From this source (see the TempDB and RCSI section), it would seem that removing snapshot isolation could be a big mistake.
Here are some other options ordered easy to difficult from and implementation standpoint...
increase the size of TempDB to accomodate longer running SELECTs
decrease the partition size the measure groups in the cube. You may
want to run a profiler trace durring SSAS processing first to
determine which measure groups are taking the longest and chop those
partitions up first
implement an incremental processing strategy
Here's a link providing more information on cube-partitioning...
Related
I have created about 5 reports in Microsoft PowerBI using a SQL database created in Microsoft Azure.
The database has more than 50 million row.
Recently my reports have stopped refreshing. In case they refresh, the refresh time is long and is running really slow.
here is a screenshot of the error i'm having enter image description here
I contacted yesterday Microsoft PowerBI to check if the issue is from the software itself. I showed them my database and my reports and they told me that the DTU in my SQL database is reaching a maximum of 100% which is slowing down the response of the database during the refresh and preventing it from performing well. Here is a screenshot of my database performance enter image description here. Please note that this picture is showing only the maximum of the DTUs, the average is giving a 50% value
I'm not an expert in Azure and i need to know if the DTUs can really effect the performance of calling the data from the database to Powerbi.
Yes, the DTUs will affect the query performance for Azure SQL database. Off course it will effect the performance of calling the data from the database to PowerBi.
Reference Database transaction units (DTUs):
A database transaction unit (DTU) represents a blended measure of CPU, memory, reads, and writes.
For a single database at a specific compute size within a service tier, Microsoft guarantees a certain level of resources for that database (independent of any other database in the Azure cloud). This guarantee provides a predictable level of performance. The amount of resources allocated for a database is calculated as a number of DTUs and is a bundled measure of compute, storage, and I/O resources.
The ratio among these resources is originally determined by an online transaction processing (OLTP) benchmark workload designed to be typical of real-world OLTP workloads. When your workload exceeds the amount of any of these resources, your throughput is throttled, resulting in slower performance and time-outs.
When the DTUs reaching a maximum of 100%, it means the performance of the database has reached the resource limits.
You need to scale the Azure SQL database service price tier or do a performance turning.
For more details, please see: Monitoring and performance tuning. Azure SQL Database provides tools and methods you can use to monitor usage easily, add or remove resources (such as CPU, memory, or I/O), troubleshoot potential problems, and make recommendations to improve the performance of a database.
Hope this helps.
I have a server with 48 GB memory and a sql server analysis service (tabular mode), 2016 standard version SP1 CU7 installed on it.
I can deploy a tabular model from visual studio.
I can manually run a XMLA script:
{
"refresh": {
"type": "full",
"objects": [
{
"database": "MyCube"
}
]
}
}
But when i run that script from sql agent job, i get this error :
the JSON DDL request failed with the following error: Failed to execute XMLA. Error returned: 'There's not enough memory to complete this operation. Please try again later when there may be more memory available.'.. at Microsoft.AnalysisServices.Xmla.XmlaClient.CheckForSoapFault
The memory before porcessing is about 4GB, it increases during processing the cube, but when it hits about 18.5 GB, it fails.
Does anybody know a solution?
Analysis Services Tabular instances in SQL Server 2016 are limited to 16GB of RAM as documented here if you are running Standard Edition. Enterprise Edition removes that cap.
When you do a process full you keep a working copy of the cube and in the background you process a shadow copy. When the shadow copy is ready then it will replace the working copy. Basically this means that at processing time you need twice the amount of memory as the size of your cube. This can be an issue when you have the 16 GB limitation per instance with SSAS Standard edition.
One solution is to do a process with clearValues first, this empties the cube, and then to do the full process. More details here http://byobi.com/2016/12/how-much-ram-do-i-need-for-my-ssas-tabular-server/
Or another one is to play with the Memory \ VertiPaqPagingPolicy settings of the SSAS server. See more details here https://www.jamesserra.com/archive/2012/05/what-happens-when-a-ssas-tabular-model-exceeds-memory/ and here https://www.sqlbi.com/articles/memory-settings-in-tabular-instances-of-analysis-services/
And of course another solution is to upgrade to Enterprise Edition.
To follow up on Greg comments, i am facing similar issue at work and the workaround was instead of doing a database refresh, i did table refresh instead. I created 2 SQL jobs. My tabular model had 40 tables. So based on the sizes of the tables, i refresh x amount of tables in one job and y amount of table in the other jobs. You can create more than 2 SQL jobs and has less tables per job if you wish. This will put less load on the memory.
You can process small subsets of your data by partitioning your tables, this can be handled in SSMS. This Article provides a nice overview on how to achieve this.
I am planning to configure some sort of 2 node replication for neo4j, similar to mysql replication. Since I am a little constrained on resources I don't want to pay for more than two Cloud compute instances. Also I am happy with just one real time or near real time copy of the neo4j database. So the approach i can think of is:
Configure HA on the two compute nodes with the help of an arbiter instance. Setup one neo4j instance (master) on first node and another neo4j instance (slave) + another neo4j instance (arbiter, only for arbitration, no data logging) instance on second node.
OR
Setup a cron for online backup using the neo4j-backup tool. Setup incremental backups every hour or so. Not sure the load it may put on the prod server, planning to test that out.
I am more inclined on the first approach since I get a more real time copy the database (I also get HA/load balancing with instant failover but that is not a priority right now).
Please let me know
which of the two approach is better,
if there is another way to achieve the same or
if any of the above approaches are not suitable or have some flaws.
I am a little new to Neo4j HA so please pardon me for my ignorance. Thanks !
So. You already mentioned available solutions.
TL;DR; I prefer first option.
Cluster
In general, recommended layout is 3 nodes (2 slaves + 1 master).
But your layout - 2 nodes (1 master + 1 slave + 1 arbiter) is viable too. Especially if one server can handle your workload.
Good things:
Almost "real-time" replica.
Possibility to utilise resources to handle bigger workload.
Better availability.
Notes:
If you have 10mb/sec write load on master, then same load will be applied on slave node. This shouldn't affect reads from slave at all (except write load is REALLY huge).
Maintenance costs are bigger, then single-instance installation. You should plan how to handle cluster upgrades, configuration updates, plugin updates.
Branched data. In clustered environment there is possibility to end up in "split-brain" scenario, when 2 nodes have different data and decision should be made which data should be kept. Neo4j handles such cases quite good. But you should keep in mind that small data-loss can occur in VERY RARE scenarios.
Backup
Good things:
Simple. Just do backups from database.
Consistency check. When backup is made, tool runs consistency check to verify if database is not damaged. There is no possibility that Backup will screw up live database. If there any issues - you will be notified via logs from backup utility. See below detailed info on to how backup is performed.
Database. Neo4j backup is fully-functional database. You can spin-up server that points to backup database, and do everything you wan't.
Incremental backups. You can do incremental backups as often, as you wan't.
Notes:
Neo4j scales vertically very well (depends on size of database). It can handle huge load on single instance (we had up to 3k requests/second on medium machine). So, you can get one bigger machine for Neo4j server and other smaller (cheaper) for backups.
How backup is performed?
One thing that should be kept in mind - live database is still fully operational. Backup utility doesn't not stop or prevent any actions.
When transaction in database is committed, all changes are appended to transaction log.
When there are no previous backup present: copy whole storage.
When there is previous backup AND transaction logs are available: copy new transaction logs and replay them on to storage.
When there is previous backup AND transactions are NOT available: discard existing storage, copy existing storage.
Why transaction logs can not be available? Your configuration may say to keep only latest transaction logs (i.e. 1 hour), or not to keep at all.
Relevant settings:
keep_logical_logs
logical_log_rotation_threshold
Other
Anyway, you should consider making backups event in clustered environment. Everything can fail, in any moment.
In general - everything depends on your load and database size.
If your database is small enough to fully fit in memory and one machine is enough to handle all load, then one Neo4j instance will be enough. Just do backup.
If you wan't better scalability/availability and real-time working replica, then cluster setup is best choice.
I am working on testing space on Data Warehousing. In the scope I got newly created and dimensions and facts which should be validated. As per my knowledge and information got via browsing I would decide to cover for following
Schema validation of Facts and Dimension tables as per spec
Data duplicate check for Facts and Dimension table
Look-up validation for dimension table
Is there anything else that I can verify here?
In addition just curious how can I check whether data correctly populated to Fact table and row count, correct surrogate keys etc. In developers point of view are they using DML scripts to load the data?
Testing the Database
The database is tested in the following three ways:
Testing the database manager and monitoring tools - To test the
database manager and the monitoring tools, they should be used in the
creation, running, and management of test database.
Testing database features - Here is the list of features that we have
to test:
-Querying in parallel
-Create index in parallel
-Data load in parallel
3.Testing database performance - Query execution plays a very important role in data warehouse performance measures. There are sets of fixed queries that need to be run regularly and they should be tested. To test ad hoc queries, one should go through the user requirement document and understand the business completely. Take time to test the most awkward queries that the business is likely to ask against different index and aggregation strategies.
http://www.tutorialspoint.com/dwh/dwh_testing.htm
Also you can use ETL testing (Extract, Transform, and Load).
ETL Testing Techniques:
1) Verify that data is transformed correctly according to various business requirements and rules.
2) Make sure that all projected data is loaded into the data warehouse without any data loss and truncation.
3) Make sure that ETL application appropriately rejects, replaces with default values and reports invalid data.
4) Make sure that data is loaded in data warehouse within prescribed and expected time frames to confirm improved performance and scalability.
Apart from these 4 main ETL testing methods other testing methods like integration testing and user acceptance testing is also carried out to make sure everything is smooth and reliable.
Also you can test Schedule, Backup Recovery, Operational Environment, the Application and Logistic of the Test
For more information about ETL Testing / Data Warehouse Testing please visit http://www.softwaretestinghelp.com/etl-testing-data-warehouse-testing/
UPD:
Creating Indexes in Parallel
Indexes on the fact table can be partitioned or non-partitioned. Local partitioned indexes provide the simplest administration. The only disadvantage is that a search of a local non-prefixed index requires searching all index partitions.
The considerations for creating index tablespaces are similar to those for creating other tablespaces. Operating system striping with a small stripe width is often a good choice, but to simplify administration it is best to use a separate tablespace for each index. If it is a local index you may want to place it into the same tablespace as the partition to which it corresponds. If each partition is striped over a number of disks, the individual index partitions can be rebuilt in parallel for recovery. Alternatively, operating system mirroring can be used. For these reasons the NOLOGGING option of the index creation statement may be attractive for a data warehouse.
Tablespaces for partitioned indexes should be created in parallel in the same manner as tablespaces for partitioned tables.
Partitioned indexes are created in parallel using partition granules, so the maximum DOP possible is the number of granules. Local index creation has less inherent parallelism than global index creation, and so may run faster if a higher DOP is used. The following statement could be used to create a local index on the fact table:
CREATE INDEX I on fact(dim_1,dim_2,dim_3) LOCAL
PARTITION jan95 TABLESPACE Tsidx1,
PARTITION feb95 TABLESPACE Tsidx2,
...
PARALLEL(DEGREE 12) NOLOGGING;
To backup or restore January data, you only need to manage tablespace Tsidx1.
Parallel Query Tuning
The parallel query feature is useful for queries that access a large amount of data by way of large table scans, large joins, the creation of large indexes, bulk loads, aggregation, or copying. It benefits systems with all of the following characteristics:
symmetric multiprocessors (SMP), clusters, or massively parallel
systems
high I/O bandwidth (that is, many datafiles on many different disk
drives)
underutilized or intermittently used CPUs (for example, systems where
CPU usage is typically less than 30%)
sufficient memory to support additional memory-intensive processes
such as sorts, hashing, and I/O buffers
If any one of these conditions is not true for your system, the parallel query feature may not significantly help performance. In fact, on over-utilized systems or systems with small I/O bandwidth, the parallel query feature can impede system performance.
Here you can read about this in more detail: http://docs.oracle.com/cd/A57673_01/DOC/server/doc/A48506/pqo.htm#1559
This resources I hope will be helpful for you:
https://wiki.postgresql.org/wiki/Parallel_Query_Execution
https://technet.microsoft.com/en-us/library/ms178065%28v=sql.105%29.aspx
http://www.csee.umbc.edu/portal/help/oracle8/server.815/a67775/ch24_pex.htm#1978
I am an ETL tester.
for data validation and data quality testing in data warehouse follow below checks
1) metadata testing - testing the structure of underlying tables and their structure (as per design document).
2) data validation - in data validation you test the mapping transformations using SQL and PL/SQL.
We generally test it using Source and target table count, Source minus Target, Source Intersect Target and Target minus Source.
3) Duplicate check : To ensure no redundancy in data warehouse.
4) loading strategy check : to check if your target table is SCD or delete on reload (depends on requirements.)
Firebird 2.1.3 database seems to be creating a lot of garbage from uncompleted transactions this is causing the database to run very slowly until its garbage is removed via a database sweep or server restart. My database size its 30gb+.
Have you any idea what could be causing this?
Do any of the new stored procedures create excess garbage?
Please Help me.?
A Firebird database getting slow after a period of time is usually a sign of bad client transaction management. This can be easily checked by inspecting various transaction counters from the header page, which can be queried by running:
gstat -h <yourdatabase>
when your database becomes slow. For example: Pretty much all access libraries, when running transactions in auto commit mode (basically when you don't care about starting explicit transactions in your client application), are using COMMIT RETAINING, which basically blocks moving OIT/OAT forward.
Beside the gstat command-line tool, with Firebird 2.1 you also have the monitoring tables, in particular MON$TRANSACTIONS, to identify long-running transactions.