Cleanup TFS tbl_TestCodeSignature - tfs

I have a large table in my TFS collection called "tbl_TestCodeSignature" and I want to clean this up, any idea's how?
It looks like this:
I also ran the following query found here
select tbc.BuildUri, COUNT(*) from tbl_TestCodeSignature tc
join tbl_TestRun tr on tc.TestRunId = tr.TestRunId
join tbl_buildconfiguration tbc on tbc.BuildConfigurationId = tr.BuildConfigurationId
group by tbc.BuildUri
Result:

That data is from builds that have created test impact analysis information.
The test impact data is mostly stored in tbl_testcodesignature table
in project collection database. This table essentially keeps the
mapping between a testresult and the impacted CodeSignatures from the
product dll. Normally a testcase will use lot of codesingnatures from
product so the size of this table grows up in million of rows. Test
impact data is associated with a test run which in turn is associated
with a particular build. So when a build gets deleted, all the runs
associated with a build also gets deleted. As part of run deletion ,
we delete test impact data from tbl_testCodeSignature table also. So
one approach to keep check on size of test impact data table is to
delete redundant builds which have lot of test impact data.
Ref: https://blogs.msdn.microsoft.com/nipun-jain/2012/10/27/cleanup-redundant-test-impact-data/

Related

Is it safe to reduce TFS DB size by using Delete stored procedures?

We have TFS 2017.3 and the database it's huge - about 1.6 TB.
I want to try to clean up the space by running these 2 stored procedures:
prc_CleanupDeletedFileContent
prc_DeleteUnusedFiles
prc_DeleteUnusedContent
Is it safe to run it?
Is there a chance that it will delete important things I am currently using? (of course, I will do a backup before...)
What are the best values to put in these stored procedures?
Another thing - If I run this query:
SELECT A.[ResourceId]
FROM [Tfs_DefaultCollection].[dbo].[tbl_Content] As A
left join [Tfs_DefaultCollection].[dbo].[tbl_FileMetadata] As B on A.ResourceId=B.ResourceId
where B.[ResourceId] IS Null
I got result of 10681.
If I run this query:
SELECT A.[ResourceId]
FROM PTU_NICE_Coll.[dbo].[tbl_Content] As A
left join PTU_NICE_Coll.[dbo].tbl_FileReference As B on A.ResourceId=B.ResourceId
where B.[ResourceId] IS Null
I got result of 10896.
How can I remove this rows? and is it completely safe to remove them?
Generally we don't recommend to do actions against the DB directly as it may cause problems.
However if you have to do that, then you need to backup the DBs first.
You can refer to below articles to clean up and reduce the size of the TFS databases:
Control\Reduce TFS DB Size
Cleaning up and reduce the size of the TFS database
Clean up your Team Project Collection
Another option is to dive deep into the database, and run the cleanup
stored procedures manually. If your Content table is large:
EXEC prc_DeleteUnusedContent 1
If your Files table is large:
EXEC prc_DeleteUnusedFiles 1, 0, 1000
This second sproc may run for a long time, that’s why it has the third
parameter which defines the batch size. You should run this sprocs
multiple times, or if it completes quickly, you can increase the chunk
size.

How to create staging table to handle incremental load

We are designing a Staging layer to handle incremental load. I want to start with a simple scenario to design the staging.
In the source database There are two tables ex, tbl_Department, tbl_Employee. Both this table is loading a single table at destination database ex, tbl_EmployeRecord.
The query which is loading tbl_EmployeRecord is,
SELECT EMPID,EMPNAME,DEPTNAME
FROM tbl_Department D
INNER JOIN tbl_Employee E
ON D.DEPARTMENTID=E.DEPARTMENTID
Now, we need to identify incremental load in tbl_Department, tbl_Employee and store it in staging and load only the incremental load to the destination.
The columns of the tables are,
tbl_Department : DEPARTMENTID,DEPTNAME
tbl_Employee : EMPID,EMPNAME,DEPARTMENTID
tbl_EmployeRecord : EMPID,EMPNAME,DEPTNAME
Kindly suggest how to design the staging for this to handle Insert, Update and Delete.
Identifying Incremental Data
The incremental loading needs to be based on some segregating information present in your source table. Such information helps you to identify the incremental portion of the data that you will load. Often times, load date or last updated date of the record is a good choice for this.
Consider this, your source table has a date column that stores both the date of insertion of the records as well as the date when any update was done on that record. At any given day during your staging load, you may take advantage of this date to identify which are the records that are newly inserted or updated since your last staging load and you consider only those changed / updated records as your incremental delta.
Given your structure of the tables, I am not sure which column you may use for this. ID columns will not help as if the record gets updated you won't know that.
Maintaining Load History
It is important to store information about how much you have loaded today so that you can load the next part in the next load. To do this, maintain a staging table - often called Batch Load Details table. That load typically will have structure such as below:
BATCH ID | START DATE | END DATE | LOAD DATE | STATUS
------------------------------------------------------
1 | 01-Jan-14 | 02-Jan-14 | 02-Jan-14 | Success
You need to insert a new record in this table everyday before you start the data loading. The new record will have start date equal to the end date of last successful load and status null. Once loading is successful, you will update the status to 'Success'
Modification in data Extraction Query to take Advantage of Batch Load Table
Once you maintain your loading history like above, you may include this table in your extraction query,
SELECT EMPID,EMPNAME,DEPTNAME
FROM tbl_Department D
INNER JOIN tbl_Employee E
ON D.DEPARTMENTID=E.DEPARTMENTID
WHERE E.load_date >= (SELECT max(START_DATE) FROM BATCH_LOAD WHERE status IS NULL)
What I am going suggest you is by no means a standard. In fact you should evaluate my suggestion carefully against your requirement.
Suggestion
Use incremental loading for transaction data, not for master data. Transaction data are generally higher in volume and can be easily segregated in incremental chunks. Master data tend to be more manageable and can be loaded in Full everytime. In the above example, I am assuming your Employee table is behaving like transactional data whereas your department table is your master.
I trust this article on incremental loading will be very helpful for you
I'm not sure what database you are using, so I'll just talk in conceptual terms. If you want to add tags for specific technologies, we can probably provide specific advice.
It looks like you have 1 row per employee and that you are only keeping the current record for each employee. I'm going to assume that EMPIDs are unique.
First, add a field to the query that currently populates the dimension. This field will be a hash of the other fields in the table EMPID, EMPNAME, DEPTNAME. You can create a view, populate a new staging table, or just use the query. Also add this same hash field to the dimension table. Basically, the hash is an easy way to generate a field that is unique for each record and efficient to compare.
Inserts: These are the records for which the EMPID does not already exist in the dimension table but does exist in your staging query/view.
Updates: These are the records for which the EMPID does in both the staging query/view the dimension table, but the hash field doesn't match.
Deletes: These are the records for which the EMPID exists in the dimension but does not exist in the staging query/view.
If this will be high-volume, you may want to create new tables to hold the records that should be inserted and the records that should be updated. Once you have identified the records, you can insert/update them all at once instead of one-by-one.
It's a bit uncommon to delete lots of records from a data warehouse as they are typically used to keep history. I would suggest perhaps creating a column that is a status or a bit field that indicates if is is active or deleted in the source. Of course, how you handle deletes should be dependent upon your business needs/reporting requirements. Just remember that if you do a hard delete you can never get that data back if you decide you need it later.
Updating the the existing dimension in place (rather than creating historical records for each change) is called a Type 1 dimension in dimensional modeling terms. This is fairly common. But if you decide you need to keep history, you can use the hash to help you create the SCD type 2 records.

SSIS Foreach through a table, insert into another and delete the source row

I have an SSIS routine that reads from a very dynamic table and inserts whichever rows it finds into a table in a different database, before truncating the original source table.
Due to the dynamic nature of the source table this truncation not surprisingly leads to rows not making it to the second database.
What is the best way of deleting only those rows that have been migrated?
There is an identity column on the source table but it is not migrated across.
I can't change either table schema.
A option, that might sound stupid but it works, is to delete first and use the OUTPUT clause.
I created a simple control flow that populates a table for me.
IF EXISTS
(
SELECT 1 FROM sys.tables AS T WHERE T.name = 'DeleteFirst'
)
BEGIN
DROP TABLE dbo.DeleteFirst;
END
CREATE TABLE dbo.DeleteFirst
(
[name] sysname
);
INSERT INTO
dbo.DeleteFirst
SELECT
V.name
FROM
master.dbo.spt_values V
WHERE
V.name IS NOT NULL;
In my OLE DB Source, instead of using a SELECT, DELETE the data you want to go down the pipeline and OUTPUT the DELETED virtual table. Somethinng like
DELETE
DF
OUTPUT
DELETED.*
FROM
dbo.DeleteFirst AS DF;
It works, it works!
One option would be to create a table to log the identity of your processed records into, and then a separate package (or dataflow) to delete those records. If you're already logging processed records somewhere then you could just add the identity there - otherwise, create a new table to store the data.
A second option: If you're trying to avoid creating additional tables, then separate the record selection and record processing into two stages. Broadly, you'd select all your records in the control flow, then process them on-by-one in the dataflow.
Specifically:
Create a variable of type Object to store your record list, and another variable matching your identity type (int presumably) to store the 'current record identity'.
In the control flow, add an Execute SQL task which uses a query to build a list of identity values to process, then stores them into the recordlist variable.
Add a Foreach Loop Container to process that list; the foreach task would load the current record identifier into the second variable you defined above.
In the foreach task, add a dataflow to copy that single record, then delete it from the source.
There's quite a few examples of this online; e.g. this one from the venerable Jamie Thomson, or this one which includes a bit more detail.
Note that you didn't talk about the scale of the data; if you have very large numbers of records the first suggestion is likely a better choice. Note that in both cases you lose the advantage of the table truncation (because you're using a standard delete call).

Performance of generated T-SQL from Entity Framework

I recently used Entity Framework for a project, despite my DBA's strong disapproval. So one day he came to my office complaining about generated T-SQL that reaches his database.
For instance, when I want to select a product based on the id, I write something like this:
context.Products.FirstOrDefault(p=>p.Id==id);
Which translates to
SELECT ... FROM (SELECT TOP 1 ... FROM PRODUCTS WHERE ID=#id)
So he is shouting, "Why on earth would you write a SELECT * FROM (SELECT TOP 1)"
So I changed my code to
context.Products.Where(p=>p.Id==id).ToList().FirstOrDefault()
and this produces a much cleaner T-SQL:
SELECT ... FROM PRODUCTS WHERE ID=#id
The inner query and the TOP 1 dissappeared. Enough mambling, my question is this: Does the first query really put an overhead for SQL Server? Is it harder to parse than the second method? The Id column has a Clustered index on. I want a good answer so I can rub it on his face (or mine)
Thanks,
Themos
Have you tried running the queries manually and comparing the executions plans?
The biggest problem here isn't that the SQL isn't perfectly formed to your DBA's standards (although I'm fairly certain that the query engine will optimize out the extra select). The second query actually returns the entire contents of the Products table which you then analyse in memory and this is definitely a task that should be performed by the DB and not the application layer.
In short, he's being a pedant; leave it the way it was.

Avoiding round-trips when importing data from Excel

I'm using EF 4.1 (Code First). I need to add/update products in a database based on data from an Excel file. Discussing here, one way to achieve this is to use dbContext.Products.ToList() to force loading all products from the database then use db.Products.Local.FirstOrDefault(...) to check if product from Excel exists in database and proceed accordingly with an insert or add. This is only one round-trip.
Now, my problem is there are two many products in the database so it's not possible to load all products in memory. What's the way to achieve this without multiplying round-trips to the database. My understanding is that if I just do a search with db.Products.FirstOrDefault(...) for each excel product to process, this will perform a round-trip each time even if I issue the statement for the exact same product several times ! What's the purpose of the EF caching objects and returning the cached value if it goes to the database anyway !
There is actually no way to make this better. EF is not a good solution for this kind of tasks. You must know if product already exists in database to use correct operation so you always need to do additional query - you can group multiple products to single query using .Contains (like SQL IN) but that will solve only check problem. The worse problem is that each INSERT or UPDATE is executed in separate roundtrip as well and there is no way to solve this because EF doesn't support command batching.
Create stored procedure and pass information about product to that stored procedure. The stored procedure will perform insert or update based on the existence of the record in the database.
You can even use some more advanced features like table valued parameters to pass multiple records from excel into procedure with single call or import Excel to temporary table (for example with SSIS) and process them all directly on SQL server. As last you can use bulk insert to get all records to special import table and again process them with single stored procedures call.

Resources