We are planning a migration from an on-premises TFS instance to VSTS very shortly. Ahead of the migration, we run the pre-requisite Validation task and obtained the following output report on the TPC database size:
"The database is currently 191GBs. This is above the recommended size of 150GBs to use the DACPAC import method. The largest table size is currently 172GBs. This is above the recommended size of 20GBs to use the DACPAC import method.
Validation completed 'Validate Collection Database Size' with result Warning, message The largest table size is currently 172GBs. This is above the recommended size of 20GBs to use the DACPAC import method."
We are therefore keen to reduce the size of the TPC database and have two main considerations:
Shrink the Database and generate the DACPAC from the resultant output.
Delete any of the following objects which are unused or redundant:
a) Older Workspaces
b) Build Results
c) Redundant Team Projects
d) Unused Files
e) Test Attachments created during test runs
f) XAML Builds
Would therefore appreciate some advice or feedback on the pros and cons of either approach and which would be recommended.
Given that you need to reduce your largest table by 150GB, I wonder whether DACPAC is every going to be an option. That said, it's always a good idea to clean up your TFS instance. Your first step won't help a lot until you've managed to strip enough data to actually get any benefit out of a shrink.
Your identified actions would indeed help, most are already documented here. Queries that can aid in detecting where your space is allocated are also found in this recent support ticket.
Delete old workspaces
Deleting workspaces and shelvesets can reduce your migration and upgrade times considerably. either use the tf commandline or leverage a tool like the TFS SideKicks to identify and delete these.
Build results
Not just build results, but often overlooked the actual build records can take up a considerable amount of data. Use tfsbuild destroy (XAML) to permanently delete the build records. In the past, I've encountered clients who had 1.8 million "hidden" builds in their database and removing them shaved off quite a considerable amount of data. These records were kept around for the warehouse.
Old team projects
Of course, destroying old team projects can give back a lot of data. Anything you don't need to send to azure helps. You could also consider splitting the collection and to leave behind the old projects. That will give you the option to detach that collection and store it somewhere, should you ever need that data again.
Redundant files
Deleted branches are a very common hidden size hog. When deleting things in TFVC, they are not actually deleted, they're just hidden. Finding deleted files and especially old development or feature branches can give you back a lot of data. Use tf destroy to get rid of them.
You may also want to look for checked in nuget package folders, those can quickly rack up a lot of space as well.
Test Attachments
Ohh yes, especially when you use test attachments, these can grow like crazy, depending on your TFS version either use the built-in test attachment cleanup features or use the Test Attachment Cleaner from the TFS power tools.
XAML Builds
The build definitions themselves won't take a lot of db space, but the build results may. But those have been covered in a previous section.
Git Repositories
You may have data in your git repositories that are no longer accessible due to force pushes or deleted branches. It's also possible that certain data in Git could be packed more efficiently. To clean your repositories you have to clone them locally, clean them up, delete the remote repo from TFS and push the cleaned copy to a new repository (you can use the same name as the old one). Doing this will break references with existing build definitions and you will have to fix these up. While you're at it, you could also run the BFG repo Cleaner and convert the repositories to enable Git-LFS support to handle large binary files in your repositories more elegantly.
git clone --mirror <<repo>>
# optionally run BFG repo cleaner at thi s point
git reflog expire --expire=now --all
git gc --prune=now --aggressive
git repack -adf
# Delete and recreate the remote repository with the same name
git push origin --all
git push origin --tags
Work item (attachments)
Work items can gather up a considerable amount of data, especially when people start attaching large attachments to them. You can use witadmin destroywi to delete workitems with unreasonably large attachments. To retain the work item, but delete its attachments you can delete the attachments from the current work item and then clone it. After cloning, destroy the old work item to allow the attachments to be cleaned up.
Old work items that you no longer need (say the sprint ites from 6 years ago) can also be deleted. My colleague Rene has a nice tool that allows you to bulk-destroy by first creating the appropriate work item query.
Be sure to run the cleanup jobs
TFS often doesn't directly prune data from the database, in many cases it just marks stuff as deleted for latest processing. To force the cleanup to happen immediately, run the following stored procedures on your Project Collection database:
EXEC prc_CleanupDeletedFileContent 1
# You may have to run the following command multiple times, the last
# parameter is the batch size, if there are more items to prune than the
# passed in number, you will have to run it multiple times
EXEC prc_DeleteUnusedFiles 1, 0, 100000
Other useful queries
To identify how much data is stored in each section, there are a few useful queries you can run. The actual query depends on your TFS version, but since you're preparing for migration I suspect you're on TFS 2017 or 2018 at the moment.
Find the largest tables:
SELECT TOP 10 o.name,
SUM(reserved_page_count) * 8.0 / 1024 SizeInMB,
SUM(CASE
WHEN p.index_id <= 1 THEN p.row_count
ELSE 0
END) Row_Count
FROM sys.dm_db_partition_stats p
JOIN sys.objects o
ON p.object_id = o.object_id
GROUP BY o.name
ORDER BY SUM(reserved_page_count) DESC
Find the largest content contributors:
SELECT Owner =
CASE
WHEN OwnerId = 0 THEN 'Generic'
WHEN OwnerId = 1 THEN 'VersionControl'
WHEN OwnerId = 2 THEN 'WorkItemTracking'
WHEN OwnerId = 3 THEN 'TeamBuild'
WHEN OwnerId = 4 THEN 'TeamTest'
WHEN OwnerId = 5 THEN 'Servicing'
WHEN OwnerId = 6 THEN 'UnitTest'
WHEN OwnerId = 7 THEN 'WebAccess'
WHEN OwnerId = 8 THEN 'ProcessTemplate'
WHEN OwnerId = 9 THEN 'StrongBox'
WHEN OwnerId = 10 THEN 'FileContainer'
WHEN OwnerId = 11 THEN 'CodeSense'
WHEN OwnerId = 12 THEN 'Profile'
WHEN OwnerId = 13 THEN 'Aad'
WHEN OwnerId = 14 THEN 'Gallery'
WHEN OwnerId = 15 THEN 'BlobStore'
WHEN OwnerId = 255 THEN 'PendingDeletion'
END,
SUM(CompressedLength) / 1024.0 / 1024.0 AS BlobSizeInMB
FROM tbl_FileReference AS r
JOIN tbl_FileMetadata AS m
ON r.ResourceId = m.ResourceId
AND r.PartitionId = m.PartitionId
WHERE r.PartitionId = 1
GROUP BY OwnerId
ORDER BY 2 DESC
If file containers are the issue:
SELECT CASE WHEN Container = 'vstfs:///Buil' THEN 'Build'
WHEN Container = 'vstfs:///Git/' THEN 'Git'
WHEN Container = 'vstfs:///Dist' THEN 'DistributedTask'
ELSE Container
END AS FileContainerOwner,
SUM(fm.CompressedLength) / 1024.0 / 1024.0 AS TotalSizeInMB
FROM (SELECT DISTINCT LEFT(c.ArtifactUri, 13) AS Container,
fr.ResourceId,
ci.PartitionId
FROM tbl_Container c
INNER JOIN tbl_ContainerItem ci
ON c.ContainerId = ci.ContainerId
AND c.PartitionId = ci.PartitionId
INNER JOIN tbl_FileReference fr
ON ci.fileId = fr.fileId
AND ci.DataspaceId = fr.DataspaceId
AND ci.PartitionId = fr.PartitionId) c
INNER JOIN tbl_FileMetadata fm
ON fm.ResourceId = c.ResourceId
AND fm.PartitionId = c.PartitionId
GROUP BY c.Container
ORDER BY TotalSizeInMB DESC
Related
We have TFS 2017.3 and the database it's huge - about 1.6 TB.
I want to try to clean up the space by running these 2 stored procedures:
prc_CleanupDeletedFileContent
prc_DeleteUnusedFiles
prc_DeleteUnusedContent
Is it safe to run it?
Is there a chance that it will delete important things I am currently using? (of course, I will do a backup before...)
What are the best values to put in these stored procedures?
Another thing - If I run this query:
SELECT A.[ResourceId]
FROM [Tfs_DefaultCollection].[dbo].[tbl_Content] As A
left join [Tfs_DefaultCollection].[dbo].[tbl_FileMetadata] As B on A.ResourceId=B.ResourceId
where B.[ResourceId] IS Null
I got result of 10681.
If I run this query:
SELECT A.[ResourceId]
FROM PTU_NICE_Coll.[dbo].[tbl_Content] As A
left join PTU_NICE_Coll.[dbo].tbl_FileReference As B on A.ResourceId=B.ResourceId
where B.[ResourceId] IS Null
I got result of 10896.
How can I remove this rows? and is it completely safe to remove them?
Generally we don't recommend to do actions against the DB directly as it may cause problems.
However if you have to do that, then you need to backup the DBs first.
You can refer to below articles to clean up and reduce the size of the TFS databases:
Control\Reduce TFS DB Size
Cleaning up and reduce the size of the TFS database
Clean up your Team Project Collection
Another option is to dive deep into the database, and run the cleanup
stored procedures manually. If your Content table is large:
EXEC prc_DeleteUnusedContent 1
If your Files table is large:
EXEC prc_DeleteUnusedFiles 1, 0, 1000
This second sproc may run for a long time, that’s why it has the third
parameter which defines the batch size. You should run this sprocs
multiple times, or if it completes quickly, you can increase the chunk
size.
In order to keep track of the history of changes to a contract, I am growing a tail off the current version, which is being created at each change. A change (renewal of contract) involves deleting old relationships (not deleting any node), inserting the new current version node, and creating new relationships. I assume this all happens in a single transaction. If a client tries to access the current version the moment that change is happening, will the client request...
fail to find the node or its contents?
find the previous version of the contract node?
find the new version of the contract node?
In the example code below we have a coach and an athlete who renew their joint contract at the beginning of each year. If you run the first query once, and the second query several times it will build this model. I don't know how to conclusively test this race-condition scenario.
//INITIALIZE FIRST CONTRACT
CREATE (coach:PERSON {name:'coach'})
CREATE (athlete:PERSON {name:'athlete'})
CREATE (curr:CONTRACT {name:'contract', year:2017, content: "The signees herein..."})
MERGE (coach)-[:BOUND_TO]->(curr)<-[:BOUND_TO]-(athlete)
//RENEW CONTRACT
MATCH (coach:PERSON {name:'coach'} )
MATCH (athlete:PERSON {name:'athlete'} )
MATCH (coach)-[r1:BOUND_TO]->(curr)<-[r2:BOUND_TO]-(athlete)
MERGE (coach)-[:BOUND_TO]->(new:CONTRACT {name:'contract', year:curr.year + 1, content: "The signees herein..."})<-[:BOUND_TO]-(athlete)
MERGE (new)-[:PREV]->(curr)
DELETE r1,r2
Neo4j is a full ACID database, so your question will depend of when the commit is realized :
before the commit : users will find the previous version
after the commit : users will find the new version
I have a large table in my TFS collection called "tbl_TestCodeSignature" and I want to clean this up, any idea's how?
It looks like this:
I also ran the following query found here
select tbc.BuildUri, COUNT(*) from tbl_TestCodeSignature tc
join tbl_TestRun tr on tc.TestRunId = tr.TestRunId
join tbl_buildconfiguration tbc on tbc.BuildConfigurationId = tr.BuildConfigurationId
group by tbc.BuildUri
Result:
That data is from builds that have created test impact analysis information.
The test impact data is mostly stored in tbl_testcodesignature table
in project collection database. This table essentially keeps the
mapping between a testresult and the impacted CodeSignatures from the
product dll. Normally a testcase will use lot of codesingnatures from
product so the size of this table grows up in million of rows. Test
impact data is associated with a test run which in turn is associated
with a particular build. So when a build gets deleted, all the runs
associated with a build also gets deleted. As part of run deletion ,
we delete test impact data from tbl_testCodeSignature table also. So
one approach to keep check on size of test impact data table is to
delete redundant builds which have lot of test impact data.
Ref: https://blogs.msdn.microsoft.com/nipun-jain/2012/10/27/cleanup-redundant-test-impact-data/
I've done some searching but have not been able to find out if there are any limits in TFS to the number of files in a single changeset.
This came up with IntelliJ IDEA where we found that it was splitting up changesets with >200 files. I want to argue that there shouldn't be any limit at all, or at least the limit should be the same as TFS' own limit, if there is one. See the defect I reported on this issue at http://youtrack.jetbrains.net/issue/IDEA-54846.
The number of changes in a changeset is stored as the CLR's int type. So there's definitely an upper limit of int.MaxValue or 2,147,483,647. I don't think there are any checks to limit the number of changes in any other way (though I may be mistaken.) Realistically, you probably have disk space contention to deal with on the server long before you reach that value.
One of the specific design goals of Team Foundation Server was to deal with large changesets - particularly merging large feature branches with a lot of churn - which can produce a changeset with a large number of merge or merge/edit changes.
In short, no. And even if there was, hundreds is several orders of magnitude off. There should be no reason to split them into multiple changesets - you only do yourself a disservice by doing that. You're hurting traceability by doing this and basically devolving into a non-atomic system (yay, CVS!) and makes the state of your repository unreliable. It negatively impacts continuous integration, linking to work items and builds and overall traceability. Imagine checking in half your merges to a branch... then the other half. That sounds like a nightmare.
Based on my observations for our TFS site, the number of files in a TFS changeset is at least 11670.
USE Tfs_Warehouse;
GO
SELECT
FCC.ChangesetSK
, COUNT(1) AS row_count
FROM
dbo.FactCodeChurn FCC
INNER JOIN
dbo.DimChangeset DCS
ON DCS.ChangesetSK = FCC.ChangesetSK
INNER JOIN
dbo.DimFile DF
ON DF.FileSK = FCC.FilenameSK
GROUP BY
FCC.ChangesetSK
HAVING
COUNT(1) > 200
ORDER BY
2 DESC;
Partial results
ChangesetSK row_count
53172 11670
4436 7940
4442 7808
43808 6262
21016 6047
53173 5835
so this is what I have
valid changeset id 8
valid changeset id 7
valid changeset id 6
valid changeset id 5
invalid merge from branch X changeset id 4
valid changeset id 3
valid changeset id 2
valid changeset id 1
is there a way to "delete" or "skip" or "ignore" the invalid changeset?
if not then I will lose a week to recover from this mess.
To answer the question "Is there a way to skip or otherwise ignore a changeset?" The answer is no.
Which leaves you with three choices:
The first is to pull all the changes you want from 5 through 8 and rollback to 3. Basically, get the files that changed and hand merge them into rev 3.
The second is to look at everything that the merge updated and hand rollback those items. In short, depending on the number of files involved you are in for a long editing session.
The third option is only available to you if sets 5 through 8 did not modify the same files as 4. If this is true then just select the files from the 4th set and roll those back individually. Then check in the new set as #9. Somehow I doubt this is available to you.
If you use TFS 2010 then you can use the tf rollback command, this will attempt to remove the offending chagngeset. If there are conflicts because subsequentchangesets have modified the same code then the merge tool will appear and you can select the code you need to keep / remove.
For earlier versions of TFS you can install the TFS power tools and use the tfpt rollback command to do the same thing