I tried to retrieve last commit where the specific file was submitted (kind of "git log foo.cc").
Get all the commits and go through them.
In the commit tree looking for the file.
The problem is that all the commits contain the file I'm interested in.
Does the commit tree suppose to contain the files which were submitted only or is it supposed that the tree contains the full tree at the time of commit?
If the tree is supposed to be full:
How can I know if the file(TreeEntry) was modified in a specific commit?
Thanks!
In git every commit contains a representation state of entire the repository. For more info look here. The paragraph snapshots not differences is clear about why this choice was made. Note, that if the file didn't change between two commits the last commit just contains a pointer to the file in the previous commit. You could compare the pointers of two consecuting commits to spot differences. Another approach would be to use a diff tool as mentioned in the comment of Edward Thomson.
Related
We are planning a migration from an on-premises TFS instance to VSTS very shortly. Ahead of the migration, we run the pre-requisite Validation task and obtained the following output report on the TPC database size:
"The database is currently 191GBs. This is above the recommended size of 150GBs to use the DACPAC import method. The largest table size is currently 172GBs. This is above the recommended size of 20GBs to use the DACPAC import method.
Validation completed 'Validate Collection Database Size' with result Warning, message The largest table size is currently 172GBs. This is above the recommended size of 20GBs to use the DACPAC import method."
We are therefore keen to reduce the size of the TPC database and have two main considerations:
Shrink the Database and generate the DACPAC from the resultant output.
Delete any of the following objects which are unused or redundant:
a) Older Workspaces
b) Build Results
c) Redundant Team Projects
d) Unused Files
e) Test Attachments created during test runs
f) XAML Builds
Would therefore appreciate some advice or feedback on the pros and cons of either approach and which would be recommended.
Given that you need to reduce your largest table by 150GB, I wonder whether DACPAC is every going to be an option. That said, it's always a good idea to clean up your TFS instance. Your first step won't help a lot until you've managed to strip enough data to actually get any benefit out of a shrink.
Your identified actions would indeed help, most are already documented here. Queries that can aid in detecting where your space is allocated are also found in this recent support ticket.
Delete old workspaces
Deleting workspaces and shelvesets can reduce your migration and upgrade times considerably. either use the tf commandline or leverage a tool like the TFS SideKicks to identify and delete these.
Build results
Not just build results, but often overlooked the actual build records can take up a considerable amount of data. Use tfsbuild destroy (XAML) to permanently delete the build records. In the past, I've encountered clients who had 1.8 million "hidden" builds in their database and removing them shaved off quite a considerable amount of data. These records were kept around for the warehouse.
Old team projects
Of course, destroying old team projects can give back a lot of data. Anything you don't need to send to azure helps. You could also consider splitting the collection and to leave behind the old projects. That will give you the option to detach that collection and store it somewhere, should you ever need that data again.
Redundant files
Deleted branches are a very common hidden size hog. When deleting things in TFVC, they are not actually deleted, they're just hidden. Finding deleted files and especially old development or feature branches can give you back a lot of data. Use tf destroy to get rid of them.
You may also want to look for checked in nuget package folders, those can quickly rack up a lot of space as well.
Test Attachments
Ohh yes, especially when you use test attachments, these can grow like crazy, depending on your TFS version either use the built-in test attachment cleanup features or use the Test Attachment Cleaner from the TFS power tools.
XAML Builds
The build definitions themselves won't take a lot of db space, but the build results may. But those have been covered in a previous section.
Git Repositories
You may have data in your git repositories that are no longer accessible due to force pushes or deleted branches. It's also possible that certain data in Git could be packed more efficiently. To clean your repositories you have to clone them locally, clean them up, delete the remote repo from TFS and push the cleaned copy to a new repository (you can use the same name as the old one). Doing this will break references with existing build definitions and you will have to fix these up. While you're at it, you could also run the BFG repo Cleaner and convert the repositories to enable Git-LFS support to handle large binary files in your repositories more elegantly.
git clone --mirror <<repo>>
# optionally run BFG repo cleaner at thi s point
git reflog expire --expire=now --all
git gc --prune=now --aggressive
git repack -adf
# Delete and recreate the remote repository with the same name
git push origin --all
git push origin --tags
Work item (attachments)
Work items can gather up a considerable amount of data, especially when people start attaching large attachments to them. You can use witadmin destroywi to delete workitems with unreasonably large attachments. To retain the work item, but delete its attachments you can delete the attachments from the current work item and then clone it. After cloning, destroy the old work item to allow the attachments to be cleaned up.
Old work items that you no longer need (say the sprint ites from 6 years ago) can also be deleted. My colleague Rene has a nice tool that allows you to bulk-destroy by first creating the appropriate work item query.
Be sure to run the cleanup jobs
TFS often doesn't directly prune data from the database, in many cases it just marks stuff as deleted for latest processing. To force the cleanup to happen immediately, run the following stored procedures on your Project Collection database:
EXEC prc_CleanupDeletedFileContent 1
# You may have to run the following command multiple times, the last
# parameter is the batch size, if there are more items to prune than the
# passed in number, you will have to run it multiple times
EXEC prc_DeleteUnusedFiles 1, 0, 100000
Other useful queries
To identify how much data is stored in each section, there are a few useful queries you can run. The actual query depends on your TFS version, but since you're preparing for migration I suspect you're on TFS 2017 or 2018 at the moment.
Find the largest tables:
SELECT TOP 10 o.name,
SUM(reserved_page_count) * 8.0 / 1024 SizeInMB,
SUM(CASE
WHEN p.index_id <= 1 THEN p.row_count
ELSE 0
END) Row_Count
FROM sys.dm_db_partition_stats p
JOIN sys.objects o
ON p.object_id = o.object_id
GROUP BY o.name
ORDER BY SUM(reserved_page_count) DESC
Find the largest content contributors:
SELECT Owner =
CASE
WHEN OwnerId = 0 THEN 'Generic'
WHEN OwnerId = 1 THEN 'VersionControl'
WHEN OwnerId = 2 THEN 'WorkItemTracking'
WHEN OwnerId = 3 THEN 'TeamBuild'
WHEN OwnerId = 4 THEN 'TeamTest'
WHEN OwnerId = 5 THEN 'Servicing'
WHEN OwnerId = 6 THEN 'UnitTest'
WHEN OwnerId = 7 THEN 'WebAccess'
WHEN OwnerId = 8 THEN 'ProcessTemplate'
WHEN OwnerId = 9 THEN 'StrongBox'
WHEN OwnerId = 10 THEN 'FileContainer'
WHEN OwnerId = 11 THEN 'CodeSense'
WHEN OwnerId = 12 THEN 'Profile'
WHEN OwnerId = 13 THEN 'Aad'
WHEN OwnerId = 14 THEN 'Gallery'
WHEN OwnerId = 15 THEN 'BlobStore'
WHEN OwnerId = 255 THEN 'PendingDeletion'
END,
SUM(CompressedLength) / 1024.0 / 1024.0 AS BlobSizeInMB
FROM tbl_FileReference AS r
JOIN tbl_FileMetadata AS m
ON r.ResourceId = m.ResourceId
AND r.PartitionId = m.PartitionId
WHERE r.PartitionId = 1
GROUP BY OwnerId
ORDER BY 2 DESC
If file containers are the issue:
SELECT CASE WHEN Container = 'vstfs:///Buil' THEN 'Build'
WHEN Container = 'vstfs:///Git/' THEN 'Git'
WHEN Container = 'vstfs:///Dist' THEN 'DistributedTask'
ELSE Container
END AS FileContainerOwner,
SUM(fm.CompressedLength) / 1024.0 / 1024.0 AS TotalSizeInMB
FROM (SELECT DISTINCT LEFT(c.ArtifactUri, 13) AS Container,
fr.ResourceId,
ci.PartitionId
FROM tbl_Container c
INNER JOIN tbl_ContainerItem ci
ON c.ContainerId = ci.ContainerId
AND c.PartitionId = ci.PartitionId
INNER JOIN tbl_FileReference fr
ON ci.fileId = fr.fileId
AND ci.DataspaceId = fr.DataspaceId
AND ci.PartitionId = fr.PartitionId) c
INNER JOIN tbl_FileMetadata fm
ON fm.ResourceId = c.ResourceId
AND fm.PartitionId = c.PartitionId
GROUP BY c.Container
ORDER BY TotalSizeInMB DESC
I have a rugged tree object and I want to find out what is its path (relative to root) and what was the commit id when that tree was written. For example:
tree = repo.lookup '7892eeee70c08fae4db63aef7000dea39f883b30' #sha/oid of tree
What operations should I perform on this tree object so that I get its path and commit id?
That information is not stored in the tree at all. Git uses Merkle trees where the parents know what the children trees are, but each tree can be contained in multiple commits (this is the typical situation, as some subdirs are very rarely touched).
A tree may also be accessible through many different paths, if those directories have the same contents.
The only way to figure out where a tree belongs would be to go and look at each commit and recursively look from the root to see if you can find the tree you were given. This is going to be a very expensive operation.
I would recommend you take a step back and figure out why you think you need to figure out where a tree is reachable from. It sounds like you've already decided many steps and you're asking about a detail, when you should be looking at it from a higher level.
I am a total hadoop n00b. I am trying to solve the following as my first hadoop project. I have a million+ sub-folders sitting in an amazon S3 bucket. Each of these folders have two files. File 1 has data as follows:
date,purchaseItem,purchaseAmount
01/01/2012,Car,12000
01/02/2012,Coffee,4
....................
File2 has the information of the customer in the following format:
ClientId:Id1
ClientName:"SomeName"
ClientAge:"SomeAge"
This same pattern is repeated across all the folders in the bucket.
Before I write all this data into HDFS, I want to join File1 and File2 as follows:
Joined File:
ClientId,ClientName,ClientAge,date,purchaseItem,purchaseAmount
Id1,"SomeName","SomeAge",01/01/2012,Car,12000
Id1,"SomeName","SomeAge",01/02/2012,Coffee,4
I need to do this for each and every folder and then feed this joined dataset into HDFS. Can somebody point out how would I be able to achieve something like this in Hadoop. A push in the right direction will be much appreciated.
What comes to mind quickly is an implementation in cascading.
Figure out a way to turn your rows into columns for File2 programmatically so that you can iterate over all the folders and transpose the file so that your 1st column is your 1st row.
For just one subfolder:
Perhaps setting up Two Schemes a TextDelimited Scheme for File 1 and a TextLine Scheme for File 2. Set these up as Taps then wrap each of these into a MultiSourceTap this concatenates all those files into one Pipe.
At this point you should have two separate MultiSourceTaps one for all the File1(s) and one for all the File2(s).
Keep in mind some of the details in between here, it may be best to just set this up for one subfolder and then iterated over the other million subfolders and output to some other area then use hadoop fs -getmerge to get all the output small files into one big one.
Keeping with the Cascading theme, then you could construct Pipes to add the subfolder name using new Insert(subfolder_name) inside and Each function so that both your data sets have a reference to the subfolder it came from to join them together then... Join them using cascading CoGroup or Hive-QL Join.
There may be a much easier implementation than this but this is what come to mind thinking quickly. :)
TextDelimited,
TextLine,
MultiSourceTap
Have a look at the CombineFileInputFormat.
We have a changeset where the developer has checked in changes to both source and target branch, many changes including renames in both branches. The merge of the the changeset from source to target branch goes fine, but the changeset remains in the list of changesets to be merged.
When I now try to merge the changeset again, it says "There are noe changes to merge.". And the changeset remains in the queue.
We have tried to use the command line tool to discard the changeset like this:
C:\src\project\sourceBranch>tf merge /discard /recursive /version:C8137~C8137 $/Project/sourceBranch $/
Project/targetBranch
This did not help. We have also tried using other options like /force and /baseless with no luck.
What other possibilities are there of getting rid of the changeset among the merge candidates?
Ok, so basically you have a changeset with items that belong to two branches that are directly related. Which makes the merge of such changeset using the "partial changeset" subcomponent of the changeset.
Let me explain with a better way:
CS1234 (your changeset)
Partial CS1234A for branch A (say the source branch)
Partial CS1234B for branch B (say the target)
You did a merge from A to B, which merged CS1234A to B.
Now when you attempt a new merge still from A to B, you still have CS1234 as a candidate, right ? Then if you select it, nothing is done, which is totally understandable due to the fact you already merged CS1234A and CS1234B does not belong to the source branch (A).
Looks like a bug from TFS to me that I already ran into, I thought Microsoft fixed it with the TFS 2010 RTM, apparently not.
Basically TFS gives you CS1234 as a candidate because only a partial part of it was merged, but as the other partial part can't be merge, it doesn't make sense to give it as a candidate.
What about:
You initiate a merge from B to A (in the reverse way), does CS1234 is given as a candidate ? My assumption is if you merge CS1234 from B to A then you won't be bother again with this changeset when you'll display the candidates from A to B. But I don't know if it's something you're willing to do.
Anyway you should fill a bug at the Microsoft Connect site
I moved from Subversion to Microsoft's Team Foundation Server for version control, and it is my understanding that you cannot merge discontinuous change-sets in TFS.
For example, I have a file called "baseline.txt" that looks like this:
line one
Then, I branch the file to a new file called "branch.txt", and then do two check-ins on "baseline.txt" so that it finally looks like this:
line one
line two //checked-in change-set A
line three //checked in change-set B
Now, I want to merge only change-set B into "branch.txt". In other words, I expect "branch.txt" to look like this after the merge:
line one
line three //checked in change-set B
Basically, I want to skip change-set A and merge change-set B. It is possible in Subversion, but in TFS if I want to get changeset-B, I have to also get all change-sets "up-to" B.
Is this true? That's what my experiments show, but "Understanding ChangeSets and Merge with Team Foundation Server" seems to indicate differently.
That article is confusing, and I don't believe it is accurate. When the second change is checked in, it should generate a merge conflict. At that time, you would need to resolve the conflict in one of three ways:
Merge the changes
Overwrite with the new changeset, or
Keep the old, and discard the new changes.
No matter what, when you get ready to merge back to baseline.txt, you have a "point-in-time" version of the file that you're going to check in.