I have a large directory that contains only stuff in CS and Math. It is over 16GB in size. The types are text, png, pdf and chm. I have currently two branches: a branch of my brother's and mine. The initial files were the same. I need to compare them. I have tried to use Git, but there is a long loading time.
What is the best way to compare two big directories?
[Mixed Solution]
Do a "ls -R > different_files" in both directories [1]
"sdiff <(echo file1 | md5deep) <(echo file2 | md5deep)" [2]
What do you think? Any drawbacks?
[1] thanks to Paul Tomblin
[2] great thanks to all repliers!
Use fslint: website. One of the options of the tool is "Duplicates". As per the description from the site:
One of the most commonly used features of FSlint is the ability to find duplicate files. The easiest way to remove lint from a hard drive is to discard any duplicate files that may exist. Often a computer user may not know that they have four, five, or more copies of the exact same song in their music collection under different names or directories. Any file type whether it be music, photos, or work documents can easily be copied and replicated on your computer. As the duplicates are collected, they eat away at the available hard drive space. The first menu option offered by FSlint allows you to find and remove these duplicate files.
How to compare 2 folders without pre-existing commands/products:
Simply create a program that scans each directory and creates a file hash of each file. It outputs a file with each relative file path and the file hash.
Run this program on both folders.
Then you simply compare the 2 output files to see if they are the same. To compare those 2 files you just load them into a string and do a string compare.
The hashing algorithm you use doesn't matter. You can use MD5, SHA, CRC, ...
You could also use the file size in the output files to help reduce the chance of collisions.
How to compare 2 folders with pre-existing commands/products:
Now if you just want a program that does it, use diff -r or windiff for windows based systems.
Use md5deep to create recursive md5sum listings of every file in those directories.
You can the use a diff tool to compare the generated listings.
Are you just trying to discover what files are present in one that aren't in the other, and vice versa? A couple of suggestions:
Do a "ls -R" in both directories, redirect to files, and diff the files.
Do a "rsync -n" between them to see what rsync would have to copy if it were to be allowed to copy. (-n means don't do the rsync, just show you what it would do if you ran it without the -n)
I would diffing by comparing the output of md5sum * | sort
That will take you to the files that are different/missing
I know this question has already been answered, however if you are not into writing such a tool yourself, there's a very well working open source project by the name of tardiff available on sourceforge which basically does exactly what you want, and even supports automated creation of patches (in tar format obviously) to account for differences.
Hope this helps
Related
I've got a genrule that produces some output files but the tool I'm using needs to know where to put the files.
So far, I've been able to get working by using dirname $(location outputfile), but this seems like a very fragile solution
You can read about which make variables are available in a genrule here:
https://docs.bazel.build/versions/master/be/make-variables.html
In particular:
#D: The output directory. If there is only one filename in outs, this
expands to the directory containing that file. If there are multiple
filenames, this variable instead expands to the package's root
directory in the genfiles tree, even if all the generated files belong
to the same subdirectory! If the genrule needs to generate temporary
intermediate files (perhaps as a result of using some other tool like
a compiler) then it should attempt to write the temporary files to #D
(although /tmp will also be writable), and to remove any such
generated temporary files. Especially, avoid writing to directories
containing inputs - they may be on read-only filesystems, and even if
they aren't, doing so would trash the source tree.
In general, if the tool lets you (or if you're writing your own tool) it's best if you give the tool the individual input and output file names. For example, if the tool understands inputs only as directories, and that's usually ok if the directory contains only the things you want, but if it doesn't, then you have to rely on sandboxing to show the tool only the files you want, or you have to manually create temporary directories. Outputs as directories gives you less control over what the outputs are named, and you still have to enumerate the files in the genrule's outs.
I have a TFS workspace which I need to move to my new PC. I have copied the whole folder structure over and ensured that the workspace is mapping to the correct folders. However the "Latest" column for every file displays as "Not downloaded". How can I reconcile this such that TFS is aware that the files match the server version?
The standard answer seems to be to re-download the whole thing. Unfortunately the repository is huge, my connection is unreliable, and I have monthly download quotas. Is there anything in the command-line tools or power tools that can make it compare file hashes or similar and realise that the files are identical?
Thanks.
There's a binary metadata file inside the working copy that stores the mapping of every path in your repo, to the path on the filesystem.
It uses absolute paths - so unless your new project folder occupies the exact same location as it did on the original computer they won't match.
Because it's a binary format, you can't do something simple like mass replace the paths with a text editor or sed.
We are using TFS to build our solutions. We have some help files that we don't include in our projects as we don't want to grant our document writer access to the source. These files are placed in a folder on our network.
When the build kicks off we want the process to grab the files from the network location and place them into a help folder that is part of source.
I have found an activity in the xaml for the build process called CopyDirectory. I think this may work but I'm not sure what values to place into the Destination and Source properties. After each successful build the build is copied out to a network location. We want to copy the files from one network location into the new build directory.
I may be approaching this the wrong way, but any help would be much appreciated.
Thanks.
First, you might want to consider your documentation author placing his documents in TFS. You can give him access to a separate folder or project without granting access to your source code. The advantages of this are:
Everything is in source control. Files dropped in a network folder are easily misplaced or corrupted, and you have no history of changes to them. The ideal for any project is that everything related to the project is captured in source control so you can lift out a complete historical version whenever one is needed.
You can map the documentation to a different local folder on your build server such that simply executing the "get" of the source code automatically copies the documentation exactly where it's needed.
The disadvantage is that you may need an extra CAL for him to be able to do this.
Another (more laborious) approach is to let him save to the network location, and have a developer check the new files into TFS periodically. If the docs aren't updated often this may be an acceptable compromise.
However, if you wish to copy the docs from the net during your build, you can use one of the MSBuild Copy commands (as you are already aware), or you can use Exec. The copy commands are more complicated to use because they often populated with filename lists that are generated from outputs of other build targets, and are usually used with solution-relative pathnames. But if you're happy with DOS commands (xcopy/robocopy), then you may find it much easier just to use Exec to run an xcopy/robocopy command. You can then "develop" and test the xcopy command outside the MSBuild environment and then just paste it into the MSBuild script with confidence that it will work - much easier than trialling copy settings as part of your full build process.
Exec is documented here. The example shows pretty well how to do what you want, but in your case you can probably just replace the Command attribute with the entire xcopy/robocopy command (or even the name of a batch file) you want to use, so you won't need to set up the ItemGroup etc.
We have an ASP.NET MVC project that we want to create a publish package from during an automated build. The build is using the unmodified default template with Arguments /p:DeployOnBuild=True /p:CreatePackageOnPublish=True.
If I do a WebDeploy directly to a server it is working fine (if I change /p:CreatePackageOnPublish to false) but I would prefer to just create a package that I can deploy during a Lab build.
The error message looks like this:
TF270002: An error occurred copying files from 'C:\Builds\19\Binaries'
to '\nas\Build\Drop\MyProject\MyProject_Development.Test\20120209.1'.
Details: The specified path, file name, or both are too long. The
fully qualified file name must be less than 260 characters, and the
directory name must be less than 248 characters.
The first part of the problem was the build folder path was too long (274 characters) but after changing the working directory from $(SystemDrive)\Builds\$(BuildAgentId)\$(BuildDefinitionPath) to $(SystemDrive)\Builds\$(BuildDefinitionId) it's down to 230 characters as the longest path so it should be ok.
The problem now seems to be the path in the drop folder, even though it's root path is not that long by itself \\nas\Build\Drop\MyProject, the build name and Build Number Format quickly adds to the length MyProject_Development.Test\MyProject_Development.Test_20120208.1. After that all them nested paths create really deep folder structures _PublishedWebsites\MyProject.Web_Package\Archive\Content\C_C\Builds\19\Sources\MyProject\Source\MyProject.Web\obj\Debug\Package\PackageTmp\Content\ui-lightness\Images\ui-bg_diagonals-thick_18_b81900_40x40.png.
So is there any way to get around this problem? I shortened the build number format from $(BuildDefinitionName)_$(Date:yyyyMMdd)$(Rev:.r) to $(Date:yyyyMMdd)$(Rev:.r) to save a few characters but it's not enough. I guess we could shorten the build name a bit but it would break the naming convention used (Ok, that would not be a really big problem but it would be annoying!) and still it would feel like a short term solution.
What else is there to do?
The short answer is the path length limitation is really annoying, and you're going to have to spend some (more) time tweaking your file/folder structure to make this work.
For example instead of \nas\Build\Drop\MyProject, just do \nas\Build\Drop (or \nas\Builds) since the project name is also in the build name.
Flatten the folder structure in your projects (do you really need a Source folder under MyProject?).
Also, go vote for the UserVoice suggestion for the TFS team to fix the path length limitations: http://visualstudio.uservoice.com/forums/121579-visual-studio/suggestions/2156195-fix-260-character-file-name-length-limitation
I know the question is old, but I faced the same problem and I devised to solution to this, although it errs more on the preventing the problem from ever occurring rather than fixing the existing path length condition. It can then be applied once the issue has been - manually - resolved.
Please note that it applies to TFS under git. A similar approach could be devised for TFSVC, although it would have to be run after the code is merged.
Essentially, it's a short script to be run as part of the PR build. It enforces that no file added or modified has a path longer than the one you allow.
It is described in this blog post
I like to backup up the source code set for a project when I release a version. I use GExperts project backups, which seems to gather up all the files in the project manager into the ZIP file. You can also add arbitrary files to this file set, but I'm always conscious of the fact that I haven't necessarily got all the files. Unless I specifically go though the uses clauses and add all the units I have sources for to the project, I'll never be sure of storing all the files necessary to recreate the installable/executable.
I've thought about rolling an app to traverse a project, following all the units used and looking down all the search paths and seeing if there is a source file available for that unit, and building a list of files to back up that way, but hey - maybe someone has already done the work?
You should (highly recommend) look into Version Control.
e.g. SVN (subversion), CVS
This will allow you to control revisions of all of your source. It will allow you to add or remove source files, roll back merge and all other nice things related to managing project sources.
This WILL save your a$%# one day.
You can interpret your question in two ways:
How can I make sure that I backup at least enough files so I can build the project
How can I make sure that I backup not too many files so I can still build the project
The first is to make sure you can build the system at all, the second to allow you to clean up unused files.
For both, a version control system including a separate build system is the way to go.
You then - for each new set of changes - can use these steps to assure that both conditions hold:
On your daily development system, check in the new revision of your source code into your version control system.
On your separate build system, get the latest version of your source control system.
Build the project on the build system; if this fails, go to Step 1, and add the missing files to your version control system from your development system
Start removing (one-by-one) files from the project that you suspect are not needed, then rebuild until it fails.
When the build fails, restore that particular file from the version control system, then continue step 3 with the next candidate
When the build succeed you have the minimum set of files.
Now make a difference overview of the files in your version control system, and the build machine.
Mark the files that are in your version control system but not on your build machine as deprecated or deleted.
Most version control systems have good ways of generating a difference between the files on your development or build system against the files in the version control system (usually fine grained for each historic point in time you added/removed/updated files in your version control system).
The reason you want a separate build system (or two separate development systems) is that you want them to be independent: you use one for developing, and the other for checking if the build is still OK.
This is the first step that in the future you might want to extend this into a continuous integration system (that runs unit tests, automatically creates product setups and much more).
--jeroen
I'm not sure if you're asking about version control or how to be sure you've got all the files.
One useful utility I run occasionally is a program that makes a DirList of all of the files in my dcu output folder. Changing the extensions from .dcu to .pas gives me a list of all of the source code files.
Of course it misses .inc files and other non-.pas files, but perhaps this line of thinking would be helpful to you in some way?
The value of this utility to me is that a second housekeeping utility program then makes a list of all .pas files in my source tree that do not have corresponding .dcu files. This (after a full compile of all programs) generally reveals some "junk" .pas files that are no longer in use.
For getting a list of all units compiled into an executable, you could let the compiler generate a MAP file. This file will contain entries for all the units used.