Our TFSVersionControl database has grown significantly in the past couple years, and is edging on 80GB. Unfortunately, we're in an environment where every gig of data storage is internally charged at a high rate, so there's lots of focus on keeping storage growth to a minimum.
I believe the majority of growth is happening because we chose to store binary files in our repository. This is something we will be remedying in the medium term.
In the short-term, there are a few places where we do not need to keep a history of our binaries. Particularly in our mainline branch and our development branch, so we're looking into doing a TF Destroy on these binaries and recreating them as part of the upcoming release.
What I'd like to know is: Is there any way to run a query against the TFSVersionControl database to understand which files are storing deltas that are over a given size?
Ideally, what I'd like to know is for a given path (item spec), for each file, the base size, and the total size of the deltas.
I think this page may be what you're looking for.
Just like asking someone else will often drive you to find your own answer, I did some additional digging, and came up with this:
select ver.VersionFrom, ver.Command, ver.ChildItem, tf.*, ct.CreationDate, ct.OffsetFrom, ct.OffsetTo, DataLength(ct.Content) as Size
from tbl_version ver with (nolock)
inner join tbl_file tf with (nolock) on tf.FileId = ver.FileId
inner join tbl_content ct with (nolock) on ct.FileId = tf.fileid
where parentpath = '$\ProjectName\Branch\Folder\'
ORDER BY ver.ChildItem, Ver.VersionFrom
--where fullpath = '$\ProjectName\Branch\Folder\FileName.cs\'
The query as written will iterate through all files in a particular path and will retrieve a record per checkin. The calculated Size field will show you the size in bytes of the delta. I'm not sure if this is compressed size or "actual" size.
The commented "where" statement will show you the same for an indvidual file.
Note that the typical forward slashes ("/") are stored in the database as backslashes ("\"), and there is always a trailing backslash at the end.
If you pull this data into Excel, you can quickly create a pivot table on it to calculate the sizes (or you can add them up manually).
Related
I am trying to find out what indices TDB2 builds. I found out by the code that it uses B+ trees to store them on disc but I didn't get what they contain and how they are used.
So my detailed questions are:
For which collation order of RDF triples (like SPO, SOP, POS, PSO, ... ) does it build indices?
How are RDF Terms encoded and stored?
What strategy is used to load the indices into main memory? (I would expect paging)?
It would also help me if you could point me to a white paper or something similar about TDB2's software design. I searched for it but couldn't find anything.
TDB2 has a "id" for each RDF term (literal's URIs, blank nodes). The id is a fixed length 64. Another way of say ting is it keeps a dictionary.
For triples it keeps SPO, POS, and OSP (this is configurable but that's the default). A triple is stored in an index as those ids - so 3 ids per triple. Fixed length.
Indexes are memory mapped files outside the heap by default. They provide the good usability.
That's the current default setup. The code isolates changes e.g. 64 bit ids could be longer, different index choices made.
I have SPSSmodeler stream which is now used and updated every week constantly to generate a certain dataset. A raw data for this stream is also renewed on a weekly basis.
In part of this stream, there is a chunk of nodes that were necessary to modify and update manually every week, and the sequence of this part is below: Type Node => Restructure Node => Aggregate Node
To simplify the explanation of those nodes' role, I drew an image of them as bellow.
Because the original raw data is changed weekly basis, the range of Unit value above is always varied, sometimes more than 6 (maybe 100) others less than 6 (maybe 3). That is why somebody has to modify there and update those chunk of nodes on a weekly basis until now. *Unit value has a certain limitation (300 for now)
However, now we are aiming to run this stream automatically without touching any human operations on it that we need to customize there to work perfectly, automatically. Please help and will appreciate your efforts, thanks!
In order to automatize, I suggest to try to use global nodes combined with clem scripts inside the execution (default script). I have a stream that calculates the first date and the last date and those variables are used to rename files at the end of execution. I think you could use something similar as explained here:
1) Create derive nodes to bring the unit values used in the weekly stream
2) Save this information in a table named 'count_variable'
3) Use a Global node named Global with a query similar to this:
#GLOBAL_MAX(variable created in (2)) (only to record the number of variables. The step 2 created a table with only 1 values, so the GLOBAL_MAX will only bring the number of variables).
4) The query inside the execution tab will be similar to this:
execute count_variable
var tabledata
var fn
set tabledata = count_variable.output
set count_variable = value tabledata at 1 1
execute Global
5) You now can use the information of variables just using the already creatde "count_variable"
It's not easy to explain just by typing, but I hope to have been helpful.
Please mark as +1 in this answer if it was relevant one.
I think there is a better, simpler and more effective (yet risky, due to node's requirements to input data) solution to your problem. It is called Transpose node and does exactly that - pivot your table. But just from version 18.1 on. Here's an example:
https://developer.ibm.com/answers/questions/389161/how-does-new-feature-partial-transpose-work-in-sps/
In the past I had to work with big files, somewhere about in the 0.1-3GB range. Not all the 'columns' were needed so it was ok to fit the remaining data in RAM.
Now I have to work with files in 1-20GB range, and they will probably grow as the time will pass. That is totally different because you cannot fit the data in RAM anymore.
My file contains several millions of 'entries' (I have found one with 30 mil entries). On entry consists in about 10 'columns': one string (50-1000 unicode chars) and several numbers. I have to sort the data by 'column' and show it. For the user only the top entries (1-30%) are relevant, the rest is low quality data.
So, I need some suggestions about in which direction to head out. I definitively don't want to put data in a DB because they are hard to install and configure for non computer savvy persons. I like to deliver a monolithic program.
Showing the data is not difficult at all. But sorting... without loading the data in RAM, on regular PCs (2-6GB RAM)... will kill some good hours.
I was looking a bit into MMF (memory mapped files) but this article from Danny Thorpe shows that it may not be suitable: http://dannythorpe.com/2004/03/19/the-hidden-costs-of-memory-mapped-files/
So, I was thinking about loading only the data from the column that has to be sorted in ram AND a pointer to the address (into the disk file) of the 'entry'. I sort the 'column' then I use the pointer to find the entry corresponding to each column cell and restore the entry. The 'restoration' will be written directly to disk so no additional RAM will be required.
PS: I am looking for a solution that will work both on Lazarus and Delphi because Lazarus (actually FPC) has 64 bit support for Mac. 64 bit means more RAM available = faster sorting.
I think a way to go is Mergesort, it's a great algorithm for sorting a
large amount of fixed records with limited memory.
General idea:
read N lines from the input file (a value that allows you to keep the lines in memory)
sort these lines and write the sorted lines to file 1
repeat with the next N lines to obtain file 2
...
you reach the end of the input file and you now have M files (each of which is sorted)
merge these files into a single file (you'll have to do this in steps as well)
You could also consider a solution based on an embedded database, e.g. Firebird embedded: it works well with Delphi/Windows and you only have to add some DLL in your program folder (I'm not sure about Lazarus/OSX).
If you only need a fraction of the whole data, scan the file sequentially and keep only the entries needed for display. F.I. lets say you need only 300 entries from 1 million. Scan the first first 300 entries in the file and sort them in memory. Then for each remaining entry check if it is lower than the lowest in memory and skip it. If it is higher as the lowest entry in memory, insert it into the correct place inside the 300 and throw away the lowest. This will make the second lowest the lowest. Repeat until end of file.
Really, there are no sorting algorithms that can make moving 30gb of randomly sorted data fast.
If you need to sort in multiple ways, the trick is not to move the data itself at all, but instead to create an index for each column that you need to sort.
I do it like that with files that are also tens of gigabytes long, and users can sort, scroll and search the data without noticing that it's a huge dataset they're working with.
Please finde here a class which sorts a file using a slightly optimized merge sort. I wrote that a couple of years ago for fun. It uses a skip list for sorting files in-memory.
Edit: The forum is german and you have to register (for free). It's safe but requires a bit of german knowledge.
If you cannot fit the data into main memory then you are into the realms of external sorting. Typically this involves external merge sort. Sort smaller chunks of the data in memory, one by one, and write back to disk. And then merge these chunks.
I have a text file sized 300MB, I want to count the occurrences of each 10,000 substrings in the file. I want to know how to do it fast.
Now, I use the following code:
content = IO.read("path/to/mytextfile")
Word.each do |w|
w.occurrence = content.scan(w.name).size
w.save
end
Word is an ActiveRecord class.
It took me almost 1 day to finish the counting. Is there anyway to do it faster? Thanks.
Edit1:
Thank you again. I am running rails 2.3.9. The name filed of words table contains what I am searching for, and it contains only unique values. Instead of using Word.each, I use batch(1000 rows a time) load. It should help.
I rewrited the whole code with the idea from bpaulon. Now it only took a few hours to finish the counting.
I profiled the new version code, now the largest time costing methods are utf8 encode supported string truncating code
def truncate(n)
self.slice(/\A.{0,#{n}}/m)
end
and characters counting code
def utf8_length
self.unpack('U*').size
end
Any other faster methods to replace them?
Your use of scan creates an array, counts the size of it, then throws it away. If you have a lot of occurrences of the substring inside a big file, you will create a big array temporarily, potentially burning up CPU time with memory management, but that should still run pretty quickly, even with 300MB.
Because Word is an ActiveRecord class, it is dependent on the schema and any indexes in your database, plus any issues your database server might be having. If the database is not optimized or is responding slowly or the query used to retrieve the data is not efficient, then the iteration will be slow. You might find it a lot faster to grab groups of Word so they are in RAM, then iterate over them.
And, if the database and your code are running on the same machine, you could be suffering from resource constraints like having only one drive, not enough RAM, etc.
Without knowing more about your environment and hardware it's hard to say.
EDIT:
I can grab the substrings into an array/hash first, then add the count results to the array or hash, and write the results back to database after all the counting is done. You think it be faster, right?
No, I doubt that will help a lot, and, without knowing where the problem lies all you might do is make the problem worse because you'll have to load 10,000 records as objects from the database, then build a 10,000 element hash or array which will also be in memory along with the DB records, then write them out.
Ruby will only use a single core, currently, but you can gain speed by using Ruby 1.9+. I'd recommend installing RVM and letting it manage your Ruby. Be sure to read the instructions on that page, then run rvm notes and follow those directions.
What is your Word model and the underlying schema and indexes look like? Is the database on the same machine?
EDIT: From looking at your table schema, you have no indexes except for id which really won't help much for normal look-ups. I'd recommend presenting your schema on Stack Overflow's sibling site https://dba.stackexchange.com/ and explain what you want to do. At a minimum I'd add a key to the text fields to help avoid full table scans for any searches you do.
What might help more is to read: Retrieving Multiple Objects in Batches from "Active Record Query Interface".
Also, look at the SQL being emitted when your Word.each is running. Is it something like "select * from word"? If so, Rails is pulling in 10,000 records to iterate over them one by one. If it is something like "select * from word where id=1" then for every record you have a database read followed by a write when you update the count. That is the scenario that the "Retrieving Multiple Objects in Batches" link will help fix.
Also, I am guessing that content is the text you are searching for, but I can't tell for sure. Is it possible you have duplicated text values causing you to do scans more than once for the same text? If so, select your records using a unique condition on that field and then update your counts for all matching records at one time.
Have you profiled your code to see if Ruby itself can help you pinpoint the problem? Modify your code a little to process 100 or 1000 records. Start the app with the -r profile flag. When the app exits profiler will output a table showing where time was spent.
What version of Rails are you running?
I think you could approach this problem differently
You do not need to scan the file this many times, you could create a db, like in mongo or mysql, and for each word you find, you fetch the db for it and then adds on some "counter" field.
You could ask me "but then I will have to scan my database a lot and it could take a lot more". Well, sure you wouldn't ask this, but it won't take more time because databases are focused in IO, besides you could always index it.
EDIT: There is no way to delimit at all?? Let's say that where you have the a Word.name string you really holds a (not simple) regex. Could the regex contain the \n? Well, if the regex can contain any value, you should estimate the maximum size of string the regex can fetch, double it, and scan the file by that ammount of chars but moving the cursor by that number.
Lets say your estimate of the maximum your regex could fetch it is like 20 chars nad your file has from 0 to 30000 chars. You pass each regex you have from 0 to 40 chars, then again from 20 to 60, from 40 to 80, etc...
You should also hold the position you found of your smaller regex so it wouldn't repeat it.
Finally, this solution seems to be not worth the effort, your problem may have a greater solution based on what that regexes are, but it will be faster than invoke scan Words.count times your your 300Mb string.
You could load your entire "Word" table into a Trie, then do back-tracking since you said there are no delimiters in the text.
So for each character in the text, go down the Trie of words. If you hit a word, increment its count. "Going down the trie" involves three cases:
There's no node at this character. (If you're mid-search, pop the back-tracking stack)
There's a node at this character. (But it's not a Word)
There's a node at this character. (It's a Word - increment and "dirty")
Back-tracking is just keeping track of places you want to go after you've exhausted this "search" of the Trie, which is when you run out of nodes to visit. This will probably be each character you visit that is a root of the Trie.
After you've done this, you can then visit all the nodes you changed and just update the records they represent.
This will take some time to implement, but will surely be faster than each & scan.
I've been thinking over some branching strategies (creating branches per feature, maybe per developer since we're a small group) and was wondering if anyone had experienced any issues. Does creating a branch take up much space?
Last time I looked, TFS uses copy-on-write, which means that you won't increase disk space until you change files. It's kind of like using symlinks until you need to change things.
James is basically correct. For a more complete answer, we need to start with Buck's post from back in 2006: http://blogs.msdn.com/buckh/archive/2006/02/22/tfs_size_estimation.aspx
Each new row in the local version table adds about 520 bytes (one row gets added for each workspace that gets the newly added item, and the size is dominated by the local path column). If you have 100 workspaces that get the newly added item, the database will grow by 52 KB. If you add 1,000 new files of average size (mix of source files, binaries, images, etc.) and have 100 workspaces get them, the version control database grows by approximately 112 MB (60 KB * 1,000 + 520 * 1,000 * 100).
We can omit the 60KB figure since branched items do not duplicate file contents. (It's not quite "copy-on-write," James -- an O(N) amount of metadata must be computed and stored during the branch operation itself, vs systems like git which I believe branch in O(1) -- but you're correct that the new item points to the same record in tbl_Content as the source item until it's edited). That leaves us with merely the 520 * num_workspaces * files_per_workspace factor. On the MS dogfood server there are something like 2 billion rows in tbl_LocalVersion, but in a self-described small group it should be utterly negligible.
Something Buck's blog does not mention is merge history. If you adopt a branch-heavy workflow and stick with it through several development cycles, it's likely tbl_MergeHistory will grow nearly as big as tbl_LocalVersion. Again, I doubt it will even register on a small team's radar, but on large installations you can easily amass hundreds of millions of rows. That said, each row is much smaller since there are no nvarchar(260) fields.