I've been thinking over some branching strategies (creating branches per feature, maybe per developer since we're a small group) and was wondering if anyone had experienced any issues. Does creating a branch take up much space?
Last time I looked, TFS uses copy-on-write, which means that you won't increase disk space until you change files. It's kind of like using symlinks until you need to change things.
James is basically correct. For a more complete answer, we need to start with Buck's post from back in 2006: http://blogs.msdn.com/buckh/archive/2006/02/22/tfs_size_estimation.aspx
Each new row in the local version table adds about 520 bytes (one row gets added for each workspace that gets the newly added item, and the size is dominated by the local path column). If you have 100 workspaces that get the newly added item, the database will grow by 52 KB. If you add 1,000 new files of average size (mix of source files, binaries, images, etc.) and have 100 workspaces get them, the version control database grows by approximately 112 MB (60 KB * 1,000 + 520 * 1,000 * 100).
We can omit the 60KB figure since branched items do not duplicate file contents. (It's not quite "copy-on-write," James -- an O(N) amount of metadata must be computed and stored during the branch operation itself, vs systems like git which I believe branch in O(1) -- but you're correct that the new item points to the same record in tbl_Content as the source item until it's edited). That leaves us with merely the 520 * num_workspaces * files_per_workspace factor. On the MS dogfood server there are something like 2 billion rows in tbl_LocalVersion, but in a self-described small group it should be utterly negligible.
Something Buck's blog does not mention is merge history. If you adopt a branch-heavy workflow and stick with it through several development cycles, it's likely tbl_MergeHistory will grow nearly as big as tbl_LocalVersion. Again, I doubt it will even register on a small team's radar, but on large installations you can easily amass hundreds of millions of rows. That said, each row is much smaller since there are no nvarchar(260) fields.
Related
I have 93 arrays. Each array has about 18 values in average
I need to make a product of these arrays.
So I have my two dimension array that store these 93 arrays.
Here is what I try to do
DATASET.first.product(*DATASET[1..-1])
Ruby returns
RangeError: too big to product
Does anyone know some workaround to figure out of it?
Some ways to chunk them?
What you want is impossible.
The product of 93 arrays with ~18 elements each is an array with approximately 549975033204266172374216967425209467080301768557741749051999338598022831065169332830885722071173603516904554174087168 elements, each of which is a 93-element array.
This means you need 549975033204266172374216967425209467080301768557741749051999338598022831065169332830885722071173603516904554174087168 * 93 * 64bit of memory to store it, which is roughly 409181424703974032246417423764355843507744515806959861294687507916928986312485983626178977220953161016576988305520852992 bytes. That is about 40 orders of magnitude more than the number of particles in the universe. In other words, even if you were to convert the entire universe into RAM, you would still need to find a way to store on the order of 827180612553027 yobibyte on each and every particle in the universe; that is about 6000000000000000000000000 times the information content of the World Wide Web and 10000000000000000000000 times the information content of the dark web.
Does anyone know some workaround to figure out of it? Some ways to chunk them?
Even if you process them in chunks, that doesn't change the fact that you still need to process 51147678087996754030802177970544480438468064475869982661835938489616123289060747953272372152619145127072123538190106624 elements. Even if you were able to process one element per CPU instruction (which is unrealistic, you will probably need dozens if not hundreds of instructions), and even if each instruction only takes one clock cycle (which is unrealistic, on current mainstream CPUs, each instruction takes multiple clock cycles), and even if you had a terahertz CPU (which is unrealistic, the fastest current CPUs top out at 5 GHz), and even if your CPU had a million cores (which is unrealistic, even GPUs only have a couple of thousand extremely simple cores), and even if your motherboard had a million sockets (which is unrealistic, mainstream motherboards only have a maximum of 4 sockets, and even the biggest supercomputers only have 10 million cores in total), and even if you had a million of those computers in a cluster, and even if you had a million of those clusters in a supercluster, and even if you had a million friends that also have a supercluster like this, it would still take you about 1621000000000000000000000000000000000000000000000000000000000000000000 years to iterate through them.
Right, so as it is hopefully clear that this should not be attempted I'll take a risk and attempt solving your actual problem.
You've mentioned in the comments that you need this array for property testing - I'll take a massive leap of faith here and assume you want to test that every possible combination satisfies some conditions - and this is the mistake here, as the amount of possible combination is just... large...
Instead, you can test that some of the combinations works. You can easily generate a short, randomized list of combinations using:
Array.new(num) { DATASET.map(&:sample) }
Where num is a number of combinations you want to test. Note that there is a chance that some of the entries will be duplicated - but given your dataset size the chances would be comparable with colliding uuids and can be safely ignored.
Generating such a subset of possible solutions is much easier, faster and, most importantly, possible. Since the output is randomized, it will test slightly different combination on each run, so remember to have some randomization setup in your test suite if you want to be able to recreate failures.
How does the size of a realm-file develop ?
To start with: I have a realm-file with several properties and one of them being an array of 860 entries and each array-entry consists of a couple of properties again.
One array-property states the name of the entry.
I observed the following:
If the name-property sais "Criteria_A1" (until "Criteria_A860") - then the realm-file is 1.6 MB big
If the name-property sais "A1" (until "A860") - then the realm-file is only 786 kB big
Why is the extra letters in the array-name-property making the realm-file this much bigger ??
A second observation:
if I add more objects (each again having an array with 860 entries), then the file size gets 1.6MB big again (no matter how many objects I add; guess until a critical value again where the size tripples...or am I wrong??).
It almost seems to me that the realm-file at 786 kB is doubled in size as soon as something is added (either a property that has more letters or an object that is added). Why does the realm-file double at a critical value and not linearly increase in size with more content added ??
Thanks for a clarification on this.
It's pretty well observed. :-) The Realm file starts out at about 4k and will double in size once it runs out of free space. It keeps doubling until 128M and then adds constantly 128M thereafter.
The reason to double the file and not just grow linearly is only due to performance. It's a common algorithm for dynamic data structures to just keep doubling.
You can use the methods available as seen below to write a compacted copy removing all free space in the file. This can be useful if you don't add new data anymore, want to ship a static database or want to send the file over the network.
Realm.writeCopyToURL(_:encryptionKey:) in Swift
-[RLMRealm writeCopyToURL:encryptionKey:error:] in Objective-C
Realm.writeCopyTo() in Java
Those thresholds and algorithm mentioned are the current ones, and may change in future versions though.
Hope this clarifies?
I'm using Cypher's LOAD CSV syntax in Neo4J 2.1.2. So far it's been a huge improvement over the more manual ETL process required in previous versions. But I'm running into some behavior in a single case that's not what I'd expect and I wonder if I'm missing something.
The cypher query being used is this:
USING PERIODIC COMMIT 500
LOAD CSV FROM 'file:///Users/James/Desktop/import/dependency_sets_short.csv' AS row
MATCH (s:Sense {uid: toInt(row[4])})
MERGE (ds:DependencySet {label: row[2]}) ON CREATE SET ds.optional=(row[3] = 't')
CREATE (s)-[:has]->(ds)
Here's a couple of lines of the CSV:
227303,1,TO-PURPOSE-NOMINAL,t,73830
334471,1,AT-LOCATION,t,92048
334470,1,AT-TIME,t,92048
334469,1,ON-LOCATION,t,92048
227302,1,TO-PURPOSE-INFINITIVE,t,73830
116008,1,TO-LOCATION,t,68204
116007,1,IN-LOCATION,t,68204
227301,1,TO-LOCATION,t,73830
334468,1,ON-DATE,t,92048
116006,1,AT-LOCATION,t,68204
334467,1,WITH-ASSOCIATE,t,92048
Basically, I'm matching a Sense node (previously imported) based on it's ID value which is the fifth column. Then I'm doing a merge to either get a DependencySet node if it exists, or create it. Finally, I'm creating a has edge between the Sense node and the DependencySet node. So far so good, this all works as expected. What's confusing is the performance as the size of the CSV grows.
CSV Lines Time (msec)
------------------------------
500 480
1000 717
2000 1110
5000 1521
10000 2111
50000 4794
100000 5907
200000 12302
300000 35494
400000 Java heap space error
My expectation is that growth would be more-or-less linear, particularly as I'm committing every 500 lines as recommended by the manual, but it's actually closer to polynomial:
What's worse is that somewhere between 300k and 400k rows, it runs into a Java heap space error. Based on the trend from previous imports, I'd expect the import of 400k to take a bit over a minute. Instead, it churns away for about 5-7 minutes before running into the heap space error. It seems like I could split this file into 300,000-line chunks, but isn't that what "USING PERIODIC COMMIT" is supposed to do, more or less? I suppose I could give Neo4J more memory too, but again, it's not clear why I should have to in this scenario.
Also, to be clear, the lookups on both Sense.uid and DependencySet.label are indexed, so the lookup penalty for these should be pretty small. Here's a snippet from the schema:
Indexes
ON :DependencySet(label) ONLINE (for uniqueness constraint)
ON :Sense(uid) ONLINE (for uniqueness constraint)
Any explanations or thoughts on an alternative approach would be appreciated.
EDIT: The problem definitely seems to be in the MATCH and/or CREATE part of the query. If I remove lines 3 and 5 from the Cypher query it performs fine.
I assume that you've already created all the Sense labeled nodes before running this LOAD CSV import. What I think is going on is that as you are matching nodes with the label Sense into memory and creating relationships from the DependencySet to the Sense node via CREATE (s)-[:HAS]->(ds) you are increasing utilization of the available heap.
Another possibility is that the size of your relationship store in your memory mapped settings needs to be increased. In your scenario it looks like the Sense nodes have a high degree of connectivity to other nodes in the graph. When this happens your relationship store for those nodes require more memory. Eventually when you hit 400k nodes the heap is maxed out. Up until that point it needs to do more garbage collection and reads from disk.
Michael Hunger put together an excellent blog post on memory mapped settings for fast LOAD CSV performance. See here: http://jexp.de/blog/2014/06/load-csv-into-neo4j-quickly-and-successfully/
That should resolve your problem. I don't see anything wrong with your query.
i believe the line
MATCH (s:Sense {uid: toInt(row[4])})
makes the time paradigm. somewhere around the 200 000 in the x line of your graph, you have no longer all the Sense nodes in the memory but some of them must be cached to disk. thus all the increase in time is simply re-loading data from cache to memory and vise-versa (otherwise it will be still linear if kept in memory).
maybe if you could post you server memory settings, we could dig deeper.
to the problem of java heap error refer to Kenny's answer
In the past I had to work with big files, somewhere about in the 0.1-3GB range. Not all the 'columns' were needed so it was ok to fit the remaining data in RAM.
Now I have to work with files in 1-20GB range, and they will probably grow as the time will pass. That is totally different because you cannot fit the data in RAM anymore.
My file contains several millions of 'entries' (I have found one with 30 mil entries). On entry consists in about 10 'columns': one string (50-1000 unicode chars) and several numbers. I have to sort the data by 'column' and show it. For the user only the top entries (1-30%) are relevant, the rest is low quality data.
So, I need some suggestions about in which direction to head out. I definitively don't want to put data in a DB because they are hard to install and configure for non computer savvy persons. I like to deliver a monolithic program.
Showing the data is not difficult at all. But sorting... without loading the data in RAM, on regular PCs (2-6GB RAM)... will kill some good hours.
I was looking a bit into MMF (memory mapped files) but this article from Danny Thorpe shows that it may not be suitable: http://dannythorpe.com/2004/03/19/the-hidden-costs-of-memory-mapped-files/
So, I was thinking about loading only the data from the column that has to be sorted in ram AND a pointer to the address (into the disk file) of the 'entry'. I sort the 'column' then I use the pointer to find the entry corresponding to each column cell and restore the entry. The 'restoration' will be written directly to disk so no additional RAM will be required.
PS: I am looking for a solution that will work both on Lazarus and Delphi because Lazarus (actually FPC) has 64 bit support for Mac. 64 bit means more RAM available = faster sorting.
I think a way to go is Mergesort, it's a great algorithm for sorting a
large amount of fixed records with limited memory.
General idea:
read N lines from the input file (a value that allows you to keep the lines in memory)
sort these lines and write the sorted lines to file 1
repeat with the next N lines to obtain file 2
...
you reach the end of the input file and you now have M files (each of which is sorted)
merge these files into a single file (you'll have to do this in steps as well)
You could also consider a solution based on an embedded database, e.g. Firebird embedded: it works well with Delphi/Windows and you only have to add some DLL in your program folder (I'm not sure about Lazarus/OSX).
If you only need a fraction of the whole data, scan the file sequentially and keep only the entries needed for display. F.I. lets say you need only 300 entries from 1 million. Scan the first first 300 entries in the file and sort them in memory. Then for each remaining entry check if it is lower than the lowest in memory and skip it. If it is higher as the lowest entry in memory, insert it into the correct place inside the 300 and throw away the lowest. This will make the second lowest the lowest. Repeat until end of file.
Really, there are no sorting algorithms that can make moving 30gb of randomly sorted data fast.
If you need to sort in multiple ways, the trick is not to move the data itself at all, but instead to create an index for each column that you need to sort.
I do it like that with files that are also tens of gigabytes long, and users can sort, scroll and search the data without noticing that it's a huge dataset they're working with.
Please finde here a class which sorts a file using a slightly optimized merge sort. I wrote that a couple of years ago for fun. It uses a skip list for sorting files in-memory.
Edit: The forum is german and you have to register (for free). It's safe but requires a bit of german knowledge.
If you cannot fit the data into main memory then you are into the realms of external sorting. Typically this involves external merge sort. Sort smaller chunks of the data in memory, one by one, and write back to disk. And then merge these chunks.
Our TFSVersionControl database has grown significantly in the past couple years, and is edging on 80GB. Unfortunately, we're in an environment where every gig of data storage is internally charged at a high rate, so there's lots of focus on keeping storage growth to a minimum.
I believe the majority of growth is happening because we chose to store binary files in our repository. This is something we will be remedying in the medium term.
In the short-term, there are a few places where we do not need to keep a history of our binaries. Particularly in our mainline branch and our development branch, so we're looking into doing a TF Destroy on these binaries and recreating them as part of the upcoming release.
What I'd like to know is: Is there any way to run a query against the TFSVersionControl database to understand which files are storing deltas that are over a given size?
Ideally, what I'd like to know is for a given path (item spec), for each file, the base size, and the total size of the deltas.
I think this page may be what you're looking for.
Just like asking someone else will often drive you to find your own answer, I did some additional digging, and came up with this:
select ver.VersionFrom, ver.Command, ver.ChildItem, tf.*, ct.CreationDate, ct.OffsetFrom, ct.OffsetTo, DataLength(ct.Content) as Size
from tbl_version ver with (nolock)
inner join tbl_file tf with (nolock) on tf.FileId = ver.FileId
inner join tbl_content ct with (nolock) on ct.FileId = tf.fileid
where parentpath = '$\ProjectName\Branch\Folder\'
ORDER BY ver.ChildItem, Ver.VersionFrom
--where fullpath = '$\ProjectName\Branch\Folder\FileName.cs\'
The query as written will iterate through all files in a particular path and will retrieve a record per checkin. The calculated Size field will show you the size in bytes of the delta. I'm not sure if this is compressed size or "actual" size.
The commented "where" statement will show you the same for an indvidual file.
Note that the typical forward slashes ("/") are stored in the database as backslashes ("\"), and there is always a trailing backslash at the end.
If you pull this data into Excel, you can quickly create a pivot table on it to calculate the sizes (or you can add them up manually).