Can someone clarify my Duplicity statistics for me? - duplicity

here's output from a Duplicity backup that I run every night on a server:
--------------[ Backup Statistics ]--------------
StartTime 1503561610.92 (Thu Aug 24 02:00:10 2017)
EndTime 1503561711.66 (Thu Aug 24 02:01:51 2017)
ElapsedTime 100.74 (1 minute 40.74 seconds)
SourceFiles 171773
SourceFileSize 83407342647 (77.7 GB)
NewFiles 15
NewFileSize 58450408 (55.7 MB)
DeletedFiles 4
ChangedFiles 6
ChangedFileSize 182407535 (174 MB)
ChangedDeltaSize 0 (0 bytes)
DeltaEntries 25
RawDeltaSize 59265398 (56.5 MB)
TotalDestinationSizeChange 11743577 (11.2 MB)
Errors 0
-------------------------------------------------
I don't know if I'm reading this right, but what it seems to be saying is that:
I started with 77.7 GB
I added 15 files totaling 55.7 MB
I deleted or changed files whose sum total was 174 MB
My deltas after taking all changes into account totaled 56.5 MB
The total disk space on the remote server that I pushed the deltas to was 11.2 MB
It seems to me that we're saying I only pushed 11.2 MB but should've probably pushed at least 55.7 MB because of those new files (can't really make a small delta of a file that didn't exist before), and then whatever other disk space the deltas would've taken.
I get confused when I see these reports. Can someone help clarify? I've tried digging for documentation but am not seeing much in the way of clear, concise plain English explanations on these values.

Disclaimer: I couldn't find a proper resource that explained the difference nor something in the duplicity docs that supports this theory.
ChangedDeltaSize, DeltaEntries and RawDeltaSize do not relate to changes in the actual files, they are related to differences between sequential data. Duplicity uses the rsync algorithm to create your backups which in its turn is a type of delta encoding.
Delta encoding is a way of storing data in the form of differences rather than complete files. Thus the delta changes you see listed is a change in those pieces of data and can therefore be smaller. In fact I think they should be smaller as they are just small snippets of changed data.
Some sources:
- http://duplicity.nongnu.org/ "Encrypted bandwidth-efficient backup using the rsync algorithm" .
- https://en.wikipedia.org/wiki/Rsync " The rsync algorithm is a type of delta encoding.. "
- https://en.wikipedia.org/wiki/Delta_encoding

Related

Repair corrupted hdf5 file

We are saving time of flight mass spectra in hdf5 files. For more than 99% of the cases this works without any errors. But sometimes measurements crashes.
We save different (meta-) data but my question is about one incrementing table:
Storage Layout: CHUNKED: 1x1x10x248584
Compression: 8,312:1GZIP: level=5
Values are inserted each time step (usually 10 seconds). After 1min the layout is 6x1x10x248584.
But the corrupted table has size: 0x1x10x248584 (Tried with h5py 2.10.0 and HDFView 3.1.0).
My question: Is there any low-level library (python preferred) where I can try to access the lost data? The file size (measuring several hours --> several GB) promises the data is there, but cannot be read with those two programs I tried.
Thank you.
[Update]:
In HDFView it looks like this for the corrupted File. From file size I expect the first dimension to be >1000.

Over 8 minutes to read a 1 GB side input view (List)

We have a 1 GB List that was created using View.asList() method on beam sdk 2.0. We are trying to iterate through every member of the list and do, for now, nothing significant with it (we just sum up a value). Just reading this 1 GB list is taking about 8 minutes to do so (and that was after we set the workerCacheMb=8000, which we think means that the worker cache is 8 GB). (If we don't set the workerCacheMB to 8000, it takes over 50 minutes before we just kill the job.). We're using a n1-standard-32 instance, which should have more than enough RAM. There is ONLY a single thread reading this 8GB list. We know this because we create a dummy PCollection of one integer and we use it to then read this 8GB ViewList side-input.
It should not take 6 minutes to read in a 1 GB list, especially if there's enough RAM. EVEN if the list were materialized to disk (which it shouldn't be), a normal single NON-ssd disk can read data at 100 MB/s, so it should take ~10 seconds to read in this absolute worst case scenario....
What are we doing wrong? Did we discover a dataflow bug? Or maybe the workerCachMB is really in KB instead of MB? We're tearing our hair out here....
Try to use setWorkervacheMb(1000). 1000 MB = Around 1GB. It will pick the side input from cache of each worker node and that will be fast.
DataflowWorkerHarnessOptions options = PipelineOptionsFactory.create().cloneAs(DataflowWorkerHarnessOptions.class);
options.setWorkerCacheMb(1000);
Is it really required to iterate 1 GB of side input data every time or you are need some specific data to get during iteration?
In case you need specific data then you should get it by passing specific index in the list. Getting data specific to index is much faster operation then iterating whole 1GB data.
After checking with the Dataflow team, the rate of 1GB in 8 minutes sounds about right.
Side inputs in Dataflow are always serialized. This is because to have a side input, a view of a PCollection must be generated. Dataflow does this by serializing it to a special indexed file.
If you give more information about your use case, we can help you think of ways of doing it in a faster manner.

How to prevent OutOfMemory exception when opening large Access .MDB files?

I’m trying to read data which is generated by another application and stored in a Microsoft Office Access .MDB file. The number of records in some particular tables can vary from a few thousands up to over 10 millions depending on size of the model (in the other application). Opening the whole table in one query can cause an Out Of Memory exception in large files. So I split the table on some criteria and read each part in a different query. But the problem is about middle sized files that could be read significantly faster in one single query with no exceptions.
So, am I on right way? Can I solve the OutOfMemory problem in another way? Is it OK to choose one of mentioned strategies (1 query or N query) based on the number of records?
By the way, I’m using DelphiXE5 and Delphi’s standard ADO components. And I need the whole data of the table, and no joining to other tables is needed. I’m creating ADO components by code and they are not connected to any visual controls.
Edit:
Well, it seems that my question is not clear enough. Here are some more details, which are actually answers to questions or suggestions posed in comments:
This .mdb file is not holding a real database; it’s just structured data, so no writing new data, no transactions, no user interactions, no server, nothing. A third-party application uses Access files to export its calculation results. The total size of these files is usually about a few hundred MBs, but they can grow up to 2 GBs. Now I need to load this data into a Delphi data structure before starting my own calculations since there’s no place for waiting for I/O during these calculations.
I can’t compile this project for x64, it’s extremely dependent on some old DLLs that share same memory manager with main executable and their authors will never release an x64 version. The company hasn’t yet decided to replace them, and it won’t change in near future.
And, you know, support guys just prefer to tell us “fix this” rather than asking two thousand customers to “buy more memory”. So I have to be really stingy about memory usage.
Now my question is: Does TADODataSet provide any better memory management for fetching such amount of data? Is there any property that prevents DataSet from fetching all data at once?
When I call ADOTable1.open it starts to allocate memory and waits to fetch the entire table, just as expected. But reading all those records in a for loop will take a while and there’s no need to have all that data, on the other hand, there’s no need to keep a record in memory after reading it since there's no seeking in rows. That’s why I split table with some queries. Now I want to know if TADODataSet can handle this or what I'm doing is the only solution.
I did some try and errors and improved performance of reading data, in both memory usage and elapsed time. My test case is a table with more than 5,000,000 records. Each record has 3 string fields and 8 doubles. No index, no primary key. I used GetProcessMemoryInfo API to get memory usage.
Initial State
Table.Open: 33.0 s | 1,254,584 kB
Scrolling : +INF s | I don't know. But allocated memory doesn't increase in Task Manager.
Sum : - | -
DataSet.DisableControls;
Table.Open: 33.0 s | 1,254,584 kB
Scrolling : 13.7 s | 0 kB
Sum : 46.7 s | 1,254,584 kB
DataSet.CursorLocation := clUseServer;
Table.Open: 0.0 s | -136 kB
Scrolling : 19.4 s | 56 kB
Sum : 19.4 s | -80 kB
DataSet.LockType := ltReadOnly;
Table.Open: 0.0 s | -144 kB
Scrolling : 18.4 s | 0 kB
Sum : 18.5 s | -144 kB
DataSet.CacheSize := 100;
Table.Open: 0.0 s | 432 kB
Scrolling : 11.4 s | 0 kB
Sum : 11.5 s | 432 kB
I also checked Connection.CursorLocarion, Connection.IsolationLevel, Connection.Mode, DataSet.CursorType and DataSet.BlockReadSize but they made no appreciable change.
I also tried to use TADOTable, TADOQuery and TADODataSet and unlike what Jerry said here in comments, both ADOTable and ADOQuery performed better than ADODataSet.
The value assigned to CacheSize should be decided for each case, not any grater values lead to better results.

What does virtual size of docker image mean?

When you type docker images it will show you which images are locally available and other information. Part of this information is the virtual size. What exactly is that?
I found a little explanation in GitHub Issues #22 on docker, but this still is not clear to me. What I really like to know is, the amount of bytes to be downloaded and how many bytes an images needs on my hard drive.
Additionally Docker Hub 2.0 has still another information. When you look to the Tags page of an image there is another value shown. At least this seems to be always much smaller compared to the information given by docker images.
The "virtual size" refers to the total sum of the on-disk size of all the layers the image is composed of. For example, if you have two images, app-1 and app-2, and both are based on a common distro image/layer whose total size is 100MB, and app-1 adds an additional 10MB but app-2 adds an additional 20MB, the virtual sizes will be 110MB and 120MB respectively, but the total disk usage will only be 130MB since that base layer is shared between the two.
The transfer size is going to be less (in most cases by quite a bit) due to gzip compression being applied to the layers while in transit.
The extended details provided in https://github.com/docker-library/docs/blob/162cdda0b66dd62ea1cc80a64cb6c369e341adf4/irssi/tag-details.md#irssilatest might make this more concretely obvious. As you can see there, the virtual size (sum of all the on-disk layer sizes) of irssi:latest is 261.1MB, but the "Content-Length" (compressed size in-transit) is only 97.5MB, and that's assuming that you don't already have any of the layers, when it's fairly likely you already have that first layer downloaded, which accounts for 125.1MB of the virtual size and 51.4MB of the "Content-Length" (and it's likely you have it already because that top layer is debian:jessie, which is a common base for the top-level images).
irssi:latest
Total Virtual Size: 261.1 MB (261122797 bytes)
Total v2 Content-Length: 97.5 MB (97485603 bytes)
Layers (13)
6d1ae97ee388924068b7a4797d995d57d1e6194843e7e2178e592a880bf6c7ad
Created: Fri, 04 Dec 2015 19:27:57 GMT
Docker Version: 1.8.3
Virtual Size: 125.1 MB (125115267 bytes)
v2 Blob: sha256:d4bce7fd68df2e8bb04e317e7cb7899e981159a4da89339e38c8bf30e6c318f0
v2 Content-Length: 51.4 MB (51354256 bytes)
v2 Last-Modified: Fri, 04 Dec 2015 19:45:49 GMT
8b9a99209d5c8f3fc5b4c01573f0508d1ddaa01c4f83c587e03b67497566aab9
...

TFS Branching and Disk Space

I've been thinking over some branching strategies (creating branches per feature, maybe per developer since we're a small group) and was wondering if anyone had experienced any issues. Does creating a branch take up much space?
Last time I looked, TFS uses copy-on-write, which means that you won't increase disk space until you change files. It's kind of like using symlinks until you need to change things.
James is basically correct. For a more complete answer, we need to start with Buck's post from back in 2006: http://blogs.msdn.com/buckh/archive/2006/02/22/tfs_size_estimation.aspx
Each new row in the local version table adds about 520 bytes (one row gets added for each workspace that gets the newly added item, and the size is dominated by the local path column). If you have 100 workspaces that get the newly added item, the database will grow by 52 KB. If you add 1,000 new files of average size (mix of source files, binaries, images, etc.) and have 100 workspaces get them, the version control database grows by approximately 112 MB (60 KB * 1,000 + 520 * 1,000 * 100).
We can omit the 60KB figure since branched items do not duplicate file contents. (It's not quite "copy-on-write," James -- an O(N) amount of metadata must be computed and stored during the branch operation itself, vs systems like git which I believe branch in O(1) -- but you're correct that the new item points to the same record in tbl_Content as the source item until it's edited). That leaves us with merely the 520 * num_workspaces * files_per_workspace factor. On the MS dogfood server there are something like 2 billion rows in tbl_LocalVersion, but in a self-described small group it should be utterly negligible.
Something Buck's blog does not mention is merge history. If you adopt a branch-heavy workflow and stick with it through several development cycles, it's likely tbl_MergeHistory will grow nearly as big as tbl_LocalVersion. Again, I doubt it will even register on a small team's radar, but on large installations you can easily amass hundreds of millions of rows. That said, each row is much smaller since there are no nvarchar(260) fields.

Resources