Checking data integrity for cloned drive - data-integrity

I have n disk clones that need to be checked against the source. I am looking for an efficient way to do this. Already tried md5sum and sha256sum but these take several minutes on a multi-gig drive and wont produce the intended result with drives of slightly differing sizes.
background info:
I am building an SD cloning tool.
Any thoughts?

Related

Why is docker pull not extracting layers in parallel?

Does extracting (untarring) of docker image layers by docker pull have to be conducted sequentially or could it be parallelized?
Example
docker pull mirekphd/ml-cpu-r40-base - an image which had to be split into more than 50 layers for build performance reasons - it contains around 4k R packages precompiled as DEB's (the entire CRAN Task Views contents), that would be impossible to build in docker without splitting these packages into multiple layers of roughly equal size, which cuts the build time from a whole day to minutes. Extraction stage - if parallelized - could become up to 50 times faster...
Context
When you observe docker pull for a large multi-layer image (gigabytes in size), you will notice that the download of each layer can be performed separately, in parallel. Not so for subsequent extracting (untarring) of each of these layers, which is performed sequentially. Do we know why?
From my anecdotal observations with such large images it would speed up the execution of the docker pull operation considerably.
Moreover, if splitting image into more layers would let you spin up containers faster, people would start writing Dockerfiles that are more readable and faster to both debug/test and pull/run, rather than trying to pile all instructions onto a single slow-bulding, impossibly convoluted and cache-busting string of instructions only to save a few megabytes of extra layers overhead (which will be easily recouped by parallel extraction).
From the discussion at https://github.com/moby/moby/issues/21814, there are two main reasons that layers are not extracted in parallel:
It would not work on all storage drivers.
It would potentially use lots of CPu.
See the related comments below:
Note that not all storage drivers would be able to support parallel extraction. Some are snapshotting filesystems where the first layer must be extracted and snapshotted before the next one can be applied.
#aaronlehmann
We also don't really want a pull operation consuming tons of CPU on a host with running containers.
#cpuguy83
And the user who closed the linked issue wrote the following:
This isn't going to happen for technical reasons. There's no room for debate here. AUFS would support this, but most of the other storage drivers wouldn't support this. This also requires having specific code to implement at least two different code paths: one with this parallel extraction and one without it.
An image is basically something like this graph A->B->C->D and most Docker storage drivers can't handle extracting any layers which depend on layers which haven't been extracted already.
Should you want to speed up docker pull, you most certainly want faster storage and faster network. Go itself will contribute to performance gains once Go 1.7 is out and we start using it.
I'm going to close this right now because any gains from parallel extraction for specific drivers aren't worth the complexity for the code, the effort needed to implement it and the effort needed to maintain this in the future.
#unclejack

Artifactory Docker Registry Free /var partition space

Due to irregularization of docker images being pushed to jfrog docker registry, my /var partition is currently FULL.
As i am able to ssh into the machine, i wanted to know can i directly go about deleting the images at the /var location as I am not able to start artifactory service due to insufficient space.
The Docker images are stored as checksum binary files in the filestore. You will have no way of knowing what checksum belongs to what image and since images often share the same layer, even deleting a single one can corrupt several images.
For the short term, I recommend moving (not deleting) a few binary files to allow you to start your registry back up. You can also delete the backup directory (since backup is on by default and you may not actually want/need it and it occupies a lot of space). Once that is done, start it up and delete enough images to clear enough space OR, preferably, expand the filestore size OR, better yet, move it to a different partition so you don't mix the app/OS with the application data. In any case, when you have more free space, move the binary files back to their original location.

How to browse the contents of a docker/btrfs container-specific layer

I have read, and I believe understood, the docker pages on using btrfs, and notably this one
My question is rather simple, I would need to be able to navigate (e.g. using cd and ls, but any other means is fine) in what the above link calls the Thin R/W layer attached to a given container.
The reason I need this is, I use an image that I have not built myself - namely jupyter/scipy-notebook:latest - and what I can see is that each container starts with a circa 100-200 Mb impact on overall disk usage, even though nothing much should be going on in the container.
So I suspect some rather verbose logs get created that I need to silent down a bit; however the whole union fs is huge - circa 5Gb large - so it would help me greatly to navigate only the data that is specific to one container so I can pinpoint the problem.
To list the files that are changed/stored since the original image use
docker diff my-container
This is quite handy if you want to get an idea of what's happening inside, doesn't give you the file sizes though.

Large Storage Solution

We are a small bootstrapped ISP in a third world country where bandwidths are usually expensive and slow. We recently got a customer who need storage solution, of 10s of TB of mostly video files (its a tv station). The thing is I know my way around linux but I have never done anything like this before. We have a backblaze 3 storage pod casing which we are thinking of using as a storage server. The Server will be connected to customer directly so its not gonna go through the internet, because 100+mbps speed is unheard off in this part of the world.
I was thinking of using 4TB HDD all formatted with ext4 and using LVM to make them one large volume (50-70tb at least). So customer logs in to an FTP like client and dumps whatever files he/she wants. But the customer only sees a single volume, and we can add space as his requirements increases. Of course this is just on papers from preliminary research as i don't have prior experience with this kind of system. Also I have to take cost in to consideration so can't go for any proprietary solution.
My questions are:
Is this the best way to handle this probably, are there equally good or better solutions out there?
For large storage solutions (at least large for me) what are my cost effective options when it comes to dealing with data corruption and HD failure.
Would love to hear any other solutions and tips you guys might have. thanks!
ZFS might be a good option but there is no native bug-free solution for Linux, yet. I would recommend other operating systems in that case.
Today I would recommend Linux MD raid5 on enterprise disks or raid6 on consumer/desktop disks. I would not assign more than 6 disks to an array. LVM can then be used to tie the arrays to a logical volume suitable for ext4.
The ext4-filesystem is well tested and stable while XFS might be better for large file storage. The downside to XFS is that it is not possible to shrink an XFS filesystem. I would prefer ext4 because of it's more flexible nature.
Please also take into consideration that backups are still required even if you are storing your data on raid-arrays. The data can silently corrupt or be accidentally deleted.
In the end, everything depends on what the customer wants. Telling the customer the price of the service usually has an effect on the requirements.
I would like to add to the answer that mingalsuo gave. As he stated, it really comes down to the customer requirements. You don't say what, specifically, the customer will do with this data. Is it for archive only? Will they be actively streaming the data? What is your budget for this project? These types of answers will better determine the proposed solution. Here are some options based on a great many assumptions. Maybe one of them will be a good fit for your project.
CAPACITY:
In this case, you are not that concerned about performance but more interested in capacity. In this case, the number of spindles don't really matter much. As Mingalsuo stated, put together a set of RAID-6 SATA arrays and use LVM to produce a large volume.
SMALL BUSINESS PERFORMANCE:
In this case, you need performance. The customer is going to store files but also requires the ability for a small number of simultaneous data streams. Here you want as many spindles as possible. For streaming, it does little good to focus on the size of the controller cache. Just focus on the number of spindles. You want as many as possible. Keep in mind that the time to rebuild a failed drive increases with the size of the drive. And, during a rebuild, your performance will suffer. For these reasons I'd suggest smaller drives. Maybe 1TB drives at most. This will provide you with faster rebuild times and more spindles for streaming.
ENTERPRISE PERFORMANCE:
Here you need high performance - similar to that that an enterprise demands. You require many simultaneous data streams and performance is required. In this case, I would stay away from SATA drives and use 900G or 1.2TB SAS drives instead. I would also suggest that you consider abstracting the storage layer from the server layer. Create a Linux server and use iSCSI (or fibre) to connect to the storage device. This will allow you to load balance if possible, or at the very least make recovery from disaster easier.
NON TRADITIONAL SOLUTIONS:
You stated that the environment has few high-speed connections to the internet. Again, depending on the requirements, you still might consider cloud storage. Hear me out :) Let's assume that the files will be uploaded today, used for the next week or month, and then rarely read. In this case, these files are sitting on (potentially) expensive disks for no reason except archive. Wouldn't it be better to keep those active files on expensive (local) disk until they "retire" and then move them to less expensive disk? There are solutions that do just that. One, for example, is called StorSimple. This is an appliance that contains SAS (and even flash) drives and uses cloud storage to automatically migrate "retired" data from the local storage to cloud storage. Because this data is retired it wouldn't matter if it took longer than normal to move it to the cloud. And, this appliance automatically pulls it back from the cloud to local storage when it is accessed. This solution might be too expensive for your project but there are similar ones that you might find will work for you. The added benefit of this is that your data is automatically backed up by the cloud provider and you have an unlimited supply of storage at your disposal.

Batch processing of image

I have been assigned to write a windows application which will copy the images from source folder > and it's sub folder (there could be n number of sub folder which can have upto 50 Gb images). Each image size might vary from some kb to 20 MB. I need to resize and compress the picture.
I am clueless and wondering if that can be done without hitting the CPU to hard and on the other hand, little faster.
Is it possible ? Can you guide me the best way to implement this ?
Image processing is always a CPU intensive task. You could do little tricks like lowering the priority of the process that's preforming the image processes so it impacts your machine less, but there's very little tradeoff to be made.
As for how to do it,
Write a script that looks for all the files in the current directory and sub-directories. If you're not sure how, do a Google search. You could do this is Perl, Python, PHP, C#, or even a BAT file.
Call one of the 10,000,000 free or open-source programs to do image conversion. The most widely used Linux program is ImageMagick and there's a Windows version of it available too.

Resources