My application regularly upload large files. Regardless of their size, all files are compressed before uploaded to server.
Part of this project requirements is to resume nicely after crash/power failure, so right now compression is done this way:
large-file.bin sliced in N slices
Compress each slice & upload it
In case of crash, I pickup from the last slice.
To optimize upload speed, I'm currently looking into sending the whole file (uploads are resumed if failed) instead of sending slices one by one, so I'm looking into compressing the whole file instead of compressing each slice.
I'm currently using 7z.dll. I wonder if it's possible, in case of power failure, to tell 7z to resume compression.
I know I could always implement my own compression routine and implement such feature, but before going that road I wonder if it's possible to do that in 7z (which already have an excellent compression ratio)
As far as I know, no compression algorithm supports that. You will likely have to recompress the source file from the beginning every time, discarding any output bytes until you reach the desired resume position, and then you can send the remaining output bytes from that point on.
Related
I am trying to put a watermark in the video files (any size) through an Azure fucntion(C#). For the same, I am downloading the video into a Stream from an external source and then splitting it into frames and applying a watermark in each frame. And then merging it back. For the same, I am using FFMeg & OpenCV.
While splitting the file, I have to get the whole file and then process it frame by frame. For smaller files, it has no issues, for large files (like 2+ GB) it's going to impact the memory.
Any suggestion on the same, to do it in a better way?
With all other steps, I am using the durable function to process the same with 5-6 small Functions.
Does it mean that it will take 100MB (Open via Disk)?
Or it mean that it will take 100MB (Open via Memory)?
That's the threshold at which libvips will flip from open-via-memory to open-via-disc.
For small images (100mb when decompressed in this case), libvips will decompress to memory then process from there. This is obviously not a good idea for large images, so for these libvips will decompress to a temporary disc file, then map that area of disc into virtual memory and use that as the pixel source.
tldr: set VIPS_DISC_THRESHOLD to a small number to prefer the use of disc, set it to a large number to prefer RAM.
There's a chapter in the libvips docs which goes into a lot more detail:
https://www.libvips.org/API/current/How-it-opens-files.md.html
To very quickly summarize:
libvips has at least four ways of opening images and tries hard to pick the best one for you automatically.
Sometimes it'll need a bit of help to hit the best path for your use case and you have three main ways of influencing this.
You can hint the access pattern you expect for this image with the access= parameter, you can set the threshold at which it'll flip between preferring memory and preferring disc, and you can say where you'd like disc temporaries to be held.
I am wondering the reason why we need to apply file compression before we upload files to server under some scenarios. For my understanding, as soon as the server received the compressed files, the compressed file need to be extracted to allow the server read the file content. It certainly consumes the computation power of the server if multiple Http POSTs are sent from many client side platforms.
Therefore, as far as I can think of the scenario of sending the compressed file is uploading the backup files, setting files, files that only servers as back up for the client side platforms. Please give me more scenarios for uploading compressed data.
I think the following article gives an perfect explanation to the question:http://www.dataexpedition.com/support/notes/tn0014.html
Here's the content:
Compression Pros & Cons
Simply put, compression is a process which trades CPU cycles for bytes. But the trade isn't always a good one. Sometimes you can spend a lot of valuable CPU cycles for little or no gain.
In the context of network data transport, "Should I compress?" is a common question. But the answer can get complicated, depending on several factors. The most important thing to remember is that compression can actually make your data move much slower, so it should not be used without some consideration.
When Compression Is Good
Compression algorithms try to identify large repeating patterns in a data set and replace them with smaller patterns. Ideally, this shrinks the size of the data set. For the purposes of network transport, having less data to move means it should take less time to move it.
Documents and files which consist mostly of plain text or machine executable code tend to compress well. Examples include word processing documents, HTML files, some .exe files, and some database files.
Combining many small files into a single archive prior to network transfer can often result in faster speeds than transferring each file individually. This may be true even if the individual files themselves are not compressible. Many archiving utilities have options to pack files into an archive without compression, such as the "-0" option for "zip". ExpeDat will combine the contents of a folder into a single data stream when you enable Streaming Folders.
When Compression Is Bad
Many data types are not compressible, because the repeating patterns have already been removed. This includes most images, videos, songs, any data that is already compressed, or any data that has been encrypted.
Trying to compress data that is not compressible wastes CPU time. When you are trying to move data at high speeds, that CPU time may be critical to feeding the network. So by taking away processing time with worthless compression, you can actually end up moving your data much more slowly than if you had compression turned off.
If you are using a compression utility only for the purposes of combining many small files, check for options that disable compression. For example, the "zip" command has a "-0" option which packages files into an archive without spending time trying to compress them.
Inline versus Offline
Many transport mechanisms allow you to apply compression algorithms to data as its being transferred. This is convenient because the compression and decompression occur seamlessly without the user having to perform extra steps. But it is also risky because any CPU time spent on compression is time NOT being spent on feeding data through the network. If the network is very fast, the CPU is very slow, or the compression algorithm is unable to scale, having inline compression turned on may cause your data to move more slowly than if you turn compression off. Inline compression can be slower than no compression even when the data is compressible!
If you are going to be transferring the same data set multiple times, it pays to compress it first using Zip or Tar-Gzip. Then you can transfer the compressed archive without taking CPU cycles away from the network processing. If you are planning to encrypt your data, make sure you compress it first, then encrypt second.
Hidden Compression
Devices in your network may be applying compression without you realizing it. This becomes evident if the "speed" of the network seems to change for different data types. If the network seems slow when you are transferring data that is already compressed, but fast when you are transferring uncompressed text files, then you can be pretty sure that something out there is making compression decisions for you.
Network compression devices can be helpful in that they take the compression burden away from the end-point CPUs. But they can also create very inconsistent results since they will not work for all destinations and data types. Network level compression can also run into the same CPU trade-offs discussed above, resulting in some files moving more slowly than they would if there was no compression.
If you are testing the speed of your network, try using data that is already compressed or encrypted to ensure consistent results.
Should I Turn On Inline Compression?
For compressed data, images, audio, video, or encrypted files: No.
For other types of data, test it both ways to see which is faster.
If the network is very fast (hundreds of megabits per second or faster), consider turning off inline compression and instead compress the data before you move it.
If anyone has used the iOS wrapper for the LZMA SDK available at https://github.com/mdejong/lzmaSDK and have been able to tweak it in order to see the progress of unarchiving, please help.
I am going to use this SDK in iOS to extract a 16MB file, which uncompresses to a 150MB file, and this takes around 40seconds to complete. It would be good to have some kind of callback for showing the progress of uncompression.
Help is greatly appreciated.
Thanks
So, I looked at this issue quite a bit recently, and honestly the best you are going to be able to do is look for all the files in a specific tmp dir where decompression is going on and then count them and compare to a known size N. The problem with attempting to do this in the library is that it spans multiple runtimes and the callback idea makes the code a mess. Also, a callback would not help that much because of the way 7z compression works. To decode, one needs to build up the decompression dictionary before specific files can be decompressed, and that process of building up the dictionary takes a long time before the first file can even be written. So, if you put a "percent done" counter in your app showing how much was done, it would show 0% done for a long time, then jump to 50% and then 90 or 100 %. Basically, it would not be that useful even if it was implemented.
You could try C++ port of the latest LZMA SDK(15.06) without described above limitations(C version). Memory allocations and IO read/write can be tuned in runtime, plus work with password encrypted archives, smoothed progress, Lzma & Lzma2 archive types etc.
GitHub: https://github.com/OlehKulykov/LzmaSDKObjC
I recently backed up my soon-to-expire university home directory by sending it as a tar stream and compressing it on my end: ssh user#host "tar cf - my_dir/" | bzip2 > uni_backup.tar.bz2.
This got me thinking: I only know the basics of how compression works, but I would imagine that this ability to compress a stream of data would lead to poorer compression since the algorithm needs to finish handling a block of data at one point, write this to the output stream and continue to the next block.
Is this the case? Or do these programs simply read a lot of data into memory compress this, write it, and then do this over again? Or are there any clever tricks used in these “stream compressors”? I see that both bzip2 and xz's man pages talk about memory usage, and man bzip2 also hints to the fact that little is lost on chopping the data to be compressed into blocks:
Larger block sizes give rapidly diminishing marginal returns. Most of the compression comes from the first two or three hundred k of block size, a fact worth bearing in mind when using bzip2 on small machines. It is also important to appreciate that the decompression memory requirement is set at compression time by the choice of block size.
I would still love to hear if other tricks are used, or about where I can read more about this.
This question relates more to buffer handling than compression algorithm, although a bit could be said about it too.
Some compression algorithm are inherently "block based", which means they absolutely need to work with blocks of specific size. This is the situation of bzip2, which block size is selected thanks to the "level" switch, from 100kb to 900kb.
So, if you stream data into it, it will wait for the block to be filled, and start compressing this block when it's full (alternatively, for the last block, it will work with whatever size it receives).
Some other compression algorithm can handle streams, which means they can continuously compress new data using older one kept in a memory buffer. Algorithms based on "sliding windows" can do it, and typically zlib is able to achieve that.
Now, even "sliding window" compressors may nonetheless select to cut input data into blocks, either for easier buffer management, or to develop multi-threading capabilities, such as pigz.