2 Files, Half the Content, vs. 1 File, Twice the Content, Which is Greater? - memory

If I have 2 files each with this:
"Hello World" (x 1000)
Does that take up more space than 1 file with this:
"Hello World" (x 2000)
What are the drawbacks of dividing content into multiple smaller files (assuming there's reason to divide them into more files, not like this example)?
Update:
I'm using a Macbook Pro, 10.5. But I'd also like to know for Ubuntu Linux.

Marcelos gives the general performance case. I'd argue worrying about this is premature optimization. you should split things into different files where it is logical to split them.
also if you really care about file size of such repetitive files then you can compress them.
your example even hints at this, a simple run length encoding of
"Hello World"x1000
is much more space efficient than actually having "hello world" written out 1000 times.

Files take up space in the form of clusters on the disk. A cluster is a number of sectors, and the size depends on how the disk was formatted.
A typical size for clusters is 8 kilobytes. That would mean that the two smaller files would use two clusters (16 kilobytes) each and the larger file would use three clusters (24 kilobytes).
A file will by average use half a cluster more than it's size. So with a cluster size of 8 kilobytes each file will by average have an overhead of 4 kilobytes.

Most filesystems use a fixed-size cluster (4 kB is typical but not universal) for storing files. Files below this cluster size will all take up the same minimum amount.
Even above this size, the proportional wastage tends to be high when you have lots of small files. Ignoring skewness of size distribution (which makes things worse), the overall wastage is about half the cluster size times the number of files, so the fewer files you have for a given amount of data, the more efficiently you will store things.
Another consideration is that metadata operations, especially file deletion, can be very expensive, so again smaller files aren't your friends. Some interesting work was done in ReiserFS on this front until the author was jailed for murdering his wife (I don't know the current state of that project).
If you have the option, you can also tune the file sizes to always fill up a whole number of clusters, and then small files won't be a problem. This is usually too finicky to be worth it though, and there are other costs. For high-volume throughput, the optimal file size these days is between 64 MB and 256 MB (I think).
Practical advice: Stick your stuff in a database unless there are good reasons not to. SQLite substantially reduces the number of reasons.

I think the usage of file(s) is to take into consideration, according to the API and the language used to read/write them (and hence eventually API restrictions).
Fragmentation of the disk, that will tend to decrease with only big files, will penalize data access if you're reading one big file in one shot, whereas several access spaced out time to small files will not be penalized by fragmentation.

Most filesystems allocate space in units larger than a byte (typically 4KB nowadays). Effective file sizes get "rounded up" to the next multiple of that "cluster size". Therefore, dividing up a file will almost always consume more total space. And of course there's one extra entry in the directory, which may cause it to consume more space, and many file systems have an extra intermediate layer of inodes where each file consumes one entry.
What are the drawbacks of dividing
content into multiple smaller files
(assuming there's reason to divide
them into more files, not like this
example)?
More wasted space
The possibility of running out of inodes (in extreme cases)
On some filesystems: very bad performance when directories contain many files (because they're effectively unordered lists)
Content in a single file can usually be read sequentially (i.e. without having to move the read/write head) from the HD, which is the most efficient way. When it spans multiple files, this ideal case becomes much less likely.

Related

What does libvips VIPS_DISC_THRESHOLD default=100 mean?

Does it mean that it will take 100MB (Open via Disk)?
Or it mean that it will take 100MB (Open via Memory)?
That's the threshold at which libvips will flip from open-via-memory to open-via-disc.
For small images (100mb when decompressed in this case), libvips will decompress to memory then process from there. This is obviously not a good idea for large images, so for these libvips will decompress to a temporary disc file, then map that area of disc into virtual memory and use that as the pixel source.
tldr: set VIPS_DISC_THRESHOLD to a small number to prefer the use of disc, set it to a large number to prefer RAM.
There's a chapter in the libvips docs which goes into a lot more detail:
https://www.libvips.org/API/current/How-it-opens-files.md.html
To very quickly summarize:
libvips has at least four ways of opening images and tries hard to pick the best one for you automatically.
Sometimes it'll need a bit of help to hit the best path for your use case and you have three main ways of influencing this.
You can hint the access pattern you expect for this image with the access= parameter, you can set the threshold at which it'll flip between preferring memory and preferring disc, and you can say where you'd like disc temporaries to be held.

What are the scenario that makes us compress data before we transfer it?

I am wondering the reason why we need to apply file compression before we upload files to server under some scenarios. For my understanding, as soon as the server received the compressed files, the compressed file need to be extracted to allow the server read the file content. It certainly consumes the computation power of the server if multiple Http POSTs are sent from many client side platforms.
Therefore, as far as I can think of the scenario of sending the compressed file is uploading the backup files, setting files, files that only servers as back up for the client side platforms. Please give me more scenarios for uploading compressed data.
I think the following article gives an perfect explanation to the question:http://www.dataexpedition.com/support/notes/tn0014.html
Here's the content:
Compression Pros & Cons
Simply put, compression is a process which trades CPU cycles for bytes. But the trade isn't always a good one. Sometimes you can spend a lot of valuable CPU cycles for little or no gain.
In the context of network data transport, "Should I compress?" is a common question. But the answer can get complicated, depending on several factors. The most important thing to remember is that compression can actually make your data move much slower, so it should not be used without some consideration.
When Compression Is Good
Compression algorithms try to identify large repeating patterns in a data set and replace them with smaller patterns. Ideally, this shrinks the size of the data set. For the purposes of network transport, having less data to move means it should take less time to move it.
Documents and files which consist mostly of plain text or machine executable code tend to compress well. Examples include word processing documents, HTML files, some .exe files, and some database files.
Combining many small files into a single archive prior to network transfer can often result in faster speeds than transferring each file individually. This may be true even if the individual files themselves are not compressible. Many archiving utilities have options to pack files into an archive without compression, such as the "-0" option for "zip". ExpeDat will combine the contents of a folder into a single data stream when you enable Streaming Folders.
When Compression Is Bad
Many data types are not compressible, because the repeating patterns have already been removed. This includes most images, videos, songs, any data that is already compressed, or any data that has been encrypted.
Trying to compress data that is not compressible wastes CPU time. When you are trying to move data at high speeds, that CPU time may be critical to feeding the network. So by taking away processing time with worthless compression, you can actually end up moving your data much more slowly than if you had compression turned off.
If you are using a compression utility only for the purposes of combining many small files, check for options that disable compression. For example, the "zip" command has a "-0" option which packages files into an archive without spending time trying to compress them.
Inline versus Offline
Many transport mechanisms allow you to apply compression algorithms to data as its being transferred. This is convenient because the compression and decompression occur seamlessly without the user having to perform extra steps. But it is also risky because any CPU time spent on compression is time NOT being spent on feeding data through the network. If the network is very fast, the CPU is very slow, or the compression algorithm is unable to scale, having inline compression turned on may cause your data to move more slowly than if you turn compression off. Inline compression can be slower than no compression even when the data is compressible!
If you are going to be transferring the same data set multiple times, it pays to compress it first using Zip or Tar-Gzip. Then you can transfer the compressed archive without taking CPU cycles away from the network processing. If you are planning to encrypt your data, make sure you compress it first, then encrypt second.
Hidden Compression
Devices in your network may be applying compression without you realizing it. This becomes evident if the "speed" of the network seems to change for different data types. If the network seems slow when you are transferring data that is already compressed, but fast when you are transferring uncompressed text files, then you can be pretty sure that something out there is making compression decisions for you.
Network compression devices can be helpful in that they take the compression burden away from the end-point CPUs. But they can also create very inconsistent results since they will not work for all destinations and data types. Network level compression can also run into the same CPU trade-offs discussed above, resulting in some files moving more slowly than they would if there was no compression.
If you are testing the speed of your network, try using data that is already compressed or encrypted to ensure consistent results.
Should I Turn On Inline Compression?
For compressed data, images, audio, video, or encrypted files: No.
For other types of data, test it both ways to see which is faster.
If the network is very fast (hundreds of megabits per second or faster), consider turning off inline compression and instead compress the data before you move it.

How to calculate vmrss from elf file

I am working on an embedded system with limited memory. I want to find a way to calculate how much memory will be used when an elf file is running by analyzing it.
I hope the result is close to vmrss, which I can use cat /proc/pid/status to get. The memory changes every moment when running. so a closer result or lower bound is also useful.
Assuming that there is no dynamic memory(like through malloc) or mapped memory(through mmap).
Simplifying assumptions:
you don't use shared libraries
you don't run multiple instances of your ELF binary
you don't use swap
your binary accesses all of its code and data
there is no significant malloc or mmap usage
With above assumptions, you can look at readelf -Wl a.out | grep LOAD, and simply add together the PT_LOAD segment sizes for an upper bound on RSS.
If you do use shared libraries, you'll need to add their PT_LOAD segments as well. But if they are used by more than one binary, then total system memory consumed will be less than the total of RSS for each process. Same goes for violating assumption 2.
Violating assumptions 3 and 4 will reduce observed RSS, while violating 5 will increase it.

What is the fastest way for reading huge files in Delphi?

My program needs to read chunks from a huge binary file with random access. I have got a list of offsets and lengths which may have several thousand entries. The user selects an entry and the program seeks to the offset and reads length bytes.
The program internally uses a TMemoryStream to store and process the chunks read from the file. Reading the data is done via a TFileStream like this:
FileStream.Position := Offset;
MemoryStream.CopyFrom(FileStream, Size);
This works fine but unfortunately it becomes increasingly slower as the files get larger. The file size starts at a few megabytes but frequently reaches several tens of gigabytes. The chunks read are around 100 kbytes in size.
The file's content is only read by my program. It is the only program accessing the file at the time. Also the files are stored locally so this is not a network issue.
I am using Delphi 2007 on a Windows XP box.
What can I do to speed up this file access?
edit:
The file access is slow for large files, regardless of which part of the file is being read.
The program usually does not read the file sequentially. The order of the chunks is user driven and cannot be predicted.
It is always slower to read a chunk from a large file than to read an equally large chunk from a small file.
I am talking about the performance for reading a chunk from the file, not about the overall time it takes to process a whole file. The latter would obviously take longer for larger files, but that's not the issue here.
I need to apologize to everybody: After I implemented file access using a memory mapped file as suggested it turned out that it did not make much of a difference. But it also turned out after I added some more timing code that it is not the file access that slows down the program. The file access takes actually nearly constant time regardless of the file size. Some part of the user interface (which I have yet to identify) seems to have a performance problem with large amounts of data and somehow I failed to see the difference when I first timed the processes.
I am sorry for being sloppy in identifying the bottleneck.
If you open help topic for CreateFile() WinAPI function, you will find interesting flags there such as FILE_FLAG_NO_BUFFERING and FILE_FLAG_RANDOM_ACCESS . You can play with them to gain some performance.
Next, copying the file data, even 100Kb in size, is an extra step which slows down operations. It is a good idea to use CreateFileMapping and MapViewOfFile functions to get the ready for use pointer to the data. This way you avoid copying and also possibly get certain performance benefits (but you need to measure speed carefully).
Maybe you can take this approach:
Sort the entries on max fileposition and then to the following:
Take the entries that only need the first X MB of the file (till a certain fileposition)
Read X MB from the file into a buffer (TMemorystream
Now read the entries from the buffer (maybe multithreaded)
Repeat this for all the entries.
In short: cache a part of the file and read all entries that fit into it (multhithreaded), then cache the next part etc.
Maybe you can gain speed if you just take your original approach, but sort the entries on position.
The stock TMemoryStream in Delphi is slow due to the way it allocates memory. The NexusDB company has TnxMemoryStream which is much more efficient. There might be some free ones out there that work better.
The stock Delphi TFileStream is also not the most efficient component. Wayback in history Julian Bucknall published a component named BufferedFileStream in a magazine or somewhere that worked with file streams very efficiently.
Good luck.

OutOfMemoryException Processing Large File

We are loading a large flat file into BizTalk Server 2006 (Original release, not R2) - about 125 MB. We run a map against it and then take each row and make a call out to a stored procedure.
We receive the OutOfMemoryException during orchestration processing, the Windows Service restarts, uses full 2 GB memory, and crashes again.
The server is 32-bit and set to use the /3GB switch.
Also I've separated the flow into 3 hosts - one for receive, the other for orchestration, and the third for sends.
Anyone have any suggestions for getting this file to process wihout error?
Thanks,
Krip
If this is a flat file being sent through a map you are converting it to XML right? The increase in size could be huge. XML can easily add a factor of 5-10 times over a flat file. Especially if you use descriptive or long xml tag names (which normally you would).
Something simple you could try is to rename the xml nodes to shorter names, depending on the number of records (sounds like a lot) it might actually have a pretty significant impact on your memory footprint.
Perhaps a more enterprise approach, would be to subdivide this in a custom pipeline into separate message packets that can be fed through the system in more manageable chunks (similar to what Chris suggests). Then the system throttling and memory metrics could take over. Without knowing more about your data it would be hard to say how to best do this, but with a 125 MB file I am guessing that you probably have a ton of repeating rows that do not need to be processed sequentially.
Where does it crash? Does it make it past the Transform shape? Another suggestion to try is to run the transform in the Receive Port. For more efficient processing, you could even debatch the message and have multiple simultaneous orchestration instances be calling the stored procs. This would definately reduce the memory profile and increase performance.

Resources