I am developing a application that needs to read/write some data. So my first solution was to store the data in a json encoded string in a sqlite database. Since deserialization of the json string was slow (about 5sec) and i couldn't pre buffer any data i decided to store the data in a binary file (on the disk). For that i have implemented a reader that reads the binary file. Now i have compared the speed results and found out that the times are more or less the same (the file size is better though).
I am using NSFileHandle for reading the file and i am reading it line by line. I tested this on a iPhone 3GS with 0.5 MB of data. Is this normal? Should i switch to reading the file using C/C++ functions? Would it be any better? Does any one have any experience with this? My code is more or less based on this question How to read data from NSFileHandle line by line?.
Thanks!
Related
I am writing an excel parse web api that reads a structured file into objects(a recurrence function).
But I've noticed a significant memory spike while parsing some files and thus would throw OutOfMemory exception. The excel parse engine needs the whole file to be loaded before it can read its structure. And I found out it's not the loading that consumes the most of the memory, it's the parsing(turning excel to structured objects) and the json http return(serialize the objects to json) that finally kills the memory. For example, a 1M large file can parse into 70M json.
So I googled around, found this .net Memory Profiler and tried to analyze what was going on that led to this huge memory usage. Here is the snapshot that I captured while parsing the same file twice. I've noticed that there are huge string /Object[] that are not being GCed.
Now I'm at lost. What are the best practices when you are dealing with lots of List and lots of string? As to reducing the memory usage, where should I start looking into? What are the best practices while handling long running process(Adding queue? Use signalR to notify the process result?)?
Some guidance would be really appreciated!
I'd say you need to avoid loading the whole file at once.
Here is a good place to start:
Load Large Excel file in C#
I've got an iOS app compressing a bunch of small chunks of data. I use compression_encode_buffer running in LZ4 mode to do it so that it is fast enough for my needs.
Later, I take the file[s] I made and decode them on a non-Apple device. Previously I'd been using their ZLIB compression mode and could successfully decode it in C# with System.IO.Compression.DeflateStream.
However, I'm having a hell of a time with the LZ4 output. Based on the LZ4 docs here, Apple breaks the stream into a bunch of blocks, each starting with a 4-byte magic number, 4-byte decompressed size, and 4-byte compressed size. All that makes sense, and I'm able to parse the file into its consituent raw-LZ4 chunks. Each chunk in the buffer iOS outputs decompresses to about 65,635 bytes, and there's about 10 of them in my case.
But then: I have no idea what to DO with the LZ4 chunks I'm left with. I've tried decoding them with LZ4net's LZ4.LZ4Stream, LZ4net's LZ4.LZ4Codec (it manages the first block, but then fails when I feed in the 2nd one). I've also tried several C++ libraries to decode the data. Each of them seem to be looking for a header that the iOS compression functions have encoded in a non-standard way.
Answering my own: Apple's LZ4 decompressor (with necessary modifications to handle their raw storage format) is here: https://opensource.apple.com/source/xnu/xnu-3789.21.4/osfmk/vm/lz4.c.auto.html
Edit afterwards: I actually wasn't able to get this working, but I didn't spend much time on it because I found Apple's LZFSE decompressor.
LZFSE Decompressor can be found here: https://github.com/lzfse/lzfse
A piece of software I'm working on outputs quite a lot of files which are the stored on a server. During its runtime I've had one file go corrupt on me. These files are critical to the operation, so this cannot happen. I'm therefore trying to come up with a way of adding error correction to the files to prevent this from ever happening again.
I've read up on Reed-Solomon, which encodes k blocks of data plus m blocks of parity, and can then reconstruct up to m missing blocks. So what I'm thinking is taking the data stream, split it into these blocks, and then store them in sequence on disk, first the data blocks, then the parity blocks. Repeat until entire file is stored. k, m, and block sizes are of course variables I'll have to investigate and play with.
However, it's my understanding that Reed-Solomon requires you to know which blocks are corrupt. How could I possibly know that? My thinking is I'd have to add some extra, simpler, error detection code to each of the blocks as I write them, otherwise I can't know if they're corrupted. Like CRC32 or something.
Have I understood this correctly, or is there a better way to accomplish this?
This is a bit of an older question, but (in my mind) is always something that is useful and in some cases necessary. Bit rot will never be completely cured (hush ZFS community; ZFS only has control of what's on it's filesystem while it's there), so we always have to come up with proactive prevention and recovery plans.
While it was designed to facilitate piracy (specifically storing and extracting multi-GB files in chunks on newsgroups where any chunk could go missing or be corrupted), "Parchives" are actually exactly what you're looking for (see the white paper, though don't implement that scheme directly as it has a bug and newer schemes are available), and they work in practice as follows:
The complete file is input in to the encoder
Blocks are processed and Reed-Solomon blocks are generated
.par files containing those blocks are output along side the original file
When integrity is checked (typically on the other end of a file transfer), the blocks are rechecked and any blocks that need to be used to reconstruct missing data are pulled from the .par files.
Things eventually settled in to "PAR2" (essentially a rewrite with additional features) with the following scheme:
Large file compressed with RAR and split in to chunks (typically around 100MB each as that was a "usually safe" max of usenet)
An "index" file is placed along side the file (for example bigfile.PAR2). This has no recovery chunks.
A series of par files totaling 10% of the original data size are along side in increasingly larger filesizes (bigfile.vol029+25.PAR2, bigfile.vol104+88.PAR2, etc)
The person on the other end can then gets all .rar files
An integrity check is run, and returns a MB count of out how much data needs recovery
.PAR2 files are downloaded in an amount equal to or greater than the need
Recovery is done and integrity verified
RAR is extracted, and the original file is successfully transferred
Now without a filesystem layer this system is still fairly trivial to implement using the Parchive tools, but it has two requirements:
That the files do not change (as any change to the file on-disk will invalidate the parity data (of course you could do this and add complexity with a copy-on-change writing scheme))
That you run both the file generation and integrity check/recovery when appropriate.
Since all the math and methods are both known and battle-tested, you can also roll your own to meet whatever needs to have (as a hook in to file read/write, spanning arbitrary path depths, storing recovery data on a separate drive, etc). For initial tips, refer to the pros: https://www.backblaze.com/blog/reed-solomon/
Edit: The same research that led me to this question led me to a whole subset of already-done work that I was previously unaware of
https://crates.io/crates/solana-reed-solomon-erasure (as well as a bunch of other implementations in the Rust crate registry)
https://github.com/klauspost/reedsolomon (based on the BackBlaze code, and processes 1Gbps per core)
Etc. Look for "Reed-Solomon file recovery "
I'm struggling with memory management in iOS while downloading relatively large files from the web (such as videos with 350MB size).
The goal here is to download these kind of files and store it on CoreData on a Binary Data field.
At the moment I'm using NSURLSession.dataTaskWithUrl and NSURLSession.dataTaskWithRequest methods to retrieve these files, but it looks like these methods don't treat problems such as memory usage, they just keep on filling the memory until it reaches its maximum usage, leaving me with a memory warning when I reach 380MB~.
Initial Memory Usage
Memory Warning
What's the best strategy to perform this kind of large data retrieval from the web without reaching a memory warning? Does AlamoFire and other libs can deal with this problem?
It is better to use download task.
And save the video as a file to Document or Library directory.
Then save the relative path to CoreData
If you use download task
You can resume if last download fail
Need less memory
You can try AFNetworking to download large files.
My program needs to read chunks from a huge binary file with random access. I have got a list of offsets and lengths which may have several thousand entries. The user selects an entry and the program seeks to the offset and reads length bytes.
The program internally uses a TMemoryStream to store and process the chunks read from the file. Reading the data is done via a TFileStream like this:
FileStream.Position := Offset;
MemoryStream.CopyFrom(FileStream, Size);
This works fine but unfortunately it becomes increasingly slower as the files get larger. The file size starts at a few megabytes but frequently reaches several tens of gigabytes. The chunks read are around 100 kbytes in size.
The file's content is only read by my program. It is the only program accessing the file at the time. Also the files are stored locally so this is not a network issue.
I am using Delphi 2007 on a Windows XP box.
What can I do to speed up this file access?
edit:
The file access is slow for large files, regardless of which part of the file is being read.
The program usually does not read the file sequentially. The order of the chunks is user driven and cannot be predicted.
It is always slower to read a chunk from a large file than to read an equally large chunk from a small file.
I am talking about the performance for reading a chunk from the file, not about the overall time it takes to process a whole file. The latter would obviously take longer for larger files, but that's not the issue here.
I need to apologize to everybody: After I implemented file access using a memory mapped file as suggested it turned out that it did not make much of a difference. But it also turned out after I added some more timing code that it is not the file access that slows down the program. The file access takes actually nearly constant time regardless of the file size. Some part of the user interface (which I have yet to identify) seems to have a performance problem with large amounts of data and somehow I failed to see the difference when I first timed the processes.
I am sorry for being sloppy in identifying the bottleneck.
If you open help topic for CreateFile() WinAPI function, you will find interesting flags there such as FILE_FLAG_NO_BUFFERING and FILE_FLAG_RANDOM_ACCESS . You can play with them to gain some performance.
Next, copying the file data, even 100Kb in size, is an extra step which slows down operations. It is a good idea to use CreateFileMapping and MapViewOfFile functions to get the ready for use pointer to the data. This way you avoid copying and also possibly get certain performance benefits (but you need to measure speed carefully).
Maybe you can take this approach:
Sort the entries on max fileposition and then to the following:
Take the entries that only need the first X MB of the file (till a certain fileposition)
Read X MB from the file into a buffer (TMemorystream
Now read the entries from the buffer (maybe multithreaded)
Repeat this for all the entries.
In short: cache a part of the file and read all entries that fit into it (multhithreaded), then cache the next part etc.
Maybe you can gain speed if you just take your original approach, but sort the entries on position.
The stock TMemoryStream in Delphi is slow due to the way it allocates memory. The NexusDB company has TnxMemoryStream which is much more efficient. There might be some free ones out there that work better.
The stock Delphi TFileStream is also not the most efficient component. Wayback in history Julian Bucknall published a component named BufferedFileStream in a magazine or somewhere that worked with file streams very efficiently.
Good luck.