TStringList.LoadFromFile - Exceptions with Large Text Files - delphi

I'm running Delphi RAD Studio XE2.
I have some very large files, each containing a large number of lines. The lines themselves are small - just 3 tab separated doubles. I want to load a file into a TStringList using TStringList.LoadFromFile but this raises an exception with large files.
For files of 2 million lines (approximately 1GB) I get the EIntOverflow exception. For larger files (20 million lines and approximately 10GB, for example) I get the ERangeCheck exception.
I have 32GB of RAM to play with and am just trying to load this file and use it quickly. What's going on here and what other options do I have? Could I use a file stream with a large buffer to load this file into a TStringList? If so could you please provide an example.

When Delphi switched to Unicode in Delphi 2009, the TStrings.LoadFromStream() method (which TStrings.LoadFromFile() calls internally) became very inefficient for large streams/files.
Internally, LoadFromStream() reads the entire file into memory as a TBytes, then converts that to a UnicodeString using TEncoding.GetString() (which decodes the bytes into a TCharArray, copies that into the final UnicodeString, and then frees the array), then parses the UnicodeString (while the TBytes is still in memory) adding substrings into the list as needed.
So, just prior to LoadFromStream() exiting, there are four copies of the file data in memory - three copies taking up at worse filesize * 3 bytes of memory (where each copy is using its own contiguous memory block + some MemoryMgr overhead), and one copy for the parsed substrings! Granted, the first three copies are freed when LoadFromStream() actually exits. But this explains why you are getting memory errors before reaching that point - LoadFromStream() is trying to use 3-4 GB of memory to load a 1GB file, and the RTL's memory manger cannot handle that.
If you want to load the content of a large file into a TStringList, you are better off using TStreamReader instead of LoadFromFile(). TStreamReader uses a buffered file I/O approach to read the file in small chunks. Simply call its ReadLine() method in a loop, Add()'ing each line to the TStringList. For example:
//MyStringList.LoadFromFile(filename);
Reader := TStreamReader.Create(filename, true);
try
MyStringList.BeginUpdate;
try
MyStringList.Clear;
while not Reader.EndOfStream do
MyStringList.Add(Reader.ReadLine);
finally
MyStringList.EndUpdate;
end;
finally
Reader.Free;
end;
Maybe some day, LoadFromStream() might be re-written to use TStreamReader internally like this.

Related

How to process big files in dart asyncronously in a performant way

By using File.openRead() Dart allows to read big files asyncronously in chunks of 64k Bytes. But as the chunks are of Type List<int> I doubt that this is a performant method.
There is a datatype ByteBuffer() which would probably be a perfect match for that requirement, as the data could be transferred directly from disk to memory.
But by returning a List<int> The file has to be read byte by byte and for every byte a 64bit integer object has to be created, that has to be appended to the list. So my question:
Is there an internal optimization to List to make it performant?
Or are there different methods for more efficiency?
It seems there is an internal optimization to do this. They use Uint8List so there isn't wasted memory like you said.
Source: file_impl.dart

What is the maximum file size for TIniFile?

My program manipulates an ini file using TIniFile. I've read TIniFile class has 64kb limit in single section. However, it seems to be working for more than 100kb in my tests. I'm using Delphi 10.3.3 and Windows 10.
Does 64kb limit exist only in old versions of Windows? Or, should I use TMemIniFile to stay safe?
Basically, there is no limit to the size of an ini file or the routine GetPrivateProfileString (which is used by TIniFile to read the data).
But there are some limits and things to consider when using TIniFile.
Looking into the code of the TIniFile implementation (thank you Delphi), there are several places where GetPrivateProfileString is used to retrieve data from an ini file.
In TIniFile.ReadString the buffer size is fixed to 2048 (2k) for reading string values.
As all other 'value' requesting routines use this routine to actually read the data from the inifile, it basically limits the buffer size for all those routines.
Second, the TIniFile.ReadSections routine uses a starting buffer of 16384 (16k) characters. But when this buffer is too small it uses a dynamic buffer which is based on the file size, so this way you won't run into a buffer problem (but because this actually reads the entire file to estimate the buffer size, this will be very slow with large ini files).
Last, the TIniFile.ReadSection routine, which uses an initial buffer size of 1024 (1k). But dynamically allocates a larger buffer when needed. So at this point, there also doesn't seem to be a limit to the (file)size.
NOTE: this information is based on Delhi 10.3 and Delphi XE2.
In older versions there we're other buffer allocation strategies...

Delphi out of memory error when i load file to memory stream

I want to convert a 2GB file to an array of byte with Delphi. I use this function, then load file into memory Stream to get bytes. But I get error "Out of memory". How I can solve this problem?
type
TByteArray = Array of Byte;
function StreamToByteArray(Stream: TStream): TByteArray;
begin
// Check stream
if Assigned(Stream) then
begin
// Reset stream position
Stream.Position:=0;
// Allocate size
SetLength(result, Stream.Size);
// Read contents of stream
Stream.Read(result[0], Stream.Size);
end
else
// Clear result
SetLength(result, 0);
end;
//////then in button control i use:
var
strmMem: TMemoryStream;
bytes: TByteArray;
begin
strmMem:=TMemoryStream.Create;
if OpenDialog1.Execute then
strmMem.LoadFromFile(OpenDialog1.FileName);
bytes:=StreamToByteArray(strmMem);
strmMem.Free;
A 32 bit process has a total of 4GB of address space. Unless it has the large address aware flag available, only 2GB of that address space is available to it.
You are attempting to load a 2GB file into memory, in a contiguous block of address space. There is no chance of you being able to succeed. Even with a large address aware 4GB address space there's little hope for you finding a contiguous 2GB block of address space.
Furthermore, you are also attempting to read the file into memory twice, so you actually need two 2GB contiguous blocks. One for the stream, and one for the array. This is a result of you using the memory stream anti-pattern as described below.
Some options:
Switch to a 64 bit process, or
load the entire file, but in discontinuous blocks, or
process the file piece by piece, in smaller chunks.
Regarding the use of a memory stream, this is a recurring anti-pattern. I'd say that >90% of the uses of memory streams that we see here in the Delphi Stack Overflow tag are needless and wasteful.
The mistake is to load into memory just to be able to copy to some other memory. You are trying to read the file into an array. So read it directly into an array. The memory stream is pointless. Use a file stream. Read from the file stream into the array. That way you only load a single copy of the file into memory.
Of course, you'll sill struggle to put a 2GB file into memory even with that change, but you should still aim to hold only one copy of the data in memory.

+ [NSString stringWithContentsOfFile:] performance

I need to load files in iOS, now I use the + [NSString stringWithContentsOfFile:].
The files are mostly 500kb to 5mb.
I load an approx. 4mb large file and Instruments and stopwatch told me it needs 1.5 seconds to load this file. In my opinion its a bit slow, is there a way to get the string faster?
EDIT:
I try some things and notice now, the creation of the NSString is my problem it takes 97% of the time and not the real loading from disk.
If you either know the encoding or can determine it (the API you use now is basic), you can just treat it as a char buffer (in a manner which is encoding aware).
I'd begin by opening it using memory mapped data (mmap, you can also approach this using NSData). madvise can be used to hint how you will access the file.
If memory mapped I/O consumes too much memory for your use, you should drop down to incremental reads, like C I/O facilities - fopen, fread, etc.. This will typically require more I/O events than memory mapped data (can be much slower, depending on how the data is accessed).
In both cases, you would treat the string as a C string -- don't simply convert the whole file to an NSString upon opening.
Foundation has a lot of tricks, so make sure this actually improves performance for your specific use case.
If those solutions are too 'core for your use, just consider using smaller files instead (dividing existing files).

TStringList of objects taking up tons of memory in Delphi XE

I'm working on a simulation program.
One of the first things the program does is read in a huge file (28 mb, about 79'000 lines,), parse each line (about 150 fields), create a class for the object, and add it to a TStringList.
It also reads in another file, which adds more objects during the run. At the end, it ends up being about 85'000 objects.
I was working with Delphi 2007, and the program used a lot of memory, but it ran OK. I upgraded to Delphi XE, and migrated the program over and now it's using a LOT more memory, and it ends up running out of memory half way through the run.
So in Delphi 2007, it would end up using 1.4 gigs after reading in the initial file, which is obviously a huge amount, but in XE, it ends up using almost 1.8 gigs, which is really huge and leads to running out and getting the error
So my question is
Why is it using so much memory?
Why is it using so much more memory in XE than 2007?
What can I do about this? I can't change how big or long the file is, and I do need to create an object for each line and to store it somewhere
Thanks
Just one idea which may save memory.
You could let the data stay on the original files, then just point to them from in-memory structures.
For instance, it's what we do for browsing big log files almost instantly: we memory-map the log file content, then we parse it quick to create indexes of useful information in memory, then we read the content dynamically. No string is created during the reading. Only pointers to each line beginning, with dynamic arrays containing the needed indexes. Calling TStringList.LoadFromFile would be definitively much slower and memory consuming.
The code is here - see the TSynLogFile class. The trick is to read the file only once, and make all indexes on the fly.
For instance, here is how we retrieve a line of text from the UTF-8 file content:
function TMemoryMapText.GetString(aIndex: integer): string;
begin
if (self=nil) or (cardinal(aIndex)>=cardinal(fCount)) then
result := '' else
result := UTF8DecodeToString(fLines[aIndex],GetLineSize(fLines[aIndex],fMapEnd));
end;
We use the exact same trick to parse JSON content. Using such a mixed approach is used by the fastest XML access libraries.
To handle your high-level data, and query them fast, you may try to use dynamic arrays of records, and our optimized TDynArray and TDynArrayHashed wrappers (in the same unit). Arrays of records will be less memory consuming, will be faster to search in because the data won't be fragemented (even faster if you use ordered indexes or hashes), and you'll be able to have high-level access to the content (you can define custom functions to retrieve the data from the memory mapped file, for instance). Dynamic arrays won't fit fast deletion of items (or you'll have to use lookup tables) - but you wrote you are not deleting much data, so it won't be a problem in your case.
So you won't have any duplicated structure any more, only logic in RAM, and data on memory-mapped file(s) - I added a "s" here because the same logic could perfectly map to several source data files (you need some "merge" and "live refresh" AFAIK).
It's hard to say why your 28 MB file is expanding to 1.4 GB worth of objects when you parse it out into objects without seeing the code and the class declarations. Also, you say you're storing it in a TStringList instead of a TList or TObjecList. This sounds like you're using it as some sort of string->object key/value mapping. If so, you might want to look at the TDictionary class in the Generics.Collections unit in XE.
As for why you're using more memory in XE, it's because the string type changed from an ANSI string to a UTF-16 string in Delphi 2009. If you don't need Unicode, you could use a TDictionary to save space.
Also, to save even more memory, there's another trick you could use if you don't need all 79,000 of the objects right away: lazy loading. The idea goes something like this:
Read the file into a TStringList. (This will use about as much memory as the file size. Maybe twice as much if it gets converted into Unicode strings.) Don't create any data objects.
When you need a specific data object, call a routine that checks the string list and looks up the string key for that object.
Check if that string has an object associated with it. If not, create the object from the string and associate it with the string in the TStringList.
Return the object associated with the string.
This will keep both your memory usage and your load time down, but it's only helpful if you don't need all (or a large percentage) of the objects immediately after loading.
In Delphi 2007 (and earlier), a string is an Ansi string, that is, every character occupies 1 byte of memory.
In Delphi 2009 (and later), a string is a Unicode string, that is, every character occupies 2 bytes of memory.
AFAIK, there is no way to make a Delphi 2009+ TStringList object use Ansi strings. Are you really using any of the features of the TStringList? If not, you could use an array of strings instead.
Then, naturally, you can choose between
type
TAnsiStringArray = array of AnsiString;
// or
TUnicodeStringArray = array of string; // In Delphi 2009+,
// string = UnicodeString
Reading though the comments, it sounds like you need to lift the data out of Delphi and into a database.
From there it is easy to match organ donors to receivers*)
SELECT pw.* FROM patients_waiting pw
INNER JOIN organs_available oa ON (pw.bloodtype = oa.bloodtype)
AND (pw.tissuetype = oa.tissuetype)
AND (pw.organ_needed = oa.organ_offered)
WHERE oa.id = '15484'
If you want to see the patients that might match against new organ-donor 15484.
In memory you only handle the few patients that match.
*) simplified beyond all recognition, but still.
In addition to Andreas' post:
Before Delphi 2009, a string header occupied 8 bytes. Starting with Delphi 2009, a string header takes 12 bytes. So every unique string uses 4 bytes more than before, + the fact that each character takes twice the memory.
Also, starting with Delphi 2010 I believe, TObject started using 8 bytes instead of 4. So for each single object created by delphi, delphi now uses 4 more bytes. Those 4 bytes were added to support the TMonitor class I believe.
If you're in desperate need to save memory, here's a little trick that could help if you have a lot of string value that repeats themselve.
var
uUniqueStrings : TStringList;
function ReduceStringMemory(const S : String) : string;
var idx : Integer;
begin
if not uUniqueStrings.Find(S, idx) then
idx := uUniqueStrings.Add(S);
Result := uUniqueStrings[idx]
end;
Note that this will help ONLY if you have a lot of string values that repeat themselves. For exemple, this code use 150mb less on my system.
var sl : TStringList;
I: Integer;
begin
sl := TStringList.Create;
try
for I := 0 to 5000000 do
sl.Add(ReduceStringMemory(StringOfChar('A',5)));every
finally
sl.Free;
end;
end;
I also read in a lot of strings in my program that can approach a couple of GB for large files.
Short of waiting for 64-bit XE2, here is one idea that might help you:
I found storing individual strings in a stringlist to be slow and wasteful in terms of memory. I ended up blocking the strings together. My input file has logical records, which may contain between 5 and 100 lines. So instead of storing each line in the stringlist, I store each record. Processing a record to find the line I need adds very little time to my processing, so this is possible for me.
If you don't have logical records, you might just want to pick a blocking size, and store every (say) 10 or 100 strings together as one string (with a delimiter separating them).
The other alternative, is to store them in a fast and efficient on-disk file. The one I'd recommend is the open source Synopse Big Table by Arnaud Bouchez.
May I suggest you try using the jedi class library (JCL) class TAnsiStringList, which is like TStringList fromDelphi 2007 in that it is made up of AnsiStrings.
Even then, as others have mentioned, XE will be using more memory than delphi 2007.
I really don't see the value of loading the full text of a giant flat file into a stringlist. Others have suggested a bigtable approach such as Arnaud Bouchez's one, or using SqLite, or something like that, and I agree with them.
I think you could also write a simple class that will load the entire file you have into memory, and provide a way to add line-by-line object links to a giant in-memory ansichar buffer.
Starting with Delphi 2009, not only strings but also every TObject has doubled in size. (See Why Has the Size of TObject Doubled In Delphi 2009?). But this would not explain this increase if there are only 85,000 objects. Only if these objects contain many nested objects, their size could be a relevant part of the memory usage.
Are there many duplicate strings in your list? Maybe trying to only store unique strings will help reducing the memory size. See my Question
about a string pool for a possible (but maybe too simple) answer.
Are you sure you don't suffer from a case of memory fragementation?
Be sure to use the latest FastMM (currently 4.97), then take a look at the UsageTrackerDemo demo that contains a memory map form showing the actual usage of the Delphi memory.
Finally take a look at VMMap that shows you how your process memory is used.

Resources