How to read or write huge Unicode files? - delphi

I need to read huge Unicode files into my program and convert to ANSI for parsing and for some files, store them again as Univode while others should be in ANSI code page.
As I have understood it, simple read/write don't support Unicode text, and for the biggest files (some maybe as big as 300 Mb or even bigger) using twidestring.loadfromfile is out of question both because memory usage and time to load.
I have been wondering if loading blocks could be a path of solution, but as I know, it doesn't support Unicode BOM ?
Any suggetions?

There is an excellent and very fast text reader in the german 'Delphi Forum'. It uses memory mapped files.
You will probably be able to modify it to read Unicode text files. However, you might have to test the BOM yourself.

In Delphi you can also use memory mapped files.
The primary benefit of memory mapping a file is increasing I/O
performance, especially when used on large files.
...
A possible benefit of memory-mapped files is a "lazy loading", thus using small amounts of RAM even for a very large file.
Memory-mapped file. (2013, February 26). In Wikipedia, The Free Encyclopedia. Retrieved 15:14, March 17, 2013, from http://en.wikipedia.org/w/index.php?title=Memory-mapped_file&oldid=540609840

Related

Advantage for hex formats like SREC or Intel HEX

I want to ask if someone can explain me the benefit for using hex formats (e.g. by Motorola S-Record or Intel HEX) over using direct binary images like for firmware or memory dumps?
I understand that it is useful to have some Meta information about the binary file like used memory areas, checksums for data integrity and so on…
However, the fact that the actual data size is doubled, because everything will be saved in a hex-ASCII representation is confusing me.
Is the reason for using a hex-ASCII representation only the portability, to prevent problems with systems that have a different byte endianness or are there other benefits?
I found for this topic many tutorials about how to convert binary to hex and backwards or the specifications of the certain formats, but no information about the advantages and disadvantages.
There are a couple of advantages of HEX-ASCII formats over binary image. Here are a couple to start with.
Transport. An ASCII file can be transfered through most mediums. Binary files may not be transferred intact across some mediums.
Validity checks. The HEX file has a checksum on each line. The Binary may not have any checks.
Size. Data is written to selected memory address in the Hex file. The binary image has limited addressing information and is likely to contain every memory location from start address to end of file.

What is the right way to write a big file with delphi

I need write a very big file (about 100-150 mb) with Delphi.
This file will be a CSV file or a SQL file (the user can choose it).
In the past I always use a TStringList object to write text files but in this case I have a very big file and I think this is not the best solution.
I need a speed and low memory solution if possible.
Use a TFileStream and write your output data to it as you are creating the data. That way, you are writing to the file immediately and it grows accordingly, and you are not wasting memory.
If you are using Delphi 2009 or later, you can use the TStreamWriter class to wrap the TFileStream. TStreamWriter has Write() and WriteLine() methods for writing String and formatted data.

How to choose migration path from Delphi 2007

I'm working with a team for a bigger application with Delphi 2007. It use a bigger legacy framework to access the data. Both the app and framework use String as datatype for strings. I have started to modify the code in framework to support Delphi 2009 strings, see my previous questions about this.
I see 2 alternatives now:
Alt 1 - Continue to use string as before. This is probably the cleanest solution as the framework will then supports Unicode. But the code in framework must be modified a lot to make this working. This require in depth understanding of the internal algorithms in framework. It is also a bigger chance to introduce new bugs.
Alt 2 - Replace String with AnsiString and Char with AnsiChar. This is propably a much easier solution and also how I start to modify the code (but then I start thinking and ask this question...). The negative side of this is no support for Unicode. Unicode support is not a requirement as it worked before but is nice to have. It could also be useful in the future. Another problem is that the application must send Ansistring variables as parameters in the methods for the framework instead of String as before. There are thousands of calls to change...
So I don't know right now. Both options require a lot of work, but Alt 1 is probably more risky and time consuming. What I want from this forum is feedback and comments as I guess I am not the first who have this problem.
EDIT
Another issue is the memory footprint. I wrote a quick test that allocate an array of one million strings. Each string was filled with 26 chars from A to Z.
With Delphi 2007 it took 40.011.600 bytes and the time was 4:15 minutes.
With Delphi 2009 it took 72.015.580 bytes and the time was 4:45 minutes.
The memory consumption was measured with GetHeapStatus.TotalAllocated.
I don't think we can afford to have the strings allocate twice as much memory.
It is not unusual to have 500 MB in memory consumption for each client now. I guess much of this are as strings. Propably we try to use AnsiString as much as possible.
Regards
Either stay with the old version of Delphi, or go all the way. You'll have to sooner and later anyway.
Note that the "replace everything with ansistring" scheme is also not entirely foolproof, specially if you touch streams and your fileformats need to stay the same. There are no explicite TStringlists,tstringstreams etc with ansistring anymore.
The same probably goes for Datasnap, Indy and other frameworks.
You can try to use this trick for certain string intensive parts at first, to avoid changing too much code directly. E.g. I had an own XML library, which I patched to remain mostly ansistring. The library was only used sideways, and unicode was of no importance to it.
Start with "alt 2", then gradually add unicode support to your framework, then move over to Unicode.
Rationale: you want a stable app; switching over to Delphi 2009+ will eventually require you to really support Unicode.
Edit: 20100125
While doing "alt 2" watch the Delphi compiler hints warnings.
The situation that Andreas describes will generate such hints and warnings.
I have explained this in my CodeRage 4 session about Unicode and other encodings.
The above link points to a page where you can view the replay of that session.
If you still have questions, just drop them here.
--jeroen
We evaluated the transition 2007 -> 2009 a year ago and tried a a smaller project (200k lines). The result was that everywhere where you do not use "fancy" things like pointers, set of char etc the porting is really not that difficult . Especially the GUI units we're ported within a day or so. This is equivalent to opt1.
The library units with low level routines, access to measurement systems etc etc was a whole different story. Here we choose to translate string -> ansistring, char -> ansichar etc etc. Porting these units is a pain to get correct and the customer won't pay for the transition. Hence opt2 for those units.
This mixed method gave us best of both worlds but we will keep some larger projects at Delphi 2007 and probably only port when a 64 bit version of the compiler will come out.
It'll be more work, but I'd really recommend that you upgrade to Unicode strings, because that's the native string type of the VCL and so all your controls will be dealing with Unicode strings anyway. Trying to convert everything back and forth will cause you all sorts of hassles.

What are the differences or advantages of using a binary file vs XML with TClientDataSet?

Is there any difference or advantages using binary a file or XML file with
TClientDataSet.
Binary will be smaller and faster.
XML will be more portable and human readable.
The Binary file will be a little smaller.
The main advantage of the XML format is that you can pass it around via http(s) protocols.
Binary is smaller and faster, but only readable by TClientDataSets.
XML is larger and slower (both are not that bad, i.e. not by orders of magnitude bigger or slower).
XML is readable by people (not recommended in general, but it is doable), and software.
Therefore it is more portable (as Nick wrote).
TClientDataSets can load and save their own style of XML, or you can use the Delphi XML Mapper tool to read and write any kind of XML.
XSLT can for instance be used to transform those XML files into any kind of text, including other XML, HTML, CSV, fixed columns, etc.
In contrast to what Tim indicates, both binary and XML can be transferred through HTTP and HTTPS. However, it is often appreciated sending XML as it is easier to trace.
Without having tested it: I guess the binary format would be quite a lot faster when reading and writing. You'd better do your own benchmarks for that, though.
Another advantage of binary might be, that it cannot be easily edited which prevents people from mucking up the data outside the application.
When using Delphi 2009, we have noticed that if the file has an extension of .XML, it will not save in binary format over an existing dfXMLUTF8 format, even with a LoadFromFile, SaveToFile. Changing the file extension to something else (.DAT, for example) allows saving the file in dfBinary. Our experience is that the binary file, in addition to being somewhat more difficult for the end-user to manipulate (a plus!), is approximately 50% smaller than the dfXMLUTF8 format file.

Best Free Text Editor Supporting *More Than* 4GB Files? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I am looking for a text editor that will be able to load a 4+ Gigabyte file into it. Textpad doesn't work. I own a copy of it and have been to its support site, it just doesn't do it. Maybe I need new hardware, but that's a different question. The editor needs to be free OR, if its going to cost me, then no more than $30. For Windows.
glogg could also be considered, for a different usage:
Caveat (reported by Simon Tewsi in the comments, Feb. 2013)
One caveat - has two search functions, Main Search and Quick Find.
The lower one, which I assume is Quick Find, is at least an order of magnitude slower than the upper one, which is fast.
I've had to look at monster(runaway) log files (20+ GB). I used hexedit FREE version which can work with any size files. It is also open source. It is a Windows executable.
Jeff Atwood has a post on this here: http://www.codinghorror.com/blog/archives/000229.html
He eventually went with Edit Pad Pro, because "Based on my prior usage history, I felt that EditPad Pro was the best fit: it's quite fast on large text files, has best-of-breed regex support, and it doesn't pretend to be an IDE."
Instead of loading a gigantic log file in an editor, I'm using Unix command line tools like grep, tail, gawk, etc. to filter the interesting parts into a much smaller file and then, I open that.
On Windows, try Cygwin.
Have you tried context editor? It is small and fast.
I Stumbled on this post many times, as I often need to handle huge files (10 Gigas+).
After being tired of buggy and pretty limited freeware, and not willing to pay fo costly editors after trial expired (not worth the money after all), I just used VIM for Windows with great success and satisfaction.
It is simply PERFECT for this need, fully customizable, with ALL feature one can think of when dealing with text files (searching, replacing, reading, etc. you name it)
I am very surprised nobody answered that (Except a previous answer but for MacOS)...
For the record I stumbled on it on this blog post, which wisely adviced it.
It's really tough to handle a 4G file as such. I used to handle larger text files, but I never used to load them in to my editor. I mostly used UltraEdit in my previous company, now I use Notepad++, but I would get just those parts which i needed to edit. (Most of the cases, the files never needed an edit).
Why do u want to load such a big file in to an editor? When I handled files of these size, I used GNU Core Utils. The most common operations i performed on those files were head ( to get the top 250k lines etc ), tail, split, sort, shuf, uniq etc. It's really powerful.
There's a lot of things you can do with GNU Core Utils. I would definitely recommend those, instead of a new editor.
Sorry to post on such an old thread, but I tried several of the tips here, and none of them worked for me.
It's slightly different than a text editor, but I found that Beyond Compare could handle an extremely large (3.6 Gig) file on my Vista 32-bit machine.
This is a file that that Emacs, Large Text File Viewer, HexEdit, and Notepad++ all choked on.
-Eric
My favourite after trying a few to read a 6GB mysqldump file:
PilotEdit Lite http://www.pilotedit.com/
Because:
Memory usage has (somehow?!) never gone above 25MB, so basically no impact on the rest of my system - though it took several minutes to open.
There was an accurate progress bar during that time so I knew how it was getting on.
Once open, simple searching, and browsing through the file all worked as well as a small notepad file.
It's free.
Others I tried...
EmEditor Pro trial was very impressive, the file opened almost instantly, but unfortunately too expensive for my requirements.
EditPad Pro loaded the whole 6GB file into memory and slowed everything to a crawl.
For windows, unix, or Mac? On the Mac or *nix you can use command line or GUI versions of emacs or vim.
For the Mac: TextWrangler to handle big files well. I'm not versed enough on the Windows landscape to help out there.
f you just want to view a large file rather than edit it, there are a couple of freeware programs that read files a chunk at a time rather than trying to load the entire file in to memory. I use these when I need to read through large ( > 5 GB) files.
Large Text File Viewer by swiftgear http://www.swiftgear.com/ltfviewer/features.html
Big File Viewer by Team Walrus.
You'll have to find the link yourself for that last one because the I can only post a maximum of one hyperlink being a newbie.
When I'm faced with an enormous log file, I don't try to look at the whole thing, I use Free File Splitter
Admittedly this is a workaround rather than a solution, and there are times when you would need the whole file. But often I only need to see a few lines from a larger file and that seems to be your problem too. If not, maybe others would find that utility useful.
A viewer that lets you see enormous text files isn't much help if you are trying to get it loaded into Excel to use the Autofilter, for example. Since we all spend the day breaking down problems into smaller parts to be able to solve them, applying the same principle to a large file didn't strike me as contentious.
HxD -- it's a hexeditor, but it allows in place edits, and doesn't barf on large files.
Tweak is a hex editor which can handle edits to very large files, including inserts and deletes.
EmEditor should handle this. As their site claims:
EmEditor is now able to open even larger than 248 GB (or 2.1 billion lines) by opening a
portion of the file with the new custom bar - Large File Controller.
The Large File Controller allows you to specify the beginning point,
end point, and range of the file to be opened. It also allows you to
stop the opening of the file and monitor the real size of the file and
the size of the temporary disk available.
Not free though..
I found that FAR commander could open large files ( I tried 4.2 GB xml file)
And it does not load the entire file in memory and works fast.
Opened 5GB file (quickly) with:
1) Hex Editor Neo
2) 010 editor
Textpad also works well at opening files that size. I have done it many times when having to deal with extremely large log files in the 3-5gb range. Also, using grep to pull out the worthwhile lines and then look at those works great.
The question would need more details.
Do you want just to look at a file (eg. a log file) or to edit it?
Do you have more memory than the size of the file you want to load or less?
For example, TheGun, a very small text editor written in assembly language, claims to "not have an effective file size limit and the maximum size that can be loaded into it is determined by available memory and loading speed of the file. [...] It has been speed optimised for both file load and save."
To abstract the memory limit, I suppose one can use mapped memory. But then, if you need to edit the file, some clever method should be used, like storing in memory the local changes, and applying them chunk by chunk when saving. Might be ineffective in some cases (big search/replace for example).
I have had problems with TextPad on 4G files too. Notepad++ works nicely.
Emacs can handle huge file sizes and you can use it on Windows or *nix.
What OS and CPU are you using? If you are using a 32-bit OS, then a process on your system physically cannot address more than 4GB of memory. Since most text editors try to load the entire file into memory, I doubt you'll find one that will do what you want. It would have to be a very fancy text editor, that can do out-of-core processing, i. e. load a chunk of the file at a time.
You may be able to load such a huge file with if you use a 64-bit text editor on a computer with a 64-bit CPU and a 64-bit operating system. And you have to make sure that you have enough space in your swap partition or your swap file.
Why do you want to load a 4+ GB file into memory? Even if you find a text editor that can do that, does your machine have 4 GB of memory? And unless it has a lot more than 4 GB in physical memory, your machine will slow down a lot and go swap file crazy.
So why do you want a 4+ GB file? If you want to transform it, or do a search and replace, you may be better off writing a small quick program to do it.
I also like notepad++.

Resources