In my application I want to use files for storing data. I don't want to use database or clear text file, the goal is to save double and integer values along with string just to identify the name of the record ; I simple need to save data on disk for generating reports. File can grow even to gigabyte. What format you suggest to use? Binary? If so what vcl component/library you know which is good to use? My goal is to create an application which creates and updates the files while another tool will "eat" those file
producing nice pdf reports for user on demand. What do you think? Any idea or suggestion?
Thanks in advance.
If you don't want to reinvent the wheel, you may find all needed Open Source tools for your task from our side:
Synopse Big Table to store huge amount of data - see in particular the TSynBigTableRecord class to store an unlimited number of records with fields, including indexes if needed - it will definitively be faster and use less disk size than any other regular SQL DB
Synopse SQLite3 Framework if you would rather use a standard SQLite engine for the storage - it comes with a full Client/Server ORM
Reporting from code, including pdf file generation
With full Source code, working from Delphi 6 up to XE.
I've just updated the documentation of the framework. More than 600 pages, with details of every class method, and new enhanced general introduction. See the SAD document.
Update: If you plan to use SQLite, you should first guess how the data will be stored, which indexes are to be created, and how a SQL query may speed up your requests. It's a bad idea to read all file content for every request: you should better structure your data so that a single SQL query would be able to return the expended results. Sometimes, using additional values (like temporary sums or means) to the data is a good idea. Also consider using the RTree virtual table of SQLite3, which is dedicated to speed up access to double min/max multi-dimensional data: it may speed up a lot your requests.
You don't want to use a full SQL database, and you think that a plain text file is too simple.
Points in between those include:
Something that isn't a full SQL database, but more of a key-value store, would technically not be a flat file, but it does provide a single "key+value" list, that is quickly searchable on a single primary key. Such as BSDDB. It has the letter D and B in the name. Does that make it a database, in your view? Because it's not a relational database, and doesn't do SQL. It's just a binary key-value (hashtable) blob storage mechanism, using a well-understood binary file format. Personally, I wouldn't start a new project and use anything in this category.
Recommended: Something that uses SQL but isn't as large as standalone SQL database servers. For example, you could use SQLite and a delphi wrapper. It is well tested, and used in lots of C/C++ and Delphi applications, and can be trusted more than anything you could roll yourself. It is a very light embedded database, and is trusted by many.
Roll your own ISAM, or VLIR, which will eventually morph over time into your own in-house DBMS. There are multiple files involved, and there are indexes, so you can look up data fast without loading everything into memory. Not recommended.
The most flat of flat binary fixed-record-length files. You mentioned originally in your question, power basic which has something called Random Access files, and then you deleted that from your question. Probably what you are looking for, especially for append-only write as the primary operation. Roll your own TurboPascal era "file of record". If you use the "FILE OF RECORD" type, you hit the 2gb limit, and there are problems with Unicode. So use TStream instead, like this. Binary file formats have a lot of strikes against them, especially since it is difficult to grow and expand your binary file format over time, without breaking your ability to read old files. This is a key reason why I would recommend you start out with what might at first seem like overkill (SQLite) instead of rolling your own binary solution.
(Update 2: After updating the question to mention PDFs and what sounds like a reporting-system requirement, I think you really should be using a real database but perhaps a small and easy to use one, like firebird, or interbase.)
I would suggest using TClientDataSet, and use it's SaveToFile() / SaveToStream() methods by the generating program, and LoadFromFile() / LoadFromStream() methods for the program that will "consume" the data. That way, you can still make indexed records without connecting to any external database, all while keeping the interchange data in a single file.
Define API to work with your flat file, so that the API can be implemented by a separate data layer in many ways.
Implement the API using standard embedded SQL database (ex SQLite or Firebird).
Only if there is something wrong with the standard solution think of your own.
I use KBMMemtable - see http://www.components4developers.com/ - fast, reliable, been around a long time - supports binary and CSV streaming in and out of files, as well indexing, filters, and lots of other goodies - TClientDataSet will not do well with large datasets.
Related
What is the best way to load huge text file data in delphi? Is there any component that can load text file superfast?
Let's say I have a text file contains database and stored in fix length format.
It contains 150 field with each at least 50 characters.
1. I need to load it into memory
2. I need to parse it and probably store it in a memdataset for processing
My questions:
1. Is it enough if I use TStringList.loadFromFile method?
2. Is there any other better component to manipulate the text file?
3. Should I use low level reading from textfile?
Thank you in advance.
TStringList is never the optimal way of working with lots of text, but it's the simplest. If you've got small files on your hands you can use TStringList without issues. Even if you have large files (not huge files) you might implement a version of you algorithm using TStringList for testing purposes, because it's simple and easy to understand.
If your files are large, as they probably are since you call them "databases", you need to look into alternative technologies that will enable you to read only as much as you need from the database. Look into:
TFileStream
Memory mapped files.
Don't look at the old "file" based API's still available in Delphi, they're plain old.
I'm not going to go into details on how to access text using those methods because we've recently had two similar questions on SO:
How Can I Efficiently Read The FIrst Few Lines of Many Files in Delphi
and
Fast Search to see if a String Exists in Large Files with Delphi
Since you have a fixed length that you're working with, you can build an access class based on TList with a TWriter and TReader that will take your records into account. You'll have none of the overhead of a TStringList (not that it's a bad thing, but if you don't need it, why have it) and you can build in your own access to records into the class.
Ultimately it depends on what you are trying to accomplish with the data once you have it loaded into memory. While TStringlist is easy to use, it isn't as efficient as "rolling your own".
However, efficiency in data manipulation may not be that much of an issue, as you are using text files to hold a database. If you just need to read in and make decisions based on data in the file, the more flexible TList may be overkill.
I recommend to adhere to TStringList if you find it convenient for your problem. Optimization is another thing that should be done later.
As for TStringList the optimization is to declare a descendant class that overrides TStrings.LoadFromStream method - you can make it practically as fast as possible, taking into account the structure of your files.
It is not entirely clear from your question why you need to load the entire file into memory, prior to then going on to create an in-memory data set.... are you conflating the two issues? (i.e. because you need to create an in-memory data set you think you first need to load the source data entirely into memory? Or is there some initial pre-processing of the source file which is only possible with the entire file loaded in memory (this is unlikely and even if this is the case, it isn't necessary with a navigable stream object such as a TFileStream).
But I think the answer you are looking for is right there in the question....
If you are loading this file in order to parse it and populate/initialise a further data structure (the data set) for further processing, then using an existing high level data structure is an unnecessary and potentially costly (in terms of time) step.
Use the lowest level means of access that provides the capabilities you need.
In this case a TFileStream will likely provide the best balance of convenience and ease of use.
We've got a rails app that processes large amounts of xml data imports. Right now we're storing these ~5MB xml docs in Postgres. This is not ideal given that we use each xml doc once or twice for parsing. We'd like to have an intelligent way of storing and archiving these docs, but not overly complicate the retrieval process for the sake of space. We've considered moving the docs to Mongo (which we're also using), but then aren't we just artificially boosting the memory requirements of our Mongo db servers?
What's the best way for us to deal with this?
I would just store a link to the file in the DB if you use it only for parsing once or twice and then load the file from the given link. Another aproach is to use a XML DB, e.g. eXist.
You could try eXist, an XML database. If you are just archiving them, though, why not just store them in a directory tree?
You may want to look into DB2's PureXML capabilities. To play with it, you can download the free DB2 Express-C version here. For the record, IBM is also the only database provider officially supporting their Ruby driver and Rails adapter, so you wouldn't be on your own.
What harm are they doing where they are? They will take up 'space' wherever you put them.
If are confident you will never need them again then there is a case for archival to less expensive storage (eg tape?) - otherwise whatever you do will 'overly complicate the retrieval process'
You could consider compressing them in-place if you are not already doing so
What is the best way to load huge text file data in delphi? Is there any component that can load text file superfast?
Let's say I have a text file contains database and stored in fix length format.
It contains 150 field with each at least 50 characters.
1. I need to load it into memory
2. I need to parse it and probably store it in a memdataset for processing
My questions:
1. Is it enough if I use TStringList.loadFromFile method?
2. Is there any other better component to manipulate the text file?
3. Should I use low level reading from textfile?
Thank you in advance.
TStringList is never the optimal way of working with lots of text, but it's the simplest. If you've got small files on your hands you can use TStringList without issues. Even if you have large files (not huge files) you might implement a version of you algorithm using TStringList for testing purposes, because it's simple and easy to understand.
If your files are large, as they probably are since you call them "databases", you need to look into alternative technologies that will enable you to read only as much as you need from the database. Look into:
TFileStream
Memory mapped files.
Don't look at the old "file" based API's still available in Delphi, they're plain old.
I'm not going to go into details on how to access text using those methods because we've recently had two similar questions on SO:
How Can I Efficiently Read The FIrst Few Lines of Many Files in Delphi
and
Fast Search to see if a String Exists in Large Files with Delphi
Since you have a fixed length that you're working with, you can build an access class based on TList with a TWriter and TReader that will take your records into account. You'll have none of the overhead of a TStringList (not that it's a bad thing, but if you don't need it, why have it) and you can build in your own access to records into the class.
Ultimately it depends on what you are trying to accomplish with the data once you have it loaded into memory. While TStringlist is easy to use, it isn't as efficient as "rolling your own".
However, efficiency in data manipulation may not be that much of an issue, as you are using text files to hold a database. If you just need to read in and make decisions based on data in the file, the more flexible TList may be overkill.
I recommend to adhere to TStringList if you find it convenient for your problem. Optimization is another thing that should be done later.
As for TStringList the optimization is to declare a descendant class that overrides TStrings.LoadFromStream method - you can make it practically as fast as possible, taking into account the structure of your files.
It is not entirely clear from your question why you need to load the entire file into memory, prior to then going on to create an in-memory data set.... are you conflating the two issues? (i.e. because you need to create an in-memory data set you think you first need to load the source data entirely into memory? Or is there some initial pre-processing of the source file which is only possible with the entire file loaded in memory (this is unlikely and even if this is the case, it isn't necessary with a navigable stream object such as a TFileStream).
But I think the answer you are looking for is right there in the question....
If you are loading this file in order to parse it and populate/initialise a further data structure (the data set) for further processing, then using an existing high level data structure is an unnecessary and potentially costly (in terms of time) step.
Use the lowest level means of access that provides the capabilities you need.
In this case a TFileStream will likely provide the best balance of convenience and ease of use.
I am constructing an anagram generator that was a coding exercise, and uses a word list thats about 633,000 lines long (one word per line). I wrote the program just in Ruby originally, and I would like to modify this to deploy it online.
My hosting service supports Ruby on Rails as about the only Ruby-based solution. I thought of hosting on my own machine, and using a smaller framework, but I don't want to deal with the security issues at this moment.
I have only used RoR for database-driven (CRUD) apps. However, I have never populated a sqlite database this way, so this is a two-part question:
1) Should I import this to a database? If so, what's the best method to do so? I would like to stick with sqlite to keep things simple if that's the case.
2) Is a 'flat file' better? I wont be doing any creating or updating, just checking against the list of words.
Thank you.
How about keeping it in memory? Storing that many words would take just a few megabytes of RAM, and otherwise you'd be accessing the file frequently so it'd probably be cached anyway. The advantage of keeping the word list in memory is that you can organize it in whatever data structure suits your needs best (I'm thinking a trie). If you can't spare that much memory, it might be to your advantage to use a database so you can efficiently load only the parts of the word list you need for any given query - of course, in that case you'd want to create some index columns (well at least one) so you can take advantage of the indexing capabilities of SQL.
Assuming that what you're doing is looking up whether a word exists in your list, I would say that SQLite with an indexed column will likely be faster than scanning through the word list linearly. Now, if your current approach is fast enough for your purposes, then I see no reason to bother porting it over to a database; it's just an added headache for no gain as far as you're concerned. If you're seeing the search times become a burden, then dumping it into an indexed database would be a good idea.
You can create the table with the following schema:
CREATE TABLE words (
word text primary key
);
CREATE INDEX word_idx ON words(word);
And import your data with:
sqlite words.db < schema.sql
while read word
do
sqlite3 words.db "INSERT INTO words values('$word');"
done < words.txt
I would skip the database for reasons listed above. A simple hash in memory will perform about as fast a lookup in the database.
Even if the database was a bit faster for the lookup, you're still wasting time with the DB having to parse the query and create a plan for the lookup, then assemble the results and send them back to your program. Plus you can save yourself a dependency.
If you plan on moving other parts of your program to a persistent store, then go for it. But a hashmap should be sufficient for your use.
I'm about to write a small utility to organze and tag my mp3s.
What is the best way to store small amounts of data. More importantly, are there databases which exist where I don't need to install a client/server environment, I just include the library and I'm good?
I could use XML, but I'm afraid that the file size would become large and hard to handle, not to mention keeping the memory footprint small.
Thanks
EDIT: I haven't decided on the language, I wanted to make my decision independent of platform. If I had to choose, most likely .NET, second Java, third C++.
My apologies, this is for a Windows App.
On Windows you can use the built-in esent database engine. There is an API you can use from C++
http://blogs.msdn.com/windowssdk/archive/2008/10/23/esent-extensible-storage-engine-api-in-the-windows-sdk.aspx
There is also a managed interop layer that you can use from C# code:
http://www.codeplex.com/ManagedEsent
Which language/platform are you talking about?
In the Java world I prefer using embedded databases such as HSQLDB, H2 or JavaDB (f.k.a. Derby).
They don't need installing and still provide the simple access you're used to from a "real" DBMS.
In the C/Python/Unixy world SQLite is a hot contender in that area.
Another option is the various forms of the Berkeley database (eg, db3, db4, SleepyCat.)
SQLITE if you want the pain of a relational DB without a server install or hassle.
I would use one of the many text-serialization formats. I personally think that YAML 1.1 is the most powerful (built-in support for referential object graphs) and easiest to read/modify by a human (parsing is a bear, use a library such as PyYAML or JYaml or some .NET libaray).
Otherwise XML or JSON are adequate file formats.
Whichever format you use, just compress the file if you're concerned about disk usage. If you're worried about in-memory usage, then I don't see how your serialization format matters...
Have a look at Prevayler - it's a serialization persistence framework (use xstream etc if you want to human-read your data), which is really fast, does not require annotations and "just works". Some basic info:
It does impose a more rigorous transaction pattern, as it does not give you automatic rollback:
Ensure transaction will succeed (with current state of system) - e.g. does it make sense now?
[transaction is added to queue], and stored (for power reset etc)
transaction is executed and applied to the object structure.
Writes of 1000's of transactions/sec
Reads of 100,000's transactions/sec
I haven't used it much, but it's sooo much nicer to use for small projects (persisting any serializable object is so nice)
Oh - as for every one saying "what platform you running on?", Prevayler (java) has/had ports to quite a few platforms, but I can't find a decent list :(. I remember that there were around 5-7, but can only remember .NET.
If you're planning on storing everything in memory while your program does work on it, then serializing to a file using a basic load() and save() function that you write would be fine, and less pain than a full on DB.
In Java that can be done using standard Serialization (or can serialize to and from XML to make it somewhat human readable and editable).
It shouldn't affect your memory footprint at all as it is merely saving and restoring your objects. You just won't get transactions and random access and queries and all that good stuff.
you could even use xml, json, an .ini file... a text file even
I would advise a SQL like database (such as SQLLite). Today your requirements might make a full SQL database seem silly. But you never know how much this "little project" will grow over the years. When it does grow to the point where you have to have a SQL engine, you will be glad you didn't just serialize some Java objects or store stuff in JSON format.