I'm about to write a small utility to organze and tag my mp3s.
What is the best way to store small amounts of data. More importantly, are there databases which exist where I don't need to install a client/server environment, I just include the library and I'm good?
I could use XML, but I'm afraid that the file size would become large and hard to handle, not to mention keeping the memory footprint small.
Thanks
EDIT: I haven't decided on the language, I wanted to make my decision independent of platform. If I had to choose, most likely .NET, second Java, third C++.
My apologies, this is for a Windows App.
On Windows you can use the built-in esent database engine. There is an API you can use from C++
http://blogs.msdn.com/windowssdk/archive/2008/10/23/esent-extensible-storage-engine-api-in-the-windows-sdk.aspx
There is also a managed interop layer that you can use from C# code:
http://www.codeplex.com/ManagedEsent
Which language/platform are you talking about?
In the Java world I prefer using embedded databases such as HSQLDB, H2 or JavaDB (f.k.a. Derby).
They don't need installing and still provide the simple access you're used to from a "real" DBMS.
In the C/Python/Unixy world SQLite is a hot contender in that area.
Another option is the various forms of the Berkeley database (eg, db3, db4, SleepyCat.)
SQLITE if you want the pain of a relational DB without a server install or hassle.
I would use one of the many text-serialization formats. I personally think that YAML 1.1 is the most powerful (built-in support for referential object graphs) and easiest to read/modify by a human (parsing is a bear, use a library such as PyYAML or JYaml or some .NET libaray).
Otherwise XML or JSON are adequate file formats.
Whichever format you use, just compress the file if you're concerned about disk usage. If you're worried about in-memory usage, then I don't see how your serialization format matters...
Have a look at Prevayler - it's a serialization persistence framework (use xstream etc if you want to human-read your data), which is really fast, does not require annotations and "just works". Some basic info:
It does impose a more rigorous transaction pattern, as it does not give you automatic rollback:
Ensure transaction will succeed (with current state of system) - e.g. does it make sense now?
[transaction is added to queue], and stored (for power reset etc)
transaction is executed and applied to the object structure.
Writes of 1000's of transactions/sec
Reads of 100,000's transactions/sec
I haven't used it much, but it's sooo much nicer to use for small projects (persisting any serializable object is so nice)
Oh - as for every one saying "what platform you running on?", Prevayler (java) has/had ports to quite a few platforms, but I can't find a decent list :(. I remember that there were around 5-7, but can only remember .NET.
If you're planning on storing everything in memory while your program does work on it, then serializing to a file using a basic load() and save() function that you write would be fine, and less pain than a full on DB.
In Java that can be done using standard Serialization (or can serialize to and from XML to make it somewhat human readable and editable).
It shouldn't affect your memory footprint at all as it is merely saving and restoring your objects. You just won't get transactions and random access and queries and all that good stuff.
you could even use xml, json, an .ini file... a text file even
I would advise a SQL like database (such as SQLLite). Today your requirements might make a full SQL database seem silly. But you never know how much this "little project" will grow over the years. When it does grow to the point where you have to have a SQL engine, you will be glad you didn't just serialize some Java objects or store stuff in JSON format.
Related
I'm writing a program for a client. The data they send us is essentially information from a relational database that got flattened, resulting in utterly gigantic comma-delimited text files that consist of extremely redundant information with only a few fields changing per line.
I am reading this into a typed dataset and essentially organizing the data I'm getting into the third normal form, which drastically cuts down on the sheer amount of redundancy. From there, I convert the data in the dataset to XML and send it off to another program to create forms and statements.
However, I'm wondering if there's a better way to go about this. It might not be as bad as I think it is, but I can't shake the feeling that there's a better, faster way to do this. The important thing is that the data is organized and easily understood, and that it is constraint-checked and validated before I convert it to XML.
Since none of the data needs to persist (in fact, it shouldn't), an actual RMDB didn't seem worth it if I was just going to end up clearing it after every use.
The program also needs to run in a myriad of environments; my workstation is Windows 7 64-bit, the testing server is Windows XP 32-bit, and the production server is Windows 7 64-bit or 32-bit depending on which server it's going on.
IMHO then I would start off with SQL Express - it's designed to work its way through those kinds of data volumes, and will adapt itself to the different platforms you're running; it's scalable to the bigger versions if necessary; and in SSMS you have a tool for easily examining intermediate results etc., and importing .csv is straightforward. And it's free.
For all the above reasons, I would give SQL Express a try and evaluate its real-world performance.
Going back to your original question, my opinion fwiw is that this is a reasonable approach; I don't think you're missing anything.
In my application I want to use files for storing data. I don't want to use database or clear text file, the goal is to save double and integer values along with string just to identify the name of the record ; I simple need to save data on disk for generating reports. File can grow even to gigabyte. What format you suggest to use? Binary? If so what vcl component/library you know which is good to use? My goal is to create an application which creates and updates the files while another tool will "eat" those file
producing nice pdf reports for user on demand. What do you think? Any idea or suggestion?
Thanks in advance.
If you don't want to reinvent the wheel, you may find all needed Open Source tools for your task from our side:
Synopse Big Table to store huge amount of data - see in particular the TSynBigTableRecord class to store an unlimited number of records with fields, including indexes if needed - it will definitively be faster and use less disk size than any other regular SQL DB
Synopse SQLite3 Framework if you would rather use a standard SQLite engine for the storage - it comes with a full Client/Server ORM
Reporting from code, including pdf file generation
With full Source code, working from Delphi 6 up to XE.
I've just updated the documentation of the framework. More than 600 pages, with details of every class method, and new enhanced general introduction. See the SAD document.
Update: If you plan to use SQLite, you should first guess how the data will be stored, which indexes are to be created, and how a SQL query may speed up your requests. It's a bad idea to read all file content for every request: you should better structure your data so that a single SQL query would be able to return the expended results. Sometimes, using additional values (like temporary sums or means) to the data is a good idea. Also consider using the RTree virtual table of SQLite3, which is dedicated to speed up access to double min/max multi-dimensional data: it may speed up a lot your requests.
You don't want to use a full SQL database, and you think that a plain text file is too simple.
Points in between those include:
Something that isn't a full SQL database, but more of a key-value store, would technically not be a flat file, but it does provide a single "key+value" list, that is quickly searchable on a single primary key. Such as BSDDB. It has the letter D and B in the name. Does that make it a database, in your view? Because it's not a relational database, and doesn't do SQL. It's just a binary key-value (hashtable) blob storage mechanism, using a well-understood binary file format. Personally, I wouldn't start a new project and use anything in this category.
Recommended: Something that uses SQL but isn't as large as standalone SQL database servers. For example, you could use SQLite and a delphi wrapper. It is well tested, and used in lots of C/C++ and Delphi applications, and can be trusted more than anything you could roll yourself. It is a very light embedded database, and is trusted by many.
Roll your own ISAM, or VLIR, which will eventually morph over time into your own in-house DBMS. There are multiple files involved, and there are indexes, so you can look up data fast without loading everything into memory. Not recommended.
The most flat of flat binary fixed-record-length files. You mentioned originally in your question, power basic which has something called Random Access files, and then you deleted that from your question. Probably what you are looking for, especially for append-only write as the primary operation. Roll your own TurboPascal era "file of record". If you use the "FILE OF RECORD" type, you hit the 2gb limit, and there are problems with Unicode. So use TStream instead, like this. Binary file formats have a lot of strikes against them, especially since it is difficult to grow and expand your binary file format over time, without breaking your ability to read old files. This is a key reason why I would recommend you start out with what might at first seem like overkill (SQLite) instead of rolling your own binary solution.
(Update 2: After updating the question to mention PDFs and what sounds like a reporting-system requirement, I think you really should be using a real database but perhaps a small and easy to use one, like firebird, or interbase.)
I would suggest using TClientDataSet, and use it's SaveToFile() / SaveToStream() methods by the generating program, and LoadFromFile() / LoadFromStream() methods for the program that will "consume" the data. That way, you can still make indexed records without connecting to any external database, all while keeping the interchange data in a single file.
Define API to work with your flat file, so that the API can be implemented by a separate data layer in many ways.
Implement the API using standard embedded SQL database (ex SQLite or Firebird).
Only if there is something wrong with the standard solution think of your own.
I use KBMMemtable - see http://www.components4developers.com/ - fast, reliable, been around a long time - supports binary and CSV streaming in and out of files, as well indexing, filters, and lots of other goodies - TClientDataSet will not do well with large datasets.
We've got a rails app that processes large amounts of xml data imports. Right now we're storing these ~5MB xml docs in Postgres. This is not ideal given that we use each xml doc once or twice for parsing. We'd like to have an intelligent way of storing and archiving these docs, but not overly complicate the retrieval process for the sake of space. We've considered moving the docs to Mongo (which we're also using), but then aren't we just artificially boosting the memory requirements of our Mongo db servers?
What's the best way for us to deal with this?
I would just store a link to the file in the DB if you use it only for parsing once or twice and then load the file from the given link. Another aproach is to use a XML DB, e.g. eXist.
You could try eXist, an XML database. If you are just archiving them, though, why not just store them in a directory tree?
You may want to look into DB2's PureXML capabilities. To play with it, you can download the free DB2 Express-C version here. For the record, IBM is also the only database provider officially supporting their Ruby driver and Rails adapter, so you wouldn't be on your own.
What harm are they doing where they are? They will take up 'space' wherever you put them.
If are confident you will never need them again then there is a case for archival to less expensive storage (eg tape?) - otherwise whatever you do will 'overly complicate the retrieval process'
You could consider compressing them in-place if you are not already doing so
What is the best way to load huge text file data in delphi? Is there any component that can load text file superfast?
Let's say I have a text file contains database and stored in fix length format.
It contains 150 field with each at least 50 characters.
1. I need to load it into memory
2. I need to parse it and probably store it in a memdataset for processing
My questions:
1. Is it enough if I use TStringList.loadFromFile method?
2. Is there any other better component to manipulate the text file?
3. Should I use low level reading from textfile?
Thank you in advance.
TStringList is never the optimal way of working with lots of text, but it's the simplest. If you've got small files on your hands you can use TStringList without issues. Even if you have large files (not huge files) you might implement a version of you algorithm using TStringList for testing purposes, because it's simple and easy to understand.
If your files are large, as they probably are since you call them "databases", you need to look into alternative technologies that will enable you to read only as much as you need from the database. Look into:
TFileStream
Memory mapped files.
Don't look at the old "file" based API's still available in Delphi, they're plain old.
I'm not going to go into details on how to access text using those methods because we've recently had two similar questions on SO:
How Can I Efficiently Read The FIrst Few Lines of Many Files in Delphi
and
Fast Search to see if a String Exists in Large Files with Delphi
Since you have a fixed length that you're working with, you can build an access class based on TList with a TWriter and TReader that will take your records into account. You'll have none of the overhead of a TStringList (not that it's a bad thing, but if you don't need it, why have it) and you can build in your own access to records into the class.
Ultimately it depends on what you are trying to accomplish with the data once you have it loaded into memory. While TStringlist is easy to use, it isn't as efficient as "rolling your own".
However, efficiency in data manipulation may not be that much of an issue, as you are using text files to hold a database. If you just need to read in and make decisions based on data in the file, the more flexible TList may be overkill.
I recommend to adhere to TStringList if you find it convenient for your problem. Optimization is another thing that should be done later.
As for TStringList the optimization is to declare a descendant class that overrides TStrings.LoadFromStream method - you can make it practically as fast as possible, taking into account the structure of your files.
It is not entirely clear from your question why you need to load the entire file into memory, prior to then going on to create an in-memory data set.... are you conflating the two issues? (i.e. because you need to create an in-memory data set you think you first need to load the source data entirely into memory? Or is there some initial pre-processing of the source file which is only possible with the entire file loaded in memory (this is unlikely and even if this is the case, it isn't necessary with a navigable stream object such as a TFileStream).
But I think the answer you are looking for is right there in the question....
If you are loading this file in order to parse it and populate/initialise a further data structure (the data set) for further processing, then using an existing high level data structure is an unnecessary and potentially costly (in terms of time) step.
Use the lowest level means of access that provides the capabilities you need.
In this case a TFileStream will likely provide the best balance of convenience and ease of use.
Is there drop-in replacement for ActiveRecord that uses some sort of Object Store?
I am thinking something like Erlang's MNesia would be ideal.
Update
I've been investigating CouchDB and I think this is the option I am going to go with. It's a toss-up between using CouchRest and ActiveCouch. CouchRest is pretty mature, and is used in the CouchDB peepcode episode, but it's not a drop-in replacement for ActiveRecord, which is a bit of a disadvantage.
Suffice to say CouchDB is pretty phenomenal.
Update (November 10, 2009)
CouchDB hasn't really worked for me. CouchDB doesn't really support arbitrary queries (queries need to be written and compiled ahead of time). It also breaks on very large datasets.
I have been playing with MongoDB and it's really incredible. Schema-less JSON data store with queries and indexing.
I've even started building a management tool for it called Ming.
Try Maglev!
AciveCouch purports to be just such a library for CouchDB, which is, in fact, written in Erlang. I wouldn't say it's as mature as ActiveRecord though.
That is the closest thing I know of to what you're asking for.
Madeleine is an implementation of the Java Prevayler object store
see http://madeleine.rubyforge.org/
I'm currently working on a ruby object database that uses mysql as a backing store (hence it's called hybriddb) that you may be interested in.
It uses no SQL or migrations, you just save your objects to the database, it also tries to work around the conventional problems with object databases (speed, finding objects quickly, large object graphs) transparently.
It is still an early version so take care. The code is here
http://github.com/pauliephonic/hybriddb/tree/master The development branch has support for transactions and I'm currently adding basic validations.
I have a web site with some tutorials etc. http://www.hybriddb.org/pages/tutorial_starter
Any comments are welcome there.
Apart from Madeleine, you can also see:
http://purple.rubyforge.org/
But it depends on scale too. Mnesia is known to support large amount of data, and is clustered, whereas these solutions won't work so well with large amount of data.
If amount of data is not huge, another options is:
http://copiousfreetime.rubyforge.org/amalgalite/files/README.html