Handling Huge data from TAdoQuery in delphi - delphi

Programming Language: Delphi 6
SQL Server in The Back end.
Problem:
Application used to hit DB each time we needed something , Ended up hitting it for more than 2000 times for getting certain things, Caused problems with application being slow. This hitting DB happened for a lot of tables each having different structure and different number of columns. So I’m trying to reduce the number of calls.
We can have about 4000 records at a time from each table.
Proposed Solution:
Let’s get all the data from DB at once and use it when we need it so we don’t have to keep hitting the DB.
How Solution is turning out so far:
This version of Delphi doesn’t have a dictionary. So we already have an implementation of a dictionary from String List (Let us assume that implementation is good).
Solution 1:
Store this in a dictionary that we created with:
A unique field as a key.
And Add rest of the data as strings in String List separated like this:
FiledName1:FileValue,FieldName2:FieldValue2,…..
Might Have to create about 2000 String List to map data to key.
I took a look at the following link:
How Should I Implement a Huge but Simple Indexed StringList in Delphi?
Looks like they could move to a different DB not possible with me.
Is this a sane solution?
Solution 2:
Store this in a dictionary with List.
That list will have Delphi Records.
Records can’t directly be added to so I took a look at this Link:
Delphi TList of records
Solution 3:
Or Given that I’m using TAdoQuery should I use Seek or locate to find my records.
Please advice on the best way to do this?
Requirements:
Need Random Access of this data.
Insertion of data will happened only once when we get all the data , per table as we require them.
Need to only read the data , don’t have to modify it.
Constantly need to search in terms of primary key.
In addition to changing the application we have already done good indexing on DB to take care of things from DB side. This is more to make things run well from the application.

This sounds like a perfect use case for TClientDataSet. It's an in-memory dataset that be indexed, filtered, and searched easily, hold any information you can retrieve from the database using a SQL statement, and it has pretty good performance over a few thousand reasonable-sized rows of data. (The link above is to the current documentation, as I don't have one available for the Delphi 6 docs. They should be very similar, although I don't recall which specific version added the ability to directly include MidasLib in your uses clause to eliminate distributing Midas.dll with your app.)
Carey Jensen wrote a series of articles about it a few years back that you might find useful. The first one can be found in A ClientDataset in Every Database Application - the others in the series are linked from it.

Related

How to dynamically generate CoreData objects

I haven't found an explicit "no" in the documentation/discussion but suspect it is not possible to generate CoreData objects programmatically, at runtime.
What I want to do is similar to executing DDL commands (eg. "Create Table", "Drop Table" etc.) from inside running code, because I don't know until I ask the user how many columns his table needs, or what data types they need to be. Maybe he needs multiple tables, at that.
Does anyone know whether this is possible? Would appreciate a pointer to something to read.
(Would also appreciate learning the negative, so I can stop wondering.)
If not doable in CoreData, would this be a reason to switch to SQLite?
You can create the entire Core Data model at run time-- there's no requirement to use Xcode's data modeler at all, and there's API support for creating and configuring every detail of the model. But it's probably not as flexible as it sounds like you want it to be. Although you can create new entity descriptions or modify existing ones, you can only do so before loading a data store file. Once you're reading and writing data, you must consider the data model as being fixed. Changing it at that point will generate an exception.
It's not quite the same as typical SQLite usage. It's sort of like the SQLite tables are defined in one fie and the data is stored in another file-- and you can modify the tables on the fly but only before loading the actual data. (I know that's not how SQLite really works, but that's basically the approach that Core Data enforces).
If you expect to need to modify your model / schema as you describe, you'll probably be better off going with direct SQLite access. There are a couple of Objective-C SQLite wrappers that allow an ObjC-style approach while still supporting SQLite-style access:
PLDatabase
FMDB

Entity, dealing with large number of records (> 35 mlns)

We have a rather large set of related tables with over 35 million related records each. I need to create a couple of WCF methods that would query the database with some parameters (data ranges, type codes, etc.) and return related results sets (from 10 to 10,000 records).
The company is standardized on EF 4.0 but is open to 4.X. I might be able to make argument to migrate to 5.0 but it's less likely.
What’s the best approach to deal with such a large number of records using Entity? Should I create a set of stored procs and call them from Entity or there is something I can do within Entity?
I do not have any control over the databases so I cannot split the tables or create some materialized views or partitioned tables.
Any input/idea/suggestion is greatly appreciated.
At my work I faced a similar situation. We had a database with many tables and most of them contained around 7- 10 million records each. We used Entity framework to display the data but the page seemed to display very slow (like 90 to 100 seconds). Even the sorting on the grid took time. I was given the task to see if it could be optimized or not. and well after profiling it (ANTS profiler) I was able to optimize it (under 7 secs).
so the answer is Yes, Entity framework can handle loads of records (in millions) but some care must be taken
Understand that call to database made only when the actual records are required. all the operations are just used to make the query (SQL) so try to fetch only a piece of data rather then requesting a large number of records. Trim the fetch size as much as possible
Yes, not you should, you must use stored procedures and import them into your model and have function imports for them. You can also call them directly ExecuteStoreCommand(), ExecuteStoreQuery<>(). Sames goes for functions and views but EF has a really odd way of calling functions "SELECT dbo.blah(#id)".
EF performs slower when it has to populate an Entity with deep hierarchy. be extremely careful with entities with deep hierarchy .
Sometimes when you are requesting records and you are not required to modify them you should tell EF not to watch the property changes (AutoDetectChanges). that way record retrieval is much faster
Indexing of database is good but in case of EF it becomes very important. The columns you use for retrieval and sorting should be properly indexed.
When you model is large, VS2010/VS2012 Model designer gets real crazy. so break your model into medium sized models. There is a limitation that the Entities from different models cannot be shared even though they may be pointing to the same table in the database.
When you have to make changes in the same entity at different places, try to use the same entity by passing it and send the changes only once rather than each one fetching a fresh piece, makes changes and stores it (Real performance gain tip).
When you need the info in only one or two columns try not to fetch the full entity. you can either execute your sql directly or have a mini entity something. You may need to cache some frequently used data in your application also.
Transactions are slow. be careful with them.
if you keep these things in mind EF should give almost similar performance as plain ADO.NET if not the same.
My experience with EF4.1, code first: if you only need to read the records (i.e. you won't write them back) you will gain a performance boost by turning of change tracking for your context:
yourDbContext.Configuration.AutoDetectChangesEnabled = false;
Do this before loading any entities. If you need to update the loaded records you can allways call
yourDbContext.ChangeTracker.DetectChanges();
before calling SaveChanges().
The moment I hear statements like: "The company is standardized on EF4 or EF5, or whatever" This sends cold shivers down my spine.
It is the equivalent of a car rental saying "We have standardized on a single car model for our entire fleet".
Or a carpenter saying "I have standardized on chisels as my entire toolkit. I will not have saws, drills etc..."
There is something called the right tool for the right job
This statement only highlights that the person in charge of making key software architecture decisions has no clue about software architecture.
If you are dealing with over 100K records and the datamodels are complex (i.e. non trivial), Maybe EF6 is not the best option.
EF6 is based on the concepts of dynamic reflection and has similar design patterns to Castle Project Active Record
Do you need to load all of the 100K records into memory and perform operations on these ? If yes ask yourself do you really need to do that and why wouldn't executing a stored procedure across the 100K records achieve the same thing. Do some analysis and see what is the actual data usage pattern. Maybe the user performs a search which returns 100K records but they only navigate through the first 200. Example google search, Hardly anyone goes past page 3 of the millions of search results.
If the answer is still yes you need to load all of the 100K records into memory and perform operations. Then maybe you need to consider something else like a custom built write through cache with light weight objects. Maybe lazy load dynamic object pointers for nested objects. etc... One instance where I use something like this is large product catalogs for eCommerce sites where very large numbers of searches get executed against the catalog. Why is in order to provide custom behavior such as early exit search, and regex wildcard search using pre-compiled regex, or custom Hashtable indexes into the product catalog.
There is no one size fits all answer to this question. It all depends the data usage scenarios and how the application works with the data. Consider Gorilla Vs Shark who would win? It all depends on the environment and the context.
Maybe EF6 is perfect for one piece that would benefit from dynamic reflection, While NetTiers is better for another that needs static reflection and an extensible ORM. While low level ADO is perhaps best for extreme high performance pieces.

Is there a better solution than ActiveRecord for batch data imports?

I've developed a web interface for a legacy (vendor) database using Ruby on Rails. The database schema is a complete mess, > 450 tables, and customer data spread over more than 20, involving complex joins, etc.
I've got a good solution for this for the web app, it works very well. But we also do nightly imports from external data sources (currently a view to a SQL Server DB and a SOAP feed) and they run SLOW. About 1.5-2.5 hours for the XML data import and about 4 hours for the DB import.
This is after doing some basic optimizations, which include manually starting the MRI garbage collector. And that right there suggests to me I'm Doing It Wrong. I've considered moving the nightly update/insert tasks out of the main Rails app and trying to use either JRuby or Rubinius to take advantage of the better concurrency and garbage collection.
My question is this: I know ActiveRecord isn't really designed for this type of task. But out of the O/RM options for Ruby (my preferred language), it seems to have the best Oracle support.
What would you do? Stick with AR and use a different interpreter? Would that really help? What about DataMapper or Sequel? Is there a better way of doing this?
I'm open to using Scala or Clojure if there's a better alternative (not limited to, but these are the other languages I'm playing with right now)... but what I don't want is something like DBI where I'm writing straight SQL, if for no other reason than that vendor updates occasionally change the DB schema, and I'd rather change a couple of classes than hundreds of UPDATE or INSERT statements.
Hopefully this question isn't 'too vague,' but I could really use some advice about this issue.
FWIW, Ruby is 1.9.2, Rails is 3.0.7, platform is OS X Server Snow Leopard (or optionally Debian 6.0).
Edit ok just realized that this solution will not work for oracle, sorry ---
You should really check out ActiveRecord-Import, it is easy to use and handles bulk imports with minimal amounts of sql statements. I saw a speed up from 5 hours to 2 minutes. And it will still run validations on the data.
from the github page:
books = []
10.times do |i|
books << Book.new(:name => "book #{i}")
end
Book.import books
https://github.com/zdennis/activerecord-import
From my experience, ORMs are a great tool to use on the front end, where you're mostly just reading the data or updating a single row at a time. On the back end where you're ingesting lost of data at a time, they can cause problems because of the way they tend to interact with the database.
As an example, assume you have a Person object that has a list of Friends that is long (lets say 100 for now). You create the Person object and assign 100 Friends to it, and then save it to the database. It's common for the naive use of an ORM to do 101 writes to the database (one for each Friend, one for the Person). If you were to do this in pure SQL at a lower level, you'd do 2 writes, one for Person and then one for all the Friends at once (an insert with 100 actual rows). The difference between the two actions is significant.
There are a couple ways I've seen to work around the problem.
Use a lower level database API that lets you write your "insert 100 friends in a single call" type command
Use an ORM that lets you write lower level SQL in order to do the Friends insert as a single SQL command (not all of them allow this and I don't know if Rails does)
Use an ORM that lets you batch writes into a single database call. It's still 101 writes to the database, but it allows the ORM to batch them into a single network call to the database and say "do these 101 things". I'm not sure what ORMs allow for this.
There's probably other ways
The main point being that using the ORM to ingest any real sized amount of data can run into efficiency problems. Understanding what the ORM is doing underneath the hood (asking it to log all db calls is a good way to understand what it's doing) is the best first step. Once you know what it's doing, you can look for ways to tell it "what I'm doing doesn't fit well into the normal pattern, lets change how you're using it"... and, should it not have a way that works, you can look at using a lower level API to allow for it.
I'll point out one other thing you can look at with a STRONG caveat that it should be one of the last things you consider. When inserting rows into the database in bulk, you can create a raw text file with all the data (format depends on the db, but the concept is similar to a CSV file) and give the file to the database to import in bulk. It's a bad way to go in almost every case, but I wanted to include it because it does exist as an option.
Edit: As a side note, the comment about more efficiently parsing the XML is a good thing to look at too. Using SAX vs DOM, or a different XML library, can be a huge win in time to completion. In some cases, it can be an even bigger win than more efficient database interaction. For example, you may be parsing a LOT of XML with lots of small pieces of data, and then only use small parts of it. In a case like that, the parsing could take a long time via DOM while SAX could ignore the parts you don't need... or it could be using a lot of memory creating DOM objects and slow down the whole thing due to garbage collection, etc. At the very least, it's worth looking at.
Since your question is indeed "a bit vague", I can only recommend you optimizing the XML import by using XML Pull parsing.
Take a look at this:
https://gist.github.com/827475
I needed to import MySQL XML, and to be fair, using the XML Pull method improved the parse part in factor of around 7 (yes, almost 7 times faster than reading the entire thing in the memory).
Another thing: you are saying "the DB import takes 4 hours". What file formats are these DB exports you are importing?

Flat file in delphi

In my application I want to use files for storing data. I don't want to use database or clear text file, the goal is to save double and integer values along with string just to identify the name of the record ; I simple need to save data on disk for generating reports. File can grow even to gigabyte. What format you suggest to use? Binary? If so what vcl component/library you know which is good to use? My goal is to create an application which creates and updates the files while another tool will "eat" those file
producing nice pdf reports for user on demand. What do you think? Any idea or suggestion?
Thanks in advance.
If you don't want to reinvent the wheel, you may find all needed Open Source tools for your task from our side:
Synopse Big Table to store huge amount of data - see in particular the TSynBigTableRecord class to store an unlimited number of records with fields, including indexes if needed - it will definitively be faster and use less disk size than any other regular SQL DB
Synopse SQLite3 Framework if you would rather use a standard SQLite engine for the storage - it comes with a full Client/Server ORM
Reporting from code, including pdf file generation
With full Source code, working from Delphi 6 up to XE.
I've just updated the documentation of the framework. More than 600 pages, with details of every class method, and new enhanced general introduction. See the SAD document.
Update: If you plan to use SQLite, you should first guess how the data will be stored, which indexes are to be created, and how a SQL query may speed up your requests. It's a bad idea to read all file content for every request: you should better structure your data so that a single SQL query would be able to return the expended results. Sometimes, using additional values (like temporary sums or means) to the data is a good idea. Also consider using the RTree virtual table of SQLite3, which is dedicated to speed up access to double min/max multi-dimensional data: it may speed up a lot your requests.
You don't want to use a full SQL database, and you think that a plain text file is too simple.
Points in between those include:
Something that isn't a full SQL database, but more of a key-value store, would technically not be a flat file, but it does provide a single "key+value" list, that is quickly searchable on a single primary key. Such as BSDDB. It has the letter D and B in the name. Does that make it a database, in your view? Because it's not a relational database, and doesn't do SQL. It's just a binary key-value (hashtable) blob storage mechanism, using a well-understood binary file format. Personally, I wouldn't start a new project and use anything in this category.
Recommended: Something that uses SQL but isn't as large as standalone SQL database servers. For example, you could use SQLite and a delphi wrapper. It is well tested, and used in lots of C/C++ and Delphi applications, and can be trusted more than anything you could roll yourself. It is a very light embedded database, and is trusted by many.
Roll your own ISAM, or VLIR, which will eventually morph over time into your own in-house DBMS. There are multiple files involved, and there are indexes, so you can look up data fast without loading everything into memory. Not recommended.
The most flat of flat binary fixed-record-length files. You mentioned originally in your question, power basic which has something called Random Access files, and then you deleted that from your question. Probably what you are looking for, especially for append-only write as the primary operation. Roll your own TurboPascal era "file of record". If you use the "FILE OF RECORD" type, you hit the 2gb limit, and there are problems with Unicode. So use TStream instead, like this. Binary file formats have a lot of strikes against them, especially since it is difficult to grow and expand your binary file format over time, without breaking your ability to read old files. This is a key reason why I would recommend you start out with what might at first seem like overkill (SQLite) instead of rolling your own binary solution.
(Update 2: After updating the question to mention PDFs and what sounds like a reporting-system requirement, I think you really should be using a real database but perhaps a small and easy to use one, like firebird, or interbase.)
I would suggest using TClientDataSet, and use it's SaveToFile() / SaveToStream() methods by the generating program, and LoadFromFile() / LoadFromStream() methods for the program that will "consume" the data. That way, you can still make indexed records without connecting to any external database, all while keeping the interchange data in a single file.
Define API to work with your flat file, so that the API can be implemented by a separate data layer in many ways.
Implement the API using standard embedded SQL database (ex SQLite or Firebird).
Only if there is something wrong with the standard solution think of your own.
I use KBMMemtable - see http://www.components4developers.com/ - fast, reliable, been around a long time - supports binary and CSV streaming in and out of files, as well indexing, filters, and lots of other goodies - TClientDataSet will not do well with large datasets.

Should i keep a file as text or import to a database?

I am constructing an anagram generator that was a coding exercise, and uses a word list thats about 633,000 lines long (one word per line). I wrote the program just in Ruby originally, and I would like to modify this to deploy it online.
My hosting service supports Ruby on Rails as about the only Ruby-based solution. I thought of hosting on my own machine, and using a smaller framework, but I don't want to deal with the security issues at this moment.
I have only used RoR for database-driven (CRUD) apps. However, I have never populated a sqlite database this way, so this is a two-part question:
1) Should I import this to a database? If so, what's the best method to do so? I would like to stick with sqlite to keep things simple if that's the case.
2) Is a 'flat file' better? I wont be doing any creating or updating, just checking against the list of words.
Thank you.
How about keeping it in memory? Storing that many words would take just a few megabytes of RAM, and otherwise you'd be accessing the file frequently so it'd probably be cached anyway. The advantage of keeping the word list in memory is that you can organize it in whatever data structure suits your needs best (I'm thinking a trie). If you can't spare that much memory, it might be to your advantage to use a database so you can efficiently load only the parts of the word list you need for any given query - of course, in that case you'd want to create some index columns (well at least one) so you can take advantage of the indexing capabilities of SQL.
Assuming that what you're doing is looking up whether a word exists in your list, I would say that SQLite with an indexed column will likely be faster than scanning through the word list linearly. Now, if your current approach is fast enough for your purposes, then I see no reason to bother porting it over to a database; it's just an added headache for no gain as far as you're concerned. If you're seeing the search times become a burden, then dumping it into an indexed database would be a good idea.
You can create the table with the following schema:
CREATE TABLE words (
word text primary key
);
CREATE INDEX word_idx ON words(word);
And import your data with:
sqlite words.db < schema.sql
while read word
do
sqlite3 words.db "INSERT INTO words values('$word');"
done < words.txt
I would skip the database for reasons listed above. A simple hash in memory will perform about as fast a lookup in the database.
Even if the database was a bit faster for the lookup, you're still wasting time with the DB having to parse the query and create a plan for the lookup, then assemble the results and send them back to your program. Plus you can save yourself a dependency.
If you plan on moving other parts of your program to a persistent store, then go for it. But a hashmap should be sufficient for your use.

Resources