I am investigating moving a thick client SQL based Delphi application to Multi Tier thin clients, and have been looking at using Datasnap in Delphi 2010. I have worked through the White Paper written by Bob Swart and extended this further.
My main question really is that I want to make the server side efficient in terms of connections and SQL Queries due to multiple queries being run and remaining open at the same time to interrogate data, can anyone point me in a direction for guidance on how to design a real world Datasnap Server application, as the demo's don't go into enough detail.
Thanks
Matt
You have to decide in your design:
(Mid-Server) Will manage session or client will identify their session each connection (stateful vs stateless)
(Mid-Server) How much cached data you desire to have. You can cache just some annoying very stable tables and only querying them when they change (if all modifications is through mid-server, if not, you need something like an arbitrary mark - a GUID, a counter - on the table to match if data changed).
(Client/Mid-Server) If your client will always get a full collection of data or just fragments of the collection.
(ex.: a product can have an categoryId column, which is a FK to a Category table. You can send both all the time or the client can request only the product data).
(Mid-Server/RDBMS)You maybe have to provide some form of custom search. If you have an clue of the most used search conditions, you can provide (if needed) the index covering to do that.
(Mid-Server/RDBMS)Don't bring great datasets to mid-server, unless you plan on do some aggresive caching of data and/or have some good use for them. Mid-Server is just another client application to the RDBMS - if both are on the same machine, you can enter in a memory/CPU/IO competition with the RDBMS.
(Mid-Server/RDBMS)Execute your business rules on the Mid-Server, is the mantra of many purists out there. For me, equilibrium is key. Let's say the RDBMS have 2000 Stored Procedures (I'm not exagerating, there's real business with such a number of SPs) and your Mid-Server can make an excellent work on 1500 of that 2000 problems solved by SPs, GREAT and do that. But if the last 500 can be a change for worst, let them alone. It can be a bigger hassle than the 1500.. So mix the two, and make those 500 an software project to an other version.
(Mid-Server/RDBMS) What I said for Stored Procedures can also be applied to triggers and other any other kind of RDBMS server features that can make your live easier.
(Mid-Server/DataSnap)Most of the crude details can be let to dataset provider. But learn to leverage the OnBeforeUpdateRecord event to do custom processing when needed.
(Mid-Server/DataSnap)You know you can create a ClientDataset only with modified data by
this kind of code below? You cannot imagine how useful it can be... :-)
Cds_New.Data := Cds_Updated.Delta;
Related
For database management, my team right now is using a RDBMS based solution (MSSQL to be exact), but we expect to move to Cassandra soon as we're expecting a huge bump in traffic.
The application logic right now is decoupled from insertion logic, as the application only calls the specific procedures in SQL which calls some data validations and makes corresponding insertions.
I want to do something similar in Cassandra. However, I am unable to find anything that could aid me in doing so. UDFs are not useful as they are mostly used in SELECT query. I'd appreciate the community's help/advice on this, thanks!
The closest feature to a stored procedure will be a batch as it will allow you to "bundle" different DML statements associated to an insert, update or delete.
If you are moving from RDBMS to Cassandra, one of the biggest challenges is to adjust to the data modeling required, and more specific, to denormalization of data. The data model is the key factor of success (and failure) of any Cassandra implementation, and because of that, you may find several resources in the web (to mention the basics eBay blog, Datastax academy's Data model course)
Good luck with your implementation!
I want to post in memory some child rows, and then conditionally post them, or don't post them to an underlying SQL database, depending on whether or not a parent row is posted, or not posted. I don't need a full ORM, but maybe just this:
User clicks Add doctor. Add doctor dialog box opens.
Before clicking Ok on Add doctor, within the Add doctor dialog, the user adds one or more patients which persist in memory only.
User clicks Ok in Add doctor window. Now all the patients are stored, plus the new doctor.
If user clicked Cancel on the doctor window, all the doctor and patient info is discarded.
Try if you like, mentally, to imagine how you might do the above using delphi data aware controls, and TADOQuery or other ADO objects. If there is a non-ADO-specific way to do this, I'm interested in that too, I'm just throwing ADO out there because I happen to be using MS-SQL Server and ADO in my current applications.
So at a previous employers where I worked for a short time, they had a class called TMasterDetail that was specifically written to add the above to ADO recordsets. It worked sometimes, and other times it failed in some really interesting and difficult to fix ways.
Is there anything built into the VCL, or any third party component that has a robust way of doing this technique? If not, is what I'm talking about above requiring an ORM? I thought ORMs were considered "bad" by lots of people, but the above is a pretty natural UI pattern that might occur in a million applications. If I was using a non-ADO non-Delphi-db-dataset style of working, the above wouldn't be a problem in almost any persistence layer I might write, and yet when databases with primary keys that use identity values to link the master and detail rows get into the picture, things get complicated.
Update: Transactions are hardly ideal in this case. (Commit/Rollback is too coarse a mechanism for my purposes.)
Your asking two separate questions:
How do I cache updates?
How can I commit updates to related tables at the same time.
Cached updates can be accomplished a number of different ways. Which one is best depends on your specific situation:
ADO Batch Updates
Since you've already stated that you're using ADO to access the data this is a reasonable option. You simply need to set the LockType to ltBatchOptimistic and CursorType to either ctKeySet or ctStatic before opening the dataset. Then call TADOCustomDataset.UpdateBatch when you're ready to commit.
Note: The underlying OLEDB provider must support batch updates to take advantage of this. The provider for SQL Server fully supports this.
I know of no other way to enforce the master/detail relationship when persisting the data than to call UpdateBatch sequentially on both datasets.
Parent.UpdateBatch;
Child.UpdateBatch;
Client Datasets
Data caching is one of the primary reasons for TClientDataset's existence and synchronizing a master/detail relationship isn't difficult at all.
To accomplish this you define the master/detail relationship on two dataset components as usual (in your case ADOQuery or ADOTable). Then create a single provider and connect it to the master dataset. Connect a single TClientDataset to the provider and you're done. TClientDatset interprets the detail dataset as a nested dataset field, which can be accessed and bound to data aware controls just like any other dataset.
Once this is in place you simply call TClientDataset.ApplyUpdates and the client dataset will take care of ordering the updates for the master/detail data correctly.
ORMs
There is a lot that can be said about ORMs. Too much to fit into an answer on StackOverflow so I'll try to be brief.
ORMs have gotten a bad rap lately. Some pundits have gone so far as to label them an anti-pattern. Personally I think this is a bit unfair. Object-relational mapping is an incredibly difficult problem to solve correctly. ORMs attempt to help by abstracting away a lot of the complexity involved in transferring data between a relational table and an instance of an object. But like with everything else in software development there are no silver bullets and ORMs are no exception.
For a simple data entry application without a lot of business rules an ORM is probably overkill. But as an application becomes more and more complex an ORM starts to look more appealing.
In most cases you'll want to use a third party ORM rather than rolling your own. Writing a custom ORM that perfectly fits your requirements sounds like a good idea and its easy to get started with simple mappings but you'll soon start running into issues like parent/child relationships, inheritance, caching and cache invalidation (trust me I know this from experience). Third party ORMs have already encountered these issues and spent an enormous amount of resources to solve them.
With many ORMs you trade code complexity for configuration complexity. Most of them are actively working to reduce the boilerplate configuration by turning to conventions and policies. If you name all your primary keys Id rather than having to map each table's Id column to a corresponding Id property for each class you simply tell the ORM about this convention and it assumes all tables and classes its aware of follow the convention. You only have to override the convention for specific cases where it doesn't apply. I'm not familiar with all of the ORMs for Delphi so I can't say which support this and which don't.
In any case you'll want to design your application architecture so you can push off the decision of which ORM framework (or for that matter any framework) to use as long as possible.
Is using Db4o as a backend datastore for a Web site (ASP.NET MVC) a judicious choice as an alternative to MS SQL Server ?
The main issue with DB4o is: Can you cut your object net in some useful manner? If not, then you'll keep too many objects in RAM for too long and your performance will suffer.
For example, in SQL, you can create a cursor and then easily traverse a huge set of results. You can also query for a small set of columns while DB4o always loads the whole objects (and its references and the references of the references). With DB4o, you must make sure that DB4o doesn't try to pull in all objects from the DB at once.
You'll also need to get used to querying things your "DB" by filling out example objects which feels weird in the beginning.
That depends, what kind of site your creating, the traffic your expecting etc...Are you going to handle a million requests a second, or 100 a minute...Does your domain justify using a Object Database? Do you really need it?
In general, most sites are not heavy hitters so they might not require all the scale out functionality (I believe and this is only a belief that traditional RDBMS have been tested and designed to handle extreme loads where as Object DB's might not have been given the same attention).
So then the question is does your domain justify this? Your going to base a core piece of your site on a technology that you will not find a lot of experts in. So how do you handle turn over rate? Are you willing to take the cost associated with training all current and future employees on this?
I am building out some reporting stuff for our website (a decent sized site that gets several million pageviews a day), and am wondering if there are any good free/open source data warehousing systems out there.
Specifically, I am looking for only something to store the data--I plan to build a custom front end/UI to it so that it shows the information we care about. However, I don't want to have to build a customized database for this, and while I'm pretty sure an SQL database would not work here, I'm not sure what to use exactly. Any pointers to helpful articles would also be appreciated.
Edit: I should mention--one DB I have looked at briefly was MongoDB. It seems like it might work, but their "Use Cases" specifically mention data warehousing as "Less Well Suited": http://www.mongodb.org/display/DOCS/Use+Cases . Also, it doesn't seem to be specifically targeted towards data warehousing.
http://www.hypertable.org/ might be what you are looking for is (and I'm going by your descriptions above here) something to store large amounts of logged data with normalization. i.e. a visitor log.
Hypertable is based on google's bigTable project.
see http://code.google.com/p/hypertable/wiki/PerformanceTestAOLQueryLog for benchmarks
you lose the relational capabilities of SQL based dbs but you gain a lot in performance. you could easily use hypertable to store millions of rows per hour (hard drive space withstanding).
hope that helps
I may not understand the problem correctly -- however, if you find some time to (re)visit Kimball’s “The Data Warehouse Toolkit”, you will find that all it takes for a basic DW is a plain-vanilla SQL database, in other words you could build a decent DW with MySQL using MyISAM for the storage engine. The question is only in desired granularity of information – what you want to keep and for how long. If your reports are mostly periodic, and you implement a report storage or cache, than you don’t need to store pre-calculated aggregations (no need for cubes). In other words, Kimball star with cached reporting can provide decent performance in many cases.
You could also look at the community edition of “Pentaho BI Suite” (open source) to get a quick start with ETL, analytics and reporting -- and experiment a bit to evaluate the performance before diving into custom development.
Although this may not be what you were expecting, it may be worth considering.
Pentaho Mondrian
Open source
Uses standard relational database
MDX (think pivot table)
ETL ( via Kettle )
I use this.
In addition to Mike's answer of hypertable, you may want to take a look at Apache's Hadoop project:
http://hadoop.apache.org/
They provide a number of tools which may be useful for your application, including HBase, another implementation of the BigTable concept. I'd imagine for reporting, you might find their mapreduce implementation useful as well.
It all depends on the data and how you plan to access it. MonetDB is a column-oriented database engine from the most revolutionary team on database technologies. They just got VLDB's 10-year best paper award. The DB is open source and there are plenty of reviews online praising them.
Perhaps you should have a look at TPC and see which of their test problem datasets match best your case and work from there.
Also consider the need for concurrency, it adds a big overhead for any kind of approach and sometimes is not really required. For example, you can pre-digest some summary or index data and only have that protected for high concurrency. Profiling your data queries is the following step.
About SQL, I don't like it either but I don't think it's smart ruling out an engine just because of the front-end language.
I see a similar problem and thinking of using plain MyISAM with http://www.jitterbit.com/ as data access layer. Jitterbit (or another free tool alike) seems very nice for this sort of transformations.
Hope this helps a bit.
A lot of people just use Mysql or Postgres :)
Given that database is generally the least scalable component (of a web application), are there any situations where one would put logic in procedures/triggers over keeping it in his favorite programming language (ruby...) or her favorite web framework (...rails!).
Server-side logic is often much faster, even with procedural approach.
You can fine-tune your grant options and hide the data you don't want to show
All queries in one places are more convenient than if they were scattered all around the code.
And here's a (very subjective) article in my blog on the reason I prefer stored procedures:
Schema Junk
BTW, triggers (as opposed to functions / stored procedures / packages) I generally dislike.
They are completely other story.
You're keeping the processing in the database, along with the data.
If you process on the server side, then you have to transfer the data out to a server process across the network, process it, and (optionally) send it back. You have the network bandwidth/latency issues, plus memory overheads.
To clarify - if I have 10m rows of data, my two extreme scenarios are to a) pull those 10m rows across the network and process on the server side, or b) process in place in the database using the server and language (SQL) optimised for this purpose. Note that this is a generalisation and not a hard-and-fast rule, but it's the one I follow for most scenarios.
When many heterogeneous applications and various other systems need to access your single database and be sure through their operations data stays consistent without integrity conflicts. So you put your logic into triggers and stored procedures that will offer an interface to external clients.
Maybe not for most web-based systems, but certainly for enterprise databases. Stored procedures and the like allow you much greater control over security and performance, as well as offering a bit of encapsulation for the database itself. You can change the schema all you want as long as the stored procedure interface remains the same.
In (almost) every situation you would keep the processing that is part of the database in the database. Application code cannot substitute for triggers, you won't get very far before you have updated the database and failed to fire the application's equivalent of the triggers (the first time you use the DBMS's management console, for instance).
Let the database do the database work and let the application to the application's work. If you have a specific performance problem with the database, and that performance problem can be addressed by moving processing from the database, in that case you might want to consider doing so.
But worrying about database performance without a database performance problem existing (which is what you seem to be doing here) is both silly and, sadly, apparently a pre-occupation of many Stackoverlow posters.
Least scalable? SQL???
Look up, "federating."
If the database is shared, having logic in the database is better in order to control everything that happens. If it's not it might just make the system overly complicated.
If you have multiple applications that talk to your database, stored procedures and triggers can enforce correctness more pervasively. Accordingly, if correctness is more important than convenience, putting logic in the database is sensible.
Scalability may be a red herring, though. Sometimes it's easier to express the behavior you want in the domain layer of an OO language, but it can be actually more expensive than doing the idiomatic SQL way.
The security mechanism at a previous company was first built in the service layer, then pushed to the db side. The motivation was actually due to some limitations in a data access framework we were using. The solution turned out to be a bit buggy because our security model was complicated, but the upside was that bugs only had to be fixed in the database; we didn't have to worry about different clients following different rules.
Triggers mean 3rd-party apps can modify the database without creating logical inconsistencies.
If you do that, you are tying your business logic to your model. If you code all your business logic in T-SQL, you aren't going to have a lot of fun if later you need to use Oracle or what have you as your database server. Actually, I'm not sure I understand this question exactly. How do you think this would improve scalability? It really shouldn't.
Personally, I'm really not a fan of triggers, particularly in a database dedicated to a single application. I hate trying to track down why some data is inconsistent, to find it's down to a poorly written trigger (and they can be tricky to get exactly correct).
Security is another advantage of using stored procs. You do not have to set the security at the table level if you don't use dynamic code (Including ithe stored proc). This means your users cannot do anything unless they have a proc to to it. This is one way of reducing the possibility of fraud.
Further procs are easier to performance tune than most application code and even better, when one needs to change, that is all you have to put on production, not recomplie the whole application.
Data integrity must be maintained at the database level. That means constraints, defaults values, foreign keys, possibly triggers (if you have very complex rules or ones involving multiple tables). If you do not do this at the database level, you will eventually have integrity issues. Peolpe will write a quick fix for a problem and run the code in the query window and the required rules are missed creating a larger problem. A millino new records will have to be imported through an ETL program that doesn't access the application because going through the application code would take too long running one record at a time.
If you think you are building an application where scalibility will be an issue, you need to hire a database professional and follow his or her suggestions for design based on performance. Databases can scale to terrabytes of data but only if they are originally designed by someone is a specialist in this kind of thing. When you wait until the while application is runnning slower than dirt and you havea new large client coming on board, it is too late. Database design must consider performance from the beginning as it is very hard to redesign when you already have millions of records.
A good way to reduce scalability of your data tier is to interact with it on a procedural basis. (Fetch row..process... update a row, repeat)
This can be done within a stored procedure by use of cursors or within an application (fetch a row, process, update a row) .. The result (poor performance) is the same.
When people say they want to do processing in their application it sometimes implies a procedural interaction.
Sometimes its necessary to treat data procedurally however from my experience developers with limited database experience will tend to design systems in a way that do not leverage the strenght of the platform because they are not comfortable thinking in terms of set based solutions. This can lead to severe performance issues.
For example to add 1 to a count field of all rows in a table the following is all thats needed:
UPDATE table SET cnt = cnt + 1
The procedural treatment of the same is likely to be orders of magnitude slower in execution and developers can easily overlook concurrency issues that make their process inconsistant. For example this kind of code is inconsistant given the avaliable read isolation levels of many RDMBS platforms.
SELECT id,cnt FROM table
...
foreach row
...
UPDATE table SET cnt = row.cnt+1 WHERE id=row.id
...
I think just in terms of abstraction and ease of servicing a running environment utilizing stored procedures can be a useful tool.
Procedure plan cache and reduced number of network round trips in high latency environments can also have significant performance advantages.
It is also true that trying to be too clever or work very complex problems in the RDBMS's half-baked procedural language can easily become a recipe for disaster.
"Given that database is generally the least scalable component (of a web application), are there any situations where one would put logic in procedures/triggers over keeping it in his favorite programming language (ruby...) or her favorite web framework (...rails!)."
What makes you think that "scalability" is the only relevant concern in a system design ? I agree with rexem where he commented that it is very obvious that you are "not" biased ...
Databases are sets of assertions of fact. Those sets become more valuable if they can also be guaranteed to conform to certain integrity rules. Those guarantees are not worth a dime if it is the applications that are expected to enforce such integrity. Triggers and sprocs are the only way SQL systems have to allow such guarantees to be offered by the DBMS itself.
That aspect outweighs "scalability" anytime, anywhere, anyhow.