Entity, dealing with large number of records (> 35 mlns) - entity-framework-4

We have a rather large set of related tables with over 35 million related records each. I need to create a couple of WCF methods that would query the database with some parameters (data ranges, type codes, etc.) and return related results sets (from 10 to 10,000 records).
The company is standardized on EF 4.0 but is open to 4.X. I might be able to make argument to migrate to 5.0 but it's less likely.
What’s the best approach to deal with such a large number of records using Entity? Should I create a set of stored procs and call them from Entity or there is something I can do within Entity?
I do not have any control over the databases so I cannot split the tables or create some materialized views or partitioned tables.
Any input/idea/suggestion is greatly appreciated.

At my work I faced a similar situation. We had a database with many tables and most of them contained around 7- 10 million records each. We used Entity framework to display the data but the page seemed to display very slow (like 90 to 100 seconds). Even the sorting on the grid took time. I was given the task to see if it could be optimized or not. and well after profiling it (ANTS profiler) I was able to optimize it (under 7 secs).
so the answer is Yes, Entity framework can handle loads of records (in millions) but some care must be taken
Understand that call to database made only when the actual records are required. all the operations are just used to make the query (SQL) so try to fetch only a piece of data rather then requesting a large number of records. Trim the fetch size as much as possible
Yes, not you should, you must use stored procedures and import them into your model and have function imports for them. You can also call them directly ExecuteStoreCommand(), ExecuteStoreQuery<>(). Sames goes for functions and views but EF has a really odd way of calling functions "SELECT dbo.blah(#id)".
EF performs slower when it has to populate an Entity with deep hierarchy. be extremely careful with entities with deep hierarchy .
Sometimes when you are requesting records and you are not required to modify them you should tell EF not to watch the property changes (AutoDetectChanges). that way record retrieval is much faster
Indexing of database is good but in case of EF it becomes very important. The columns you use for retrieval and sorting should be properly indexed.
When you model is large, VS2010/VS2012 Model designer gets real crazy. so break your model into medium sized models. There is a limitation that the Entities from different models cannot be shared even though they may be pointing to the same table in the database.
When you have to make changes in the same entity at different places, try to use the same entity by passing it and send the changes only once rather than each one fetching a fresh piece, makes changes and stores it (Real performance gain tip).
When you need the info in only one or two columns try not to fetch the full entity. you can either execute your sql directly or have a mini entity something. You may need to cache some frequently used data in your application also.
Transactions are slow. be careful with them.
if you keep these things in mind EF should give almost similar performance as plain ADO.NET if not the same.

My experience with EF4.1, code first: if you only need to read the records (i.e. you won't write them back) you will gain a performance boost by turning of change tracking for your context:
yourDbContext.Configuration.AutoDetectChangesEnabled = false;
Do this before loading any entities. If you need to update the loaded records you can allways call
yourDbContext.ChangeTracker.DetectChanges();
before calling SaveChanges().

The moment I hear statements like: "The company is standardized on EF4 or EF5, or whatever" This sends cold shivers down my spine.
It is the equivalent of a car rental saying "We have standardized on a single car model for our entire fleet".
Or a carpenter saying "I have standardized on chisels as my entire toolkit. I will not have saws, drills etc..."
There is something called the right tool for the right job
This statement only highlights that the person in charge of making key software architecture decisions has no clue about software architecture.
If you are dealing with over 100K records and the datamodels are complex (i.e. non trivial), Maybe EF6 is not the best option.
EF6 is based on the concepts of dynamic reflection and has similar design patterns to Castle Project Active Record
Do you need to load all of the 100K records into memory and perform operations on these ? If yes ask yourself do you really need to do that and why wouldn't executing a stored procedure across the 100K records achieve the same thing. Do some analysis and see what is the actual data usage pattern. Maybe the user performs a search which returns 100K records but they only navigate through the first 200. Example google search, Hardly anyone goes past page 3 of the millions of search results.
If the answer is still yes you need to load all of the 100K records into memory and perform operations. Then maybe you need to consider something else like a custom built write through cache with light weight objects. Maybe lazy load dynamic object pointers for nested objects. etc... One instance where I use something like this is large product catalogs for eCommerce sites where very large numbers of searches get executed against the catalog. Why is in order to provide custom behavior such as early exit search, and regex wildcard search using pre-compiled regex, or custom Hashtable indexes into the product catalog.
There is no one size fits all answer to this question. It all depends the data usage scenarios and how the application works with the data. Consider Gorilla Vs Shark who would win? It all depends on the environment and the context.
Maybe EF6 is perfect for one piece that would benefit from dynamic reflection, While NetTiers is better for another that needs static reflection and an extensible ORM. While low level ADO is perhaps best for extreme high performance pieces.

Related

Core Data Model Design - Attributes vs Entities

I've been developing a very basic core data application for over a year now (Toy Collector, http://bit.ly/tocapp), and I'm looking at doing a redesign so that I can build in iCloud support. I figured while I'm doing that, I might as well update my core data model (if needed), and I'm having a heck of a time tracking down "best practices" for the following:
Currently, I have 2 entities:
Toy, Keywords
Toy has all the information about the object: Name, Year, Set, imageName, Owned, Wanted, Manufacturer, etc, (18 attributes in all)
Keywords has the normalized words to help speed up the search
My question is whether or not there is any advantage to breaking out some of the Toy attributes into their own entities. For example, I could have a manufacturer entity that stores the dozen or so manufacturers, instead of keeping that information in the Toy object. My gut tells me this could reduce the memory footprint (instead of 50,000 objects storing a manufacturer string, there would simple be 12 manufacturer strings in an entity with a relationship to the main Toy entity). Does that kind of organization really matter? Am I trying to overcomplicate things? I just feel like my entity has a lot of attributes, and I'm not sure if taking the time to break it apart into multiple entities would make a difference.
Any advice or pointers would be appreciated!
Zack
Your question is pretty broad, since it addresses the topic of database design. Let me say upfront that it is almost impossible to give you any sensible suggestions, since I would need to know a lot more about your app, use cases, etc. than it is possible through a S.O. question.
Coming to your concrete questions, I would say that you correctly identify one of advantages of splitting a table into multiple ones; actually, the advantage of doing that is not just reducing the database footprint, rather keep data redundancy to a minimum. Redundancy not only affects memory footprint but also manageability and modifiability of your data, and lack of redundancy could even cause anomalies or corruption. There is even a whole database theory topic which is known as database normalisation that addresses this king of concerns.
On the other hand, as it is always the case, redundancy can help performance, and this is actually the case when you can fetch your data through a simple query instead of multiple queries or table joins. There is a technique to improving a database performance which is known as database denormalization and is the exact opposite to normalisation. Your current scheme is fully denormalized.
Using Core Data, which is a relational object graph manager running often on top of SQLite, which is a relational database manager, you have also to take into account the fact that Core Data will automatically build your object graph and fetch into memory the data when you need it. This means that if you can take a smaller memory footprint on disk for granted, this might not be the case when it comes to RAM footprint of your query results (Core Data will "explode", so to say, at some moment your data from multiple tables into one object plus its attributes).
In your specific case, you should also possibly take into account the cost of migrating your existing user base (if the database is not read-only).
All in all, I would say that if your app does not have any database footprint issues at the moment; if you do not feel that creating new tables might be useful, e.g., in order to add new functionality, such as listing all manufacturers; and, finally, if you do not foresee tasks like renaming a manufacturer or such at some point, then maybe refactoring your database will not add much benefit. But, as I say, without knowing your app in detail and your roadmap for it, it is difficult to say anything really on spot. In any case, I hope this general considerations will help you take a decision.
EDIT:
If you want to investigate your core data performance and try to understand where the bottlenecks are, give a try to Instruments/Core Data tool (Product/Profile menu). There are a lot of things that can go bad.
On the other hand, it is really hard to help you further without having more details about the type of searches your app allows to do. One thing that is not clear to me is if your searches are slow only when they return a lot of results or they are slow even when returning a few results.
Normalizing might help performance if you only use (say, after doing a search) just one normalized entity (e.g., to display the toy name in a table). In this case all of the attributes referring to other entities would be faults (hence would not occupy memory nor take) and this might speed up things. But, if you do a search and then display the information from the other tables as well, then there might not be any advantage, quite the opposite, since the faults would have to be resolved immediately and this would produce more accesses to the database.
Also it is true that depending on how you use it, core data could not be the best way to handle your data. Have a look at this Brent Simmons' post relating his experience.

Ruby on Rails database and application design

We have to create rather large Ruby on Rails application based on large database. This database is updated daily, each table has about 500 000 records (or more) and this number will grow over time. We will also have to provide proper versioning of all data along with referential integrity. It must be possible for user to move from version to version, which are kind of "snapshots" of main database at different points of time. In addition some portions of data need to be served to other external applications with and API.
Considering large amounts of data we thought of splitting database into pieces:
State of the data at present time
Versioned attributes of each table
Snapshots of the first database at specific, historical points in time
Each of those would have it's own application, creating a service with API to interact with the data. It's needed as we don't want to create multiple applications connecting to multiple databases directly.
The question is: is this the proper approach? If not, what would you suggest?
We've never had any experience with project of this magnitude and we're trying to find the best possible solution. We don't know if this kind of data separation has any sense. If so, how to provide proper communication of different applications with individual services and between services themselves, as this will be also required.
In general the amount of data in the tables should not be your first concern. In PostgreSQL you have a very large number of options to optimize queries against large tables. The larger question has to do with what exactly you are querying, when, and why. Your query loads are always larger concerns than the amount of data. It's one thing to have ten years of financial data amounting to 4M rows. It's something different to have to aggregate those ten years of data to determine what the balance of the checking account is.
In general it sounds to me like you are trying to create a system that will rely on such aggregates. In that case I recommend the following approach, which I call log-aggregate-snapshot. In this, you have essentially three complementary models which work together to provide up-to-date well-performing solution. However the restrictions on this are important to recognize and understand.
Event model. This is append-only, with no updates. In this model inserts occur, and updates to some metadata used for some queries only as absolutely needed. For a financial application this would be the tables representing the journal entries and lines.
The aggregate closing model. This is append-only (though deletes are allowed for purposes of re-opening periods). This provides roll-forward information for specific purposes. Once a closing entry is in, no entries can be made for a closed period. In a financial application, this would represent closing balances. New balances can be calculated by starting at an aggregation point and rolling forward. You can also use partial indexes to make it easier to pull just the data you need.
Auxiliary data model. This consists of smaller tables which do allow updates, inserts, and deletes provided that integrity to the other models is not impinged. In a financial application this might be things like customer or vendor data, employee data, and the like.

Is there some way in Delphi to cache master-detail rows and post both master and detail child rows at the same time

I want to post in memory some child rows, and then conditionally post them, or don't post them to an underlying SQL database, depending on whether or not a parent row is posted, or not posted. I don't need a full ORM, but maybe just this:
User clicks Add doctor. Add doctor dialog box opens.
Before clicking Ok on Add doctor, within the Add doctor dialog, the user adds one or more patients which persist in memory only.
User clicks Ok in Add doctor window. Now all the patients are stored, plus the new doctor.
If user clicked Cancel on the doctor window, all the doctor and patient info is discarded.
Try if you like, mentally, to imagine how you might do the above using delphi data aware controls, and TADOQuery or other ADO objects. If there is a non-ADO-specific way to do this, I'm interested in that too, I'm just throwing ADO out there because I happen to be using MS-SQL Server and ADO in my current applications.
So at a previous employers where I worked for a short time, they had a class called TMasterDetail that was specifically written to add the above to ADO recordsets. It worked sometimes, and other times it failed in some really interesting and difficult to fix ways.
Is there anything built into the VCL, or any third party component that has a robust way of doing this technique? If not, is what I'm talking about above requiring an ORM? I thought ORMs were considered "bad" by lots of people, but the above is a pretty natural UI pattern that might occur in a million applications. If I was using a non-ADO non-Delphi-db-dataset style of working, the above wouldn't be a problem in almost any persistence layer I might write, and yet when databases with primary keys that use identity values to link the master and detail rows get into the picture, things get complicated.
Update: Transactions are hardly ideal in this case. (Commit/Rollback is too coarse a mechanism for my purposes.)
Your asking two separate questions:
How do I cache updates?
How can I commit updates to related tables at the same time.
Cached updates can be accomplished a number of different ways. Which one is best depends on your specific situation:
ADO Batch Updates
Since you've already stated that you're using ADO to access the data this is a reasonable option. You simply need to set the LockType to ltBatchOptimistic and CursorType to either ctKeySet or ctStatic before opening the dataset. Then call TADOCustomDataset.UpdateBatch when you're ready to commit.
Note: The underlying OLEDB provider must support batch updates to take advantage of this. The provider for SQL Server fully supports this.
I know of no other way to enforce the master/detail relationship when persisting the data than to call UpdateBatch sequentially on both datasets.
Parent.UpdateBatch;
Child.UpdateBatch;
Client Datasets
Data caching is one of the primary reasons for TClientDataset's existence and synchronizing a master/detail relationship isn't difficult at all.
To accomplish this you define the master/detail relationship on two dataset components as usual (in your case ADOQuery or ADOTable). Then create a single provider and connect it to the master dataset. Connect a single TClientDataset to the provider and you're done. TClientDatset interprets the detail dataset as a nested dataset field, which can be accessed and bound to data aware controls just like any other dataset.
Once this is in place you simply call TClientDataset.ApplyUpdates and the client dataset will take care of ordering the updates for the master/detail data correctly.
ORMs
There is a lot that can be said about ORMs. Too much to fit into an answer on StackOverflow so I'll try to be brief.
ORMs have gotten a bad rap lately. Some pundits have gone so far as to label them an anti-pattern. Personally I think this is a bit unfair. Object-relational mapping is an incredibly difficult problem to solve correctly. ORMs attempt to help by abstracting away a lot of the complexity involved in transferring data between a relational table and an instance of an object. But like with everything else in software development there are no silver bullets and ORMs are no exception.
For a simple data entry application without a lot of business rules an ORM is probably overkill. But as an application becomes more and more complex an ORM starts to look more appealing.
In most cases you'll want to use a third party ORM rather than rolling your own. Writing a custom ORM that perfectly fits your requirements sounds like a good idea and its easy to get started with simple mappings but you'll soon start running into issues like parent/child relationships, inheritance, caching and cache invalidation (trust me I know this from experience). Third party ORMs have already encountered these issues and spent an enormous amount of resources to solve them.
With many ORMs you trade code complexity for configuration complexity. Most of them are actively working to reduce the boilerplate configuration by turning to conventions and policies. If you name all your primary keys Id rather than having to map each table's Id column to a corresponding Id property for each class you simply tell the ORM about this convention and it assumes all tables and classes its aware of follow the convention. You only have to override the convention for specific cases where it doesn't apply. I'm not familiar with all of the ORMs for Delphi so I can't say which support this and which don't.
In any case you'll want to design your application architecture so you can push off the decision of which ORM framework (or for that matter any framework) to use as long as possible.

Linq to Sql structure standard

I was wondering what the "correct"-way is when using a Linq to Sql-model in Visual Studio.
Should I create a new model for each of my components, say Blog, Users, News and so on and have all different xxxDataContext's with tables and SPROCs added in each of these.
Or should I create one MyDbDataContext and always work against that?
What's the pro's/con's? My gut tells me to divide it up in smaller context's, but it also feels like that could bring problems as the project expands?
What's the deal? Help me Stackoverflow :)
There will always be overhead when creating the data context as the model needs to be built. Depending on the number of tables in your database this might not be much of a big deal though. If it's only 10 tables or so, the overhead will not be much more than that for a context with say 1 table (sorry, I don't have actual stress testing to show the overhead, but, hey, maybe that gives me something to blog on this weekend).. When looking at large databases the overhead might be a enough to consider using seperate contexts.
The main advantage I would see with using a single data context is that you gain the ability to use JOINs in your LINQ query and that will be translated to T-SQL. Where as if you do the join after the arrays of objects are pulled, the performance might be a bit slower. Additionally, keeping track of multiple data contexts might be confusing and good naming conventions would be needed. So building your own data model w/ business logic which encapsulates the contexts would be a bit harder. I've done this and it's not fun :)
However, if you still feel you want to go that route, then I would recommend putting similar tables (that you might need to join) in the same context. Also, there are some tuts online that recommend using a shared MappingSource when using multiple contexts that use the same source. Information on this can be found here: http://www.albahari.com/nutshell/speedinguplinqtosql.aspx
Sorry, I know that's not really a black and white answer, but hopefully it helps :)
Addition:
Just wanted to add that I did a small test and ran 20,000 SELECT statements against a small sized table using 2 different data contexts:
DataClasses1DataContext contained mappings to all tables in the db (4 total)
DataClasses2DataContext contained a single mapping for just the one table
Results:
Time to execute 20000 SELECTs using DataClasses1DataContext: 00:00:10.4843750
Time to execute 20000 SELECTs using DataClasses2DataContext: 00:00:10.4218750
As you can see, it's not much of a difference.

Stored Procedures - General or Specific

What is the preferred way to use stored procedures between the following two methods and why:
One general SP such as 'GetOrders' which returns all the columns for the table Order. Several different parts of the application will use the same SP.
OR
Several more specific SPs such as 'GetOrdersForUse1' and 'GetOrdersForUse2' which return a subset of all the columns. Each SP is only used by one part of the application.
In the general case, the application will only use a subset of the columns returned by the SP. I was thinking of using the specific method for performance reasons but is it really going to be worth the extra work? I am developing a web site using ASP.NET and SQL 2005.
Like all great things it depends. How different is the logic in your variations. If for example the only difference is the return columns, then all your saving is some bandwidth over the network and some memory both of which are a lot cheaper then the time its going to take to create the variations test them and maintain them.
Now if there is very significant different selection logic going (joining different tables etc), then you might be better off having specialized SP's.
One last thing don't prematurely optimize. Build it simple and working first, then when you discover you need that extra millisecond then you can look at tweaking.
I would go for your second option simply because you should NOT be extracting data from the database that you don't need (either rows or columns) - it puts unnecessary strain of the DBMS and transmits useless data across the wire, wasting network bandwidth.
Think of them as functions, if you would write a separate function then probably use separate stored procedures. If there is any doubt remaining, use separate stored procedures because:
it will save bandwith
it will save memory
I find that separate stored procedures are easier to maintain than one giant one.

Resources