I have an hierarchical structure with millions of records.
I'm doing a recursive scan on the DB in order to update some of the connections and some of the data.
the problem is that I get an outofmemory exception since the entire DB is eventually loaded to the context (lazy). data that I no longer need stays in the context without any way of removing it.
I also can't use Using(context...) since I need the context alive because I'm doing a recursive scan.
Please take the recursion as a fact.
Thanks
This sort of an operation is really not handled well nor does it scale well using entities. I tend to resort to stored procedures for batch ops.
If you do want to remove/dump objects from context, I believe this post has some info (solution to your problem at the bottom).
just ran into the same problem. I've used NHibernate before I used EF as ORM tool and it had exactly the same problem. These frameworks just keep the objects in memory as long as the context is alive, which has two consequences:
serious performance slowdown: the framework does comparisons between the objects in memory (e.g. to see if an object exists or not). You will notice a gradual degradation of performance when processing many records
you will eventually run out of memory.
If possible I always try to do large batch operation on the database using pure SQL (as the post above states clearly), but in this case that wasn't an option. So to solve this, what NHibernate has is a 'Clear' method on the session, which throws away all object in memory that refer to database records (new ones, added ones, corrupt ones...)
I tried to mimic this method in entity framework as follows (using the post described above):
public partial class MyEntities
{
public IEnumerable<ObjectStateEntry> GetAllObjectStateEntries()
{
return ObjectStateManager.GetObjectStateEntries(EntityState.Added |
EntityState.Deleted |
EntityState.Modified |
EntityState.Unchanged);
}
public void ClearEntities()
{
foreach (var objectStateEntry in GetAllObjectStateEntries())
{
Detach(objectStateEntry.Entity);
}
}
}
The GetAllObjectStateEntries() method is taken separately because it's useful for other things. This goes into a partial class with the same name as your Entities class (the one EF generates, MyEntities in this example), so it is available on your entities instance.
I call this clear method now every 1000 records I process and my application that used to run for about 70 minutes (only about 400k entities to process, not even millions) does it in 25mins now. Memory used to peak to 300MB, now it stays around 50MB
Related
One of our contractors implemented a repository pattern with code first approach. We use Service Locator as DI pattern. what we do when we retrieve data from DB, we pass interface to GetQueryable function and get the data. However, I see serious performance issues on our application. I implemented MiniProfiler and MiniProfiler.EF to see where the bottleneck is.
We have a case table which has quite a few fields(around 25) and some of those fields are associated to other tables as one to one and one to many(only one field has many relation to other table). when I try to see the case detail, it runs around 400 SQL queries and SQL takes around 40 percent of the load time as far as the miniprofiler concerned. Here our GetQueryable and Find methods
public IQueryable<T> GetQueryable<T>(params string[] includes)
{
Type type = _impls.Value[typeof (T).Name].GetType();
DbSet dbSet = Db.Set(type);
foreach (var include in includes)
{
dbSet.Include(include);
}
return ((IQueryable<T>) dbSet);
}
I added included to this method to attach other related tables, but it did not make any difference. and here is the Find Method
public T Find<T>(long? id)
{
Type type = _impls.Value[typeof(T).Name].GetType();
return (T) Db.Set(type).Find(id);
}
I pretty much tried to apply all the performance improvements, but the number of the SQL queries has not gone down. I tried to disable lazy loading, but it caused many problems in other parts of the application.
Just some additional information, in case table, there are 70000 rows and in out dialogs table, there are 500000 rows. Case and Dialog are associates as one-to-many. and each case has 20-40 dialog entries.
My questions are;
Why does include not make any difference when I use?
Is there any other way to crop number of the queries run?
Do you think the implementation is the problem?
Thanks
Include returns a new IQueryable and does not modify the source query. In addition you can use the generic version of Set which simplifies the code a bit:
public IQueryable<T> GetQueryable<T>(params string[] includes)
{
IQueryable<T> query = Db.Set<T>();
foreach (var include in includes)
{
query = query.Include(include);
}
return query;
}
Step 1: Fire your contractor. Seriously. Like right now. That is some awful code. Not only did they miss something as simple and basic as using the generic version of Set, but they've successfully only made working with Entity Framework more complex, because all the repository does is proxy Entity Framework methods with its own unique and bastardized API.
That said, there's really not enough here to diagnose what your problem is. The use of Include may give you larger queries, but it should actually serve to reduce the overall number of queries issued. It's possible, you're just not using includes where you should be.
Now, the fact that you "tried to disable lazy loading, but it caused many problems in other parts of the application", means that you're relying too heavily on lazy-loading. Basically, you're loading in stuff you don't even know about, which is the antithesis of optimization. Ironically, you'd actually be best served by going ahead and disabling lazy-loading, and then tracking down where your code fails because of that. If you want to actually lazy-load that thing, you can use .Load (see: Explicit Loading). But, if you want to eager-load to reduce queries, then you know what includes you need to add.
I'm learning ASP.NET MVC and I'm having some questions that the tutorials I've read until now haven't explored in a way that covers me. I've tried searching, but I didn't see any questions asking this. Still, please forgive me if I have missed an existing ones.
If I have a single ASP.NET MVC application that has a number of models (some of which related and some unrelated with each other), how many DbContext subclasses should I create, if I want to use one connection string and one database globally for my application?
One context for every model?
One context for every group of related models?
One context for all the models?
If the answer is one of the first two, then is there anything I should have in mind to make sure that only one database is created for the whole application? I ask because, when debugging locally in Visual Studio, it looks to me like it's creating as many databases as there are contexts. That's why I find myself using the third option, but I'd like to know if it's a correct practice or if I'm making some kind of mistake that will come back and bite me later.
#jrummell is only partially correct. Entity Framework will create one database per DbContext type, if you leave it to its own devices. Using the concept of "bounded contexts" that #NeilThompson mentioned from Julie Lerhman, all you're doing is essentially telling each context to actually use the same database. Julie's method uses a generic pattern so that each DbContext that implements it ends up on the same database, but you could do it manually for each one, which would look like:
public class MyContext : DbContext
{
public MyContext()
: base("name=DatabaseConnectionStringNameHere")
{
Database.SetInitializer(null);
}
}
In other words, Julie's method just sets up a base class that each of your contexts can inherit from that handles this piece automatically.
This does two things: 1) it tells your context to use a specific database (i.e., the same as every other context) and 2) it tells your context to disable database initialization. This last part is important because these contexts are now essentially treated as database-first. In other words, you now have no context that can actually cause a database to be created, or to signal that a migration needs to occur. As a result, you actually need another "master" context that will have every single entity in your application in it. You don't have to use this context for anything other than creating migrations and updating your database, though. For your code, you can use your more specialized contexts.
The other thing to keep in mind with specialized contexts is that each instantiation of each context represents a unique state even if they share entities. For example, a Cat entity from one context is not the same thing as a Cat entity from a second context, even if they share the same primary key. You will get an error if you retrieved the Cat from the first context, updated it, and then tried save it via the second context. That example is a bit contrived since you're not likely to have the same entity explicitly in two different contexts, but when you get into foreign key relationships and such it's far more common to run into this problem. Even if you don't explicitly declare a DbSet for a related entity, it an entity in the context depends on it, EF will implicitly create a DbSet for it. All this is to say that if you use specialized contexts, you need to ensure that they are truly specialized and that there is zero crossover at any level of related items.
I use what Julie Lerman calls the Bounded Context
The SystemUsers code might have nothing to do with Products - so I might have a System DbContext and a Shop DbContext (for example).
Life is easier with a single context in a small app, but for larger application it helps to break the contexts up.
Typically, you should have one DbContext per database. But if you have separate, unrelated groups of models, it would make sense to have separate DbContext implementations.
it looks to me like it's creating as many databases as there are
contexts.
That's correct, Entity Framework will create one database per DbContext type.
Linq To SQL's DataContext has an overload on SubmitChanges that allows for updates to continue when a Optimistic Concurrency Exception is thrown, and provides the developer with a mechanism to resolve the conflicts afterwards in a single Try Catch block.
Even the WCFDataServicesContext has a SaveChangedOptions.ContinueOnError parameter for its SaveChanges method that at least allows you to continue updating when an error has occurred and leaves conflicting updates unresolved so you can look into them later.
(1) Why then does the ObjectContext.SaveChanges method have no such option?
(2) Do any update patterns exist that will mimick the Linq To SQL behaviour? The examples I find on MSDN make it appear as if a single Try Catch block will see you home in the case of multiple updates. But this pattern does not allow you to investigate each conflicting update separately: it just alerts you to the first conflict and then gives you the option to "wipe the table clean in one sweep" to prevent any further optimistic concurrency exceptions from surfacing, without your knowing if any exist and what you would have liked to do about them.
Why then does the ObjectContext.SaveChanges method have no such option?
I think the simplest answer is because Linq-to-Sql, Entity Framework and WCF Data Services were all implemented by different teams and internal communication among these teams doesn't work as we would hope. I have described some interesting features missing in newer APIs in one of my former answers but I don't think this is a missing feature - I will explain it in the second part of my answer.
WCF Data Services have more interesting features which should be part of Entity framework as well. For example:
Change and Query interceptors
Batching multiple queries and SaveChanges operations to single call to server
Asynchronous operations - this will come to EF6 in form of async/await implementation
Do any update patterns exist that will mimick the Linq To SQL behaviour?
There is a pattern how to solve this but you will probably not like it. EF's SaveChanges works as Unit of work. It either saves all changes or none. If you have a scenario where your saving operation can result in case where only part of your changes is persisted than it should not be handled by single SaveChanges call. Each atomic set of changes should have its own SaveChanges call:
using (var scope = new TransactionScope(...)) {
foreach (var entity in someEntitiesToModify) {
try {
context.SomeEntities.Attach(entity);
context.ObjectStateManager.ChangeObjectState(entity, EntityState.Modified);
context.SaveChanges();
catch (OptimisticConcurrencyException e) {
// Do something here
context.Refresh(e.StateEntries[0].Entity, RefreshMode.ClientWins);
context.SaveChanges();
}
}
scope.Complete();
}
I think the reason why this feature doesn't exist is because it is not generic and as mentioned about it goes against unit of work pattern. Suppose this example:
You load an entity
You add a new dependent entity to navigation property of your loaded entity
You change something on the loaded entity
In the mean time somebody else concurrently delete your loaded entity
You trigger SaveChanges with relaxed conflict resolution
EF will try to save changes to the principal entity but it conflicts because there is no entity to update in the database
EF will continue because conflict resolution is relaxed
EF will try to insert dependent entity but it will fire SqlException because the principal entity doesn't exist in the database. This exception will break the persistence operation and you will not know why it is complaining about referential integrity because you have a principal entity. (It is possible that this insert will even not happen and EF fires another exception due to inconsistency of context's inner state but it depends on EF's internal implementation).
This immediately makes whole relaxing of conflict resolution much more complex feature. There are IMHO three ways to solve it:
Simply not support it. If you need conflict resolution per entity basis you can still use the example I showed above but for complex scenarios it may not work because complex scenarios are hard to solve.
Rebuild database change set each time the conflict occurs - it means to explore the remaining change set and exclude all entities related to conflicting entity and their relations an so on from the processed persistence. There is a problem: EF cannot exclude any changed entity from processing. That would break the meaning of unit of work and I repeat it one more: Relaxing conflict resolution can also break meaning of unit of work.
Let EF to proceed with dependencies even if the principal entity conflicted. This requires to handle the database exception and understand its content to know if the exception is fired due to conflicting principal or due to other error (which should fail whole persistence operation immediately). It can be quite difficult to understand database exceptions on the code level and moreover it is provider specific for every supported database.
It doesn't mean it may not be possible to make such functionality but it will need to cover all scenarios when it comes to relations and this can be pretty complex. I'm not sure if Linq-to-Sql handles this.
You can always make a suggestion on Data UserVoice or check out the code and try to implement it yourselves. Maybe I see this feature too complicated and it can be implemented easily.
I am experiencing some bizarre problems with Nhibernate within my MVC web application.
There is not 1 consistent error, I keep getting loads of random ones:
Transaction not successfully started
New request is not allowed to start because it should come with valid transaction descriptor
Unexpected row count: -1; expected: 1
To give a little context to the setup, I am using Ninject to DI the sessions and other Nhibernate related objects, currently I am using RequestScope however I have tried SingletonScope. I have a large and complicated data model, which is read out as a whole, but persisted back in separate parts, as these can all be edited and saved individually.
An example would be having a Customer object, which contains a address object, a contact object, friends object, previous orders object etc etc...
So the whole object is read out, then mapped to the UI domain models and then displayed in different partials within the page. Each partial can be updated individually via ajax, so you may update 1 section or you could update them all together. It seems mainly to give me the problems when I try to persist them all together (so 2-4 simultanious ajax requests to persist chunks of the model).
Now I have integration tests that work fine, which just test the persistence and retrieval of entities. As a whole and individually and all pass fine, however in the web app they just seem to keep throwing random exceptions, and originally refused to persist outside of the Nhibernate cache. I found a way round this by wrapping most units of work within transactions, which got the data persisting but started adding new errors to the mix.
Originally I was thinking of just scrapping Nhibernate from the project, as although I really want its persistance/caching layer, it just didnt seem to be flexible enough for my domain, which seems odd as I have used it before without much problem, although it doesn't like 1-1 mappings.
So has anyone else had flakey transaction/nhibernate issues like this within an ASP MVC app... I know this may be a bit vague as the errors dont point to one thing, and it doesn't always error, so its like stabbing in the dark, but I am out of ideas so any help would be great!
-- Update --
I cannot post all relevant code as the project is huge, but the transaction bit looks like:
using (var transaction = sessionManager.Session.BeginTransaction(IsolationLevel.ReadUncommitted))
{
try
{
// Do unit of work
transaction.Commit();
}
catch (Exception)
{
transaction.Rollback();
throw;
}
}
Some of the main problems I have had on this project have stemmed from:
There are some 1-1 relationships with composite keys, but logically it makes sense
The Nhibernate domain entities go through a mapping layer to become the UI domain entities, then vice versa when saving. Problem here is that with the 1-1 mappings, when persisting the example Address I have to make a Surrogate Customer object with the correct Id then merge.
There is ALOT of Ajax that deals with chunks of the overall model (I talk like there is one single model, but there are quite a few top level models, just one that is most important)
Some notes that may help. I use windsor but imagine the concepts are the same. Sounds like there may be a combination of things.
SessionFactory should be created as singleton and session should be per web request. Something like:
Bind<ISessionFactory>()
.ToProvider<SessionFactoryBuilder>()
.InSingletonScope();
Bind<ISession>()
.ToMethod( context => context.Kernel.Get<ISessionFactory>().OpenSession() )
.InRequestScope();
Be careful of keeping transactions open for too long, keep them as short lived as possible to avoid deadlocks.
Check your queries are running as as expected by using a tool like NHProf. Often people load up too much of the graph which impacts performance and can create deadlocks.
Check your mappings for things like not.lazyload() and see if you actually need the additional data in the queries and keep results returned to a min. Check your queries execution plans and ensure adequate indexes are in place.
I have had issues with mvc3 action filters being cached, which meant transactions were not always started, but would attempt to be closed causing issues. Moved all my transaction commits into ActionResults in the controllers to keep transaction as short as possible and close to the action.
Check your cascades in your mappings and keep the updates to a minimum.
I'm currently in the middle of a reasonably large question / answer based application (kind of like stackoverflow / answerbag.com)
We're using SQL (Azure) and nHibernate for data access and MVC for the UI app.
So far, the schema is roughly along the lines of the stackoverflow db in the sense that we have a single Post table (contains both questions / answers)
Probably going to use something along the lines of the following repository interface:
public interface IPostRepository
{
void PutPost(Post post);
void PutPosts(IEnumerable<Post> posts);
void ChangePostStatus(string postID, PostStatus status);
void DeleteArtefact(string postId, string artefactKey);
void AddArtefact(string postId, string artefactKey);
void AddTag(string postId, string tagValue);
void RemoveTag(string postId, string tagValue);
void MarkPostAsAccepted(string id);
void UnmarkPostAsAccepted(string id);
IQueryable<Post> FindAll();
IQueryable<Post> FindPostsByStatus(PostStatus postStatus);
IQueryable<Post> FindPostsByPostType(PostType postType);
IQueryable<Post> FindPostsByStatusAndPostType(PostStatus postStatus, PostType postType);
IQueryable<Post> FindPostsByNumberOfReplies(int numberOfReplies);
IQueryable<Post> FindPostsByTag(string tag);
}
My question is:
Where / how would i fit solr into this for better querying of these "Posts"
(I'll be using solrnet for the actual communication with Solr)
Ideally, I'd be using the SQL db as merely a persistant store-
The bulk of the above IQueryable operations would move into some kind of SolrFinder class (or something like that)
The Body property is the one that causes the problems currently - it's fairly large, and slows down queries on sql.
My main problem is, for example, if someone "updates" a post - adds a new tag, for example, then that whole post will need re-indexing.
Obviously, doing this will require a query like this:
"SELECT * FROM POST WHERE ID = xyz"
This will of course, be very slow.
Solrnet has an nHibernate facility- but i believe this will be the same result as above?
I thought of a way around this, which I'd like your views on:
Adding the ID to a queue (amazon sqs or something - i like the ease of use with this)
Having a service (or bunch of services) somewhere that do the above mentioned query, construct the document, and re-add it to solr.
Another problem I'm having with my design:
Where should the "re-indexing" method(s) be called from?
The MVC controller? or should i have a "PostService" type class, that wraps the instance of IPostRepository?
Any pointers are greatly received on this one!
On the e-commerce site that I work for, we use Solr to provide fast faceting and searching of the product catalog. (In non-Solr geek terms, this means the "ATI Cards (34), NVIDIA (23), Intel (5)" style of navigation links that you can use to drill-down through product catalogs on sites like Zappos, Amazon, NewEgg, and Lowe's.)
This is because Solr is designed to do this kind of thing fast and well, and trying to do this kind of thing efficiently in a traditional relational database is, well, not going to happen, unless you want to start adding and removing indexes on the fly and go full EAV, which is just cough Magento cough stupid. So our SQL Server database is the "authoritative" data store, and the Solr indexes are read-only "projections" of that data.
You're with me so far because it sounds like you are in a similar situation. The next step is determining whether or not it is OK that the data in the Solr index may be slightly stale. You've probably accepted the fact that it will be somewhat stale, but the next decisions are
How stale is too stale?
When do I value speed or querying features over staleness?
For example, I have what I call the "Worker", which is a Windows service that uses Quartz.NET to execute C# IJob implementations periodically. Every 3 hours, one of these jobs that gets executed is the RefreshSolrIndexesJob, and all that job does is ping an HttpWebRequest over to http://solr.example.com/dataimport?command=full-import. This is because we use Solr's built-in DataImportHandler to actually suck in the data from the SQL database; the job just has to "touch" that URL periodically to make the sync work. Because the DataImportHandler commits the changes periodically, this is all effectively running in the background, transparent to the users of the Web site.
This does mean that information in the product catalog can be up to 3 hours stale. A user might click a link for "Medium In Stock (3)" on the catalog page (since this kind of faceted data is generated by querying SOLR) but then see on the product detail page that no mediums are in stock (since on this page, the quantity information is one of the few things not cached and queried directly against the database). This is annoying, but generally rare in our particularly scenario (we are a reasonably small business and not that high traffic), and it will be fixed up in 3 hours anyway when we rebuild the whole index again from scratch, so we have accepted this as a reasonable trade-off.
If you can accept this degree of "staleness", then this background worker process is a good way to go. You could take the "rebuild the whole thing every few hours" approach, or your repository could insert the ID into a table, say, dbo.IdentitiesOfStuffThatNeedsUpdatingInSolr, and then a background process can periodically scan through that table and update only those documents in Solr if rebuilding the entire index from scratch periodically is not reasonable given the size or complexity of your data set.
A third approach is to have your repository spawn a background thread that updates the Solr index in regards to that current document more or less at the same time, so the data is only stale for a few seconds:
class MyRepository
{
void Save(Post post)
{
// the following method runs on the current thread
SaveThePostInTheSqlDatabaseSynchronously(post);
// the following method spawns a new thread, task,
// queueuserworkitem, whatevever floats our boat this week,
// and so returns immediately
UpdateTheDocumentInTheSolrIndexAsynchronously(post);
}
}
But if this explodes for some reason, you might miss updates in Solr, so it's still a good idea to have Solr do a periodic "blow it all away and refresh", or have a reaper background Worker-type service that checks for out-of-date data in Solr everyone once in a blue moon.
As for querying this data from Solr, there are a few approaches you could take. One is to hide the fact that Solr exists entirely via the methods of the Repository. I personally don't recommend this because chances are your Solr schema is going to be shamelessly tailored to the UI that will be accessing that data; we've already made the decision to use Solr to provide easy faceting, sorting, and fast display of information, so we might as well use it to its fullest extent. This means making it explicit in code when we mean to access Solr and when we mean to access the up-to-date, non-cached database object.
In my case, I end up using NHibernate to do the CRUD access (loading an ItemGroup, futzing with its pricing rules, and then saving it back), forgoing the repository pattern because I don't typically see its value when NHibernate and its mappings are already abstracting the database. (This is a personal choice.)
But when querying on the data, I know pretty well if I'm using it for catalog-oriented purposes (I care about speed and querying) or for displaying in a table on a back-end administrative application (I care about currency). For querying on the Web site, I have an interface called ICatalogSearchQuery. It has a Search() method that accepts a SearchRequest where I define some parameters--selected facets, search terms, page number, number of items per page, etc.--and gives back a SearchResult--remaining facets, number of results, the results on this page, etc. Pretty boring stuff.
Where it gets interesting is that the implementation of that ICatalogSearchQuery is using a list of ICatalogSearchStrategys underneath. The default strategy, the SolrCatalogSearchStrategy, hits SOLR directly via a plain old-fashioned HttpWebRequest and parsing the XML in the HttpWebResponse (which is much easier to use, IMHO, than some of the SOLR client libraries, though they may have gotten better since I last looked at them over a year ago). If that strategy throws an exception or vomits for some reason, then the DatabaseCatalogSearchStrategy hits the SQL database directly--although it ignores some parameters of the SearchRequest, like faceting or advanced text searching, since that is inefficient to do there and is the whole reason we are using Solr in the first place. The idea is that usually SOLR is answering my search requests quickly in full-featured glory, but if something blows up and SOLR goes down, then the catalog pages of the site can still function in "reduced-functionality mode" by hitting the database with a limited feature set directly. (Since we have made explicit in code that this is a search, that strategy can take some liberties in ignoring some of the search parameters without worrying about affecting clients too severely.)
Key takeaway: What is important is that the decision to perform a query against a possibly-stale data store versus the authoritative data store has been made explicit--if I want fast, possibly stale data with advanced search features, I use ICatalogSearchQuery. If I want slow, up-to-date data with the insert/update/delete capability, I use NHibernate's named queries (or a repository in your case). And if I make a change in the SQL database, I know that the out-of-process Worker service will update Solr eventually, making things eventually consistent. (And if something was really important, I could broadcast an event or ping the SOLR store directly, telling it to update, possibly in a background thread if I had to.)
Hope that gives you some insight.
We use solr to query a large product database.
Around 1 million products, and 30 stores.
What we did is we used triggers on the product table and stock tables on our Sql server.
Each time a row is changed it flags the product to be reindexed. And we have a windows service that grabs these products and post them to Solr every 10 seconds. (With a limit of 100 products per batch).
It's super efficient, almost real time info for the stock.
If you have a big text field (your 'body' field), then yes, re-index in background. The solutions you mentioned (queue or periodic background service) will do.
MVC controllers should be oblivious of this process.
I noticed you have IQueryables in your repository interface. SolrNet does not currently have a LINQ provider. Anyway, if those operations are all you're going to do with Solr (i.e. no faceting), you might want to consider using Lucene.Net instead, which does have a LINQ provider.