I am currently building a CoreData migration for an app which has 200k / 500k average rows of data per entity. There are currently 15 entities within the CoreData Model.
This is the 7th migration I have built for this app, all of the previous have been simple (add 1or 2 column style) migrations, which have not been any trouble and have not needed any mapping models.
This Migration
The migration we are working on is fairly sizeable in comparison to previous migrations and adds a new entity between two existing entities. This requires a custom NSEntityMigrationPolicy which we have built to map the new entity relationships. We also have a *.xcmappingmodel, which defines the mapping between model 6 and the new model 7.
We have implemented our own subclass of NSMigrationManager (as per http://www.objc.io/issue-4/core-data-migration.html + http://www.amazon.com/Core-Data-Management-Pragmatic-Programmers/dp/1937785084/ref=dp_ob_image_bk).
The Problem
Apple uses the migrateStoreFromURL method of NSMigrationManager to migrate the model, however, this seems to be built for low/medium dataset sizes, which do not overload the memory.
We are finding that the app crashes due to memory overload (# 500-600mb on iPad Air/iPad 2) as a result of the following apple method not frequently dumping the memory on data transfer.
[manager migrateStoreFromURL:sourceStoreURL type:type options:nil withMappingModel:mappingModel toDestinationURL:destinationStoreURL destinationType:type destinationOptions:nil error:error];
Apple's Suggested Solution
Apple suggest that we should divide the *.xcmappingmodel up into a series of mapping models per individual entities - https://developer.apple.com/library/ios/documentation/cocoa/conceptual/CoreDataVersioning/Articles/vmCustomizing.html#//apple_ref/doc/uid/TP40004399-CH8-SW2. This would work neatly with the progressivelyMigrateURL methods defined in the above NSMigrationManager subclasses. However, we are not able to use this method as once entity alone will still lead to a memory overload due to the size of one entity by itself.
My guess would be that we would need to write our own migrateStoreFromURL method, but would like to keep this as close to as Apple would have intended as possible. Has anyone done this before and/or have any ideas for how we could achieve this?
The short answer is that heavy migrations are not good for iOS and should be avoided at literally any cost. They were never designed to work on a memory constrained device.
Having said that, a few question for you before we discuss a resolution:
Is the data recoverable? Can you download it again or is this user data?
Can you resolve the relationships between the entities without having the old relationship in place? Can it be reconstructed?
I have a few solutions but they are data dependent, hence the questions back to you.
Update 1
The data is not recoverable and cannot be re-downloaded. The data is formed from user activity within the application over a time period (reaching up to 1 year in the past). The relationships are also not reconstructable, unless we store them before we lose access to the old relationships.
Ok, what you are describing is the worst case and therefore the hardest case. Fortunately it isn't unsolvable.
First, Heavy migration is not going to work. We must write code to solve this issue.
First Option
My preferred solution is to do a lightweight migration that only adds the new relationship between the (now) three entities, it does not remove the old relationship. This lightweight migration will occur in SQLite and will be very quick.
Once that migration has been completed then we iterate over the objects and set up the new relationship based on the old relationship. This can be done as a background process or it can be done piece meal as the objects are used, etc. That is a business decision.
Once that conversion as been completed you can then do another migration, if needed, to remove the old relationship. This step is not necessary but it does help to keep the model clean.
Second Option
Another option which has value is to export and re-import the data. This has the added value of setting up code to back up the user's data in a format that is readable on other platforms. It is fairly simple to export the data out to JSON and then set up an import routine that pulls the data into the new model along with the new relationship.
The second option has the advantage of being cleaner but requires more code as well as a "pause" in the user's activities. The first option can be done without the user even being aware there is a migration taking place.
If I understand this correctly then you have one entity that is so big that when migrating this entity does cause the memory overload. In this case, how about splitting the migration of this one entity in several steps and therefore doing only some properties per each migration iteration?
That way you won't need to write your own code but you can benefit form the "standard" code.
Related
I'm trying to figure out how to use Core Data in my App. I already have in mind what the object graph would be like at runtime:
An Account object owns a TransactionList object.
A TransactionList object contains all the transactions of the account. Rather than being a flat list, it organizes transactions per day. So it contains a list of DailyTransactions objects sorted by date.
A DailyTransactions contains a list of Transaction objects which occur in a single day.
At first I thought Core Data was an ORM so I thought I might just need two tables: Account table and Transaction table which contained all transactions and set up the above object graph (i.e., organizing transactions per date and generating DailyTransactions objects, etc.) using application code at run time.
When I started to learn Core Data, however, I realized Core Data was more of an object graph manager than an ORM. So I'm thinking about using Core Data to implement above runtime object relationship directly (it's not clear to me what's the benefit but I believe Core Data must have some features that will be helpful).
So I'm thinking about a data model in Core Data like the following:
Acount <--> TransactionList -->> DailyTransactions -->> Transaction
Since I'm still learning Core Data, I'm not able to verify the design yet. I suppose this is the right way to use Core Data. But doesn't this put too many implementation details, instead of raw data, in persistent store? The issue with saving implementation details, I think, is that they are far more complex than raw data and they may contain duplicate data. To put it in another way, what exactly does the "data" in data model means, raw data or any useful runtime objects?
An alternative approach is to use Core Data as ORM by defining a data model like:
Account <-->> Transactions
and setting up the runtime object graph using application code. This leads to more complex application code but simpler database design (I understand user doesn't need to deal with database directly when using Core Data, but still it's good to have a simpler system). That said, I doubt this is not the right way to use Cord Data.
A more general question. I did little database programming before, but I had the impression that there was usually a business object layer above plain old data object layer in server side programming framework like J2EE. In those architectures, objects that encapsulate application business are not same as the objects loaded from database. It seems that's not the case with Core Data?
Thanks for any explanations or suggestions in advance.
(Note: the example above is an simplification. A transaction like transfer involves two accounts. I ignore that detail for simplification.)
Now that I read more about the Core Data, I'll try to answer my own question since no one did it. I hope this may help other people who have the same confusion as I did. Note the answer is based on my current (limited) understanding.
1. Core Data is an object graph manager for data to be persistently stored
There are a lot articles on the net emphasizing that Core Data manages object graph and it's not an ORM or database. While they might be technically correct, they unfortunately cause confusion to beginner like me. In my opinion, it's equally important to point out that objects managed by Core Data are not arbitrary runtime objects but those that are suitable for being saved in database. By suitable it means all these objects conform to principles of database schema design.
So, what' a proper data model is very much a database design question (it's important to point out this because most articles try to ask their readers to forget about database).
For example, in the account and transactions example I gave above, while I'd like to organize transactions per day (e,g., putting them in a two-level list, first by date, then by transaction timestamp) at runtime. But the best practice in database design is to save all transactions in a single table and generating the two-level list at runtime using application code (I believe so).
So the data model in Core Data should be like:
Account <->> Transaction
The question left is where I can add the code to generate the runtime structure (e.g., two-level list) I'd like to have. I think it's to extend Account class.
2. Constraints of Core Data
The fact that Core Data is designed to work with database (see 1) explains why it has some constraints on the data model design (i.e., attribute can't be of an arbitrary type, etc.).
While I don't see anyone mentioned this on the net, personally I think relationship in Core Data is quite limited. It can't be of a custom type (e.g, class) but has to be a variable (to-one) or an array (to-many) at run time. That makes it far less expressive. Note: I guess it's so due to some technical reason. I just hope it could be a class and hence more flexible.
For example, in my App I actually have complex logic between Account and its Transaction and want to encapsulate it into a single class. So I'm thinking to introduce an entity to represent the relationship explicitly:
Account <->> AccountTranstionMap <-> Transaction
I know it's odd to do this in Core Data. I'll see how it works and update the answer when I finish my app. If someone knows a better way to not do this, please let me know!
3. Benefits of Core Data
If one is writing a simple App, (for example, an App that data modal change are driven by user and hence occurs in sequence and don't have asynchronous data change from iCloud), I think it's OK to ignore all the discussions about object graph vs ORM, etc. and just use the basic features of Core Data.
From the documents I have read so far (there are still a lot I haven't finished), the benefits of Core Data includes automatic mutual reference establishment and clean up, live and automatically updated relationship property value, undo, etc. But if your App is not complex, it might be easier to implement these features using application code.
That said, it's interesting to learn a new technology which has limitation but at the same time can be very powerful in more complex situations. BTW, just curious, is there similar framework like Core Data on other platforms (either open source or commercial)? I don't think I read about similar things before.
I'll leave the question open for other answers and comments :) I'll update my answer when I have more practical experience with Core Data.
I'm in the process of a manual core data migration and keep running into Cocoa Error 134140: NSMigrationMissingMappingModelError. I've noticed this happens any time I make any change to the model, even something as small as marking a property as optional. So far, the only solution I've found when this happens is to delete my mapping model and create a new mapping model. Are there any better, less tedious solutions?
There's a menu option to resolve this. If you update your model anytime after creating your mapping model just do the following:
Select the mapping model.
Choose Editor -> Refresh Data Models.
This happens because:
The migration map identifies the model files by the entity hashes, and
When you change an entity, you change its hash.
When you change the model, the map no longer matches it, and migration fails because no matching map can be found.
The workaround is to not mess with migration until you've nailed down what the new model looks like. Then create the map with the final version of the model. If you can't finalize the new model and need to work on migration, you've already discovered the necessary procedure.
Tom is correct but I would take it one further. I would not do a manual/heavy migration, ever. If it cannot be done in a lightweight migration consider doing an export/import. It will be faster and more performant than a heavy migration.
My standard recommendation is to keep your changes small enough so that you can always do a lightweight migration.
Update on Import/Export
Heavyweight migration is a hold-over from OS X where memory was cheap. It should not be used in iOS. So what is the right answer?
My recommendation to people is to handle it on your own. Lightweight migration if at all possible, even if it requires walking through several models to get from A to B. However in your case that does not sound possible.
So the second option is export/import. It is very easy to export Core Data out to JSON. I even did a quick example in a Stack Overflow post about it.
First, you stand up the old model and the current store. This involves finding the right model version and manually loading it using [[NSManagedObjectModel alloc] initWithContentsofURL:] and pointing to the right model version. There are details on how to find the right mold version in my book (grin).
Then export the current model out to JSON. That should be fairly quick. However, don't do this in your -applicationDidFinish.. for obvious reasons.
Step two is to load up the new Core Data stack with the "current" model and import that JSON. Since that JSOn is in a known format you can import it fairly easily.
This will allow you to control the entire experience and avoid the issues that heavy migration has.
I am new to iOS programming but have done SQL stuff for years. I am trying to use Core Data to build my model. Following the tutorials I have created a schema for my application that involves a number of one-to-many relationships that are not bi-directional.
For example I have a Games entity and a Player entity. A Game includes a collection of Players. Because a Player can be involved in more than one game, an inverse relationship does not make any sense and is not needed.
Yet when I compile my application, I get Consistency Error messages in two forms. One says.
Game.players does not have an inverse; this is an advanced setting.
Really? This is an "advanced" capability enough to earn a warning message? Should I just ignore this message or am I actually doing something wrong here that Core Data is not designed to do?
The other is of the form Misconfigured Property and logs the text:
Something.something should have an inverse.
So why would it think that?
I can't find any pattern to why it picks one error message over the other. Any tips for an iOS newb would be appreciated.
This is under Xcode 5.0.2.
Core Data is not a database. This is an important fact to grasp otherwise you will be fighting the framework for a long time.
Core Data is your data model that happens to persist to a database as one of its options. That is not its main function, it is a secondary function.
Core Data requires/recommends that you use inverse relationships so that it can keep referential integrity in check without costly maintenance. For example, if you have a one way between A and B (expressed A --> B) and you delete a B, Core Data may need to walk the entire A table looking for references to B so that it can clean them up. This is expensive. If you have a proper bi-directional relationship (A <-> B) then Core Data knows exactly which A objects it needs to touch to keep the referential integrity.
This is just one example.
Bi-directionality is not required but it is recommended highly enough that it really should be considered a requirement. 99.999% of the time you want that bi-directional relationship even if you never use it. Core Data will use it.
Why not just add the inverse relationship? It can be to-many as well, and you may well end up using it - often fetch requests or object graph navigation works faster or better coming from a different end of a relationship.
Core Data prefers you to define relationships in both directions (hence the warnings) and it costs you nothing to do so, so you may as well. Don't fight the frameworks - core data isn't an SQLLite "manager", it is an object graph and persistence tool, that can use SQL as a back end.
I have a rather large Core Data-based database schema (~20 entities, over 140 properties) that is undergoing large changes as it migrates from our 1.x code base over to our 2.x code base.
I'm very familiar with performing lightweight migrations, but I'm a bit flummoxed with this particular migration because there's a few entities that used to store related objects as transformable attributes on the entity itself, but now I want to migrate those to actual entities.
This seems like a perfect example of when you should use a heavy migration instead of a lightweight one, but I'm not too happy about that either. I'm not familiar with heavy migrations, one of the entities that has this array -> modeled relationship conversion occurring takes up ~90% of the rows in the database, these databases tend to be more than 200 MB in size, and I know a good portion of our customers are using iPad 1s. That combined with the repeated warnings in Apple documentation and Marcus Zarra's (excellent) Core Data book regarding the speed and memory usage of a heavy migration make me very wary, and searching for another way to handle this situation.
WWDC 2010's "Mastering Core Data" session 118 (slides here, requires login, the 9th to last slide, with the title 'Migration Post-Processing' is what I'm referring to) mentions a way to sort of work around this – performing the migration, then using the store metadata to flag whether or not custom post processing you want to perform has been completed. I'm thinking this might be the way to go, but it feels a bit hacky (for lack of a better word) to me. Also, I'm worried about leaving attributes hanging around that are in practice, deprecated. ex. if I move entity foo's barArray attribute into a relationship between entity foo, and entity bar, and I nil out barArray, barArray still exists as an attribute that can be written to and read from. A potential way to solve this would be to signal that these attributes are deprecated by changing their names to have "deprecated" in front, as well as perhaps overriding the accessors to assert if they're used, but with KVO, there's no guaranteed compile-time solution that will prevent people from using them, and I loathe leaving 'trap code' around, especially since said 'trap code' will have to be around as long as I potentially have customers who still need to migrate from 1.0.
This turned into more of a brain dump than I intended, so for sake of clarity, my questions are:
1) Is a heavy migration a particularly poor choice with the constraints I'm working under? (business-critical app, lack of experience with heavy migrations, databases over 200 MB in size, tens of thousands of rows, customers using iPad 1s running iOS 5+)
2) If so, is the migration post-processing technique described in session 118 my best bet?
3) If so, how can I right away/eventually eliminate those 'deprecated' attributes so that they are no longer polluting my code base?
My suggestion is to stay away from heavy migration; full stop. It is too expensive on iOS and most likely will lead to a unacceptable user experience.
In this situation I would do a lazy migration. Create a lightweight migration that has the associated objects.
Then do the migration but don't move any data yet.
Change the accessor for that new relationship so that it first checks the old transformable, if the transformable is populated it pulls the data out, copies it over to the new relationship and then nil's out the transformable.
Doing this will cause the data to move over as it is being used.
Now there are some issues with this design.
If you are wanting to use predicates against those new objects it is going to be ... messy. You will want to do a two pass fetch. i.e. Fetch with a predicate that does not hit that new object and then do a section fetch once they are memory so that the transformable gets moved over.
We have a rather large set of related tables with over 35 million related records each. I need to create a couple of WCF methods that would query the database with some parameters (data ranges, type codes, etc.) and return related results sets (from 10 to 10,000 records).
The company is standardized on EF 4.0 but is open to 4.X. I might be able to make argument to migrate to 5.0 but it's less likely.
What’s the best approach to deal with such a large number of records using Entity? Should I create a set of stored procs and call them from Entity or there is something I can do within Entity?
I do not have any control over the databases so I cannot split the tables or create some materialized views or partitioned tables.
Any input/idea/suggestion is greatly appreciated.
At my work I faced a similar situation. We had a database with many tables and most of them contained around 7- 10 million records each. We used Entity framework to display the data but the page seemed to display very slow (like 90 to 100 seconds). Even the sorting on the grid took time. I was given the task to see if it could be optimized or not. and well after profiling it (ANTS profiler) I was able to optimize it (under 7 secs).
so the answer is Yes, Entity framework can handle loads of records (in millions) but some care must be taken
Understand that call to database made only when the actual records are required. all the operations are just used to make the query (SQL) so try to fetch only a piece of data rather then requesting a large number of records. Trim the fetch size as much as possible
Yes, not you should, you must use stored procedures and import them into your model and have function imports for them. You can also call them directly ExecuteStoreCommand(), ExecuteStoreQuery<>(). Sames goes for functions and views but EF has a really odd way of calling functions "SELECT dbo.blah(#id)".
EF performs slower when it has to populate an Entity with deep hierarchy. be extremely careful with entities with deep hierarchy .
Sometimes when you are requesting records and you are not required to modify them you should tell EF not to watch the property changes (AutoDetectChanges). that way record retrieval is much faster
Indexing of database is good but in case of EF it becomes very important. The columns you use for retrieval and sorting should be properly indexed.
When you model is large, VS2010/VS2012 Model designer gets real crazy. so break your model into medium sized models. There is a limitation that the Entities from different models cannot be shared even though they may be pointing to the same table in the database.
When you have to make changes in the same entity at different places, try to use the same entity by passing it and send the changes only once rather than each one fetching a fresh piece, makes changes and stores it (Real performance gain tip).
When you need the info in only one or two columns try not to fetch the full entity. you can either execute your sql directly or have a mini entity something. You may need to cache some frequently used data in your application also.
Transactions are slow. be careful with them.
if you keep these things in mind EF should give almost similar performance as plain ADO.NET if not the same.
My experience with EF4.1, code first: if you only need to read the records (i.e. you won't write them back) you will gain a performance boost by turning of change tracking for your context:
yourDbContext.Configuration.AutoDetectChangesEnabled = false;
Do this before loading any entities. If you need to update the loaded records you can allways call
yourDbContext.ChangeTracker.DetectChanges();
before calling SaveChanges().
The moment I hear statements like: "The company is standardized on EF4 or EF5, or whatever" This sends cold shivers down my spine.
It is the equivalent of a car rental saying "We have standardized on a single car model for our entire fleet".
Or a carpenter saying "I have standardized on chisels as my entire toolkit. I will not have saws, drills etc..."
There is something called the right tool for the right job
This statement only highlights that the person in charge of making key software architecture decisions has no clue about software architecture.
If you are dealing with over 100K records and the datamodels are complex (i.e. non trivial), Maybe EF6 is not the best option.
EF6 is based on the concepts of dynamic reflection and has similar design patterns to Castle Project Active Record
Do you need to load all of the 100K records into memory and perform operations on these ? If yes ask yourself do you really need to do that and why wouldn't executing a stored procedure across the 100K records achieve the same thing. Do some analysis and see what is the actual data usage pattern. Maybe the user performs a search which returns 100K records but they only navigate through the first 200. Example google search, Hardly anyone goes past page 3 of the millions of search results.
If the answer is still yes you need to load all of the 100K records into memory and perform operations. Then maybe you need to consider something else like a custom built write through cache with light weight objects. Maybe lazy load dynamic object pointers for nested objects. etc... One instance where I use something like this is large product catalogs for eCommerce sites where very large numbers of searches get executed against the catalog. Why is in order to provide custom behavior such as early exit search, and regex wildcard search using pre-compiled regex, or custom Hashtable indexes into the product catalog.
There is no one size fits all answer to this question. It all depends the data usage scenarios and how the application works with the data. Consider Gorilla Vs Shark who would win? It all depends on the environment and the context.
Maybe EF6 is perfect for one piece that would benefit from dynamic reflection, While NetTiers is better for another that needs static reflection and an extensible ORM. While low level ADO is perhaps best for extreme high performance pieces.