I have a rather large Core Data-based database schema (~20 entities, over 140 properties) that is undergoing large changes as it migrates from our 1.x code base over to our 2.x code base.
I'm very familiar with performing lightweight migrations, but I'm a bit flummoxed with this particular migration because there's a few entities that used to store related objects as transformable attributes on the entity itself, but now I want to migrate those to actual entities.
This seems like a perfect example of when you should use a heavy migration instead of a lightweight one, but I'm not too happy about that either. I'm not familiar with heavy migrations, one of the entities that has this array -> modeled relationship conversion occurring takes up ~90% of the rows in the database, these databases tend to be more than 200 MB in size, and I know a good portion of our customers are using iPad 1s. That combined with the repeated warnings in Apple documentation and Marcus Zarra's (excellent) Core Data book regarding the speed and memory usage of a heavy migration make me very wary, and searching for another way to handle this situation.
WWDC 2010's "Mastering Core Data" session 118 (slides here, requires login, the 9th to last slide, with the title 'Migration Post-Processing' is what I'm referring to) mentions a way to sort of work around this – performing the migration, then using the store metadata to flag whether or not custom post processing you want to perform has been completed. I'm thinking this might be the way to go, but it feels a bit hacky (for lack of a better word) to me. Also, I'm worried about leaving attributes hanging around that are in practice, deprecated. ex. if I move entity foo's barArray attribute into a relationship between entity foo, and entity bar, and I nil out barArray, barArray still exists as an attribute that can be written to and read from. A potential way to solve this would be to signal that these attributes are deprecated by changing their names to have "deprecated" in front, as well as perhaps overriding the accessors to assert if they're used, but with KVO, there's no guaranteed compile-time solution that will prevent people from using them, and I loathe leaving 'trap code' around, especially since said 'trap code' will have to be around as long as I potentially have customers who still need to migrate from 1.0.
This turned into more of a brain dump than I intended, so for sake of clarity, my questions are:
1) Is a heavy migration a particularly poor choice with the constraints I'm working under? (business-critical app, lack of experience with heavy migrations, databases over 200 MB in size, tens of thousands of rows, customers using iPad 1s running iOS 5+)
2) If so, is the migration post-processing technique described in session 118 my best bet?
3) If so, how can I right away/eventually eliminate those 'deprecated' attributes so that they are no longer polluting my code base?
My suggestion is to stay away from heavy migration; full stop. It is too expensive on iOS and most likely will lead to a unacceptable user experience.
In this situation I would do a lazy migration. Create a lightweight migration that has the associated objects.
Then do the migration but don't move any data yet.
Change the accessor for that new relationship so that it first checks the old transformable, if the transformable is populated it pulls the data out, copies it over to the new relationship and then nil's out the transformable.
Doing this will cause the data to move over as it is being used.
Now there are some issues with this design.
If you are wanting to use predicates against those new objects it is going to be ... messy. You will want to do a two pass fetch. i.e. Fetch with a predicate that does not hit that new object and then do a section fetch once they are memory so that the transformable gets moved over.
Related
I'm trying to figure out how to use Core Data in my App. I already have in mind what the object graph would be like at runtime:
An Account object owns a TransactionList object.
A TransactionList object contains all the transactions of the account. Rather than being a flat list, it organizes transactions per day. So it contains a list of DailyTransactions objects sorted by date.
A DailyTransactions contains a list of Transaction objects which occur in a single day.
At first I thought Core Data was an ORM so I thought I might just need two tables: Account table and Transaction table which contained all transactions and set up the above object graph (i.e., organizing transactions per date and generating DailyTransactions objects, etc.) using application code at run time.
When I started to learn Core Data, however, I realized Core Data was more of an object graph manager than an ORM. So I'm thinking about using Core Data to implement above runtime object relationship directly (it's not clear to me what's the benefit but I believe Core Data must have some features that will be helpful).
So I'm thinking about a data model in Core Data like the following:
Acount <--> TransactionList -->> DailyTransactions -->> Transaction
Since I'm still learning Core Data, I'm not able to verify the design yet. I suppose this is the right way to use Core Data. But doesn't this put too many implementation details, instead of raw data, in persistent store? The issue with saving implementation details, I think, is that they are far more complex than raw data and they may contain duplicate data. To put it in another way, what exactly does the "data" in data model means, raw data or any useful runtime objects?
An alternative approach is to use Core Data as ORM by defining a data model like:
Account <-->> Transactions
and setting up the runtime object graph using application code. This leads to more complex application code but simpler database design (I understand user doesn't need to deal with database directly when using Core Data, but still it's good to have a simpler system). That said, I doubt this is not the right way to use Cord Data.
A more general question. I did little database programming before, but I had the impression that there was usually a business object layer above plain old data object layer in server side programming framework like J2EE. In those architectures, objects that encapsulate application business are not same as the objects loaded from database. It seems that's not the case with Core Data?
Thanks for any explanations or suggestions in advance.
(Note: the example above is an simplification. A transaction like transfer involves two accounts. I ignore that detail for simplification.)
Now that I read more about the Core Data, I'll try to answer my own question since no one did it. I hope this may help other people who have the same confusion as I did. Note the answer is based on my current (limited) understanding.
1. Core Data is an object graph manager for data to be persistently stored
There are a lot articles on the net emphasizing that Core Data manages object graph and it's not an ORM or database. While they might be technically correct, they unfortunately cause confusion to beginner like me. In my opinion, it's equally important to point out that objects managed by Core Data are not arbitrary runtime objects but those that are suitable for being saved in database. By suitable it means all these objects conform to principles of database schema design.
So, what' a proper data model is very much a database design question (it's important to point out this because most articles try to ask their readers to forget about database).
For example, in the account and transactions example I gave above, while I'd like to organize transactions per day (e,g., putting them in a two-level list, first by date, then by transaction timestamp) at runtime. But the best practice in database design is to save all transactions in a single table and generating the two-level list at runtime using application code (I believe so).
So the data model in Core Data should be like:
Account <->> Transaction
The question left is where I can add the code to generate the runtime structure (e.g., two-level list) I'd like to have. I think it's to extend Account class.
2. Constraints of Core Data
The fact that Core Data is designed to work with database (see 1) explains why it has some constraints on the data model design (i.e., attribute can't be of an arbitrary type, etc.).
While I don't see anyone mentioned this on the net, personally I think relationship in Core Data is quite limited. It can't be of a custom type (e.g, class) but has to be a variable (to-one) or an array (to-many) at run time. That makes it far less expressive. Note: I guess it's so due to some technical reason. I just hope it could be a class and hence more flexible.
For example, in my App I actually have complex logic between Account and its Transaction and want to encapsulate it into a single class. So I'm thinking to introduce an entity to represent the relationship explicitly:
Account <->> AccountTranstionMap <-> Transaction
I know it's odd to do this in Core Data. I'll see how it works and update the answer when I finish my app. If someone knows a better way to not do this, please let me know!
3. Benefits of Core Data
If one is writing a simple App, (for example, an App that data modal change are driven by user and hence occurs in sequence and don't have asynchronous data change from iCloud), I think it's OK to ignore all the discussions about object graph vs ORM, etc. and just use the basic features of Core Data.
From the documents I have read so far (there are still a lot I haven't finished), the benefits of Core Data includes automatic mutual reference establishment and clean up, live and automatically updated relationship property value, undo, etc. But if your App is not complex, it might be easier to implement these features using application code.
That said, it's interesting to learn a new technology which has limitation but at the same time can be very powerful in more complex situations. BTW, just curious, is there similar framework like Core Data on other platforms (either open source or commercial)? I don't think I read about similar things before.
I'll leave the question open for other answers and comments :) I'll update my answer when I have more practical experience with Core Data.
I realize this may be common sense for a lot of people, so apologies if this seems like a stupid question.
I am trying to learn core data for iOS programming, and I have repeatedly read and heard it said that Core Data (CD) is not a relational database. But very little else is said about this, or why exactly it is important to know beyond an academic sense. I mean functionally at least, it seems you can use CD as though it were a database for most things - storing and fetching data, runnings queries etc. From my very rudimentary understanding of it, I don't really see how it differs from a database.
I am not questioning the fact that the distinction is important. I believe that a lot of smart people would not be wasting their time on this point if it weren't useful to understand. But I would like someone please to explain - ideally with examples - how CD not being a relational database affects how we use it? Or perhaps, if I were not told that CD isn't a relational database, how would this adversely impact my performance as an Objective-C/Swift programmer?
Are there things that one might try to do incorrectly if they treated CD as a relational database? Or, are there things which a relational database cannot do or does less well that CD is designed to do?
Thank you all for your collective wisdom.
People stress the "not a relational database" angle because people with some database experience are prone to specific errors with Core Data that result from trying to apply their experience too directly. Some examples:
Creating entities that are essentially SQL junction tables. This is almost never necessary and usually makes things more complex and error prone. Core Data supports many-to-many relationships directly.
Creating a unique ID field in an entity because they think they need one to ensure uniqueness and to create relationships. Sometimes creating custom unique IDs is useful, usually not.
Setting up relationships between objects based on these unique IDs instead of using Core Data relationships-- i.e. saving the unique ID of a related object instead of using ObjC/Swift semantics to relate the objects.
Core Data can and often does serve as a database, but thinking of it in terms of other relational databases is a good way to screw it up.
Core Data is a technology with many powerful features and tools such as:
Change tracking (undo/redo)
Faulting (not having to load entire objects which can save memory)
Persistence
The list goes on..
The persistence part of Core Data is backed by SQLite, which is a relational database.
One of the reasons I think people stress that Core Data is not a relational database is because is it so much more than just persistence, and can be taken advantage of without using persistence at all.
By treating Core Data as a relational database, I assume you mean that relationships between objects are mapped by ids, i.e. a Customer has a customerId and a product has a productId.
This would certainly be incorrect because Core Data let's you define powerful relationships between object models that make things easy to manage.
For example, if you want to have your customer have multiple products and want to delete them all when you delete the customer, Core Data gives you the ability to do that without having to manage customerIds/productIds and figuring out how to format complex SQL queries to match your in-memory model. With Core Data, you simply update your model and save your context, the SQL is done for you under the hood. (In fact you can even turn on debugging to print out the SQL Core Data is performing for you by passing '-com.apple.CoreData.SQLDebug 1' as a launch argument.
In terms of performance, Core Data does some serious optimizations under the hood that make accessing data much easier without having to dive deep into SQL, concurrency, or validation.
I THINK the point is that it is different from a relational database and that trying to apply relational techniques will lead the developer astray as others have mentioned. It actually operates at a higher level by abstracting the functionality of the relational database out of your code.
A key difference, from a programming standpoint, is that you don't need unique identifiers because core data just handles that. If you tried to create your own, you will come to see that they are redundant and a whole lot of extra trouble.
From the programmer's perspective, whenever you access an entity "record", you will have a pointer to any relationship -- perhaps a single pointer for a "to-one" relationship, or a set of pointers to the records in a "to-many" relationship. Core Data handles the retrieval of the actual "records" when you use one of the pointers.
Because Core Data efficiently handles faults (where the "record" (object) referenced by a pointer is not in memory) you do not have to concern yourself with their retrieval generally. When they're needed by your program Core Data will make them available.
At the end of the day, it provides similar functionality but under the hood it is different. It does require some different thinking in that ordinary SQL doesn't make sense in the context of Core Data as the SQL (in the case of a sqlite store) is handled for you.
The main adjustments for me in transitioning to Core Data were as noted -- getting rid of the concept of unique identifiers. They're going on behind the scenes but you never have to worry about them and should not try to define your own. The second adjustment for me was that whenever you need an object that is related to yours, you just grab it by using the appropriate pointer(s) in the entity object you already have.
I am currently building a CoreData migration for an app which has 200k / 500k average rows of data per entity. There are currently 15 entities within the CoreData Model.
This is the 7th migration I have built for this app, all of the previous have been simple (add 1or 2 column style) migrations, which have not been any trouble and have not needed any mapping models.
This Migration
The migration we are working on is fairly sizeable in comparison to previous migrations and adds a new entity between two existing entities. This requires a custom NSEntityMigrationPolicy which we have built to map the new entity relationships. We also have a *.xcmappingmodel, which defines the mapping between model 6 and the new model 7.
We have implemented our own subclass of NSMigrationManager (as per http://www.objc.io/issue-4/core-data-migration.html + http://www.amazon.com/Core-Data-Management-Pragmatic-Programmers/dp/1937785084/ref=dp_ob_image_bk).
The Problem
Apple uses the migrateStoreFromURL method of NSMigrationManager to migrate the model, however, this seems to be built for low/medium dataset sizes, which do not overload the memory.
We are finding that the app crashes due to memory overload (# 500-600mb on iPad Air/iPad 2) as a result of the following apple method not frequently dumping the memory on data transfer.
[manager migrateStoreFromURL:sourceStoreURL type:type options:nil withMappingModel:mappingModel toDestinationURL:destinationStoreURL destinationType:type destinationOptions:nil error:error];
Apple's Suggested Solution
Apple suggest that we should divide the *.xcmappingmodel up into a series of mapping models per individual entities - https://developer.apple.com/library/ios/documentation/cocoa/conceptual/CoreDataVersioning/Articles/vmCustomizing.html#//apple_ref/doc/uid/TP40004399-CH8-SW2. This would work neatly with the progressivelyMigrateURL methods defined in the above NSMigrationManager subclasses. However, we are not able to use this method as once entity alone will still lead to a memory overload due to the size of one entity by itself.
My guess would be that we would need to write our own migrateStoreFromURL method, but would like to keep this as close to as Apple would have intended as possible. Has anyone done this before and/or have any ideas for how we could achieve this?
The short answer is that heavy migrations are not good for iOS and should be avoided at literally any cost. They were never designed to work on a memory constrained device.
Having said that, a few question for you before we discuss a resolution:
Is the data recoverable? Can you download it again or is this user data?
Can you resolve the relationships between the entities without having the old relationship in place? Can it be reconstructed?
I have a few solutions but they are data dependent, hence the questions back to you.
Update 1
The data is not recoverable and cannot be re-downloaded. The data is formed from user activity within the application over a time period (reaching up to 1 year in the past). The relationships are also not reconstructable, unless we store them before we lose access to the old relationships.
Ok, what you are describing is the worst case and therefore the hardest case. Fortunately it isn't unsolvable.
First, Heavy migration is not going to work. We must write code to solve this issue.
First Option
My preferred solution is to do a lightweight migration that only adds the new relationship between the (now) three entities, it does not remove the old relationship. This lightweight migration will occur in SQLite and will be very quick.
Once that migration has been completed then we iterate over the objects and set up the new relationship based on the old relationship. This can be done as a background process or it can be done piece meal as the objects are used, etc. That is a business decision.
Once that conversion as been completed you can then do another migration, if needed, to remove the old relationship. This step is not necessary but it does help to keep the model clean.
Second Option
Another option which has value is to export and re-import the data. This has the added value of setting up code to back up the user's data in a format that is readable on other platforms. It is fairly simple to export the data out to JSON and then set up an import routine that pulls the data into the new model along with the new relationship.
The second option has the advantage of being cleaner but requires more code as well as a "pause" in the user's activities. The first option can be done without the user even being aware there is a migration taking place.
If I understand this correctly then you have one entity that is so big that when migrating this entity does cause the memory overload. In this case, how about splitting the migration of this one entity in several steps and therefore doing only some properties per each migration iteration?
That way you won't need to write your own code but you can benefit form the "standard" code.
I've been developing a very basic core data application for over a year now (Toy Collector, http://bit.ly/tocapp), and I'm looking at doing a redesign so that I can build in iCloud support. I figured while I'm doing that, I might as well update my core data model (if needed), and I'm having a heck of a time tracking down "best practices" for the following:
Currently, I have 2 entities:
Toy, Keywords
Toy has all the information about the object: Name, Year, Set, imageName, Owned, Wanted, Manufacturer, etc, (18 attributes in all)
Keywords has the normalized words to help speed up the search
My question is whether or not there is any advantage to breaking out some of the Toy attributes into their own entities. For example, I could have a manufacturer entity that stores the dozen or so manufacturers, instead of keeping that information in the Toy object. My gut tells me this could reduce the memory footprint (instead of 50,000 objects storing a manufacturer string, there would simple be 12 manufacturer strings in an entity with a relationship to the main Toy entity). Does that kind of organization really matter? Am I trying to overcomplicate things? I just feel like my entity has a lot of attributes, and I'm not sure if taking the time to break it apart into multiple entities would make a difference.
Any advice or pointers would be appreciated!
Zack
Your question is pretty broad, since it addresses the topic of database design. Let me say upfront that it is almost impossible to give you any sensible suggestions, since I would need to know a lot more about your app, use cases, etc. than it is possible through a S.O. question.
Coming to your concrete questions, I would say that you correctly identify one of advantages of splitting a table into multiple ones; actually, the advantage of doing that is not just reducing the database footprint, rather keep data redundancy to a minimum. Redundancy not only affects memory footprint but also manageability and modifiability of your data, and lack of redundancy could even cause anomalies or corruption. There is even a whole database theory topic which is known as database normalisation that addresses this king of concerns.
On the other hand, as it is always the case, redundancy can help performance, and this is actually the case when you can fetch your data through a simple query instead of multiple queries or table joins. There is a technique to improving a database performance which is known as database denormalization and is the exact opposite to normalisation. Your current scheme is fully denormalized.
Using Core Data, which is a relational object graph manager running often on top of SQLite, which is a relational database manager, you have also to take into account the fact that Core Data will automatically build your object graph and fetch into memory the data when you need it. This means that if you can take a smaller memory footprint on disk for granted, this might not be the case when it comes to RAM footprint of your query results (Core Data will "explode", so to say, at some moment your data from multiple tables into one object plus its attributes).
In your specific case, you should also possibly take into account the cost of migrating your existing user base (if the database is not read-only).
All in all, I would say that if your app does not have any database footprint issues at the moment; if you do not feel that creating new tables might be useful, e.g., in order to add new functionality, such as listing all manufacturers; and, finally, if you do not foresee tasks like renaming a manufacturer or such at some point, then maybe refactoring your database will not add much benefit. But, as I say, without knowing your app in detail and your roadmap for it, it is difficult to say anything really on spot. In any case, I hope this general considerations will help you take a decision.
EDIT:
If you want to investigate your core data performance and try to understand where the bottlenecks are, give a try to Instruments/Core Data tool (Product/Profile menu). There are a lot of things that can go bad.
On the other hand, it is really hard to help you further without having more details about the type of searches your app allows to do. One thing that is not clear to me is if your searches are slow only when they return a lot of results or they are slow even when returning a few results.
Normalizing might help performance if you only use (say, after doing a search) just one normalized entity (e.g., to display the toy name in a table). In this case all of the attributes referring to other entities would be faults (hence would not occupy memory nor take) and this might speed up things. But, if you do a search and then display the information from the other tables as well, then there might not be any advantage, quite the opposite, since the faults would have to be resolved immediately and this would produce more accesses to the database.
Also it is true that depending on how you use it, core data could not be the best way to handle your data. Have a look at this Brent Simmons' post relating his experience.
We have a rather large set of related tables with over 35 million related records each. I need to create a couple of WCF methods that would query the database with some parameters (data ranges, type codes, etc.) and return related results sets (from 10 to 10,000 records).
The company is standardized on EF 4.0 but is open to 4.X. I might be able to make argument to migrate to 5.0 but it's less likely.
What’s the best approach to deal with such a large number of records using Entity? Should I create a set of stored procs and call them from Entity or there is something I can do within Entity?
I do not have any control over the databases so I cannot split the tables or create some materialized views or partitioned tables.
Any input/idea/suggestion is greatly appreciated.
At my work I faced a similar situation. We had a database with many tables and most of them contained around 7- 10 million records each. We used Entity framework to display the data but the page seemed to display very slow (like 90 to 100 seconds). Even the sorting on the grid took time. I was given the task to see if it could be optimized or not. and well after profiling it (ANTS profiler) I was able to optimize it (under 7 secs).
so the answer is Yes, Entity framework can handle loads of records (in millions) but some care must be taken
Understand that call to database made only when the actual records are required. all the operations are just used to make the query (SQL) so try to fetch only a piece of data rather then requesting a large number of records. Trim the fetch size as much as possible
Yes, not you should, you must use stored procedures and import them into your model and have function imports for them. You can also call them directly ExecuteStoreCommand(), ExecuteStoreQuery<>(). Sames goes for functions and views but EF has a really odd way of calling functions "SELECT dbo.blah(#id)".
EF performs slower when it has to populate an Entity with deep hierarchy. be extremely careful with entities with deep hierarchy .
Sometimes when you are requesting records and you are not required to modify them you should tell EF not to watch the property changes (AutoDetectChanges). that way record retrieval is much faster
Indexing of database is good but in case of EF it becomes very important. The columns you use for retrieval and sorting should be properly indexed.
When you model is large, VS2010/VS2012 Model designer gets real crazy. so break your model into medium sized models. There is a limitation that the Entities from different models cannot be shared even though they may be pointing to the same table in the database.
When you have to make changes in the same entity at different places, try to use the same entity by passing it and send the changes only once rather than each one fetching a fresh piece, makes changes and stores it (Real performance gain tip).
When you need the info in only one or two columns try not to fetch the full entity. you can either execute your sql directly or have a mini entity something. You may need to cache some frequently used data in your application also.
Transactions are slow. be careful with them.
if you keep these things in mind EF should give almost similar performance as plain ADO.NET if not the same.
My experience with EF4.1, code first: if you only need to read the records (i.e. you won't write them back) you will gain a performance boost by turning of change tracking for your context:
yourDbContext.Configuration.AutoDetectChangesEnabled = false;
Do this before loading any entities. If you need to update the loaded records you can allways call
yourDbContext.ChangeTracker.DetectChanges();
before calling SaveChanges().
The moment I hear statements like: "The company is standardized on EF4 or EF5, or whatever" This sends cold shivers down my spine.
It is the equivalent of a car rental saying "We have standardized on a single car model for our entire fleet".
Or a carpenter saying "I have standardized on chisels as my entire toolkit. I will not have saws, drills etc..."
There is something called the right tool for the right job
This statement only highlights that the person in charge of making key software architecture decisions has no clue about software architecture.
If you are dealing with over 100K records and the datamodels are complex (i.e. non trivial), Maybe EF6 is not the best option.
EF6 is based on the concepts of dynamic reflection and has similar design patterns to Castle Project Active Record
Do you need to load all of the 100K records into memory and perform operations on these ? If yes ask yourself do you really need to do that and why wouldn't executing a stored procedure across the 100K records achieve the same thing. Do some analysis and see what is the actual data usage pattern. Maybe the user performs a search which returns 100K records but they only navigate through the first 200. Example google search, Hardly anyone goes past page 3 of the millions of search results.
If the answer is still yes you need to load all of the 100K records into memory and perform operations. Then maybe you need to consider something else like a custom built write through cache with light weight objects. Maybe lazy load dynamic object pointers for nested objects. etc... One instance where I use something like this is large product catalogs for eCommerce sites where very large numbers of searches get executed against the catalog. Why is in order to provide custom behavior such as early exit search, and regex wildcard search using pre-compiled regex, or custom Hashtable indexes into the product catalog.
There is no one size fits all answer to this question. It all depends the data usage scenarios and how the application works with the data. Consider Gorilla Vs Shark who would win? It all depends on the environment and the context.
Maybe EF6 is perfect for one piece that would benefit from dynamic reflection, While NetTiers is better for another that needs static reflection and an extensible ORM. While low level ADO is perhaps best for extreme high performance pieces.