Core Data Model Design - Attributes vs Entities - ios

I've been developing a very basic core data application for over a year now (Toy Collector, http://bit.ly/tocapp), and I'm looking at doing a redesign so that I can build in iCloud support. I figured while I'm doing that, I might as well update my core data model (if needed), and I'm having a heck of a time tracking down "best practices" for the following:
Currently, I have 2 entities:
Toy, Keywords
Toy has all the information about the object: Name, Year, Set, imageName, Owned, Wanted, Manufacturer, etc, (18 attributes in all)
Keywords has the normalized words to help speed up the search
My question is whether or not there is any advantage to breaking out some of the Toy attributes into their own entities. For example, I could have a manufacturer entity that stores the dozen or so manufacturers, instead of keeping that information in the Toy object. My gut tells me this could reduce the memory footprint (instead of 50,000 objects storing a manufacturer string, there would simple be 12 manufacturer strings in an entity with a relationship to the main Toy entity). Does that kind of organization really matter? Am I trying to overcomplicate things? I just feel like my entity has a lot of attributes, and I'm not sure if taking the time to break it apart into multiple entities would make a difference.
Any advice or pointers would be appreciated!
Zack

Your question is pretty broad, since it addresses the topic of database design. Let me say upfront that it is almost impossible to give you any sensible suggestions, since I would need to know a lot more about your app, use cases, etc. than it is possible through a S.O. question.
Coming to your concrete questions, I would say that you correctly identify one of advantages of splitting a table into multiple ones; actually, the advantage of doing that is not just reducing the database footprint, rather keep data redundancy to a minimum. Redundancy not only affects memory footprint but also manageability and modifiability of your data, and lack of redundancy could even cause anomalies or corruption. There is even a whole database theory topic which is known as database normalisation that addresses this king of concerns.
On the other hand, as it is always the case, redundancy can help performance, and this is actually the case when you can fetch your data through a simple query instead of multiple queries or table joins. There is a technique to improving a database performance which is known as database denormalization and is the exact opposite to normalisation. Your current scheme is fully denormalized.
Using Core Data, which is a relational object graph manager running often on top of SQLite, which is a relational database manager, you have also to take into account the fact that Core Data will automatically build your object graph and fetch into memory the data when you need it. This means that if you can take a smaller memory footprint on disk for granted, this might not be the case when it comes to RAM footprint of your query results (Core Data will "explode", so to say, at some moment your data from multiple tables into one object plus its attributes).
In your specific case, you should also possibly take into account the cost of migrating your existing user base (if the database is not read-only).
All in all, I would say that if your app does not have any database footprint issues at the moment; if you do not feel that creating new tables might be useful, e.g., in order to add new functionality, such as listing all manufacturers; and, finally, if you do not foresee tasks like renaming a manufacturer or such at some point, then maybe refactoring your database will not add much benefit. But, as I say, without knowing your app in detail and your roadmap for it, it is difficult to say anything really on spot. In any case, I hope this general considerations will help you take a decision.
EDIT:
If you want to investigate your core data performance and try to understand where the bottlenecks are, give a try to Instruments/Core Data tool (Product/Profile menu). There are a lot of things that can go bad.
On the other hand, it is really hard to help you further without having more details about the type of searches your app allows to do. One thing that is not clear to me is if your searches are slow only when they return a lot of results or they are slow even when returning a few results.
Normalizing might help performance if you only use (say, after doing a search) just one normalized entity (e.g., to display the toy name in a table). In this case all of the attributes referring to other entities would be faults (hence would not occupy memory nor take) and this might speed up things. But, if you do a search and then display the information from the other tables as well, then there might not be any advantage, quite the opposite, since the faults would have to be resolved immediately and this would produce more accesses to the database.
Also it is true that depending on how you use it, core data could not be the best way to handle your data. Have a look at this Brent Simmons' post relating his experience.

Related

EDW Kimball vs Inmon

I've been tasked with coming up with a recommendation of how to proceed with a EDW and am looking for clarification on what I'm seeing. Everything that I am learning about states that Kimball's approach will bring value quicker to business vs Inmon's. I get that Kimball's approach is a dimensional model from the getgo and different data marts (star schema) are integrated through conformed dimensions... thus the theory is I can simply come up with my immediate DM to solve business need and go on from there.
What I'm learning states that Inmon's model suggests that I have a EDW designed in 3NF. The EDW is not defined by source system but instead the structure of the business, Corporate Factory (Orders, HR, etc.). So data from disparate systems map into this structure. Once the data is in this form, ETLs are then created to produce DMs.
Personally I feel Inmon's approach is a better way. I believe this way is going to ensure that data is going to be consistent and it feels like you can do more with this data. What holds me back with this approach though is everything I'm reading says it's going to take much more time to deliver something but I'm not seeing how that is true. From my narrow view, it feels like no matter what the end result is we need a DM. Regardless of using Kimball's or Inmon's approach the end result is the same.
So then the question becomes how do we get there? In Kimballs approach we will create ETLs to some staging location and generally from there create a DM. In Inmon's approach I feel we just add in another layer... that is from the staging area we load this data into another database in 3NF organized by function. What I'm missing is how this step adds so much time.
I feel I can look at the end DM that needs to be made. Map those back to a DW in 3NF and then as more DMs are requested keep building up the DW in 3NF with more and more data. However if I create a DM in Kimballs model that DM is going to be built around the level of grain decided for that DM and what if the next DM requested wants reporting at even a deeper grain (to me it feels like Kimballs methodology would take more work) and with Inmon's it doesn't matter. I have everything at the transnational level so DMs of varying grains are requested, well I have the data, just ETL it to a DM and all DMs will report the same since they are sourced from the same data.
I dunno... just looking for others views. Everything I read says Kimball's is quicker... I say sure maybe a little bit but there is certainly a cost attributed by going to quicker route. And for sake of argument... let's say it takes a week to get a DM up and running through Kimballs methodology... to me it feels like it should only take 10% maybe 20% longer utilizing Inmon's.
If anyone has any real world experience with the different models and if one really takes so much longer then the other... please share. Or if I have this so backwards tell me that too!
For context; I look after a 3 billion record data warehouse, for a large multi-national. Our data makes its way from the various source systems through staging and into a 3NF db. From here our ELT processes move the data into a dimensionally modelled, star schema db.
If I could start again I would definitely drop the 3NF step. When I first built that layer I thought it would add real value. I felt sure that normalisation would protect the integrity of my data. I was equally confident the 3NF db would be the best place to run large/complex queries.
But in practice, it has slowed our development. Most changes require an update to the stage, 3NF and star schema db.
The extra layer also increases the amount of time it takes to publish our data. The additional transformations, checks and reconciliations all add up.
The promised improvement in integrity never materialised. I realise now that because I control the ETL, and the validation processes within, I can ensure my data is both denormalised and accurate. In reporting data we control every cell in every table. The more I think about that, the more I see it as a real opportunity.
Large and complex queries was another myth that has been busted by experience. I now see the need to write complex reporting queries as a failing of my star db. When this occurs I always ask myself: why isn't this question easy to answer? The answer is most often bad table design. The heavy lifting is best carried out when transforming the data.
Running a 3NF and star also creates an opportunity for the two systems to disagree. When this happens it is often a very subtle difference. Neither is wrong, per se. Instead, it is possible the 3NF and star query are asking slightly different questions, and therefore returning different results. Although technically correct, this can be hard to explain. Even minor and explainable differences can erode confidence, over time.
In defence of our 3NF db, it does make loading into the star easier. But I would happily trade more complex SSIS packages for one less layer.
Having said all of this; it is very hard to recommend an approach to anyone without a deep understanding of their systems, requirements, culture, skills, etc. Having read your question I am sure you have wrestled with all these issues, and many more no doubt! In the end, only you can decide what the best approach for your situation is. Once you've made your mind up, stick with it. Consistency, clarity and a well-defined methodology are more important that anything else.
Dimensions and measures are a well proven method for presenting and simplifying data to end users.
If you present a schema based on the source system (3nf) to an end user, vs a dimensionally modelled star schema (Kimball) to an end user, they will be able to make much more sense of the dimensionally modelled one
I've never really looked into an Inmon decision support system but to me it seems to be just the ODS portion of a full datawarehouse.
You are right in saying "The EDW is not defined by source system but instead the structure of the business". A star schema reflects this but an ODS (a copy of the source system) doesn't
A star schema takes longer to build than just an ODS but gives many benefits including
Slowly changing dimensions can track changes over time
Denormalisation simplifies joins and improves performance
Surrogate keys allow you to disconnect from source systems
Conformed dimensions let you report across business units (i.e. Profit per headcount)
If your Inmon 3NF database is not just an ODS (replica of source systems), but some kind of actual business model then you have two layers to model: the 3NF layer and the star schema layer.
It's difficult nowadays to sell the benefit of even one layer of data modelling when everyone thinks they can just do it all in a 'self service' tool! (which I believe is a fallacy). Your system should be no more complicated than it needs to be because all that complexity adds up to maintenance and that's the real issue - introducing changes 12 months into the build when you have to change many layers
To paraphrase #destination-data: your source system to star schema transformation (and seperation) is already achieved through ETL so the 3nf seems redundant to me. You design your star schema to be independent from source systems by correctly implementing surrogate keys and business keys, and modelling it on the business, not on the source system
With ETL and back-end data wrangling taking up about 70% of the project time for this kind of endeavour, an extra layer makes a big difference. Its an extra layer of transforming from source to target, to agree with the business and to test. It all adds up.
Whilst I'm not saying that dimensional models (the Kimball kind) are always easy to change, you've got a whole lot more inflexibility should you have to always change lots of layers when you want to change your BI.
In fact, where I've been consulting in places that have data warehouses that are considered to be inflexible and expensive to develop for, and not keeping pace with changes to the business, they have without exception included the 3NF layer prior to the DMs. As Nick mentioned, it is hard nowadays to sell the idea of a 'proper' data warehouse as opposed to a Data Discovery Bi tool- and the appeal of these is often driven by DWs being seen to be slow and expensive to develop.
Kimball isn't against having a 3NF layer prior to his DW if it makes sense for a situation, he just doesn't agree with Inmon that there's a point.
One common misunderstanding is that Kimball proposes distinct data marts, so that you'd have to change it each time there is a different reporting request. Instead, Kimball's DMs are based on real life business processes and modelled accordingly. Although its true you will then try and make them suitable for reporting, you try and make them so they can answer forseaable queries. You don't aggregate and store just the aggregates: you work with the transactional data in a Kimball dimensional model.
So no need to be reluctant from that perspective.
If an ODS works for you, then go for it- but a Kimball DW will meet the majority of requirements.

When to not use neo4j?

Neo4j is a great tool for mapping relational data, but I am curious what under what conditions it would not be a good tool to use.
In which use cases would using neo4j be a bad idea?
You might want to check out this slide deck and in particular slides 18-22.
Your question could have a lot of details to it, but let me try to focus on the big pieces. Graph databases are naturally indexed by relationships. So graph databases will be good when you need to traverse a lot of relationships. Graphs themselves are very flexible, so they'll be good when the inter-connections between your data need to change from time to time, or when the data about your core objects that's important to store needs to change. Graphs are a very natural method of modeling some (but not all) data sources, things like peer to peer networks, road maps, organizational structures, etc.
Graphs tend to not be good at managing huge lists of things. For example, if you were going to build a customer transaction database with analytics (where you need 1 million customers, 50 million transactions, and all you do is post transactions all day long) then it's probably not a good fit. RDBMS is great at that, notice how that use case doesn't exploit relationships really.
Make sure to read those two links I provided, they have much more discussion.
For maintenance reasons, any service aggregating data feeds has until now been well advised to keep their sources independent.
If I want to explore relationships between different feeds, this can be done at application level, using data tracking (for example) user preferences amongst the other feeds.
Graph databases are about managing relationship complexity, but this complexity is in many cases a design choice. Putting all your kids in one bathtub is fine until you drop the soap..

"Core Data is not a relational database." Why exactly is this important to know?

I realize this may be common sense for a lot of people, so apologies if this seems like a stupid question.
I am trying to learn core data for iOS programming, and I have repeatedly read and heard it said that Core Data (CD) is not a relational database. But very little else is said about this, or why exactly it is important to know beyond an academic sense. I mean functionally at least, it seems you can use CD as though it were a database for most things - storing and fetching data, runnings queries etc. From my very rudimentary understanding of it, I don't really see how it differs from a database.
I am not questioning the fact that the distinction is important. I believe that a lot of smart people would not be wasting their time on this point if it weren't useful to understand. But I would like someone please to explain - ideally with examples - how CD not being a relational database affects how we use it? Or perhaps, if I were not told that CD isn't a relational database, how would this adversely impact my performance as an Objective-C/Swift programmer?
Are there things that one might try to do incorrectly if they treated CD as a relational database? Or, are there things which a relational database cannot do or does less well that CD is designed to do?
Thank you all for your collective wisdom.
People stress the "not a relational database" angle because people with some database experience are prone to specific errors with Core Data that result from trying to apply their experience too directly. Some examples:
Creating entities that are essentially SQL junction tables. This is almost never necessary and usually makes things more complex and error prone. Core Data supports many-to-many relationships directly.
Creating a unique ID field in an entity because they think they need one to ensure uniqueness and to create relationships. Sometimes creating custom unique IDs is useful, usually not.
Setting up relationships between objects based on these unique IDs instead of using Core Data relationships-- i.e. saving the unique ID of a related object instead of using ObjC/Swift semantics to relate the objects.
Core Data can and often does serve as a database, but thinking of it in terms of other relational databases is a good way to screw it up.
Core Data is a technology with many powerful features and tools such as:
Change tracking (undo/redo)
Faulting (not having to load entire objects which can save memory)
Persistence
The list goes on..
The persistence part of Core Data is backed by SQLite, which is a relational database.
One of the reasons I think people stress that Core Data is not a relational database is because is it so much more than just persistence, and can be taken advantage of without using persistence at all.
By treating Core Data as a relational database, I assume you mean that relationships between objects are mapped by ids, i.e. a Customer has a customerId and a product has a productId.
This would certainly be incorrect because Core Data let's you define powerful relationships between object models that make things easy to manage.
For example, if you want to have your customer have multiple products and want to delete them all when you delete the customer, Core Data gives you the ability to do that without having to manage customerIds/productIds and figuring out how to format complex SQL queries to match your in-memory model. With Core Data, you simply update your model and save your context, the SQL is done for you under the hood. (In fact you can even turn on debugging to print out the SQL Core Data is performing for you by passing '-com.apple.CoreData.SQLDebug 1' as a launch argument.
In terms of performance, Core Data does some serious optimizations under the hood that make accessing data much easier without having to dive deep into SQL, concurrency, or validation.
I THINK the point is that it is different from a relational database and that trying to apply relational techniques will lead the developer astray as others have mentioned. It actually operates at a higher level by abstracting the functionality of the relational database out of your code.
A key difference, from a programming standpoint, is that you don't need unique identifiers because core data just handles that. If you tried to create your own, you will come to see that they are redundant and a whole lot of extra trouble.
From the programmer's perspective, whenever you access an entity "record", you will have a pointer to any relationship -- perhaps a single pointer for a "to-one" relationship, or a set of pointers to the records in a "to-many" relationship. Core Data handles the retrieval of the actual "records" when you use one of the pointers.
Because Core Data efficiently handles faults (where the "record" (object) referenced by a pointer is not in memory) you do not have to concern yourself with their retrieval generally. When they're needed by your program Core Data will make them available.
At the end of the day, it provides similar functionality but under the hood it is different. It does require some different thinking in that ordinary SQL doesn't make sense in the context of Core Data as the SQL (in the case of a sqlite store) is handled for you.
The main adjustments for me in transitioning to Core Data were as noted -- getting rid of the concept of unique identifiers. They're going on behind the scenes but you never have to worry about them and should not try to define your own. The second adjustment for me was that whenever you need an object that is related to yours, you just grab it by using the appropriate pointer(s) in the entity object you already have.

Persisting Game Actor Objects

This question pertains to a game I have been developing, but I believe it is a pretty generic concept for which I have not been able to find a clear answer.
I have been trying to figure out how to serialize actors (objects in a game world) to a file, dynamically and at arbitrary times.
Context
To understand my question you need to know how the world is generally constructed. The game is a cell-based world with 3 dimensions divided into smaller, more manageable sections that I'll refer to as chunks. The terrain info is all fixed known length, and I can serialize that information just fine, simply writing/reading to/from a world file with the appropriate offset whenever that chunk needs to be loaded into memory (say a player gets near it). That's all well and good until I have to deal with actors and writing them to a single file.
The Problem
I know that ISerializable is an incredibly useful resource for actually obtaining the data from the actors, but the problem I'm having is committing that to disk dynamically. By that I mean inserting/removing actors from the middle of a big file containing all actors. It would be a lot easier if I could serialize the entire game state and actor tree, but I need to be able to do this on small sections of the world at a time. Some sections will have no actors, some will have many (say up to a couple hundred). These sections are being loaded and saved as the players move around the world. Furthermore, the number of actors and size of their data will change over the course of the game, so I cannot handle it like I do the terrain. I need a way of committing the actor quickly, where I can find it quickly later and am not wasting a lot of file space. One thing that may be of use is that all actors in a chunk are serialized/de-serialized at once, never individually.
Note: These worlds can get very large (16k x 16k x 6) and therefore easily have millions of actors in all.
The Question
Is a database really the best way to do this? I am not opposed to implementing one, but that is an involved process and I want to be sure it is a recommended course of actions before I continue. It seems like there might be serious performance implications.
A tradition database (RDBMS) is not always the right way to go. But alas, you ARE trying to persist data.
Most IT professionals will likely guide you towards a traditional database, simply because for us it ISN'T involved. It is out bread and butter. Further more, there are hundreds of libraries that make our lives easier, the latest generation of which are the full blown ORMs.
However, as you have noted, a full blown RDBMS is a little heavy weight for your application (depending on your particular scaling needs). So I'll suggests a few alternatives.
MongoDB
RavenDB
CouchDB
Cassandra
Redis
Now, it IS true that in many ways, these are much lighter weight than RDBMSs. However these so called NoSQL (I picked Document stores, since they seem to be the closest match to your requirements) are somewhat immature. That is not to say they are buggy, and unreliable (they have higher reliability than RDBMSs), but people don't really know how to work with them.
Again, I need to qualify that statement. RDBMS have several decades of research and best practices behind them. There are vast swathes of plug-ins to the tool chains of each implementation. Every single contributor in SO knows how to use a DB well. But, none of those things is true with NoSQL.
TLDR
So it really boils down to this. YES RDBMS (traditional DBs) are complex, like a modern road car. But like a road car (which you buy), these exists the infrastructure to support them.
The alternative is a NoSQL database, which is like building a small electric go scooter. Yes its simpler. But you take it to a car shop, and they'll still have no clue.
Finally
My advice. Use an off the shelf ORM with a RDBMS. The current generation of ORM can pretty much hide your database from you. The setup won't be very performant (you won't be doing microsecond algo trading with it), but it should be enough for your needs.

Achieving better DB performance

I have a website backed by a relational database comprised of the usual e-commerce related tables (Order, OrderItem, ShoppingCart, CreditCard, Payment, Customer, Address, etc...).
The stored proc. which returns order history is painfully slow due to the amount of data + the numerous joins which must occur, and depending on the search parameters it sometimes times out (despite the indexing that is in place).
The DB schema is pretty well normalized and I believe I can achieve better performance by moving toward something like a data warehouse. DW projects aren't trivial and then there's the issue of keeping the data in sync so I was wondering if anyone knows of a shortcut. Perhaps an out-of the box solution that will create the DW schema and keep the data in sync (via triggers perhaps). I've heard of Lucene but it seems geared more toward text searches and document management. Does anyone have other suggestions?
How big is your database?
There's not really any shortcuts, but dimensional modelling is really NOT that hard. You first determine a grain and then need to identify your facts and the dimensions associated with the facts. Then you divide the dimensions into tables which allow you to have the dimensions only grow slowly over time. The choice of dimensions is completely practical and based on the data behavior.
I recommend you have a look at Kimball's books.
For a database of a few GB, it's certainly possible to update a reporting database from scratch several times a day (no history, just repopulating from a 3NF for a different model of the same data). There are certain realtime data warehousing techniques which just apply changes continuously throughout the day.
So while DW projects might not be trivial, the denormalization techniques are very approachable and usable without necessarily building a complete time-invariant data warehouse.
Materialized Views are what you might use in Oracle. They give you the "keeping the data in sync" feature you are looking for combined with fast access of aggregate data. Since you didn't mention any specifics (platform, server specs, number of rows, number of hits/second, etc) of your platform, I can't really help much more than that.
Of course, we are assuming you've already checked that all your SQL is written properly and optimally, that your indexing is correct, that you are properly using caching in all levels of your app, that your DB server has enough RAM, fast hard drives, etc.
Also, have you considered denormalizing your schema, just enough to serve up your most common queries faster? that's better than implementing an entire data warehouse, which might not even be what you want anyway. Usually a data warehouse is for reporting purposes, not for serving interactive apps.

Resources