Does it make sense to use different CoreData configurations to improve performance/reduce storage even if using just one persistent store? - ios

I am working on a suite of apps and those apps will have a lot of model code in common. I'm using CoreData so I currently plan on having just one model file for all the different apps, although not all apps use all entities defined in the model.
I have read about Core Data configurations that can be defined in the managed object model to get only a subset of all entities. I am wondering whether I could use these to also optimize the CoreData usage in my apps.
Consider the following scenario:
I have three apps, App1, App2 and App3.
They have a shared managed object model with the following entities.
A, A1, A2, A3, B, C, D
whereas A is abstract and A1, A2 and A3 all inherit from A. Each of the A1, A2 and A3 entities have around 10 - 20 attributes/relationships.
Now
App1 only uses A, A1, B, C, D,
App2 only uses A, A2, B, C, D,
App3 only uses A, A3, B
I have read (can't remember where) that to model sub entities in sqlite, CoreData just creates a table for the parent entity that contains all attributes and relationship of sub entities as table columns. Therefore it would often not be advisable to create small parent-entities with several large sub-entities, since it would lead to a lot of empty columns for each of the sub-entities (which don't need the columns for attributes of other sub-entities).
Now, using configurations, I could create three configurations Conf1, Conf2, Conf3 like that:
Conf1 contains entities A, A1, B, C, D,
Conf2 contains entities A, A2, B, C, D,
Conf3 contains entities A, A3, B
Each of the apps would use a single store with the appropriate configuration, so I wouldn't make use of the "store the object automatically in the correct store" advantage configurations have when used with several stores.
However, my hope is that by adding a store for the specific configurations in each of the apps, the store would ignore the attributes of the non-included entities and thus not create the appropriate table colums. In Case of App3/Conf3 it would even avoid the creation of tables for entities C and D altogether.
My questions is: Does it work that way? Would the superfluous columns be left out in persistent stores that use the correct configuration?
And if so: Does it actually make a difference in performance or storage requirements (assuming a number of objects so performance optimizations actually start to make sense)?

How Core Data represents sub entities in the SQLite store is an implementation detail that is hidden from you and subject to change. Do not depend on it working one way, because at some point it may work completely differently.
You may be prematurely optimizing. Build it, test it, and if there is a performance issue stemming from how you're using entities address it at that point.
As to your actual broader question wether there should be performance advantages in using multiple configurations for a single store: There shouldn't be. If you have one SQLite store and there is only one configuration, Core Data is not going to be making additional optimizations based on the (single) configuration.
Much of Core Data's performance comes from your data model design and access patterns. An application that is architected to be aware of Core Data faulting behavior that uses a well thought out data model will be quite performant. Even if you have a less than optimal data model Core Data can be very fast if you optimize your round trips to the persistent store (i.e. managing faults, batch faulting when appropriate, implementing a correct find-or-create).
The Incremental Store Programming Guide contains a very good description of how faults are fulfilled. The Core Data Programming Guide has a higher level description of faulting, and discusses batch faulting, prefetching, and the find or create pattern.

Related

Does a data warehouse need to satisfy 2NF or another normal form?

I'm investigating data warehouses. And I have an issue about star schemas.
It's in
Oracle® OLAP Application Developer's Guide
10g Release 1 (10.1)
3.2.1 Dimension Table: TIME_DIM
https://docs.oracle.com/cd/B13789_01/olap.101/b10333/global.htm#CHDCGABE
To represent the hierarchy MONTH -> QUARTER -> YEAR, we need some keys such as: YEAR_ID, QUARTER_ID. But there are some things that I do not understand:
1) Why do we need field YEAR_DSC & QUARTER_DSC? I think that we can look up these values from YEAR & QUARTER TABLE. And it breaks 2NF.
2) What is the normal form that a schema in data warehouse needs to satisfy? (1NF, 2NF, 3NF, or any.)
NFs (normal forms) don't matter for data warehouse base tables.
We normalize to reduce certain kinds of redundancy so that when we update a database we don't have to say the same thing in multiple places and so that we can't accidentally erroneously not say the same thing where it would need to be said in multiple places. That is not a problem in query results because we are not updating them. The same is true for a data warehouse's base tables. (Which are also just queries on its original database's base tables.)
Data warehouses are usually optimized for reading speed, and that usually means some denormalization compared to the original database to avoid recomputation at the expense of space. (Notice though that sometimes rereading something bigger can be slower than reading smaller parts and recomputing the big thing.) We probably don't want to drop normalized tables when moving to a data warehouse, because they answer simple queries and we don't want to slow down by recomputing them. Other than those tradeoffs, there's no reason not to denormalize. Some particular warehouse design methods might have their own rules about what parts should be denormalized what amounts.
(Whatever our original database design NF is chosen to be, we should always first normalize to 5NF then consciously denormalize. We don't need to normalize or know constraints to update or query a database.)
Read some textbook basics on why we normalize & why we use data warehouses.

Date Warehouse: when the cleaning and transforming is performed?

I am reading a book "Modeling the agile data warehouse with data vault" by H. Hultgren. He states:
EDW represents what did happen - not what should have happened
When does the cleaning and possible transforming is performed? Under transforming I mean stadartization f the values, for example, sex column can contain only two possible values 'f' and 'm' and not 'female' or 'male' or 0 or 1)?
If you are importing data through ETL, that is one place to do it. Or you can use some other kind of data cleansing tool. This is a very general question. It depends on the architecture of your data warehouse.
For example you might have a data warehouse that loads data and tries to automatically clean it or you might have an architecture where every single 'bad' record goes to an approval area to be cleaned by a person. I can assure you in the real world, no business user wants to have to pick from 6 values for gender.
The other thing is you might be loading data from three different systems, and these three different representations are completely valid in each system, but an end user doesn't want to have to pick from 6 choices - they want the data to be cleansed.
I'm thinking maybe this statement
EDW represents what did happen - not what should have happened
is a data vault specific thing since DV is all about modelling and storing the source system data no matter how the schema changes, and I guess in this case you would treat the data vault as an ODS and preserve the data as-as, then cleanse it on the way into the reporting star schema

Graph Databases vs Triple Stores - when to use which?

I know that there are similar questions around on Stackoverflow but I don't feel they answer the following.
Graph Databases to my understanding store data following mostly this schema:
Table/Collection 1: store nodes with UID
Table/Collection 2: store relations referencing nodes via UID
This allows storing arbitrary types of graphs. Now as I understand triple stores store nothing but triples:
Triple/Collection 1: store triples (2 nodes, 1 relation)
Now I would see the following distinction regarding use cases:
Graph Databases: when you have known, static connections
Triple Stores: when you have loosely connected nodes and are often looking for new connections
I am confused by the fact that people do not seem to be discussing which one to use according to these criteria. Most article I find are talking about arguments like speed or compatibility. But is this not the most relevant point?
Put the other way round:
Imagine having a clearly connected, user defined graph. Why on earth would you want to store that as triples only, loosing all the info about connections? Or having to implement some custom solution storing IDs in the triple subject.
Imagine having loosely collected nodes that you want to query for unknown relations using SPARQL. Graph databases do support that. But for this they have to build another index I assume and would be slower?
EDIT:
I see that "loosing info about connections" is the wrong way to put it. If you do as shown in the accepted answer and insert several triples for 2 nodes + 1 relation then you keep all the info and specifically the info what exact nodes are connected.
The main difference between graph databases and triple stores is how they model the graph. In a triple store (or quad store), the data tends to be very atomic. What I mean is that the "nodes" in the graph tend to be primitive data types like string, integer, date, etc. Relationships link primitives together, and so the "unit of discourse" in a triple store is a triple, and not a node or a relationship, typically.
By contrast, other graph databases are often called "property stores" because nodes are data containers that correspond to objects in a domain. A node stands in for an object, and has properties; they act as rich data types specified by the graph modelers, more than just primitive data types. In these graph databases, nodes and relationships are the "unit of discourse".
Let's say I have a person named "Bob" who knows "Susan". In RDF, it would be something like this:
<http://example.org/person/1> :hasName "Bob".
<http://example.org/person/1> foaf:knows <http://example.org/person/2>.
<http://example.org/person/2> :hasName "Susan".
In a graph database like neo4j, it would be this:
(a:Person {name: "Bob"})-[:KNOWS]->(b:Person {name: "Susan"})
Notice that in RDF, it's 3 relationships but only one of those relationships actually expresses semantics between two entities. The other two relationships are just tracking properties of a single higher-level entity (the person). In neo4j, it's 1 relationship amongst two nodes, with each node having a property. In RDF you'll tend to identify things by URI, in neo4j it's a database object that gets a database ID automatically. That's what I mean about the difference between a more atomic/primitive store (triple stores) and a richer property graph.
RDF and triple stores are mostly built for the kinds of architectural challenges you'd run into with the semantic web. For example, XML namespacing is built in, on the architectural assumption that you'll be mixing and matching the use of many different vocabularies and namespaces. (That right there is a very "semantic web" assumption). So in SPARQL and RDF you'll see typically at least the use of xsd, rdf, and rdfs namespaces concurrently, and probably also owl, skos, and many others. SPARQL and RDF/RDFS also have many hooks and features that are there explicitly to make things like ontology inference easier. You'll tend to identify things with URIs as a way of "namespacing your identifiers" but also because some people may want to de-reference the URI...again the assumption here is a wide data sharing arrangement between many parties.
Property stores by contrast are keyed towards different use cases, like flexible modeling of data within one model/namespace, mappings between objects and graphs for persistence of enterprise applications, rapid evolvability, and so on. You'll tend to identify things with your own scheme (or an internal database ID). An auto-incrementing integer may not be best form of ID for any random consumer on the web, (and they certainly can't be de-referenced like URLs) but they might not be your first thought for a company internal application.
So which is better? The more atomic triple store format, or a rich property graph? Do you need to mix and match many different vocabularies in one query or data model? Do you need to create an OWL ontology or do inference? Do you need to serialize a bunch of java objects in memory to a database? Do you need to do fast traversal of long paths? Those types of questions would guide your selection.
Graphs are graphs, both of them do graphs, and so I don't think there's much difference in terms of what they can represent, or how you go about thinking about a problem in "graph terms". The differences boil down to the architecture underneath of the hood, and what sorts of use cases you think you'll need. I won't tell you one is better than the other, but choose wisely.
(in reply to the comments on this answer: https://stackoverflow.com/a/30167732 )
When an owl:inverseOf production rule is defined, the inverse property triple is inferred by the reasoner either when adding or updating the store, or when selecting from the store. This is a "materialized relation"
Schema.org - an RDFS vocabulary - defines, for example, https://schema.org/isPartOf as the inverse property of hasPart. If both are specified, it's not necessary to run another graph pattern query to traverse a directed relation in the other direction.
(:book1 schema:hasPart ?o)
(?o schema:isPartOf :book1)
(?s schema:hasPart :chapter2)
It's certainly possible to use RDFS and OWL to describe schema for and within neo4j property graphs; but there's no reasoner to e.g. infer inverse properties or do schema validation.
Is there any RDF graph that neo4j cannot store? RDF has datatypes and languages for objects: you'd need to reify properties where datatypes and/or languages are specified (and you'd be re-implementing well-defined semantics)
Can every neo4j graph be represented with RDF? Yes.
RDF is a representation for graphs for which there are very many store implementations that are optimized for various use cases like insert and query performance.
Comparing neo4j to a particular triplestore (with reasoning support) might be a more useful comparison given that all neo4j graphs can be expressed as RDF.

NSKeyedArchiver vs Core Data

I am building an app with Objective-C and I would like to persist data. I am hesitating between NSKeyedArchiver and core Data. I am aware there are plenty of ressources about this on the web (including Objective-C best choice for saving data) but I am still doubtful about the one I should use. Here are the two things that make me wonder :
(1) I am assuming I will have around 1000-10000 objects to handle for a data volume of 1-10 Mb. I will do standard database queries on these objects. I would like to be able to load all these objects on launching and to save them from time to time -- a 1 second processing time for loading or saving would be fine by me.
(2) For the moment my model is rather intricate : for instance classA contains among other properties an array of classB which is itself formed by (among other) a property of type classC and a property of type classD. And class D itself contains properties of type classE.
Am I right to assume that (1) means that NSKeyedArchiver will still work fine and that (2) means that using core Data may not be very simple ? I have tried to look for cases where core Data was used with complex object graph structure like my case (2) on the web but haven't found that many ressources. This is for the moment what refrains me the most from using it.
The two things you identify both make me lean towards using CoreData rather than NSKeyedArchiver:
CoreData is well able to cope with 10,000 objects (if not considerably more), and it can support relatively straight-forward "database-like" queries of the data (sorting with NSSortDescriptors, filtering with NSPredicate). There are limitations on what can be achieved, but worst case you can load all the data into memory - which is what you would have to do with the NSKeyedArchiver solution.
Loading in sub-second times should be achievable (I've just tested with 10,000 objects, totalling 14Mb, in 0.17 secs in the simulator), particularly if you optimise to load only essential data initially, and let CoreData's faulting process bring in the additional data when necessary. Again, this will be better than NSKeyedArchiver.
Although most demos/tutorials opt for relatively straight forward data models (enough to demonstrate attributes and relationships), CoreData can cope with much more sophisticated data models. Below is a mock-up of the relationships that you describe, which took a few minutes to put together:
If you generate subclasses for all those entities, then traversing those relationships is simple (both forwards and backwards - inverse relationships are managed automatically for you). Again, there are limitations (CoreData does the SQL work for you, but in so doing it is less flexible than using a relational database directly).
Hope that helps.

SQL SELECT with table aliases in Core Data

I have the following SQL query that I want to do using Core Data:
SELECT t1.date, t1.amount + SUM(t2.amount) AS importantvalue
FROM specifictable AS t1, specifictable AS t2
WHERE t1.amount < 0 AND t2.amount < 0 AND t1.date IS NOT NULL AND t2.date IS NULL
GROUP BY t1.date, t1.amount;
Now, it looks like CoreData fetch requests can only fetch from a single entity. Is there a way to do this entire query in a single fetch request?
The best way I know is to crate an abstract parent entity for entities you wish to fetch together.
So if you have - 'Meat' 'Vegetables' and 'Fruits' entities, you can create a parent abstract entity for 'Food' and then fetch for all the sweet entities in the 'Food' entity.
This way you will get all the sweet 'Meat' 'Vegetables' and 'Fruits'.
Look here:
Entity Inheritance in Apple documentation.
Nikolay,
Core Data is not a SQL system. It has a more primitive query language. While this appears to be a deficit, it really isn't. It forces you to bring things into RAM and do your complex calculations there instead of in the DB. The NSSet/NSMutableSet operations are extremely fast and effective. This also results in a faster app. (This is particularly apparent on iOS where the flash is slow and, hence, big fetches are to be preferred.)
In answer to your question, yes, a fetch request operates on a single entity. No, you are not limited to data on that entity. One uses key paths to traverse relationships in the predicate language.
Shannoga's answer is one good way to solve your problem. But I don't know enough about what you are actually trying to accomplish with your data model to judge whether using entity inheritance is the right path for your app. It may not be.
Your SQL schema from a server may not make sense in a CD app. Both the query language and how the data is used in the UI probably force a different structure. (For example, using a fetched results controller on iOS can force you to denormalize your data differently than you would on a server.)
Entity inheritance, like inheritance in OOP, is a stiff technology. It is hard to change. Hence, I use it carefully. When I do use it, I gain performance in some fetches and some simplification in other calculations. At other times, it is the wrong answer, performance wise.
The real answer is a question: what are you really trying to do?
Andrew

Resources