CoreData performance duplicate entities vs properties - ios

I'm making an iOS app with thousands of flash cards with questions. The questions pool has about 10,000 questions, and is divided in 5 categories. One single question can have only one category. Categories won't change, they are fixed. Questions are just text. No images involved.
I was thinking about two approaches:
1) Create an Entity for the question with a category field (int) in it.
Fetch the results to get only the questions of a specific category.
2) Create 5 Entities, all with the same fields, except for the category, which has a default value corresponding with the category.
Why option 2?
I think option 1) is the clean proper one, but the app has so many questions that I'm thinking that submitting a query filtering a specific field, is maybe slower than retrieving a completely different Entity. I'm thinking from an SQL point of view, where maybe performing a SELECT on one table and then another one, should be faster than a SELECT...WHERE on the same table?

I agree with you, option 1 is the clean proper one. Retrieving the category from another entity will add minimal overhead. Moreover, if you have to edit a category you have only one entry to change. You can also add other categories more easily.
If you are really concerned about performance (and I don't think you should at that point) you can code both and do a speed test. But that is really overkill and the difference will probably be insignificant.

Related

Creating a YAML based list vs. a model in Rails

I have an app that consists mainly of restaurant model instances. One of the essential attributes for these restaurants is labeling the cuisine it falls under. I'm currently at odds with myself in regards to designing this. On one hand I thought of creating a Cuisine model and creating either a HMT or HABTM association between Restaurants and Cuisines.
More recently I came across this post which shows how to create a pre-defined set of attributes. To take the answer one step further I'm assuming (in my case) I'd add a string-based cuisine column to my restaurant model and setup a select box in my restaurant form that would save the selected value.
What I was wondering was what would be the most efficient way of doing this? The goal is to eventually be able to query restaurants based what cuisine(s) they fall under. I wasn't sure if a model would be the best choice due to it only serving as a join table in a sense with a name attribute. Wasn't sure if having this extra table for something so minute would be optimal.
On the other hand I didn't know if using YAML for this would be conducive since the values are essentially dummy strings with no tangible records on file like I'd have with a model instance. Can someone help me sort out this confusion?
There are many benefits of normalizing many-to-many relationships in the db. Here are some:
Searching, sorting, and creating indexes is faster, since tables are narrower, and more rows fit on a data page.
You can have more clustered indexes (one per table), so you get more flexibility in tuning queries.
Index searching is often faster, since indexes tend to be narrower and shorter.
More tables allow better use of segments to control physical placement of data.
You usually have fewer indexes per table, so data modification commands are faster.
Fewer null values and less redundant data, making your database more compact.
Triggers execute more quickly if you are not maintaining redundant data.
Data modification anomalies are reduced.
Normalization is conceptually cleaner and easier to maintain and change as your needs change.
Also, by normalizing you get the cleaner syntax and other infrastructure benefits from ActiveRecord, e.g.
cuisine.restaurants.where(city: 'Toledo')

Fact table linked to Slowly Changing Dimension

I'm struggling to understand the best way to model a particular scenario for a data warehouse.
I have a Person dimension, and a Tenancy dimension. A person could be on 0, 1 or (rarely) multiple tenancies at any one time, and will often have a succession of tenancies over time. A tenancy could have one or more people associated with it. The people associated with a tenancy can change over time, and tenancies generally last for many years.
One option is to add tenancy reference, start and end dates to the Person Dimension as type 2 SCD columns. This would work well as long as I ignore the possibility of multiple concurrent tenancies for a person. However, I have other areas of the data warehouse where I am facing a similar design issue and ignoring multiple relationships is not a possibility.
Another option is to model the relationship as an accumulating snapshot fact table. I'm not sure how well this would work in practice though as I could only link it to one version of a Person and Tenancy (both of which will have type 2 SCD columns) and that would seem to make it impossible to produce current or historical reports that link people and tenancies together.
Are there any recommended ways of modelling this type of relationship?
Edit based on the patient answer and comments given by SQL.Injection
I've produced a basic model showing the model as described by SQL.Injection.
I've moved tenancy start/end dates to the 'junk' dimension (Dim.Tenancy) and added Person tenancy start/end dates to the fact table as I felt that was a more accurate way to describe the relationship.
However, now that I see it visually I don't think that this is fundamentally any different from the model that I started with, other than the fact table is a periodic snapshot rather than an accumulating snapshot. It certainly seems to suffer from the same flaw that whenever I update a type 2 slowly changing attribute in any of the dimensions it is not reflected in the fact.
In order to make this work to reflect current changes and also allow historical reporting it seems that I will have to add a row to the fact table every time a SCD2 change occurs on any of the dimensions. Then, in order to prevent over-counting by joining to multiple versions of the same entity I will also need to add new versions of the other related dimensions so that I have new keys to join on.
I need to think about this some more. I'm beginning to think that the database model is right and that it's my understanding of how the model will be used that is wrong.
In the meantime any comments or suggestions are welcome!
Your problem is similar to to the sale transactions with multiple item. The difference, is that a transaction usually has multiple items and your tenancy fact usually has a single person (the tenant).
Your hydra is born because you are trying to model the tenancy as a dimension, when you should be modeling it as a fact.
The reason why I think you have a tenancy dimension, is because somewhere you have a fact rent. To model the fact rent consider use the same approach i stated above, if two persons are tenants of the same property two fact records should be inserted each month:
1) And now comes some magic (that is no magic at all), split the value of the of the rent by the number of tenants and store it the fact
2) store also the full value of the rent (you don't know how the data scientist is going to use the data)
3) check 1) with the business user (i mean people that build the risk models); there might be some advanced rule on how to do the spliting (a similar thing happens when the cost of shipping is to be divided across multiple item lines of the same order -- it might not be uniformly distributed)

How to handle product/subproduct in Rails 4

I am making a proof of concept for an app, using Rails 4 and Postgresql. I am thinking on the best way to handle the relation between Products and SubProducts.
A Product have a name, a description... and a SubProduct could have multiple fields too.
A Product have many SubProduct, a SubProduct belongs to one Product.
I have some Products and SubProducts with hundreds of fields. So I think it is best to not use STI to avoid thousands of null value.
Also I am working with remote designers, I would like to keep it simple for them. So when they want to display the value of a field from a sub product, they would write something like #product.name (from Product table) or #product.whatever (field from SubProduct table).
My question is how to handle this ? For the moment, I was thinking to delete the Products table and to make multiple SELECT to db, one for each SubProducts table. But maybe there is a solution to keep the Products table ? Or maybe I can take advantage of table inheritance from Postgresql ?
Thank you :-)
Are the hundreds of fields all different for each subproduct? (As you mentioned, "sparse" attributes can lead to lots of nulls.)
This brings to mind an entity-attribute-value model, as described here:
https://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80%93value_model
Here's a presentation with one organization's solution (key/value hstore):
https://wiki.postgresql.org/images/3/37/Eav-pgconfnyc2014.pdf
This can quickly get very complicated, and makes things like search much more challenging.
And if there are many variations, this also brings to mind a semi-structured, document-oriented or "NoSQL" design.

Core data database design issues

I want to create a core data model to store user input questions and then entries for those questions.
The questions will be 1 of 3 types. (1-10, Yes No or a number of specific units). So users will go into the app, click "add question" and will be shown a form where they will input the question title choose the question type (Yes/No, Scale 1-10 or Specific Units) - and if they choose specific units they have to put in those units eg: centimetres, millimetres, litres etc...
There will be multiple entries to these questions mapped over a period of time. IE: Users may put in the questions "How tall is your tomato plant? - specific units - CMs" and "How red are your tomatos? - scale 1-10" and will go back into the app time and again to map out the progress over time.
My question is about database design for this. How would I set this up in my core data?
At the moment I have this:
Now, I'm stuck! I don't quite know how to account for the fact that questions have different types, and therefore the entries will be different.
Plus I need to account for questions where units may be required. These are the questions that will just be a number.
So I'm thinking I probably need different question entites. Something like having a Questions entity and the specific questions as sub-entites - but I'm not sure how to do this, and not sure how to then record the entries made to those questions!
Thank you for looking, and for helping if you can. :)
There's no correct way to design it, and there are indeed many different ways to design a solution.
As #dasdom suggested, you could remove the value from the Entry entity description, and make Entry abstract. Create three subclasses of Entry (BooleanAnswer, RankedAnswer, and ValueAnswer for example). The ValueAnswer would have a decimalValue and units, the others similar related values.
Also, FYI on naming conventions:
Entities: upper case, singular: Question and Entry (though i'd name it Answer)
Attributes/Relationships: lower case: questionText, type, entries, value, questions ('forQuestion' and 'forEntity' are redundant, being the relationship is defined on each of those)
Another option is to have Entity have units, boolValue, rankValue, and decimalValue but only use the attribute that's relevant. That's not an elegant solution, imho.

Single Inheritance or Polymorphic?

I'm programming a website that allows users to post classified ads with detailed fields for different types of items they are selling. However, I have a question about the best database schema.
The site features many categories (eg. Cars, Computers, Cameras) and each category of ads have their own distinct fields. For example, Cars have attributes such as number of doors, make, model, and horsepower while Computers have attributes such as CPU, RAM, Motherboard Model, etc.
Now since they are all listings, I was thinking of a polymorphic approach, creating a parent LISTINGS table and a different child table for each of the different categories (COMPUTERS, CARS, CAMERAS). Each child table will have a listing_id that will link back to the LISTINGS TABLE. So when a listing is fetched, it would fetch a row from LISTINGS joined by the linked row in the associated child table.
LISTINGS
-listing_id
-user_id
-email_address
-date_created
-description
CARS
-car_id
-listing_id
-make
-model
-num_doors
-horsepower
COMPUTERS
-computer_id
-listing_id
-cpu
-ram
-motherboard_model
Now, is this schema a good design pattern or are there better ways to do this?
I considered single inheritance but quickly brushed off the thought because the table will get too large too quickly, but then another dilemma came to mind - if the user does a global search on all the listings, then that means I will have to query each child table separately. What happens if I have over 100 different categories, wouldn't it be inefficient?
I also thought of another approach where there is a master table (meta table) that defines the fields in each category and a field table that stores the field values of each listing, but would that go against database normalization?
How would sites like Kijiji do it?
Your database design is fine. No reason to change what you've got. I've seen the search done a few ways. One is to have your search stored procedure join all the tables you need to search across and index the columns to be searched. The second way I've seen it done which worked pretty well was to have a table that is only used for search which gets a copy of whatever fields that need to be searched. Then you would put triggers on those fields and update the search table.
They both have drawbacks but I preferred the first to the second.
EDIT
You need the following tables.
Categories
- Id
- Description
CategoriesListingsXref
- CategoryId
- ListingId
With this cross reference model you can join all your listings for a given category during search. Then add a little dynamic sql (because it's easier to understand) and build up your query to include the field(s) you want to search against and call execute on your query.
That's it.
EDIT 2
This seems to be a little bigger discussion that we can fin in these comment boxes. But, anything we would discuss can be understood by reading the following post.
http://www.sommarskog.se/dyn-search-2008.html
It is really complete and shows you more than 1 way of doing it with pro's and cons.
Good luck.
I think the design you have chosen will be good for the scenario you just described. Though I'm not sure if the sub class tables should have their own ID. Since a CAR is a Listing, it makes sense that the values are from the same "domain".
In the typical classified ads site, the data for an ad is written once and then is basically read-only. You can exploit this and store the data in a second set of tables that are more optimized for searching in just the way you want the users to search. Also, the search problem only really exists for a "general" search. Once the user picks a certain type of ad, you can switch to the sub class tables in order to do more advanced search (RAM > 4gb, cpu = overpowered).

Resources