How to resolve 1-n relationship between in star schema? - data-warehouse

I'm working on a data storage model for a clickstream analytics system. User action data comes from a third-party system as a set of large JSON files. Currently, we will have an ETL process to read JSON files as a source and save data into our store for future analysis and reporting.
Depending on some business rules of the source system, each event can have an is_success field set to true or false. Non-successful user actions have a JSON field with an array of nested objects with diagnostic data about failures.
The draft data model for the storage system is the following:
I have concerns about the relation between fact_events and dim_failure_details on the diagram above. To me, dim_failure_details does not look like a dimension because it has a many-to-one relationship to the fact table.
I've read a design tip from the Kimball Group. That article recommends using a bridge table in a similar situation. But I don't understand how to apply that solution in my case because each event can have different and unpredictable values for attribute_key and attribute_value even when failure_type is the same for multiple events.
I also saw a few similar questions (Star schema [fact 1:n dimension]...how?, Star schema [fact 1:n dimension]...how?), but still don't know how the relationship should be organized correctly. Any help will be much appreciated.

Related

Storing Product Properties

I'm creating a jewellery product catalogue application and I need to store properties for each product such as material, finishes, product type etc.
I've concluded that there needs to be a model for each property, mainly because things like material and finishes might have prices and weights and other things associated with them.
Which of the two options will be the most efficient way to store data and be scalable
Create a model PropertyMap that will map property types and IDs to a Product ID.
Create several other models such as ProductMaterial, ProductFinish etc that will made a property to a product
All the data needs to be searchable & filterable. The database will probably index around 10K products.
Open to other smarter ways to store this data as well!
As a rule of thumb, to get the most out of your database tools, it's best to normalize your data according to the typical SQL conventions. That means that a bunch of fields that have a one-to-one relationship with each other should be collected together into the same table. That way you can grab them all (and they're frequently needed together) with a simple and efficient query.
If you instead have to gather them up from some different organization, both you and the database will end up having to do a lot more work. It will scale poorly, both on the hardware and in your brain as you struggle to maintain and extend it.

Is this pattern suitable for Core Data?

The only databases I've worked with before are MySQL so the database design of CoreData is confusing me a little bit.
Briefly, the design consists of a many-to-many relationship between people and businesses. Many people can own one business. One person can own many businesses.
In this simplified design, there are 3 tables:
PERSON BUSINESS OWNED BUSINESS
------ -------- --------------
id id personID
name name businessID
email website acquisitionDate
The OwnedBusiness table is the one that's confusing me. In MySQL, this table is used to support many-to-many relationships. I understand that CoreData doesn't require this, however I have an extra field in OwnedBusiness: acquisitionDate.
Does the extra field, acquisitionDate warrant the use of the extra entity/table? If not, where would that field go?
First, Core Data is not a database, full stop.
Core Data is an object graph management framework, your model in your application.
It can persist to disk in a database. It can also persist as binary, XML and just about anything else. It does not even need to persist.
Think about Core Data as an object graph only. In your example you would have a Person entity, a Business entity and a OwnedBusiness entity.
The OwnedBusiness entity would have two relationships and one property. You would not manage the foreign keys because Core Data handles that if you end up persisting to a database. Otherwise they are object pointers.
So first of all, CoreData is not a relational db just to clear this out.
Second, I think you should have a quick look at CoreData documentation and since you are familiar with MySql it will be an easy reading and I think you will be kind of amazed by the extra features that CoreData provides.
Regarding the many-to-many relationship, CoreData support this relationship without the need of extra tables. Also the relationship are not based on ids, they are based directly on objects.
So in your case, you don't have to use the person id & business id to create the relationship, you can create the relationship in the Relationship section of your xcdatamodel, there you can set the relationship class (or Destination), an inverse to that relationship (useful thing) and of course the type of relationship (to-many, to-one).
So to answer your question, you can add it there depending on your business logic. As a short advice, pleas don't try to normalise the database as you would do on a normal MySql instance, you will loose lot of performance by normalising, this thing is often ignored by devs.

CoreData object modeling with multiple timeframes for weather data

I do have some JSON file http://jsonblob.com/530664b3e4b0237f7f82bdfa I am pulling from forecast.io.
I am little confused how I should be creating my CoreData entities and relationships.
In below setup, I made my Location entity as the parent entity and created a separate entity for Currently, Minutely, Hourly, Daily. However I have decided it's best to hold all the information regarding the weather data in one entity, so I created a Data table for that purpose and tied it to Daily and Currently in the image below.
Before going further, I paused and would like to get a second opinion on it. Is this a valid way of going forward with this?
EDIT: Based on Wain's response I changed my model to this
Currently Minutely and Hourly add little value as they don't have any attributes or relationships. It's also generally easier to add a type attribute rather than having a number of sub entities because you can easily filter the type using a predicate while doing a fetch. If you're going to add more in the future then there could be a case for keeping sub entities.
Once the entities are trimmed down then you only have a Location and Data with a relationship. You should make that relationship bi-directional so that Core Data can manage the data store contents better. (this applies to all relationships, even if you keep the sub entities you already have).
Other than that, fine :-)

How to do a join in Elasticsearch -- or at the Lucene level

What's the best way to do the equivalent of an SQL join in Elasticsearch?
I have an SQL setup with two large tables: Persons and Items.
A Person can own many items.
Both Person and Item rows can change (i.e. be updated).
I have to run searches which filter by aspects of both the person and the item.
In Elasticsearch, it looks like you could make Person a nested document of Item, then use has_child.
But: if you then update a Person, I think you'd need to update every Item they own (which could be a lot).
Is that correct?
Is there a nice way to solve this query in Elasticsearch?
As already mentioned the way to go is parent/child. The point is that nested documents are extremely performant but in order for them to be updated you need to re-submit the whole structure (parent + nested documents). Although the internal implementation of nested documents consists of separate lucene documents, those nested doc are not visible nor directly accessible. In fact when using nested documents you then need to use proper queries to access them (nested query, nested filter, nested facet etc.).
On the other hand parent/child allows you to have separate documents that refer to each other, which can be updated independently. It has a cost in terms of performance and memory used but it is way more flexible than nested documents.
As mentioned in this article though, the fact that elasticsearch helps you managing relations doesn't mean that you must use those features. In a lot of complex usecases it is just better to have some custom logic on the application layer that handles with relations. In facet there are limitations with parent/child too: for instance you can never get back both parent and children at the same time, as opposed to nested documents that doesn't allow to get back only matching children (for now).
Take a look at my answer for: In Elasticsearch, can multiple top-level documents share a single nested document?
This discusses the use of _parent mapping as a way to avoid the issue with needing to update every Item when a Person is updated.

Rails - Good way to support creation of drafts for several models

I want to allow users to create drafts of several models (such as article, blog post etc). I am thinking of implementing this by creating a draft model for each of my current models (such as articleDraft, blogpostDraft etc.). Is there a better way to do this? Creating a new model for every existing model that should support drafts seems messy and is a lot of work.
I think the better was is to have a flag in the table (ex: int column called draft), to identify if the record is a draft or not.
Advantages of having such a column with out a separate table, as I can see:
It's easy to make your record non-draft (just change the flag)
you will not duplicate data (because practically you will have the same in draft and non-draft records)
coding will be easy, no complex login
all the data will be in one place and hence less room for error
I've been working on Draftsman, a Ruby gem for creating a draft state of your ActiveRecord data.
Draftsman's default approach is to store draft data for all drafted models in a single drafts table via a polymorphic relationship. It stores the object state as JSON in an object column and optionally stores JSON data representing changes in an object_changes column.
Draftsman allows for you to create a separate draft model for each model (e.g., article_drafts, blog_post_drafts) if you want. I agree that this approach is fairly cumbersome and error-prone.
The real advantage to splitting the draft data out into separate models (or to just use a boolean draft flag on the main table, per sameera207's answer) is that you don't end up with a gigantic drafts table with tons of records. I'd offer that that only becomes a real problem when your application has a ton of usage though.
All that to say that my ultimate recommendation is to store all of your draft data in the main model (blog) or a single drafts table, then separate out as needed if your application needs to scale up.
Check out the Active Record Versioning category at The Ruby Toolbox. The current leader is Paper Trail.
I'd go down the state machine route. You can validate each attribute when the model's in a certain state only. Far easier than multiple checkboxes and each state change can have an action (or actions) associated with it.
Having a flag in the model has some disadvantages:
You can not save as draft unless the data is valid. Sure, you can skip validations in the Rails model, but think about the "NOT NULL" columns defined in the database
To find the "real" records, you have to use a filter (like "WHERE draft = FALSE"). This can slow down query performance.
As an alternative, check out my gem drafting. It stores drafts for different models in a separate table.

Resources