Data Warehouse example - simply explained [closed] - data-warehouse

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'm trying to learn about Data Warehouses right now, but I really don't get it. My question isn't really specific, but I just want somebody to explain to me the idea of data warehouses.
I'm trying right now to create a data warehouse out of SO's database.
In this database there are 8 tables, they are pretty self-explanatory for those who use SO:
Badges
Comments
PostHistory
PostLinks
Posts
Tags
Users
Votes
1. Dimensions
What would be the dimensions? That's the big part I don't understand. For me I see 7 dimensions: Badges, Comments, Posts, PostLinks, Tags, Users and Votes. But then I don't see the point of using data warehouses, the dimensions are exactly the tables.
-Would date be a dimension? Date of what? Of each comment AND post?
-Would it be relevant to separate Post into a Question dimension and an Answer dimension?
-What other dimensions can I put?
2. Fact Table
How can I put all the foreign keys (userId, postId, commentId...) in one table? For example, let's say a user posts a question but there's no comment. I would have a line in my fact table with his userId, the postId an NULL in the commentId column?
Measures. I'm thinking of the following measures in the fact table: number of questions, number of users, number of tags...
Can someone tell me about if I'm going in the right direction?

The first question you have to answer when building a data warehouse is "What question(s) do I want to answer?"
Using Stack Overflow as an example, one question could be, "How many posts are there about X each month over the last 2 years?"
To answer this question, we need to create Posts and Post Tags fact tables. Since these tables are select and insert only, we can denormalize the fact data so it's easier to select.
So, we might have a Post fact table that looks something like this.
Post
----
Post Number
Post Text
Post Timestamp
Post Tag 1
Post Tag 2
Post Tag 3
Post Tag 4
Post Tag 5
It would be somewhat straightforward to select based on the timestamp and group by month. We only care about the first 5 post tags, and we don't care if some of them are null.
Now, you don't have to denormalize the data. Generally, queries run faster if you denormalize the data.
You do the same thing for the other data available. What question(s) do you want to answer?

Stack Overflow is probably not the best data model to consider if you're trying to wrap your head around the concept of DW. It doesn't contain many "traditional" facts. The only examples which jump to my mind immediately are the Up/Down votes and the user rankings.
You would find many, of what we call "factless facts". These essentially treat the intersection of multiple dimensions as a fact, with just an implied "count" as the sole fact. As an example, in the Post Fact, it would simply be a count at the intersection of User, Date, SO Database, etc.
You would probably consider a concept such as the Junk dimension to support referencing the Tags in a Fact table. This would see you assign a pseudo key to each unique combination of Tags, and then this key is what you would store in the Fact table.
If you want to learn about DW, use your personal finances, this is how I learned. You can learn about snapshot facts with your account balances, you can learn about transactional facts with your purchases, and you can create Vendor and Account dimensions, among others.

Related

What is the difference between adding a column to a table in rails and joining a table? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I recently discovered that you can add columns to a table in Rails by doing something like:
rails generate migration add_lastname_to_users lastname:string
Previously I used to join tables which was a very complicated to me but, adding a column seems to accomplish the same task.
Why should I choose one over another?
A table represents a single "entity". In this case, it probably makes most sense to store the users.lastname in the same table.
On the other hand, suppose a user can have many phone numbers. In this case, it is better to normalise the database and store this data in a separate table.
In other words, you want to avoid doing something like this:
users.phone_number_1
users.phone_number_2
users.phone_number_3
The key issues with this approach (as explained in more detail by the above link) are:
You'll have lots of redundant columns, for must users. This causes wasted storage space, and decreased performance.
You need to keep adding new columns if a user goes over the limit (e.g. 3 numbers, because there are 3 columns).
Querying the data gets much harder. For example, suppose you want to query "all users who have phone number X" -- you now need to search across multiple columns!
Instead, create a separate phones table - which is joined the the user by a user_id column.
I guess it depends on your application, but generally it's best to "normalize" your database. That is, define individual tables for specific objects. A user table might have the fields user_id, first_name, & last_name. You can then join on the user_id field. This tends to make your lookups faster and your tables smaller.
https://en.wikipedia.org/wiki/Database_normalization
This is not so much related to rails and ActiveRecord but more a question of database design.
Without going into too much detail: In a relational database management system you join tables (or columns in your case?) when some piece of information you need is already available in a different place (usually another table) (hence the "relational" in the name, tables are related to each other and share information). You do not want to repeat data.
This is different in NoSQL databases where joins might not even exist (MongoDB has the notion of embedded documents, you will at some point have to repeat data)
I your case, it is easy (from the point of view of the DB) to just take the already available information (first_name + last_name) and return that. Adding another column with the same information seems 'wasteful'.
You should be able to define a helper method in your model that returns the full name, see create a name helper

Rails Association Guidance [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I am new to rails 4. I have gone through lots of tutorials and trying to solve below scenario. But still no success. Can anybody point me in the right direction. How to handle associations for below scenario.
Scenario:
1. Patient can have many surgeries.
2. Surgery has two types (Patient will undergo only one surgery at a time.)
a.) Surgery Type A (if chosen, record 15 column fields values)
b.) Surgery Type B (if chosen, record 25 column fields values)
3. Surgery Type A can be of any one type.
a.) unilateral (if chosen, record 10 column fields for unilateral & skip bilateral)
b.) bilateral (if chosen, record 10 column fields for bilateral & skip unilateral)
4. Surgery Type B can be of any one type.
a.) unilateral (if chosen, record 10 column fields for unilateral & skip bilateral)
b.) bilateral (if chosen, record 10 column fields for bilateral & skip unilateral)
I need some suggestions to handle associations correctly.. This is little confusing, I need to record lot of field values in table based on each surgery type and sub type (unilateral or bilateral).. What is the best way to handle associations for this scenario and later fetch data easily for reporting purpose.
Thanks in Advance
So, the complex part of your situation is that you have one thing (Surgery) that can be of many different types, and the different types have different fields. There are a number of different approaches to this problem, and I don't believe there's wide consensus on the 'best way'.
The first and simplest way (from a data model perspective, at least), is 'just put it all in one thing' - make a Surgery model, give it a surgery_type field to specify which type it is, and give that one record all 45 fields, and then use logic in your views to display only the relevant fields based on the surgery_type field, and to validated the presence of only those fields, and so on.
A more complex variant on this is Single Table Inheritance, in which you have multiple models, but they all live in the same table.
There are some obvious downsides to this approach - 45 fields is a lot, and when most of them are empty for any given record, that feels wasteful (and can have performance impacts). Which is why the various other patterns exist.
And so, as an alternative, there is Multiple Table Inheritance, which Rails implements via polymorphism. In this pattern, a Patient has_many Surgeries, but this is a polymorphic association, meaning that it can refer to other objects of multiple types. In this pattern, you'd have either two or four models representing types of surgery, and associate each one to a patient. TheUnilateralEndocrineSurgery model, for instance, only needs its ten fields. The downsides of polymorphism include making it more difficult to work with groups of Surgery objects, because they are of different types and respond to different methods.
The relative strengths of these approaches are complex and frequently debated, and enumerating them is beyond the scope of a SO answer.

iOS Swift - Core Data examples with a complex data relationship [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I'm trying to understand how to model complex data relationships with Core Data.
With my app, I currently have an entity of Recipe, Ingredient and RecipeIngredient which bind ingredients to a recipe.
I have not come across any example of fetching data out of this joint entity. I'd appreciate it if someone could give an example of an entity like my RecipeIngredient in Swift.
The reason you haven't seen examples similar to your RecipeIngredient entity is that you don't need joint entities like that in Core Data. You're treating Core Data as if it were a relational database, where you'd typically use a join table in order to create efficient many-to-many relationships between entities. As explained in the Many-to-Many Relationships sub-section of the Core Data Programming Guide, with Core Data all you need to do is to specify a to-many relationship in both directions between two entities. Note the parenthetical remark in the docs:
(If you have a background in database management and this causes you concern, don't worry: if you use a SQLite store, Core Data
automatically creates the intermediate join table for you.)
Here's an illustration of the relationship as you should model it, ripped straight from Xcode's model editor:
If you'd still like to see examples of how to do this, search for something like "Core Data many to many relationships" and you'll find plenty. You could start here on StackOverflow; a quick search turned up a number of examples, including How do you manage and use "Many to many" core data relationships?.
Update: From your comment, I understand that you want to use an intermediate object to add information about the relationship between recipes and ingredients. That is a case where another entity is warranted. So let's say your model looks like this:
It seems unlikely that you'd want to fetch one of these RecipeIngredient objects directly; you'd probably just follow the appropriate relationship. So, you might create a fetch request to find all the Recipes whose name matches #"chocolate cake". (There are plenty of examples of fetch requests using a predicate in the docs and all over the net, so I won't do that here.) Your fetch request will return an array of recipes that we could call cakeRecipes, but you're probably only interested in one:
Recipe *cake = cakeRecipes.firstObject;
Now, what do you want to know about the ingredients for your cake? Here's a list of the ingredients:
NSArray *ingredientNames = cake.ingredients.ingredient.name;
If you'd like to log the ingredient names and amounts:
for (RecipeIngredient *i in cake.ingredients) {
NSLog(#"%# %#", i.amount, i.ingredient.name);
}
Or, you could use a fetch request to find the ingredients matching "celery", storing the result in celeries. After that, you might look for recipes including celery:
Ingredient *celery = celeries.firstObject;
NSArray *recipes = celery.recipes.recipe
If this doesn't help, perhaps you could be more specific about the problem. Also, I know you asked for Swift, but my fingers are still used to Obj-C, and the language specifics don't really come into play here -- Core Data works the same in both languages.

Properly gathering metadata/object data using Core Data/sqlite [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions concerning problems with code you've written must describe the specific problem — and include valid code to reproduce it — in the question itself. See SSCCE.org for guidance.
Closed 9 years ago.
Improve this question
I'm new to objective c and core data. I am making an iOS app kind of like a flash card app. I have a core data set up with an EnglishWord entity in a many to one relationship with ForeignWord entities (different languages).
For each ForeignWord entity I want to keep track of certain metadata: how many times I have viewed the word, the dates I viewed it, a score I give it, etc. Ideally would be if I could have an array/dictionary as an attribute within the ForeignWord managed object itself. This is not possible.
The only option I can think of is to create a new entity called 'Score', with each ForeignWord entity 'owning' many Score managed objects (one to many), a new 'Score' managed object being created every time I view the foreignWord.
However, this sounds very messy. If I have 1000 words then I would have 1000 different tables in the sqlite database, one for each card.. does that slow things down? is it bad to have 1000 different tables?
is this really the way to do it? Is there a more elegant solution? Thanks!
You might consider adding a table called something like "Viewing" that has a relationship (to 1) both to EnglishWord and ForeignWord. You could then track the metadata that interest you in this table and aggregate the data in this table to determine how many times that you viewed a particular word, whether or not you identified it correctly, etc.
I would create one new table called ViewEntry or something and link relationships to both of your word tables. That way when a word is viewed you can store whatever meta information you want, as well as having an active link to the English and foreign version. Something like this:
This way you can create a new ViewEntry set its foreignVersion and word attributes to the English and foreign words, set the date, score, and total answer time (along with anything else you want). You could then do some really nice querying to pull up useful information.
give me all of the foreign versions of the English word "school" where the score was less than 50%.
I want all of the times the user attempted the Russian translation of "house" and the associated scores.
what English words have the worst/best score?
what is the most/least view English (or foreign) word?
All of this would be relatively easy since the English word could access all of the times it was viewed and any foreign version of any English version could do the same. You could also access the score and view date and pull up all of the English/foreign versions from that data.
You also would not need to create 1000 tables :)

user matching system, efficient search approach?

EDIT: I know it's been over a year, but I finally got something new to this problem. To see an update for this look at this question: Rails 3 user matching-algorithm to SQL Query (COMPLICATED)
I'm working on a site where users are matched based on answered questions.
The match percentage is calculated each time a user, for example, visits another users profile page. So the matching percentage is not stored in the database and is recalculated all the time.
Now I want to build in a search where users can search for their best match.
The question I have is, what is the most efficient way to do this?
What if I have 50k users and I have to list them ordered by match percentages. Do I have to calculate each matching percentage between one and the other 50k users and then create a list out of that? Sounds kind of inefficient to me. Wouldn't that slow down the app drastically?
I hope someone can help me with this, because this gives me kind of a headache.
EDIT:
To clear things up a bit, here is my database model for user, questions, answers, user_answers and accepted_answers:
Tables:
Users(:id, :username, etc.)
Questions(:id, :text)
Answers(:id, :question_id, :text)
UserAnswers(:id, :user_id, :question_id, :answer_id, :importance)
AcceptedAnswers(:id, :user_answer_id, :answer_id)
Questions <-> Answers: one-to-many
Questions <-> UserAnswers: one-to-many
Users <-> UserAnswers: one-to-many
UserAnswers <-> AcceptableAnswers: one-to-many
So there is a list of Questions(with possible answers to this question) and Users give their "UserAnswers" to those questions, assign how important that question is to them and what answers they accept from other users.
Then if you take User1 and User2, you look for common answered questions, so UserAnswers where the question_id is the same. They have 10 questions in common. User1 gave the importance value 10 to the first five questions and the importance value 20 to the other five. User 2 gave acceptable answers to two 20 value and three 10 value questions. A total of 70 points. The highest reachable pointscore is of course 20x5 + 10x5... So User2 reached 70/150 * 100 = 46,66% ... The same thing is done the other way around for how much User1 reached of User2's assigned points to those questions. Those 2 percentages are then combined through the geometric mean: sqrt of percentage1 * percentage2 ... this gives the final match percentage
#Wassem's answer seems on spot to your problem. I would also suggest you take an approach where percentages are updated on new answers and new accepted answers.
I have created a db only solution(gist), which would work but has an additional complexity of an intermediate table.
Ideally you should create two more tables, one for importance and another for percentage matches. You should create/insert/delete rows in these tables when user assigns/updates importance to an answer or marks some answer as acceptable. You can also leverage delayed_job or rescue to update the tables in background on the particular actions.
You may need to run the sqls once in while to sync up the data in the two new tables as there can be inconsistencies arising due to concurrency and also due to ordering of update actions in certain cases.
Updates on a accepted answer should be straight forward as you only need to update one pair. But in case somebody assigns importance to a question, there can be a lot calculations and a lot of percentages might need updation. To avoid this you might chose to only maintain the table with sums of importance for each pair, update it when required and calculate actual percentages on the fly(in db off-course).
I suggest you keep the match percentage of all the users in your database. Create a table matches that has match percentage for a pair of users. You do not need to save match percentage for all the pairs of users in your database. A valid match percentage is calculated for two users only when any one of have them has accepted an answer from other user. Most of the users will not accept the answers of most of other users.
I will suggest you to calculate and save the match percentage not at the time when a user visits another users profile. But when a user accepts another users answers. This will make sure that you do not make any unnecessary calculation and match percentage for a pair of users is always fresh.

Resources