Database design question - repeating duplicate values [closed] - ruby-on-rails

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 days ago.
Improve this question
The application I am working on has a hierarchy to the data tables it has, which is mostly straight forward. However, there is one field that is shared between most of the tables, that is acting like a secondary pk to the tables. Searching through google, this site and other places, I don't see any similar examples of this design. It appears that this design pattern is uncommon -- but does it's use create a problem that should be resolved?
Parent Table - Top level data object (ex: Product)
There are at least 10 sub tables (i.e., Manufacture, Materials, SalesRep, Vendor, etc...)
Each of the sub tables may or may not have other dependent tables.
The parent table, and some (but not all) of the dependent tables have a field called "Type" (saved as an integer). (ex: Physical, Electronic, Both)
The issue is, when selecting the data, the type_id is passed into all of the retrievals, for all of the tables. Doing so, allows for "Product" (ex, a "Book") to have one complete set of data (e.g., manufactures, materials, reps, vendors etc...) for one type of "Product" (ex: electronic book) and that same "Product" to have a completely different (or the same) set of data (e.g., manufactures, reps, etc...) for a different type of "Product" (ex: physical printed book).
Repeating the type_id through all of the tables duplicates the same data throughout the tables, resulting in essentially a two field pk for each record.
Currently:
--// Table: product
+------+-------------+----------------+
| id | date_issued | product_fields |
+------+-------------+----------------+
| 1 | 2010-08-20 | Book 1 |
| 2 | 2010-08-20 | Book 2 |
| 3 | 2010-08-20 | Book 3 |
+------+-------------+----------------+
--// Table: manufacturer
+------+------------+----------+-------------------+
| id | product_id | type_id | name |
+------+------------+----------+-------------------+
| 1 | 1 | 1 | Digital Printers |
+------+------------+----------+-------------------+
| 2 | 1 | 2 | Physical Printers |
+------+------------+----------+-------------------+
From what I can see, making "Type" relation a sub table under "Product", then having every other table be a dependent of product/type association is an alternative. However, to implement such a design change would require a great deal of refactoring both the database and code. While it is an alternative, is that the way others would do this?
Resulting in something like this:
--// Table: product
+------+-------------+----------------+
| id | date_issued | product_fields |
+------+-------------+----------------+
| 101 | 2010-08-20 | Book 1 |
| 102 | 2010-08-20 | Book 2 |
| 103 | 2010-08-20 | Book 3 |
+------+-------------+----------------+
--// Table: product_type_assoc
+------+-------------+-------------+
| id | product_id | type_id |
+------+-------------+-------------+
| 5 | 101 | 1 |
+------+-------------+-------------+
| 6 | 101 | 2 |
+------+-------------+-------------+
| 7 | 102 | 1 |
+------+-------------+-------------+
--// Table: manufacturer
+------+---------------------+-------------------+
| id | assoc_id | name |
+------+---------------------+-------------------+
| 1 | 5 | Digital Printers |
+------+---------------------+-------------------+
| 2 | 6 | Physical Printers |
+------+---------------------+-------------------+
An interim steps seems like having the current "type" in the product table, and passing that to the sub queries
Select from "Vendor" where "Product".id = 1 and "Type"_id = "Product".current_type
What do you think - Is this the preferred way, or is there a more commonly accepted design that does the same thing?

Related

how to create relationship using cypher

I have been learning neo4j/cypher for the last week. I have finally been able to upload two csv files and create a relationship,"captured". However, I am not fully confident in my understanding of the code as I was following the tutorial on the neo4j site. Could you please help me confirm what I did is correct.
I have two csv files, a "cap.csv" and a "survey.csv". The survey table contains data of each unique survey conducted at the survey sites. the cap table contains data of each unique organisms captured. In the cap table I have a foreign key, "survey_id", which in the Postgres db you would join to the p.key in the survey table.
I want to create a relationship, "captured", showing each unique organsism that was captured based on the "date" column in the survey table.
Survey table
| lake_id | date |survey_id | duration |
| -------- | -------------- | --| --
| 1 | 05/27/14 |1 | 7 |
| 2 | 03/28/13 | 2|10 |
| 2 | 06/29/19 | 3|23 |
| 3 | 08/21/21 | 4|54 |
| 1 | 07/23/18 | 5|23 |
| 2 | 07/22/23 | 6|12 |
Capture table
| cap_id | species |capture_life_stage | weight | survey_id |
| -------- | -------------- | --| -----|---|
| 1 | a |adult | 10 | 1|
| 2 | a | adult|10 | 2 |
| 3 | b | juv|23 | 3 |
| 4 | a | adult|54 | 4 |
| 5 | b | juv|23 | 5 |
| 6 | c | juv |12 | 6 |
LOAD CSV WITH HEADERS FROM 'file:///cap.csv' AS row
WITH
row.id as id,
row.species as species,
row.capture_life_stage as capture_life_stage,
toInteger(row.weight) as weight,
row.survey_id as survey_id
MATCH (c:cap {id: id})
MERGE (s) - [rel:captured {survey_id: survey_id}] ->(c)
return count(rel)
I am struggling to understand the code I wrote above. I followed the neo4j tutorial exactly but used my data (https://neo4j.com/developer/desktop-csv-import/).
I am fairly confident from data checks, but did the above code create the "captured" relationship showing each unique organism captured on that unique survey date? Based on the visual I can see I believe it did but I don't fully understand each step in the code.
What is the purpose of the MATCH (c:cap {id: id}) in the code?
The code below
MATCH (c:cap {id: id})
is the same as
MATCH (c:cap)
Where c.id = id
It is a shorter way of finding Captured node based on id and then you are creating a relationship with Survey node.
Question: s is not defined in your query. Where is it?

In data warehouse, can fact table contain two same records?

If a user ordered same product with two different order_id;
The orders are created within a same date-hour granularity, for example
order#1 2019-05-05 17:23:21
order#2 2019-05-05 17:33:21
In the data warehouse, should we put them into two rows like this (Option 1):
| id | user_key | product_key | date_key | time_key | price | quantity |
|-----|----------|-------------|----------|----------|-------|----------|
| 001 | 1111 | 22 | 123 | 456 | 10 | 1 |
| 002 | 1111 | 22 | 123 | 456 | 10 | 2 |
Or just put them in one row with the aggregated quantity (Option 2):
| id | user_key | product_key | date_key | time_key | price | quantity |
|-----|----------|-------------|----------|----------|-------|----------|
| 001 | 1111 | 22 | 123 | 456 | 10 | 3 |
I know if I put the order_id as a degenerate dimension in the fact table, it should be Option 1. But in our case, we don't really want to keep the order_id.
Also I once read an article that says that when all dimensions are filtered out, there should be only one row of data in the fact table. If this statement is correct, the Option 2 will be the choice.
Is there a principle where I can refer ?
Conceptually, fact tables in a data warehouse should be designed at the most detailed grain available. You can always aggregate data from the lower granularity to the higher one, while the opposite is not true - if you combine the records, some information is lost permanently. If you ever need it later (even though you might not see it now), you'll regret the decision.
I would recommend the following approach: in a data warehouse, keep order number as degenerate dimension. Then, when you publish a star schema, you might build a pre-aggregated version of the table (skip order number, group identical records by date/hour). This way, you can have smaller/cleaner fact table in your dimensional model, and yet preserve more detailed data in the DW.

Understanding Slowly Changing Dimension Type 2

I am having difficult time understanding how to use slowly changing dimension type 2, in my scenario.
I have gone through different tutorial websites but they don't fit.
I have an employee dimension table containing:
+-----+---------------+------------+------------+
| id | employee | designation| Location |
+-----+---------------+------------+------------+
| 1 | Ola | CEO | Newyork |
| 2 | Ahmed | DEVELOPER | California |
| 3 | Ola | Manager | California |
+----------+----------+------------+------------+
I have a Account Fact table
+-------+----------+
|emp_id | Amount |
+-------+-----------
| 1 | 2000000 |
| 2 | 300000 |
+----------+-------+
Now we see that the dimension has changed, and thus a new ID to same Ola employee has been given.
How would we manage in the fact table?
The new ID of Ola will not be found in Fact Table.
so if we add a new row in fact, with new ID of Ola, how would we link that they are same employee, when they are identified differently, 'primary key'.
How would we distinguish this employee is not a new employee and actually location / designation has been changed.
I am sure there are many ways of doing it, here's one way - Have an "employee_Key" in your dimension Table which is unique for an employee. So your dimension table will look like this -
id | emp_key | employee | designation| Location |Valid From| Valid To |
-----|---------|------------|------------|------------|----------|----------|
1 | EMP1 | Ola | CEO | Newyork |1/1/1900 |1/1/2016 |
2 | EMP2 | Ahmed | DEVELOPER | California |1/1/1900 |NULL |
3 | EMP1 | Ola | Manager | California |1/2/2016 |NULL |
You can continue loading your fact table with the "New" ID for the employee. In this case you will have 2 different sets of Keys for that employee.
+-------+----------+
|emp_id | Amount |
| 1 | 2000000 |
| 2 | 300000 |
| 3 | 100000 |
+----------+-------+
If you want to rollup (say Sum of amounts) for an employee from the beginning, you would join the fact and dimension using the ID key and group by emp_key.
So,
select emp_key, sum(amount) from employee dim, account fact where dim.ID = fact.ID group by emp_key.
If you want to find out the amount since he became a manager, you just have to do rollup on the ID field.
select dim.ID, sum(amount) from employee dim, account fact where dim.ID = fact.ID group by dim.ID.
or this way -
select fact.ID, sum(amount) from account fact group by fact.ID.

Designing a Core Data managed object model for an iOS app that creates dynamic databases

I'm working on an iPhone app for users to create mini databases. The user can create a custom database schema and add columns with the standard data types (e.g. string, number, boolean) as well as other complex types such as objects and collections of a data type (e.g. an array of numbers).
For example, the user can create a database to record his meals.
Meal database:
[
{
"timestamp": "2013-03-01T13:00:00",
"foods": [1, 2],
"location": {
"lat": 47.253603,
"lon": -122.442537
}
}
]
Meal-Food database:
[
{
"id": 1,
"name": "Taco",
"healthRating": 0.5
},{
"id": 2,
"name": "Salad",
"healthRating": 0.8
}
]
What is the best way to implement a database for an app like this?
My current solution is to create the following database schema for the app:
When the user creates a new database schema as in the example above, the definition table will look like this:
+----+-----------+--------------+------------+-----------------+
| id | parent_id | name | data_type | collection_type |
+----+-----------+--------------+------------+-----------------+
| 1 | | meal | object | |
| 2 | 1 | timestamp | timestamp | |
| 3 | 1 | foods | collection | list |
| 4 | 1 | location | location | |
| 5 | | food | object | |
| 6 | 5 | name | string | |
| 7 | 5 | healthRating | number | |
+----+-----------+--------------+------------+-----------------+
When the user populates the database, the record table will look like this:
+----+-----------+---------------+------------------------+-----------+-----+
| id | parent_id | definition_id | string_value | int_value | ... |
+----+-----------+---------------+------------------------+-----------+-----+
| 1 | | 1 | | | |
| 2 | 1 | | 2013-03-01T13:00:00 | | |
| 3 | 1 | 2 | | 1 | |
| 4 | 1 | 2 | | 2 | |
| 5 | 1 | 4 | 47.253603, -122.442537 | | |
+----+-----------+---------------+------------------------+-----------+-----+
More details about this approach:
Values for different data types are stored in different columns in the record table. It is up to the app to parse values correctly (e.g. converting timestamp int_value into a date object).
Constraints and validation must be performed on the app as it is not possible on the database level.
What are other drawbacks with this approach and are there better solutions?
First of all your Record table is very inefficient and somewhat hard to work with. Instead you can have separate record tables for each record type you need to support.It will simplify everything a lot and add some additional flexibility, because it will not be a problem to introduce support for a new record type.
With that being said we can conclude it will be enough to have basic table management to make your system functional. Naturally, there is ALTER TABLE command:
but in some cases it might be very expensive and some engines have various limitations. For example:
SQLite supports a limited subset of ALTER TABLE. The ALTER TABLE
command in SQLite allows the user to rename a table or to add a new
column to an existing table.
Another approach might be to use BLOBs with some type tags in order to store record values.
This approach will reduce the need to support separate tables. It leads us to Schemaless approach.
Do you absolutely have to use CoreData for this?
It might make more sense to use a schema-less solution, such as http://developer.couchbase.com/mobile/develop/references/couchbase-lite/release-notes/iOS/index.html

Ideal solution for the following case scenario in database

There are 50 exams to be written by around millions of students online, One person may or may not write more than one exam. A person can also write a single exam more than one time ( retries ) ..
So which of the below solution is better for this case, I am okay with a better solution than these two as well
Option 1. Store each exam in a single table :
Subject 1
+----------------+---------+
| student id | Marks |
+----------------+---------+
| 1 | 85 |
| 2 | 32 |
| 2 | 60 |
+----------------+---------+
Subject 2
+----------------+---------+
| student id | Marks |
+----------------+---------+
| 1 | 85 |
| 2 | 32 |
| 2 | 60 |
+----------------+---------+
Like above with each table will have the student id only if that particular person has taken that exam , and have multiple occurrences of the student id if he has taken it more than once.
Option 2 :
+----------------+---------+---------+
| student id | Subject | Marks |
+----------------+---------+---------+
| 1 | Subj1 | 85 |
| 2 | Subj1 | 32 |
| 2 | Subj1 | 60 |
| 1 | Subj2 | 80 |
| 3 | Subj2 | 90 |
+----------------+---------+---------+
with all the values in a single table.
Which is better in terms of performance and storage perspective.
My various que
I think the best here is following:
Table STUDENT with information about students
Table EXAM with information about exams
Table EXAM_TRY with reference to STUDENT and EXAM tables, and fields DATE_OF_EXAM and RESULT_OF_EXAM
2 indexes on foreign keys in table EXAM_TRY
Depending on situation - index on date field (for example, you would need it for planning work for examiners)

Resources