Ideal solution for the following case scenario in database - join

There are 50 exams to be written by around millions of students online, One person may or may not write more than one exam. A person can also write a single exam more than one time ( retries ) ..
So which of the below solution is better for this case, I am okay with a better solution than these two as well
Option 1. Store each exam in a single table :
Subject 1
+----------------+---------+
| student id | Marks |
+----------------+---------+
| 1 | 85 |
| 2 | 32 |
| 2 | 60 |
+----------------+---------+
Subject 2
+----------------+---------+
| student id | Marks |
+----------------+---------+
| 1 | 85 |
| 2 | 32 |
| 2 | 60 |
+----------------+---------+
Like above with each table will have the student id only if that particular person has taken that exam , and have multiple occurrences of the student id if he has taken it more than once.
Option 2 :
+----------------+---------+---------+
| student id | Subject | Marks |
+----------------+---------+---------+
| 1 | Subj1 | 85 |
| 2 | Subj1 | 32 |
| 2 | Subj1 | 60 |
| 1 | Subj2 | 80 |
| 3 | Subj2 | 90 |
+----------------+---------+---------+
with all the values in a single table.
Which is better in terms of performance and storage perspective.
My various que

I think the best here is following:
Table STUDENT with information about students
Table EXAM with information about exams
Table EXAM_TRY with reference to STUDENT and EXAM tables, and fields DATE_OF_EXAM and RESULT_OF_EXAM
2 indexes on foreign keys in table EXAM_TRY
Depending on situation - index on date field (for example, you would need it for planning work for examiners)

Related

how to create relationship using cypher

I have been learning neo4j/cypher for the last week. I have finally been able to upload two csv files and create a relationship,"captured". However, I am not fully confident in my understanding of the code as I was following the tutorial on the neo4j site. Could you please help me confirm what I did is correct.
I have two csv files, a "cap.csv" and a "survey.csv". The survey table contains data of each unique survey conducted at the survey sites. the cap table contains data of each unique organisms captured. In the cap table I have a foreign key, "survey_id", which in the Postgres db you would join to the p.key in the survey table.
I want to create a relationship, "captured", showing each unique organsism that was captured based on the "date" column in the survey table.
Survey table
| lake_id | date |survey_id | duration |
| -------- | -------------- | --| --
| 1 | 05/27/14 |1 | 7 |
| 2 | 03/28/13 | 2|10 |
| 2 | 06/29/19 | 3|23 |
| 3 | 08/21/21 | 4|54 |
| 1 | 07/23/18 | 5|23 |
| 2 | 07/22/23 | 6|12 |
Capture table
| cap_id | species |capture_life_stage | weight | survey_id |
| -------- | -------------- | --| -----|---|
| 1 | a |adult | 10 | 1|
| 2 | a | adult|10 | 2 |
| 3 | b | juv|23 | 3 |
| 4 | a | adult|54 | 4 |
| 5 | b | juv|23 | 5 |
| 6 | c | juv |12 | 6 |
LOAD CSV WITH HEADERS FROM 'file:///cap.csv' AS row
WITH
row.id as id,
row.species as species,
row.capture_life_stage as capture_life_stage,
toInteger(row.weight) as weight,
row.survey_id as survey_id
MATCH (c:cap {id: id})
MERGE (s) - [rel:captured {survey_id: survey_id}] ->(c)
return count(rel)
I am struggling to understand the code I wrote above. I followed the neo4j tutorial exactly but used my data (https://neo4j.com/developer/desktop-csv-import/).
I am fairly confident from data checks, but did the above code create the "captured" relationship showing each unique organism captured on that unique survey date? Based on the visual I can see I believe it did but I don't fully understand each step in the code.
What is the purpose of the MATCH (c:cap {id: id}) in the code?
The code below
MATCH (c:cap {id: id})
is the same as
MATCH (c:cap)
Where c.id = id
It is a shorter way of finding Captured node based on id and then you are creating a relationship with Survey node.
Question: s is not defined in your query. Where is it?

How to forecast (or any other function) in Google Sheets with only one cell of data?

My sheet:
+---------+-----------+---------+---------+-----------+
| product | value 1 | value 2 | value 3 | value 4 |
+---------+-----------+---------+---------+-----------+
| name 1 | 700,000 | 500 | 10,000 | 2,000,000 |
+---------+-----------+---------+---------+-----------+
| name 2 | 200,000 | 800 | 20,000 | ? |
+---------+-----------+---------+---------+-----------+
| name 3 | 100,000 | 150 | 6,000 | ? |
+---------+-----------+---------+---------+-----------+
| name 4 | 1,000,000 | 1,000 | 25,000 | ? |
+---------+-----------+---------+---------+-----------+
| name 5 | 2,000,000 | 1,500 | 30,000 | ? |
+---------+-----------+---------+---------+-----------+
| name 6 | 2,500,000 | 3,000 | 65,000 | ? |
+---------+-----------+---------+---------+-----------+
| name 7 | 300,000 | 300 | 12,000 | ? |
+---------+-----------+---------+---------+-----------+
| name 8 | 350,000 | 200 | 9,000 | ? |
+---------+-----------+---------+---------+-----------+
| name 9 | 900,000 | 1,200 | 28,000 | ? |
+---------+-----------+---------+---------+-----------+
| name 10 | 150,000 | 100 | 5,000 | ? |
+---------+-----------+---------+---------+-----------+
What I am attempting is to predict the empty columns based on the data that I do have. Maybe just one of the columns that contain data in every row or maybe I should be only focusing on one column that contains data in every row?
I have used FORECAST previously but had more data in the column that I was predicting values for which the lack of data I think is my root problem(?). Not sure if FORECAST is best for this so any recommendations for other functions are most welcome.
The last thing I can add though is that the known value in column E (value 4) is a confident number and ideally it's used in any formula that I end up with (although I am open to any other recommendations).
The formula I was using:
=FORECAST(D3,E2,$D$2:$D$11)
I don't think this is possible without more information. If you think about it, Value 4 can be a constant (always 2,000,000), be dependent on only one other value (say 200 times value 3), or be a complex formula (say add values 1, 2, and 3 with a constant). Each of these 3 models agree with the values for name 1, however they generate vastly different value 4 predictions.
In the case of name 2, the models would output the following for value 4:
Constant: 2,000,000
Value 3: 8,000,000
Sum: 2,489,700
Each of those values could be valid without providing further constraints (either through data points or specifying the kind of model, but probably both).

Time span accumulating fact tables design

I need to design a star schema to process order processing. The progress of an order look like this:
Customer C place an order on item I with quantity 100
Factory F1 take the order partially with quantity 30
Factory F2 take the order partially with quantity 20
Buy from market 50 items
F1 delivery 20 items
F1 delivery 7 items
F1 cancel the contract (we need to buy 3 more item from market)
F2 delivery 20 items
Buy from market 3 items
Complete the order
How can I design a fact table in this case, since the number of step is not fixed, the data types of event is not the same.
I'm sorry for my bad English.
The definition of an Accumulating Snapshot Fact table according to Kimball is:
summarizes the measurement events occurring at predictable steps between the beginning and the end of a process.
For this particular use case I would go with a Transaction Fact Table as the events (steps) are unpredictable, it is more like an event fact table, something similar to logs or audits.
| order_key | date_key | full_datetime | entity_key (customer, factory, etc. varchar) | entity_type | state | quantity |
|-----------|----------|---------------------|----------------------------------------------|-------------|----------|----------|
| 1 | 20190602 | 2019-06-02 04:30:00 | C1 | customer | request | 100 |
| 1 | 20190602 | 2019-06-02 05:30:00 | F1 | factory | receive | 30 |
| 1 | 20190602 | 2019-06-02 05:30:00 | F2 | factory | receive | 20 |
| 1 | 20190602 | 2019-06-02 05:40:00 | Company? | company | buy | 50 |
| 1 | 20190603 | 2019-06-03 06:40:00 | F1 | factory | deliver | 20 |
| 1 | 20190603 | 2019-06-03 02:40:00 | F1 | factory | deliver | 7 |
| 1 | 20190603 | 2019-06-03 04:40:00 | F1 | factory | deliver | 3 |
| 1 | 20190603 | 2019-06-03 06:40:00 | F1 | factory | cancel | |
| 1 | 20190604 | 2019-06-04 07:40:00 | F2 | factory | deliver | 20 |
| 1 | 20190604 | 2019-06-04 07:40:00 | Company? | company | buy | 3 |
| 1 | 20190604 | 2019-06-04 09:40:00 | Company? | company | complete | 100 |
I'm not sure about your reporting needs as they were not specified, but assuming you need to measure lag/durations of unpredictable steps, you could PIVOT and use dynamic SQL to create the required view
SQL Server dynamic PIVOT query?
Let me know if you came up with something different as I'm interested on this particular use case. Good luck

Understanding Slowly Changing Dimension Type 2

I am having difficult time understanding how to use slowly changing dimension type 2, in my scenario.
I have gone through different tutorial websites but they don't fit.
I have an employee dimension table containing:
+-----+---------------+------------+------------+
| id | employee | designation| Location |
+-----+---------------+------------+------------+
| 1 | Ola | CEO | Newyork |
| 2 | Ahmed | DEVELOPER | California |
| 3 | Ola | Manager | California |
+----------+----------+------------+------------+
I have a Account Fact table
+-------+----------+
|emp_id | Amount |
+-------+-----------
| 1 | 2000000 |
| 2 | 300000 |
+----------+-------+
Now we see that the dimension has changed, and thus a new ID to same Ola employee has been given.
How would we manage in the fact table?
The new ID of Ola will not be found in Fact Table.
so if we add a new row in fact, with new ID of Ola, how would we link that they are same employee, when they are identified differently, 'primary key'.
How would we distinguish this employee is not a new employee and actually location / designation has been changed.
I am sure there are many ways of doing it, here's one way - Have an "employee_Key" in your dimension Table which is unique for an employee. So your dimension table will look like this -
id | emp_key | employee | designation| Location |Valid From| Valid To |
-----|---------|------------|------------|------------|----------|----------|
1 | EMP1 | Ola | CEO | Newyork |1/1/1900 |1/1/2016 |
2 | EMP2 | Ahmed | DEVELOPER | California |1/1/1900 |NULL |
3 | EMP1 | Ola | Manager | California |1/2/2016 |NULL |
You can continue loading your fact table with the "New" ID for the employee. In this case you will have 2 different sets of Keys for that employee.
+-------+----------+
|emp_id | Amount |
| 1 | 2000000 |
| 2 | 300000 |
| 3 | 100000 |
+----------+-------+
If you want to rollup (say Sum of amounts) for an employee from the beginning, you would join the fact and dimension using the ID key and group by emp_key.
So,
select emp_key, sum(amount) from employee dim, account fact where dim.ID = fact.ID group by emp_key.
If you want to find out the amount since he became a manager, you just have to do rollup on the ID field.
select dim.ID, sum(amount) from employee dim, account fact where dim.ID = fact.ID group by dim.ID.
or this way -
select fact.ID, sum(amount) from account fact group by fact.ID.

How do you SUM two fields from two tables, even when the field in the second table could be null?

I have the following tables:
products.rb
# has_many :sales
+----+----------+----------+-------+
| id | name | quantity | price |
+----+----------+----------+-------+
| 1 | Pencil | 30 | 1.0 |
| 2 | Pen | 50 | 1.5 |
| 3 | Notebook | 100 | 2.0 |
+----+----------+----------+-------+
sales.rb
# belongs_to :product
+----+----------+------------+
| id | quantity | product_id |
+----+----------+------------+
| 1 | 10 | 1 |
| 2 | 2 | 1 |
| 3 | 5 | 1 |
| 4 | 2 | 2 |
| 5 | 10 | 2 |
+----+----------+------------+
I'd like to know, first, how many items I have left, regardless of their type. The answer is of course 151, but that'd be cheating. I could simply make a SUM of both tables individually, then put them together to know the final number, but I'm wondering if this could be done via activerecord in a single command.
I tried the following:
Product.includes(:sales).group('products.id').sum('products.quantity - sales.quantity')
but I get:
=> {1=>73, 2=>88, 3=>0}
which is understandable, as it is going through each one to do the sum like this:
+-------------------+----------------+-----+
| products.quantity | sales.quantity | sum |
+-------------------+----------------+-----+
| 30 | 10 | 20 |
| 30 | 2 | 28 |
| 30 | 5 | 25 |
+-------------------+----------------+-----+
which equals 73.
Anyway, how could this be achieved with ActiveRecord? I want to know the total number of items, but I'd also like to know the total of each type.
I'm not familiar of any ActiveRecord way to achieve what you want but you can try mixing a little sql in there
Product
.group('products.id')
.sum('products.quantity - (SELECT SUM(sales.quantity) AS sales_quantity FROM sales WHERE sales.product_id = products.id)')

Resources