How to join tables of events to report non-events? - join

Employees must complete continuing education modules to stay competent. I need to build a report showing supervisors which modules their employees have – and have not -- completed.
I have a simple table of employees and their supervisors. Each row is an employee, with two columns: Supervisor Name and Employee Name. Call this the Staff table.
Supervisor
Employee Name
Mike
Jack
I have a table of module completions. Each row is an event -- a module completion, with columns for the Module Code, Employee Name, and Date of Completion. Call this the Completions table.
Employee Name
Module Code
Date of Completion
Jack
MOD1
2022-01-20
Jack
MOD2
2022-01-21
I have a table of all the Modules. Each row is a module, with columns for the Module Code and other module-level information, such as Module Name, etc.
Module Code
Module Name
Format
MOD1
Important Content
Online
MOD2
Super Important Content
Online
MOD3
Oh so Critical Content
In person
MOD4
You Must Learn This Content
In person
I can join the Staff table to the Completion table on Employee Name. That gives me a list of employees and the modules they have completed.
Employee Name
Module Code
Completion Date
Completions
Jack
MOD1
2022-01-20
1
Jack
MOD2
2022-01-21
1
I’m halfway there.
The challenge is how to report the uncompleted modules as well, so that we see this:
Employee Name
Module Code
Completion Date
Completions
Jack
MOD1
2022-01-20
1
Jack
MOD2
2022-01-21
1
Jack
MOD3
0
Jack
MOD4
0
By what configuration of joins among the Staff, Completions, and Modules tables can I report completions of all modules for every employee, even there is no completion date for some modules?
I'm stumped. I'm imagining different kinds of joins -- some kind of full outer join of Completions and Modules, or some kind of self join of Completions that reports a sum of all completions for all modules for every employee -- but I've reached the current limit of my skill.

Since we don't know what modules each employee has been assigned. We then assume each module must be associated to each employee and we build that assocation.
To do this we use a CROSS JOIN which is every record of one table related to every record of another table. Caution as this results in a Cartesian product. (1000*1000) = 1,000,000 records! but each employee is then assigned a module.
SELECT ...
FROM SupervisorEmployee E
CROSS JOIN Modules M -- This cross join ensure every employee is assigned every
-- record. note there is no ON clause when using a cross join.
LEFT JOIN Completions C
on {keys}
WHERE...
NOTE: Since we are using a left join. the where clause should only have limits on the SupervisorEmploye and Modules table. It should not have a limit on the completions table or it will negate the left join. (you can do it but you have to use an or and handle nulls it's just simpler to not put limits here)
If you need to limit on completions; do so using the On criteria so the reduction occurs before/as the join occurs and you retain the records without a completion.

Related

Relational Algebra Join - necessary to rename?

Let's assume i have some 2 simple tables:
IMPORTANT: This is about relational algebra, not SQL.
Band table:
band_name founded
Gambo 1975
John. 1342
Album table:
album_name band_name
Celsius. Gambo
Trambo Gambo
Now, since the Band and the Album table share the same column name "band_name", would it be necessary to rename it when i would join them?
As far as i know, the join eliminates the duplicate entry that is shared amongst the join. This example, where i simply pick all Bands that are existing in the Album table (obviously just 'Gambo' in this giving example)
Πfounded, band_name(Band ⋈ Album)
should therefore work fine, right? Can somebody confirm?
(Have to enter a caveat that there are many variants of Relational Algebra; that they differ in semantics; and they differ in syntax. Assuming you intend a variant similar to that in wikipedia ...)
Yes that expression should work fine. The natural join operator ⋈ matches same-named attributes between its two operands. So the subexpression Band ⋈ Album produces a result with attributes {band_name, founded, album_name}. Your expression projects two of those.
Note the attributes for a relation value are a set not a sequence; therefore any operation over relation operands with same-named attributes must match attributes.
In contrast, Cartesian Product × requires its operands to have disjoint attribute names. Then Band × Album is ill-formed and would be rejected. (So you'd need to Rename band_name in one of them, to get relations that could be operands.)
I'm not all that happy with your way of putting it "the join eliminates the duplicate entry that is shared amongst the join." Because only in SQL do you get a duplicate (from SELECT * FROM Band, Album ... -- which results in a table with four columns, of which two are named band_name). SQL FROM list of tables is a botch-up: neither join nor Cartesian Product, but something trying to be both, and succeeding only in being neither. RA's ⋈ never produces a "duplicate" so never does it "eliminate" anything.
Particularly if there's Keys declared and a Foreign Key constraint (from Album's band_name to Band's) I see those as identifying the same band, then the natural operation is to bring together that which has been taken apart, so the name 'Natural Join'.

Rails (Activerecord) - Can't make query with joins and global sum without duplicates

I'm using a query with multiple user-set filters in order to show a list of invoices in a Rails app. One of the filters adds a where condition on a column of a separate table, which needs a double join in order to be accessible (estimates -through projects-).
scope :by_seller, lambda {|user_id|
joins(project: :estimates)
.where(estimates: {:user_id => user_id}) unless user_id.blank?
}
Additionally, I use Rails' aggregate method "sum" in order to find out the total amount of the invoices, #invoices.sum(:total_cache), where total_cache is a cached column in the database specifically designed to perform this kind of sum in a performant way.
#invoices.sum(:total_cache)
My problem is, given the fact that I need a double join in order to access Estimates through Projects, and that each Invoice belongs to a Project, BUT a project can have many Estimates, the join operation results in duplicate records, so my Invoices table shows some of the invoices many times (as many as the number of estimates its project has). This results in an invoices table with duplicate records, and in an incorrect sum value, as it sums some of the invoice totals N times.
The filtering behaviour is just fine, as my intention is to filter by the user who made ANY of the estimates in the invoice project. However, the issue is that when I try to avoid the duplicates by adding a group('invoices.id') -the way I always solved such situations-, the final sum operation won't return the total sum of the invoices' total, but a grouped sum of each one of them (totally useless).
The only workaround I've found is to include the group clause and perform the sum in pure ruby code, treating the collection as an array, which IMHO is terribly inefficient, as there are tons of invoices:
#invoices.map(&:total_cache).inject(0, &:+)
Is there a way I can obtain a unique ActiveRecord collection of Invoices without duplicates in a way I can then call the aggregate sum method and obtain a total calculated by Postgres?
Of course, if there is something wrong in my base idea I'm completely open to hearing it! It's quite a complex query (I simplified it for the sake of the question here) and there can be many approaches I'm sure!
Thank you everyone!
I'm not sure how much "slower" or "faster" this is than doing the sum in ruby code. But if you want to still retain an ActiveRecord::Relation object, then you can do something like below. I reproduced your setup environment in a local Rails project.
user = User.first
Invoice.where(
id: Invoice.by_seller(user.id).select(:id)
).sum(:total_cache)
# (1.2 ms) SELECT SUM("invoices"."total_cache") FROM "invoices" WHERE "invoices"."id" IN (SELECT "invoices"."id" FROM "invoices" INNER JOIN "projects" ON "projects"."id" = "invoices"."project_id" INNER JOIN "estimates" ON "estimates"."project_id" = "projects"."id" WHERE "estimates"."user_id" = $1) [["user_id", 1]]
# => 5

Reducing over multiple joins in CouchDB

In my CouchDB database, I have the following models (implemented as documents in the database with different type fields):
Team: name, id (has many matches, has many fans)
Match: name, team_a, team_b, time (has many teams, has many tweets)
Fan: team_id (has many tweets)
Tweet: time, sentiment, fan_id
I want to average the tweet sentiment for each team. If I were using SQL I'd do it like this:
SELECT avg(sentiment)
FROM team
JOIN match on team.id = match.team_a OR team.id = match.team_b
JOIN fan on fan.team = team.id
JOIN tweet on (tweet.time BETWEEN match.time AND match.time + interval '1 hour') AND tweet.user = fan.id
GROUP BY team.id
However in CouchDB you can at best do 1 join in a view function, as explained in the docs (by emitting the join field as the key).
How can this be better modelled in CouchDB to allow for this query to work? I don't really want to denormalise too much, but I guess I will if I have to?
It's a bit complex, but I use what I call "tertiary indexes". The goal is to be able to write a view that is applied to another view. Unfortunately, the only way to do this is to use a view to write data to a secondary database and then have another view that works on that database. Doing this requires an outside process - I use a script that listens to the _changes feed of the primary database, and then updates the relevant documents in the secondary database when something changes.
So in your example your secondary database could consist of a single document for each team with all of the (or the latest) match/fan/tweet data in that one document. Then you write a view that extracts the sentiment (or whatever) from that secondary database.

Select rows with "one of each" in relational algebra

Say I have a Personstable with attributes {name, pet}. How do I select the names of people where they have one of each kind of pet (dog, cat, bird), but a person only has one of each kind of pet if they pet is in the table.
Example: Bob, Dog and Bob, Cat are the only rows in the table. Therefore, Bob has one of each kind of pet. But the moment Lynda, Bird are added, Bob doesn't have one of each type of pet anymore.
I think the first step to this is to π(pet). You get a list of all kinds of pets since relational algebra removes duplicates. Not sure what to do after this, but I have think I need to join π(pet) and Persons.
I've tried a few things like Natural Join and Cross products but I haven't arrived at a result yet and I'm out of ideas.
The answer to the question can be found with the Division operator:
Persons ÷ πpet(Persons)
This relational algebra expression returns a relation with only the column name, containing all the names of the persons that have all the different kind of pets currently present in the Persons table itself.
The division is an operator that, in some sense, is the inverse of the product operator (the name is derived exactly from this fact). It is a derived operator that can be defined in terms of projection, set difference and product (see for instance this answer).

Matching students to courses with course limit (Hungarian, Max Flow, Min-Cost-Flow, ...)

I am currently writing a program which maps students to courses. Currently, I am using a SAT-Solver, but I am trying to implement a polynomial time / non greedy algorithm which solves the following sub-problem:
There are students (50-150)
There are subjects (10-20), e.g. 'math', 'biology', 'art'
There are courses per subject (at least one), e.g. 'math-1', 'math-2', 'biology-1', 'art-1', 'art-2', 'art-3'
A student selects some (fixed) subjects (10-12) and for each subject the student has to be assigned to exactly one of the existing courses (if possible). It does not matter which course 'math-1' or 'math-2' is being selected.
The courses have a maximum number of allowed students (20-34)
Each course is in a fixed block (= timeslot 1 to 13)
A student may not be assigned to courses being in the same block
I am now describing what I have done so far.
(1) Ignoring the course-student-limit
I was able to solve this with the hungarian algorithm / bipartite matching. Each student may be computed individually by modelling it as following:
left nodes represent the subjects 'math', 'biology', 'art' (of the student)
right nodes represent the blocks '1', '2', .... '13'
an edge is inserted for each course from 'subject' to 'block'
This way the student is assigned for every subject to a course while not attending courses which are in the same block. But course-limits are ignored.
(2) Ignoring the selected subjects of the student
I was able to solve this with a max-flow-algorithm. For each student the following is modelled:
Layer 1: From source to each student with a flow of 13
Layer 2: From each student to his/her personal block with a flow of 1
Layer 3: From each student-block to each course in that block with flow 1
Layer 4: From each course to the sink with 'max-student-limit'
This way the student selects arbitrary courses and the course-limit is fullfilled. But he/she may be unlucky and be assigned to 'math-1', 'math-2' and 'math-3' ignoring the subjects 'biology' and 'art'.
(3) Greedy Hungarian
Another idea I had was to match one student at a time with the hungarian algorithm and adjusting the weights so that 'more empty courses' are preferred. For example one could model:
left nodes are subjects of the student
right nodes are blocks
for each course insert an edge from subject to the block of the course with weight = number of free seats
And then computing a Maximum-Weight-Matching.
I would really appreciate any suggestions / help.
Thank you!

Resources