How is cardinality described for an arrowed line? - entity-relationship

There is an E.R. diagram like below.
My questions are:
What is the cardinality for the arrowed line between “teacher” and
“offer”? (Or is it possible to talk about cardinality for this
diagram?)
What does this arrow mean about relation of “teacher” and
“course”? (I found out that it is for showing the relation’s way
from left to right and means “teachers offer course”. Is it true?)
Why is this arrowed line between “teacher” and “offer”, but not
between “offer” and “course”? Is there any difference for these
situations? If yes, what are the differences?
What is the notation type of this arrowed line according to third link below (or according to any other source) ? If it is “Shaler/Mellor”, why is the arrowed line in bold style?
I examined the related questions and link below. But I really confused.
entity relationship diagram
One-to many relationships in ER diagram
Class Diagrams
Entity Relationship Diagram

As far as I know, your diagram uses a mixed notation. It's mostly Chen's notation, but he used 1 to indicate unique constraints on a component of a relationship, and variables (M, N, P...) to indicate multiple possible occurrences. Different variables were used per relationship so that numerical correspondence between different roles weren't accidentally implied.
Some online sources (e.g. this one) show that an arrow represents a unique constraint, and a bold line indicates total participation. In your diagram, that would mean mean a teacher must occur exactly once - each teacher offers a single course.
In some examples the arrow is reversed without a change in meaning. It's also possible that the author of your diagram only meant to indicate a preferred reading direction. Without a reference or explanation, we can't be sure.
If the arrow was used to indicate a unique constraint, then it matters on which role its used. An arrow between offer and course would mean each course can only be offered by a single teacher. An arrow on both roles would indicate a one-to-one relationship.
Whether this notation was adopted from data structure diagrams, the Shlaer-Mellor method, or just for a more visual indication of cardinality, I don't know.

Related

How to infer surprisingly "missing" data in a large set of strings. Or nerds do unique (but sane) baby names

I was thinking about this problem the other day when trying to find an applicable email address for my very common name.
Let's say I had all the names of the roughly 150 million men in the United States in a file, and I wanted to figure out "men who don't exist but sound like they should". That is, I wanted to figure out a combination of names (First, Middle, Last) that exist without a person being named that combination in my record of all names. Let's say I appreciate the advantages of unique names but don't want any of the disadvantages of unfamiliarity and mispronunciation.
Of course I could make up a name like "Nickleback Sunshine Cheeseburger" and reasonably suspect that nobody would be named this combination but that may confuse people so I want names that exist in the set. So names like "Chao-Lin" which have different languages of origin although they may appear with the last name "Jones" would not be as likely to appear with Jones and seem more consistent with a last name of similar language origin like "Chao-Lin Kuo". José is more likely to appear with Gonzalez than Patel and so on.
Of course any of these notions would have to be re-enforced by the structure of the data.
So an example would be if "John Marcus Black" doesn't exist, that would be interesting because all names in the name are common and appear together frequently, just not in that order.
The first thing that came into my mind was some sort of trie or directed graph that is weighted by frequency but that only really works for an "autocomplete" like feature where what we are looking for is not actually present in the set. I was thinking about suffix trees as well but not sure if this is a good use case.
I'm sure there is a machine learning algorithm that would be sufficient in finding these names but I don't know very many.
Bonus, the most normal unique name given a necessary last name. Given a starting name like "Smith", come up with most surprising missing names.
tl;dr 1. Given all the names of men in the US in a file, find n names that probably should exist but don't. Also: some men have middle names, some don't.
The obvious choice would be character level Markov chains.
That won't prevent the generation of existing names, and of profanity, though. I.e., it might combine FUnk and niCK.
You could then rank the results by some surprisingness measure. E.g., fbased on character bigram/trigram frequencies.

Unknown symbol in ERD

I was searching for the meaning of the symbol marked in red in the image below, but I didn't get anything.
So, do you guys know what it means?
It indicates a supertype/subtype relationship. The notation is used in Microsoft Visio, where it's called a category. A double horizontal line is used to indicate a complete category.
In your image, Jurusan records information about the parent/supertype while Animasi, TKJ, RPL and Otomotif describe children/subtypes.
Here's a video on the topic.
Supertype/subtype hierarchies
It does seem to represent supertypes and subtypes. Often the circle will have either a d or an o inside. The d stands for "disjoint" or non-overlapping and the o would stand for "overlapping". These would be constraints. A d would mean that an instance can only appear in one subtype. An o would imply that an instance can be in multiple subtypes. In your example, I'm assuming the supertype instance in question would come from your Colon Siswa table.
Since no information is included in the circle, we do not know. As for the line beneath the circle, there is an option of 1 line - partial completeness or 2 lines - total completeness. The completeness constraint identifies whether an instance in the supertype has to be in one of the subtypes. A single line implies that a supertype instance does not have to be in the subtype. While a double line (2 lines), implies that an instance HAS to be in the subtype.
Supertypes and subtypes are sometimes organized in a specialization hierarchy. The ER model is sometimes called an "enhanced" or "extended" ER Model or EER in that case.
The model you provided is also missing the subtype discriminator. The subtype discriminator would let us know which attribute (field) determines which subtype the instance would belong too. Since you're familiar with the language your model was created in, it may be obvious how it will be determined which instances from the supertype (parent entity) would fall into which subtypes (child entities).
We sometimes call the relationships shown on a specialization hierarchy a "is a" relationship.
You can find examples of these relationships in the textbook below:
Coronel, C. (2017). Database Systems: Design, Implementation, & Management, 12th Edition. [VitalSource Bookshelf Online]. Retrieved from https://bookshelf.vitalsource.com/#/books/9781337509596/
or the following reading room...not my own, but it is an example from the text above.
Reading Room Notes
cardinality symbol, i think one to many symbol

What's the optimal structure for a multi-domain sentence/word graph in Neo4j?

I'm implementing abstractive summarization based on this paper, and I'm having trouble deciding the most optimal way to implement the graph such that it can be used for multi-domain analysis. Let's start with Twitter as an example domain.
For every tweet, each sentence would be graphed like this (ex: "#stackoverflow is a great place for getting help #graphsftw"):
(#stackoverflow)-[next]->(is)
-[next]->(a)
-[next]->(great)
-[next]->(place)
-[next]->(for)
-[next]->(getting)
-[next]->(help)
-[next]->(#graphsftw)
This would yield a graph similar to the one outlined in the paper:
To have a kind of domain layer for each word, I'm adding them to the graph like this (with properties including things like part of speech):
MERGE (w:Word:TwitterWord {orth: "word" }) ON CREATE SET ... ON MATCH SET ...
In the paper, they set a property on each word {SID:PID}, which describes the sentence id of the word (SID) and also the position of each word in the sentence (PID); so in the example sentence "#stackoverflow" would have a property of {1:1}, "is" would be {1:2}, "#graphsftw" {1:9}, etc. Each subsequent reference to the word in another sentence would add an element to the {SID:PID} property array: [{1:x}, {n:n}].
It doesn't seem like having sentence and positional information as an array of elements contained within a property of each node is efficient, especially when dealing with multiple word-domains and sub-domains within each word layer.
For each word layer or domain like Twitter, what I want to do is get an idea of what's happening around specific domain/layer entities like mentions and hashtags; in this example, #stackoverflow and #graphsftw.
What is the most optimal way to add subdomain layers on top of, for example, a 'Twitter' layer, such that different words are directed towards specific domain-entities like #hashtags and #mentions? I could use a separate label for each subdomain, like :Word:TwitterWord:Stackoverflow, but that would give my graph a ton of separate labels.
If I include the subdomain entities in a node property array, then it seems like traversal would become an issue.
Since all tweets and extracted entities like #mentions and #hashtags are being graphed as nodes/vertices prior to the word-graph step, I could have edges going from #hashtags and #mentions to words. Or, I could have edges going from tweets to words with the entities as an edge property. Basically, I'm looking for a structure that is the "cheapest" in terms of both storage and traversal.
Any input on how generally to structure this graph would be greatly appreciated. Thanks!
You could also put the domains / positions on the relationships (and perhaps also add a source-id).
OTOH you can also infer that information as long as your relationships represent the original sentence.
You could then either aggregate the relationships dynamically to compute the strengths or have a separate "composite" relationship that aggregates all the others into a counter or sum.

Modeling arrows/relationships as nodes in Neo4j

Relationship/Arrows in Neo4j can not get more than one type/label (see here, and here). I have a data model that edges need to get labels and (probably) properties. If I decide to use Neo4j (instead of OriendDB which supports labeled arrow), I think I would have then two options to model an arrow, say f, between two nodes A and B:
1) encode an arrow f as a span, say A<--f-->B, such that f is also a node and --> and <-- are arrows.
or
2) encode an arrow f as A --> f -->B, such that f is a node again and two --> are arrows.
Though this seems to be adding unnecessary complexity on my data model, it does not seem to be any other option at the moment if I want to use Neo4j. Then, I am trying to see which of the above encoding might fit better in my queries (queries are the core of my system). For doing so, I need to resort to examples. So I have two question:
First Question:
part1) I have nodes labeled as Person and father, and there are arrows between them like Person<-[:sr]-father-[:tr]->Person in order to model who is father of who (tr is father of sr). For a given person p1 how can I get all of his ancestors.
part2) If I had Person-[:sr]->father-[:tr]->Person structure instead, for modeling father relationship, how the above same query would look like.
This is answered here when father is considered as a simple relationship (instead of being encoded as a node)
Second Question:
part1) I have nodes labeled as A nodes with the property p1 for each. I want to query A nodes, get those elements that p1<5, then create the following structure: for each a1 in the query result I create qa1<-[:sr]-isA-[:tr]->a1 such that isA and qa1 are nodes.
part2) What if I wanted to create qa1-[:sr]->isA-[:tr]->qa1 instead?
This question is answered here when isA is considered as a simple arrow (instead of being modeled as a node).
First, some terminology; relationships don't have labels, they only have types. And yes, one type per relationship.
Second, relative to modeling, I think the direction of the relationship isn't always super important, since with neo4j you can traverse it both ways easily. So the difference between A-->f-->B and A<--f-->B I think should be entirely driven what what makes sense semantically for your domain, nothing else. So your options (1) and (2) at the top seem the same to me in terms of overall complexity, which brings me to point #3:
Your main choice is between making a complex relationship into a node (which I think we're calling f here) or keeping it as a relationship. Making "a relationship into a node" is called reification and I think it's considered a fairly standard practice to accommodate a number of modeling issues. It does add complexity (over a simple relationship) but adds flexibility. That's a pretty standard engineering tradeoff everywhere.
So with all of that said, for your first question I wouldn't recommend an intermediate node at all. :father is a very simple relationship, and I don't see why you'd ever need more than one label on it. So for question one, I would pick "neither of the options you list" and would instead model it as (personA)-[:father]->(personB). More simple. You'd query that by saying
MATCH (personA { name: "Bob"})-[:father]->(bobsDad) RETURN bobsDad
Yes, you could model this as (personA)-[:sr]->(fatherhood)-[:tr]->(personB) but I don't see how this gains you much. As for the relationship direction, again it doesn't matter for performance or query, only for semantics of whatever :tr and :sr are supposed to mean.
I have nodes labeled as A nodes with the property p1 for each. I want
to query A nodes, get those elements that p1<5, then create the
following structure: for each a1 in the query result I create
qa1<-[:sr]-isA-[:tr]->a1 such that isA and qa1 are nodes.
That's this:
MATCH (aNode:A)
WHERE aNode.p1 < 5
WITH aNode
MATCH (qa1 { label: "some qa1 node" })
CREATE (qa1)<-[:sr]-(isA)-[:tr]->aNode;
Note that you'll need to adjust the criteria for qa1 and also specify something meaningful for isA.
What if I wanted to create qa1-[:sr]->isA-[:tr]->qa1 instead?
It should be trivial to modify that query above, just change the direction of the arrows, same query.

Would like to Understand 6NF with an Example

I have just read #PerformanceDBA's arguments re: 6NF and E-A-V. I am intrigued. I had previously been skeptical of 6NF as it was presented as "merely" sticking some timestamp columns on tables.
I have always worked with a data dictionary and do not need to be convinced to use one, or to generate SQL code. So I expect an answer that would require a dictionary (or catalog) that is used to generate code.
So I would like to know how 6NF would deal with an extremely simple example. A table of items, descriptions and prices. The prices change over time.
So anyway, what does the Items table look like when converted to 6NF? What is the "explosion of tables?" that happens here?
If the example does not work with a table this simple, feel free to add what is necessary to get the point across.
I actually started putting an answer together, but I ran into complications, because you (quite understandably) want a simple example. The problem is manifold.
First I don't have a good idea of your level of actual expertise re Relational Databases and 5NF; I don't have a starting point to take up and then discuss the specifics of 6NF,
Second, just like any of the other NFs, it is variegated. You can just barely step into it; you can implement 6NF for certan tables; you can go the full hog on every table, etc. Sure there is an explosion of tables, but then you Normalise that, and kill the explosion; that's an advanced or mature implementation of 6NF. No use providing the full or partial levels of 6NF, when you are asking for the simplest, most straight-forward example.
I trust you understand that some tables can be "in 5NF" while others are "in 6NF".
So I put one together for you. But even that needs explanation.
Now SQL barely supports 5NF, it does not support 6NF at all (I think dportas says the same thing in different words). Now I implement 6NF at a deep level, for performance reasons, simplified pivoting (of entire tables; any and all columns, not the silly PIVOT function in MS), columnar access, etc. For that you need a full catalogue, which is an extension to the SQL catalogue, to support the 6NF that SQL does not support, and maintain data Integrity and business Rules. So, you really do not want to implement 6NF for fun, you only do that if you have a need, because you have to implement a catalogue. (This is what the EAV crowd do not do, and this is why most EAV systems have data integrity problems. Most of them do not use the declarative Referential & Data Integrity that SQL does have.)
But most people who implement 6NF don't implement the deeper level, with a full catalogue. They have simpler needs, and thus implement a shallower level of 6NF. So, let's take that, to provide a simple example for you. Let's start with an ordinary Product table that is declared to be in 5NF (and let's not argue about what 5NF is). The company sells various different kinds of Products, half the columns are mandatory, and the other half are optional, meaning that, depending on the Product Type, certain columns may be Null. While they may have done a good job with the database, the Nulls are now a big problem: columns that should be Not Null for certain ProductTypes are Null, because the declaration states NULL, and their app code is only as good as the next guy's.
So they decide to go with 6NF to fix that problem, because the subtitle of 6NF states that it eliminates The Null Problem. Sixth Normal Form is the irreducible Normal Form, there will be no further NFs after this, because the data cannot be Normalised further. The rows have been Normalised to the utmost degree. The definition of 6NF is:
a table is in 6NF when the row contains the Primary Key, and at most one, attribute.
Notice that by that definition, millions of tables across the planet are already in 6NF, without having had that intent. Eg. typical Reference or Look-up tables, with just a PK and Description.
Right. Well, our friends look at their Product table, which has eight non-key attributes, so if they make the Product table 6NF, they will have eight sub-Product tables. Then there is the issue that some columns are Foreign Keys to other tables, and that leads to more complications. And they note the fact that SQL does not support what they are doing, and they have to build a small catalogue. Eight tables are correct, but not sensible. Their purpose was to get rid of Nulls, not to write a little subsytem around each table.
Simple 6NF Example
Readers who are unfamiliar with the Standard for Modelling Relational Databases may find IDEF1X Notation useful in order to interpret the symbols in the example.
So typically, the Product Table retains all the Mandatory columns, especially the FKs, and each Optional column, each Nullable column, is placed in a separate sub-Product table. That is the simplest form I have seen. Five tables instead of eight. In the Model, the four sub-Product tables are "in 6NF"; the main Product table is "in 5NF".
Now we really do not need every code segment that SELECTs from Product to have to figure out what columns it should construct, based on the ProductType, etc, so we supply a View, which essentially provides the 5NF "view" of the Product table cluster.
The next thing we need is the basic rudiments of an extension to the SQL catalog, so that we can ensure that the rules (data integrity) for the various ProductTypes are maintained in one place, in the database, and not dependent on app code. The simplest catalogue you can get away with. That is driven off ProductType, so ProductType now forms part of that Metadata. You can implement that simple structure without a catalogue, but I would not recommend it.
Update
It is important to note that I implement all Business Rules in the database. Otherwise it is not a database (the notion of implementing rules "in application code" is hilarious in the extreme, especially nowadays, when we have florists working as "developers"). Therefore all rules, etc are first and foremost implemented as SQL declarations, CHECK constraints, functions, etc. That preserves all Declarative Referential Integrity, and declarative Data Integrity. The extension to the SQL catalog covers the area that SQL does not have declarations for, and they are then implemented as SQL. Being a good data dictionary, it does much more. Eg. I do not write Views every time I change the tables or add or change columns or their characteristics, they are created directly from the catalog+extension using a simple code generator.
One more very important note. You cannot implement 6NF (or EAV properly, for that matter), without completing a full and faithful Normalisation exercise, to 5NF. The problem I see at every site is, they don't have a genuine 5NF state, they have a mish-mash of partial normalisation or no normalisation at all, but they are very attached to that. Creating either 6NF or EAV from that is a disaster. Creating EAV or 6NF from that without all business rules implemented in declarative SQL is a nuclear disaster, burning for years. You get what you pay for.
End update.
Finally, yes, there are at least four further levels of Normalisation (Normalisation is a Principle, not a mere reference to a Normal Form), that can be applied to that simple 6NF Product cluster, providing more control, less tables, etc. The deeper we go, the more extensive the catalogue. And higher levels of performance. When you are ready, just ask, I have already erected the models and posted details in other answers.
In a nutshell, 6NF means that every relation consists of a candidate key plus no more than one other (key or non-key) attribute. To take up your example, if an "item" is identified by a ProductCode and the other attributes are Description and Price then a 6NF schema would consist of two relations (* denotes the key in each):
ItemDesc {ProductCode*, Description}
ItemPrice {ProductCode*, Price}
This is potentially a very flexible approach because it minimises the dependencies. That's also its main disadvantage however, especially in a SQL database. SQL makes it hard or impossible to enforce many multi-table constraints. Using the above schema, in most cases it will not be possible to enforce a business rule that every product must always have a description AND a price. Similarly, you may not be able to enforce some compound keys that ought to apply (because their attributes could be split over multiple tables).
So in considering 6NF you have to weigh up which dependencies and integrity rules are important to you. In many cases you may find it more practical and useful to stick to 5NF and normalize no further than that.
I had previously been skeptical of 6NF
as it was presented as "merely"
sticking some timestamp columns on
tables.
I'm not quite sure where this apparent misconception comes from. Perhaps the fact that 6NF was introduced for the book "Temporal Data and The Relational Mode" by Date, Darwen and Lorentzos? Anyhow, I hope the other answers here have clarified that 6NF is not limited to temporal databases.
The point I wanted to make is, although 6NF is "academically respectable" and always achievable, it may not necessarily lead to the optimal design in every case (and not just when considering implementation using SQL either). Even the aforementioned discoverers and proponents of 6NF seem to agree e.g.
Chris Date: "For practical purposes, stick to 5NF (and 6NF)."
Hugh Darwen: "the 6NF decomposition around Date [not the person!] would be overkill... an optimal design for the soccer club is... 5-and-a-bit-NF!"
Hugh Darwen: "we are in 5NF but not in 6NF, and again 5NF is sufficient" (several similar examples).
Then again, I can also find evidence to the contrary:
Chris Date: "Darwen and I have both felt for some time that all base relvars should be in 6NF".
On a practical note, I recently extended the SQL schema of one of our products to add a minor feature. I adopted a 6NF to avoid nullable columns and ended up with six new tables where most (all?) of my colleagues would have used one table (or perhaps extended an existing table) with nullable columns. Despite me proving several 'helper' stored procs and a 'denormalized' VIEW with a INSTEAD OF triggers, every coder that has had to work with this feature at the SQL level has gone out of their way to curse me :)
These guys have it down: Anchor Modeling. Great academic papers on the subject, combined with practical examples. Their writings have finally pushed me over the edge to consider building a DW in 6nf on an upcoming project. The POC work I have done has validated (for me, at least) that the enormous benefits of 6nf don't outweigh the costs.

Resources