Is col X functionally dependent of col y? - normalization

I am trying to understand database normalisation. I saw this example of 2 Normal form which is not 3 normal forms
Tournament Year Winner Winner_Date_of_Birth
Indiana Invitational 1998 Al Fredrickson 21 July 1975
Cleveland Open 1999 Bob Albertson 28 September 1968
Des Moines Masters 1999 Al Fredrickson 21 July 1975
Indiana Invitational 1999 Chip Masterson 14 March 1977
Here the primary key is Tournament, Year. So no non primary key attribute is Functionally dependent on subset of primary, it is in 2NF.
How, acc to wikipedia, it is not in 3 NF because
Touranment, Year -> Winner and
Winner -> Winner_Date_Of_Birth
So there is a transitive property of Functional Dependency among keys. I understand this part, but what I would like to know is that, Since for our key
(Tournament,Year) there can only be one unique winner_date_of_birth, is it right to say that ( Touranment, Year ) -> Winner_Date_Of_Birth without using the transitive property above?

Yes, transitive means that you can derive A -> C from A -> B and B -> C.

Related

Minimum number of states in DFA

Minimum number states in the DFA accepting strings (base 3 i.e,, ternary form) congruent to 5 modulo 6?
I have tried but couldn't do it.
At first sight, It seems to have 6 states but then it can be minimised further.
Let's first see the state transition table:
Here, the states q0, q1, q2,...., q5 corresponds to the states with modulo 0,1,2,..., 5 respectively when divided by 6. q0 is our initial state and since we need modulo 5 therefore our final state will be q5
Few observations drawn from above state transition table:
states q0, q2 and q4 are exactly same
states q1, q3 and q5 are exactly same
The states which make transitions to the same states on the same inputs can be merged into a single state.
Note: Final and Non-final states can never be merged.
Therefore, we can merge q0, q2, q4 together and q1, q3 together leaving the state q5 aloof from collation.
The final Minimal DFA has 3 states as shown below:
Let's look at a few strings in the language:
12 = 1*3 + 2 = 5 ~ 5 (mod 6)
102 = 1*9 + 0*3 + 2 = 11 ~ 5 (mod 6)
122 = 1*9 + 2*3 + 2 = 17 ~ 5 (mod 6)
212 = 2*9 + 1*3 + 2 = 23 ~ 5 (mod 6)
1002 = 1*18 + 0*9 + 0*9 + 2 = 29 ~ 5 (mod 6)
We notice that all the strings end in 2. This makes sense since 6 is a multiple of 3 and the only way to get 5 from a multiple of 3 is to add 2. Based on this, we can try to solve the problem of strings congruent to 3 modulo 6:
10 = 3
100 = 9
120 = 15
210 = 21
1000 = 27
There's not a real pattern emerging, but consider this: every base-3 number ending in 0 is definitely divisible by 3. The ones that are even are also divisible by 6; so the odd numbers whose base-3 representation ends in 0 must be congruent to 3 mod 6. Because all the powers of 3 are odd, we know we have an odd number if the number of 1s in the string is odd.
So, our conditions are:
the string begins with a 1;
the string has an odd number of 1s;
the string ends with 2;
the string can contain any number of 2s and 0s.
To get the minimum number of states in such a DFA, we can use the Myhill-Nerode theorem beginning with the empty string:
the empty string can be followed by any string in the language. Call its equivalence class [e]
the string 0 cannot be followed by anything since valid base-3 representations don't have leading 0s. Call its equivalence class [0].
the string 1 must be followed with stuff that has an even number of 1s in it ending with a 2. Call its equivalence class [1].
the string 2 can be followed by anything in the language. Indeed, you can verify that putting a 2 at the front of any string in the language gives another string in the language. However, it can also be followed by strings beginning with 0. Therefore, its class is new: [2].
the string 00 can't be followed by anything to fix it; its class is the same as its prefix 0, [0]. same for the string 01.
the string 10 can be followed by any string with an even number of 1s that ends in a 2; it is therefore equivalent to the class [1].
the string 11 can be followed by any string in the language whatever; indeed, you can verify prepending 11 in front of any string in the language gives another solution. However, it can also be followed by strings beginning with 0. Therefore, its class is the same as [2].
12 can be followed by a string with an even number of 1s ending in 2, as well as by the empty string (since 12 is in fact in the language). This is a new class, [12].
21 is equivalent to 1; class [1]
22 is equivalent to 2; class [2]
20 is equivalent to 2; class [2]
120 is indistinguishable from 1; its class is [1].
121 is indistinguishable from [2].
122 is indistinguishable from [12].
We have seen no new equivalence classes on new strings of length 3; so, we know we have seen all the equivalence classes. They are the following:
[e]: any string in the language can follow this
[0]: no string can follow this
[1]: a string with an even number of 1s ending in 2 can follow this
[2]: same as [e] but also strings beginning with 0
[12]: same as [1] but also the empty string
This means that a minimal DFA for our language has five states. Here is the DFA:
[0]
^
|
0
|
----->[e]--2-->[2]<-\
| ^ |
| | |
1 __1__/ /
| / /
| | 1
V V |
[1]--2-->[12]
^ |
| |
\___0___/
(transitions not pictured are self-loops on the respective states).
Note: I expected this DFA to have 6 states, as Welbog pointed out in the other answer, so I might have missed an equivalence class. However, the DFA seems right after checking a few examples and thinking about what it's doing: you can only get to accepting state [12] by seeing a 2 as the last symbol (definitely necessary) and you can only get to state [12] from state [1] and you must have seen an odd number of 1s to get to [1]…
The minimum number of states for almost all modulus problems is the base of the modulus. The general strategy is one state for every modulus, as transitions between moduli are independent of what the previous numbers were. For example, if you're in state r4 (representing x = 4 (mod 6)), and you encounter a 1 as your next input, your new modulus is 4x6+1 = 25 = 1 (mod 6), so the transition from r4 on input 1 is to r1. You'll find that the start state and r0 can be merged, for a total of 6 states.

how to use seq2seq to decode concatenated string

Am trying to decode a concatenated String like below ...
SQCB7A750BATWE SQ CB 7 A 750 B A T WE
PT05A1219PY023 PT 05 A 12 19 P Y 023
PT55A1019PX02 PT 55 A 10 19 P X 02
PT33SE2215SW023 PT 33 SE 22 15 S W 023
PT05A2216PW023(LC) PT 05 A 22 16 P W 023 (LC)
am looking for a smarter way rather than hard-coded rules as the input will have variations(number of characters and digits), I came across SEQ2SEQ model and I want to know if it's possible to use it in such problem
I already followed some tutorials to get a taste of it, but the results weren't even close
it also seems there are 2 approaches character level and word level as per this tutorial
Character level:
Input sentence: SQCACA333BA71A
Decoded sentence: P 9(PDD366AZ2IDD4K )F)F(L)L)1)1)1) 6A
-
Input sentence: SQCAAC152DA71A
Decoded sentence: P 9(PDD366AZ2IDD4K )F)F(L)L)1)1)1) 6A
am still trying to implement the word level, but I'd like to know if the problem can be solved using this approach (seq2seq)

Estimating mortality with acmeR package

There is a relatively new package that has come out called acmeR for producing estimates of mortality (for birds and bats), and it takes into consideration things like bleedthrough (was the carcass still there but undetected, and then found in a later search?), diminishing searcher efficiency, etc. This is extremely useful, except I can't seem to get it to work, despite it seeming to be pretty straightforward. The data structure should be like:
Date, in US format mm/dd/yyyy or ISO 8601 format yyyy-mm-dd
Time, in am/pm US format HH:MM:SS AM or 24-hr ISO format HH:MM:SS
ID, arbitrary distinct alpha strings unique to each carcas
Species, arbitrary distinct alpha strings (e.g. AOU, ABMP, IBP)
Event, “Place”, “Check”, or “Search” (only 1st letter counts)
Found, TRUE or FALSE (only 1st letter counts)
and look something like this:
“Date”,“Time”,“ID”,“Species”,“Event”,“Found”
“1/7/2011”,“08:00:00 PM”,“T091”,“UNBA”,“Place”,TRUE
“1/8/2011”,“12:00:00 PM”,“T091”,“UNBA”,“Check”,TRUE
“1/8/2011”,“16:00:00 PM”,“T091”,“UNBA”,“Search”,FALSE
My data look like this:
Date: Date, format: "2017-11-09" "2017-11-09" "2017-11-09" ...
Time: Factor w/ 644 levels "1:00 PM","1:01 PM",..: 467 491 518 89 164 176 232 39 53 247 ...
Species: Factor w/ 52 levels "AMCR","AMKE",..: 31 27 39 27 39 31 39 45 27 24 ...
ID: Factor w/ 199 levels "GHBT001","GHBT002",..: 1 3 2 3 2 1 2 7 3 5 ...
Event: Factor w/ 3 levels "Check","Place",..: 2 2 2 3 3 3 1 2 1 2 ...
Found: logi TRUE TRUE TRUE FALSE TRUE TRUE ...
I have played with the date, time, event, etc formats too, trying multiple formats because I have had some issues there. So here are some of the errors I have encountered, depending on what subset of data I use:
Error in optim(p0, f, rd = rd, method = "BFGS", hessian = TRUE) :non-finite value supplied by optim In addition: Warning message: In log(c(a0, b0, t0)) : NaNs produced
Error in read.data(fname, spec = spec, blind = blind) : Expecting date format YYYY-MM-DD (ISO) or MM/DD/YYYY (USA) USA
Error in solve.default(rv$hessian): system is computationally singular: reciprocal condition number = 1.57221e-20
Warning message: # In sqrt(diag(Sig)[2]) : NaNs produced
Error in solve.default(rv$hessian) : Lapack routine dgesv: system is exactly singular: U[2,2] = 0
The last error is most common (and note, my data are non-numeric, sooooo... I am not sure what math is being done behind the scenes to come up with these equations, then fail in the solve), but the others are persistent too. Sometimes, despite the formatting being exactly the same in one dataset, a subset of that data will return an error when the parent dataset does not (does not have anything to do with species being there/not being there in one or the other dataset, as far as I can tell).
I cannot find any bug reports or issues with the acmeR package out there - so maybe it runs perfectly and my data are the problem, but after three ecologists have vetted the data and pronounced it good, I am at a dead end.
Here is a subset of the data, so you can see what they look like:
Date Time Species ID Event Found
8 2017-11-09 1:39 PM VATH GHBT007 P T
11 2017-11-09 2:26 PM CORA GHBT004 P T
12 2017-11-09 2:30 PM EUST GHBT006 P T
14 2017-11-09 6:43 AM CORA GHBT004 S T
18 2017-11-09 8:30 AM EUST GHBT006 S T
19 2017-11-09 9:40 AM CORA GHBT004 C T
20 2017-11-09 10:38 AM EUST GHBT006 C T
22 2017-11-09 11:27 AM VATH GHBT007 S F
32 2017-11-09 10:19 AM EUST GHBT006 C F

Lookup data from two ranges and if text matches, then assign numeric value

I am trying to calculate points in a Formula 1 racing league. I'm having trouble with a bonus 15 points if a constructor qualifies 1st and finishes the race 1st. The issue is there could be two different drivers who do this. For example. As you can see, HAM qualified 1st and ROS finished 1st in the race. Because they both drive for Mercedes, 15 points need to be awarded to Mercedes. The data can't be moved around as it's imported using an API (not in the example) but a copy of the layout can be found here
Qualifying Race Driver Team
14 1 ROS mercedes
1 15 HAM mercedes
3 3 VET ferrari
8 4 RIC red_bull
6 5 MAS williams
19 6 GRO haas
10 7 HUL force_india
16 8 BOT williams
7 9 SAI toro_rosso
5 10 VES toro_rosso
13 11 PAL renault
Put this in I2 and copy down. See if that is how you want it:
=IF(AND(VLOOKUP(1, $A$2:$H$12, 8, FALSE)=VLOOKUP(1, $B$2:$H$12, 7, FALSE), VLOOKUP(1, $B$2:$H$12, 7, FALSE)=H2, MATCH(H2, H:H, 0)=ROW(H2)), 15, 0)

Clustering unique datasets based on similarities (equality)

I just entered into the space of data mining, machine learning and clustering. I'm having special problem, and do not know which technique to use it for solving it.
I want to perform clustering of observations (objects or whatever) on specific data format. All variables in each observation is numeric. My data input looks like this:
1 2 3 4 5 6
1 3 5 7
2 9 10 11 12 13 14
45 1 22 23 24
Let's say that n represent row (observation, or 1D vector,..) and m represents column (variable index in each vector). n could be very large number, and 0 < m < 100. Also main point is that same observation (row) cannot have identical values (in 1st row, one value could appear only once).
So, I want to somehow perform clustering where I'll put observations in one cluster based on number of identical values which contain each row/observation.
If there are two rows like:
1
1 2 3 4 5
They should be clustered in same cluster, if there are no match than for sure not. Also number of each rows in one cluster should not go above 100.
Sick problem..? If not, just for info that I didn't mention time dimension. But let's skip that for now.
So, any directions from you guys,
Thanks and best regards,
JDK
Its hard to recommend anything since your problem is totally vague, and we have no information on the data. Data mining (and in particular explorative techniques like clustering) is all about understanding the data. So we cannot provide the ultimate answer.
Two things for you to consider:
1. if the data indicates presence of species or traits, Jaccard similarity (and other set based metrics) are worth a try.
2. if absence is less informative, maybe you should be mining association rules, not clusters
Either way, without understanding your data these numbers are as good as random numbers. You can easily cluster random numbers, and spend weeks to get the best useless result!
Can your problem be treated as a Bag-of-words model, where each article (observation row) has no more than 100 terms?
Anyway, I think your have to give more information and examples about "why" and "how" you want to cluster these data. For example, we have:
1 2 3
2 3 4
2 3 4 5
1 2 3 4
3 4 6
6 7 8
9 10
9 11
10 12 13 14
What is your expected clustering? How many clusters are there in this clustering? Only two clusters?
Before you give more information, according to you current description, I think you do not need a cluster algorithm, but a structure of connected components. The first round you process the dataset to get the information of connected components, and you need a second round to check each row belong to which connected components. Take the example above, first round:
1 2 3 : 1 <- 1, 1 <- 2, 1 <- 3 (all point linked to the smallest point to
represent they are belong to the same cluster of the smallest point)
2 3 4 : 2 <- 4 (2 and 3 have already linked to 1 which is <= 2, so they do
not need to change)
2 3 4 5 : 2 <- 5
1 2 3 4 : 1 <- 4 (in fact this change are not essential because we have
1 <- 2 <- 4, but change this can speed up the second round)
3 4 6 : 3 <- 6
6 7 8 : 6 <- 7, 6 <- 8
9 10 : 9 <- 9, 9 <- 10
9 11 : 9 <- 11
10 11 12 13 14 : 10 <- 12, 10 <- 13, 10 <- 14
Now we have a forest structure to represent the connected components of points. The second round you can easily pick up one point in each row (the smallest one is the best) and trace its root in the forest. The rows which have the same root are in the same, in your words, cluster. For example:
1 2 3 : 1 <- 1, cluster root 1
2 3 4 5 : 1 <- 1 <- 2, cluster root 1
6 7 8 : 1 <- 1 <- 3 <- 6, cluster root 1
9 10 : 9 <- 9, cluster root 9
10 11 12 13 14 : 9 <- 9 <- 10, cluster root 9
This process takes O(k) space where k is the number of points, and O(nm + nh) time, where r is the height of the forest structure, where r << m.
I am not sure if this is the result you want.

Resources