Fourth Normal Form - normalization

Fourth Normal Form describes a relation that is in BCNF but one that also contains no non-trivial multivalued dependencies.
I am struggling to understand what a trivial multivalued dependency and a non-trivial multivalued dependency are and the differences. How do I identify the latter in order to perform 4NF?
EDIT:
I mainly need to know what the difference between a trivial and non trivial dependency is?

There is a fairly good example on wikipedia: Fourth normal form. Is there any specific part you don't understand?
You might also want to look at Multivalued dependency.
UPDATE: so what is the difference between trivial and non trivial dependencies?
It depends if we are talking about functional or multivalued dependencies.
A trivial functional dependency X -> Y is one where Y is a subset of X. Since X -> Y means "Y can be determined from X", this is trivially true for any X and Y where Y is made up of attributes from X; obviously if we know X we can determine Y if it only contains stuff from X!
A trivial multivalued dependency X ->-> Y is one where Y contains every attribute not in X. Note it can also contain attributes in X as well. This kind of multivalued dependency is also true for all X and Y and is therefore trivial. This follows from the definition of multivalued dependency:
denote by (x,y,z) the tuple having
values for X, Y, R − X − Y
collectively equal to x, y, z,
correspondingly, then whenever the
tuples (a,b,c) and (a,d,e) exist in r,
the tuples (a,b,e) and (a,d,c) should
also exist in r.
In a trivial multivalued dependancy, the set z = R - X - Y is empty, so the requirement reduces to ( 0 being the empty set):
tuples (a,b,0) and (a,d,0) exist in r,
the tuples (a,b,0) and (a,d,0) should
also exist in r.
Which is obviously true.

X->Y is Trival if and only if the right hand side is a subset of the left hand side.
X->Y is Non Trival if Y is not contained in X.

Related

Trying to understand expected value in Linear Regression

I'm having trouble understanding a lecture slide in my school's machine learning course
why does the expected value of Y = f(X)? what does it mean
my understanding is that X, Y are vectors and f(X) outputs a vector of Y where each individual value (y_i) in the Y vector corresponds to a f(x_i) where x_i is the value in X at index i; But now it's taking the expected value of Y, which is going to be a single value, so how is that equal to f(X)?
X, Y (uppercase) are vectors
x_i,y_i (lowercase with subscript) are scalars at index i in X,Y
There is a lot of confusion here. First let's start with definitions
Definitions
Expectation operator E[.]: Takes a random variable as an input and gives a scalar/vector as an output. Let's say Y is a normally distributed random variable with mean Mu and Variance Sigma^{2} (usually stated as:
Y ~ N( Mu , Sigma^{2} ), then E[Y] = Mu
Function f(.): Takes a scalar/vector (not a random variable) and gives a scalar/vector. In this context it is an affine function, that is f(X) = a*X + b where a and b are fixed constants.
What's Going On
Now you can view linear regression from two angles.
Stats View
One angle assumes that your response variable-Y- is a normally distributed random variable because:
Y ~ a*X + b + epsilon
where
epsilon ~ N( 0 , sigma^sq )
and X is some other distribution. We don't really care how X is distributed and treat it as given. In that case the conditional distribution is
Y|X ~ N( a*X + b , sigma^sq )
Notice here that a,b and also X is a number, there is no randomness associated with them.
Maths View
The other view is the math view where I assume that there is a function f(.) that governs the real life process, that if in real life I observe X, then f(X) should be the output. Of course this is not the case and the deviations are assumed to be due to various reasons such as gauge error etc. The claim is that this function is linear:
f(X) = a*X + b
Synthesis
Now how do we combine these? Well, as follows:
E[Y|X] = a*X + b = f(X)
About your question, I first would like to challenge that it should be Y|X and not Y by itself.
Second, there are tons of possible ontological discussions over what each term here represents in real life. X,Y (uppercase) could be vectors. X,Y (uppercase) could also be random variables. A sample of these random variables might be stored in vectors and both would be represented with uppercase letters (the best way is to use different fonts for each). In this case, your sample will become your data. Discussions about the general view of the model and its relevance to real life should be made at random variable level. The way to infer the parameters, how linear regression algorithms works should be made at matrix and vectors levels. There could be other discussion where you should care about both.
I hope this overly unorganized answer helps you. In general if you want to learn such stuff, be sure you know what kind of math objects and operators you are dealing with , what do they take as input and what are their relevance to real life.

Machine Learning: Why xW+b instead of Wx+b?

I started to learn Machine Learning. Now i tried to play around with tensorflow.
Often i see examples like this:
pred = tf.add(tf.mul(X, W), b)
I also saw such a line in a plain numpy implementation. Why is always x*W+b used instead of W*x+b? Is there an advantage if matrices are multiplied in this way? I see that it is possible (if X, W and b are transposed), but i do not see an advantage. In school in the math class we always only used Wx+b.
Thank you very much
This is the reason:
By default w is a vector of weights and in maths a vector is considered a column, not a row.
X is a collection of data. And it is a matrix nxd (where n is the number of data and d the number of features) (upper case X is a matrix n x d and lower case only 1 data 1 x d matrix).
To correctly multiply both and use the correct weight in the correct feature you must use X*w+b:
With X*w you mutliply every feature by its corresponding weight and by adding b you add the bias term on every prediction.
If you multiply w * X you multipy a (1 x d)*(n x d) and it has no sense.
I'm also confused with this. I guess this may be a dimension matter. For a n*m-dimension matrix W and a n-dimension vector x, using xW+b can be easily viewed as that maping a n-dimension feature to a m-dimension feature, i.e., you can easily think W as a n-dimension -> m-dimension operation, where as Wx+b (x must be m-dimension vector now) becomes a m-dimension -> n-dimension operation, which looks less comfortable in my opinion. :D

how to parse Context-sensitive grammar?

CSG is similar to CFG but the reduce symbol is multiple.
So, can I just use CFG parser to parse CSG with reducing production to multiple terminals or non-terminals?
Like
1. S → a bc
2. S → a S B c
3. c B → W B
4. W B → W X
5. W X → B X
6. B X → B c
7. b B → b b
When we meet W X, can we just reduce W X to W B?
When we meet W B, can we just reduce W B to c B?
So if CSG parser is based on CFG parser, it's not hard to write, is it true?
But when I checked wiki, it said to parse CSG, we should use linear bounded automaton.
What is linear bounded automaton?
Context sensitive grammars are non-deterministic. So you can not assume that a reduction will take place, just because the RHS happens to be visible at some point in a derivation.
LBAs (linear-bounded automata) are also non-deterministic, so they are not really a practical algorithm. (You can simulate one with backtracking, but there is no convenient bound on the amount of time it might take to perform a parse.) The fact that they are acceptors for CSGs is interesting for parsing theory but not really for parsing practice.
Just as with CFGs, there are different classes of CSGs. Some restricted subclasses of CSGs are easier to parse (CFGs are one subclass, for example), but I don't believe there has been much investigation into practical uses; in practice, CSGs are hard to write, and there is no obvious analog of a parse tree which can be constructed from a derivation.
For more reading, you could start with the wikipedia entry on LBAs and continue by following its references. Good luck.

How to design an O(m) time algorithm to compute the shortest cycle of G(undirected unweighted graph) that contains s?

How to design an O(m) time algorithm to compute the shortest cycle of G(undirected unweighted graph) that contains s(s ∈ V) ?
You can run a BFS from your node s as starting point, this will give you a BFS-tree. Afterwards you can built a lowest-common-ancestor (LCA) data structure on this BFS-tree. This can be done for example with Tarjan's lowest-common-ancestor algorithm. I will not got into details here. Given two nodes v and w, LCA lets you find the lowest node in a tree (the BFS-tree in our case) that has v and w as descendents. The idea is when you are considering two nodes that are connected in our BFS-tree you want to check if their paths to the root (s is this case) + the edge that connects them forms a cycle (with s). This is the case if their LCA is s.
Assuming you have built the LCA, you run a second BFS. When expanding the neighbours of a node v, you also take into consideration the nodes already marked as explored. Suppose x is a neighbour of v such that x has already been explored. If the LCA of v and x is s then the path from x to s and form v to s in the BFS-tree plus the edge xv forms a cycle. The first x and v that you encounter in your second BFS gives you the desired result. If no such x exist then s is not contained in any cycle.
The cycle is also the shortest containing s.
The two BFS run in O(m) and the LCA construction can also be done in linear time, hence the whole procedure can be implemented in O(m).
This might a bit overkill. There surely is a much simpler solution.

Z3 tactic for finding partial assignment

Suppose that I have a formula F which contains variables w, x, y, z.
Is there any tactic of Z3 that finds a partial model of F, but the partial model must contains assignments for y and z. (I don't care w and x.)
By applying this tactic, Z3 spends less time for finding the partial model than finding a full model.
Is there such tactic exists?
There is no built-in tactic for doing that.
It is not cheap to find the precise set of "don't cares".
Moreover, if w and x are really "don't cares", then they should not affect Z3's performance in a significant way.

Resources