Lossy OR Lossless Decomposition - normalization

Consider the relation R(A,B,C,D,E) with the set of F=(A->C,B->C,C->D,DC->C,CE->A)
Suppose the relation has been Decomposed by the relations R1(A,D),R2(A,B),R3(B,E),R4(C,D,E),R5(A,E)
Is this decomposition lossy or lossless?
i tried solving this question using the matrix method and i am getting the answer as lossless because i managed to get a row in the 5*5 matrix filled with one variable however the book from which i am solving gives the answer as lossy. which one is the correct answer??

It is a lossless decomposition for sure. The row corresponding to R3 gets filled with one variable.
As an aside, if you have the above decomposition obtained using Bernstein Synthesis then just checking whether any of the decomposed relations consists of all the attributes of the key of the original relation R will ensure that it's a lossless decomposition. For example, BE is the key for the relation R in the example above. The decomposed relation R3 consists of both the primary attributes B and E and hence this ensures a lossless decomposition.

Related

Using awkward1.Array for BDT

I want to implement a boosted decision tree for my analysis. But the entries in my array contain are of varying length, so the array is not convertible directly into numpy or pandas.
Is there any way to use existing ML libraries with awkward array?
Your ML library might assume that the arrays are NumPy arrays and not recognize an ak.Array. That problem, in itself, is easily solved: call np.to_numpy (or equivalently, cast it with np.asarray) to put it in a form the ML library expects. Incidentally, there's also ak.to_pandas to make a DataFrame in which variable-length nested lists are represented by a MultiIndex (with limitations: there has to be only one nested list, since a DataFrame has only one index).
The above is what I'd call a "branding" issue: the ML library just doesn't recognize the ak.Array "brand" of array, so we relabel it. But there's a more fundamental issue: does the ML algorithm in question intrinsically require rectilinear data? For instance, a feedforward neural network maps N-dimensional inputs to M-dimensional outputs; N and M can't be different for each input. This is a problem even if you're not using Awkward Array. In HEP, the old solution was to run variable-length data through a recurrent neural network (thus ignoring the boundaries between lists and imposing an irrelevant order on them) and the new solution seems to be graph neural networks (which is a more theoretically correct thing to do).
I've noticed that some ML libraries are introducing their own "jagged arrays," which are the minimum structure that Awkward Array provides: TensorFlow has RaggedTensors and PyTorch is getting NestedTensors. I don't know to what degree these data types have been integrated into the ML algorithms, though. If they have been, then Awkward Array ought to get an ak.to_tensorflow and ak.to_pytorch to complement ak.to_numpy and ak.to_pandas, as a way to preserve jaggedness when sending data to these libraries. Hopefully, they'll be able to use that jaggedness in their ML algorithms! (Otherwise, what's the point? But I haven't been following these developments closely.)
You're interested in boosted decision trees (BDTs). I can't think of how a decision tree model, boosted or not, could be adapted to different length inputs... Or maybe I can: the nodes of a decision tree choose which subtree to pass the data down to based on the value of one index in the N-dimensional input. That doesn't imply there's a maximum index value N, though a particular tree would have a set of indexes that it splits on, and there would be some maximum of that set (because the tree is finite!). Apply a tree that wants to split on index k on an input with n < k elements would have to have a contingency for how to split anyway, but there are already methods for applying decision trees to datasets with missing values. An input datum with n elements could be treated as an input for which indexes greater than n are considered missing values. To train such a BDT, you'd have to give it inputs with missing values beyond each list's maximum element.
In Awkward Array, the function for that is ak.pad_none. If you know the maximum length list in your sample (ak.num and ak.max), you can pad the whole array such that all lists have the same length with missing values at the end. If you set clip=True, then the resulting array type is "regular," it no longer considers the possibility that a list can have a length different from the chosen length. If you pass such an array to np.to_numpy (and not np.asarray), then it becomes a NumPy masked array, which a BDT algorithm that expects missing values should be able to recognize.
The only problem with this plan is that padding every list to have the same length as the maximum length list uses more memory. If the BDT algorithm were aware of jaggedness (the way that TensorFlow and soon PyTorch is/will be aware of jaggedness), then it should be able to make these trees and apply them to data without the memory-padding step. I don't know if there are any such BDT implementations out there, but if someone wants to write a "BDT with missing values that accepts jagged arrays," I'd be happy to help them get it set up with Awkward Arrays!

How to solve basic HMM problems with hmmlearn

There are three fundamental HMM problems:
Problem 1 (Likelihood): Given an HMM λ = (A,B) and an observation
sequence O, determine the likelihood P(O|λ).
Problem 2 (Decoding): Given an observation sequence O and an HMM λ =
(A,B), discover the best hidden state sequence Q.
Problem 3 (Learning): Given an observation sequence O and the set of
states in the HMM, learn the HMM parameters A and B.
I'm interested in Problems ## 1 and 3. In general, first problem can be solved with Forward Algorithm and third problem can be solved with Baum-Welch Algorithm. Am I right that I should use fit(X, lengths) and score(X, lengths) methods from hmmlearn for solving first and third problems respectively? (The documentation does not say that score uses Forward Algorithm.)
And I have some more questions about score method. Why score calculates log probability? And why if I pass several sequences to score it returns sum of log probabilities instead of probability for each sequence?
My original task was the following: I have 1 million short sentences of equal size (10 words). I want to train HMM model with that data and for test data (10-words sentences again) predict the probability of each sentence in the model. And based on that probability I will decide is that usual or unusual phrase.
And maybe there is better libraries for python to solve that problems?
If you are fitting the model on a single sequence, you should use score(X) and fit(X) to solve the first and third problem, respectively (since length = None is the default value, you do not need to pass it explicitly). When working with multiple sequences, you should pass the list of their lengths as the lengths parameter, see the documentation.
The score method calculates log probability for numerical stability. Multiplying a lot of numbers can result in numerical overflow or underflow - i.e. a number may get too big to store in memory or too small to distinguish from zero. The solution is to add their logarithms instead.
The score method returns the sum of logprobabilities of all sequences, well, because that is how it is implemented. The feature request for the feature you want has been submitted a month ago, though, so maybe it will appear soon. https://github.com/hmmlearn/hmmlearn/issues/272 Or you can simply score each sequence separately.
The hmmlearn library is a good python library to use for hidden markov models. I have tried using a different library, ghmm, and I was getting weird numerical underflows. It turned out that their implementation of Baum-Welch algorithm for Gaussian HMM was numerically unstable. They used LU decomposition instead of Cholesky decomposition for computing the inverse of the covariance matrix, which sometimes leads to the covariance matrix ceasing to be positive semidefinite. The hmmlearn library uses the Cholesky decomposition. I switched to hmmlearn and my program started working fine, so I recommend it.

consistency among different training pairs of skip-gram model

Regarding skip-gram model, for training purpose, the input is a word (one-hot representation) and outputs are its context words (multiple one-hot representations). For example, (A,B), (A,C), (A, D)
My question is: when we are running the training process, we are running the model pair by pair or we just feed in [A, B|C|D] altogether?
Another question is regarding the word vector matrix "M" (matrix between input and hidden layer). Since input is one-hot, so the result of input (size |V|) x M is a vector of size |V| which is a row of the word vector matrix. My question is: when we are running the back propagation, it seems only that row of word vector matrix updated.
Is this true?
If that is the case and suppose we are training the model pair by pair (A,B), (A,C), (A, D), how to keep the consistency among the different back propagation of different pairs? For example, once pair (A,B) is done, the row in word vector matrix got updated. Through the update, the error will become less for (A,B). Then we run pair (A,C), the same row is picked and got updated through back propagation. Now this time, the error will become less for (A,C). But the correction for (A,B) will be erased, which means the back propagation of (A,B) is discarded. Is my understanding correct here?
Thanks
You can think of it as presenting the pairs, individually as training examples: (A, B), (A, C), (A, D). That's how the algorithm is defined. (However, some implementations might in practice be able to do them as a batch.)
Note that 'outputs' are only "one-hot" when using negative-sampling. (In hierarchical-softmax, predicting one word requires a variable huffman-coded set of output nodes to have specific 0/1 values.)
Yes, in skip-gram, each training example (such as "A predicts B") only results in a single word's vector ("row of word vector matrix") being updated, since only one word is involved.
Note that while the paper describes using a focus word to predict its nearby words, the released Google word2vec.c code (and other implementations modeled on it) as a matter of implementation uses context words to predict the focus word. Essentially it'd just a different way of structuring the loops (and thus ordering the training pair examples), which I believe they did for slightly better CPU cache efficiency.
In your scenario, the subsequent backprop for the (A, C) example does not necessarily "erase" the prior (A, B) backprop – depending on the out-representations of B & C, it might reinforce it (in some ways/directions/dimensions), or weaken it (in other ways/directions/dimensions). It is essentially that repeated tug-of-war, between all the diverse examples, that eventually creates an arrangement of weights which somehow reflect some useful similarities of A, B, and C.
The best way to rigorously understand what is happening is to review the source code of implementations, such as the original Google word2vec.c or the Python implementation in gensim.

Does use dummy value make model's performance better?

I see many feature engineering has the get_dummies step on the object features. For example, dummy the sex column which contains 'M' and 'F' to two columns and label them in one-hot representation.
Why we not directly make the 'M' and 'F' as 0 and 1 in the sex column?
Does the dummy method has positive impact on machine learning model both in classification and regression model ?
If it is , and why?
Thanks.
In general, directly encoding a categorical variable with N different values directly with (0,1, ... , N-1) and turning into a numerical variable won't work with many algorithms, because you are giving ad hoc meaning to the different category variables. The gender example works since it is binary, but think of a price estimation example with car models. If there are N distinct models and if you encode the model A with 3 and model B with 6, this would mean, for example, for the OLS liner regression that the model B affects the response variable 2 times more compared to model A. You can't simply give such random meanings to different categorical values, the generated model would be meaningless. In order to prevent such numerical ambiguity, the most common way is to encode a categorical variable with N distinct values with N-1 binary, one-hot variables.
To one-hot-encode a feature with N possible values you only need N-1 columns with 0 / 1 values. So you are right: binary sex can be encoded with a single binary feature.
Using dummy coding with N features instead of N-1 shouldn't really add performance to any Machine Learning model and it complicates some statistical analysis such as ANOVA.
See the patsy docs on contrasts for reference.

LSA - Feature selection

I have this SVD decomposition of the document
I've read this page, but I don't understand how can I compute the best feature for document separation.
I know that:
S x Vt gives me relation between documents and features
U x S gives me relation between terms and features
But what is the key for the best feature selection?
SVD is concerned only with inputs, and not with their labels. In other words, it can be seen as an unsupervised technique. As such, it cannot tell you what features are good for separation, without making any further assumptions.
What it does tell you, is what 'basis vectors' are more important then others, in terms of reconstructing the original data using only a subset of the basis vectors.
Nevertheless, you can think about LSA in the following manner (this is only interpretation, the math is what important): A document is generated by a mixture of topics. Each topic is represented by a vector of length n, which tells you how likely is each word in this topic. For example, if the topic is sports, then words like football or game are more likely than bestseller or movie. These topic-vectors are the columns of U. In order to generate a document (a column of A), you take a linear combination of topics. The coefficients of the linear combination are the columns of Vt - each column tells you what proportion of topics to take in order to generate a document. In addition, each topic has an overall 'gain' factor, which tells you how much this topic is important in your set of documents (maybe you have just one document about sports out of 1000 total documents). These are the singular values == the diagonal of S. If you throw away the smaller ones, you can represent your original matrix A with less topics, and small amount of information lost. Of course, 'small' is a matter of application.
One drawback of LSA is that it is not entirely clear how to interpret the numbers - they are not probabilities, for example. It makes sense to have "0.5" units of sports in a document, but what does it mean to have "-1" units?

Resources