consistency among different training pairs of skip-gram model - machine-learning

Regarding skip-gram model, for training purpose, the input is a word (one-hot representation) and outputs are its context words (multiple one-hot representations). For example, (A,B), (A,C), (A, D)
My question is: when we are running the training process, we are running the model pair by pair or we just feed in [A, B|C|D] altogether?
Another question is regarding the word vector matrix "M" (matrix between input and hidden layer). Since input is one-hot, so the result of input (size |V|) x M is a vector of size |V| which is a row of the word vector matrix. My question is: when we are running the back propagation, it seems only that row of word vector matrix updated.
Is this true?
If that is the case and suppose we are training the model pair by pair (A,B), (A,C), (A, D), how to keep the consistency among the different back propagation of different pairs? For example, once pair (A,B) is done, the row in word vector matrix got updated. Through the update, the error will become less for (A,B). Then we run pair (A,C), the same row is picked and got updated through back propagation. Now this time, the error will become less for (A,C). But the correction for (A,B) will be erased, which means the back propagation of (A,B) is discarded. Is my understanding correct here?
Thanks

You can think of it as presenting the pairs, individually as training examples: (A, B), (A, C), (A, D). That's how the algorithm is defined. (However, some implementations might in practice be able to do them as a batch.)
Note that 'outputs' are only "one-hot" when using negative-sampling. (In hierarchical-softmax, predicting one word requires a variable huffman-coded set of output nodes to have specific 0/1 values.)
Yes, in skip-gram, each training example (such as "A predicts B") only results in a single word's vector ("row of word vector matrix") being updated, since only one word is involved.
Note that while the paper describes using a focus word to predict its nearby words, the released Google word2vec.c code (and other implementations modeled on it) as a matter of implementation uses context words to predict the focus word. Essentially it'd just a different way of structuring the loops (and thus ordering the training pair examples), which I believe they did for slightly better CPU cache efficiency.
In your scenario, the subsequent backprop for the (A, C) example does not necessarily "erase" the prior (A, B) backprop – depending on the out-representations of B & C, it might reinforce it (in some ways/directions/dimensions), or weaken it (in other ways/directions/dimensions). It is essentially that repeated tug-of-war, between all the diverse examples, that eventually creates an arrangement of weights which somehow reflect some useful similarities of A, B, and C.
The best way to rigorously understand what is happening is to review the source code of implementations, such as the original Google word2vec.c or the Python implementation in gensim.

Related

Using awkward1.Array for BDT

I want to implement a boosted decision tree for my analysis. But the entries in my array contain are of varying length, so the array is not convertible directly into numpy or pandas.
Is there any way to use existing ML libraries with awkward array?
Your ML library might assume that the arrays are NumPy arrays and not recognize an ak.Array. That problem, in itself, is easily solved: call np.to_numpy (or equivalently, cast it with np.asarray) to put it in a form the ML library expects. Incidentally, there's also ak.to_pandas to make a DataFrame in which variable-length nested lists are represented by a MultiIndex (with limitations: there has to be only one nested list, since a DataFrame has only one index).
The above is what I'd call a "branding" issue: the ML library just doesn't recognize the ak.Array "brand" of array, so we relabel it. But there's a more fundamental issue: does the ML algorithm in question intrinsically require rectilinear data? For instance, a feedforward neural network maps N-dimensional inputs to M-dimensional outputs; N and M can't be different for each input. This is a problem even if you're not using Awkward Array. In HEP, the old solution was to run variable-length data through a recurrent neural network (thus ignoring the boundaries between lists and imposing an irrelevant order on them) and the new solution seems to be graph neural networks (which is a more theoretically correct thing to do).
I've noticed that some ML libraries are introducing their own "jagged arrays," which are the minimum structure that Awkward Array provides: TensorFlow has RaggedTensors and PyTorch is getting NestedTensors. I don't know to what degree these data types have been integrated into the ML algorithms, though. If they have been, then Awkward Array ought to get an ak.to_tensorflow and ak.to_pytorch to complement ak.to_numpy and ak.to_pandas, as a way to preserve jaggedness when sending data to these libraries. Hopefully, they'll be able to use that jaggedness in their ML algorithms! (Otherwise, what's the point? But I haven't been following these developments closely.)
You're interested in boosted decision trees (BDTs). I can't think of how a decision tree model, boosted or not, could be adapted to different length inputs... Or maybe I can: the nodes of a decision tree choose which subtree to pass the data down to based on the value of one index in the N-dimensional input. That doesn't imply there's a maximum index value N, though a particular tree would have a set of indexes that it splits on, and there would be some maximum of that set (because the tree is finite!). Apply a tree that wants to split on index k on an input with n < k elements would have to have a contingency for how to split anyway, but there are already methods for applying decision trees to datasets with missing values. An input datum with n elements could be treated as an input for which indexes greater than n are considered missing values. To train such a BDT, you'd have to give it inputs with missing values beyond each list's maximum element.
In Awkward Array, the function for that is ak.pad_none. If you know the maximum length list in your sample (ak.num and ak.max), you can pad the whole array such that all lists have the same length with missing values at the end. If you set clip=True, then the resulting array type is "regular," it no longer considers the possibility that a list can have a length different from the chosen length. If you pass such an array to np.to_numpy (and not np.asarray), then it becomes a NumPy masked array, which a BDT algorithm that expects missing values should be able to recognize.
The only problem with this plan is that padding every list to have the same length as the maximum length list uses more memory. If the BDT algorithm were aware of jaggedness (the way that TensorFlow and soon PyTorch is/will be aware of jaggedness), then it should be able to make these trees and apply them to data without the memory-padding step. I don't know if there are any such BDT implementations out there, but if someone wants to write a "BDT with missing values that accepts jagged arrays," I'd be happy to help them get it set up with Awkward Arrays!

Using currently invalid input data for prediction purposes

Let's say we have some data (input) with which we want to predict some output. If the possible values that a specific input can take has changed over time, is it still appropriate to use all of the data?
Let me try to clarify with an example. Suppose that one of the inputs is a categorical variable that has the unique values [A, B, C] in the data, but we know for a fact that in the current setting in which we will ultimately make predictions, only the values [A, B] are possible.
Would it still be appropriate to use all of the data, or should all of the observations that include a C be excluded?
If C does not uniquely map to the Target variable, but rather it shares some target variables with A or/and B. In this case, leaving C in the dataset, knowing that it'll definitely not occur in the future input (i.e. where you predict for unseen inputs), will adjust the hypothesis of the model (and that depends on the model, linear models are more prone to this) and the final hypothesis will consequently be based on redundant information.
In simple terms: In-Sample does not represent the Out-of-Sample, so it will overfit and won't generalize!.

How to solve basic HMM problems with hmmlearn

There are three fundamental HMM problems:
Problem 1 (Likelihood): Given an HMM λ = (A,B) and an observation
sequence O, determine the likelihood P(O|λ).
Problem 2 (Decoding): Given an observation sequence O and an HMM λ =
(A,B), discover the best hidden state sequence Q.
Problem 3 (Learning): Given an observation sequence O and the set of
states in the HMM, learn the HMM parameters A and B.
I'm interested in Problems ## 1 and 3. In general, first problem can be solved with Forward Algorithm and third problem can be solved with Baum-Welch Algorithm. Am I right that I should use fit(X, lengths) and score(X, lengths) methods from hmmlearn for solving first and third problems respectively? (The documentation does not say that score uses Forward Algorithm.)
And I have some more questions about score method. Why score calculates log probability? And why if I pass several sequences to score it returns sum of log probabilities instead of probability for each sequence?
My original task was the following: I have 1 million short sentences of equal size (10 words). I want to train HMM model with that data and for test data (10-words sentences again) predict the probability of each sentence in the model. And based on that probability I will decide is that usual or unusual phrase.
And maybe there is better libraries for python to solve that problems?
If you are fitting the model on a single sequence, you should use score(X) and fit(X) to solve the first and third problem, respectively (since length = None is the default value, you do not need to pass it explicitly). When working with multiple sequences, you should pass the list of their lengths as the lengths parameter, see the documentation.
The score method calculates log probability for numerical stability. Multiplying a lot of numbers can result in numerical overflow or underflow - i.e. a number may get too big to store in memory or too small to distinguish from zero. The solution is to add their logarithms instead.
The score method returns the sum of logprobabilities of all sequences, well, because that is how it is implemented. The feature request for the feature you want has been submitted a month ago, though, so maybe it will appear soon. https://github.com/hmmlearn/hmmlearn/issues/272 Or you can simply score each sequence separately.
The hmmlearn library is a good python library to use for hidden markov models. I have tried using a different library, ghmm, and I was getting weird numerical underflows. It turned out that their implementation of Baum-Welch algorithm for Gaussian HMM was numerically unstable. They used LU decomposition instead of Cholesky decomposition for computing the inverse of the covariance matrix, which sometimes leads to the covariance matrix ceasing to be positive semidefinite. The hmmlearn library uses the Cholesky decomposition. I switched to hmmlearn and my program started working fine, so I recommend it.

Sequence learning with discrete/categorical output available to the objective function

Please bear with, I am still quite new to deep learning so what I am looking for may or may not make a lot of sense. But this is why I ask, I need some guidance on where to find a guiding example or paper.
Here is what I want and have:
I am interested in sequence data (time-series actually) hence I am using RNN/LSTM/GRU and those types of things
Now suppose I have a 1-D time-series, lets call it X = [x_1,...,x_n]
For my particular problem, it turns out that X is a function of another function, a generator function if you will s.t. X = f(a,b)
That function takes two integer parameters a and b
Here is my problem: I want to find the value of a and b that best reconstructs my time-series X (assume that I can generate time-series with f(a,b))
This leads me to believe that I must feature the actual values of the network output, i.e. a and b in my objective function
My objective function could be something like objectiveFunction(X_true,X_pred) but then my X_pred is generated from f with parameters a and b
Further, the batch size may need to be the whole time-series (they are small, and I have many examples) but we can use big mini batches if needs be.
Suppose the search space over a and b is [0,10] for both (again a and b can only assume integer values). Then I have 100 pairs e.g. (6,7) for values of a and b. As I train the network I expect the weight of edge leading from the Dense to the categorical outputs (i.e. my parameters pairs (a,b)), to get maximised as the network finds better and better time-series generator parameters a and b.
Initially, I was just going to test a network structure as such:
Input -->> RNN -->> Dense -->> Categorical output
I want to keep it simple for now and not try anything fancier such as an LSTM as my time-series only have short-term dependencies.
Hence, any advice would be most welcome. Thanks in advance.

Neural Network Normalization of Nominal Data for 1 Output Neuron

I am new to machine learning and AI and started with NN recently.
Already got some information here on stackoverflow, but I don't understand the logic from the whole gathered information at the moment.
Let's take 4 nominal (but not ordinal) values [A, B, C, D] and 2 numericals already normalized [0.35, 0.55] - so 2 input neurons, one for nominal one for numerical.
I mostly see in NN literature you have to use 4 input neurons for encoding. But I don't need it to predict those nominal ones. I have only one output neuron that represents at most a relationship in the way if I would use it with expert systems and rules.
If I would normalize them to [0.2, 0.4, 0.6, 0.8] for example, isn't the NN able to distinguish between them? For the NN it's only a number, isn't it?
Naive approach and thinking:
A with 0.35 numerical leads to ideal 1.
B with 0.55 numerical leads to ideal 0.
C with 0.35 numerical leads to ideal 0.
D with 0.55 numerical leads to ideal 1.
Is there a mistake in my way of thinking about this approach?
Additional info (edit):
Those nominal values are included in decision making (significance if measured with statistics tools by combining with the numerical values), depends if they are true or not. I know they can be encoded binary, but the list of nominal values is a litte bit larger.
Other example:
Symptom A with blood test 1 leads to diagnosis X (the ideal)
Symptom B with blood test 1 leads to diagnosys Y (the ideal)
Actually expert systems are used. Symptoms are nominal values, but in combination with the blood test value you get the diagnosis. The main question finally: Do I have to encode symptoms in binary way or can I replace symptoms with numbers? If I can't replace it with numbers, why binary representation is the only way in usage of a NN?
INPUTS
Theoretically it doesn't really matter how do you encode your inputs. As long as different samples will be represented by different points in the input space it is possible to separate them with a line - and that what's the input layer (if it's linear) is doing - it combines the inputs linearly. However, the way the data is laid out in the input space can have huge impact on convergence time during learning. A simple way to see this is this: imagine a set of lines crossing the origin in the 2D space. If your data is scattered around the origin, then it is likely that some of these lines will separate data into parts, and few "moves" will be required, especially if the data is linearly separable. On the other hand, if your input data is dense and far from the origin, then most of initial input discrimination lines won't even "hit" the data. So it will require a large number of weight updates to reach the data, and the large amount of precise steps to "cut" it into initial categories.
OUTPUTS
If you have categories then encoding them as binary is quite important. Imagine that you have three categories: A, B and C. If you encode them with two three neurons as 1;0;0, 0;1;0 and 0;0;1 then during learning and later with noisy data a point about which network is "not sure" can end up as 0.5;0.0;0.5 on the output layer. That makes sense, if it is really something conceptually between A and C, but surely not B. If you'd choose one output neuron end encode A, B and C as 1, 2 and 3, then for the same situation the network would give an input of average between 1 and 3 which gives you 2! So the answer would be "definitely B" - clearly wrong!
Reference:
ftp://ftp.sas.com/pub/neural/FAQ.html

Resources