Deep learning-Flatten is one of the special form of Embedding?

Deep learning-Flatten is one of the special form of Embedding? - machine-learning

In Deep Learning,
The definition of Embedding is making data to dense vector.
Flatten is a widely used concept that makes data in a line. So, we can consider Flatten also return vector. Just except changing feature numbers.
So, Flatten is one of the special form of Embedding? is this logically right?
Flatten ⊂ Embedding ?

No, Flatten is a layer that takes input of higher dim i.e (d1,d2,...,dn) and flatten it out to 1d vector. This vector will have (d1 * d2 * ... * dn) elements. It doesn't learn anything. It just takes higher dim tensor and converts it to single dim tensor.
Embeddings on the other hand have learnable parameters which gets updated during the training. These parameters learn meaningful representation of the data.

Related

Using awkward1.Array for BDT

I want to implement a boosted decision tree for my analysis. But the entries in my array contain are of varying length, so the array is not convertible directly into numpy or pandas.
Is there any way to use existing ML libraries with awkward array?

Your ML library might assume that the arrays are NumPy arrays and not recognize an ak.Array. That problem, in itself, is easily solved: call np.to_numpy (or equivalently, cast it with np.asarray) to put it in a form the ML library expects. Incidentally, there's also ak.to_pandas to make a DataFrame in which variable-length nested lists are represented by a MultiIndex (with limitations: there has to be only one nested list, since a DataFrame has only one index).
The above is what I'd call a "branding" issue: the ML library just doesn't recognize the ak.Array "brand" of array, so we relabel it. But there's a more fundamental issue: does the ML algorithm in question intrinsically require rectilinear data? For instance, a feedforward neural network maps N-dimensional inputs to M-dimensional outputs; N and M can't be different for each input. This is a problem even if you're not using Awkward Array. In HEP, the old solution was to run variable-length data through a recurrent neural network (thus ignoring the boundaries between lists and imposing an irrelevant order on them) and the new solution seems to be graph neural networks (which is a more theoretically correct thing to do).
I've noticed that some ML libraries are introducing their own "jagged arrays," which are the minimum structure that Awkward Array provides: TensorFlow has RaggedTensors and PyTorch is getting NestedTensors. I don't know to what degree these data types have been integrated into the ML algorithms, though. If they have been, then Awkward Array ought to get an ak.to_tensorflow and ak.to_pytorch to complement ak.to_numpy and ak.to_pandas, as a way to preserve jaggedness when sending data to these libraries. Hopefully, they'll be able to use that jaggedness in their ML algorithms! (Otherwise, what's the point? But I haven't been following these developments closely.)
You're interested in boosted decision trees (BDTs). I can't think of how a decision tree model, boosted or not, could be adapted to different length inputs... Or maybe I can: the nodes of a decision tree choose which subtree to pass the data down to based on the value of one index in the N-dimensional input. That doesn't imply there's a maximum index value N, though a particular tree would have a set of indexes that it splits on, and there would be some maximum of that set (because the tree is finite!). Apply a tree that wants to split on index k on an input with n < k elements would have to have a contingency for how to split anyway, but there are already methods for applying decision trees to datasets with missing values. An input datum with n elements could be treated as an input for which indexes greater than n are considered missing values. To train such a BDT, you'd have to give it inputs with missing values beyond each list's maximum element.
In Awkward Array, the function for that is ak.pad_none. If you know the maximum length list in your sample (ak.num and ak.max), you can pad the whole array such that all lists have the same length with missing values at the end. If you set clip=True, then the resulting array type is "regular," it no longer considers the possibility that a list can have a length different from the chosen length. If you pass such an array to np.to_numpy (and not np.asarray), then it becomes a NumPy masked array, which a BDT algorithm that expects missing values should be able to recognize.
The only problem with this plan is that padding every list to have the same length as the maximum length list uses more memory. If the BDT algorithm were aware of jaggedness (the way that TensorFlow and soon PyTorch is/will be aware of jaggedness), then it should be able to make these trees and apply them to data without the memory-padding step. I don't know if there are any such BDT implementations out there, but if someone wants to write a "BDT with missing values that accepts jagged arrays," I'd be happy to help them get it set up with Awkward Arrays!

Neural Networks normalizing output data

I have a training data for NN along with expected outputs. Each input is 10 dimensional vector and has 1 expected output.I have normalised the training data using Gaussian but I don't know how to normalise the outputs since it only has single dimension. Any ideas?
Example:
Raw Input Vector:-128.91, 71.076, -100.75,4.2475, -98.811, 77.219, 4.4096, -15.382, -6.1477, -361.18
Normalised Input Vector: -0.6049, 1.0412, -0.3731, 0.4912, -0.3571, 1.0918, 0.4925, 0.3296, 0.4056, -2.5168
The raw expected output for the above input is 1183.6 but I don't know how to normalise that. Should I normalise the expected output as part of the input vector?

From the looks of your problem, you are trying to implement some sort of regression algorithm. For regression problems you don't normally normalize the outputs. For the training data you provide for a regression system, the expected output should be within the range you're expecting, or simply whatever data you have for the expected outputs.
Therefore, you can normalize the training
inputs to allow the training to go faster, but you typically don't normalize the target outputs. When it comes to testing time or providing new inputs, make sure you normalize the data in the same way that you did during training. Specifically, use exactly the same parameters for normalization during training for any test inputs into the network.

One important remark is that you normalized elements of a single input vector. Having one-dimensional output space, you could not normalize the output.
The correct way is, indeed, to take a complete batch of training data, say N input (and output) vectors, and normalize each dimension (variable) individually (using N samples). Thus, for one-dimensional output, you will have N samples for normalization. In this way, the vector space of your input will not be distorted.
The normalization of the output dimension is usually required when the scale-space of output variables significantly different. After training, you should use the same set normalization parameters (e.g., for zscore it is "mean" and "std") as you obtain from the training data. In this case, you will put new (unseen) data into the same scale space as you in training.

Reshaping Inputs that contain continuous and discrete values

The inputs I am using are 2xN, where the first 1xN row are continuous numbers, and the second 1xN row are discrete numbers (that encodes a specific class out of 7 possible classes). I expect there to be a relation between vertically adjacent pairs.
I am looking to use a neural net for a multi-class classifier on this input, but am unsure of how to reshape my data for forward propagation in a way that makes sense.
What is a feasible way to reshape my data into 1x2N for forward propogation that makes sense?
edit:
Example input:
input_features = [[99.3, 22.1, 41.7], [1, 3, 4]]

Unless you know something more than "there might be some kind of relation", you should just flatten the array and pass it as a vector - NN can (in theory) find such realtions on its own (given enough data).
What are the other options? If you suspect that there is a single relation, such that it is true for every single column, then you might want to construct specific neural net. One option is to have a convolution of size 2x1 (single column) in the input layer. On the other hand - if you create large enough set of kernels, this will be able to model more complex relations too. In such case - leave it as a matrix (think about it as an image). There is nothing wrong with discrete values, as long as they are in the reasonable scale.
In general - you will actually just work with specific wiring of the net, not reshaping of an array (however, implementations of conv nets actually use shape to do the work for you, as described).

Two vectors of every word in basic Skip-bigram word2vec model with softmax function

I'm reading the raw word2vec paper: http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf
According to below equation, every word has two vectors, one is used to predict context word as center word, another is used as context word. For the former one, we can update it with Gradient descent in each iteration. But how to update the latter one? And which vector is the final vector in final model?

To my understanding, irrespective of what architecture is used (skip-gram/CBOW), word vectors are read from same word-vector matrix.
As suggested in second footnote of the paper, v_in and v'_out of same word (say dog) should be different, and they are assumed to be coming from different vocabularies during the derivation of the loss function.
Practically, probability of word appearing in its own context is very low, and most implementations don't save two vector representations of same word for saving memory and efficiency.

Qualitative Classification in Neural Network on Weka

I have a training set where the input vectors are speed, acceleration and turn angle change. Output is a crisp class- an activity state from the given set {rest, walk, run}. e.g- say for input vectors [3.1 1.2 2]-->run ; [2.1 1 1]-->walk and so on.
I am using weka to develop a Neural Network model. The output I am defining as crisp ones (or rather qualitative ones in words- categorical values). After training the model, the model can fairly classify on test data.
I was wondering how the internal process (mapping function) is taking place? Is the qualitative output states are getting some nominal value inside the model and after processing it is again getting converted to the categorical data? because a NN model cannot map float input values to a categorical data through hidden neurons, so what is actually happening, although the model is working fine.
If the model converts the categorical outputs into nominal ones and then start processing then on what basis it converts the categorical value into some arbitrary numerical values?

Yes, categorical values are usually being converted to numbers, and the networks learn to associate input data with these numbers. However these numbers are often further encoded, not to use only single output neuron. The most common way to do it, for unordered labels, is to add dummy output neurons dedicated to each category and use 1-of-C encoding, with 0.1 and 0.9 as target values. Output is interpreted using the Winner-take-all paradigm.
Using only one neuron and encoding categories with different numbers for unordered labels often leads to problems - as the network will treat middle categories as "averages" of the boundary categories. This however may sometimes be desired, if you have ordered categorical data.
You can find very good explanation of this issue in this part of the online Neural Network FAQ.

The neural net's computations all take place on continuous values. To do multiclass classification with discrete output, its final layer produces a vector of such values, one for each class. To make a discrete class prediction, take the index of the maximum element in that vector.
So if the final layer in a classification network for four classes predicts [0 -1 2 1], then the third element of the vector is the largest and the third class is selected. Often, these values are also constrained to form a probability distribution by means of a softmax activation function.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart