My input parameters for the neural network contain symbolic data. I do not understand whether it is possible to submit data in this form or needed to normalize.
Is it not true? And how can I fix it?
Neural networks are not symbolic networks. They work only on the numeric data. If by "symbolic" you mean categorical (like "cat", "dog") then typical approach is to use one hot encoding. However, if by symbolic you mean actual symbolic (like "a+b-2^c", "a*b") then there is no specific way to go, and it is a problem-specific question, impossible to answer without exact specification of data and lots of research.
Related
I am facing a binary prediction task and have a set of features of which all are categorical. A key challenge is therefore to encode those categorical features to numbers and I was looking for smart ways to do so.
I stumbled over word2vec, which is mostly used for NLP, but I was wondering whether I could use it to encode my variables, i.e. simply take the weights of the neural net as the encoded features.
However, I am not sure, whether it is a good idea since, the context words, which serve as the input features in word2vec are in my case more or less random, in contrast to real sentences which word2vec was originially made for.
Do you guys have any advice, thoughts, recommendations on this?
You should look into entity embedding if you are searching for a way to utilize embeddings for categorical variables.
google has a good crash course on the topic: https://developers.google.com/machine-learning/crash-course/embeddings/categorical-input-data
this is a good paper on arxiv written by a team from a Kaggle competition: https://arxiv.org/abs/1604.06737
It's certainly possible to use the word2vec algorithm to train up 'dense embeddings' for things like keywords, tags, categories, and so forth. It's been done, sometimes beneficially.
Whether it's a good idea in your case will depend on your data & goals – the only way to know for sure is to try it, and evaluate the results versus your alternatives. (For example, if the number of categories is modest from a controlled vocabulary, one-hot encoding of the categories may be practical, and depending on the kind of binary classifier you use downstream, the classifier may itself be able to learn the same sorts of subtle interrelationships between categories that could also otherwise be learned via a word2vec model. On the other hand, if categories are very numerous & chaotic, the pre-step of 'compressing' them into a smaller-dimensional space, where similar categories have similar representational vectors, may be more helpful.)
That such tokens don't quite have the same frequency distributions & surrounding contexts as true natural language text may mean it's worth trying a wider range of non-default training options on any word2vec model.
In particular, if your categories don't have a natural ordering giving rise to meaningful near-neighbors relationships, using a giant window (so all words in a single 'text' are in each others' contexts) may be worth considering.
Recent versions of the Python gensim Word2Vec allow changing a parameter named ns_exponent – which was fixed at 0.75 in many early implementations, but at least one paper has suggested can usefully vary far from that value for certain corpus data and recommendation-like applications.
I am coming from Google BERT context (Bidirectional Encoder representations from Transformers). I have gone through architecture and codes. People say this is bidirectional by nature. To make it unidirectional attention some mask is to be applied.
Basically a transformer takes key, values and queries as input; uses encoder decoder architecture; and applies attention to these keys, queries and values. What I understood is we need to pass tokens explicitly rather than transformer understanding this by nature.
Can someone please explain what makes transformer is bidirectional by nature
Bidirectional is actually a carry-over term from RNN/LSTM. Transformer is much more than that.
Transformer and BERT can directly access all positions in the sequence, equivalent to having full random access memory of the sequence during encoding/decoding.
Classic RNN has only access to the hidden state and last token, e.g. encoding of word3 = f(hidden_state, word2), so it has to compress all previous words into a hidden state vector, theoretically possible but a severe limitation in practice. Bidirectional RNN/LSTM is slightly better. Memory networks is another way to work around this. Attention is yet another way to improve LSTM seq2seq models. The insight for Transformer is that you want full memory access and don't need the RNN at all!
Another piece of history: an important ingredient that let us deal with sequence structure without using RNN is positional encoding, which comes from CNN seq2seq model. It would not have been possible without this. Turns out, you don't need the CNN either, as CNN doesn't have full random access, but each convolution filter can only look at a number of neighboring words at a time.
Hence, Transformer is more like FFN, where encoding of word1 = f1(word1, word2, word3), and encoding of word3 = f2(word1, word2, word3). All positions available all the time.
You might also appreciate the beauty which is that the authors made it possible to compute attention for all positions in parallel, through the use of Q, K, V matrices. It's quite magical!
But understand this, you'll also appreciate the limitations of Transformer, that it requires O(N^2 * d) computation where N is the sequence length, precisely because we're doing N*N attention of all words with all other words. RNN, on the other hand, is linear in the sequence length and requires O(N * d^2) computation. d is the dimension of model hidden state.
Transformer just won't write a novel anytime soon!
On the following picture you will see in a really clear way why BERT is Bidirectional.
This is crucial since this forces the model to use information from the entire sentence simultaneously – regardless of the position – to make a good predictions.
BERT has been a clear break through allowed by the use of the notorious "attention is all you need" paper and architecture.
This Bidirectional idea (masked) is different from classic LSTM cells which till now used the forward or the backward method or both but not at the same time.
Edit:
this is done by the transformer. The attention is all you need paper is presenting an encoder-decoder system implementing a sequence to sequence framework. BERT is using this Transformer (sequence to sequence Bidirectional network) to do other NLP task. And this has been done by using a masked approach.
The important thing is: BERT uses Attention but Attention has been done for a translation and as such do not care for Bidirectional. But remove a word and you have Bidirectional.
So why BERT now?
well the Transformer is the first transduction model relying
entirely on self-attention to compute representations of its input and output without using sequencealigned RNNs or convolution. Meaning that this model allows a sentence Embedding far more effective than before. In fact, RNN based architectures are hard to parallelize and can have difficulty learning long-range dependencies within the input and output sequences. SO break through in architecture AND the use of this idea to train a network by masking a word (or more) leads to BERT.
Edit of Edit:
forget about the scale product, it's the inside the Attention which is inside A multi head attention itself inside the Transformer: you are looking to deep. The transformer is using the entire sequence every time to find the other sequence (In case of BERT it's the missing 0.15 percentage of the sentence) and that's it. The use of BERT as a language model is realy a transfer learning (see this)
As stated in your post, unidirectional can be done with a certain type of mask, bidirec is better. And it is used because the go from a full sentence to a full sentence but not the way classic Seq2seq is made (with LSTM and RNN) and as such can be used for LM.
BERT is a bidirectional transformer whereas the original transformer (Vaswani et al., 2017) is unidirectional. This can be shown by comparing the masks in the code.
Original Transformer
Tensorflow's tutorial is a good reference. The look_ahead_mask is what makes the Transformer unidirectional.
def create_look_ahead_mask(size):
mask = 1 - tf.linalg.band_part(tf.ones((size, size)), -1, 0)
return mask # (seq_len, seq_len)
If you trace the code, you can find the look_ahead_mask is applied to the attention_weights in the decoder. Basically each row in the attention_weights represents a attention query for token at certain position (the first row -> first token position; the second row -> second token position etc.). And the look_ahead_mask blacks out the tokens appear after this position in the decoder so it does not see the "future". In that sense, the decoder is unidirectional analogous to unidirectional in an RNN.
BERT
On the other hand, if you check the original BERT implementation (also in Tensorflow). There's only an optional input_mask applied to the entire BertModel. And if you follow the README on pre-training the model and run create_pretraining_data.py, you will observe that the input_mask is only used for padding the input sequence so for short sentences the unused tokens are ignored. Thus, attention in BERT can be applied to both the "past" and the "future" of a given token position. In that sense, the encoder in BERT is bidirectional analogous to bidirectional in an RNN.
I know this is an old post but for anyone coming back to this:
Adding to what Hai-Anh Trinh said, Transformers aren't 'bi-directional', it would be better to call them "omni-directional". Because of their self-attention method, they are able to consider every single word at the same time, simultaneously.
BERT on the other hand is "deeply bidirectional". This is because of the masked language model(MLM) pre-training objective that is used in BERT. (there are a lot of resources online, I can link some if need be)
It's easy to get confused so don't worry about it.
(https://arxiv.org/pdf/1810.04805.pdf; link to the original BERT paper)
(https://arxiv.org/pdf/1706.03762.pdf; link to the original Transformer paper)
I usually set unk'value as random distribution vector or 0-vector.
It performed not bad,but most situation it's also not best for many task,i think.
But i'm curious about the best method to process 'unk' word vector,thank for any helpful advice.
If you're training word-vectors, the most-common strategy is to discard low-frequency terms entirely. (That's what the min_count setting does, in Google's original word2vec.c, Python gensim Word2Vec, etc.)
Whether you'd need to remember that something was at a particular place would be more common in sequence-learning scenarios, rather than plain word2vec. (If that's your concern, you could make your question more specific about why & how you're using word-vectors.)
I was reading about neural networks and found this:
"Many-state nominal variables are more difficult to handle. ST Neural Networks has facilities to convert both two-state and many-state nominal variables for use in the neural network. Unfortunately, a nominal variable with a large number of states would require a prohibitive number of numeric variables for one-of-N encoding, driving up the network size and making training difficult. In such a case it is possible (although unsatisfactory) to model the nominal variable using a single numeric index; a better approach is to look for a different way to represent the information."
This is exactly what is happening when I am building my input layer. One-of-N encoding is making the model very complex to design. However, it is mentioned in the above that you can use a numeric index, which I am not sure what he/she means by it. What is a better approach to represent the information? Can neural networks solve a problem with many-state nominal variables?
References:
http://www.uta.edu/faculty/sawasthi/Statistics/stneunet.html#gathering
Solving this task is very often crucial for modeling. Depending on a complexity of distribution of this nominal variable it'seems very often truly important to find a proper embedding between its values and R^n for some n.
One of the most successful example of such embedding is word2vec where the function between words and vectors is obtained. In other cases - you should use either ready solution if it exists or prepare your own by representational learning (e.g. by autoencoders or RBMs).
I have 20 numeric input parameters (or more) and single output parameter and I have thousands of these data. I need to find the relation between input parameters and output parameter. Some input parameters might not relate to output parameter or all input parameters might not relate to output parameter. I want some magic system that can statistically calculate output parameter when I provide all input parameters and it much be better if this system also provide confident rate with output result.
What’s technique (in machine learning) that I need to use to solve this problem? I think it should be Neural network, genetic algorithm or other related thing. But I don't sure. More than that, I need to know the limitation of this technique.
Thanks,
Your question seems to simply define the regression problem. Which can be solved by numerous algorithms and models, not just neural networks.
Support Vector Regression
Neural Networks
Linear regression (and many modifications and generalizations) using for example OLS method
Nearest Neighbours Regression
Decision Tree Regression
many, many more!
Simply look for "regression methods", "regression models" etc. in particular, sklearn library implements many of such methods.
I would recommend Genetic Programming (GP), which is genetic-based machine learning approach where the learnt model is a single mathematical expression/equation that best fits your data. Most GP packages out there come with a standard regression suite which you can run "as is" with your data, and with minimal setup costs.