Recurrent Neural Networks (RNN) with different sequences length inputs - machine-learning

I have different length input texts, from few character to hundred words so I decided to use different MAX_LENGTH for each batch instead of fix MAX_LENGTH for all batches (obviously shorter MAX_LENGTH for smaller text).
After googling I saw this thread in Keras github page which gave the following solution:
Sequences should be grouped together by length, and segmented manually
into batches by that length before being sent to Keras.
If I use this trick, I'll guess there is no way to shuffle data at training time and it might lead to overfitting.
I saw many disscution in Kaggle that use this trick. I want to know is there any other solution for this problem?

There is the solution of padding your data with a dummy value, so all your input sequences have the same length.
Suppose you have these two sequences:
[1,2,3,1,2], you keep it as [1,2,3,1,2]
[1,3,2,3], you pad it with zeros until it reaches the desired length: [1,3,2,3,0]
Then you start your model with a Masking layer.
This will automatically ignore the additional length in the samples.

Related

Data Encoding for Training in Neural Network

I have converted 349,900 words from a dictionary file to md5 hash. Sample are below:
74b87337454200d4d33f80c4663dc5e5
594f803b380a41396ed63dca39503542
0b4e7a0e5fe84ad35fb5f95b9ceeac79
5d793fc5b00a2348c3fb9ab59e5ca98a
3dbe00a167653a1aaee01d93e77e730e
ffc32e9606a34d09fca5d82e3448f71f
2fa9f0700f68f32d2d520302906e65ce
1c9b32ff1b53bd892b87578a11cbd333
26a10043bba821303408ebce568a2746
c3c32ff3481e9745e10defa7ce5b511e
I want to train a neural network to decrypt a hash using just simple architecture like MultiLayer Perceptron. Since all hash value is of length 32, I was thingking that the number of input nodes is 32, but the problem here is the number of output nodes. Since the output are words in the dictionary, it doesn't have any specific length. It could be of various length. That is the reason why Im confused on how many number of output nodes shall I have.
How will I encode my data, so that I can have specific number of output nodes?
I have found a paper here in this link that actually decrypt a hash using neural network. The paper said
The input to the neural network is the encrypted text that is to be decoded. This is fed into the neural network either in bipolar or binary format. This then traverses through the hidden layer to the final output layer which is also in the bipolar or binary format (as given in the input). This is then converted back to the plain text for further process.
How will I implement what is being said in the paper. I am thinking to limit the number of characters to decrypt. Initially , I can limit it up to 4 characters only(just for test purposes).
My input nodes will be 32 nodes representing every character of the hash. Each input node will have the (ASCII value of the each_hash_character/256). My output node will have 32 nodes also representing binary format. Since 8 bits/8 nodes represent one character, my network will have the capability of decrypting characters up to 4 characters only because (32/8) = 4. (I can increase it if I want to. ) Im planning to use 33 nodes. Is my network architecture feasible? 32 x 33 x 32? If no, why? Please guide me.
You could map the word in the dictionary in a vectorial space (e.g. bag of words, word2vec,..). In that case the words are encoded with a fix length. The number of neurons in the output layer will match that length.
There's a great discussion about the possibility of cracking SHA256 hashes using neural networks in another Stack Exchange forum: https://security.stackexchange.com/questions/135211/can-a-neural-network-crack-hashing-algorithms
The accepted answer was that:
No.
Neural networks are pattern matchers. They're very good pattern
matchers, but pattern matchers just the same. No more advanced than
the biological brains they are intended to mimic. More thorough, more
tireless, but not more sophisticated.
The patterns have to be there to be found. There has to be a bias in
the data to tease out. But cryptographic hashes are explicitly and
extremely carefully designed to eliminate any bias in the output. No
one bit is more likely than any other, no one output is more likely to
correlate to any given input. If such a correlation were possible, the
hash would be considered "broken" and a new algorithm would take its
place.
Flaws in hash functions have been found before, but never with the aid
of a neural network. Instead it's been with the careful application of
certain mathematical principles.
The following answer also makes a funny comparison:
SHA256 has an output space of 2^256, and an input space that's
essentially infinite. For reference, the time since the big bang is
estimated to be 5 billion years, which is about 1.577 x 10^27
nanoseconds, which is about 2^90 ns. So assuming each training
iteration takes 1 ns, you would need 2^166 ages of the universe to
train your neural net.

one hot encoding of output labels

While I understand the need to one hot encode features in the input data, how does one hot encoding of output labels actually help? The tensor flow MNIST tutorial encourages one hot encoding of output labels. The first assignment in CS231n(stanford) however does not suggest one hot encoding. What's the rationale behind choosing / not choosing to one hot encode output labels?
Edit: Not sure about the reason for the downvote, but just to elaborate more, I missed out mentioning the softmax function along with the cross entropy loss function, which is normally used in multinomial classification. Does it have something to do with the cross entropy loss function?
Having said that, one can calculate the loss even without the output labels being one hot encoded.
One hot vector is used in cases where output is not cardinal. Lets assume you encode your output as integer giving each label a number.
The integer values have a natural ordered relationship between each other and machine learning algorithms may be able to understand and harness this relationship, but your labels may be unrelated. There may be no similarity in your labels. For categorical variables where no such ordinal relationship exists, the integer encoding is not good.
In fact, using this encoding and allowing the model to assume a natural ordering between categories may result in unexpected results where model predictions are halfway between categories categories.
What a mean by that?
The idea is that if we train an ML algorithm - for example a neural network - it’s going to think that a cat (which is 1) is halfway between a dog and a bird, because they are 0 and 2 respectively. We don’t want that; it’s not true and it’s an extra thing for the algorithm to learn.
The same may happen when data is encoded in n dimensional space and vector has a continuous value. The result may be hard to interpret and map back to labels.
In this case, a one-hot encoding can be applied to label representation as it has clear interpretation and its values are separated each is in different dimension.
If you need more information or would like to see the reason for one-hot encoding for the perspective of loss function see https://www.linkedin.com/pulse/why-using-one-hot-encoding-classifier-training-adwin-jahn/

Seq2Seq with Keras understanding

For some self-studying, I'm trying to implement simple a sequence-to-sequence model using Keras. While I get the basic idea and there are several tutorials available online, I still struggle with some basic concepts when looking these tutorials:
Keras Tutorial: I've tried to adopt this tutorial. Unfortunately, it is for character sequences, but I'm aiming for word sequences. There's is a block to explain the required for word sequences, but this is currently throwing "wrong dimension" errors -- but that's OK, probably some data preparation errors from my side. But more importantly, in this tutorial, I can clearly see the 2 types of input and 1 type of output: encoder_input_data, decoder_input_data, decoder_target_data
MachineLearningMastery Tutorial: Here the network model looks very different, completely sequential with 1 input and 1 output. From what I can tell, here the decoder gets just the output of the encoder.
Is it correct to say that these are indeed two different approaches towards Seq2Seq? Which one is maybe better and why? Or do I read the 2nd tutorial wrongly? I already got an understanding in sequence classification and sequences labeling, but with sequence-to-sequence it hasn't properly clicked yet.
Yes, those two are different approaches and there are other variations as well. MachineLearningMastery simplifies things a bit to make it accessible. I believe Keras method might perform better and is what you will need if you want to advance to seq2seq with attention which is almost always the case.
MachineLearningMastery has a hacky workaround that allows it to work without handing in decoder inputs. It simply repeats the last hidden state and passes that as the input at each timestep. This is not a flexible solution.
model.add(RepeatVector(tar_timesteps))
On the other hand Keras tutorial has several other concepts like teacher forcing (using targets as inputs to the decoder), embeddings(lack of) and a lengthier inference process but it should set you up for attention.
I would also recommend pytorch tutorial which I feel is the most appropriate method.
Edit:
I dont know your task but what you would want for word embedding is
x = Embedding(num_encoder_tokens, latent_dim)(encoder_inputs)
Before that, you need to map every word in the vocabulary into an integer, turn every sentence into a sequence of integers and pass that sequence of integers to the model (embedding layer of latent_dim maybe 120). So each of your word is now represented by a vector of size 120. Also your input sentences must be all of the same size. So find an appropriate max sentence length and turn every sentence into that length and pad with zero if sentences are shorter than max len where 0 represents a null word perhaps.

Features of vector form of sentences for opinion finding.

I want to find the opinion of a sentence either positive or negative. For example talk about only one sentence.
The play was awesome
If change it to vector form
[0,0,0,0]
After searching through the Bag of words
bad
naughty
awesome
The vector form becomes
[0,0,0,1]
Same for other sentences. Now I want to pass it to the machine learning algorithm for training it. How can I train the network using these multiple vectors? (for finding the opinion of unseen sentences) Obviously not! Because the input is fix in neural network. Is there any way? The above procedure is just my thinking. Kindly correct me if I am wrong. Thanks in advance.
Since your intuitive input format is "Sentence". Which is, indeed, a string of tokens with arbitrary length. Abstracting sentences as token series is not a good choice for many existing algorithms only works on determined format of inputs.
Hence, I suggest try using tokenizer on your entire training set. This will give you vectors of length of the dictionary, which is fixed for given training set.
Because when the length of sentences vary drastically, then size of the dictionary always keeps stable.
Then you can apply Neural Networks(or other algorithms) to the tokenized vectors.
However, vectors generated by tokenizer is extremely sparse because you only work on sentences rather than articles.
You can try LDA (supervised, not PCA), to reduce the dimension as well as amplify the difference.
That will keep the essential information of your training data as well as express your data at fixed size, while this "size" is not too large.
By the way, you may not have to label each word by its attitude since the opinion of a sentence also depends on other kind of words.
Simple arithmetics on number of opinion-expressing words many leave your model highly biased. Better label the sentences and leave the rest job to classifiers.
For the confusions
PCA and LDA are Dimensional Reduction techniques.
difference
Let's assume each tuple of sample is denoted as x (1-by-p vector).
p is too large, we don't like that.
Let's find a matrix A(p-by-k) in which k is pretty small.
So we get reduced_x = x*A, and most importantly, reduced_x must
be able to represent x's characters.
Given labeled data, LDA can provide proper A that can maximize
distance between reduced_x of different classes, and also minimize
the distance within identical classes.
In simple words: compress data, keep information.
When you've got
reduced_x, you can define training data: (reduced_x|y) where y is
0 or 1.

word2vec application to non-human words: implementation allowing to provide my own context and distance

I would like to use word2vec to transform 'words' into numerical vectors and possibly make predictions for new words. I've tried extracting features from words manually and training a linear regression model (using Stocahstic Gradient Descent), but this only works to an extent.
The input data I have is:
Each word is associated with a numerical value. You can think of this value as being the word's coordinate in 1D space.
For each word I can provide a distance to any other word (because I have the words' coordinates).
because of this I can provide the context for each word. If given a distance, I can provide all the other words within this distance from the target one.
Words are composed from latin letters only (e.g. AABCCCDE, BKEDRRS).
Words almost never repeat, but their structural elements repeat a lot within different words.
Words can be of different length (say 5-50 letters max).
Words have common features, some subsequences in them will occur multiple times in different words (e.g. some dublets or triplets of letters, their position within a word, etc).
The question:
Is there an implementation of word2vec which allows provision of your own distances and context for each word?
A big bonus would be if the trained model could spit out the predicted coordinate for any word you feed in after training.
Preferrably in Java, Python is also fine, but in general anything will do.
I am also not restricting myself to word2vec, it just seems as a good fit, but my knowledge of machine-learning and data mining are very limited, so I might be missing a better way to tackle the problem.
PS: I know about deeplearning4j, but I haven't looked around the code enough to figure out if what I want to do is easy to implement in it.
Example of data: (typical input contains thousands to tens of thousands of words)
ABCD 0.50
ABCDD 0.51
ABAB 0.30
BCDAB 0.60
DABBC 0.59
SPQTYRQ 0.80

Resources