What is different between RNN output and rule based output? - machine-learning

I am new in machine learning and i got a question , I am following this tutorial , I read about LSTM and RNN. i use the code provided by tutorial and run it , it completed the training and now i gave some strings for testing :
Training data is this :
output is :
Iter= 20000, Average Loss= 0.531466, Average Accuracy= 84.60%
['the', 'sly', 'and'] - [treacherous] vs [treacherous]
Optimization Finished!
Elapsed time: 12.159853319327036 min
Run on command line.
tensorboard --logdir=/tmp/tensorflow/rnn_words
Point your web browser to: http://localhost:6006/
3 words: ,hello wow and
Word not in dictionary
3 words: mouse,mouse,mouse
3 words: mouse
3 words: mouse mouse mouse
mouse mouse mouse very well , but who is to bell the cat approaches the until will at one another and take mouse a receive some signal of her approach , we he easily escape
3 words: 3 words: had a general
had a general to proposal to make round the neck will all agree , said he easily at and enemy approaches to consider what common the case . you will all agree , said he
3 words: mouse mouse mouse
mouse mouse mouse very well , but who is to bell the cat approaches the until will at one another and take mouse a receive some signal of her approach , we he easily escape
3 words: what was cat
what was cat up and said he is all very well , but who is to bell the cat approaches the until will at one another and take mouse a receive some signal of her
3 words: mouse fear cat
Word not in dictionary
3 words: mouse tell cat
Word not in dictionary
mo3 words: mouse said cat
Word not in dictionary
3 words: mouse fear fear
Word not in dictionary
3 words: mouse ring bell
Word not in dictionary
m3 words: mouse ring ring
Word not in dictionary
3 words: mouse bell bell
mouse bell bell and general to make round the neck will all agree , said he easily at and enemy approaches to consider what common the case . you will all agree , said he
3 words: mouse and bell
mouse and bell this means we should always , but looked is young always , but looked is young always , but looked is young always , but looked is young always , but looked
3 words: mouse was bell
mouse was bell and said he is all very well , but who is to bell the cat approaches the until will at one another and take mouse a receive some signal of her approach
3 words:
Now what i am not getting , When i give three words it gives result something like which we can easily achieve via regular expression or rule based code using if-else like if input words in file then fetch some sentence previous or next sentences , What is special about this output , How its different ? Explain please
like sometimes it says , word not in dict , so if i have to give only those words which are in training file then its like its matching inout words in training data and fetching some result from file then we can do same thing with if else or in pure programming without any module then how's its different ?

Your training dataset only has ~180 words, and is achieving a 84.6% (training) accuracy, so it is overfitting quite a bit. Essentially, the model is simply predicting the next most likely word based on the training data.
Usually language models are trained on much larger datasets, such as PTB or the 1B word benchmark. PTB is a small dataset, with 100,000 words, and the 1B word benchmark has 1 billion words.
RNN models have a limited vocabulary to allow words or characters to be encoded. The vocabulary size would depend on model. Most word models that train on PTB have a vocabulary size of 10,000, which is enough for most common words.

Related

Is a small vocabulary for Neural Nets ok?

I am designing a neural network to try and generate music. The neural network would be a 2 layered LSTM (Long Short Term Memory).
I am hoping to encode the music into a many-hot format for training, ie it would be a 1 if that note was playing and a 0 if that note was not playing.
Here is an excerpt of what this data would look like:
0000000000000000000000000000000000000000000000000001000100100001000000000000000000000000
0000000000000000000000000000000000000000000000000001000100100001000000000000000000000000
0000000000000000000000000000000000000000000000000001000100100001000000000000000000000000
0000000000000000000000000000000000000000000000000001000100100001000000000000000000000000
0000000000000000000000000000000000000000000000000001000100100001000000000000000000000000
0000000000000000000000000000000000000000000000000011010100100001010000000000000000000000
There are 88 columns which represent 88 notes and each now represents a new beat. The output will be at a character level.
I am just wondering since there are only 2 characters in the vocabulary, would the probability of a 0 being next always be higher than the probability of a 1 being next?
I know for a large vocabulary, a large training set is needed, but I only have a small vocabulary. I have 229 files which corresponds to about 50,000 lines of text. Is this enough to prevent the output being all 0s?
Also, would it be better to have 88 nodes, 1 for each note, or just one node for one character at a time?
Thanks in advance
A small vocabulary is fine as long as your dataset not skewed overwhelmingly to one of the "words".
As to "would it be better to have 88 nodes, 1 for each note, or just one node for one character at a time?", each timestep is represented as 88 characters. Each character is a feature of that timestep. Your LSTM should be outputting the next timestep, so you should have 88 nodes. Each node should output the probability of that node being present in that timestep.
Finally since you are building a Char-RNN I would strongly suggest using abc notation to represent your data. A song in ABC notation looks like this:
X:1
T:Speed the Plough
M:4/4
C:Trad.
K:G
|:GABc dedB|dedB dedB|c2ec B2dB|c2A2 A2BA|
GABc dedB|dedB dedB|c2ec B2dB|A2F2 G4:|
|:g2gf gdBd|g2f2 e2d2|c2ec B2dB|c2A2 A2df|
g2gf g2Bd|g2f2 e2d2|c2ec B2dB|A2F2 G4:|
This is perfect for Char-RNNs because it represents every song as a set of of characters, and you can run conversions from MIDI to ABC and vice versa. All you have to do is train your model to predict the next character in this sequence instead of dealing with 88 output nodes.

Estimating both the category and the magnitude of output using neural networks

Let's say I want to calculate which courses a final year student will take and which grades they will receive from the said courses. We have data of previous students'courses and grades for each year (not just the final year) to train with. We also have data of the grades and courses of the previous years for students we want to estimate the results for. I want to use a recurrent neural network with long-short term memory to solve this problem. (I know this problem can be solved by regression, but I want the neural network specifically to see if this problem can be properly solved using one)
The way I want to set up the output (label) space is by having a feature for each of the possible courses a student can take, and having a result between 0 and 1 in each of those entries to describe whether if a student will attend the class (if not, the entry for that course would be 0) and if so, what would their mark be (ie if the student attends class A and gets 57%, then the label for class A will have 0.57 in it)
Am I setting the output space properly?
If yes, what optimization and activation functions I should use?
If no, how can I re-shape my output space to get good predictions?
If I understood you correctly, you want that the network is given the history of a student, and then outputs one entry for each course. This entry is supposed to simultaneously signify whether the student will take the course (0 for not taking the course, 1 for taking the course), and also give the expected grade? Then the interpretation of the output for a single course would be like this:
0.0 -> won't take the course
0.1 -> will take the course and get 10% of points
0.5 -> will take the course and get half of points
1.0 -> will take the course and get full points
If this is indeed your plan, I would definitely advise to rethink it.
Some obviously realistic cases do not fit into this pattern. For example, how would you represent an (A+)-student is "unlikely" to take a course? Should the network output 0.9999, because (s)he is very likely to get the maximum amount of points if (s)he takes the course, OR should the network output 0.0001, because the student is very unlikely to take the course?
Instead, you should output two values between [0,1] for each student and each course.
First value in [0, 1] gives the probability that the student will participate in the course
Second value in [0, 1] gives the expected relative number of points.
As loss, I'd propose something like binary cross-entropy on the first value, and simple square error on the second, and then combine all the losses using some L^p metric of your choice (e.g. simply add everything up for p=1, square and add for p=2).
Few examples:
(0.01, 1.0) : very unlikely to participate, would probably get 100%
(0.5, 0.8): 50%-50% whether participates or not, would get 80% of points
(0.999, 0.15): will participate, but probably pretty much fail
The quantity that you wanted to output seemed to be something like the product of these two, which is a bit difficult to interpret.
There is more than one way to solve this problem. Andrey's answer gives a one good approach.
I would like to suggest simplifying the problem by bucketing grades into categories and adding an additional category for "did not take", for both input and output.
This turns the task into a classification problem only, and solves the issue of trying to differentiate between receiving a low grade and not taking the course in your output.
For example your training set might have m students, n possible classes, and six possible results: ['A', 'B', 'C', 'D', 'F', 'did_not_take'].
And you might choose the following architecture:
Input -> Dense Layer -> RELU -> Dense Layer -> RELU -> Dense Layer -> Softmax
Your input shape is (m, n, 6) and your output shape could be (m, n*6), where you apply softmax for every group of 6 outputs (corresponding to one class) and sum into a single loss value. This is an example of multiclass, multilabel classification.
I would start by trying 2n neurons in each hidden layer.
If you really want a continuous output for grades, however, then I recommend using separate classification and regression networks. This way you don't have to combine classification and regression loss into one number, which can get messy with scaling issues.
You can keep the grade buckets for input data only, so the two networks take the same input data, but for the grade regression network your last layer can be n sigmoid units with log loss. These will output numbers between 0 and 1, corresponding the predicted grade for each class.
If you want to go even further, consider using an architecture that considers the order in which students took previous classes. For example if a student took French I the previous year, it is more likely he/she will take French II this year than if he/she took French Freshman year and did not continue with French after that.

Using SVM to predict text with label

I have data in a csv file in the following format
Name Power Money
Jon Red 30
George blue 20
Tom Red 40
Bob purple 10
I consider values like "jon", "red" and "30 as inputs. Each input as a label. For instance inputs [jon,george,tom,bob] have label "name". Inputs [red,blue,purple] have label "power". This is basically how I have training data. I have bunch of values that are each mapped to a label.
Now I want to use svm to train a model based on my training data to accurately identify given a new input what is its correct label. so for instance if the input provided is "444" , the model should be smart enough to categorize it as a "Money" label.
I have installed py and also installed sklearn. I have completed the following tutorial as well. I am just not sure on how to prepare input data to train the model.
Also I am new to machine learning if i have said something that sounds wrong or odd please point it out as I will be happy to learn the correct.
With how your current question is formulated, you are not dealing with a typical machine learning problem. Currently, you have column-wise data:
Name Power Money
Jon Red 30
George blue 20
Tom Red 40
Bob purple 10
If a user now inputs "Jon", you know it is going to be type "Name", by a simple hash-map look up, e.g.,:
hashmap["Jon"] -> "Name"
The main reason people are saying it is not a machine-learning problem is your "categorisation" or "prediction" is being defined by your column names. Machine learning problems, instead (typically), will be predicting some response variable. For example, imagine instead you had asked this:
Name Power Money Bought_item
Jon Red 30 yes
George blue 20 no
Tom Red 40 no
Bob purple 10 yes
We could build a model to predict Bought_item using the features Name, Power, and Money using SVM.
Your problem would have to look more like:
Feature1 Feature2 Feature3 Category
1.0 foo bar Name
3.1 bar foo Name
23.4 abc def Money
22.22 afb dad Power
223.1 dad vxv Money
You then use Feature1, Feature2, and Feature3 to predict Category. At the moment your question does not give enough information for anyone to really understand what you need or what you have to reformulate it this way, or consider an unsupervised approach.
Edit:
So frame it this way:
Name Power Money Label
Jon Red 30 Foo
George blue 20 Bar
Tom Red 40 Foo
Bob purple 10 Bar
OneHotEncode Name and Power, so you now have a variable for each name that can be 0/1.
Standardise Money so that it has a range between, approximately, -1/1.
LabelEncode your labels so that they are 0,1,2,3,4,5,6 and so on.
Use a One vs. All classifier, http://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html.

What does the multiple outputs in skip-gram mean?

I've been trying to understand the process of skip-gram learning algorithm. There's this small detail that confuses me.
In the following graph(which is used in many articles and blogs to explain skip-gram), what does the multiple outputs mean? I mean, the input word is the same, the output matrix is the same. Then when you calculate the output vector, which I believe is the probability set of all words appearing near the input word, it should be the same all the time.
skipgram model
Hope someone can help me with this~
This article seems to explain it adequately — each "chunk" of the output represents the prediction of a word at one position in the context (the window of words before and after the input word in the text). The output is "really" a single vector, but the diagram is trying to make it clear that it corresponds to C instances of a word-vector where C is the size of the context.
It's kind of a prone-to-misinterpretation diagram. Each of the three outputs in that diagram should be considered the results for a different input (context) word.
Feed it word 1, and through the hidden layer, to the output layer, you'll get (size-of-vocabulary) V output values (at each node, assuming the easier-to-think-about negative-sampling mode) – the top results in the diagram. Feed it word 2, and you'll get the middle results. Feed it word 3, and you'll get the bottom results.

Store textual dataset for binary classification

I am currently working on a machine learning project, and am in the process of building the dataset. The dataset will be comprised of a number of different textual features, of varying length from 1 sentence to around 50 sentences(including punctuation). What is the best way to store this data to then pre-process and use for machine learning using python?
In most cases, you can use a method called Bag of Word, however, in some cases when you are performing more complicated task like similarity extraction or want to make comparison between sentences, you should use Word2Vec
Bag of Word
You may use the classical Bag-Of-Word representation, in which you encode each sample into a long vector indicating the count of all the words from all samples. For example, if you have two samples:
"I like apple, and she likes apple and banana.",
"I love dogs but Sara prefer cats.".
Then all the possible words are(order doesn't matter here):
I she Sara like likes love prefer and but apple banana dogs cats , .
Then the two samples will be encoded to
First: 1 1 0 1 1 0 0 2 0 2 1 0 0 1 1
Second: 1 0 1 0 0 1 1 0 1 0 0 1 1 0 1
If you are using sklearn, the task would be as simple as:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?',
]
X = vectorizer.fit_transform(corpus)
# Now you can feed X into any other machine learning algorithms.
Word2Vec
Word2Vec is a more complicated method, which attempts to find the relationship between words by training a embedding neural network underneath. An embedding, in plain english, can be thought of the mathematical representation of a word, in the context of all the samples provided. The core idea is that words are similar if their contexts are similar.
The result of Word2Vec are the vector representation(embeddings) of all the words shown in all the samples. The amazing thing is that we can perform algorithmic operations on the vector. A cool example is: Queen - Woman + Man = King reference here
To use Word2Vec, we can use a package called gensim, here is a basic setup:
model = Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
model.most_similar(positive=['woman', 'king'], negative=['man'])
>>> [('queen', 0.50882536), ...]
Here sentences is your data, size is the dimension of the embeddings, the larger size is, the more space is used to represent a word, and there is always overfitting we should think about. window is the size of the context we are cared about, it is the number of words before the target word we are looking at when we are predicting the target from its context, when training.
One common way is to create your dictionary(all the posible words) and then encode every of your examples in function of this dictonary, for example(this is a very small and limited dictionary just for example) you could have a dictionary : hello ,world, from, python . Every word will be associated to a position, and in every of your examples you define a vector with 0 for inexistence and 1 for existence, for example for the example "hello python" you would encode it as: 1,0,0,1

Resources