I have more of an abstract question about LSTMs.
So I have time series data, a long series of let's say 30.000 data points.
For LSTMs in Tensorflow, the input shape is [batch_size, time_steps, features]. My batch size is equal to 1 since I only have one time series.
Now my question is about the role of the sequence length.
I could pass the entire time series into a tensorflow LSTM using timestep=1, in which case the entire time series would be plugged in one by one into the LSTM. Or I could use some >1 number of timesteps and make it sequence-to-sequence model.
This way, I could plug in 7 datapoints at once (= a week of data) and predict 7 outputs (= one week into the future).
The thing is, we know that the hidden state of the LSTM will eventually remember the entire dataset (if we do enough epochs of training). So what is the difference between
a) timesteps=1, I predict one period ahead, and plug the prediction back into the neural net 7 times,
b) timesteps=7, I predict a sequence of 7 (=one week).
in practical terms, seq2seq (ie your second example) are far more difficult to train. in anthropomorphic terms, your lstm will face considerable pressure to encode all of the inputs into a single vector, a process that discards quite a bit of information, and somehow decode that single vector into 7 outputs in the correct sequential order.
Related
I am new to Machine/Deep learning area!
If I understood correctly, when I am using images as an input,
the number of neurons at input layer = the number of pixels (i.e resolution)
The weights and biases are updated through back-propagation to achieive low as possible error-rate.
Question 1.
So, even one single image data will adjust the values of weights & biases (through back-propagation algorithm), then how does adding more similar images into this MLP improve the performance?
(I must be missing something big.. however to me, it seems like it will only be optimised for the given single image and if i input the next one (of similar img), it will only be optimised for the next one )
Question 2.
If I want to train my MLP to recognise certain types of images ( Let's say clothes / animals ) , what is a good number of training set for each label(i.e clothes,animals)? I know more training set will produce better result, however how much number would be ideal for good enough performance?
Question 3. (continue)
A bit different angle question,
There is a google cloud vision API , which will take images as an input, and produce label/probability as an output. So this API will give me an output of 100 (lets say) labels and the probabilities of each label.
(e.g, when i put an online game screenshot, it will produce as below,)
Can this type of data be used as an input to MLP to categorise certain type of images?
( Assuming I know all possible types of labels that Google API produces and using all of them as input neurons )
Pixel values represent an image. But also, I think this type of API output results can represent an image in different angle.
If so, what would be the performance difference ?
e.g) when classifying 10 different types of images,
(pixels trained model) vs (output labels trained model)
I can help you with the "intuitive" picture.
First, it may be worth looking at convolution neural nets and deep learning and see how to handle images as input to reduce number of weights. It will not be 1 weight per pixel.
Also, what exactly you mean by "performance"? That is not a well defined question. If you use 1 image, say a cat, do you mean by performance that you can identify cats in other pictures, or how well you are able to get close to your cat?
Imagine you have a table of 3 weights, 1 input and 1 output, and trained your network to have error of < 0.01, and the desired output is 0.5
W1 | W2 | W3 | Output
0.1 0.2 0.05 0.5006
If you retrain the network, you may get a different
W1 | W2 | W3 | Output
0.3 0.2 0.08 0.49983
Since the weights are way different, you can imagine that there are several solutions.
Then, if you add another input, you can imagine that some of those weights which worked for first solution will work for the second.
Then you add another input. Then subset of the solutions with 2 inputs will work for 3 inputs. Etc.
When you have enough unrelated or noisy inputs, you won't find a subset of weights which meet your error criterion. Either you need to add weights (more degrees of freedom) or increase the error target, or both.
Now, you have a learning rate when you train a network. Say you are doing online training (for each input you update the weights), not batch training (you find the error vector for a batch (subset) of the input and you update your weights based on that, 1 time for the batch).
Now, suppose your learning rate was 0.01 and weight of 0.1. Intuitively:
If, for the first input, the first weight had derivative of 5, then your weight has new value of 0.1 - 0.01*5 = 0.05
If you feed your next input, say the derivative was -5. That means that the second input "disagrees" with the first change, and tries to go back to 0.01
If the derivative for the second input was 5, that means that the second weight "agrees" with the first.
If you have 20 inputs, some will pull the value up, some will push the value down. You keep looping through the training and then the value will approach a value which most of the inputs agree on, hence minimizing the error caused by that weight.
For question 2:
My mathematical guts feel tells me you definitely need at least 2*weight number to have any meaning to the training, but you should make that at least 10x the number of weights for the least minimum amount to even make a conclusion about your network, unless you are not trying to guess something new (for example, for xor gate, you can probably get away with way less input than weights, but that is a bit long discussion)
Note:
With 1 image, you can rotate it, stretch it, mix it with other images... to create another images and increase your input set.
If you have a simple input like xor gate, you can create inputs like (0.3, 0.7) (0.3, 0.6) (0.2, 0.8)... to expand your training set.
For question 3:
This is equivalent to chaining google's network with a network you create serially, but training each part separately.
Basically: You have Pictures --> 10 labels input to your network --> your classification
The problem I see there is, you may not know all the possible outputs of google's classification. But say they are consistent,
Is your label same as one of the 10 labels? If so, use the given label. If it is a different type of label, you can use that API to simplify your network. What are the consequences or what is the performance?
That is beyond me. In neural nets, while they have good mathematical theories to tell us what they can do, many posed problems such as the one you asked require either a special mathematical analysis (perhaps get PhD on some insight related to that class of problems) or, as most do, show empirical results.
I have time series data of size 100000*5. 100000 samples and five variables.I have labeled each 100000 samples as either 0 or 1. i.e. binary classification.
I want to train it using LSTM , because of the time series nature of data.I have seen examples of LSTM for time series prediction, Is it suitable to use it in my case.
Not sure about your needs.
LSTM is best suited for sequence models, like time series you said, and your description don't look a time series.
Any way, you may use LSTM for time series, not for prediction, but for classification like this article.
In my experience, for binary classification having only 5 features you could find better methods, will consume more memory thant other methods, and could get worst results.
First of all, you can see it from a different perspective, i.e. instead of having 10,000 labeled samples of 5 variables, you should treat it as 10,000 unlabeled samples of 6 variables, where the 6th variable is the label.
Therefore, you can train your LSTM as a multivariate predictor for your 6th variable, that is the sample label and compare with the ground truth during testing to evaluate its performance.
I have the below dataset for a chemical process comprised of 5 consecutive input vectors to produce 1 output. Each input is sampled every minute while the output os sample every 5.
While I believe the output depends on the 5 previous input vectors, than I decided to look for LSTMs for my design. After a lot of research on how should be my LSTM architecture, I concluded that I should mask some of the output sequence by zeros and only leave the last output. The final architecture is below according to my dataset:
My question is: What should be my 3D input tensor parameters? E.g. [5, 5, ?]? And also what should be my "Batch size"? Should it be the quantity of my samples?
Since you are going for many to one sequence modelling, you don't need to pad zeros to your output (it's not needed). The easiest thing would be to perform classification at last time-step i.e after RNN/LSTM sees the 5th input. The dimension of your 3D input tensor will be [batch_size, sequence_length, input_dimensionality], where sequence_length is 5 in your case (row 1-5, 7-11, 13-17 etc.), and input_dimensionality is also 5 (i.e. column A- E).
Batch_size depends on the number of examples (also how much reliable is your data), if you have more than 10,000 examples then batch size of 30-50 should be okay (read this explanation about choosing the appropriate batch size).
Looking at the previous answer, I would say that you do not have to do a many-to-one architecture. It really depends on the problem you have. For example, if you system has a lot of dependencies from the past, i.e. more that 5 samples in your case, it would be better to do many-to-many architecture but with different input and output frequencies. But if you think that the previous 5 samples do not impact your next 5 samples. then a many-to-one architecture would do it.
Also, if you your problem is regression, you can use a Dense layer as the output of an LSTM cell is a tanh with output range of (-1, 1).
I used the caret package to train a random forest, including repeated cross-validation. I’d like to know whether the OOB, as in the original RF by Breiman, is used or whether this is replaced by the cross-validation. If it is replaced, do I have the same advantages as described in Breiman 2001, like increased accuracy by reducing the correlation between input data? As OOB is drawn with replacement and CV is drawn without replacement, are both procedures comparable? What is the OOB estimate of error rate (based on CV)?
How are the trees grown? Is CART used?
As this is my first thread, please let me know if you need more details. Many thanks in advance.
There are a lot of basic questions here and you would be better served by reading a book on machine learning or predictive modeling. Thats probably why you haven't gotten much of a response.
For caret you should also consult the package website where some of these questions are answered.
Here are some notes:
CV and OOB estimation for RF are somewhat different. This post might help explain how. For this application, the OOB rate from random forest is computed while the model is being build whereas CV uses holdout samples that are predicted after the random forest model is computed.
The original random forest model (used here) uses unpruned CART trees. Again, this is in many text books and papers.
Max
I recently got a little confused with this too, but reading chapter 4 in Applied Predictive Modeling by Max Kuhn helped me to understand the difference.
If you use randomForest in R, you grow a number of decision trees by sampling N cases with replacement (N is the number of cases in the training set). You then sample m variables at each node where m is less than the number of predictors. Each tree is then grown fully and terminal nodes are assigned to a class based on the mode of cases in that node. New cases are classified by sending them down all the trees and then taking a vote; the majority vote wins.
The key points to note here are:
how the trees are grown - sampling WITH replacement (a bootstrap). This means that some cases will be represented many times in your bootstrap sample and others may not be represented at all. The bootstrap sample will be the same size as your training dataset.
The cases that are not selected for building trees are referred to as the OOB samples- an OOB error estimate is calculated by classifying the cases that aren't selected when building a tree. About 63% of the data points in the bootstrap sample are represented at least once.
If you use caret in R, you will normally use caret::train(....) and specify the method as "rf" and trControl="repeatedcv". You can change trControl to "oob" if you want out of the bag. The way this works is as follows (I'm going to use a simple example of a 10 fold cv repeated 5 times): the training dataset is split into 10 folds of roughly equal size, a number of trees will be built using only 9 samples - so omitting the 1st fold (which is held out). The held out sample is predicted by running the cases through the trees and used to estimate performance measures. The first subset is returned to the training set and the procedure repeats with the 2nd subset held out, and so on. The process is repeated 10 times. This whole procedure can be repeated multiple times (in my example, I do this 5 times); for each of the 5 runs, the training dataset with be split into 10 slightly different folds. It should be noted that 50 different held out samples are used to calculate model efficacy.
The key points to note are:
this involves sampling WITHOUT replacement - you split the training data and build a model on 9 samples and predict the held out sample (the remaining 1 sample of the 10) and repeat this process as above
the model is built using a dataset that is smaller than the training dataset; this is different to the bootstrap method discussed above
You are using 2 different resampling techniques which will yield different results therefore they are not comparable. The k fold repeated cv tends to have low bias (for k large); where k is 2 or 3, bias is high and comparable to the bootstrap method. K fold cv tends to have high variance though...
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I'm working on Sentiment Analysis for text classification and I want to classify tweets from Twitter to 3 categories, positive, negative, or neutral. I have 210 training data, and I'm using Naive Bayes as classifier. I'm implementing using PHP and MySQL as my database for training data.
What I've done is in sequence :
I split my training data based on 10-fold Cross Validation to 189 training data and 21 testing data.
I insert my training data into database, so my classifier can classify based on training data
Then I classify my testing data using my classifier. I got 21 prediction results.
Repeat step 2 and 3 for 10 times based on 10-fold Cross Validation
I evaluate the accuracy of the classifier one by one, so I got 10 accuracy results. Then I take the average of the results.
What i want to know is :
Which is the learning process ? What is the input, process, and output ?
Which is the validation process ? What is the input, process, and output ?
Which is the testing process ? What is the input, process, and output ?
I just want to make sure that my comprehension about these 3 process (learning, validation, and testing) is the right one.
In your example, I don't think there is a meaningful distinction between validation and testing.
Learning is when you train the model, which means that your outputs are, in general, parameters, such as coefficients in a regression model or weights for connections in a neural network. In your case, the outputs are estimated probabilities for the probability of seeing a word w in a tweet given the tweet positive P(w|+), seeing a word given negative P(w|-), and seeing a word given neutral P(w|*). Also the probabilities of not seeing words in the tweet given positive, negative, neutral, P(~w|+), etc. The inputs are the training data, and the process is simply estimating probabilities by measuring the frequencies that words occur (or don't occur) in each of your classes, i.e just counting!
Testing is where you see how well your trained model does on data you haven't seen before. Training tends to produce outputs that overfit the training data, i.e. the coefficients or probabilities are "tuned" to noise in the training data, so you need to see how well your model does on data it hasn't been trained on. In your case, the inputs are the test examples, the process is applying Bayes theorem, and the outputs are classifications for the test examples (you classify based on which probability is highest).
I have come across cross-validation -- in addition to testing -- in situations where you don't know what model to use (or where there are additional, "extrinsic", parameters to estimate that can't be done in the training phase). You split the data into 3 sets.
So, for example, in linear regression you might want to fit a straight line model, i.e. estimate p and c in y = px + c, or you might want to fit a quadratic model, i.e. estimate p, c, and q in y = px + qx^2 + c. What you do here is split your data into three. You train the straight line and quadratic models using part 1 of the data (the training examples). Then you see which model is better by using part 2 of the data (the cross-validation examples). Finally, once you've chosen your model, you use part 3 of the data (the test set) to determine how good your model is. Regression is a nice example because a quadratic model will always fit the training data better than the straight line model, so can't just look at the errors on the training data alone to decide what to do.
In the case of Naive Bayes, it might make sense to explore different prior probabilities, i.e. P(+), P(-), P(*), using a cross-validation set, and then use the test set to see how well you've done with the priors chosen using cross-validation and the conditional probabilities estimated using the training data.
As an example of how to calculate the conditional probabilities, consider 4 tweets, which have been classified as "+" or "-" by a human
T1, -, contains "hate", "anger"
T2, +, contains "don't", "hate"
T3, +, contains "love", "friend"
T4, -, contains "anger"
So for P(hate|-) you add up the number of times hate appears in negative tweets. It appears in T1 but not in T4, so P(hate|-) = 1/2. For P(~hate|-) you do the opposite, hate doesn't appear in 1 out of 2 of the negative tweets, so P(~hate|-) = 1/2.
Similar calculations give P(anger|-) = 1, and P(love|+) = 1/2.
A fly in the ointment is that any probability that is 0 will mess things up in the calculation phase, so you instead of using a zero probability you use a very low number, like 1/n or 1/n^2, where n is the number of training examples. So you might put P(~anger|-) = 1/4 or 1/16.
(The maths of the calculation I put in this answer).