how to train data with large differences between values - machine-learning

I'm currently working on recurrent neural networks for text-to-speech but I'm stuck at one point.
I've some input files and they have characteristic features of text(phonemes etc.) with dimension 490. The output files are mgc(60-d), bap(25-d) and lf0(1-d). mgc and bap files are ok because there are no big gaps between values. I can train them with reasonable time and accuracy. Inputs and outputs are sequential and properly aligned, e.g. if an input is of shape (300, 490), then the shapes of mgc, bap and lf0 are (300, 60), (300, 25) and (300, 1), respectively.
My problem here is with the lf0(log of fundamental frequency, I suppose). The values are like, say, [0.23, 1.2, 0.54, 3.4, -10e9, -10e9, -10e9, 3.2, 0.25]. I tried to train it using MSE but the error is too high and not decreasing at all.
I'd like to hear any suggestion for this problem. I'm open to anything.
PS: I'm using 2 gru layers with 256 or 512 unit of each.

Related

What is the purpose of having the same input and output in PyTorch nn.Linear function?

I think this is a comprehension issue, but I would appreciate any help.
I'm trying to learn how to use PyTorch for autoencoding. In the nn.Linear function, there are two specified parameters,
nn.Linear(input_size, hidden_size)
When reshaping a tensor to its minimum meaningful representation, as one would in autoencoding, it makes sense that the hidden_size would be smaller. However, in the PyTorch tutorial there is a line specifying identical input_size and hidden_size:
class NeuralNetwork(nn.Module):
def __init__(self):
super(NeuralNetwork, self).__init__()
self.flatten = nn.Flatten()
self.linear_relu_stack = nn.Sequential(
nn.Linear(28*28, 512),
nn.ReLU(),
nn.Linear(512, 512),
nn.ReLU(),
nn.Linear(512, 10),
)
I guess my question is, what is the purpose of having the same input and hidden size? Wouldn't this just return an identical tensor?
I suspect that this just a requirement after calling the nn.ReLU() activation function.
As well stated by wikipedia:
An autoencoder is a type of artificial neural network used to learn
efficient codings of unlabeled data. The
encoding is validated and refined by attempting to regenerate the
input from the encoding.
In other words, the idea of the autoencoder is to learn an identity. This identity-function will be learned only for particular inputs (i.e. without anomalies). From this, the following points derive:
Input will have same dimensions as output
Autoencoders are (generally) built to learn the essential features of the input
Because of point (1), you have that autoencoder will have a series of layers (e.g. a series of nn.Linear() or nn.Conv()).
Because of point (2), you generally have an Encoder which compresses the information (as your code-snippet, you start from 28x28 to the ending 10) and a Decoder that decompress the information (10 -> 28x28). Generally the latent space dimensionality (10) is much smaller than the input (28x28) across several implementation of this theoretical architecture. Now that the end-goal of the Encoder part is clear, you may appreciate that the compression may produce additional data during the compression itself (nn.Linear(28*28, 512)), which will disappear when the series of layers will give the final output (10).
Note that because the model in your question includes a nonlinearity after the linear layer, the model will not learn an identity transform between the input and output. In the specific case of the relu nonlinearity, the model could learn an identity transform if all of the input values were positive, but in general this won't be the case.
I find it a little easier to imagine the issue if we had an even smaller model consisting of Linear --> Sigmoid --> Linear. In such a case, the input will be mapped through the first matrix transform and then "squashed" into the space [0, 1] as the "hidden" layer representation. The next ("output") layer would need to take this squashed view of the input and come up with some way of "unsquashing" it back into the original. But with an affine output layer, it's not possible to do this, so the model will have to learn some other, non-identity, transforms for the two matrices.
There are some neat visualizations of this concept on Chris Olah's blog that are well worth a look.

deep neural network model stops learning after one epoch

I am training a unsupervised NN model and for some reason, after exactly one epoch (80 steps), model stops learning.
]
Do you have any idea why it might happen and what should I do to prevent it?
This is more info about my NN:
I have a deep NN that tries to solve an optimization problem. My loss function is customized and it is my objective function in the optimization problem.
So if my optimization problems is min f(x) ==> loss, now in my DNN loss = f(x). I have 64 input, 64 output, 3 layers in between :
self.l1 = nn.Linear(input_size, hidden_size)
self.relu1 = nn.LeakyReLU()
self.BN1 = nn.BatchNorm1d(hidden_size)
and last layer is:
self.l5 = nn.Linear(hidden_size, output_size)
self.tan5 = nn.Tanh()
self.BN5 = nn.BatchNorm1d(output_size)
to scale my network.
with more layers and nodes(doubles: 8 layers each 200 nodes), I can get a little more progress toward lower error, but again after 100 steps training error becomes flat!
The symptom is that the training loss stops being improved relatively early. Suppose that your problem is learnable at all, there are many reasons for the for this behavior. Following are most relavant:
Improper preprocessing of input: Neural network prefers input with
zero mean. E.g., if the input is all positive, it will restrict the
weights to be updated in the same direction, which may not be
desirable (https://youtu.be/gYpoJMlgyXA).
Therefore, you may want to subtract the mean from all the images (e.g., subtract 127.5 from each of the 3 channels). Scaling to make unit standard deviation in each channel may also be helpful.
Generalization ability of the network: The network is not complicated
or deep enough for the task.
This is very easy to check. You can train the network on just a few
images (says from 3 to 10). The network should be able to overfit the
data and drives the loss to almost 0. If it is not the case, you may
have to add more layers such as using more than 1 Dense layer.
Another good idea is to used pre-trained weights (in applications of Keras documentation). You may adjust the Dense layers at the top to fit with your problem.
Improper weight initialization. Improper weight initialization can
prevent the network from converging (https://youtu.be/gYpoJMlgyXA,
the same video as before).
For the ReLU activation, you may want to use He initialization
instead of the default Glorot initialiation. I find that this may be
necessary sometimes but not always.
Lastly, you can use debugging tools for Keras such as keras-vis, keplr-io, deep-viz-keras. They are very useful to open the blackbox of convolutional networks.
I faced the same problem then I followed the following:
After going through a blog post, I managed to determine that my problem resulted from the encoding of my labels. Originally I had them as one-hot encodings which looked like [[0, 1], [1, 0], [1, 0]] and in the blog post they were in the format [0 1 0 0 1]. Changing my labels to this and using binary crossentropy has gotten my model to work properly. Thanks to Ngoc Anh Huynh and rafaelvalle!

neural network produces similar pattern for all inputs

I am attempting to train an ANN on time series data in Keras. I have three vectors of data that are broken into scrolling window sequences (i.e. for vector l).
np.array([l[i:i+window_size] for i in range( len(l) - window_size)])
The target vector is similarly windowed so the neural net output is a prediction of the target vector for the next window_size number of time steps. All the data is normalized with a min-max scaler. It is fed into the neural network as a shape=(nb_samples, window_size, 3). Here is a plot of the 3 input vectors.
The only output I've managed to muster from the ANN is the following plot. Target vector in blue, predictions in red (plot is zoomed in to make the prediction pattern legible). Prediction vectors are plotted at window_size intervals so each one of the repeated patterns is one prediction from the net.
I've tried many different model architectures, number of epochs, activation functions, short and fat networks, skinny, tall. This is my current one (it's a little out there).
Conv1D(64,4, input_shape=(None,3)) ->
Conv1d(32,4) ->
Dropout(24) ->
LSTM(32) ->
Dense(window_size)
But nothing I try will affect the neural net from outputting this repeated pattern. I must be misunderstanding something about time-series or LSTMs in Keras. But I'm very lost at this point so any help is greatly appreciated. I've attached the full code at this repository.
https://github.com/jaybutera/dat-toy
I played with your code a little and I think I have a few suggestions for getting you on the right track. The code doesn't seem to match your graphs exactly, but I assume you've tweaked it a bit since then. Anyway, there are two main problems:
The biggest problem is in your data preparation step. You basically have the data shapes backwards, in that you have a single timestep of input for X and a timeseries for Y. Your input shape is (18830, 1, 8), when what you really want is (18830, 30, 8) so that the full 30 timesteps are fed into the LSTM. Otherwise the LSTM is only operating on one timestep and isn't really useful. To fix this, I changed the line in common.py from
X = X.reshape(X.shape[0], 1, X.shape[1])
to
X = windowfy(X, winsize)
Similarly, the output data should probably be only 1 value, from what I've gathered of your goals from the plotting function. There are certainly some situations where you want to predict a whole timeseries, but I don't know if that's what you want in this case. I changed Y_train to use fuels instead of fuels_w so that it only had to predict one step of the timeseries.
Training for 100 epochs might be way too much for this simple network architecture. In some cases when I ran it, it looked like there was some overfitting going on. Observing the decrease of loss in the network, it seems like maybe only 3-4 epochs are needed.
Here is the graph of predictions after 3 training epochs with the adjustments I mentioned. It's not a great prediction, but it looks like it's on the right track now at least. Good luck to you!
EDIT: Example predicting multiple output timesteps:
from sklearn import datasets, preprocessing
import numpy as np
from scipy import stats
from keras import models, layers
INPUT_WINDOW = 10
OUTPUT_WINDOW = 5 # Predict 5 steps of the output variable.
# Randomly generate some regression data (not true sequential data; samples are independent).
np.random.seed(11798)
X, y = datasets.make_regression(n_samples=1000, n_features=4, noise=.1)
# Rescale 0-1 and convert into windowed sequences.
X = preprocessing.MinMaxScaler().fit_transform(X)
y = preprocessing.MinMaxScaler().fit_transform(y.reshape(-1, 1))
X = np.array([X[i:i + INPUT_WINDOW] for i in range(len(X) - INPUT_WINDOW)])
y = np.array([y[i:i + OUTPUT_WINDOW] for i in range(INPUT_WINDOW - OUTPUT_WINDOW,
len(y) - OUTPUT_WINDOW)])
print(np.shape(X)) # (990, 10, 4) - Ten timesteps of four features
print(np.shape(y)) # (990, 5, 1) - Five timesteps of one features
# Construct a simple model predicting output sequences.
m = models.Sequential()
m.add(layers.LSTM(20, activation='relu', return_sequences=True, input_shape=(INPUT_WINDOW, 4)))
m.add(layers.LSTM(20, activation='relu'))
m.add(layers.RepeatVector(OUTPUT_WINDOW))
m.add(layers.LSTM(20, activation='relu', return_sequences=True))
m.add(layers.wrappers.TimeDistributed(layers.Dense(1, activation='sigmoid')))
print(m.summary())
m.compile(optimizer='adam', loss='mse')
m.fit(X[:800], y[:800], batch_size=10, epochs=60) # Train on first 800 sequences.
preds = m.predict(X[800:], batch_size=10) # Predict the remaining sequences.
print('Prediction:\n' + str(preds[0]))
print('Actual:\n' + str(y[800]))
# Correlation should be around r = .98, essentially perfect.
print('Correlation: ' + str(stats.pearsonr(y[800:].flatten(), preds.flatten())[0]))

feature number in tensorflow tf.nn.conv2d

In the Tensorflow example "Deep MNIST for Experts" https://www.tensorflow.org/get_started/mnist/pros
I am not clear how to determine the feature number specified in weight of activation function.
For example:
We can now implement our first layer. It will consist of convolution,
followed by max pooling. The convolution will compute 32 features for
each 5x5 patch.
W_conv1 = weight_variable([5, 5, 1, 32])
Why 32 is picked here?
In order to build a deep network, we stack several layers of this
type. The second layer will have 64 features for each 5x5 patch.
W_conv2 = weight_variable([5, 5, 32, 64])
Again, why 64 is picked?
Now that the image size has been reduced to 7x7, we add a
fully-connected layer with 1024 neurons to allow processing on the
entire image.
W_fc1 = weight_variable([7 * 7 * 64, 1024])
Why 1024 here?
Thanks
Each of these filters will actually do something, like check for edges, check for colour change, or right-shift, left-shit the image, sharpen, blur etc.
Each of these filters are actually working on finding out the meaning of the image by sharpening, enhancing, smoothening, intensifying etc.
For e.g. check this link which explains the meaning of these filters
http://setosa.io/ev/image-kernels/
So all these filters are actually neurons where the output will be max-pooled and eventually fed into a FC layer after some activation.
If you are looking for just understanding the filters, that is another approach. However if you are looking to learn how conv. architectures work but since these are tried and tested filters over the dataset, you hsould just go with it for now.
The filters also learn through Backprop.
32 and 64 are number of filters in the respective layers.
1024 is the number of output neurons in the fully connected layer.
Your question basically is about the reason behind the choice of these hyperparameters.
There is no mathematical or programming reason behind these specific choices. These have been picked up after experiments as they delivered a good accuracy over MNIST dataset.
You can change these numbers and that is one way by which you can modify a model.
Unfortunately you cannot yet explore the reason for the choice behind these parameters within TensorFlow or any other literature source.

MLP with both Pattern Recognition & Forecasting Inputs: Bad Idea?

This is regarding a 3-layer MLP (Input, Hidden, Output) in Ward Systems NeuroShell 2
I would prefer to split these input layer classes (PR & F) into 2 separate nets with their own hidden layers that then feed a single output layer - this would be a 3 layer network. There could be a 4 layer version using a new hidden layer to combine the 2 nets:
1) Inputs (partitioned into F and PR classes)
2) Hiddens (partitioned into F and PR classes)
3) Hiddens (fully connected "mixing" layer)
4) Output
These structures would be trained at once as opposed to training the two networks, getting the output/prediction, and then averaging those 2 numbers.
I've found that while averaging outputs works, "letting a net do it" works even better. But this requires layer partitioning which my platform (NeuralShell 2) cannot. And I've never read a paper where anyone attempts to do better than averaging.
FYI the ratio of PR to F inputs is 10:1.
Most discussion of nets is Forecasting with usually is of the Order of 10 inputs. Pattern Recognition has Orders more, 100's to 1000's and even more.
In fact, it seems that the two types of problems are virtually mutually exclusive when searching the research.
So my conclusion is having both types of structure in a single network is probably a very bad idea.
Agreed?
Not a bad idea at all! In fact this approach is a very common one that is used very frequently, you just missed out on some secret lingo.
Basically, what you're trying to do here is an ensemble prediction. The best way to approach this is to train two entirely separate nets for both halves of your problem. Then use those outputs as inputs to a new neural network.
The field is known as ensemble learning and the results are often quite good.
As far as your question of blending pattern recognition and forecasting, well it's really impossible to make a call on that without knowing more specifics around the data you're working with, but just because people haven't tried it doesn't mean you shouldn't either.
forecasting using time series creates a sliding window of data sequences for input into the network. I don't think you show mix forecasting with classification. The output is for classification can be either a softmax or a binary result. Whereas the output of a forecast will be a dense output with one neuron.
https://machinelearningmastery.com/how-to-develop-multilayer-perceptron-models-for-time-series-forecasting/
[10, 20, 30, 40, 50, 60, 70, 80, 90]
X, y
10, 20, 30 40
20, 30, 40 50
30, 40, 50 60
The input X and y is then fit by the multi layer perceptron.

Resources