I want to create a neural network that will exactly fit to my sample, however at some point the learning process stops. Different optimizers like adam or sgd also don't work.
Cybenko proved that the statement from the title is true for any sigmoidal function as an activation function so I want to use either sigmoid or tanh. I want also to keep only one hidden layer.
What are tweaks that I can use? Here is my setup (python3) and results:
features = np.linspace(-0.5,.5,2000).reshape(2000,1)
target = np.sin(12*features)
hn = 100
nn = models.Sequential()
#Only one hidden layer
nn.add(layers.Dense(units=hn, activation='sigmoid',
input_shape=(features.shape[1],)))
nn.add(layers.Dense(units=1, activation='linear'))
# Compile neural network
nn.compile(loss='mse',
optimizer='RMSprop',
metrics=['mse'])
nn.fit(features,
target,
epochs=200,
verbose=2,
batch_size=5)
Related
When I used to do classification work with textcnn, I had experience finetuning textcnn using pretrained word embedding with like Word2Vec and fasttext. And I use this process:
Create an embedding layer in textcnn
Load the embedding matrix of the words used this time by Word2Vec or
fasttext
Since the vector value of the embedding layer will change during training, the network is
being finetuning.
Recently I also want to try BERT to do this. I thought, 'As there should be few differences to use BERT pretrained embedding to initial other networks' embedding layer and finetuning, it should be easy!' But in fact yesterday I tried all day and still cannot do it.
The fact I found is that, as BERT's embedding is a contextual embedding, especially when extracting the word embeddings, the vector of each word from each sentence will vary, so it seems that there is no way to use that embedding to initialize the embedding layer of another network as usual...
Finally, I thought up one method to 'finetuning', as the following steps:
First, do not define an embedding layer in textcnn.
Instead of using embedding layer, in the network training part, I
firstly pass sequence tokens to the pretrained BERT model and get
the word embeddings for each sentence.
Put the BERT word embedding from 2. into textcnn and train the
textcnn network.
By using this method I was finally able to train, but thinking seriously, I don't think I'm doing a finetuning at all...
Because as you can see, every time when I start a new training loop, the word embedding generated from BERT is always the same vector, so just input these unchanged vectors to the textcnn wouldn't let the textcnn be finetuned at all, right?
UPDATE:
I thought up a new method to use the BERT embeddings and 'train' BERT and textcnn together.
Some part of my code is:
BERTmodel = AutoModel.from_pretrained('bert-
base-uncased',output_hidden_states=True).to(device)
TextCNNmodel = TextCNN(EMBD_DIM, CLASS_NUM, KERNEL_NUM,
KERNEL_SIZES).to(device)
optimizer = torch.optim.Adam(TextCNNmodel.parameters(), lr=LR)
loss_func = nn.CrossEntropyLoss()
for epoch in range(EPOCH):
TextCNNmodel.train()
BERTmodel.train()
for step, (token_batch, seg_batch, y_batch) in enumerate(train_loader):
token_batch = token_batch.to(device)
y_batch = y_batch.to(device)
BERToutputs = BERTmodel(token_batch)
# I want to use the second-to-last hidden layer as the embedding, so
x_batch = BERToutputs[2][-2]
output = TextCNNmodel(x_batch)
output = output.squeeze()
loss = loss_func(output, y_batch)
optimizer.zero_grad()
loss.backward()
optimizer.step()
I think by enable BERTmodel.train() and delete torch.no_grad() when get the embedding, the loss gradient could be backward to BERTmodel, too. The training process of TextCNNmodel also went smoothly.
To use this model later, I saved the parameters of both TextCNNmodel and BERTmodel.
Then to experiment whether the BERTmodel was really being trained and changed, in another program I load the BERTModel, and input a sentence to test that whether the BERTModel was really being trained.
However, I found that the output (the embedding) of original 'bert-base-uncased' model and my 'BERTmodel' are the same, which is disappointing...
I really have no idea why the BERTmodel part did not change...
Here I would like to thanks #Jindřich , thank you for giving me the important hint!
I think I am almost there when using my updated version code, but I forgot to set an optimizer for BERTmodel.
After I set the optimizer and did the training process again, this time when I load my BERTmodel, I found that the output (the embedding) of original 'bert-base-uncased' model and my 'BERTmodel' are finally different, which means this BERTmodel is changed and should be finetuned.
Here is my final codes, hope it could help you, too.
BERTmodel = AutoModel.from_pretrained('bert-
base-uncased',output_hidden_states=True).to(device)
TextCNNmodel = TextCNN(EMBD_DIM, CLASS_NUM, KERNEL_NUM,
KERNEL_SIZES).to(device)
optimizer = torch.optim.Adam(TextCNNmodel.parameters(), lr=LR)
optimizer_bert = torch.optim.Adamw(BERTmodel.parameters(), lr=2e-5, weight_decay=1e-2)
loss_func = nn.CrossEntropyLoss()
for epoch in range(EPOCH):
TextCNNmodel.train()
BERTmodel.train()
for step, (token_batch, seg_batch, y_batch) in enumerate(train_loader):
token_batch = token_batch.to(device)
y_batch = y_batch.to(device)
BERToutputs = BERTmodel(token_batch)
# I want to use the second-to-last hidden layer as the embedding, so
x_batch = BERToutputs[2][-2]
output = TextCNNmodel(x_batch)
output = output.squeeze()
loss = loss_func(output, y_batch)
optimizer.zero_grad()
optimizer_bert.zero_grad()
loss.backward()
optimizer.step()
optimizer_bert.step()
I will continue my experiments to see if my BERTmodel is really being finetuned.
Hi I am developing a neural network model using keras.
code
def base_model():
# Initialising the ANN
regressor = Sequential()
# Adding the input layer and the first hidden layer
regressor.add(Dense(units = 4, kernel_initializer = 'he_normal', activation = 'relu', input_dim = 7))
# Adding the second hidden layer
regressor.add(Dense(units = 2, kernel_initializer = 'he_normal', activation = 'relu'))
# Adding the output layer
regressor.add(Dense(units = 1, kernel_initializer = 'he_normal'))
# Compiling the ANN
regressor.compile(optimizer = 'adam', loss = 'mse', metrics = ['mae'])
return regressor
I have been reading about which kernel_initializer to use and came across the link- https://towardsdatascience.com/hyper-parameters-in-action-part-ii-weight-initializers-35aee1a28404
it talks about glorot and he initializations. I have tried with different intilizations for weights, but all of them give the same results. I want to understand how important is it do a proper initialization?
Thanks
I'll give you an explanation of how much weights initialisation is important.
Let's suppose our NN has an input layer with 1000 neurons, and suppose we start to initialise weights as they are normal distributed with mean 0 and variance 1 ().
At the second layer, we assume that only 500 first layer's neurons are activated, while the other 500 not.
The neuron's input of the second layer z will be the sum of :
so, it will be even normal distributed but with variance .
This means its value will be |z| >> 1 or |z| << 1, so neurons will saturate. The network will learn slowly at all.
A solution is to initialise weights as where is the number of the inputs of the first layer. In this way z will be and so less spreader, consequently neurons are less prone to saturate.
This trick can help as a start but in deep neural networks, due to the presence of hidden multi-layers, the weights initialisation should be done at each layer. A method may be using the batch normalization
Besides this from your code I can see you'v chosen as cost function the MSE, so it is a quadratic cost function. I don't know if your problem is a classification one, but if this is the case I suggest you to use a cross-entropy function as cost function for increasing the learning rate of your network.
I'm trying to train a neural network to predict the ratings for players in FIFA 18 by easports (ratings are between 64-99). I'm using their players database (https://easports.com/fifa/ultimate-team/api/fut/item?page=1) and I've processed the data into training_x, testing_x, training_y, testing_y. Each of the training samples is a numpy array containing 7 values...the first 6 are the different stats of the player (shooting, passing, dribbling, etc) and the last value is the position of the player (which I mapped between 1-8, depending on the position), and each of the testing values is a single integer between 64-99, representing the rating of that player.
I've tried many different hyperparameters, including changing the activation functions to tanh and relu, and I've tried adding a batch normalization layer after the first dense layer (I thought that it might be useful since one of my features is very small and the other features are between 50-99), I've played around with the SGD optimizer (changed the learning rate, momentum, even tried changing the optimizer to Adam), tried different loss functions, added/removed dropout layers, and tried different regularizers for the weights of the model.
model = Sequential()
model.add(Dense(64, input_shape=(7,),
kernel_regularizer=regularizers.l2(0.01)))
//batch normalization?
model.add(Activation('sigmoid'))
model.add(Dense(64, kernel_regularizer=regularizers.l2(0.01),
activation='sigmoid'))
model.add(Dropout(0.3))
model.add(Dense(32, kernel_regularizer=regularizers.l2(0.01),
activation='sigmoid'))
model.add(Dense(1, activation='linear'))
sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='mean_absolute_error', metrics=['accuracy'],
optimizer=sgd)
model.fit(training_x, training_y, epochs=50, batch_size=128, shuffle=True)
When I train the model, the loss is always nan and the accuracy is always 0, even though I've tried adjusting a lot of different parameters. However, if I remove the last feature from my data, the position of the players, and update the input shape of the first dense layer, the model actually "trains" and ends up with around 6% accuracy no matter what parameters I change. In that case, I've found that the model only predicts 79 to be the player's rating. What am I doing inherently wrong?
You can try the following steps :
Use mean squared error loss function.
Use Adam which will help you converge faster with low learning rate like 0.0001 or 0.001. Otherwise, try using the RMSprop optimizer.
Use the default regularizers. That is none actually.
Since this is a regression task, use activation function like ReLU in all the layers except the output layer ( including the input layer ). Use linear activation in output layer.
As mentioned in the comments by #pooyan , normalize the features. See here. Even try standardizing the features. Use whichever suites the best.
I am following the official TensorFlow with Keras tutorial and I got stuck here: Predict house prices: regression - Create the model
Why is an activation function used for a task where a continuous value is predicted?
The code is:
def build_model():
model = keras.Sequential([
keras.layers.Dense(64, activation=tf.nn.relu,
input_shape=(train_data.shape[1],)),
keras.layers.Dense(64, activation=tf.nn.relu),
keras.layers.Dense(1)
])
optimizer = tf.train.RMSPropOptimizer(0.001)
model.compile(loss='mse', optimizer=optimizer, metrics=['mae'])
return model
The general reason for using non-linear activation functions in hidden layers is that, without them, no matter how many layers or how many units per layer, the network would behave just like a simple linear unit. This is nicely explained in this short video by Andrew Ng: Why do you need non-linear activation functions?
In your case, looking more closely, you'll see that the activation function of your final layer is not the relu as in your hidden layers, but the linear one (which is the default activation when you don't specify anything, like here):
keras.layers.Dense(1)
From the Keras docs:
Dense
[...]
Arguments
[...]
activation: Activation function to use (see activations). If you don't specify anything, no activation is applied (ie. "linear" activation: a(x) = x).
which is indeed what is expected for a regression network with a single continuous output.
I'm a newbie to machine learning and this is one of the first real-world ML tasks challenged.
Some experimental data contains 512 independent boolean features and a boolean result.
There are about 1e6 real experiment records in the provided data set.
In a classic XOR example all 4 out of 4 possible states are required to train NN. In my case its only 2^(10-512) = 2^-505 which is close to zero.
I have no more information about the data nature, just these (512 + 1) * 1e6 bits.
Tried NN with 1 hidden layer on available data. Output of the trained NN on the samples even from the training set are always close to 0, not a single close to "1". Played with weights initialization, gradient descent learning rate.
My code utilizing TensorFlow 1.3, Python 3. Model excerpt:
with tf.name_scope("Layer1"):
#W1 = tf.Variable(tf.random_uniform([512, innerN], minval=-2/512, maxval=2/512), name="Weights_1")
W1 = tf.Variable(tf.zeros([512, innerN]), name="Weights_1")
b1 = tf.Variable(tf.zeros([1]), name="Bias_1")
Out1 = tf.sigmoid( tf.matmul(x, W1) + b1)
with tf.name_scope("Layer2"):
W2 = tf.Variable(tf.random_uniform([innerN, 1], minval=-2/512, maxval=2/512), name="Weights_2")
#W2 = tf.Variable(tf.zeros([innerN, 1]), name="Weights_2")
b2 = tf.Variable(tf.zeros([1]), name="Bias_2")
y = tf.nn.sigmoid( tf.matmul(Out1, W2) + b2)
with tf.name_scope("Training"):
y_ = tf.placeholder(tf.float32, [None,1])
cross_entropy = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(
labels = y_, logits = y)
)
train_step = tf.train.GradientDescentOptimizer(0.005).minimize(cross_entropy)
with tf.name_scope("Testing"):
# Test trained model
correct_prediction = tf.equal( tf.round(y), tf.round(y_))
# ...
# Train
for step in range(500):
batch_xs, batch_ys = Datasets.train.next_batch(300, shuffle=False)
_, my_y, summary = sess.run([train_step, y, merged_summaries],
feed_dict={x: batch_xs, y_: batch_ys})
I suspect two cases:
my fault – bad NN implementation, wrong architecture;
bad data. Compared to XOR example, incomplete training data would result in a failing NN. However, the training examples fed to the trained NN are supposed to give right predictions, aren't they?
How to evaluate if it is possible at all to train a neural network (a 2-layer perceptron) on the provided data to forecast the result? A case of aceptable set would be the XOR example. Opposed to some random noise.
There are only ad hoc ways to know if it is possible to learn a function with a differentiable network from a dataset. That said, these ad hoc ways do usually work. For example, the network should be able to overfit the training set without any regularisation.
A common technique to gauge this is to only fit the network on a subset of the full dataset. Check that the network can overfit to that, then increase the size of the subset, and increase the size of the network as well. Unfortunately, deciding whether to add extra layers or add more units in a hidden layer is an arbitrary decision you'll have to make.
However, looking at your code, there are a few things that could be going wrong here:
Are your outputs balanced? By that I mean, do you have the same number of 1s as 0s in the dataset targets?
Your initialisation in the first layer is all zeros, the gradient to this will be zero, so it can't learn anything (although, you have a real initialisation above it commented out).
Sigmoid nonlinearities are more difficult to optimise than simpler nonlinearities, such as ReLUs.
I'd recommend using the built-in definitions for layers in Tensorflow to not worry about initialisation, and switching to ReLUs in any hidden layers (you need sigmoid at the output for your boolean target).
Finally, deep learning isn't actually very good at most "bag of features" machine learning problems because they lack structure. For example, the order of the features doesn't matter. Other methods often work better, but if you really want to use deep learning then you could look at this recent paper, showing improved performance by just using a very specific nonlinearity and weight initialisation (change 4 lines in your code above).