-- Practice Final Thread
Manasa Mananjaya, Prajna Puranik
7. Give one adaptive learning rate algorithm for neural nets that we considered in class and motivate how it works.
Solution:
RMSProp modifies AdaGrad to use an exponentially decaying average of the gradient histories. It can be adapted to work with momentum and has been shown to work better with deep neural nets. Here is what the algorithm with Nestorov Momentum looks like:
epsilon := starting learning rate
rho := starting decay rate
alpha := momentum coefficient
theta := initial weights
v := initial velocity
r := 0 // accumulation variable
while stopping criteria not met:
Sample minibatch {x[1], ..., x[m]}
from training set with targets {y[1], ..., y[m]}.
Compute interim update: tmpTheta := theta + alpha*v
Compute gradient g := 1/m * grad_{tmpTheta} sum_i L(f([i]; tmpTheta), y[i])
Accumulate gradient: r := rho * r + (1 - rho)* hadamard (g, g)
// pointwise product
Update velocity: v := alpha * v - hadamard (vec (epsilon_j /sqrt(r_j)), g)
//make vector of epsilons then hadamard
Apply update: theta := theta + v
8. Give a code example with explanation of how to create a custom layer in Keras. Give a code example of how to create a custom network topology using Keras' functinal API.
Solution:
Creating a Parametric ReLU layer as a custom layer using Keras:
import tensorflow as tf
class ParametricRelu(tf.keras.layers.Layer):
def __init__(self, **kwargs):
# constructor here is calling the parent
super(ParametricRelu, self).__init__(**kwargs)
# in build below we specify one new weight to train with initial value 0
# we also say this layer's units only take 1 input
def build(self, input_shape):
self.alpha = self.add_weight(
name = 'alpha', shape=(input_shape[1]),
initializer = 'zeros',
trainable = True
)
super(ParametricRelu, self).build(input_shape)
# for the inputs to a unit (in this case just 1), how we compute the output of that unit
def call(self, x):
return tf.maximum(0.,x)+ self.alpha * tf.minimum(0.,x)
Given a Sequential model, we could then add this layer using:
model.add(ParametricRelu());
Example of how to create a custom network topology using Keras' functinal API.
input_layer = Input(shape=input_shape)
take all inputs and feeds to 32 neurons
dense_layer_1 = Dense(32, activation='relu')(input_layer)
take all inputs and feeds to 16 neurons
dense_layer_2 = Dense(16, activation='relu')(input_layer)
makes a single layer with above two layers happening in parallel
merged_layer = Concatenate()([dense_layer_1, dense_layer_2])
feeds results of layer into a final layer
final_layer = Dense(10, activation='softmax')(merged_layer)
create model specifying input and output layers.
model = Model(inputs=input_layer, outputs=final_layer, name="My Model")
9.
Give the back-propagation through time algorithm. Explain the teaching-forcing algorithm. Explain the long term dependency problem and why LSTMs might help to solve this problem.
solution:
a. To compute the gradient of the just presented network involves performing a forward pass moving left to right through the unrolled computation graph. This is then followed by a backward propagation pass moving right to left through the graph. Together this gives a runtime of O(τ) and a memory cost of O(τ). This process is called back-propagation through time (BPTT) and can be very expensive because it can't be easily parallelized as the forward graph is sequential in nature and we need to remember each value for the backward pass.
b. Teacher forcing algorithm implements an RNN that produces an output at each time step and has recurrent connections only from the outputs at one step to the hidden units at the next. Rather than feeding the model's own output back into itself during training, it uses the sequence x⃗ (1),...,x⃗ (t) so far, y⃗ (t) and y⃗ (t−1), where y⃗ (t−1 )acts as a teacher forcing what the previous answer should be.
c.Long term dependencies can cause problems while training with RNN, which is that unfolding them multiple steps can have the effect of matrix powering. To avoid long term dependencies, LSTMs make use of a memory cell which can remember the previous state. Unlike normal RNNs, this cell's value is also controlled by forget lines that can roughly reset the memory of the previous state to 0, thus, preventing the powering issue.
(
Edited: 2021-12-06)
Manasa Mananjaya, Prajna Puranik<br><br>
'''7. Give one adaptive learning rate algorithm for neural nets that we considered in class and motivate how it works.'''
<u>Solution:</u>
<br>
RMSProp modifies AdaGrad to use an exponentially decaying average of the gradient histories. It can be adapted to work with momentum and has been shown to work better with deep neural nets. Here is what the algorithm with Nestorov Momentum looks like:
<br>
''epsilon := starting learning rate<br>
rho := starting decay rate<br>
alpha := momentum coefficient<br>
theta := initial weights<br>
v := initial velocity<br>
r := 0 // accumulation variable <br>
while stopping criteria not met:<br>
Sample minibatch {x[1], ..., x[m]}
from training set with targets {y[1], ..., y[m]}.
Compute interim update: tmpTheta := theta + alpha*v
Compute gradient g := 1/m * grad_{tmpTheta} sum_i L(f([i]; tmpTheta), y[i])
Accumulate gradient: r := rho * r + (1 - rho)* hadamard (g, g)
// pointwise product
Update velocity: v := alpha * v - hadamard (vec (epsilon_j /sqrt(r_j)), g)
//make vector of epsilons then hadamard
Apply update: theta := theta + v
''
'''8.''' '''Give a code example with explanation of how to create a custom layer in Keras. Give a code example of how to create a custom network topology using Keras' functinal API.'''<br>
<u>Solution:</u> <br>Creating a Parametric ReLU layer as a custom layer using Keras:<br>
import tensorflow as tf
class ParametricRelu(tf.keras.layers.Layer):
def __init__(self, **kwargs):
# constructor here is calling the parent
super(ParametricRelu, self).__init__(**kwargs)
# in build below we specify one new weight to train with initial value 0
# we also say this layer's units only take 1 input
def build(self, input_shape):
self.alpha = self.add_weight(
name = 'alpha', shape=(input_shape[1]),
initializer = 'zeros',
trainable = True
)
super(ParametricRelu, self).build(input_shape)
# for the inputs to a unit (in this case just 1), how we compute the output of that unit
def call(self, x):
return tf.maximum(0.,x)+ self.alpha * tf.minimum(0.,x)
<br>
Given a Sequential model, we could then add this layer using:<br>
model.add(ParametricRelu());<br><br>
Example of how to create a custom network topology using Keras' functinal API.
<br>
#input_layer = Input(shape=input_shape)
#take all inputs and feeds to 32 neurons
dense_layer_1 = Dense(32, activation='relu')(input_layer)
#take all inputs and feeds to 16 neurons
dense_layer_2 = Dense(16, activation='relu')(input_layer)
#makes a single layer with above two layers happening in parallel
merged_layer = Concatenate()([dense_layer_1, dense_layer_2])
#feeds results of layer into a final layer
final_layer = Dense(10, activation='softmax')(merged_layer)
#create model specifying input and output layers.
model = Model(inputs=input_layer, outputs=final_layer, name="My Model")
<br><br>
9. '''Give the back-propagation through time algorithm. Explain the teaching-forcing algorithm. Explain the long term dependency problem and why LSTMs might help to solve this problem.'''<br>
<br><u>solution:</u><br>
<br>
a. To compute the gradient of the just presented network involves performing a forward pass moving left to right through the unrolled computation graph. This is then followed by a backward propagation pass moving right to left through the graph. Together this gives a runtime of O(τ) and a memory cost of O(τ). This process is called back-propagation through time (BPTT) and can be very expensive because it can't be easily parallelized as the forward graph is sequential in nature and we need to remember each value for the backward pass.
<br>
b. Teacher forcing algorithm implements an RNN that produces an output at each time step and has recurrent connections only from the outputs at one step to the hidden units at the next. Rather than feeding the model's own output back into itself during training, it uses the sequence x⃗ (1),...,x⃗ (t) so far, y⃗ (t) and y⃗ (t−1), where y⃗ (t−1 )acts as a teacher forcing what the previous answer should be.<br><br>
c.Long term dependencies can cause problems while training with RNN, which is that unfolding them multiple steps can have the effect of matrix powering. To avoid long term dependencies, LSTMs make use of a memory cell which can remember the previous state. Unlike normal RNNs, this cell's value is also controlled by forget lines that can roughly reset the memory of the previous state to 0, thus, preventing the powering issue.
<br><br>
((resource:WhatsApp Image 2021-12-06 at 5.29.31 PM.jpeg|Resource Description for WhatsApp Image 2021-12-06 at 5.29.31 PM.jpeg))