Deep Learning Frameworks and Network Tweaks

By Magnus Ekman
Aug 17, 2021

␡

⎙ Print

< Back Page 7 of 9 Next >

Learning Deep Learning: Theory and Practice of Neural Networks, Computer Vision, Natural Language Processing, and Transformers Using TensorFlow

Learn More Buy

Experiment: Tweaking Network and Learning Parameters

To illustrate the effect of the different techniques, we have defined five different configurations, shown in Table 5-1. Configuration 1 is the same network that we studied in Chapter 4 and at beginning of this chapter. Configuration 2 is the same network but with a learning rate of 10.0. In configuration 3, we change the initialization method to Glorot uniform and change the optimizer to Adam with all parameters taking on the default values. In configuration 4, we change the activation function for the hidden units to ReLU, the initializer for the hidden layer to He normal, and the loss function to cross-entropy. When we described the cross-entropy loss function earlier, it was in the context of a binary classification problem, and the output neuron used the logistic sigmoid function. For multiclass classification problems, we use the categorical cross-entropy loss function, and it is paired with a different output activation known as softmax. The details of softmax are described in Chapter 6, but we use it here with the categorical cross-entropy loss function. Finally, in configuration 5, we change the mini-batch size to 64.

Table 5-1 Configurations with Tweaks to Our Network

CONFIGURATION	HIDDEN ACTIVATION	HIDDEN INITIALIZER	OUTPUT ACTIVATION	OUTPUT INITIALIZER	LOSS FUNCTION	OPTIMIZER	MINI-BATCH SIZE
Conf1	tanh	Uniform 0.1	Sigmoid	Uniform 0.1	MSE	SGD lr=0.01	1
Conf2	tanh	Uniform 0.1	Sigmoid	Uniform 0.1	MSE	SGD lr=10.0	1
Conf3	tanh	Glorot uniform	Sigmoid	Glorot uniform	MSE	Adam	1
Conf4	ReLU	He normal	Softmax	Glorot uniform	CE	Adam	1
Conf5	ReLU	He normal	Softmax	Glorot uniform	CE	Adam	64

Note: CE, cross-entropy; MSE, mean squared error; SGD, stochastic gradient descent.

Modifying the code to model these configurations is trivial using our DL framework. In Code Snippet 5-13, we show the statements for setting up the model for configuration 5, using ReLU units with He normal initialization in the hidden layer and softmax units with Glorot uniform initialization in the output layer. The model is then compiled using categorical cross-entropy as the loss function and Adam as the optimizer. Finally, the model is trained for 20 epochs using a mini-batch size of 64 (set to BATCH_SIZE=64 in the init code).

Code Snippet 5-13 Code Changes Needed for Configuration 5

model = keras.Sequential([
    keras.layers.Flatten(input_shape=(28, 28)),
    keras.layers.Dense(25, activation='relu',
                       kernel_initializer='he_normal',
                       bias_initializer='zeros'),
    keras.layers.Dense(10, activation='softmax',
                       kernel_initializer='glorot_uniform',
                       bias_initializer='zeros')])

model.compile(loss='categorical_crossentropy',
                    optimizer = 'adam',
                    metrics =['accuracy'])

history = model.fit(train_images, train_labels,
                    validation_data=(test_images, test_labels),
                    epochs=EPOCHS, batch_size=BATCH_SIZE,
                    verbose=2, shuffle=True)

If you run this configuration on a GPU-accelerated platform, you will notice that it is much faster than the previous configuration. The key here is that we have a batch size of 64, which results in 64 training examples being computed in parallel, as opposed to the initial configuration where they were all done serially.

The results of the experiment are shown in Figure 5-12, which shows how the test errors for all configurations evolve during the training process.

FIGURE 5-12 Error on the test dataset for the five configurations

Configuration 1 (red line) ends up at an error of approximately 6%. We spent a nontrivial amount of time on testing different parameters to come up with that configuration (not shown in this book).

Configuration 2 (green) shows what happens if we set the learning rate to 10.0, which is significantly higher than 0.01. The error fluctuates at approximately 70%, and the model never learns much.

Configuration 3 (blue) shows what happens if, instead of using our tuned learning rate and initialization strategy, we choose a “vanilla configuration” with Glorot initialization and the Adam optimizer with its default values. The error is approximately 7%.

For Configuration 4 (purple), we switch to using different activation functions and the cross-entropy error function. We also change the initializer for the hidden layer to He normal. We see that the test error is reduced to 5%.

For Configuration 5 (yellow), the only thing we change compared to Configuration 4 is the mini-batch size: 64 instead of 1. This is our best configuration, which ends up with a test error of approximately 4%. It also runs much faster than the other configurations because the use of a mini-batch size of 64 enables more examples to be computed in parallel.

Although the improvements might not seem that impressive, we should recognize that reducing the error from 6% to 4% means removing one-third of the error cases, which definitely is significant. More important, the presented techniques enable us to train deeper networks.

< Back Page 7 of 9 Next >

🔖 Save To Your Account

Deep Learning Frameworks and Network Tweaks

This chapter is from the book

This chapter is from the book

This chapter is from the book 

Experiment: Tweaking Network and Learning Parameters

Table 5-1 Configurations with Tweaks to Our Network

Code Snippet 5-13 Code Changes Needed for Configuration 5

InformIT Promotional Mailings & Special Offers