I’ve recently finished the first pass of CS231N Convolutional Neural Networks for Visual Recognition. Now it’s time to try out a library to get hands dirty. Keras seems to be an easy-to-use high-level library, which wraps over 3 different backend engine: TensorFlow, CNTK and Theano. Just perfect for a beginner in Deep Learning.
The tutorial I picked is the one on the MNIST dataset. I’m adding some notes along the way to refresh my memory on what I have learned as well as some links so that I can find the references in CS231N quickly in the future.
Step 1 Importing libraries and prepare parameters for training
'''Trains a simple convnet on the MNIST dataset. Gets to 99.25% test accuracy after 12 epochs (there is still a lot of margin for parameter tuning). 16 seconds per epoch on a GRID K520 GPU. ''' from __future__ import print_function import keras from keras.datasets import mnist from keras.models import Sequential from keras.layers import Dense, Dropout, Flatten from keras.layers import Conv2D, MaxPooling2D from keras import backend as K batch_size = 128 num_classes = 10 epochs = 12 # input image dimensions img_rows, img_cols = 28, 28
Not much to explain on the import statements, so let’s look at some of the parameters defined in this section.
What are batch_size and epochs?
A good explanation can be found Training a Model from DL4J. Epoc means to train the model on all of your data once—a single pass over the whole dataset. Why do we need to train the model with multiple epochs?
To answer this question, we need to know what happens in the training process in a Neural Network. Using the example in CS231N, it is minimizing the loss function using gradient descent. One gradient descent update most likely won’t give you the minimal loss, so we have to do multiple passes until it converges or hitting a pre-set limit—for example, the epoch number. Of course, not all machine learning require multiple passes like this, for example, K-Nearest Neigbour (K-NN) algorithm.
Now let’s talk about batch_size. It relates to how we train the model, specifically how to optimize the loss function. In the naive form, we compute the loss function over the whole dataset. Quoted from CS231N:
while True: weights_grad = evaluate_gradient(loss_fun, data, weights) weights += - step_size * weights_grad # perform parameter update
However if we have millions of records, it becomes wasteful and inefficient to repeatedly compute the loss function to do a simple gradient update. Therefore, a common way to solve the scalability issue is to compute the gradient over batches of training data.
while True: data_batch = sample_training_data(data, 256) # sample 256 examples weights_grad = evaluate_gradient(loss_fun, data_batch, weights) weights += - step_size * weights_grad # perform parameter update
So why does this work? To quote from the course note:
“….the gradient from a mini-batch is a good approximation of the gradient of the full objective. Therefore, much faster convergence can be achieved in practice by evaluating the mini-batch gradients to perform more frequent parameter updates.”
Gradient descent using mini-batch like this is called Minibatch Gradient Descent (MGD) but in practice this is usually referred as another concept Stochastic Gradient Descent (SGD) when the batch size is 1.
- One question I have: with epoch and batch_size, does this mean that we update the gradient with SGD multiple times in one epoch?
Set the image dimension
- We specified the image dimension in the code, which raised two questions:
- Do all the images in the dataset have to be in the same dimension?
- I assume if they don’t, we will have to resize them into the same size. How? Doesn’t the resizing make the subject in the image disproportional?
Step 2: Prepare the dataset for training and testing
# the data, shuffled and split between train and test sets (x_train, y_train), (x_test, y_test) = mnist.load_data() if K.image_data_format() == 'channels_first': x_train = x_train.reshape(x_train.shape, 1, img_rows, img_cols) x_test = x_test.reshape(x_test.shape, 1, img_rows, img_cols) input_shape = (1, img_rows, img_cols) else: x_train = x_train.reshape(x_train.shape, img_rows, img_cols, 1) x_test = x_test.reshape(x_test.shape, img_rows, img_cols, 1) input_shape = (img_rows, img_cols, 1) x_train = x_train.astype('float32') x_test = x_test.astype('float32') x_train /= 255 x_test /= 255 print('x_train shape:', x_train.shape) print(x_train.shape, 'train samples') print(x_test.shape, 'test samples') # convert class vectors to binary class matrices y_train = keras.utils.to_categorical(y_train, num_classes) y_test = keras.utils.to_categorical(y_test, num_classes)
Specify the depth of the images
Since we are using CNN, one important step is to arrange the neurons in 3D (width, height and depth). I’ll skip the details but the depth here in code is 1. That means our images have only 1 channel, instead of 3 (RGB channels).
Normalize the mean and standard-deviation
It seems that the code above doesn’t perform this processing except for the two lines below:
x_train /= 255 x_test /= 255
As a guideline:
“Normally we would want to preprocess the dataset so that each feature has zero mean and unit standard deviation, but in this case the features are already in a nice range from -1 to 1, so we skip this step.”
Preprocess the class labels
Well, we need the class label to be a 10-dimensional array for each record. Not sure if this is related, but the scoring function of the model is a 10-dimensional array with each value representing a score assigned to a particular class. If we look at the labels, we will find the labels in a 1-dimensional array. Hence the conversion.
print y_train[:10] # [5 0 4 1 9 2 1 3 1 4]
Step 3: Define the model structure
model = Sequential() model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) model.add(Conv2D(64, (3, 3), activation='relu')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Dropout(0.25)) model.add(Flatten()) model.add(Dense(128, activation='relu')) model.add(Dropout(0.5)) model.add(Dense(num_classes, activation='softmax')) model.compile(loss=keras.losses.categorical_crossentropy, optimizer=keras.optimizers.Adadelta(), metrics=['accuracy'])
A Sequential model a linear stack of layers. Here we added 8 layers. Why do we add these layers but not others? I don’t know. People spend a great deal of time trying out different architectures of the network. If we are just starting out, we might just rely on architectures that are proven to be useful, like the examples provided by Keras.
……A ConvNet is made up of Layers. Every Layer has a simple API: It transforms an input 3D volume to an output 3D volume with some differentiable function that may or may not have parameters……We use three main types of layers to build ConvNet architectures: Convolutional Layer, Pooling Layer, and Fully-Connected Layer (exactly as seen in regular Neural Networks). We will stack these layers to form a full ConvNet architecture…..
So why do we use Convolutional layers instead of the regular ones? In short, to solve performance and scalability issues as well as to avoid overfitting when processing full images.
The links above covered Conv2D, MaxPooling, and Dense layers. What about Dropout and Flatten here?
- In short, Dropout is a way to prevent overfitting in the training.
- Flatten is not covered in the CS231N notes, but what it does it is to flatten the input before sending to the fully connected Dense layer.
At this point, the model structure is defined. We then specify the loss function, the way to optimize it and the measurement metric in the compile method.
Step 4: train the model and test it
model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, verbose=1, validation_data=(x_test, y_test)) score = model.evaluate(x_test, y_test, verbose=0) print('Test loss:', score) print('Test accuracy:', score)
Not much to explain here. We train the model with the training data. However, one concern I have with this piece is that the model is validating itself on the testing data after each epoch and it’s also evaluating on the same testing data to get the score. What we should do is to have a dedicated validation set split from the training set (as suggested in the courser note validation set is considered already burned during training (see the last point in the summary)). Therefore, using the option validation_split may be a better idea.