The following is an implemenation of 'ImageNet Classification with Deep Convolutional Neural Networks' by Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. It allows you to train a model with identical specifications to AlexNet on training and validation images saved in a specified directory.
The AlexNet architecture consists of five convolutional layers, and three fully connected layers. Three convolutional layers (layers 1, 3, and 5) are followed by overlapping maxpooling layers, and the first and second maxpooling layers are followed by local response normalization layers. While ImageNet contains 256x256 color images, the model is trained on 224x224 slices of these images, in order to construct additional samples. Furthermore, predictions are made by creating 10 such 224x224 slices (method described below) from our original 256x256 image, and averaging predictions across these slices.
AlexNet utilizes stochastic gradient descent with momentum and constant weight decay, with values of 0.9 and 0.0005 respectivley. The authors note that weight decay was not simply a means of regularization, but was a requirement for their model to learn. Additionally, an adaptive learning rate is used. The learning rate is initialized to 0.01 and is decreased by a factor of 10 when validation error stops decreasing. The exact criteria for this shrinking are not specified by the authors, so this implementation uses a fairly standard heuristic.
AlexNet contains three maxpooling layers following the first, second, and fifth convolutional layers. A kernel size of 3x3 and a stride of (2, 2) is used to intentionally overlap these maxpooling layers, which further increases the model's performance.
While batch normalization is now the most prevelant layer normalization method for avoiding vanishing and exploding gradients when training very deep networks, AlexNet instead uses local response normalization. This method has largely been replaced by batch normalization, but our model still utilizes local response normalization to stay true to the original paper.
AlexNet employs several types of image augmentations. The first form of data augmentation is to rescale all images to 256x256 along the short side, to standardize image shape. The next step is to both train and predict on 224x224 slices of our images, rather than the newly scaled 256x256 image, which allows us to significantly artificially increase the size of our dataset. The second method of data augmentation is to simply reflect the image horizontally with probability 0.5. RGB values are first altered by centering around zero and scaling by 255. Lastly, PCA analysis is performed over the entirety of our RGB values and the PCA terms are used to randomly jitter our original RGB values.
While the set of data augmentations is identical during both training and prediction, the method that they are applied is slightly different during these phases. Firstly, upon initilization, our model calculates the PCA terms needed for augmentation. Next, during training, each 256x256 scaled input image is randomly flipped horizontally with probability 0.5, and then a random 224x224 slice of the image is chosen. The RGB values are then transformed, and image is sent into our network. Finally, upon prediction, each 256x256 input image is used to create 10 new images on which to predict: the corners and center of our original image create 5 of our training images, and their horizontal reflections create the remaining 5. The PCA values of these 10 images are then transformed, and predictions are made across these 10 new images. Our final prediction is then the average across these 10 predictions.
The paper includes many finer details that are less novel. Relu activation is used for non-output layers, weight and bias initializations are specified, etc. Each of these details are accounted for, but are likely of little interest.
The creators of the AlexNet network trained their model across two GPUs by splitting the networks layers between the GPUs, with layers from separate GPUs being combine only sporadically. This implementation takes a different approach to multi-GPU utilization by instead giving the options to split the entire training task across multiple GPUs. Rather than each GPU being responsible for only a subset of the entire model, this implementation utilizes Tensorflow's MirroredStrategy to fully distribute model training across all available GPUs on the current machine. This can be easily extended to train on multiple workers using Tensorflow's MultiWorkerMirroredStrategy if desired.
Let's create and train an AlexNet inspired model. Before we begin, we must ensure our data is in the desired format. We assume that all images are located in a single directory with subdirectories 'train/', 'val/', and 'test/'. Additionally, we assume that we have a json file mapping image ids to their corresponding label, and another json file mapping labels to label index. The label indices can be arbitrary, but must be provided for the sake of consistency across sessions. The variable 'data_path' may be either a local path or a URL link to the desired image directory, and both 'label_path' and 'label_encoding' must be json files.
Let's begin by defining these paths.
from AlexNet import AlexNet
data_path = 'DIRECTORY-CONTAINING-IMAGE-SUBDIRECTORIES'
# Example file format: {image_1: 'cat', image_2: 'mug', ...}
label_path = 'PATH-TO-IMAGE-LABEL-MAP.json'
# Example file format: {'cat': 0, 'mug': 1, ...}
label_encoding = 'PATH-TO-LABEL-INDEX-MAP.json'With the paths prepared, creating and fitting our model is as simple as running the following:
# Create and fit model. Store history for plotting training curve.
model = AlexNet(data_path, label_path, label_encoding)
history = model.fit(epochs=5)And just like that, we're ready to make predictions with our model. The 'predict' method can be used to return the 1,000 dimensional output array, or the method 'predict_n' can be used to return the n classes with the highest probability. All image augmentations are handled automatically, so the location of our color input image is the only input that is needed.
test_image = '/Users/justinsima/dir/implementations/datasets/ImageNet/dummy_data/test/ILSVRC2012_test_00018560.JPEG'
pred = model.predict(test_image)
top_pred = model.predict_n(test_image, n=1)
print(f'Prediction Probabilities: \n{pred}')
print(f'Most Likely Class: {top_pred})Thanks for checking out my repo. For more information please see the original paper here: https://papers.nips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html