Convolutional Neural Networks (CNNs) are one of the most commonly used deep learning algorithms. They are widely used for image-related tasks, such as image recognition, object detection, image segmentation, and more. The applications of CNNs are endless, ranging from powering vision in self-driving cars to the automatic tagging of friends in our Facebook pictures. Although CNNs are widely used for image datasets, they can also be applied to textual datasets.
What are CNNs?
A CNN, also known as a ConvNet, is one of the most widely used deep learning algorithms for computer vision tasks. Let's understand by considering an example.
Consider the following image to recognize it through CNN,
We feed the image to a computer, it basically converts it into a matrix of pixel values with dimension [image width x image height x number of channels]. Here, we are having 3D matrix which is hard to visualize. So, for the sake of understanding, considering a gray-scale image having one channel, i.e. a 2D matrix. The input grayscale image will be converted into a matrix of pixel values ranging from 0 to 255, with the pixel values representing the intensity of pixels at that point:
Now, How does the CNN come to understand that the image contains a horse? CNNs consists of the following three important layers:
The convolutional layer
The pooling layer
The fully connected layer
Let's go in deep for better understanding,
1. Convolutional layers
The convolutional layer is the first and core layer of the CNN. It is one of the building blocks of a CNN and is used for extracting important features from the image. The question arises that how does the CNN understand these features? For this, the convolution operation helps us to understand what the image is all about.
As we know, every input image is represented by a matrix of pixel values. Apart from the input matrix, we also have another matrix called the filter matrix. The filter matrix is also known as a kernel, or simply a filter, as shown in fig.1.
We take the filter matrix, slide it over the input matrix by one pixel, perform element-wise multiplication, sum up the results, and produce a single number as shown in fig.2.
We are basically sliding the filter matrix over the entire input matrix by one pixel, performing element-wise multiplication and summing their results, which creates a new matrix called a feature map or activation map. This is called the convolution operation.
Various filters are used for extracting different features from the image. For instance, if we
use a sharpen filter, , then it will sharpen our image. So, instead of using one filter, we can use multiple filters for extracting different features from the image, and produce multiple feature maps. So, the depth of the feature map will be the number of filters.
Strides: The number of pixels we slide over the input matrix by the filter matrix is called a stride. If we set the stride to 2, then we slide over the input matrix with the filter matrix by two pixels. The following diagram shows a convolution operation with a stride of 2:
Padding: In some cases, the filter does not perfectly fit the input matrix. In this case, we perform padding.
Zero Padding or Same Padding: We can simply pad the input matrix with zeros so that the filter can fit the input matrix. shown in fig.(a).
valid padding: Instead of padding them with zeros, we can also simply discard the region of the input matrix where the filter doesn't fit in. Shown in fig.(b).
2. Pooling layers
The feature maps are too large in dimension. In order to reduce the dimensions of feature maps, we perform a pooling operation. A pooling operation is also called a downsampling or subsampling operation.There are different types of pooling operations, including max pooling, average pooling, and sum pooling.
In max pooling, we slide over the filter on the input matrix and simply take the maximum value from the filter window, as shown,
In average pooling, we take the average value of the input matrix within the filter window, and in sum pooling, we sum all the values of the input matrix within the filter window, as shown,
3. Fully connected layers
A CNN can have multiple convolutional layers and pooling layers. However, these layers will only extract features from the input image and produce the feature map; that is, they are just the feature extractors. Now, we need to classify these extracted features. For that, we use a feedforward neural network. We flatten the feature map and convert it into a vector, and feed it as an input to the feedforward network that takes this flattened feature map as an input, applies an activation function, such as sigmoid, and returns the output, stating whether the image contains a horse or not; this is called a fully connected layer.
Here's the code for CNN; working on mnist dataset to classify digits:
import keras from keras.datasets import mnist from keras.models import Sequential from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D from keras import backend as K
batch_size = 128 num_classes = 10 epochs = 12 img_rows, img_cols = 28, 28
(x_train, y_train), (x_test, y_test) = mnist.load_data() //preprocessing/reshaping the data if K.image_data_format() == ‘channels_first’: x_train = x_train.reshape(x_train.shape, 1, img_rows, img_cols) x_test = x_test.reshape(x_test.shape, 1, img_rows, img_cols) input_shape = (1, img_rows, img_cols) else: x_train = x_train.reshape(x_train.shape, img_rows, img_cols, 1) x_test = x_test.reshape(x_test.shape, img_rows, img_cols, 1) input_shape = (img_rows, img_cols, 1) x_train = x_train.astype(‘float32’) x_test = x_test.astype(‘float32’) x_train /= 255 x_test /= 255 print(‘x_train shape:’, x_train.shape) print(x_train.shape, ‘train samples’) print(x_test.shape, ‘test samples’) //converting the vector of classes into binary class matrices y_train = keras.utils.to_categorical(y_train, num_classes) y_test = keras.utils.to_categorical(y_test, num_classes) //instantiating the Sequential model model = Sequential() //adding the layers model.add(Conv2D(32, kernel_size=(3, 3), activation=’relu’, input_shape=input_shape)) model.add(Conv2D(64, (3, 3), activation=’relu’)) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Dropout(0.25)) model.add(Flatten()) model.add(Dense(128, activation=’relu’)) model.add(Dropout(0.5)) model.add(Dense(num_classes, activation=’softmax’)) model.compile(loss=keras.losses.categorical_crossentropy, optimizer=keras.optimizers.Adadelta(), metrics=[‘accuracy’]) //fitting the model model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, verbose=1, validation_data=(x_test, y_test)) //evaluating the model score = model.evaluate(x_test, y_test, verbose=0) print(‘Test loss:’, score)