The Basic Usage of CNN

What are Convolutional Neural Networks?

They’re basically just neural networks that use Convolutional layers（卷积层）, Conv layers, which are based on the mathematical operation of convolution（内积）. Conv layers consist of a set of filters（卷积核）, which you can think of as just 2d matrices of numbers. Here’s an example 3x3 filter:
在这里插入图片描述

We can use an input image and a filter to produce an output image by convolving the filter with the input image. This consists of

在这里插入图片描述
Consider this tiny 4x4 grayscale image and this 3x3 filter:

在这里插入图片描述
The numbers in the image represent pixel intensities, where 0 is black and 255 is white. We’ll convolve the input image and the filter to produce a 2x2 output image:

在这里插入图片描述
To start, lets overlay our filter in the top left corner of the image:

Next, we perform element-wise multiplication between the overlapping image values and filter values. Here are the results, starting from the top left corner and going right, then down:

在这里插入图片描述
Next, we sum up all the results. That’s easy enough:

在这里插入图片描述
Finally, we place our result in the destination pixel of our output image. Since our filter is overlayed in the top left corner of the input image, our destination pixel is the top left pixel of the output image:

在这里插入图片描述
We do the same thing to generate the rest of the output image:

在这里插入图片描述

Padding（填充）

Remember convolving a 4x4 input image with a 3x3 filter earlier to produce a 2x2 output image? Often times, we’d prefer to have the output image be the same size as the input image. To do this, we add zeros around the image so we can overlay the filter in more places. A 3x3 filter requires 1 pixel of padding:

在这里插入图片描述

Weights（权重）

A conv layer’s primary parameter is the number of filters it has.
For our MNIST CNN, we’ll use a small conv layer with 8 filters as the initial layer in our network. This means it’ll turn the 28x28 input image into a 26x26x8 output volume:

在这里插入图片描述

Pooling

Neighboring pixels in images tend to have similar values, so conv layers will typically also produce similar values for neighboring pixels in outputs. As a result, much of the information contained in a conv layer’s output is redundant（冗余的）.

Here’s an example of a Max Pooling layer with a pooling size of 2:
在这里插入图片描述