ML Zoomcamp 2023 – Deep Learning – Part 5

This part is about Convolutional Neural Networks (CNN). In the last part we already touched this topic while we were using a pre-trained neural network. This part is a bit more about the theory.

Convolutional neural networks

Convolutional neural networks are mostly used for images and they consists of different kind of layers. Let’s imagine a CNN as a black box that gets an image and outputs a prediction. What happens inside? There are different kinds of layers. We’ll look at two main types of layers which are convolutional layers and dense layers. There are much more layers. You can find good information here.

Convolutional layers and filters

Convolutional layers consist of filters which are kind of small images. They’re usually quite small images like 5×5, could be even smaller. This image filters contain simple shapes, like
[ ^ ] [ – ] [ / ] [ u ] [ | ][ \ ] This is not from a real network it’s just to give an understanding.

Then we take the image and take one filter and slide this filter across the image. Every time we apply the filter to an image we see how similar this filter is to the part of the image. A table is created with a lot of cells in which each cells value corresponds to the similarity of this filter to the corresponding part of the image which is a number between 0 (no similarity) and 9 (very similar). This table is called a feature maps. So a feature maps is the result of applying a filter to an image.

So we slide a filter across the image and every time we calculate the similarity between this particular filter and the part of the image, we record the result and get this feature map. High values mean higher degree of similarity. We do this steps for each filter. We have six filters here that means we’ll get six feature maps.

So the output of the first convolutional layer is a set of feature maps. We can take this output and treat it as a new image that we made from the original one. Then we can have another convolutional layer that has its own set of filters it applies. Then this layer produces its own feature map. Let’s say in the first convolution we used 6 filters and now in the second convolution we have 12 filters. So the second convolution produces 12 feature maps. We can easy imagine to go on with other convolutional layers with its own filters. Because of this chaining each layer learns more and more complex filters.

image	–>	CONV LAYER 1	–>	CONV LAYER 2	–>	CONV LAYER 3	–>	…
		[ ^ ] [ – ] [ / ]		[ < ] [ o ] [ > ]		[ ( ] [ U ] [ ) ]
		[ u ] [ \| ] [ \ ]		[ L ] [ x ] [ + ]		[ ? ] [ Q ] [ ~ ]
		Low-Level		Mid-Level		High-Level

The second layer learns the new filters by combining the filters from the previous layer. That means with every layer you can learn more complex shapes (= more high-level features of the image). This representation is not exactly how filters look like, but we can think of them in this way.

What filters do?

We have a filter that we apply to some area, then it looks at that reagion across all the values of all the feature maps in this particular reagion. That means it “goes in depth” here and looks at all the different feature maps. So let’s assume that in one place there is a 6 in case of [/] and a 6 in case of [\\] then there is a high possibility that there is a “X” on the image in this area.

What happens when we take an image and pass it through a set of convolutional layers?

The result is a vector representation of the image. Let’s say the image is 299x299x3, then the vector representation could be something like 2,048, which is a one-dimensional array. This way we turn an image into a vector. This vector captures all this information about the image. So all the filters of the convolutional layers extract features from the image that contain all the information that the neural network was able to extract from the image.

Dense layers

What we can do with this vector representation is, we can build some dense layers. This layers turn the vector representation into final prediction. The final prediction could be a label like “t-shirt”. While the role of convolutional layers is extracting a vector representation the role of a dense layer is using the vector representation to make the final prediction.

How to get the final prediction?

The vector representation consists of many numbers (vector dimensionality could be 1,024 or 2,048, so it’s always something to the power of 2 for some reason).

Binary classification problem

Using this vector we want to build a model that makes a prediction. Let’s start with a *binary classification problem*. “Is this image a t-shirt or not?”. The vector is x and y = {0, 1} with 0 means not a t-shirt, and 1 means a t-shirt. Here we use logistic regression. g(x)=sigmoid(xTw). x is the feature vector and we have to train a regression model to get the weights (w). The output of this would be the probability that this x is a t-shirt.

With this vector w we can make a prediction by multiplying x1*w1 + x2*w2 + … + xn*wn. Then we take this sum and turn it into probability by using sigmoid. The output then is the probability for being a t-shirt.

Multiple classification problem

What to do in case of multiple classes? We can build one model for each class – one for shirt, one for t-shirt, and one for dress. So we get another w vector for shirts and we can do the same for dresses. So we end up with three different w vectors each for one class. But in case of three sigmoid functions we use something different. A sigmoid for multiple classes is called softmax. The output of softmax will have three numbers (in this case here). The first number will contain the probability of x being a shirt, then the second will be the probability of x being a t-shirt and the last one for the probability of x being a dress.

So what happens here is, we put multiple logistic regressions together and as a result we got a neural network.

DENSE	Layer
\|x1\|
\|x2\|
\|x3\|	\| 0 \|
\| . \|	\| 0 \|
\| . \|	\| 0 \|
\| . \|
\| xn \|
Input	Output

This layer is called a dense layer. It’s dense because it connects each element of the input with all the elements of its output. For this reason, these layers are sometimes called “fully connected”.

Each of the output elements has its own w (w1, w2, w3), so there is a W:

|–w1–|
W = |–w2–|
|–w3–|

All we need to do to transform the column vector x to the output is W*x. That means the dense layer is a matrix multiplication.

We can put multiple dense layers together. Then we have an inner dense layer and an outer dense layer.

Summary

To summarize we have seen the following steps from image to prediction.

image –> conv layers –> vector representation –> dense layers –> prediction

This gives just a high-level overview about the internals of convolutional neural networks. There are many other internals and other layers as well. For more in-depth knowledge check the notes of the course CS231n Convolutional Neural Networks for Visual Recognition.

There is one important layer that hasn’t covered here, the “pooling layer”. The purpose of this layer is that we can reduce the size of convolutional layers. The reason for doing this is reducing the size leads to having fewer parameters. You can for example reduce a 200×200 into a 100×100 image. Especially this layer is described very good in the course notes of CS231n.

DENSE	Layer
\|x1\|
\|x2\|
\|x3\|	\| 0 \|
\| . \|	\| 0 \|
\| . \|	\| 0 \|
\| . \|
\| xn \|
Input	Output