Artificial intelligence constantly increases the possibilities of machines, making their functionality close to human’s. Due to the large interest in this issue in recent years, many fields of science have made a large development leap. One of the goals of artificial intelligence is to allow machines to observe the world around them in a way that humans do. This is possible through the use of neural networks – mathematical structures which are inspired by natural neurons found in the nerves and the human brain.
Without a shadow of doubt, you have experienced the presence of neural networks in everyday life, for example in:
  • face detection and recognition in photos on your smartphone,
  • recognition of voice commands by a virtual assistant,
  • autonomous cars.
  The potential of neural networks is huge. The examples mentioned above are only a fraction of current applications. However, they are associated with a particular class of neural networks called convolutional, CNN or ConvNet (Convolutional Neural Networks).

Neural networks in image processing

In order to bring the issue of convolutional neural networks closer, we will focus on their most common applications i.e. image processing. CNN is an algorithm that can download an input image and classify it according to predefined categories (e.g. dog breeds). It is possible due to assigning different weight scales to various shapes, structures and objects.   Convolutional networks training allows them to learn which special features of the image can help them in its classification. Their advantage over standard networks is greater efficiency in detecting complicated relations in images. This is possible due to the use of specific filters which are analysing the relations between adjacent pixels.

Each image is a matrix of values. Their number is proportional to matrix’s width and height in pixels. RGB images have three basic colors, so each pixel is represented by three values. ConvNet’s task is to reduce the image size without losing important features, i.e. those which carry key information for further classification. CNN consists of two key layers. The first is the convolutional layer.

The animation above shows an RGB image and three 3×3 filters moving over it with defined step. The step is a pixel value by which the filter moves. To save more information the “zero padding” method (filling with white squares) can be used, but usually it results in lower performance levels. The values of the output matrix are calculated as follows:
  • multiplying values in a selected part of the image by a filter (element by element),
  • adding calculated values for each channel,
  • adding values for each channel, including bias (in this case it equals 1).
  It is worth to mention that the filter’s values for each channel may differ from each other. The task of the first convolutional layers is to distinguish features such as edges, colors and gradients. The more layers, the more complex features will be determined.   After the convolutional layer there is an activation layer (most often ReLU function) which introduces non-linearity to the network.   Another layer is called the pooling layer. Its task is to reduce the dimensions of the convolutional features, which were chosen in the previous layer while maintaining key features. This layer is also responsible for noise reduction. The most popular method is “max pooling”.

The merge operation in similar to that used in the convolutional layer. Filter and step are defined. Values of the output matrix are the maximum value covered by the filter.   The aforementioned layers together are one layer in the convolutional network. After using selected number of layers, obtained matrix is flattened and becomes the entrance to the standard neural network, created from connected layers. This allows you to teach the algorithm of non-linear relations between the features determined by the convolutional layers.   The last layer is called the Soft-Max layer, which allows to obtain probability values of features belonging to particular classes (for example, the probability that there is a cat in the picture).