Abstract
Deep learning has revolutionized the field of machine learning. Convolutional Neural Networks (CNNs) have become very popular for solving problems related to image recognition, image reconstruction, and various other computer vision problems. Libraries such TensorFlow* and Keras* make the programmer’s job easier. But, these libraries do not directly provide support for complex networks and uncommonly used layers. This guide will help you to write complex neural networks such as Siamese networks in Keras. It also explains the procedure to write your own custom layers in Keras.
Introduction
Person re-identification is defined as identifying if the same person exists in a given pair of images. Some of the challenges faced while tackling this problem are caused by pictures being taken from various viewpoints and variations in the light intensity that result in pictures of different people looking similar, thus creating a false positive. The Normalized X-Corr model1 is used to solve the problem of person re-identification. This guide demonstrates a step-by-step implementation of a Normalized X-Corr model using Keras, which is a modification of a Siamese network2.
Figure 1. Architectural overview of a Normalized X-Corr model.
Overview of the Normalized X-Corr Model
Arulkumar Subramaniam and his colleagues1 propose a deep neural network to solve the problem of binary classification. Figure 1 gives an overview of the Normalized X-Corr (normxcorr) model. Firstly, the images are passed through conv-pool-conv-pool layers to extract features from the two images. The idea behind using these layers is to extract features from the image, so the weights of conv layers are shared (i.e., both images are passed through the same layers). After extracting the features, establishing a similarity between the features is necessary. This is done by the normalized correlation layer, which is a custom layer that will be discussed later in this guide. This layer basically takes a small 55 patch and then it convolves around in the other feature map and calculates the normalized correlation given by:
We will denote feature maps as X and Y belonging to the images. Considering the sizes in Figure 1, we take a patch from map of X centered at (x,y) at a given depth and normxcorr is calculated with Y(a,b), where 1 <= a <= 12 and y – 2<= b <= y + 2. Thus, for every X(x,y), 5×12=60, values are generated and stored along the depth of the output feature map. This is done at all depths; therefore, we have output dimensions as 12×37×1500 (i.e., 60×25).
In Figure 2, the size of the image is assumed to be 8×8 for the purpose of demonstration. If we consider the patch centered at the block marked by the red square in image 1 of size 5×5, we calculate Normalized-X-Corr of this patch with patches marked by the green squares in image 2 (i.e., across the entire width of image), and height within [3 - 2, 3 + 5], which is [1,5]. Thus, the total number of values generated by a single patch in image 1 is the width×height allowed (i.e., 8×5=40). These values are stored along the depth of the output feature map. Thus, for one patch, we generate an output of 1×1×40. Considering the entire image, we would have a feature map of size 8×8×40. But, if the input has more than one channel, then the calculated feature maps are stacked one behind the other. Due to this, height and width of the output feature map remain the same, but the depth gets multiplied by the depth of input images. Hence, an input image of 8×8×5 would generate an output feature map of 8×8×(40×5) (i.e., 8×8×200). For the patch centered at the block marked by the blue color, we see that to satisfy the criteria, we need to add padding. Thus, in such cases, the image is padded with zeros.
After the Normalized-X-Corr layer, two conv layers and pooling have been added to concisely incorporate greater context information. On top of it, two fully connected layers are added and a softmax activation function is applied.
More information about the architecture is available in the paper “Deep Neural Networks with Inexact Matching for Person Re-Identification.”
Figure 2. Demonstrating normalization correlation layers operation.
Diving into the Code
The code below was tested on Intel® AI DevCloud. The following libraries and frameworks were also used: Python* 3 (February 2018 version), Keras* (version 2.1.2), Intel® Optimization for TensorFlow* (version 1.3.0), NumPy (version 1.14.0).
1 import keras
2 import sys
3 from keras import backend as K
4 from keras.layers import Conv2D, MaxPooling2D, Dense,Input, Flatten
5 from keras.models import Model, Sequential
6 from keras.engine import InputSpec, Layer
7 from keras import regularizers
8 from keras.optimizers import SGD, Adam
9 from keras.utils.conv_utils import conv_output_length
10 from keras import activations
11 import numpy as np
These are some of the imports from Keras and other libraries we need to implement in this model.
1 a = Input((160,60,3))
2 b = Input((160,60,3))
These create placeholders for the input images.
1 model = Sequential()
2 model.add(Conv2D(kernel_size = (5,5), filters = 20,input_shape = (160,60,3), activation = 'relu'))
3 model.add(MaxPooling2D((2,2)))
4 model.add(Conv2D(kernel_size = (5,5), filters = 25, activation = 'relu'))
5 model.add(MaxPooling2D((2,2)))
These are the layers that need to be shared between the images. Therefore, we create a model of these layers.
1 feat_map1 = model(b)
2 feat_map2 = model(a)
model(a)
passes the input it gets through the model and returns the output layer. This is done for both the layers so that they share the same model and output two feature maps as feat_map1
and feat_map2
.
1 normalized_layer = Normalized_Correlation_Layer(stride = (1,1), patch_size = (5, 5))([feat_map1, feat_map2])
This is the custom layer that establishes a similarity between the feature maps extracted from the images. We pass the feature maps as a list input. Its implementation is mentioned later in this guide.
1 final_layer = Conv2D(kernel_size=(1,1), filters=25, activation='relu')(normalized_layer)
2 final_layer = Conv2D(kernel_size=(3,3), filters=25, activation = None)(final_layer)
3 final_layer = MaxPooling2D((2,2))(final_layer)
4 final_layer = Dense(500)(final_layer)
5 final_layer = Dense(2, activation = "softmax")(final_layer)
These are layers that are added on top of the normalized correlation layer.
x_corr_mod = Model(inputs=[a,b], outputs = final_layer)
Finally, a new model is created with inputs as the images to be passed as a list, which gives a binary output.
The visualizations of layers of this model are available in the paper “Supplementary Material for the Paper: Deep Neural Networks with Inexact Matching for Person Re-Identification.”
Normalized Correlation Layer
This is not a layer provided by Keras so we have to write it on our own layer with the support provided by the Keras backend.
1 class Normalized_Correlation_Layer(Layer):
2
3 create a class inherited from keras.engine.Layer.
4
5 def __init__(self, patch_size=(5,5),
6 dim_ordering='tf',
7 border_mode='same',
8 stride=(1, 1),
9 activation=None,
10 **kwargs):
11
12 if border_mode != 'same':
13 raise ValueError('Invalid border mode for Correlation Layer '
14 '(only "same" is supported as of now):', border_mode)
15 self.kernel_size = patch_size
16 self.subsample = stride
17 self.dim_ordering = dim_ordering
18 self.border_mode = border_mode
19 self.activation = activations.get(activation)
20 super(Normalized_Correlation_Layer, self).__init__(**kwargs)
This constructor just sets the values passed as parameters as the class variables and also initializes its parent class by calling the constructor.
1 def compute_output_shape(self, input_shape):
2 return(input_shape[0][0], input_shape[0][1], input_shape[0][2],
3 self.kernel_size[0] * input_shape[0][2]*input_shape[0][-1])
This returns the shape of the feature map outputted by this layer as a tuple. The first element is the number of images, the second is the number of rows, the third is the number of columns, and the last one is the depth which is the allowance to move in height×allowance to move in width×depth. In our case its 5×12×25.
1 def get_config(self):
2 config = {'patch_size': self.kernel_size,
3 'activation': self.activation.__name__,
4 'border_mode': self.border_mode,
5 'stride': self.subsample,
6 'dim_ordering': self.dim_ordering}
7 base_config = super(Correlation_Layer, self).get_config()
8 return dict(list(base_config.items()) + list(config.items()))
This adds the configuration passed as arguments to constructor, appends it to those of the parent class, and returns it. This function is called by Keras to get the configurations.
def call(self, x, mask=None):
This function is called at every iteration. This function takes the input as feature maps as per the model.
1 input_1, input_2 = x
2 stride_row, stride_col = self.subsample
3 inp_shape = input_1._keras_shape
Separate the inputs from the lists and load some variables to local ones\ to make it easier to refer later on.
output_shape = self.compute_output_shape([inp_shape, inp_shape])
This uses the function written earlier to get the desired output shape and store it in the variable.
1 padding_row = (int(self.kernel_size[0] / 2),int(self.kernel_size[0] / 2))
2 padding_col = (int(self.kernel_size[1] / 2),int(self.kernel_size[1] / 2))
3 input_1 = K.spatial_2d_padding(input_1, padding =(padding_row,padding_col))
4 input_2 = K.spatial_2d_padding(input_2, padding = ((padding_row[0]*2, padding_row[1]*2),padding_col))
This block of code adds padding to the feature map. This is required as we take patches centered at (0,0) and other edges, too. Therefore, we need to add padding of 2 in our case. But, for the feature map of the second input, we need to take patches with an offset of 2 from the center of the patch of the first feature map. Thus, for the patch at (0, 0) we need to consider patches centered at (0,0), (0,1), (0,2), (0,-1), (0,-2) of the second feature map with same value at X. Thus, we need to add a padding of 4,
1 output_row = output_shape[1]
2 output_col = output_shape[2]
and store them into the variables.
1 output = []
2 for k in range(inp_shape[-1]):
Loop for all the depths.
1 xc_1 = []
2 xc_2 = []
3 for i in range(padding_row[0]):
4 for j in range(output_col):
5 xc_2.append(K.reshape(input_2[:, i:i+self.kernel_size[0], j:j+self.kernel_size[1], k],
6 (-1, 1,self.kernel_size[0]*self.kernel_size[1])))
This is done for the patches of feature map 2 where we have added the extra padding (i.e., the patches that are not centered on the feature map and which are at the first rows).
1 for i in range(output_row):
2 slice_row = slice(i, i + self.kernel_size[0])
3 slice_row2 = slice(i + padding_row[0], i +self.kernel_size[0] + padding_row[0])
4 for j in range(output_col):
5 slice_col = slice(j, j + self.kernel_size[1])
6 xc_2.append(K.reshape(input_2[:, slice_row2, slice_col, k],
7 (-1, 1,self.kernel_size[0]*self.kernel_size[1])))
8 xc_1.append(K.reshape(input_1[:, slice_row, slice_col, k],
9 (-1, 1,self.kernel_size[0]*self.kernel_size[1])))
Extract patches of size 5×5 from both feature maps and store them in xc_1
and xc_2
, respectively. In this case, these patches are flattened and reshaped in form (-1,1,25).
1 for i in range(output_row, output_row+padding_row[1]):
2 for j in range(output_col):
3 xc_2.append(K.reshape(input_2[:, i:i+ self.kernel_size[0], j:j+self.kernel_size[1], k],
4 (-1, 1,self.kernel_size[0]*self.kernel_size[1])))
This is to extract patches of feature map 2, but which are centered below the bottom of the feature maps.
xc_1_aggregate = K.concatenate(xc_1, axis=1)
These patches are joined along axis=1
so that they would be of the shape (-1, 60, 25) for any given depth.
1 xc_1_mean = K.mean(xc_1_aggregate, axis=-1, keepdims=True)
2 xc_1_std = K.std(xc_1_aggregate, axis=-1, keepdims=True)
3 xc_1_aggregate = (xc_1_aggregate - xc_1_mean) / xc_1_std
This is just the implementation of normalization of the features of the first feature map.
1 xc_2_aggregate = K.concatenate(xc_2, axis=1)
2 xc_2_mean = K.mean(xc_2_aggregate, axis=-1, keepdims=True)
3 xc_2_std = K.std(xc_2_aggregate, axis=-1, keepdims=True)
4 xc_2_aggregate = (xc_2_aggregate - xc_2_mean) / xc_2_std
Similarly, for the feature maps of image 2.
1 xc_1_aggregate = K.permute_dimensions(xc_1_aggregate, (0, 2, 1))
2 block = []
3 len_xc_1= len(xc_1)
4 for i in range(len_xc_1):
5
6
7 sl1 = slice(int(i/inp_shape[2])*inp_shape[2],
8 int(i/inp_shape[2])*inp_shape[2]+inp_shape[2]*self.kernel_
9 size[0])
10
11
12
13 block.append(K.reshape(K.batch_dot(xc_2_aggregate[:,sl1,:],
14 xc_1_aggregate[:,:,i]),(-1,1,1,inp_shape[2] *self.kernel_size[0])))
Calculate the dot product (i.e., the normalized correlation and store it in "block").
1 block = K.concatenate(block, axis=1)
2 block= K.reshape(block,(-1,output_row,output_col,inp_shape[2] *self.kernel_size[0]))
3 output.append(block)
Join the calculated normalized correlation values, reshape them (they are calculated sequentially so that reshaping would be easier), and append it to “output.”
output = K.concatenate(output, axis=-1)
Join the output feature map calculated at each depth, along the depth of “output.”
1 output = self.activation(output)
2 return output
Apply activation if sent as an argument and return the output generated.
Applications
Such a network can have various applications such as matching a person’s identity in crime scenes. This network can be generalized to find similarity between two images (i.e., to find if the same fruit exists in both images or not).
Further Scope
The code runs sequentially and is devoid of parallelism. The matrix multiplication of the patches can be parallelized across multiple cores using libraries such as multiprocessing. This would help to speed the training time. The accuracy of the model can be increased by finding a more suitable similarity measure between the image patches.
Acknowledgement
I would like to thank the Intel® Student Ambassador Program for AI, which provided me with the necessary training resources on the Intel® AI DevCloud and the technical support that helped me to use DevCloud.
References
- Subramaniam, M. Chatterjee, and A. Mittal. “Deep Neural Networks with Inexact Matching for Person Re-Identification.” In NIPS 2016.
- Dong Yi, Zhen Lei, Shengcai Liao, Stan Z. Li. “Deep Metric Learning for Person Re-Identification.” In ICPR, volume 2014.
- Code on GitHub*
For more complete information about compiler optimizations, see our Optimization Notice.