Before the advent of deep learning, features. Were hand-crafted using sophisticated statistical methods.These handcrafted features were commonly used with the traditional machine learning approaches like Support Vector Machines (SVM), for an instance. Hand-crafted systems used two main parameters – Image
descriptors and Distance metrics. Image Description uses features such as color/texture for foreground background. Distinction \cite{item2}, local binary pattern \cite{item8}, and etc. Regarding distance metric. Many approaches like locally adaptive decision functions \cite{item14} , KISSME \cite{item9}, cross view quadratic discriminant analysis \cite{item15} have been proposed.
Besides adopting the strategy of several descriptors which are primarily focused on generating hand-crafted features, currently, deep learning networks are widely used to resolve tasks relating to a wide variety computer vision applications. Hence this technique is employed by researchers for addressing the problem of person re-identification.
Following the success of the deep learning, convolutional neural networks (CNN). Are currently gaining popularity in the field of image classification \cite{item10} and Person re-ID \cite{item19}. Hand-crafted methods independently optimize the feature extractions and metric learning which makes them sub-optimal. The CNN-based system employs an end-to-end optimal structure which jointly learns features and metrics. Each CNN layer produces a collection of feature maps in which each pixel of a particular image is mapped to a specific feature representation.
CNN-based models differ with different strategies and objective functions but can be mostly based on two major criteria. The first one is getting better feature representations, by means of global full-body representations \cite{item4, item12}, part-based representations \cite{item1, item13} or both \cite{item11, item16}. The second one has robust feature learning via better learning objectives, like posing problem as classification \cite{item1} or ranking \cite{item4}.
A lot of research \cite{item2, item25} has been conducted demonstrating substantial progress around these two aspects. But, as deep learning based approaches require extensive training data to learn robust, discriminative features, the major drawback of this strategy in re-ID is the dearth of available training data. Hence, the model lacks sufficient instances per ID to converge.
Apart from extracting better feature representations, there have been various approaches based on learning better ranking loss or better classification loss. As per \cite{item3} , the person re-ID problem lies at an intermediate stage between the retrieval task and image classification task. For the first case, the person re-ID is considered as a ranking task, where a ranking loss is adopted rank images based on similarity scores. For two person images, in order to compute their similarity score, we have to compare each part of two people. We can’t obtain their similarity score only based on some local parts.
In other words, the global features of the whole images should be paid more attention than local parts during ranking. In \cite{item15}, a new term is appended to the original triplet loss to bring images of the same individual closer to each other. A.Hermans in \cite{item6} proposed a variation of the conventional triplet loss utilizing hard mining within the batch. This is the one of the loss functions incorporated in the proposed network.
As far as the classification task is concerned, the person re-ID problem is usually resolved by using Softmax loss. There are two ways to execute person re-ID as a classification task. The first one is termed as verification network \cite{item3}, which accepts a pair of images as input and determines whether they conform to the same identity or not by means of a binary classification network. The second one is called multi-class recognition network, in which each individual is treated as a distinct category.
Also, some approaches \cite{item4, item26} use a combination of a ranking and classification loss. Their approach considers the two losses as mutually complementing each other. Our proposed model optimizes the network similarly by leveraging this complementary loss design, however, the difference lies in terms of how we assign weights to each loss during training.