Handbook of Intelligent Computing and Optimization for Sustainable Development. Группа авторов

Чтение книги онлайн.

Читать онлайн книгу Handbook of Intelligent Computing and Optimization for Sustainable Development - Группа авторов страница 43

Handbook of Intelligent Computing and Optimization for Sustainable Development - Группа авторов

Скачать книгу

with a sigmoid activation function is utilized to produce the final fashion landmark heatmaps. Although the approach performs well in detecting landmarks of garments, the same performance metrics cannot be translated to scenarios in video surveillance, which often include occluded garments that need to be consistently detected on a per-frame basis.

      Ge et al. [13] proposed DeepFashion2, a benchmark for detection, pose estimation, segmentation, and re-identification of clothing images. In addition to creating an expansive dataset comprising of 491,000 images of cloths, the authors proposed a model called Match R-CNN, which is based on the Mask R-CNN object detection model proposed by He et al. [9]. Match R-CNN is an end-to-end framework that jointly performs clothes detection, landmark estimation, instance segmentation, and customer-to-shop retrieval. Different streams are used and a siamese model is stacked on top of these streams to aggregate the learned features. Match R-CNN comprises three components, namely, a Feature Network (FN), a Perception Network (PN), and a Matching Network (MN). FN builds a pyramid of feature maps and RoIAlign is used to extract features from different levels of the pyramid. PN contains three streams of networks: landmark estimation, clothes detection, and mask prediction. The RoI features are fed into these streams of the PN. MN contains a feature extractor and a similarity learning network for clothes retrieval, which is used for recognition. Although the Match R-CNN model is state-of-the-art when it comes to identifying garments, it is only trained to identify fashion images that are available in the DeepFashion2 dataset, which although covers a wide array of clothing items, it does not cater to garments such as Indian sarees.

      Hara et al. [14] proposed a CNN-based algorithm for the task of fashion items detection that combines the background information of human pose skeletons to detect fashion items. The authors consider the dynamic rigidity of the garments while using human pose estimation models to get coordinates of those garments that are close to the detected human pose coordinates. However, the use of R-CNN as the baseline object detection model significantly increases training cost in both space and time and results in slow object detection, when compared with other state-of-the-art object detection frameworks, for instance, Mask R-CNN.

      Kita et al. [15] proposed a deformable-model-driven method to identify hanging garments. The authors recognize the state of a garment by considering its 3D location and posture. This 3D data of a garment is obtained from the deformable model by comparing the observed state of garments with predicted candidate shapes. Sutoyo et al. [16] proposed a methodology for hand detection, by obtaining an image dataset comprising of positive (with hands) and negative (without hands) images. The Haar cascade classifier model was trained on these images to build a hand detection model. The key disadvantage of using this model for the detection of hands is that the model requires an up-close image of a hand to classify it accurately, a scenario that is unattainable from surveillance footage data.

      As discussed previously, it can be comprehended that a combination of previous works had the following limitations or drawbacks: on some occasions, the works could not detect complex garments accurately, faced issues for detecting occluded garments in video surveillance, performed less adequately in cases of uncommon garments such as Indian sarees, required close-up images of hands for their proper identification, or simply used an archaic objected detection framework such as R-CNN. Our proposed approach attempts to address a majority of these problems. Color masks are applied to detect regions of garments and these are linked to obtain the entire garment. Missing regions of partially occluded garments are also identified before linking. The OpenPose framework is used for pose estimation as it does not require close-up images of wrists and Mask R-CNN is used as it outperforms R-CNN.

      In this section, we elucidate the key stages in our proposed framework, which aims at identifying the garments of interest to customers as they browse through the collection of garments available at a garment store. The framework comprises of three integral stages, namely:

      1 1. Stage 1: Obtaining the foreground information

      2 2. Stage 2: Detection of active garments

      3 3. Stage 3: Identification of garments of interest

      3.3.1 Obtaining the Foreground Information

      The proposed approach processes an input video from the dataset on a per-frame basis. Before processing the input video frame for garment identification, the input video frame is converted from RGB color space to HSV color space, so that the pixel intensity can be distinguished from the color information. To obtain the foreground information, we use a background subtraction model inline with an object detection algorithm, which is known as Mask R-CNN. The background subtraction model identifies the pixels associated with non-static objects present in a particular frame such as an instance a customer picks up a garment he finds interesting. As the garments worn by the customers are also included in this foreground, the Mask R-CNN model is utilized to identify customers and obtain the pixels associated with the customers alone. These pixels are then excluded from the foreground obtained by the background subtraction algorithm, thereby ensuring that only pixels associated with the garments at the store are considered by the subsequent stages of the proposed framework.

      An input frame I(k) is quantized in color space and compared against the static background image model, Ĥ(k), to generate a posterior probability image. The resulting image is filtered using morphological operations. The filtered image is then segmented into a set of bounding boxes, image, using connected components. The Kalman-filter bank maintains a set of tracked foreground objects, image, and a set of predicted

Скачать книгу