Читать онлайн книгу - Handbook of Intelligent Computing and Optimization for Sustainable Development. Группа авторов. Техническая литература. LiveLib

Новинки Лучшее Рекомендации

Информация о книге:

Название:

Автор:

Жанр:

Серия:

Издательство:

Handbook of Intelligent Computing and Optimization for Sustainable Development - Группа авторов

Скачать книгу

1 × 1 × 512 2,048 1

Mask R-CNN is used to segment and construct pixel-wise masks for each customer in a given video frame. The output of this step is a dictionary of masks and bounding box coordinates that engulfs the detected customers. This data corresponding to the person detection is also used later in Stage 3 of the framework.

To obtain the foreground information, we remove the regions common to the foreground masks obtained by the background subtraction model and Mask R-CNN from the aggregate of the former. This step ensures that the clothing worn by the customer are excluded from the foreground.

3.3.2 Detection of Active Garments

We define an active garment as a garment present in the foreground frame obtained from the preceding stage. Individual color masks constituting the dominant colors such as Red, Blue, Green, and Yellow are applied to this foreground frame. As our data pre-processing involves the conversion of the video frames to the HSV color model, the color masks utilized in this step embodies the entire range of HSV values for a given color (i.e., all possible shades of a given color) and not just limited to a specific predetermined value. The corresponding images for each color are obtained after applying the given color masks to the foreground frame. The resulting images are converted to grayscale to reduce space and computational complexities. Morphological image processing techniques, most notably the closing operation, are performed to alleviate the small holes in images that arise due to noise.

Edge detection is used to detect the contours and edges that are present on each of the preceding frames. The given contours could either represent an entire active garment or a region of an active garment present in the foreground. Hence, an imperative step of identifying missing garment regions is performed once the garment regions are obtained. These missing regions of an active garment result from scenarios in which a customer serves as an occlusion to the active garment into consideration, thus obstructing a portion of the garment. These missing regions are identified as garment regions before the linking process. Linking identifies adjacent contours that belong to the same active garment and associates them based on spatial distance. Thus, this stage yields all active garments present used by the subsequent stages in a given video frame.

3.3.3 Identification of Garments of Interest

Once the active garments are determined, we introduce a novel approach to determine the garments of interest. We define a garment of interest from a set of active garments that is close to the wrists of a customer. This indicates that the customer in consideration is interested in the given active garment. We determine this by establishing a Confidence Score (C) between a person’s wrists and an active garment.

3.3.3.1 Centroid Tracking

A person detected by the Mask R-CNN [9] in Stage 1 is tracked by leveraging the Centroid-based tracking algorithm proposed by Nascimento et al. [20]. This algorithm tracks the identified persons by measuring the Euclidean distance between the centroids of people detected over successive frames. It works based on the presumption that even though an object will move between the resulting frames of a recording, the distance between the centroid of the same object between two consecutive frames will be less than the distance to the centroid of some other object identified in the given frames.

This step enables us to track and associate every person detected in the recording with a unique tracking ID across numerous frames.

3.3.3.2 Pose Estimation

We determine the coordinates of a person’s wrists using a state-of-the-art pose estimation framework, OpenPose [10]. OpenPose is the first real-time 2D multi-person human pose estimation framework that achieves the tasks of jointly detecting the human body, hand, face, and foot-related key points from a single image. The OpenPose framework identifies a total of 135 feature points in the detected human. This is accomplished using a multi-stage Convolutional Neural Network (CNN) that uses a nonparametric representation called Part Affinity Fields (PAFs) to learn how to associate the body parts with the corresponding humans in the image. The OpenPose multi-stage CNN architecture has three crucial steps:

1 1. The first set of stages predicts the PAFs from the input feature map.

2 2. The second set of stages utilizes the PAFs from the previous layers to refine the prediction of confidence maps detection.

3 3. The final set of detected PAFs and Confidence Maps are passed into a greedy algorithm, which approximates the global solution, by displaying the various key points in the given input image.

The architecture of the CNN used in OpenPose consists of a convolution step that utilizes two consecutive 3×3 convolutional kernels. The convolution is performed in order to reduce the number of computations. Additionally, the output of each of the aforementioned convolutional kernels is concatenated, producing the basic convolution step in the multistage CNN. Before passing the input image (in RGB color space) to the first stage of the network, the image is passed through the first 10 layers of the VGG-19 network to generate a set of feature maps. These feature maps are then passed through the multi-stage CNN pipeline to generate Part Confidence Maps and PAF. A confidence map is a 2D representation of the belief that a given body part can be located in a given pixel of the input image. PAF is a set of 2D vector fields that encodes the orientation and the location of body parts in a given image.

We use the OpenPose framework’s “BODY_25” pose model to extract the spatial coordinates of both wrist landmarks, PL(x, y) and PR(x, y), of a person denoted by keypoints 4 and 7, respectively, as shown in Figure 3.3.

Schematic illustration of the key points for pose output.

Figure 3.3 Keypoints for pose output [10].

3.3.3.3 Calculation of Confidence Score

We determine the confidence score (C) which indicates the extent to which the given active garment is a garment of interest for a given customer. In order to accomplish this, we first calculate the Area of the active garment (AAG), whose top-left and bottom-right spatial coordinates are denoted as AGTL(x, y) and AGBR(x, y), respectively, using Equation (3.1).

(3.1)

Then, we determine the minimum Euclidean distance (D) between the centroid of the active garment AGC(x, y)) and the coordinates of the two wrists of a person using Equations (3.2) and (3.3).

(3.2)