Machine Vision Inspection Systems, Machine Learning-Based Approaches. Группа авторов
Чтение книги онлайн.
Читать онлайн книгу Machine Vision Inspection Systems, Machine Learning-Based Approaches - Группа авторов страница 18
Twin network: Twin network consists of two similar networks that share weights between them. The purpose of sharing weights is getting the same output from both networks if the same image feed to them. Since we wanted the twin network to learn how to extract features that could help distinguish images; convolutional layers, capsule layers and deep capsule layers were used and deep capsule layers-based model gave the best performance.
The capsule network consists of four layers. Since we consider relatively simpler images with plain backgrounds, having many layers has a less effect. The first layer is a convolutional layer with 256, 9 × 9 kernels with a stride 5 to discover basic features in the 2D image. Second, third, and fourth layers are capsule layers with 32 channels of 4-dimensional capsules, where each capsule consists of 4 convolutional units with a 3 × 3 kernel and strides of 2 and 1, respectively. Next capsule layer contains 16 channels of 6-dimensional capsules. Each of them consists of a convolutional unit with a 3 × 3 kernel and stride of 2. The sixth layer is a fully connected capsule layer named as entity capsule layer. It contains 20 capsules of 16-dimension. We use dynamic routing proposed by Ref. [9], between final convolutional capsule layer and entity capsule layer with three routing iterations.
Vector difference layer: After twin network identifies and extracts important features in two input images, the vector difference layer is used to compare those features to get a final decision about similarity. Each capsule in the twin network is trained to extract an exact type of property or entity such as an object or part of an object. Here, the length and the direction of the output vector is determined by the probability of feature detection and the state of the detected feature, respectively [11]. For example, when an identified feature is changed its state by a move, the probability remains the same with the vector length, while orientation changes. Due to this property, it is not enough to take scalar difference using L1 distance but needs to use more complex vector difference and analyse it. We obtain 20 vectors of dimension 16 after the difference layer and feed it to a fully connected network.
Figure 2.1 Siamese network architecture.
Fully connected network: Fully connected network comprises four fully connected layers with parameters as shown by Figure 2.1. Except for the last fully connected layer which has sigmoid activation, other fully connected layers use Rectified Linear Unit (ReLU) activation [35]. In this study, multiple fully connected layers are used to analyse the complex output of the vector difference layer to get an accurate probability.
2.3.1 One-Shot Learning Implementation
The goal of this study is classifying characters in new alphabets. After fine-tuning the model for the verification job, we expect that it has learned a general enough function to distinguish between any two images. Hence, we could model character classification as a one-shot learning task that uses only one sample to learn or perform a particular task [6]. This study creates a reference set for all the possible classifications with only one image and then feed the verification model with the pairs created by using test image and one image from the reference set and predict a class using the similarity score given by the model. This approach is further extended to improve accuracy and testing purposes, as explained in Section 2.4.
2.3.2 Optimization and Learning
The proposed methodology learns the optimal model parameters by optimizing a cost function, which is defined over the expected output and the actual result. Moreover, binary cross-entropy function [36], is used as given in Equation (2.2), to quantify the prediction accuracy. Here θ denotes the parameters of the model. The symbols xi, xj. and yi,j represent the input image, reference image and the expected output, respectively. The output of the function F increases if the reference and the test images are equal. Otherwise, the function tries to decrease the value. The Adam optimizer [37], is used to optimize this cost function.
2.3.3 Dataset
This study focuses on character domain. Therefore, we use the Omniglot dataset to train the model to learn a discriminative function and features of the images. Omniglot dataset consists of 1,623 handwritten characters that belong to 50 alphabets [6]. Each character has 20 samples, which is written by 20 individuals through the Amazon Mechanical Turk platform. The dataset is divided into a training set with 30 alphabets and test set with 20 alphabets. For the training sessions, we use data from the training set only and validate using the data in the test set.
2.3.4 Training Process
The learning model is trained on an AWS EC2 instance consists of four vCPUs and Nvidia Tesla V100 GPU with 16GB memory. We trained our models up to 500 epochs while manually adjusting the learning rate depending on the convergence.
Before the model training, images were coupled. For the images of the same category, the expected prediction is 1 and for others 0. Data fetching is done on the CPU, at the same time they are fed and processed in the GPU. This significantly reduced the training time.
Algorithm 1 states the data generation process for model training. The process takes the category list of the characters and the images that belong to each category as the inputs. This process generates the image couples and the expected output values as output. The process starts with generating similar couples. As stated in line 1, the loop goes through each character category and generates the couples belonging to the same category, as given in the get_similar_couples function in line 2. These image couples are added to the output array training_couples in line 3, along with the expected value as given in line 4. For the matching image couples, the prediction is one, hence number 1 is added to the expected values array for the count of couples.
In lines 5 and 6, the algorithm loops through category list for two times, and check for the similar categories in line 7. If the two categories are the same, the process immediately goes to the next iteration of the loop, using the continue keyword in line 8. If there are different categories, then the process generates the mismatching image couples from the category images in each of the considered categories, as given in line 9. Then the image couples are added to the training_couples array. Since these are the false couples, the prediction should be zero. Thus, in line 11, the value 0 is added to the expected values array for the same length of the image_couples array.
After that, in line 12, the output arrays are shuffled before the training model, to generate random training samples.
Algorithm 1: Data generation
Input: cat_list[], category_images [] Output: training_couples[], expected_values[]