Machine Vision Inspection Systems, Machine Learning-Based Approaches. Группа авторов
Чтение книги онлайн.
Читать онлайн книгу Machine Vision Inspection Systems, Machine Learning-Based Approaches - Группа авторов страница 21
2.4.3 MNIST Classification
The Omniglot dataset has more than 1,600-character classes, but has only 20 samples for each category. In contrast, MNIST dataset has 10 classes and 60,000 total training samples [30]. Since the proposed model of this study aims to learn an abstract knowledge about characters and extend it to identify new characters, by treating MNIST as a whole new alphabet with 10 characters, we could use the proposed capsule layers-based Siamese network model to apply classifications for MNIST dataset. Table 2.4 shows the accuracy values obtained by different MNIST models. Here, large neural networks have achieved more than 90% accuracy while the proposed capsule layers-based Siamese network model has given 76% accuracy with only 20 images.
Figure 2.6 Gurmukhi (left) and Cyrillic (right) alphabets.
Table 2.4 Accuracies of different MNIST models.
MNIST Model | Accuracy |
1-Layer NN [18] | 88% |
2-layer NN [18] | 95.3% |
Large convolutional NN [25] | 99.5% |
Proposed capsule layer-based Siamese network (1-shot) | 51% |
Proposed capsule layer-based Siamese network (20-shot) | 74.5% |
The MNIST dataset is a benchmark model for image classification algorithms and has been solved to get more than 90% accuracy as summarized in Table 2.4. These methods are based on deep neural networks and use all the 60K characters in the dataset.
Although the proposed capsule layers-based Siamese network model has shown only 51% accuracy with MNIST dataset, that has used only one sample for each digit class while other models have access to more than 60,000 samples. The proposed solution has improved this accuracy by using the same n-shot learning technique. By using 20 samples the accuracy is improved by 23.5% as depicted in Figure 2.7. Thus, the classification accuracy of MNIST dataset is improved from 51 to 74.5% by using a greater number of samples.
Figure 2.7 MNIST n-shot learning performance.
2.4.4 Sinhala Language Classification
One of the main goals in this research is evaluating the performance of one-shot learning for Sinhala language. Using deep learning approaches is not an option for Sinhala character recognition due to a lack of datasets. Sinhala language has 60 characters, making it a complex alphabet. For each character in Sinhala alphabet, we have added 20 new images to Omniglot dataset. First, we have classified Sinhala characters with a model which was not trained with Sinhala characters and was able to achieve 49% accuracy. After training the model with 5% of the Sinhala dataset, the accuracy is improved to 56%. Considering the languages used in the experiment, Sinhala language has the largest alphabet. Compared to some other languages with a smaller number of characters, the model has given a better accuracy for Sinhala. This could be due to significant visual structural differences between characters.
2.5 Discussion
2.5.1 Study Contributions
This chapter has presented a novel architecture for image verification using Siamese networks structure and capsule networks. We have improved the energy function used in Siamese network to extract complex details output by capsules and obtained on par performance as Siamese networks based on convolutional units [7], but using significantly a smaller number of parameters.
Another major objective of this study is duplicating the human ability to understand completely new visual concepts using previously learnt knowledge. Capsule based Siamese networks can learn a well-generalized function that can be effectively extended to previously unseen data. We have evaluated this capability using n-way classification using one-shot learning. The results have shown more than 80.5% classification accuracy with 20 different characters, which the model has no previous experience.
Moreover, the model is evaluated with MNIST dataset, which is considered as a de facto dataset to evaluate image classification model [30]. The proposed methodology of the capsule layers-based Siamese network has shown 51% accuracy in the classification, using only one image for each digit. Latest deep learning models achieve more than 90% accuracy [39], but that is using all the 60K images available in MNIST dataset. The solution proposed by this study has improved the one-shot learning accuracies by using n-shot learning method, that is using n samples from each image class to do the classification. This way accuracies were improved by 23.5% using 20 samples. As depicted in Figure 2.5, even 28-way learning has showed a classification accuracy of 90%, with Omniglot dataset, while MNIST dataset achieved 74.5% accuracy as shown in Table 2.4.
Further, we have extended the Omniglot dataset by adding a new set of characters for Sinhala language. This contains 600 new handwritten characters for 60 characters in the alphabet. The proposed model has given 49% accuracy for Sinhala without any training stage and it has shown a classification accuracy of 56.22% with a training model accuracy using only one reference image, as shown in Table 2.3.
By comparing with the related studies, in Koch et al. [7], the authors of Omniglot dataset, have used a convolutional layer based Siamese network to solve the one-shot problem [6]. They have shown an accuracy of 94% for class independent classification. This is a similar performance as of the proposed capsule layers-based Siamese network model. In contrast, capsule layers achieve this accuracy with 40% fewer parameters. In an experiment with MNIST dataset using one-shot learning, Koch et al. have achieved 70% accuracy [7], Vinyals et al. [27] have shown 72% accuracy, while the proposed capsule layers-based Siamese network model has given 76% accuracy. The approach in Vinylas et al. [27], is based on Memory augmented neural networks (MANN) and has a similar structure to recurrent neural networks with external memory.
2.5.2 Challenges and Future Research Directions
Although the proposed solution has shown more than 50% accuracy, which is the general threshold for the tested languages, for most of the alphabet types in Omniglot dataset, it has used a small set of images to achieve that accuracy. This limitation can be surpassed by using handcrafted features,