Reservoir Characterization. Группа авторов

Чтение книги онлайн.

Читать онлайн книгу Reservoir Characterization - Группа авторов страница 27

Reservoir Characterization - Группа авторов

Скачать книгу

      If AD(Yk) > anomaly detection cuttoff, then record Υ is classified as anomalous. The anomaly cutoff is defined by the expected false discovery rate (expFD):

      (3.3)image

      N(AD(Yk) > anomaly cuttoff; YkTrainSet) is the number of records in the training set with values of anomaly detection classifier exceeding cutoff, K is total number of records in the training set.

      For construction of anomaly detection classifiers we selected parameters based on results of dissimilarity analysis (Katz et al., [8]). These parameters are Vp/Vs and Poisson’s Ratio.

      Three basic classifiers are introduced, analyzed and tested in this paper:

      1 1. Distance from the center of the training set:(3.4)where ym and ctr,m are coordinates of the tested record and of the center of the training set respectively. The center of the training set is defined as the mean over train set records. Coordinates of the training set center are of the form: where yk,m is the m-th coordinate of the k-th record in the training set, K is total number of records in the training set.

      2 2. Nearest neighbors sparsity:(3.5)where dist(Y, neighborl) is the distance between tested record Y and its l-th nearest neighbor from the training set. The farther away in a parameter space tested records are from the records in the training set, the larger are both the sparsity and the distance from the center of the training set. These two classifiers are universal. Their performance is not affected by the properties of records in the training set.

      3 3. Divergence is defined as follows:(3.6)

      To characterize anomaly detection quality, we introduced and distinguished two types of quality characteristics: (a) Prior quality characteristics and (b) Posterior (actual) quality characteristics. The only prior classification quality characteristic is an expected false discovery rate (expFD). Value of the expected false discovery rate is assigned prior to performing data analysis. It is used for calculation of anomaly detection cutoff (AD cutoff) on the data in a training set. Posterior characteristics are calculated on the test set with identified regular and anomaly records. They include true and false discovery rates as functions of the AD cutoff. True and false discovery rates form a posterior ROC curve, which is used for evaluation of area under the ROC curve and comparative analysis of efficiency of several anomaly detection classifiers.

      The writers used bootstrap for statistical analysis of anomaly detection results and did comparative analysis of properties of posterior efficiency characteristics. At each bootstrap run, sampling with replacement was done and a randomly formed pair of training and test set was constructed. The training set was selected from a pool of regular records. Each test set contained both regular and anomaly records. Multiple pairs of training and test sets produced by random sampling were utilized for calculation of quality characteristics of AD classifiers. They included mean and median values, and width of the quantile region for analyzed AD characteristics. They also included analysis of parameters characterizing relations between expected false discovery rate and posterior AD characteristics. The ROC curve analysis was done using multiple posterior ROC curves.

      First 20 values in the Figure 3.1 marked by circles (Index≤20) are the values of the divergence classifier on the records from the training set. The points marked by triangles and crosses (Index>20) show values of the divergence classifier on the records from the test set. The horizontal dashed line shows anomaly detection cutoff. Records with the values above the cutoff are classified as anomalous. One can observe that distribution of the values of the divergence classifier on the regular records in the test set is very similar to that in the training set. On the other hand, the values of the divergence for anomaly records are systematically higher compared to divergence for regular records. The anomaly detection cutoff corresponds to the expected false discovery rate of 15%. Expected FDR is calculated as percent of records in the training set that exceeds AD cutoff. Posterior true and false discovery rates are, respectively, the percentages of regular and anomaly records in the test set exceeding AD cutoff. In this particular case, the posterior false discovery rate is smaller compared to the expected FDR and equals 6.6%. True discovery rate is high and equals 84%. High true discovery rate is due to the large proportion of anomaly records characterized by positive divergence values. Low posterior FDR is due to the fact that divergence values on a large proportion of regular records in the test set are smaller than the classification cutoff.

Schematic illustration of divergence values for records in training and test set.

Скачать книгу