Database Anonymization. David Sánchez
Чтение книги онлайн.
Читать онлайн книгу Database Anonymization - David Sánchez страница 8
2.8 SUMMARY
This chapter has presented a broad overview of disclosure risk limitation. We have identified the privacy threats (identity and/or attribute disclosure), and we have introduced the main families of SDC methods (data masking via perturbative and non-perturbative methods, as well as synthetic data generation). Also, we have surveyed disclosure risk and information loss metrics and we have discussed how risk and information loss can be traded off in view of finding the best SDC method and parameterization.
CHAPTER 3
Anonymization Methods for Microdata
It was commented in Section 2.5 that the protected data set Y was generated either by masking the original data set X or by building it from scratch based on a model of the original data. Microdata masking techniques were further classified into perturbative masking (which distorts the original data and leads to the publication of non-truthful data) and non-perturbative masking (which reduces the amount of information, either by suppressing some of the data or byreducing the level of detail, but preserves truthfulness). This chapter classifies and reviews some well-known SDC techniques. These techniques are not only useful on their own but they also constitute the basis to enforce the privacy guarantees required by privacy models.
3.1 NON-PERTURBATIVE MASKING METHODS
Non-perturbative methods do not alter data; rather, they produce partial suppressions or reductions of detail in the original data set.
Sampling
Instead of publishing the original microdata file X, what is published is a sample S of the original set of records [104]. Sampling methods are suitable for categorical microdata [58], but for continuous microdata they should probably be combined with other masking methods. The reason is that sampling alone leaves a continuous attribute unperturbed for all records in S. Thus, if any continuous attribute is present in an external administrative public file, unique matches with the published sample are very likely: indeed, given a continuous attribute and two respondents xi and xj, it is unlikely that both respondents will take the same value for the continuous attribute unless xi = xj (this is true even if the continuous attribute has been truncated to represent it digitally). If, for a continuous identifying attribute, the score of a respondent is only approximately known by an attacker, it might still make sense to use sampling methods to protect that attribute. However, assumptions on restricted attacker resources are perilous and may prove definitely too optimistic if good quality external administrative files are at hand.
Generalization
This technique is also known as global recoding in the statistical disclosure control literature. For a categorical attribute Xi, several categories are combined to form new (less specific) categories, thus resulting in a new Yi with |Dom(Yi)| < |Dom(Xi)| where |·| is the cardinality operator and Dom(·) is the domain where the attribute takes values. For a continuous attribute, generalization means replacing Xi by another attribute Yi which is a discretized version of Xi. In other words, a potentially infinite range Dom(Xi) is mapped onto a finite range Dom(Yi). This is the technique used in the μ-Argus SDC package [45]. This technique is more appropriate for categorical microdata, where it helps disguise records with strange combinations of categorical attributes. Generalization is used heavily by statistical offices.
Example 3.1 If there is a record with “Marital status = Widow/er” and “Age = 17,” generalization could be applied to “Marital status” to create a broader category “Widow/er or divorced,” so that the probability of the above record being unique would diminish. Generalization can also be used on a continuous attribute, but the inherent discretization leads very often to an unaffordable loss of information. Also, arithmetical operations that were straightforward on the original Xi are no longer easy or intuitive on the discretized Yi.
Top and bottom coding
Top and bottom coding are special cases of generalization which can be used on attributes that can be ranked, that is, continuous or categorical ordinal. The idea is that top values (those above a certain threshold) are lumped together to form a new category. The same is done for bottom values (those below a certain threshold).
Local suppression
This is a masking method in which certain values of individual attributes are suppressed with the aim of increasing the set of records agreeing on a combination of key values. Ways to combine local suppression and generalization are implemented in the μ-Argus SDC package [45].
If a continuous attribute Xi is part of a set of key attributes, then each combination of key values is probably unique. Since it does not make sense to systematically suppress the values of Xi, we conclude that local suppression is rather oriented to categorical attributes.
3.2 PERTURBATIVE MASKING METHODS
Noise addition
Additive noise is a family of perturbative masking methods. The values in the original data set are masked by adding some random noise. The statistical properties of the noise being added determine the effect of noise addition on the original data set. Several noise addition procedures have been developed, each of them with the aim to better preserve the statistical properties of the original data.
• Masking by uncorrelated noise addition. The vector of observations, xi, for the i-th attribute of the original data set Xi is replaced by a vector yi = xi + ei where ei is a vector of normally distributed errors. Let
, respectively, the k-th and l-th components of vector ei. We have that and are independent and drawn from a normal distribution . The usual approach is for the variance of the noise added to attribute Xi to be proportional to the variance of Xi; that is, . The term “uncorrelated” is used to mean that there is no correlation between the noise added to different attributes.This method preserves means and covariances,
However, neither variances nor correlations are preserved