Administrative Records for Survey Methodology. Группа авторов
Чтение книги онлайн.
Читать онлайн книгу Administrative Records for Survey Methodology - Группа авторов страница 22
The technique adopted is called partially synthetic data with multiple imputation of missing items. The term “partially synthetic data” means that the person-level records are released containing some variables from the actual responses and other variables where the actual responses have been replaced by values sampled from the posterior predictive distribution (PPD) for that record, conditional on all of the confidential data. From 2003 until 2015, seven preliminary versions of the SSB were produced. In this chapter, we will focus on the protections that pertain to the linked nature of the data. The interested reader is referred to Abowd, Stinson, and Benedetto (2006) for details on data sources, imputation, and linkage. The analysis here is for the SSB version 4. Since version 4, two additional versions have been released with slightly different structure.3 Subsequent versions are well-illustrated by the extensive analysis described here.
2.3.2.2 Disclosure Avoidance Methods
The existence of SIPP public use files poses a key challenge for disclosure avoidance. To protect the confidentiality of survey respondents, it was deemed necessary to prevent reidentification of a record that appears in the synthetic data against the existing SIPP public use files. Hence, all information regarding the dating of variables whose source was a SIPP response, and not administrative data, has to be made consistent across individuals regardless of the panel and wave from which the response was taken. The public use file contains several variables that were never missing and are not synthesized. These variables are: gender, marital status, spouse’s gender, initial type of Social Security benefits, type of Social Security benefits in 2000, and the same benefit type variables for the spouse. All other variables in the SSB v4 were synthesized.
The model first imputes any missing data, then synthesizes the completed data (Reiter 2004). For each iteration of the missing data imputation phase and again during the synthesis phase, a joint PPD for all of the required variables is estimated according to the following protocol. At each node of the parent/child tree, a statistical model is estimated for each of the variables at the same level. The statistical model is a Bayesian bootstrap, logistic regression, or linear regression (possibly with transformed inputs). The missing data phase included nine iterations of estimation. The synthetic data phase occurred on the 10th iteration. Four missing data implicates were created. These constitute the completed data files that are the inputs to the synthesis phase. Four synthetic implicates were created for each missing data implicate, for a total of 16 synthetic implicates on the released file. Because copying the final weight to each implicate of the synthetic data would have provided an additional unsynthesized variable with 55 552 distinct values, the disclosure risk associated with the weight variable had to be addressed. A synthetic weight using a PPD based on the Multinomial/Dirichlet natural conjugate likelihood and prior was created.
2.3.2.3 Disclosure Avoidance Assessment
The link of administrative earnings, benefits and SIPP data adds a significant amount of information to an already very detailed survey and could pose potential disclosure risks beyond those originally managed as part of the regular SIPP public use file disclosure avoidance process. The synthesis of the earnings data meets the IRS disclosure officer’s criteria for properly protecting the federal tax information found in the summary and detailed earnings histories used to create the longitudinal earnings variables.
The Census Bureau Disclosure Review Board at the time of release used two standards for disclosure avoidance in partially synthetic data. First, using the best available matching technology, the percentage of true matches relative to the size of the files should not be excessively large. Second, the ratio of true matches to the total number of matches (true and false) should be close to one-half.
The disclosure avoidance analysis (Abowd, Stinson, and Benedetto 2006) uses the principle that a potential intruder would first try to reidentify the source record for a given synthetic data observation in the existing SIPP public use files. Two distinct matching exercises – one probabilistic (Fellegi and Sunter 1969), one distance-based (Torra, Abowd, and Domingo-Ferrer 2006) – between the synthetic data and the harmonized confidential data were conducted.4 The harmonized confidential data – actual values of the data items as released in the original SIPP public use files – are the equivalent of the best available information for an intruder attempting to reidentify a record in the synthetic data. Successful matches between the harmonized confidential data and the synthetic data represent potential disclosure risks. In practice, the intruder would also need to make another successful link to exogenous data files that contain direct identifiers such as names, addresses, telephone numbers, etc. The results from the experiments are conservative estimates of reidentification risk. For the probabilistic matching, the assessment matched synthetic and confidential files exactly on the unsynthesized variables of gender and marital status, and success of the matching exercise is assessed using a person identifier which is not, in fact, available in the released version of the synthetic data. Without the personid, an intruder would have to compare many more record pairs to find true matches, would not find any more true matches (the true match is guaranteed to be in the blocks being compared), and would almost certainly find more false matches. In fact, the records that can be reidentified represent only a very small proportion (less than 3%) of candidate records, and correct reidentifications are swamped by a sea of false reidentifications (Abowd, Stinson, and Benedetto 2006, p. 6).
In distance-based matching, records between the harmonized confidential and synthetic data are blocked in a similar way, and distances (or similarity scores) are computed for a given confidential record and every synthetic record within a block. The three closest records are declared matches, and the personid again checked to verify how often a true match is obtained. A putative intruder who treated the closest record as a match would correctly link about 1% of all synthetic records, and less than 3% in the worst-case subgroup (Abowd, Stinson, and Benedetto 2006, p. 8).
Figure 2.1 Probability density function of the ramp distribution used in LEHD disclosure avoidance system.
2.3.2.4 Analytical Validity Assessment
Although synthetic data are designed to solve a confidentiality protection problem, the success of this solution is measured by both the degree of protection provided and the user’s ability to reliably estimate scientifically interesting quantities. The latter property of the synthetic data is known as analytical (or statistical) validity. Analytical validity exists when, at a minimum, estimands can be estimated without bias and their confidence intervals (or the nominal level of significance for hypothesis tests) can be stated accurately (Rubin 1987). To verify analytical validity, the confidence intervals surrounding the point estimates obtained from confidential