Administrative Records for Survey Methodology. Группа авторов

Чтение книги онлайн.

Читать онлайн книгу Administrative Records for Survey Methodology - Группа авторов страница 22

Administrative Records for Survey Methodology - Группа авторов

Скачать книгу

Synthetic Beta File or SSB, combines variables from the Census Bureau’s SIPP, the IRS individual lifetime earnings data, and the SSA individual benefit data. Aimed at a user community that was primarily interested in national retirement and disability programs, the selection of variables for the proposed SIPP/SSA/IRS-PUF focused on the critical demographic data to be supplied from the SIPP, earnings histories going back to 1937 from the IRS data maintained at SSA, and benefit data from SSA’s master beneficiary records, linked using respondents’ SSNs. After attempting to determine the feasibility of adding a limited number of variables from the SIPP directly to the linked earnings and benefit data, it was decided that the set of variables that could be added without compromising the confidentiality protection of the existing SIPP public use files was so limited that alternative methods had to be used to create a useful new file.

      The existence of SIPP public use files poses a key challenge for disclosure avoidance. To protect the confidentiality of survey respondents, it was deemed necessary to prevent reidentification of a record that appears in the synthetic data against the existing SIPP public use files. Hence, all information regarding the dating of variables whose source was a SIPP response, and not administrative data, has to be made consistent across individuals regardless of the panel and wave from which the response was taken. The public use file contains several variables that were never missing and are not synthesized. These variables are: gender, marital status, spouse’s gender, initial type of Social Security benefits, type of Social Security benefits in 2000, and the same benefit type variables for the spouse. All other variables in the SSB v4 were synthesized.

      The model first imputes any missing data, then synthesizes the completed data (Reiter 2004). For each iteration of the missing data imputation phase and again during the synthesis phase, a joint PPD for all of the required variables is estimated according to the following protocol. At each node of the parent/child tree, a statistical model is estimated for each of the variables at the same level. The statistical model is a Bayesian bootstrap, logistic regression, or linear regression (possibly with transformed inputs). The missing data phase included nine iterations of estimation. The synthetic data phase occurred on the 10th iteration. Four missing data implicates were created. These constitute the completed data files that are the inputs to the synthesis phase. Four synthetic implicates were created for each missing data implicate, for a total of 16 synthetic implicates on the released file. Because copying the final weight to each implicate of the synthetic data would have provided an additional unsynthesized variable with 55 552 distinct values, the disclosure risk associated with the weight variable had to be addressed. A synthetic weight using a PPD based on the Multinomial/Dirichlet natural conjugate likelihood and prior was created.

      2.3.2.3 Disclosure Avoidance Assessment

      The link of administrative earnings, benefits and SIPP data adds a significant amount of information to an already very detailed survey and could pose potential disclosure risks beyond those originally managed as part of the regular SIPP public use file disclosure avoidance process. The synthesis of the earnings data meets the IRS disclosure officer’s criteria for properly protecting the federal tax information found in the summary and detailed earnings histories used to create the longitudinal earnings variables.

      In distance-based matching, records between the harmonized confidential and synthetic data are blocked in a similar way, and distances (or similarity scores) are computed for a given confidential record and every synthetic record within a block. The three closest records are declared matches, and the personid again checked to verify how often a true match is obtained. A putative intruder who treated the closest record as a match would correctly link about 1% of all synthetic records, and less than 3% in the worst-case subgroup (Abowd, Stinson, and Benedetto 2006, p. 8).

Schematic illustration of the probability density function of the ramp distribution used in LEHD disclosure avoidance system.

      2.3.2.4 Analytical Validity Assessment

Скачать книгу