Administrative Records for Survey Methodology. Группа авторов

Чтение книги онлайн.

Читать онлайн книгу Administrative Records for Survey Methodology - Группа авторов страница 21

Administrative Records for Survey Methodology - Группа авторов

Скачать книгу

along dimensions that are broadly relevant.

      Relatively recently, formal privacy models have emerged from the literature on database security and cryptography. In formal privacy models, the data are distorted by a randomized mechanism prior to publication. The goal is to explicitly characterize, given a particular mechanism, how much private information is leaked to data users.

      Differential privacy is a particularly prominent and useful approach to characterizing formal privacy guarantees. Briefly, a formal privacy mechanism that grants ε-differential privacy places an upper bound, parameterized by ε, on the ability of a user to infer from the published output whether any specific data item, or response, was in the original, confidential data (see Dwork and Roth 2014 for an in-depth discussion).

      Formal privacy models are very intriguing because they solve two key challenges for disclosure limitation. First, formal privacy models by definition provide provable guarantees on how much privacy is lost, in a probabilistic sense, in any given data publication. Second, the privacy guarantee does not require that the implementation details, specifically the parameter ε, be kept secret. This allows researchers using data published under formal privacy models to conduct fully SDL-aware analysis. This is not the case with many traditional disclosure limitation methods which require that key parameters, such as the swap rate, suppression rate, or variance of noise, not be made available to data users (Abowd and Schmutte 2015).

      To illustrate the application of new disclosure avoidance techniques, we describe three examples of linked data and the means by which confidentiality protection is applied to each. First, the Health and Retirement Study(HRS) links extensive survey information to respondents’ administrative data from the Social Security Administration (SSA) and the Center for Medicare and Medicaid Services (CMS). To protect confidentiality in the linked HRS–SSA data, its data custodians use a combination of restrictive licensing agreements, physical security, and restrictions on model output. Our second example is the Census Bureau’s Survey of Income and Program Participation (SIPP), which has also been linked to earnings data from the Internal Revenue Service (IRS) and benefit data from the SSA. Census makes the linked data available to researchers as the SIPP Synthetic Beta File(SSB). Researchers can directly access synthetic data via a restricted server and, once their analysis is ready, request output based on the original harmonized confidential data via a validation server. Finally, the Longitudinal Employer-Household Dynamics Program (LEHD) at the Census Bureau links data provided by 51 state administrations to data from federal agencies and surveys and censuses on businesses, households, and people conducted by the Census Bureau. Tabular summaries of LEHD are published with greater detail than most business and demographic data. The LEHD is accessible in restricted enclaves, but there are also restrictions on the output researchers can release. There are many other linked data sources. These three are each innovative in some fashion, and allow us to illustrate the issues faced when devising disclosure avoidance methods for linked data.

      2.3.1 HRS–SSA

      2.3.1.1 Data Description

      The HRS is conducted by the Institute for Social Research at the University of Michigan. Data collection was launched in 1992 and has reinterviewed the original sample of respondents every two years since then. New cohorts and sample refreshment have made the HRS one of the largest representative longitudinal samples of Americans over 50, with over 26 000 respondents in a given wave (Sonnega and Weir 2014). In 2006, the HRS started collecting measures of physical function, biomarkers, and DNA samples. The collection of these additional sensitive attributes reinforces confidentiality concerns.

      2.3.1.2 Linkages to Other Data

      The CMS maintain claims records for the medical services received by essentially all Americans age 65 and older and those less than 65 years who receive Medicare benefits. These records include comprehensive information about hospital stays, outpatient services, physician services, home health care, and hospice care. When linked to the HRS interview data, this supplementary information provides far more detail on the health circumstances and medical treatments received by HRS participants than would otherwise be available.

      Data from HRS interviews are also linked to information about respondents’ employers. This improves information on employer-provided benefits, including pensions. While most pension-eligible workers have some idea of the benefits available through their pension plans, they generally are not knowledgeable about detailed provisions of the plans. By linking HRS interview data with detailed information on pension plans, researchers can better understand the contribution of the pension to economic circumstances and the effects of the pension structure on work and retirement decisions.

      HRS data are also linked at the individual level to administrative records from Social Security and Medicare, Veteran’s Administration, the National Death Index, and employer-provided pension plan information (Sonnega and Weir 2014).

      2.3.1.3 Disclosure Avoidance Methods

      To ensure privacy and confidentiality, all study participants’ names, addresses, and contact information are maintained in a secure control file (National Institute on Aging and the National Institutes of Health 2017). Anyone with access to identifying information must sign a pledge of confidentiality. The survey data are only released to the research community after undergoing a rigorous process to remove or mask any identifying information. First a set of sensitive variables (such as state of residence or specific occupation) are suppressed or masked. Next, the remaining variables are tested for any possible identifying content. When testing is complete, the data files are subject to final review and approval by the HRS Data Release Protocol Committee. Data ready for public use are made available to qualified researchers via a secure website. Registration is required of all researchers before downloading files for analyses. In addition, use of linked data from other sources, such as Social Security or Medicare records, is strictly controlled under special agreements with specially approved researchers operating in secure computing environments that are periodically audited for compliance.

      The HRS uses licensing as its primary method of giving access to restricted files. A license can be secured only after meeting a stringent set of criteria that leads to a contractual agreement between the HRS, the researcher, and the researcher’s employer. The license enables the user to receive restricted files and use them at the researcher’s own institutional facility.

      2.3.2 SIPP–SSA–IRS (SSB)

      2.3.2.1 Data Description

      The SIPP/SSA/IRS Public Use

Скачать книгу