Administrative Records for Survey Methodology. Группа авторов
Чтение книги онлайн.
Читать онлайн книгу Administrative Records for Survey Methodology - Группа авторов страница 20
Many social scientists believe that suppression can be complemented by restricted access agreements that allow the researcher to use all of the confidential data but limit what can be published from the analysis. Such a strategy is not a complete solution because SDL must still be applied to the output of the analysis, which quickly brings the problem of which output to suppress back to the forefront.
Custom tabulations and data enclaves. Another traditional response by data custodians to the demand by researchers for more extensive and detailed summaries of confidential data, was to create a custom tabulation, a table not previously published, but generated by data custodian staff with access rights to the confidential data, and typically subject to the same suppression rules. As these requests increased, the tabulation and analysis work was offloaded onto researchers by providing them with access to protected microdata. This approach has expanded rapidly in the last two decades, and is widely used around the world. We discuss it in detail later in this chapter.
Coarsening is a method for protecting data that involves mapping confidential values into broader categories. The simplest method is a histogram, which maps values into (fixed) intervals. Intuitively, the broader the interval, the more protection is provided.
Sampling is a protection mechanism that can be applied either at the collection stage or at the data publication stage. At the collection stage, it is a natural part of conducting surveys. In combination with coarsening and the use of statistical weights, the basic idea is simple: if a table cell is based on only a few sampled individuals which collectively represent the underlying population, then statistical inference will not reveal the attributes of any particular individual with any precision, as long as the identity of the sampled individuals is not revealed. Both coarsening and sampling underlie the release of public use microdata samples.
2.2.1 Input Noise Infusion
Protection mechanisms for microdata are often similar in spirit, though not in their details, to the methods employed for tabular data. Consider coarsening, in which the more detailed response to a question (say, about income), is classified into a much smaller set of bins (for instance, income categories such as “[10 000; 25 000]”). In fact, many tables can be viewed as a coarsening of the underlying microdata, with a subsequent count of the coarsened cases.
Many microdata methods are based on input noise infusion: distorting the value of some or all of the inputs before any publication data are built. The Census Bureau uses this technique before building publication tables for many of its business establishment products and in the American Community Survey (ACS) publications, and we will discuss it in more detail for one of those data products later in this chapter. The noise infusion parameters can be set such that all of the published statistics are formally unbiased – the expected value of the published statistic equals the value of the confidential statistic with respect to the probability distribution of the infused noise – or nearly so. Hence, the disclosure risk and data quality can be conveniently summarized by two parameters: one measuring the absolute distortion in the data inputs and the other measuring the mean squared error of publication statistics (either overall for censuses or relative to the undistorted survey estimates).
From the viewpoint of empirical social sciences, however, all input distortion systems with the same risk-quality parameters are not equivalent. In a regression discontinuity design, for example, there will now be a window around the break point in the running variable that reflects the uncertainty associated with the noise infusion. If the effect is not large enough, it will be swamped by noise even though all the inputs to the analysis are unbiased, or nearly so. Once again, using the unmodified confidential data via a restricted access agreement does not completely solve the problem because once the noisy data have been published, the agency has to consider the consequences of allowing the publication of a clean regression discontinuity design estimate where the plot of the unprotected outcomes versus the running variable can be compared to the similar plot produced from the public noisy data.
An even more invasive input noise technique is data swapping. Sensitive data records (usually households) are identified based on a priori criteria. Then, sensitive records are compared to “nearby” records on the basis of a few variables. If there is a match, the values of some or all of the other variables are swapped (usually the geographic identifiers, thus effectively relocating the records in each other’s location). The formal theory of data swapping was developed shortly after the theory of primary/complementary suppression (Dalenius and Reiss 1982, first presented at American Statistical Association (ASA) Meetings in 1978). Basically, the marginal distribution of the variables used to match the records is preserved at the cost of all joint and conditional distributions involving the swapped variables. In general, very little is published about the swapping rates, the matching variables, or the definition of “nearby,” making analysis of the effects of this protection method very difficult. Furthermore, even arrangements that permit restricted access to the confidential files still require the use of the swapped data. Some providers destroy the unswapped data. Data swapping is used by the Census Bureau, NCHS, and many other agencies (FCSM 2005). The Census Bureau does not allow analysis of the unswapped decennial and ACS data except under extraordinary circumstances that usually involve the preparation of linked data from outside sources then reimposition of the original swap (so the records acquire the correct linked information, but the geographies are swapped according to the original algorithm before any analysis is performed). NCHS allows the use of unswapped data in its restricted access environment but prohibits publication of most subnational geographies when the research is published.
The basic problem for empirical social scientists is that agencies must have a general purpose data publication strategy in order to provide the public good that is the reason for incurring the cost of data collection in the first place. But this publication strategy inherently advantages certain analyses over others. Statisticians and computer scientists have developed two related ways to address this problem: synthetic data combined with validation servers and privacy-protected query systems. Statisticians define “synthetic data” as samples from the joint probability distribution of the confidential data that are released for analysis. After the researcher analyzes the synthetic data, the validation server is used to repeat some or all of the analyses on the underlying confidential data. Conventional SDL methods are used to protect the statistics released from the validation server.
2.2.2 Formal Privacy Models
Computer scientists define a privacy-protected query system as one in which all analyses of the confidential data are passed through a noise-infusion filter before they are published. Some of these systems use input noise infusion – the confidential data are permanently altered at the record level, and then all analyses are done on the protected data. Other formally private systems apply output noise infusion to the results of statistical analyses before they are released.
All formal privacy models define a cumulative, global privacy loss associated with all of the publications released from a given confidential database. This is called the total privacy-loss budget. The budget can then be allocated to each of the released queries. Once the budget is exhausted, no more analysis can be conducted. The researcher must decide how much of the privacy-loss budget to spend on each query – producing noisy answers to many queries or sharp answers to a few. The agency must decide the total privacy-loss budget for all queries and how to allocate it among competing potential users.
An increasing number of modern SDL and formal privacy procedures replace methods like deterministic suppression and targeted random swapping with some form of noisy query system. Over the last decade these approaches have moved to the forefront because they provide the agency with a formal method of quantifying the global disclosure risk in the output and of evaluating