Administrative Records for Survey Methodology. Группа авторов

Чтение книги онлайн.

Читать онлайн книгу Administrative Records for Survey Methodology - Группа авторов страница 20

Administrative Records for Survey Methodology - Группа авторов

Скачать книгу

once an item has been designated for either primary or complementary suppression, it would disappear from the release tables until the entire product is redesigned.

      Many social scientists believe that suppression can be complemented by restricted access agreements that allow the researcher to use all of the confidential data but limit what can be published from the analysis. Such a strategy is not a complete solution because SDL must still be applied to the output of the analysis, which quickly brings the problem of which output to suppress back to the forefront.

      Custom tabulations and data enclaves. Another traditional response by data custodians to the demand by researchers for more extensive and detailed summaries of confidential data, was to create a custom tabulation, a table not previously published, but generated by data custodian staff with access rights to the confidential data, and typically subject to the same suppression rules. As these requests increased, the tabulation and analysis work was offloaded onto researchers by providing them with access to protected microdata. This approach has expanded rapidly in the last two decades, and is widely used around the world. We discuss it in detail later in this chapter.

      Coarsening is a method for protecting data that involves mapping confidential values into broader categories. The simplest method is a histogram, which maps values into (fixed) intervals. Intuitively, the broader the interval, the more protection is provided.

      2.2.1 Input Noise Infusion

      Protection mechanisms for microdata are often similar in spirit, though not in their details, to the methods employed for tabular data. Consider coarsening, in which the more detailed response to a question (say, about income), is classified into a much smaller set of bins (for instance, income categories such as “[10 000; 25 000]”). In fact, many tables can be viewed as a coarsening of the underlying microdata, with a subsequent count of the coarsened cases.

      Many microdata methods are based on input noise infusion: distorting the value of some or all of the inputs before any publication data are built. The Census Bureau uses this technique before building publication tables for many of its business establishment products and in the American Community Survey (ACS) publications, and we will discuss it in more detail for one of those data products later in this chapter. The noise infusion parameters can be set such that all of the published statistics are formally unbiased – the expected value of the published statistic equals the value of the confidential statistic with respect to the probability distribution of the infused noise – or nearly so. Hence, the disclosure risk and data quality can be conveniently summarized by two parameters: one measuring the absolute distortion in the data inputs and the other measuring the mean squared error of publication statistics (either overall for censuses or relative to the undistorted survey estimates).

      From the viewpoint of empirical social sciences, however, all input distortion systems with the same risk-quality parameters are not equivalent. In a regression discontinuity design, for example, there will now be a window around the break point in the running variable that reflects the uncertainty associated with the noise infusion. If the effect is not large enough, it will be swamped by noise even though all the inputs to the analysis are unbiased, or nearly so. Once again, using the unmodified confidential data via a restricted access agreement does not completely solve the problem because once the noisy data have been published, the agency has to consider the consequences of allowing the publication of a clean regression discontinuity design estimate where the plot of the unprotected outcomes versus the running variable can be compared to the similar plot produced from the public noisy data.

      The basic problem for empirical social scientists is that agencies must have a general purpose data publication strategy in order to provide the public good that is the reason for incurring the cost of data collection in the first place. But this publication strategy inherently advantages certain analyses over others. Statisticians and computer scientists have developed two related ways to address this problem: synthetic data combined with validation servers and privacy-protected query systems. Statisticians define “synthetic data” as samples from the joint probability distribution of the confidential data that are released for analysis. After the researcher analyzes the synthetic data, the validation server is used to repeat some or all of the analyses on the underlying confidential data. Conventional SDL methods are used to protect the statistics released from the validation server.

      2.2.2 Formal Privacy Models

      All formal privacy models define a cumulative, global privacy loss associated with all of the publications released from a given confidential database. This is called the total privacy-loss budget. The budget can then be allocated to each of the released queries. Once the budget is exhausted, no more analysis can be conducted. The researcher must decide how much of the privacy-loss budget to spend on each query – producing noisy answers to many queries or sharp answers to a few. The agency must decide the total privacy-loss budget for all queries and how to allocate it among competing potential users.

      An increasing number of modern SDL and formal privacy procedures replace methods like deterministic suppression and targeted random swapping with some form of noisy query system. Over the last decade these approaches have moved to the forefront because they provide the agency with a formal method of quantifying the global disclosure risk in the output and of evaluating

Скачать книгу