Administrative Records for Survey Methodology. Группа авторов
Чтение книги онлайн.
Читать онлайн книгу Administrative Records for Survey Methodology - Группа авторов страница 24
The LODES provides aggregated information on where workers are employed (Destinations) and where they live (Origins), along with the characteristics of those places. As the name implies, the data are intended for use in understanding commuting patterns and the nature of local labor markets. The fundamental geographic unit in LODES is a Census block, and thus much more detailed than QWI for which data are published as county-level aggregates. LODES is tabulated from the same microdata as the QWI, and for workplaces (the destination), uses a variation of the QWI noise infusion technique. Cells that do not meet the publication criteria of the QWI continue to be suppressed in LODES, but are replaced using synthetic data.6 For residences (the origin), the protection system relies on a provably-private synthetic data model (Machanavajjhala et al. 2008). A statistical model is built from the data, as the PPD of release data X′ given the confidential data X: Pr[X′|X]. Synthetic data points are sampled from the model X′, and released. In general, to satisfy differential privacy (Dwork 2006; Dwork et al. 2006, 2017), the amount of noise that must be injected into the synthetic data model is quite large, typically rendering the releasable data of low utility. The novelty of the LODES protection system was to introduce the concept of “probabilistic differential privacy,” and early variant of what are now called approximate differential privacy systems. By allowing the differential privacy guarantee (parametrized by ε) to fail in certain rare cases (which occur with probability δ), (ɛ, δ)-probabilistic differential privacy (Machanavajjhala et al. 2008) improves the analytical validity of the data greatly. LODES uses Census tract-to-tract relations to estimate the PPD for the block-to-block model. A unique model is estimated for each block, recovering the likelihood of a place of residence conditional on place of work and characteristics of the workers and the workplaces. Several additional measures further improve the privacy and analytical validity of the model (see Machanavajjhala et al. 2008 for further details). The resulting privacy-preserving algorithm guarantees ɛ-differential privacy of 8.99 with 99.999 999% confidence (δ = 10−6).
2.3.3.3 Disclosure Avoidance Assessment for QWI
The extent of the protection of the QWI micro-data can be measured in two ways: showing the percentage deviation as a measure of the uncertainty about the true value that one can infer from the released value, and the amount of reallocation of small cells (less than five entities in a tabulation cell).7 Each cell underlying the tabulation is for a statistic Xkt where k is a cell defined by a combination of age, gender, industry, and county, and for all released time periods for the states at the time of these experiments.8 The interested reader may find an example assessment in table 1 of Abowd, Schmutte, and Vilhuber (2018) undistorted, unweighted data.
2.3.3.4 Analytical Validity Assessment for QWI
The noise infusion algorithm for QWI is designed to preserve validity of the data for particular analysis tasks. We demonstrate analytical validity using two statistics: time-series properties of the distorted data relative to the confidential data of several estimates, and the cross-sectional unbiasedness of the published data for beginning-of-quarter employment B. The unit of analysis is an interior substate geography × industry × age × sex cell kt.9 Analytical validity is obtained when the data display no bias and the additional dispersion due to the confidentiality protection system can be quantified so that statistical inferences can be adjusted to accommodate it.
Time-Series Properties of Distorted Data
We estimate an AR(1) for the time series associated with each cell kt. For each cell, the error Δr = r − r* is computed, where r and r* are the first-order serial correlation coefficient computing using confidential data and protected data, respectively. Table 2.1 shows the distribution of the errors Δr across SIC-division × county cells, for accessions A, beginning-of-quarter employment B, full-quarter employment F, net job flows JF, and separations S (for additional tables, see Abowd et al. 2012). Table 2.1 shows that the time series properties of the QWI remain largely unaffected by the distortion. The central tendency of the bias (as measured by the median of the Δr distribution) is never greater than 0.001, and the error distribution is tight: the semi-interquartile range of the distortion for B in Table 2.1 is 0.022, which is less than the precision with which estimated serial correlation coefficients are normally displayed.10 The overall spread of the distribution is slightly higher when considering two-digit SIC × county and three-digit SIC × county cells (not reported here), due to the greater sparsity. The time series properties of the QWI data are unbiased. The small amount additional noise in the time series statistics is, in general, economically meaningless.
Cross-sectional Unbiasedness of the Distorted Data
The distribution of the infused noise is symmetric, and allocation of the noise factors is random. The data distribution resulting from the noise infusion should thus be unbiased. We compute the bias ΔX in each cell kt, expressed in percentage terms:
Table 2.1 Distribution of errors Δr in first-order serial correlation, QWI.
Variable | Median | Semi-interquartile range |
---|---|---|
Accessions | −0.000 542 | 0.026 314 |
Beginning-of-quarter employment | 0.000 230 | 0.021 775 |
Full-quarter employment | 0.000 279 | 0.018 830 |
Net job flows | −0.000 025 | 0.002 288 |
Separations | 0.000 797 | 0.025 539 |
Evidence of unbiasedness is provided by Figure 2.2, which shows the distribution of the bias for X = B. 11 The distribution of ΔB has most mass around the mode at 0%. Also, as is to be expected, secondary spikes are present around ±c, the inner bound of the noise distribution.
Box 2.2 Sidebox: Do-It-Yourself Noise Infusion
The interested user might consult a simple example (with fake data)