Administrative Records for Survey Methodology. Группа авторов
Чтение книги онлайн.
Читать онлайн книгу Administrative Records for Survey Methodology - Группа авторов страница 26
Most of these systems only provide access to household and person surveys. Of the known systems surveyed above, only Australia’s RADL systems and the Bank of Italy’s implementation of LISSY (Bruno, D’Aurizio, and Tartaglia-Polcini 2009, 2014) seem to provide access to business microdata through automated remote processing facilities.
2.4.3 Licensing
Users of secure research data centers always sign some form of legally binding user or licensing agreement. These agreements describe acceptable user behavior, such as not copying or photographing screen contents. However, licensing alone may also be used to provide access to restricted-use microdata outside of formal restricted access data centers. In general, the detail in licensed microdata files is greater than in the equivalent (or related) public-use file, and may allow for disclosure of confidential data if inappropriately exploited. For this reason, licensed microdata files tend to have several additional levels of disclosure avoidance methods applied, including output review in some cases. For instance, even without linkages, the HRS licensed files have more detailed geography on respondents (county, say, rather than Census region), but do not have the most detailed geography (GPS coordinates or exact address). Generally, the legally enforceable license imposes restrictions on what can be published by the researchers, and restricts who can access the data, and for what purpose. The contracting organization is the researcher’s university, which is subject to penalties such as loss of eligibility status for research grants if the license is violated.
In the United States, some surveys (NCES, NLSY, and HRS) use licensing to distribute portions of the data they collect on their respondents. Commercial data providers (COMPUSTAT, etc.) also license the data distributed to researchers. Penalties for license infractions range from restricting future research grant funding, for example in HRS, to monetary penalties, for example in commercial data licenses. We are not aware of any studies that quantify the violation rates or financial penalties actually incurred due to license violations. Licensing may be limited by the enforceability of laws or contracts, and thus may be limited to residents of the same jurisdiction in which the data provider is housed. Often, some licensing is combined with the creation of ad-hoc data enclaves, the simplest of these being stand-alone, nonnetworked computer workstations.
2.4.4 Disclosure Avoidance Methods
Data enclaves exist to allow researchers to perform analyses within the restricted environment, and then extract or publish some form of statistical summary that can be released from the secure environment. Generally, these summaries are estimates from a statistical model. In general, model-based output is evaluated in accordance with the same criteria traditionally used for tabular output (minimum number of units within a reporting cell, minimum percentage of global activity within a reporting cell). In contrast to licensing arrangements, which allow researchers to self-monitor, statistical data enclaves have regimented output monitoring, typically by staff of the data provider. Generally, released statistical outputs are registered in some fashion, but documentation of the full provenance chain may be limited.
No systematic attempt has been made, to our knowledge, to measure formally the cumulative privacy impact of model-based releases because the science and technology for doing so are rudimentary. Remote processing facilities, on the other hand, when using automated mechanisms, rely on several practices to reduce the risk of disclosure. First, they limit the scope of possible analyses to those for which the agency has developed safe procedures. The number of times a researcher may request releases may also be limited. Nevertheless, most agencies recognize that this review system does not scale because the infeasibility of a full accounting of all possible query combinations over time. In general, they apply basic disclosure avoidance techniques such as suppression, perturbation, masking, recoding, and bootstrap sampling of the input data to each project separately. Some systems apply automated analysis of log and output files (Schouten and Cigrang 2003), although often a manual review is also included (O’Keefe et al. 2013). Some systems provide for self-monitored release of model results, either under licensing or remote access. There are also limitations on quantity and frequency of self-released results, combined with sampling by human reviewers. More sophisticated tools, such as perturbation or synthesizing of estimated model parameters, have been proposed (Reiter 2003). Finally, such systems require review of the draft research paper before submission to any publication medium including online preprint repositories like ArXiv.org.
All three of the examples of linked data provided in this paper rely on some version of secure data enclaves to provide microdata access to approved researchers. HRS data are made available to tenure-track researchers who sign a data use agreement and provide documentation of a secure local computing environment. An additional option for HRS data is to visit to the Michigan Center on the Demography of Aging data enclave, which makes data accessible to researchers in a physical data enclave at “headquarters,” like many NSOs. More recently, HRS has started to offer secure VDI access to researchers. The confidential data underlying the SSB, and against which validation requests are run, are also available either within the FSRDC network, or by sending validation requests by email to staff at Census headquarters (a form of “remote processing”). LEHD microdata are only available through the FSRDC.
An open question is whether the disclosure risks addressed through physical security measures are greater for linked data. Enabling researchers to measure some of the heuristic disclosure risk such as n cell count or p-percent rule (O’Keefe et al. 2013) becomes more important when any possible combination of k variables (k large) leads to small cells or dominated cells. Even subject matter experts cannot assess these situations a priori.
2.4.5 Data Silos
One concern with the increasing move to multiple distinct access points for confidential data is the “siloing” of data. The critical symptom is a physical separation of files in distinct secure data enclaves. The underlying causes are the incompatible legal restrictions on different data. Typically, these restrictions impose administrative barriers to combining data sources for which linking is technically possible.
Such administrative barriers may also be driven by ethical or confidentiality concerns. The question of consent by survey or census respondents may explicitly prevent the linkage of their survey responses or of their biological specimen with other data. For example, the Canadian Census long form of 2006 offered respondents the option to either answer survey questions on earnings, or consent to linking in their tax data on earnings. In the 2016 census, the question was no longer asked, and users were simply notified that linkage would happen.
In the case of the LEHD data, as of December 2015, all 50 states as well as the District of Columbia had signed agreements with the Census Bureau to share data and produce public-use statistics. It would thus seem possible for researchers to access a comprehensive LEHD jobs database through the FSRDC network, by linking together the job databases from 51 administrative entities. However, all but 12 of the States had declined to automatically extend the right to use the data to external researchers within the FSRDC network. Nevertheless, some of the same states that declined to provide such permission in the FSRDC give access to researchers through their state data centers or other means. The UI state-level data is thus siloed, and researchers may be faced with nonrepresentative data on the American job market. Several European projects, such as Data without Boundaries (DwB), have investigated cross-national access with elevated expectations but relatively limited success (Schiller and Welpton 2014; Bender and Heining 2011). Increasingly, the U.S. Census Bureau and CASD also host data from other data providers, through collaborative agreements, moving