Administrative Records for Survey Methodology. Группа авторов
Чтение книги онлайн.
Читать онлайн книгу Administrative Records for Survey Methodology - Группа авторов страница 12
Survey evaluation covers the use of register data for checking, validating, or assessing survey data, whether they are collected in a sample or census. This may be done at both individual and aggregate levels. Reversely, using survey estimates for external validation of register-based statistics has been a natural approach from early on (Myrskyla 1991). Quality survey in a census year is another common approach in Scandinavia (Axelson et al. 2020), which is usually not directed at the population coverage errors of the Central Population Register in those countries, but at the various classification and measurement errors in the register data. Or, as mentioned above, survey data are commonly used implicitly to define the processing rules or to assess the accuracy of the register data.
In summary, one can speak of a multisource data perspective for combining register and survey data on at least two different levels. In the wider sense, it is possible to characterize equally the uses of both register and survey data into four broad categories: (i) single-source estimation, (ii) multisource estimation, (iii) frames, and (iv) evaluation. Both can be treated as statistical data and used as such. In a narrower sense, one can greatly extend the scope of “indirect estimation” under the multisource data perspective, where register and survey data each may comprise part of the inputs on an equal footing provided the proxy variables are present. Indirect estimation will be discussed in more details in Section 1.3. But first we shall explain below what we mean by proxy variables.
1.1.2 Concept of Proxy Variable
According to Upton and Cook (2008), a proxy variable is “a measured variable that is used in place of a variable that cannot be measured.” We make two observations. Firstly, one may distinguish between the cases where the ideal measure is unobservable in principle and where it is unavailable by chance. For example, per-capita gross domestic product (GDP) is sometimes used as a proxy measure of living standard, where it seems reasonable to acknowledge that the latter is unobservable in principle. For a contrasting example, country of birth can generate a proxy to mother tongue, by referring to the official language in that country. One should think that in this case the ideal measure is unavailable only due to circumstances. Secondly, in order for a proxy to be used in place of the ideal measure, the two should have the same support. Taking the previous example, it is not the birth country that is a proxy to the true mother tongue, but the official language in that country, and the common support of the proxy and ideal measures being all the existing languages in this case.
Zhang (2015a) defines a proxy variable as one that is similar in definition and has the same support as the target variable. It follows that one can regard two variables as proxy to each other, without having to specify one of them to be the target (or ideal) measure. Variables such as age, sex, education, income can be useful auxiliary but not proxy variables for the binary International Labour Organization (ILO) unemployment status. In particular, sex is not a proxy despite it being binary and thus have the same support as the unemployment status, because they do not have similar definitions. The binary register-based job-seeker status is a proxy, and the ILO unemployment status does not have to be the ideal measure for every conceivable purpose. But the job-seeker status is not a proxy variable for the activity status defined as (employed, unemployed, and inactive) because the two have different support.
Proxy variables can arise from survey data. For example, indirect interview yields proxy measures (Thomsen and Villund 2011), where household members respond on behalf of the absentees. Data collected in different modes can be proxy to each other. A variable collected in a census can be proxy to the same variable or a similarly defined one in the postcensal years. Synthetic datasets released for research can contain proxy variables for the target measures, based on which the synthetic ones are modeled and generated. Register data are perhaps the richest source of proxy variables. It is often possible to have both complete coverage and concurrency, or nearly so. As some common examples of register proxy variable one can mention economic activity status, education level, income, family and housing condition, etc. in social statistics; value-added tax (VAT) based turnover, export and import, house price, animal holding, fishing and hunting figures, arable soils, vegetation, etc. in economic and environmental statistics.
Finally, it is useful to reflect on the relationship between a proxy variable and one that can be affected by measurement errors, since one can always envisage a proxy variable as an attempt to measure the target variable, whether the effort is real or imaginary. Measurement errors are commonly decomposed into two components: random errors and systematic errors. By definition random errors occur by chance and has zero expectation. Insofar as one considers random measurement errors to be unavoidable and omnipresent, any measured variable can only be a proxy of the ideal measure. In contrast, many proxy variables will remain the case even when it is acceptable to disregard the potential random errors for practical purposes. Systematic errors due to discrepancy in definition, instrument, time point, etc. are then the cause of imperfect measure, including when the ideal measure is unobservable in principle. Notice that this interpretation of systematic errors differs from the usage of the term in statistical data editing (de Waal, Pannekoek, and Scholtus 2011), where a systematic error is regarded as an error for which a plausible cause can be detected and knowledge of the underlying error mechanism enables then a satisfactory treatment in an unambiguous deterministic manner. Some examples of such systematic errors are typographical, measurement unit or sign errors. In summary, regardless of whether proxy variables may arise due to measurement errors, we are concerned here with the proxy variables that cannot be corrected by data editing methods.
1.2 Instances of Proxy Variable
Zhang (2012) presents a two-phase life cycle model of integrated statistical micro data, which provides a total-error framework for combining data from multiple sources. The first phase concerns the respective input data before integration takes place. Here, we consider the instances of proxy variables in relation to the various processing steps and associated error sources at the second, integration phase (Figure 1.1).
1.2.1 Representation
We start with the Representation side in Figure 1.1, which concerns the target population and units. Let us consider coverage error first. For instance, one may have a Population Register that is not sufficiently accurate to allow for direct tabulation of census-like population counts at detailed aggregation levels, so that Population Coverage Surveys are carried out in order to obtain the desired population estimates. The Population Register and Coverage Survey enumerations are proxies of the true population enumeration. This is the situation in Switzerland 2000 (Renaud 2007) and Israel 2008 (Nirel and Glickman 2009). Other instances may involve one or several register enumerations, Census enumeration and Census Coverage Survey enumeration. Capture–recapture methodology is a commonly used estimation approach that combines two or more proxy enumerations subjected to under-counts (Fienberg 1972; Wolter 1986; Hogan 1993). Adjustment of erroneous over-counts has attracted increasing attention recently, in situations where one does not have a Population Register and over-coverage errors are found to be large in the available register enumerations (ONS 2013). See, e.g. Zhang (2015b), for an extension of the capture–recapture modeling approach, Zhang and Dunne (2017) for trimmed dual-system estimation, and Di Cecco et al. (2018) for a latent class modeling approach.