Administrative Records for Survey Methodology. Группа авторов
Чтение книги онлайн.
Читать онлайн книгу Administrative Records for Survey Methodology - Группа авторов страница 13
Source: Zhang (2012).
All the methods mentioned above require the matching of records in separate sources. In reality, linkage errors may be unavoidable, unless a unique identifier exists in the different files and facilitates exact matching. A linkage error is the case if either a pair of linked records actually do not belong to the same entity or if the records that belong to the same entity fail to be linked. Both multi pass deterministic and probabilistic record linkage procedures are common in practice and often used in tandem. See, e.g. Fellegi and Sunter (1969) and Herzog, Scheuren, and Winkler (2007). The records in different files are compared to each other in terms of key variables such as name, birth date, address, etc. One can regard a concatenated string of key variables as a proxy of the true identifier, insofar as the key variables involved in principle could lead to unique combinations. Distortion of the key variables would then result in erroneous proxy identifiers and potentially cause linkage errors. For population size estimation, see, e.g. Di Consiglio and Tuoto (2015) for a study of linkage errors and dual system estimation, and Zhang and Dunne (2017) for a discussion regarding the trimmed dual-system estimation. More generally, since record linkage is a prerequisite for combining multisource data at the individual level, the matter of linkage errors due to imperfect proxy identifiers can be relevant in many other situations.
In a frame that is constructed from combining multiple population datasets, one can often find several related classification variables. Identification errors (Zhang 2012) arise if the classification variables or the relationships between them are mistaken based on the input datasets. For instance, the variable address is central for population and household statistics. Multiple addresses can be collected by combining the Population Register with resident address, the Post Register with postal address, the Higher-Education Student Register with term-time address, the various Utility datasets with occupant address, etc. Each person may be assigned a unique de jure address based on all these sources, in a way that is judged to be most appropriate, which would then yield a proxy variable for the de facto address that is of interest in many social-economic statistics.
The economic activity classification, e.g. NACE in Europe, is a well-known example in business statistics. The NACE code in the Business Register is generally a proxy of the target “pure” economic classification that has its root in the System of National Accounts. Several issues contribute to this fact, such as inconsistent operational rules of the Business Register, misreporting, lack of updation, etc. It is common in sample surveys to observe that for some units the NACE code based on the updated survey returns will differ from the existing one in the Business Register. Such domain classification error is a kind of identification error. See, e.g. Brion and Gros (2015) for an example of how the matter is dealt with in the French Structural Business Surveys, and Van Delden, Scholtus, and Burger (2016) for an analysis of the NACE-classification errors in the Dutch context.
For survey data, the statistical unit can be identified in fieldwork. Based on register data, however, it is sometimes necessary to construct proxy statistical unit of interest, in which case unit errors may be unavoidable even if all the input data are error-free. For instance, consider register-based household. Provided all dwelling (or address) in the Population Register are correct, one may define a dwelling household to consist of all the persons who de jure share the same dwelling. We do not consider such a dwelling household to be a constructed statistical unit, precisely because it can be obtained from error-free input data directly. The perfection is another way of saying that there are no identification errors. An example of a constructed unit in this context is living household, which does not have to include everyone registered at the same dwelling nor be limited to these. Errors in a constructed living household is the case if two persons in different living households are placed in the same constructed living household, or if two persons in the same living households are placed in different constructed living households.
Constructed or not, unit error can be the case whether it results from lack of data or errors in data. Zhang (2011) devises a mathematical representation of unit error. It is assumed that each statistical unit of interest can consist of one or several so-called base units, but never cuts across a base unit. For example, person can be the base unit for household. The mapping from the set of base units to the set of statistical units can then be specified in terms of an allocation matrix, where each element takes value 1 or 0 depending on whether or not the corresponding base unit (arranged by column) belong to the statistical unit (arranged by row). In the case where a base unit can be assigned to one and only one statistical unit, such as a person can only belong to one household, the column sum of the allocation matrix is always equal to 1. Zhang (2011) develops a unit error theory for household statistics. Despite the unit error clearly being one of the most fundamental difficulties in business statistics, a statistical theory has so far been lacking. This may be partly due to the prominence of the identification error mentioned above. Another important reason may simply be the lack of a commonly acknowledged choice of base unit in business statistics.
1.2.2 Measurement
Consider now the measurement side in Figure 1.1. Relevance error refers to the discrepancy between the target measure that may be a theoretical construct and the measure that is achievable based on the available data. In a widespread scenario for combining register and survey data, the survey variable is treated as the target measure and the register proxy an auxiliary variable, which can be used either to adjust the survey sampling weights or to build a prediction model of the survey variable.
Sometimes, however, all the available measures entail relevance error, regardless of the source of the data, and there does not exist a way in which they can be combined to derive the target measure directly. For instance, Meijer, Rohwedder, and Wansbeek (2012) adopt such a viewpoint and study earnings data in register and survey using a mixture model approach, whereas Pavlopoulos and Vermunt (2015) apply latent class models to analyze income-based labor market mobility. It is also possible to formulate an adjusted measure as the solution of an appropriately defined constrained optimization problem, without explicitly introducing a model that spells out the relationship between the true measure and the observed proxy measures. For instance, Mushkudiani, Daalmans, and Pannekoek (2014) apply such an approach to Census aggregated tables and turnover variable from different sources.
Mapping error due to reclassification of input register data is highly common, since a register proxy variable often arises by means of reclassification. For instance, inferring the mother tongue from birth country is reclassification of the input variable birth country to the outcome variable mother tongue. For another example, to classify someone receiving unemployment benefit as unemployed is to reclassify the input variable benefit or not to the outcome variable unemployed or not. Examples as such are numerous.
It is worth noting that mapping error may be caused by delays or mistakes in the administrative sources, even where reclassification has no conceptual difficulties. Register data may be progressive in the sense that the observations for a particular reference time point may differ depending on when the observations are compiled. According to Zhang and Fosen (2012) and Zhang and Pritchard (2013), let t be the reference time point of interest and t + d the measurement time point, for d ≥ 0. Let U(t) and y(t) be the target population and value at t, respectively. For a unit i, let Ii(t; t + d) = 1 if it is to be included in the target population and 0 otherwise, based on the register data available at t + d, and let yi(t; t + d) be the observed value for t at t + d. The data are said to be progressive if, for d ≠ d ′ > 0, one can have Ii(t; t + d) ≠ Ii(t; t + d′) and yi(t; t + d) ≠ yi(t; t + d′). Progressiveness is a distinct feature of register data