4 Data Requirements and Criteria | Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build

Page 153 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

4

Data Requirements and Criteria

Policy makers and the public require high-quality, consistent, timely, and granular statistics on household income, consumption, and wealth (ICW) using standard definitions. Researchers require microdata with those attributes to study household economic wellbeing in depth and with tailored definitions appropriate to their research. To serve both purposes, relevant statistical agencies will need to integrate multiple data sources, which will in turn require careful examination of data quality and the amelioration of data quality problems, to the extent possible, during the integration process.

In this chapter, the panel describes data requirements and criteria for an integrated system of ICW and contrasts the advantages and disadvantages of different types of data in terms of quality. The chapter’s sections cover (a) an ideal dataset for statistics and research on household ICW; (b) frameworks for measuring data quality in a broad sense, including not only accuracy but also relevance and accessibility; (c) assessments of surveys, administrative records, and commercial data sources that could contribute to an integrated ICW dataset on relevance and other data quality attributes excluding accuracy; (d) assessments of data sources and also an integrated ICW dataset on accuracy (alternatively, errors), which requires consideration of additional sources of error, including data linkage error; (e) projects underway at statistical agencies in the United States to improve the accuracy and other quality attributes of selected data sources; (f) the experience of other countries, including Denmark, Finland, the Netherlands, New Zealand, Norway, Sweden, and the United Kingdom, with establishing integrated data systems to produce official statistics, with various levels of maturity and experience, to provide lessons for the United States; and

Page 154 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

(g) the panel’s conclusions and recommendations about data requirements, criteria for assessment, and priorities for improving data sources. Chapter 5 takes the next step of outlining methods, processes, and short-term and long-term priorities for building an integrated ICW dataset.

AN IDEAL DATASET

The ideal data infrastructure for statistics and research on household ICW is a micro dataset with records for households with some data on the individuals living in the household. Such a dataset would allow users to precisely estimate the joint distribution of ICW for individuals aggregated to households and other units of analysis with as much geographic, race/ethnicity, and demographic detail as possible at various breakdown points in the distribution, including the very top (e.g., top 1% and top 0.1%). The ideal dataset is easily accessible, regularly updated, and comparable both over time and with other countries. Critical attributes of an ideal dataset include the following:

Unit of observation and population universe. The ideal dataset units are individuals whose records include pointers that enable them to be assembled into family and household units (and other kinds of units for particular purposes, such as tax filing units). The dataset includes indicators of the size and composition of households, families, and other types of aggregate units to permit the application of equivalence scales for comparability. The population universe consists of all residents in the United States, including groups not always included in existing data sources such as individuals not living in private households (i.e., living in institutional or noninstitutional group quarters or homeless) and undocumented individuals. Similar to the SNA, private and institutional households are distinguished and treated differently in analysis and results reporting.
Variables on ICW. The ideal dataset contains variables for individuals on the components of their individual ICW, so that statistical agencies and researchers can use the components that are appropriate to their purposes. For example, the ideal dataset has variables that permit estimating both prefiscal (before taxes and government transfers) and postfiscal incomes, both of which are relevant to policy. The variables on ICW permit one to obtain estimates that are consistent with macro-level aggregate variables. The ideal dataset yields household distributions of ICW that are consistent not only with the standard definitions put forward in Chapter 3 but also

Page 155 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

with other tailored definitions for particular analyses (see Chapter 2), including the definitions used by the Distributional National Accounts (DINA), the Organisation for Economic Co-operation and Development (OECD)–Eurostat Expert Group on Disparities in a National Accounts framework (EG DNA), the OECD ICW Framework (Organisation for Economic Co-operation and Development, 2013), and the Canberra Group (2011).
Variables on unit characteristics. The ideal dataset has variables for individuals on a wide range of characteristics to support descriptive statistics (e.g., breakdowns of income, consumption, or wealth by race and ethnicity) and for analysis. These variables include characteristics of the person (age, race and ethnicity, gender, household relationship, marital status, employment status, and many other attributes) and geographic information, such as geocoded location of units to place them within administrative divisions such as electoral districts, counties, and states.
Frequency, timeliness, and updateability. Although most existing U.S. statistics on ICW are released annually, there is potential interest in shorter reference periods (e.g., month or quarter) because volatility is a topic of interest, and also in longer periods such as multiyear averages, including “lifetime” estimates. The ideal dataset is not a one-off project but one that is updated frequently at relatively low cost once the initial dataset has been constructed. Updates are completed in a timely manner so that analysis and statistics produced from the dataset are not out of date. Ideally, the dataset allows tracking the same units over time, as in a panel study, and does not support only repeated cross-section perspectives.
Flexibility. The ideal dataset is developed recognizing that there is no single best set of choices for standard statistics and research analyses for each of the desired attributes set out above. Choices need to be informed by the purposes of the statistics and analyses. Thus, the ideal integrated dataset on individual ICW allows for analytical flexibilities in menu choice—for example, in the choice of variables and unit definitions.

The ideal dataset captures the desired units of observation, coverage, characteristics, consistency, frequency, timeliness, updateability, and flexibility. However, such a dataset does not exist in the United States. Currently, multiple data sources are required to support useful statistics and research on household ICW distributions. The resulting integrated ICW dataset needs to combine survey data, administrative data, and even commercial

Page 156 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

data. The need to use multiple data sources and combine them inevitably gives rise to operational issues (see Chapter 5) and data quality issues, discussed later in this chapter.

As a preface to the rest of the chapter, the panel notes that many existing datasets—surveys, administrative records, commercial datasets—could contribute value to an integrated ICW dataset. Relevant surveys include the American Community Survey (ACS), Consumer Expenditure Survey (CE), Current Population Survey Annual Social and Economic Supplement (CPS-ASEC), Health and Retirement Study (HRS), Panel Study of Income Dynamics (PSID), Survey of Consumer Finances (SCF), and Survey of Income and Program Participation (SIPP). Relevant administrative records include, among others, Internal Revenue Service/Statistics of Income (IRS/SOI) tax records, Social Security Administration (SSA) records, and federal and state social insurance and transfer program records (e.g., housing subsidy records, unemployment insurance records, and nutrition program records). Relevant commercial sources include, among others, JP Morgan Chase Institute and Black Knight.

After considering extant data quality frameworks, the panel assesses the survey, administrative records, and commercial data sources listed above against the criteria in these frameworks and separately discusses such criteria as relevance, accessibility, and the various sources of error that can impair data accuracy.

DATA QUALITY FRAMEWORKS

The quality of an integrated ICW dataset refers not only to the quality of the individual survey, administrative, and commercial data sources that contribute to it, but also to the quality attributes of the integrated dataset as a whole. Quality assessment thus requires a broad framework beyond each of the data sources (see National Academies of Sciences, Engineering, and Medicine, 2017, 2023a,b).

Survey research has developed the Total Survey Error Framework, which calls for identifying all the errors in the design, collection, processing, and analysis of survey data that cause a survey estimate to differ from the underlying true value (Groves, 2004). Such errors can include coverage error, sampling variability, response error, and others. Even for survey data, however, this framework omits important aspects of quality from the user perspective, such as relevance, accessibility, and consistency. Moreover, administrative and commercial data are typically collected for purposes that differ from those motivating surveys, and quality control of such data mainly lies in data cleaning and processing. Biemer (2010) and Amaya et al. (2020) accordingly adapt the total survey error framework to “big data”

Page 157 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

processing, listing relevant error sources, although they too omit some quality aspects, such as accessibility.

The Eurostat Quality Assurance Framework (Eurostat, 2019) developed by the European Statistical System Committee has five major criteria for assessing the quality of statistics in a broad framework, which might be summed up as criteria to ensure that the data are “fit for use”:

Relevance
Accuracy and reliability
Timeliness and punctuality
Accessibility and clarity
Coherence and comparability

The U.S. Federal Committee on Statistical Methodology has also developed a set of criteria, organized into three domains, for assessing data quality (Federal Committee on Statistical Methodology, 2020):

Utility
1. Relevance
2. Accessibility
3. Timeliness
4. Punctuality
5. Granularity
Objectivity
1. Accuracy and reliability
2. Coherence
Integrity
1. Scientific integrity
2. Credibility
3. Computer and physical security
4. Confidentiality

There is considerable overlap between the frameworks of the U.S. Federal Committee on Statistical Methodology and the European Statistical System Committee. For the panel’s purposes, five criteria are most appropriate for assessing data quality. Five criteria are assessed as a group for surveys, administrative records, and commercial data as the input data source for an integrated ICW dataset and the dataset as a whole:

Relevance: Whether the data are responsive to users’ analytic needs in terms of subject content, population covered, adequacy for estimation, frequency, and utility for longitudinal as well as repeated cross-sectional analysis;

Page 158 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

Timeliness and punctuality: How long the interval is between the phenomena, when data are available, and whether they are produced punctually in accordance with an announced publication schedule;
Accessibility and clarity: Whether users can easily obtain and understand the microdata and associated documentation;
Coherence and comparability: Whether the data accord with standard definitions and are consistent over time and with other relevant data; and
Granularity: The extent to which the data can be provided for population groups and geographic areas.

The panel considered separately the criteria of accuracy and reliability (the attributes the total survey error framework addresses) because so many error sources are involved in that assessment.

It is rare that any one dataset satisfies all quality criteria, and an integrated dataset may fall short of multiple criteria depending on the available input data sources and on errors that occur in the process of integration, such as linkage errors. Inevitably, tradeoffs must be made. For this reason, the efforts underway at statistical agencies to improve data quality for key datasets are absolutely crucial (see discussion below). Finally, the cost, response burden, and operational challenges involved in the production of statistics act as constraints on quality. The panel uses its best judgment to recommend priorities for data improvement in the development of an integrated ICW dataset.

RELEVANCE AND OTHER QUALITY ATTRIBUTES

Surveys

Overall, surveys are essential for building an integrated ICW dataset because of the wide range of information they collect and because, often, one of their design goals is to include almost all of the residential U.S. population. Determining the precise role that each survey should play, however, is not an easy task—for example, whether a survey should be the “spine” of an integrated ICW dataset and, if so, which survey. Alternatively, IRS tax records might form the spine. How best to exploit the strengths of each survey is also not an easy task—for example, how best to use the ACS to obtain geographic granularity when its content on ICW is so limited, or how best to use the capability for longitudinal analysis of the HRS and PSID when their sample sizes are so small. Chapter 5 discusses these issues in depth.

Page 159 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

Table 4-1 describes key quality attributes of seven surveys likely to figure prominently in the ICW dataset: ACS, CE, CPS-ASEC, HRS, PSID, SCF, and SIPP. Tables 4-1A, 4-1B, and 4-1C, located at the end of this chapter, focus on each survey’s variables on ICW, respectively. Following the panel’s five data quality criteria outlined above, the following observations may be made about the characteristics and features of these surveys:

Relevance—subject content. All seven surveys obtain data on one or more ICW measures for individuals or aggregates of individuals (e.g., the consumer unit in the CE or the primary economic unit in the SCF). The seven surveys differ substantially in the level of detail they obtain on ICW—for example, the ACS has limited detail on income (six components), nothing on consumption, and only house value and vehicle availability for wealth; in contrast, the SCF has extensive information on income and wealth, although much less on consumption.
Relevance—population covered. Five of the seven surveys (listed above) include the household population and people living in noninstitutional group quarters but not the homeless; the ACS includes people living in institutions in addition to noninstitutional group quarters and households; the HRS includes the household population ages 51 and over.
Relevance—adequacy for estimation. The ACS has the largest number of individual records available per year (2.1 million households and almost 5 million people) followed by the CPS-ASEC, SIPP, CE, HRS, SCF, and PSID.
Relevance—frequency. The ACS, CE, CPS-ASEC, and SIPP provide annual microdata files; the HRS and PSID data are biennial; and the SCF data are triennial.
Relevance—following people over time. The HRS and PSID follow people over long periods of time; SIPP follows people over several years; the SCF has occasionally had a longitudinal component; and some people in the CPS-ASEC can be followed for 2 years. The HRS and PSID permit examination of intergenerational transfers, such as inheritances.
Timeliness and punctuality. The ACS, CE, and CPS-ASEC release data files on an annual schedule, 9–12 months after data collection, and punctually; the HRS, PSID, and SCF release data files biennially and triennially, respectively, about 18 months after data collection and are generally punctual; the SIPP has had a checkered history of data release.

Page 160 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

TABLE 4-1 Quality Assessment of Relevant Surveys for an Integrated Household Income, Consumption, and Wealth Dataset: Relevance, Timeliness and Punctuality, Accessibility and Clarity, Coherence and Comparability, and Granularity

Quality Metric/Survey	ACS (Census Bureau)	CE Interview Survey (BLS)	CPS-ASEC (U.S. Census Bureau)
Relevance: Income content (see Table 4-1A); Consumption content (see Table 4-1B); Wealth content (see Table 4-1C)	Income: Prior 12 months—6 major cash sources, SNAP benefit receipt/no $^a Consumption: Housing (annual real estate taxes, fire/flood insurance, mobile home fees; monthly rent, mortgage/2nd/equity $, condo fee) Wealth: House value (owners); vehicles available	Income: Previous 12 months (4th interview)—12 cash sources; earnings (1st/4th interviews); Quarterly—SNAP $, housing subsidies $, WIC/NSLP/LIHEAP imputed Consumption: 60 categories: Quarterly—$/ detail for major expenses (60–70%); global $ for other expenses (20–25%) Wealth: Major durable goods inventory (1st interview); assets (4th interview)	Income: Prior calendar year—30+ cash sources, noncash benefit $ (SNAP), noncash benefit receipt/no $ (NSLP, WIC, LIHEAP, housing) Consumption: Housing (whether have a mortgage/second mortgage/home equity loan, no $); child care costs; health insurance costs; child support payments Wealth: House value (owners)
Relevance: Population covered	People in households and all GQ except for homeless and residents of domestic violence shelters (college students in dorms sampled at college)	People in households and noninstitutional group quarters (college students in dorms, and in any any college-sponsored housing, are a CU, Armed Forces living with 1+ civilian adult); excludes homeless, institutionalized, Armed Forces barracks	People in households and noninstitutional group quarters (college students in dorms are counted at home); excludes homeless, institutionalized, Armed Forces barracks

Page 161 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

Relevance: Units for data collection; Longitudinal analysis (possible?)	Household (all people at an address); Family (related people at an address); Unrelated individual (in GQ or household); cross-sectional	Consumer Unit (family members at an address, including foster children; unrelated people in a household who share major expense; unrelated individuals who are financially independent); cross-sectional, yet CUs are interviewed for up to 4 quarters and can form a short longitudinal panel	(Same as the ACS); cross-sectional but can link addresses for part of sample to previous year
Relevance: Estimation adequacy (sample size/design)	Sample size: 295,000 addresses per month; 3.54 million addresses per year; 2.1 million final interviews per year^b Sample design: Nonclustered sample every month; small governmental units oversampled, large census tracts undersampled; nonrespondents (after mail follow-up) subsampled for field follow-up (1/3 subsample on average, rates vary inversely by response rates)	Sample size: 12,000 addresses in sample each quarter; 6,900 usable interviews each quarter Sample design: Multistage, stratified, clustered, rotational design; addresses in sample 4 quarters; new sample each quarter	Sample size: 89,000 addresses in sample selected from 1,385 of nation’s 3,143 counties; 78,000 interviewed households Sample design: Supplement to the monthly CPS: multistage, stratified, clustered, rotational design (addresses in sample 4 months, out 8 months, in again 4 months); CPS-ASEC has 100% oversample of Hispanic households; oversample for estimates of states’ Children’s Health Insurance Program
Relevance: Frequency	Data collection: Monthly (estimates are cumulated to calendar year) Publication: Annual—1-year tables for areas with 65,000+ people; reduced set of 1-year tables for areas with 20,000+ people; 5-year tables for all areas to block group level; 1-year/5-year microdata files	Data collection: Monthly Publication: National-level publications on 12-month consumer expenditures (integrating Interview and Diary Surveys) released every 6 months; microdata files available annually (Diary and Interview Survey)	Data collection: Annual (interviews conducted in February–April) Publication: National-level Income and Poverty in the United States released mid-September; additional tables and microdata file released concurrently

Page 162 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

Quality Metric/Survey	ACS (Census Bureau)	CE Interview Survey (BLS)	CPS-ASEC (U.S. Census Bureau)
Timeliness and Punctuality	About 9–12 months after collection; punctual	About 9 months after collection; punctual	About 5 months after collection; punctual
Accessibility and Clarity	High marks for both availability and quality of documentation	(Same as ACS)	(Same as ACS)
Coherence and comparability	Internally coherent (e.g., definitions stable over time), but inconsistencies in ICW definitions and other features with other datasets (see Chapter 3)	(Same as ACS)	(Same as ACS)
Granularity: Geographic (on PUMS files)	PUMAs of ~100,000 people; 1-year PUMS about 2/3rds of ACS sample = ~ 2.1 million households (~ 3 million people)	All states, 23 metropolitan areas, population size (6 categories) for each consumer unit record; 6,000 consumer units for each of 4 quarters (consumer units may appear in 4 quarters or fewer)	U.S., states, large areas (e.g., metro areas, counties) that meet confidentiality restrictions; about 150K person-records
Granularity: Demographic/socio-economic	Limited detail per topic: Family/household composition, person demographics, ancestry, citizenship, nativity, commuting, place of work, disability status, education, employment, fertility, grandparents as caregivers, health insurance, industry, occupation, class of worker, language spoken at home, marital history, residence 1 year ago, military service, undergraduate field of degree, veteran status, work status last year, housing characteristics, computer/internet use	Limited detail per topic: Consumer unit characteristics, person demographics, employment (1st/4th interviews)	Detailed questions: Family/household composition, person demographics, marital status, educational attainment, health insurance coverage/costs, nativity, work experience, geographic mobility

Page 163 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

Quality Metric/Survey	HRS (University of Michigan)	PSID (University of Michigan)	SCF (Federal Reserve Board)	SIPP (U.S. Census Bureau)
Relevance: Income content (see Table 4-1A); Consumption content (see Table 4-1B); Wealth content (see Table 4-1C)	Income: Prior calendar year for sample member and spouse (wages and all other for other household members)—15 cash sources, SNAP, source of inheritance Consumption: Sample unit expenses prior 12 months for ~30 categories Wealth: Gross/net value of home, property, financial assets, life insurance, vehicles, other assets	Income: Prior calendar year for reference person, spouse, and all other family members (some items)—30 cash sources, SNAP, school lunch, WIC, LIHEAP, etc. Consumption: Household expenses prior year (most items) for ~30 categories Wealth: Gross/net value of home, property, financial assets, vehicles, medical/legal/credit card debt	Income: Prior calendar year, for PEU as a whole—15 cash sources, free meals, taxes Consumption: PEU rent/mortgage/other expenses, principal residence; food at home/away; charity; alimony/support to others; gifts to others Wealth: PEU real/financial assets; loans (applied for/turned down); pension wealth for reference person and spouse/partner; inheritances	Income: Monthly—30+ cash sources, noncash benefit $; Annual—income from assets Consumption: Rent/mortgage payment/utilities; support payments for child parent, ex-spouse, someone else; work expenses (parking/tolls, other) Wealth: House value (owners), financial assets, life insurance (face/cash value), property assets, business assets, vehicles, debts, educational savings accounts, retirement contributions (employer/employee)
Relevance: Population covered	Adults ages 51 and older in households	(Same as CPS-ASEC originally)	(Same as CE)	CPS-ASEC

Quality Metric/Survey

HRS (University of Michigan)

PSID (University of Michigan)

SCF (Federal Reserve Board)

SIPP (U.S. Census Bureau)

Relevance: Income content (see Table 4-1A);

Consumption content (see Table 4-1B);

Wealth content (see Table 4-1C)

Income: Prior calendar year for sample member and spouse (wages and all other for other household members)—15 cash sources, SNAP, source of inheritance

Consumption: Sample unit expenses prior 12 months for ~30 categories

Wealth: Gross/net value of home, property, financial assets, life insurance, vehicles, other assets

Income: Prior calendar year for reference person, spouse, and all other family members (some items)—30 cash sources, SNAP, school lunch, WIC, LIHEAP, etc.

Consumption: Household expenses prior year (most items) for ~30 categories

Wealth: Gross/net value of home, property, financial assets, vehicles, medical/legal/credit card debt

Income: Prior calendar year, for PEU as a whole—15 cash sources, free meals, taxes

Consumption: PEU rent/mortgage/other expenses, principal residence; food at home/away; charity; alimony/support to others; gifts to others

Wealth: PEU real/financial assets; loans (applied for/turned down); pension wealth for reference person and spouse/partner; inheritances

Income: Monthly—30+ cash sources, noncash benefit $; Annual—income from assets

Consumption: Rent/mortgage payment/utilities; support payments for child parent, ex-spouse, someone else; work expenses (parking/tolls, other)

Wealth: House value (owners), financial assets, life insurance (face/cash value), property assets, business assets, vehicles, debts, educational savings accounts, retirement contributions (employer/employee)

Relevance: Population covered

Adults ages 51 and older in households

(Same as CPS-ASEC originally)

(Same as CE)

CPS-ASEC

Page 164 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

Quality Metric/Survey	HRS (University of Michigan)	PSID (University of Michigan)	SCF (Federal Reserve Board)	SIPP (U.S. Census Bureau)
Relevance: Units for data collection; Longitudinal analysis (possible?)	Single person or couple; Follows each sample member until death (including in nursing homes)	Family: Follows members (adults and children) and their children over time	PEU (economically dominant single person or couple/all others financially interdependent with them); cross-sectional (but 1983 respondents reinterviewed in 1986, 1989; 2007 respondents reinterviewed in 2009)	Household (all people at an address); Family (related people at an address); Unrelated individual (in noninstitutional GQ or household); Program units (e.g., SNAP); longitudinal over 12–48 months
Relevance: Estimation adequacy (sample size/design)	Sample size: 12,622–27,192; Respondents: 10,694–21,384, including ~450 interviews with nursing home residents each wave Sample design: Multistage area probability design; initial cohort in 1992 of people ages 51–61 and spouses of any age; other cohorts added; since 2004 in steady state of adding new cohort of people ages 51–56 (and spouses of any age) every 6 years; Black people, Hispanic people, and Florida residents oversampled	Sample size:17,000–31,000 individuals, currently 25,000 (response rates about 90%) Sample design: Original 1968 sample included 1,872 low-income families from the Survey of Economic Opportunity and nationally representative sample of 2,930 households; Latino sample included 1990–1995; immigrant sample added, low-income sample cut back in 1997 (changes driven by cost and desire to maintain representativeness)	Sample size: About 5,800–6,300 families in 2016, 2019 surveys, 4,600 in 2022 survey Sample design: Multistage area-probability sample + tax return sample to overrepresent wealthy families (excludes Forbes’ list of wealthiest 400 people in U.S.)	Sample size: Wave 1 eligible households: 2018 panel, 45,000; 2019 panel, 24,500; 2020 panel, 22,000; 2021 panel, 14,500; 2022 panel, 47,500 Sample design: Multistage, stratified, clustered design; low-income areas oversampled; new panel every year (as of 2018); each panel with 4 annual waves (2019 panel had one wave; 2020 panel had response problems due to COVID-19; 2022 panel had low response in Wave 1) NOTE: Design has changed substantially several times since SIPP began in 1984

Page 165 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

Relevance: Frequency	Data collection: Every 2 years since 1992 Publication: No regular publications; microdata files support research	Data collection: Every 2 years since 1997 Publication: No regular publications; microdata files support research	Data collection: Every 3 years since 1983 Publication: Overview of findings (2022 findings released October 2023); technical reports; microdata file from each survey	Data collection: Annual (interviews conducted in winter-spring) Publication: Topical national-level publications released sporadically, P-70 series; Microdata files released by year; latest year available: 2022 (2020 Wave 3, 2021 Wave 2, 2022 Wave 1)
Timeliness and Punctuality	Punctual with availability of microdata file from each biennial wave about 18 months after collection (links with Medicare/Social Security available on restricted basis)	Punctual with availability of microdata file from each biennial wave about 18 months after collection (links with Medicare, 1940 Census available on restricted basis)	Punctual with release from each triennial survey	Timeliness—an issue for SIPP in the past—has improved in recent years
Accessibility and Clarity	High marks for accessibility and quality of documentation	High marks for accessibility and quality of documentation	High marks for accessibility and quality of documentation	High marks for accessibility and quality of documentation
Quality Metric/Survey	HRS (University of Michigan)	PSID (University of Michigan)	SCF (Federal Reserve Board)	SIPP (U.S. Census Bureau)

Page 166 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

Coherence and Comparability	Largely consistent over time but differences in economic unit (single/couple) and variables from other surveys	Largely consistent over time (content expanded in several areas) and with other surveys	Consistent over time but differences in economic unit (PEU vs. household/family) and variables collected from other surveys	SIPP has changed design and content often but core focus on income and program participation has remained
Granularity: Geographic	U.S., regions (states available on restricted basis)	U.S. (census tract and block available on restricted basis)	U.S., around 5,000–6,000 family records in microdata file (full sample)	U.S., states; metropolitan/nonmetropolitan status; about 41,000 persons with up to 12 months of data for the 2022 SIPP microdata file
Granularity: Demographic/socio-economic	Detailed questions: Family composition, person demographics, veteran status, parents’ occupations/education/financial status, language at home, children born/fathered, marital/partner status/history, residences, place of birth, citizenship, education, childhood health, smoking, health status, health care use/costs, disability, functional limitations, cognition, expectations, internet use, food security	Detailed questions: Family composition, person demographics, employment/history, marital/birth histories, education, occupation, veteran status, education/occupation of parents, certification/licenses, time use, food security, health status, childhood health, health conditions/disabilities, smoking, volunteering/giving	Limited questions: Economic expectations/credit attitude/financial institutions; attitudes on saving/investing; work history/demographics (education, sex, race/ethnicity, veteran status, marital history, health status, smoking) for reference person and spouse; most information for PEU; some separately for reference person and spouse/partner; summary information for rest of household	Detailed questions: Family/household composition, person demographics, language at home, citizenship, nativity (self/parents), veteran status, residences, marital history, parent mortality, education, employment, commuting, health care utilization/insurance/costs, child/dependent care, disability, fertility, food security, adult/child wellbeing

Page 167 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

NOTES: ACS = American Community Survey; CE = Consumer Expenditure Survey; CPS = Current Population Survey; CSP-ASEC = Annual Social and Economic Supplement to the Current Population Survey; CU = consumer unit; GQ = group quarter; HRS = Health and Retirement Study; ICW = income, consumption, and wealth; LIHEAP = Low Income Home Energy Assistance Program; NSLP = National School Lunch Program; PEU = primary economic unit; PSID = Panel Study of Income Dynamics; PUMA = Public Use Microdata Area; SCF = Survey of Consumer Finances; SIPP = Survey of Income and Program Participation; SNAP = Supplemental Nutrition Assistance Program; WIC = Woman, Infants, and Children.

^a $ indicates that a dollar value is an available variable (and not just an indicator).

^b Due to COVID-19, the 2020 ACS had fewer sample addresses and fewer final interviews; the 2020 CPS-ASEC had fewer interviewed households.

SOURCES: Bureau of Labor Statistics (2022); Census Bureau (2021a,b); Eurostat (2003); Health and Retirement Study (n.d.a,b); psidonline.isr.umich.edu/documents/psid/questionnaires/q2023.pdf, User Guide for the 2021 Interviewing Year (umich.edu) - psidonline.isr.umich.edu/data/Documentation/UserGuide2021.pdf; SCF codebook and questionnaire outline: www.federalreserve.gov/econres/files/codebk2019.txt, scfoutline.2019.pdf (federalreserve.gov) - www.federalreserve.gov/econres/files/scfoutline.2019.pdf; SIPP data dictionary and users’ guide: 2022 Survey of Income and Program Participation Data Dictionary (census.gov) - www2.census.gov/programs-surveys/sipp/tech-documentation/data-dictionaries/2021/2021_SIPP_Data_Dictionary_AUG22.pdf, 2022 Survey of Income and Program Participation Users’ Guide (census.gov) - www2.census.gov/programssurveys/sipp/tech-documentation/methodology/2021_SIPP_Users_Guide_AUG22.pdf

Page 168 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

Accessibility and clarity. All seven surveys at present make public-use microdata sample (PUMS) files¹ available for public use and have good documentation. All seven surveys also provide for access to additional information in secure environments (e.g., access to income amounts that are not top-coded in the CPS-ASEC through the Federal Statistical Research Data Center [FSRDC] network).
Coherence and comparability. As amply documented in Chapter 3 with respect to concepts and components of ICW, relevant surveys differ in many respects that affect coherence and comparability.
Granularity. The seven surveys differ greatly in the granularity of the geographic and population detail they can provide. The ACS currently publishes statistics for areas as small as block groups and census tracts and provides PUMS files with identifiers of public-use microdata areas of 100,000 or more people. In contrast, the HRS, PSID, and SCF provide only regional geographic detail. The ACS includes a long list of individual and household characteristics but with limited detail for each. Other surveys provide more detail for a shorter list of characteristics.

Several of the above-described quality attributes factor fundamentally into the strategy for constructing an optimal ICW database.

Recommendation 4-1: The relevant statistical agencies should ensure that the integrated household-level data on income, consumption, and wealth are representative of the national population; cover individual, family, and household units of analysis; have key components of income, consumption, and wealth that bear on economic wellbeing; and can be used to construct estimates that are consistent with published macro aggregates.

Administrative Records

Because of errors and gaps in survey reports of income, administrative records are essential to constructing an accurate, integrated ICW dataset for household and family income (administrative records are typically less useful for estimating consumption or wealth because of the lack of applicable content). As stated in the recent Committee on National Statistics report, Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good (National Academies, 2023b), “In the panel’s

___________________

¹ There is a movement at the Census Bureau to restrict the availability of PUMS files on grounds of confidentiality protection or to provide them as synthetic files with provision for validation on the original data. See further discussion in Chapter 6.

Page 169 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

opinion, U.S. statistical agencies’ reliance on sample-survey data and census data is unsustainable.”

Conclusion 4-1: Current federal household survey datasets that include income, consumption, and wealth data suffer from multiple data quality issues, including unit and item nonresponse, coverage error, and reporting error (mostly underreporting). Use of administrative records can provide additional data to address these errors.

While administrative data can improve the reporting error in surveys, depending on the coverage, the administrative data may not solve the nonresponse errors apparent in surveys. Table 4-2 describes the characteristics of three relevant administrative sources: IRS individual income tax records, Social Security Detailed Earnings Records (DER), and Supplemental Nutrition Assistance Program (SNAP) records. Table 4-3 details the income data available in the IRS and SNAP records (see Table 4-2 for the limited detail in the DER records). Finally, Table 4-4 indicates the records currently available to the Census Bureau, which has been the statistical agency most active in acquiring relevant records.

Indeed, substantial progress has been made over the past two decades linking Census Bureau survey data to administrative tax records and transfer program data from the SSA, IRS, U.S. Department of Agriculture (USDA), and numerous state agencies. SSA has provided information to the Census Bureau on individual earnings through the Detailed Earnings Records as well as information on transfer benefits from Social Security Retirement, Social Security Disability, and Supplemental Security Income. The IRS has provided access to 1040 tax returns, along with associated forms such as the W-2 wage statement, 1099s for self-employment and retirement income, and the Earned Income Tax Credit (EITC; and recently the Child Tax Credit) recipient file. USDA has provided access to SNAP and the Special Supplemental Nutrition Program for Women, Infants, and Children (WIC) records for cooperating states, with some states also providing access to data on other programs such as Temporary Assistance for Needy Families (TANF). The Census Bureau also has access to limited data from records for Department of Veterans Affairs benefits and other disability and survivor benefits.

Page 170 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

TABLE 4-2 Quality Assessment of Selected Relevant Administrative Records for an Integrated Household Income, Consumption, and Wealth Dataset: Relevance, Timeliness and Punctuality, Accessibility and Clarity, Coherence and Comparability, Granularity

Quality Metric/Records	Federal Individual Income Tax Records	Social Security DER	SNAP Records
Responsible Agency	SOI/IRS	SSA	States (counties in CA); FNS, USDA
Relevance: Income content; Consumption content; Wealth content	Income: All taxable sources (prior year) on 1040 and other forms (see Table 4-6) Consumption: Only expenses that are eligible for deductions or credits (e.g., charitable giving, owner property taxes, child care) (deductions only for itemizers) Wealth: No data (some wealth components can be imputed from income, e.g., rental property, interest-generating assets)	Income: Wage and salary earnings (total, not top-coded at the FICA limits) from yearly W-2 forms (back to 1978) with separate record for each worker for each job during the year; also deferred contributions to retirement and trust plans and health savings accounts, sick pay, railroad wages, other noncovered earnings; positive self-employment earnings from 1040-SE forms Consumption: No data Wealth: No data	Income: SNAP benefits (monthly); countable income (monthly, for unit and each adult) from a variety of sources (see Table 4-6) Consumption: Only expenses eligible for deductions from countable income: child support, dependent care, medical expenses, rent/mortgage, utility allowance Wealth: Countable assets under state rules: vehicle value, liquid assets, other nonliquid assets, real property

Page 171 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

Relevance: Population covered	People required to file income-tax returns (excludes people with income under a threshold) (information returns [such as 1099 forms] can pick up additional people) (can include the unhoused)	People in covered employment—federal employees excluded before 1983, state and local before 1991, police and firefighters before 1994 (can include homeless). For complete records, there is a Summary Earnings Record and Master Earnings File	People who applied for and were determined to be eligible for benefits (can include homeless)
Relevance: Units for data collection; Publication; Longitudinal analysis (possible?)	Tax filing unit (taxpayer, spouse, dependents) for filers, individuals for nonfilers Longitudinal analysis possible	Individuals in covered employment Longitudinal analysis possible (SER has limited data to 1959)	SNAP unit (people in a household eligible to receive benefits) Longitudinal analysis possible
Relevance: Estimation adequacy	Not applicable (whole covered population)	Not applicable (whole covered population)	Not applicable (whole covered population)
Relevance: Frequency	Data collection: Yearly Publication: Yearly tables	Data collection: Weekly from IRS data (posted to Master Earnings File of which DER and SER are extracts); 90% of data for employees with name and SSN match to SSA Numident File; 6% probabilistic matches where one identifier fails to match; remainder set aside for followup Publication: Basic tables on covered employment/benefits	Data collection: Monthly Publication: Basic monthly and yearly tables

Page 172 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

Quality Metric/Records	Federal Individual Income Tax Records	Social Security DER	SNAP Records
Timeliness and Punctuality	Publications: About 9–12 months after collection; punctual Availability to Census Bureau: timely	Publications: Variable (e.g., 2022 Annual Statistical Supplement tables have 2020 or 2021 as latest year) Availability to Census Bureau: Timely link to CPS-ASEC and SIPP	Publications: About 2 months after collection for monthly tables; 12 months for yearly tables; punctual Availability to Census Bureau: Varies by State
Accessibility and Clarity	SOI/IRS from 1996–2014 issued PUMS; release on hold, pursuing synthesized PUMS for testing and validation server for production and privacy protection of output; Census Bureau has access to some items (see Table 4-4), which are accessible to researchers in FSRDCs	Requires special arrangements with SSA (see Table 4-4 for Census Bureau access); documentation not publicly available	Requires special arrangements with states; Census Bureau and USDA ERS and FNS cooperate on gaining access (see Table 4-3 for Census Bureau access; data for some states accessible to researchers in FSRDCs)
Coherence and Comparability	Time series comparability affected by changes in tax law	Time series comparability affected by coverage differences (e.g., police not covered before 1994) and changes in tax law (e.g., allowable pre-tax deductions)	Cross-state comparability affected by eligibility rule nuances and in variables provided to Census Bureau/ERS/FNS; unit and person income, consumption, wealth, and demographics pertain to time of certification (which can be 6 months or longer)
Granularity: Geographic	In principle, any type of geography; SOI/IRS publishes ZIP Code tabulations	In principle, any type of geography; SSA publishes earnings/employment for covered workers by state/county	In principle, any type of geography; states provide county identifiers to Census Bureau

Page 173 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

Granularity: Demographic/socio-economic

Limited detail: SSN, name, occupation of filer(s), number of dependent children, elderly, disabled in filing unit; type of filing unit (e.g., married filing jointly)

Limited detail: SSN, name, sex, race, date of birth, date of death (from SSA Numident File)

SNAP unit: Size, household size, whether/number nonelderly, disabled, elderly, children, school-age children, noncitizens, whether single-female-headed unit

Adult person: Age, citizenship, disability and employment status, race/ethnicity, relationship to head of household, sex, highest educational level completed

NOTE: CPS-ASEC = Current Population Survey Annual Social and Economic Supplement; DER = Detailed Earnings Records; ERS = Economic Research Service; FICA = Federal Insurance Contributions Act; FNS = Food and Nutrition Service; FSRDC = Federal Statistical Research Data Centers; IRS = Internal Revenue Service; PUMS = Public Use Microdata Sample; SER = Summary Earnings Record; SIPP = Survey of Income and Program Participation; SNAP = Supplemental Nutrition Assistance Program; SOI/IRS = Statistics of Income/Internal Revenue Service; SSA = Social Security Administration; SSN = Social Security Number; USDA = U.S. Department of Agriculture.

SOURCES: Bryant (2017); Genadek et al. (2021); Internal Revenue Service (2023); Prell (2022); Social Security Administration (n.d.); and USDA ERS - SNAP and WIC Administrative Data.

Page 174 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

TABLE 4-3 Income Components (and Taxes) in Federal Individual Income Tax and Supplemental Nutrition Assistance Program (SNAP) Administrative Records

Component	IRS/SOI Tax Records	Census Bureau/ERS/FNS SNAP Records
Reference Period	Prior calendar year	Month (as of eligibility certification)
Unit	Tax filing unit, filer and spouse	SNAP unit, adult members
Labor income
Earnings (gross) (wages, bonuses, tips, commissions)	Yes (including pre-tax retirement contributions etc.)	Yes
Self-employment	Yes (net/gross)	Yes
Income from assets
Dividends	Yes (in AGI/qualified)	Yes (in other unearned income)
Interest	Yes (gross/tax-exempt)	Yes (in other unearned income)
Rent	Yes (net/gross)	Yes (in other unearned income)
Royalties/Estates/Trusts	Yes	Yes (in other unearned income)
Realized capital gains/losses	Yes	No
Social insurance
Unemployment compensation	Yes	Yes
Worker’s compensation	No	Yes
Social Security/Railroad Retirement	Yes (gross/in AGI)	Yes (in other unearned income)
Disability benefits	No	Yes (in other unearned income)
Veterans’ payments	No	Yes (in other unearned income)
Survivor benefits	No	Yes (in other unearned income)
Retirement
Pensions	Pensions + annuities (total/in AGI)	Yes (in other unearned income)
Annuities	Pensions + annuities (total/in AGI)	Yes (in other unearned income)
Retirement withdrawals	Yes (taxable)	No
Public assistance
Supplemental Security Income	No	Yes
TANF/GA (cash)	No	Yes
Other income
Educational assistance	No	Yes

Page 175 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

Component	IRS/SOI Tax Records	Census Bureau/ERS/FNS SNAP Records
Alimony	Yes	Yes
Child support	No	Yes
Financial assistance from friends/relatives	No	Yes
Alaska dividend	No	No
All other regular income	Yes (on various schedules—e.g., honoraria)	Yes
Lump sums	Yes (capital gains/losses; retirement distributions)	No
In-kind benefits
SNAP	No	Yes
School lunch	No	No
Public/subsidized housing	No	No
WIC	No	No
Energy assistance (LIHEAP)	No	No
Free meals at work	No	No
Health care benefits
Type (employer/union, Medicaid, Medicare, ACA, Military/VA, Indian Health Service)	No	No
Premium	Yes (if itemized)	No
Medical out-of-pocket, MOOP	Yes (if itemized)	Excess over $35 per month for elderly/disabled
Taxes/tax credits
Federal taxes	Yes (withheld/additional paid)	No
State/local taxes	Yes (if itemized)	No
FICA taxes	Available in W-2	No
EITC	Yes	Yes
Other tax credits	Yes	No
Other tax information	Yes (extensive, AGI, taxable income, tax owed/paid, detail on schedules/forms)	No

NOTE: ACA = Affordable Care Act; AGI = adjusted gross income; EITC = Earned Income Tax Credit; ERS = Economic Research Service; FICA = Federal Insurance Contributions Act (payroll tax); FNS = Food and Nutrition Service; GA = General Assistance; IRS = Internal Revenue Service; LIHEAP = Low Income Home Energy Assistance Program; MOOP = Medical Out-of-Pocket; SNAP = Supplemental Nutrition Assistance Program; SOI = Statistics of Income Division; TANF = Temporary Assistance for Needy Families; VA = Veterans Assistance; WIC = Special Supplemental Nutrition Program for Women, Infants, and Children.

SOURCES: See Table 4-2.

Page 176 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

TABLE 4-4 Available or Potentially Available (to the Census Bureau) Administrative Data for Money Income, In-Kind Benefits, and Taxes

Income Item	Data Source and Administrative Item (available to Census Bureau unless otherwise noted)	Notes
Wages and salaries	IRS: Limited W-2 Information SSA (via IRS): DER States: Unemployment insurance data in Longitudinal Employer-Household Dynamics data	Earnings net of employee deductions for health insurance, etc.; excludes unreported earnings (e.g., tips). The DER has W2 earnings as well as deferred wage contributions to 401(k), 403(b), 408(k), 457(b), and 501(c) plans. (available for use with the CPS-ASEC and SIPP and recently for the 2017–2019 ACS)
Self-employment (sole proprietor/independent contractor)	SSA: DER IRS: 1040 Schedule C, SE; 1099-MISC; 1099-K; K-1—not available	See above for availability of DER Underreported income not in tax data
Self-employment (pass-through)	IRS: 1040, Schedules E, F; K-1—not available	Income from owners of C-corps not reported unless dividends taken
Unemployment compensation	IRS: 1099-G; 1040—not available
Workers’ compensation	Not available	Mostly administered by private insurance firms
Social Security	SSA: Payment History Update System IRS: SSA 1099—not available	Available for use with the CPS-ASEC, SIPP and, recently, the 2017–2019 ACS
SSI	SSA: Supplemental Security Record	See above for availability; nontaxable, not on any IRS form
Public assistance	States: DHHS: TANF	Not available for all states; not all cash assistance covered
Veterans’ benefits	Veterans Administration: Administrative data (limited)	Some benefit data available for limited uses
Disability	IRS: 1099-R, limited data	Excludes Social Security and VA
Survivor income	IRS: 1099-R, limited data	Excludes Social Security and VA
Interest	IRS: 1040 IRS: 1099-INT—not available	Includes taxable and nontaxable; excludes tax-preferred

Page 177 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

Income Item	Data Source and Administrative Item (available to Census Bureau unless otherwise noted)	Notes
Dividends	IRS: 1040 IRS: 1099-DIV—not available	Excludes tax-preferred
Rent and royalty income	IRS: 1040 IRS: 1040 Schedule E, K-1—not available	Only gross rent Excludes depreciation
Educational assistance	IRS: 1098-T, 1099-Q—not available	1098-T covers financial aid; 1099-Q covers spending from tax-preferred education accounts (529, Coverdell)
Other income	IRS: Capital gains, 1040, 1099-B, K-1—not available IRS: Alimony, 1040—not available IRS: Gambling income, 2-2G—not available IRS: Alaska dividend, 1099-MISC—not available
Noncash/deferred compensation from employers	Firms: Retirement plan contributions, Form 550 public data IRS: Health insurance contributions, other benefits (e.g., moving expenses, etc.), W-2—not available	Only available at aggregate firm level
Government taxes, credits	IRS: EITC, other credits (e.g., child tax, education expense), 1040—not available IRS: Federal tax obligations, 1040—not available IRS: State, local, property tax obligations (for itemizers up to cap), 1040—not available	Census Bureau models federal and state income taxes, including various credits

Page 178 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

Income Item	Data Source and Administrative Item (available to Census Bureau unless otherwise noted)	Notes
Near-income items	States: SNAP, WIC	SNAP, WIC availability varies by year; not available for all states
	CMS/DHHS: Medicare	HUD data do not cover all possible sources of housing assistance
	CMS/DHHS: Medicaid	LIHEAP data available for one state for some years
	States: School lunch—not available HUD: Housing assistance States: LIHEAP

NOTE: ACS = American Community Survey; CMS = Centers for Medicare and Medicaid Services; CPS-ASEC = Annual Social and Economic Supplement to the Current Population Survey; DER = Social Security Detailed Earning Records; EITC = Earned Income Tax Credit; HUD = U.S. Department of Housing and Urban Development; LIHEAP = Low Income Home Energy Assistance Program; SIPP = Survey of Income and Program Participation; SNAP = Supplemental Nutrition Assistance Program; SSA = Social Security Administration; TANF = Temporary Assistance for Needy Families; WIC = Special Supplemental Nutrition Program for Women, Infants, and Children.

SOURCES: Adapted from Bee and Rothbaum (2019, Table 1).

Observations from these tables include:

Relevance: Administrative records are structured and contain data in accordance with the laws and regulations governing the program to which they apply. The records listed in Table 4-2 include one or more components of income but have fewer or no data on consumption or wealth. The records are large in size but do not cover the entire population but rather those eligible for a benefit or required to pay a tax, and not all eligible people file. Some records pertain to individuals while others pertain to program or filing units. Some records pertain to a calendar year, while others are monthly. Individuals can potentially be tracked over time as they enter and exit a program.
Timeliness and punctuality: Administrative records systems are updated to suit their own program needs and are not always available on a time schedule that would support the use of, say, tax records for calendar year t in year t estimates of household income, consumption, and wealth to be released no later than year t + 1.
Accessibility and clarity: Access to administrative records typically requires executing a legal agreement of some sort. In the past, the IRS/SOI released a PUMS of individual income data, but it has not done so since 2014 because of concerns about disclosure risk.

Page 179 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

Consequently, the use of administrative records in developing an integrated ICW dataset must be done within a restricted enclave, such as one of the FSRDCs or by an agency such as the Census Bureau within its own secure data storage and processing environment. Documentation is often minimal for administrative records and not publicly accessible.
Coherence and comparability: Definitions of variables differ among administrative records and with surveys, whether the variable is a unit of observation (e.g., tax filing unit, SNAP eligibility unit) or type of income (e.g., earnings reported to the IRS, which exclude pre-tax retirement and other deductions).
Granularity: Demographic and geographic detail varies widely among administrative records.
Overall: Administrative records on income (e.g., SNAP records reports of SNAP benefits paid out) are essential for an integrated ICW dataset because, in most instances, they can be used to improve underreporting in surveys. Other items on records, such as unit membership or income for determining program eligibility, require careful evaluation to determine if they are out of date or erroneous. Only IRS/SOI tax records have the detail and population coverage to merit consideration as the “spine” of an integrated ICW dataset.

Commercial Data Sources

Commercial data sources can help in many ways to fill in gaps in surveys and administrative records, cross-validate survey and administrative information, and help select among alternative assumptions for distributing ICW components to households. However, commercial sources have significant drawbacks, including instability. Zillow’s Transaction and Assessment real estate database is a case in point. Zillow offered this real estate database for free to researchers in the academic, nonprofit, and government sectors for a number of years upon the signing of a data-use agreement. The database, updated quarterly, contained property characteristics, mortgages, geographic information, prior valuations, and more for about 200 million parcels over 3,100 counties, which could be very useful in measuring a principal component of household wealth. Zillow recently announced the discontinuation of its transactions database, effective September 30, 2023, stating that the Zillow research team could no longer adequately serve the growing number of researchers.² Currently Zillow is working with the Inter-university Consortium for Political and Social Research at the University of Michigan for the data deposition and access.

___________________

² See www.zillow.com/research/ztrax/

Page 180 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

Black Knight has similar data to Zillow on residences and commercial properties, which are available to clients of the company for custom reports or client-generated reports.³ Presumably, researchers can acquire the company’s data, although the company’s website offers no information on what, how, and for what cost this could be done. Black Knight has been linked to the National Experimental Wellbeing Statistics (NEWS) system at the Census Bureau (see Bee et al., 2023; Rothbaum, 2022), while other commercial data offerings such as that from Experian and Corelogic have been linked to census and other survey data (see Brown et al., 2023) and Zillow has been used at Bureau of Economic Analysis (BEA; see Fixler et al., 2020).

The JP Morgan Chase Institute uses its large databases of banking and credit card transactions to generate research reports on relevant topics, such as households’ cash balances and how households used the advanced portion of the refundable child tax credit in 2021.⁴ The de-identified data are available on request to researchers, although ascertaining the details of what can be obtained, under what terms, at what cost, and with what documentation and support is not possible without initiating a request. Wheat (2022) in his presentation to the panel demonstrated the usefulness of linking income and spending using the banking data.

Federal statistical agencies constructing an integrated ICW dataset would need to exercise caution in planning to use a commercial data source on a continuing basis. Nonetheless, such sources may be valuable for particular aspects at particular times.

ACCURACY AND RELIABILITY

The quality metric of accuracy and reliability merits its own discussion. Table 4-5 lists types of error that affect surveys, administrative records, and commercial datasets and additional sources of error that arise in the process of linking multiple data sources, and demonstrates that these errors are present in all types of data.

This section discusses several major sources of error for an integrated ICW dataset: nonresponse, misreporting, and population coverage error in surveys, using income data in the CPS-ASEC as a prime example; coverage and other deficiencies in administrative records; and errors in linking multiple data sources. A full assessment of errors and prioritization of the most important to address is critical to a successful project of data integration.

___________________

³ See www.blackknightinc.com/products/property-data/

⁴ See www.jpmorganchase.com/institute/research/household-income-spending

Page 181 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

TABLE 4-5 Error Components of Multiple Data Sources and Linked Datasets

Error Component	Survey Data	Administrative Data	Commercial Data
Construct misspecification/misalignment	Inconsistent ICW concepts and definitions	Inconsistent ICW concepts and definitions	Inconsistent ICW concepts and definitions
Coverage error	Undercoverage Overcoverage (depends on adjustment)	Undercoverage Overcoverage	Undercoverage Overcoverage Nonrepresentative
Sampling error	Inadequate sample size in most cases for subdomain estimation	Large size	Large size
Nonresponse error	Item nonresponse Unit nonresponse	Item nonresponse Unit nonresponse	Item nonresponse Unit nonresponse
Measurement error	Misreporting	Misreporting	Misreporting
Processing error	Editing	Editing	Editing
Modeling error	Imputation Nonresponse adjustment Calibration/weighting	Imputation Non-filing adjustment	Imputation Calibration/weighting
Linkage error	False matches, false nonmatches, records unable to be matched
Analysis on integrated datasets	Small domain estimation, multilevel models, multigenerational analysis

SOURCES: Panel generated using survey information (see Table 4-1).

Survey Errors from Nonresponse and Misreporting

Household surveys have been the flagship of federal statistics on income, poverty, expenditures, and many other topics for at least 75 years. Over the past 30 years, however, survey quality has deteriorated in many cases.⁵ Specifically, response rates (unit and item) have declined in recent decades and underreporting has increased, requiring a greater reliance on imputation and weighting adjustments.

___________________

⁵ See detailed discussions in Chapter 6.3.3 in National Academies (2023a).

Page 182 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

**FIGURE 4-1** Unit response rates to selected household surveys conducted by the Census Bureau, 1984–2019.
NOTES: CE Interview = Consumer Expenditure Interview Survey; CPS = monthly Current Population Survey; CPS-ASEC = CPS Annual Social and Economic Supplement. Annual or annual average response rates reported.
SOURCES: Panel generated using data in Abraham (2022).

Household (unit) Response Rates

Household (unit) response rates have been declining in nearly all surveys, in the United States and abroad. The Census Bureau obtains higher response rates in its surveys than other organizations, but even Census Bureau surveys have not been immune to the problem. Figure 4-1 shows the decline in response rates to the basic CPS, CPS-ASEC, and CE surveys from 1984 to 2019. Based on findings by Sabelhaus et al. (2013), in 2014 the Bureau of Labor Statistics (BLS) began using ZIP-Code-level income estimates from the IRS for nonresponse weighting adjustments. Such adjustments are needed to correct for underrepresentation of high-income households and overrepresentation of low-income households in the CE. BLS is currently using IRS data on income at an aggregate level for nonresponse weighting adjustments (Steinberg et al., 2020). The Census Bureau is also doing this

Page 183 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

type of research for the CPS-ASEC and ACS.⁶ Low response rates do not necessarily indicate low-quality data, but they do indicate a higher risk of nonresponse error.

Item Nonresponse to Income Questions

Item nonresponse to income questions has also increased in the CPS-ASEC and other surveys. For example, there has been a substantial increase in nonresponse to the CPS-ASEC questions on labor-market earnings, which are the dominant source of income among nonretired households, comprising at least 80% of personal income (Bollinger et al., 2018; Hokayem et al., 2015). This nonresponse can occur either from refusal to answer the earnings questions (item nonresponse) or from refusal to respond to most or all of the ASEC (supplement nonresponse). Publicly available data from the CPS for people ages 16–64 indicate that earnings item nonresponse more than doubled from 1990 to 2004, then trended down for the next decade, only to jump several percentage points over the past 5 years. Even more striking is the increase in ASEC nonresponse, which jumped from 10% in 2010 to 23% in 2021. Combined, this means that earnings are missing for at least 40% of potential prime-aged workers.

The Census Bureau does not drop observations with missing earnings or a missing supplement, but instead retains these observations and imputes values for the missing data. Depending on the questions being addressed and the reasons for nonresponse, use of imputed values can either have little effect or can produce severe bias, depending on the quality of imputation methods (Bollinger et al., 2018; Bollinger & Hirsch 2006; Hirsch & Schumacher, 2004; Hokayem et al., 2015). If earnings data are “missing completely at random,” then nonresponse is completely independent of earnings; if earnings are “missing at random,” then nonresponse is not dependent on earnings after conditioning on covariates; and if earnings are not missing at random, then nonresponse depends on the value of missing earnings even after conditioning on covariates (Little & Rubin, 2019; Rubin, 1976). The last case is generally referred to as “nonresponse bias.” Both Census Bureau imputation procedures and common methods to deal with nonresponse assume that nonresponse is missing at random—that is, those not reporting earnings have earnings similar to those with equivalent measured attributes. Bollinger et al. (2018) present evidence that missing earnings are not missing at random; and Hokayem et al. (2015) show that the official poverty measure was biased downward from nonresponse by about 10% in a typical year during the period 1998–2009, which was prior to the large run-up in ASEC supplement nonresponse.

___________________

⁶ See, for example, www.census.gov/library/working-papers/2020/demo/SEHSD-WP2020-10.html

Page 184 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

Long-running household panels such as the PSID and HRS have item nonresponse rates on earnings that are about one-third the rates found in the CPS-ASEC. This is presumably because of the extra effort that can be devoted to the smaller number of households sampled in these surveys compared to the CPS-ASEC (see Schoeni et al., 2013).

Substantial and Increased Underreporting of Transfer Income

Substantial and increased underreporting of transfer income in surveys, as compared with administrative records aggregates, is well documented for many sources (e.g., Meyer et al., 2009, 2015). The growing problem of nonresponse and underreporting of transfer income is not unique to the CPS-ASEC, nor to surveys in the United States (Brewer et al., 2017). The reasons for these problems are not well understood.

Imputation can make up missing income amounts for respondents who report receipt and can add income for those respondents who do not answer a receipt-related question and are imputed to receive the income source. Collected income will be unaccounted for in instances in which respondents received an income source but reported they did not, and also when they reported a lower amount than they in fact received.

Net underreporting of income from many transfer programs (both cash and in-kind) is found to be high and increasing up through 2005 for the CPS-ASEC, ACS, and CE (Meyer et al., 2015, Tables 2, 3, 4, 7), and Larrimore et al. (2022) found substantial underreporting of unemployment benefits in the CPS-ASEC. Moreover, it is likely that reporting has not improved since these comparisons were made. For example, less than 50% of Aid to Families with Dependent Children/Temporary Assistance for Needy Families (AFDC/TANF) benefits were reported in the CPS-ASEC and ACS in 2004 (and only 25% of such benefits were reported in the CE). Reporting of SNAP benefits was not much better. Even Social Security benefits are somewhat underreported (90%, 81%, and 90% of benefits were reported in the CPS-ASEC, ACS, and CE, respectively in 2005), while under-reporting of SSI benefits is somewhat worse (78%, 84%, and 66% of benefits were reported in the CPS-ASEC, ACS, and CE, respectively, in 2005). Meyer (2022) in his presentation to the panel emphasized this underreporting of government transfer programs in the CPS-ASEC, ACS and SIPP.

Using samples linked to New York State administrative records, Meyer et al. (2015, Table 4) estimate the extent of net under-reporting in SNAP estimates in the CPS-ASEC and AFDC/TANF estimates in the CPS-ASEC and ACS due to the combination of unit nonresponse, coverage error, and weighting; as well as under-reporting due to item nonresponse and to measurement error. They find measurement error to be the largest source of the discrepancy. Meyer and Mittag (2019) estimate the effect of net

Page 185 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

underreporting on estimates of deep poverty, poverty, and near poverty, using matched CPS-ASEC and administrative records for New York State for 2008–2011 for SNAP, TANF, General Assistance, and housing assistance. Using the administrative records, they found a reduction of 0.9, 2.5, and 3.1 percentage points (for people in deep poverty, poverty, and near poverty, respectively) compared with the survey estimates, and even larger reductions for certain types of households, such as those headed by single mothers.

Survey Coverage of Population Groups

Survey coverage of population groups is uneven. For example, even after weighting for nonresponse, surveys disproportionately miss such groups as Black men when compared with decennial census counts. Use of census-based population estimates by age, sex, and race/ethnicity as a final stage in weighting rectifies the disparities on these basic demographic characteristics based on the missing-at-random assumption. Nevertheless, it is difficult to correct for disparities that may exist for socioeconomic groups (e.g., men experiencing periods of low income and very-high-income and high-wealth groups), and that in turn may distort estimates of inequality, poverty, and other measures of household economic wellbeing.⁷

Errors in Administrative Records

A major problem with administrative records for use in estimating and analyzing household income, consumption, and wealth is that the records are designed for program administration and not for the needs of statistical agencies or researchers. Consequently, the content and coverage of records will change when program provisions change, and definitions of what seem to be the same concepts, such as earnings, will depend on program provisions and not on what makes sense for statistics and analysis. That said, records are essential for an integrated ICW dataset, given the deficiencies outlined in the previous section in surveys.

Bee et al. (2023) illustrate the challenges and opportunities for using administrative records of income tax returns of earnings to improve the accuracy of earnings for respondents to the CPS-ASEC. They matched the CPS-ASEC for 2019, IRS 1040 and information returns, SSA DER records, and job records in the Longitudinal Employer-Household Dynamics (LEHD) dataset maintained by the Census Bureau (which is derived from state reports of quarterly wages collected as part of the unemployment insurance program). Of all adults ages 15 and older with either wages or

___________________

⁷ Bollinger et al. (2019) use a model applying a copula-based process to correct for missing earnings values. That model could be modified to also address undercoverage of people.

Page 186 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

self-employment income (or both) in either the CPS-ASEC or administrative records (or both), 79.8% reported wages on the survey and had administrative wage records, while 7.7% had administrative wage records but no survey reports of wages, and 6.4% had survey reports of wages but no administrative wage records. The remaining adults with earnings had self-employment income that was only reported on the survey or in records or both (Bee et al., 2023, Table A8). Choosing which administrative records source of earnings to use also posed challenges. The IRS W-2 information available to the Census Bureau does not include information on pretax wages used to pay health insurance premiums, so wages are underreported for people in that situation. Wage records from the LEHD program are truly gross wages, but many jobs are not covered by unemployment insurance (Bee et al., 2023, p. 9).

In another application, Jones and Ziliak (2022) demonstrate the importance of using tax data to measure the effectiveness of the EITC. Linkages between tax data and the CPS-ASEC are also important in evaluating take-up rates for the EITC and other tax credits.

Data Linkage Errors

Using linked multiple data sources for an integrated ICW dataset (or any other purpose) adds error from the linking process. To start with, some linkages require the survey respondents’ informed consent to extract the administrative records. This is not the case for linkages made by the Census Bureau, but obtaining informed consent from respondents prior to linking their surveys (such as the HRS and PSID) with administrative records is often mandated by research ethics boards and legislation. The underlying mechanisms driving the consent decision could threaten the validity of inferences, playing a similar role to survey response mechanisms. Sakshaug and Kreuter (2012) find that the linkage consent rates of several multidisciplinary linkage projects vary between 24% and 89%. In the HRS, response rates sharply declined beginning in the early 2010s from close to 90%, a rate that had been sustained for many years, to under 75% in 2020. Conditional on response, however, consent rates for linkage to Medicare records have remained relatively steady at 85–90%. Consent rates are lower for Black participants and those with less than a high school education (Weir, 2021).

The decline in linkage consent rates raises concerns about nonconsent bias as a source of selection, which can potentially distort conclusions drawn from linked-data analyses. Researchers can adapt strategies for nonresponse-bias adjustment to understand the magnitude of linkage nonconsent bias and whether this source of bias is likely to impact one’s analytic objectives. Literature studies find that linkage consenters often

Page 187 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

differ from nonconsenters on characteristics observed in the survey and administrative data (e.g., Jenkins et al., 2006; Sakshaug & Kreuter, 2012). Statistical adjustment procedures, including weighting and imputation (Si et al., 2022, 2023), have been used in linkage consent studies to adjust for systematic differences between consenters and nonconsenters.

The linkage and statistical integration of multiple data sources add a unique error source, due to incomplete linkages and probabilistic linkage error. The Census Bureau links person records through its Person Identification Validation System. In this system, survey respondents are matched to their SSN from the SSA Numident file,⁸ using personally identifiable information from the survey, such as name, gender, date of birth, and residential address. The SSN is then converted to a PIK, which is used to link data such as the CPS-ASEC to “PIKed” SSNs in tax and other administrative records (see Box 4-1).

It is not possible to reliably assign a PIK to every record, either due to insufficient identifying information or because the information does not uniquely match any of the administrative records used in the person validation process. For example, 14% of CPS-ASEC respondents in the period 2006–2011 could not be linked to administrative records (Bollinger et al., 2019). On average, each year since 2008 about 87–88% of respondents to the CPS-ASEC and about 90% of respondents to the SIPP received a PIK (Genadek et al., 2021, Table 2), but the PIK rate is lower for low-income population groups. Bollinger et al. (2018) report that the inability to obtain a link (or PIK) is highest among noncitizens of Hispanic ethnicity, and Jones and Ziliak (2022) report that the antipoverty effectiveness of programs such as the EITC is attenuated because of missing PIKs among this population, which tends to be more disadvantaged.

The errors in text fields, such as use of nicknames, fake names, or duplicate names, will cause incomplete or mismatched linkages, and the probabilistic linkage process also introduces linkage error and mislinkage. Multiple datasets can be integrated via statistical matching (rather than exact linking as in the PIK process) based on the variables they share in common, but when the matching variables are categorical there is no single best link (Flaxman, 2022). Many studies in Europe have demonstrated that linkage error has important adverse implications for the quality of the administrative data that are linked to the survey data. Research in Sweden and the United Kingdom comparing the reliability of survey and administrative earnings data shows that reliability is notably lower for the linked administrative data than the survey data on employment and earnings. And

___________________

⁸ The Numident file is a record of applications for Social Security cards and includes data elements for name, date and place of birth, parents’ names, and date of death (see Genadek et al., 2021; McNabb et al., 2009).

Page 188 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

BOX 4-1
Introduction to the Person Identification Validation System and Its Protected Identification Keys

The Census Bureau uses the Person Identification Validation System to assign unique Protected Identification Keys (PIKs) (see Wagner & Layne, 2014) to person records to facilitate deduplication and record linkage. PIKs are anonymous, unique person identifiers that remain unchanged over time such as Social Security Numbers (SSNs). Each PIK is associated with an SSN or an Individual Taxpayer Identification Number, which are used by people who do not have an SSN but are required to file federal income taxes. SSNs are replaced by PIKs in files that initially contain SSNs when received by the Census Bureau, since access to files containing SSNs is limited to a small staff that specializes in maintaining the record linkage system. A reference file is constructed from the SSA NUMIDENT file enhanced with address data obtained from other federal administrative records. Census Bureau household surveys no longer collect SSNs, so linking surveys to administrative data requires mapping from characteristics such as name, birthday, or address to determine the unique SSN.

The Person Identification Validation System uses a probabilistic matching algorithm (Fellegi & Sunter, 1969) to compare personally identifiable information of person records in census, survey, and administrative data to person records in the reference file. When a census or survey record is linked to the composite database, the record is assigned the relevant PIK from the composite. Using the PIKs, the record can be linked to tax and other administrative records.

Brown et al. (2023) find that 83.6% of people in the 2020 census were assigned a unique PIK and 83.4% an SSN, both rates being lower than in the 2010 census. In particular, the non-matched people are more likely to be in race groups other than non-Hispanic White people, Hispanic people, young children, and young adults.

as discussed above, linkage error is not random and more likely impacts minority groups (see Jenkins & Rios-Avila, 2023; Meijer et al., 2012).

Most probabilistic record linkage techniques create a single point estimate, ignoring the uncertainty in linking and matching. Analyses of linked data need to estimate and propagate the linking uncertainty. Potential solutions include Bayesian approaches obtaining posterior distributions of the matching matrix, multiple imputation with combining inference rules, and fitting a layer of regression on the linkage model using hierarchical modeling (Reiter, 2022).

Conclusion 4-2: It is important that statistical agencies producing estimates of income, consumption, and wealth using multiple data sources maintain high quality of the data and estimates while maintaining

Page 189 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

low burden on survey respondents and containing costs to the extent feasible.

Recommendation 4-2: Statistical agencies should develop estimates of error when producing estimates for households of income, consumption, and wealth using integrated multiple data sources. The estimates of error should account for linkage errors in addition to other errors such as those arising from sampling, lack of coverage, and misreporting. Agencies should regularly publish measures of the different kinds of error and both agencies and data users should report the methods used to account for them.

INTEGRATION PROJECTS: THE UNITED STATES

Examples of projects to integrate multiple data sources on one or more of income, consumption, and wealth in the United States include the NEWS at the Census Bureau and the Comprehensive Income Dataset (CID) of the University of Chicago in partnership with the Census Bureau. Table 4-6 lists nine projects in all. Boxes 4-2A, 4-2B, and 4-2C provide descriptions of each project grouped into three categories: continuing projects on household and family income (four projects), adult distribution of national income and wealth (two projects and a general description of the World Inequality Database), and one-time projects.

Observations about the projects listed and described in Table 4-6 and Boxes 4-2a, 4-2b, and 4-2c, as well as their underlying source documents, follow:

Developing an integrated, high-quality dataset on household income, consumption, or wealth is challenging, given the need to acquire, evaluate, decide how to use, and implement the necessary data linkages.
The one-time data-integration efforts all have useful insights and methods to contribute, and can provide guidance on imputing ICW components within particular datasets.
The World Inequality Database projects are interesting and important, particularly for cross-national comparisons, and provide a guide for using tax data as the basic source of income information. As shown in Chapter 2, however, their income definition is different from that used by both BEA and OECD.
The BEA household personal income distributions, the BEA/BLS consumer unit expenditure distributions, and the Census Bureau’s NEWS projects are key efforts to making near-term progress toward an integrated ICW dataset.

Page 190 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

TABLE 4-6 Integrated Income, Consumption, and Wealth Dataset Projects (USA)

ICW focus	Project/Home	Scope/Unit of Analysis/Time Period	Data and Methods	Products and Access	Key Reference(s)
I	Personal Income; BEA	Personal income distributions consistent with NIPA Table 2.1; Households; 2000 forward (annual—latest year, 2021, is a projection)	CPS-ASEC, supplement top decile with IRS tabulations, SCF for some income sources	Tables of household personal income and components by decile; Inequality metrics	BEA (2023); Fixler et al. (2020)
I	NEWS; Census Bureau	Comprehensive income distributions and population characteristics; Households; 2018 forward (to be annual)	PIK-based linkage between CPS-ASEC, IRS tax data, public transfer records	Working papers; Microfile (internal Census Bureau access only)	Bee et al. (2023)
I	CID Project; Census Bureau with University of Chicago	Comprehensive income distributions and population characteristics; Families; extending back and forward in time (Corinth et al., 2022, compare data for 1995, 2016)	PIK-based linkage between household surveys (e.g., CPS-ASEC, SIPP, ACS), IRS tax returns, public transfer records	Academic papers; Microfile (internal Census Bureau access only)	Corinth et al. (2022); Medalia et al. (2019)
I	Household Incomes in Tax Data; Treasury Department with Joint Committee on Taxation	Comprehensive income distributions, some population characteristics; Households; 2010 tax year	IRS universe file, includes 1040 filers and nonfilers on information returns (e.g., 1099s)	Academic papers; Microfile (internal Treasury access only)	Larrimore et al. (2021)
I	Distribution of Household Income; Congressional Budget Office	Comprehensive income distributions, some population characteristics; Households; 1979 to latest year for which tax data available	IRS/SOI 1040 file statistically matched with CPS-ASEC; correct for underreporting of Medicaid, SNAP, and SSI	Annual reports with tables/graphs by quintile (more detail for highest quintile)	CBO (2023); Habib (2018)

Page 191 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

I	DINA—Income (in the WID); Paris School of Economics	National income and income net of taxes and transfers distributions; Adults (and couples); 1962–2014 (limited data back to 1913); extended in time by projecting 2014 IRS/SOI PUMS file with various sources	IRS/SOI PUMS, nonfilers created from CPS-ASEC, variety of sources to estimate nontaxable incomes (e.g., retained earnings)	Academic papers, Microfiles available from WID	Chancel et al. (2022); Piketty et al. (2018)
W	DINA—Wealth (in the WID)	National wealth distributions; Adults (and couples); 1962-2012 (limited data back to 1913); extended in time by projecting 2012 IRS/SOI PUMS file	IRS/SOI PUMS supplemented from 1996 with internal SOI tabulations; capitalize wealth using income reported on tax returns, supplement with SCF	Academic papers; Microfiles available fromn WID	Chancel et al. (2022); Saez & Zucman (2016, 2020)
W	Top Wealth in America; Treasury Department with Princeton and University of Chicago	Comprehensive wealth distributions; Adults; 1989–2016	IRS CDW, capitalize wealth using income flows, supplement with SCF	Academic papers; Tables (internal Treasury access only)	Smith, Zidar, & Zwick (2021)
ICW	“3D”; Washington Center for Equitable Growth	Joint distributions of ICW; Households; Data for every 3 years, 1989–2016	SCF for income and wealth, statistical match to CE for consumption	Academic papers; Microfiles available from authors	Fisher et al. (2022)

NOTE: ACS = American Community Survey; BEA = Bureau of Economic Analysis; CDW = Compliance Data Warehouse; CID = Comprehensive Income Dataset; CPS-ASEC = Annual Social and Economic Supplement to the Current Population Survey; DINA = Distributional National Accounts; ICW = income, consumption, and wealth; IRS = Internal Revenue Service; NEWS = National Experimental Wellbeing Statistics; NIPA = National Income and Product Accounts; PIK = Protected Identification Key, generated in data linkage by the Census Bureau (represents anonymized versions of Social Security or Taxpayer Identification Numbers); SCF = Survey of Consumer Finances; SIPP = Survey of Income and Program Participation; SNAP = Supplemental Nutrition Assistance Program; SOI = Statistics of Income Division; SSI = Supplemental Security Income; WID = World Income Database.

SOURCE: Compiled by the panel.

Page 192 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

BOX 4-2A
Continuing Projects on Estimating U.S. Household/Family Income Distributions

Distribution of Personal Income

This is a project initiated at the BEA and intended to move to a production basis. It measures how U.S. personal income (NIPA Table 2.1) is distributed across households using the CPS-ASEC as the base with IRS/SOI estimates used to adjust incomes for higher-income households for underreporting and a variety of sources used to allocate personal income components to households. The statistics build on at least a decade of BEA research by bringing in new sources of data, including demographic surveys, aggregated tax records, and administrative records. BEA published the first set of prototype statistics in March 2020 for 2007–2018 and added statistics on the distribution of disposable (after-tax) personal income later that year. BEA has extended the series back to 2000, continued publishing new estimates annually, and incorporated methodological improvements since then. In December 2022, BEA published supplemental internationally comparable data (using OECD guidelines), along with accompanying documentation.^a

NEWS

NEWS is a project initiated at the Census Bureau intended to move from an experimental to a production status. Its goal is to provide high-quality household distributions of money income, cash income, and near-cash income both pre- and post-government taxes and transfers. It uses a variety of administrative records to correct for the massive nonresponse and misreporting (mostly underreporting) of various income sources ascertained in the CPS-ASEC. The first available estimates are for 2018 for pre-tax and transfer money income (see Bee et al., 2023); they show an increase of 6

Page 193 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

percentage points in household median income, mostly due to the use of administrative records for retirement and investment income for the elderly.

CID

CID is a foundation-funded project of Bruce Meyer and colleagues at the University of Chicago working with the Census Bureau to combine surveys, tax records, and federal and state benefit program records. The goal is to overcome inaccuracies in basic understanding of economic wellbeing by correcting for nonresponse and misreporting of income in household surveys. To date, the CID project has linked tax records and 12 sources of federal and state administrative program data with the CPS-ASEC, ACS, and SIPP, and has produced new research on extreme poverty and homelessness. The intention is to extend the dataset back in time for two decades and to update it continuously going forward (see the CID website at (cid.harris.uchicago.edu).

The Distribution of Household Income

This is an annual publication of the Congressional Budget Office (CBO), estimating income (money and in-kind plus realized capital gains) before and after government taxes and transfers. Each report covers 1979 through the latest year for which IRS/SOI-cleaned samples of individual income tax returns are available to CBO (the latest year of estimates is 3 years prior to the publication date). CBO statistically matches the IRS/SOI data to the CPS-ASEC and corrects CPS reports of Medicaid, SNAP, and SSI for underreporting with administrative-record aggregate information. The latest report (Chancel et al., 2022) covers 1979 through 2019, with graphs and tables for income quintiles and more detail for the top quintile.

__________________

^a BEA has a similar project under way with BLS to develop estimates of U.S. personal consumption expenditures for consumer units in the CE; see Distribution of Personal Consumption Expenditures: U.S. Bureau of Labor Statistics (bls.gov). Prototype estimates are available for 2017–2021.

Page 194 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

BOX 4-2B
World Inequality Database (WID) Project to Estimate Adult National Income and Wealth Cross-Nationally

DINA

Piketty et al. (2018) constructed “distributional national [income] accounts (DINA)” for the United States from 1913 to 2014. They estimated national income (GDP minus capital depreciation plus net income from abroad) for the adult population, creating pre-tax and post-tax series (the latter allocates all government spending, not just transfer programs per se). For the years 1962–2014, their foundational database comprised SOI/IRS PUMS. They used the CPS-ASEC to simulate nonfiling tax units and a variety of data sources to estimate the incidence of retained earnings and other nontaxable income sources. They found that average pre-tax real national income per adult from 1980 to 2014 stagnated for the bottom 50% of the distribution, grew for adults between the median and the 90th percentile faster than what tax and survey data suggest, due in particular to the rise of tax-exempt fringe benefits, and increased greatly at the top, and that the government offset only a small fraction of the increase in inequality.

Distributional National Wealth Accounts

Saez and Zucman (2016) constructed “distributional national [wealth] accounts” for the United States from 1913 to 2014, estimating national wealth from the Federal Reserve Financial Accounts for the adult population. For the years 1962–2012, their foundational database comprised SOI/IRS PUMS, supplemented from 1996 with internally created IRS/SOI tabulations. They estimated wealth by capitalizing asset incomes reported by individual taxpayers, accounting for assets that do not generate taxable income. They successfully tested their capitalization method with the SCF, linked estate and income tax returns, and foundations’ tax records. They found that the wealth share of the top 0.1% of households rose from 7% in 1978 to 22% in 2012, almost as high as in 1929. They attribute the increase in wealth inequality in recent decades to the upsurge of top incomes combined with an increase in savings rate inequality.

WID

The WID, housed at the Paris School of Economics, built on and extended the work of Piketty et al. (2018) and Saez and Zucman (2016) and earlier work on constructing distributional estimates of income and wealth. WID provides guidelines for countries to construct distributions of national income and wealth for adult individuals and couples (Chancel et al., 2022). The goal is an accessible database of consistently derived statistics for research and policy use. Ultimately, a goal of WID developers is to have it adopted by national statistical offices. For the United States, the WID income and wealth distributional files are updated by projecting the 2012 IRS/SOI PUMS file forward using a variety of data sources for population, income, and wealth. Various improvements have been made in data and methods since the predecessor to WID, the World Top Incomes Database, was begun in 2011.

Page 195 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

BOX 4-2C
One-Time Data Integration Projects on ICW

Fisher et al. (2022) use data on income and wealth from the every-3-years SCF for 1989–2016 to examine distributional characteristics of the United States. They statistically match the SCF with the CE Interview Survey to estimate the ratio of expenditures measured in the SCF (homes, vehicles, and food since 2004) to total expenditures so that their dataset contains household-level estimates of ICW. They find an increase in inequality over the study period in both two dimensions (income and wealth) and three dimensions (income, wealth, and expenditures), with a faster increase in multi-dimensional inequality than in one-dimensional inequality. They conclude that viewing inequality through one dimension greatly understates the level and the growth in inequality in two and three dimensions. The SCF data and CE data they used are available from the Federal Reserve Board and BLS; programs to create imputations are available from authors.

Larrimore et al. (2021) use individual income tax returns and 1099 information returns for tax year 2010 to construct households for the entire population, including people who do and those who do not file tax returns. The linkage is address-based so that, for example, a child filing a return is linked to the child’s parents through a common address. They find that using tax units as a proxy for households overstates household income inequality, as measured by Gini coefficients, while the CPS-ASEC understates household income inequality. They further find that the federal income tax code and EITC are less progressive when measured at the household instead of the tax filing unit level. The dataset they assembled is internal to the Treasury Department.

Smith et al. (2021) use a variety of data to estimate wealth distributions in the United States and assess wealth inequality from 1989 to 2016. Their core dataset is the IRS/SOI stratified random samples of tax returns for 1965–2016. For each asset class (e.g., pass-throughs, trusts, loans to businesses) they use numerous sources to generate unique source-owner linked data. For example, housing asset estimates incorporate data from private-sector sources on state property tax rates, assessed tax values, and state price indexes, and from the Census Bureau on state property tax revenues and population. Among other data sources they use are business tax returns, estate tax filings, Compustat public company filings, the SCF, and estimates by other researchers. They find that rich people earn much more of their interest income in higher-yielding forms and also have much greater exposure to credit risk. They estimate less dramatic (compared with other estimates) but still large changes in wealth among the top 1%, 0.1%, and 0.01% and find that wealth is very concentrated: the top 1% holds nearly as much wealth as the bottom 90%. Their dataset is internal to the Treasury Department.

Page 196 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

Recommendation 4-3: Statistical agencies that produce estimates for households about income, consumption, and wealth should regularly consult with expert groups and agency advisory committees and evaluate their datasets (whether based on a single data source or multiple sources) to assess whether the datasets and the estimates derived from them meet quality criteria—namely relevance, accuracy, timeliness and punctuality, accessibility and clarity, and comparability and coherence.

INTEGRATED DATA INFRASTRUCTURES: OTHER COUNTRIES

This section provides an overview of the integrated data infrastructures (IDIs) that exist in a selection of other countries. The experience of other nations provides information about the preconditions necessary for the successful introduction of IDIs and the nature of the IDIs that are introduced.

The countries included in the review range from nations with relatively mature IDIs, specifically the Nordic countries (Denmark, Finland, Norway, and Sweden), to nations with more recently established national infrastructures, such as the Netherlands and New Zealand. The panel also heard from Statistics Canada on its new Distributions of Household Economic Accounts; see the Sinclair (2022) presentation. The United Kingdom is an example of a country still establishing a more integrated data system, with its ongoing “household finances transformation” project. Lessons learned are listed at the end of the section.

The Nordic Countries

The most comprehensive overview of the establishment of IDIs in the Nordic countries is provided by the United Nations’ Economic Commission for Europe (United Nations, 2007), a co-production of the national statistical agencies of Denmark, Finland, Norway, and Sweden—each of which is a single agency with certain rights and responsibilities. The IDIs for these countries are mature. UNECE (2007, especially pp. 5–6 and accompanying table) describes their origins in the mid- to late-1960s when central population registers were established in all Nordic countries, based on unique personal identification numbers. (These registers correspond to the concept of a “spine,” as cited in Chapter 5.) Other registers were set up in subsequent years, including other “base” registers of businesses and dwellings, as well as linked “topic” registers (e.g., education, employment, income). Official statistics based on the administrative registers appeared from the early 1970s. Register-based data on income were first used in census statistics in Finland in 1970, in Sweden in 1971, in Norway in 1980, and in Denmark in 1981.

Page 197 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

Based on their collective experience, UNECE (2007, Chapter 3) cites five “general preconditions”—that is, “certain preconditions that facilitate the extensive use of administrative sources in statistics production” (2007, p. 7). These are as follows:

Legal base. UNECE (2007, p. 7) states that “[a]ll Nordic countries have a national statistics act that gives the NSI the right to access administrative data on unit level with identification data and to link them with other administrative registers for statistical purposes. Furthermore, the statistics act provides a detailed definition of data protection.” The Nordic countries also have laws covering the processing of personal data including provisions to ensure that individuals’ data are appropriately protected. In addition, “processing data for statistical purposes is allowed even if it was not the main aim of the data collection” (p. 7), and data, once processed by the national IDI agency, must not be used for purposes other than statistics and research. In addition, the “statistical authority may grant access to confidential data for scientific research or statistical surveys” (p. 7).
Public approval. UNECE (2007) points to a tension. On the one hand, the establishment of multiple linked registers can lead to what UNECE labels a potential “Big Brother” problem—the perception that the state knows everything about its citizens. On the other hand, citizens can also see advantages to having data in this form, including lower costs, reduced respondent data collection burdens (by comparison with surveys), and improved data security and privacy due to computers being more involved and interviewers less involved in data collection. Although most non-Nordic researchers probably believe that public trust and confidence in their register-based IDIs is very high by comparison with other countries, UNECE (2007) points out that this trust and confidence cannot be assumed (and points to some heterogeneity in views across the Nordic countries). UNECE concludes that it is “very easy to lose the confidence of the general public, but a major effort to regain it” (2007, p. 8).
Unified identification systems. The fundamental point is that a common identification number is essential for linking base registers—the individuals in the population register—to other registers. Linkages can be undertaken in other ways (such as statistical matching), but this is resource-intensive and leads to potential inaccuracies. The need for common identifiers is closely related to the earlier precondition relating to social approval, although UNECE (2007) does not comment on this. The use of national ID numbers

Page 198 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

is routine and part of everyday life in the Nordic countries (and in many other European countries) in a way that simply does not exist in much of the world, often because of societal mistrust in such numbers.
Comprehensive and reliable register systems developed for administrative needs. The basic point is that an IDI needs to contain appropriate data. Under this heading, UNECE (2007) also remarks that development of quality-linked register data not only helps governments govern and administer efficiently but may also be efficient for citizens, for example by providing proof of residence, civil status, and citizenship. Consequently, there can be a coincidence of interests in having well-functioning infrastructure. Interestingly, UNECE (2007, Chapter 3, p. 9) also remarks that “[o]ne reason for the successful development of administrative registers in the Nordic countries may be that all countries are rather small and homogeneous. In larger and more heterogeneous countries it may be more problematic to establish administrative registers at state [national] level due to technical, organizational and other constraints.” The United States is undoubtedly a larger and more heterogenous country than any of the Nordic countries.
Cooperation among administrative authorities. UNECE’s final precondition refers to a dual requirement for “firm and explicit commitment from the highest possible level as well as close collaboration among relevant authorities” (2007, p. 9). The first requirement is the point made earlier about the need to have a top-level coordinating agency with rights and responsibilities regarding data (supported by the relevant legislation). This agency needs to be sufficiently well resourced to fulfil its role. The second point underlines that success also requires coordination with the other agencies and government departments that produce and supply register data for the national IDI.

The IDIs of the Nordic countries have retained their essential features since the UNECE review in 2007, but they have evolved. For example, because registers about individuals can be linked longitudinally, the passage of time has facilitated a substantial increase in research about ICW topics that exploits long-term panel features. For example, one can now examine income histories over a complete working life, and incomes can be linked across generations (Bengtsson et al., 2016; Björklund, 1993; Nybom & Stuhler, 2017).

UNECE (2007) devotes a section to “micro data to researchers” (8.2). UNECE’s discussion raises important issues. First, the analysis datasets that

Page 199 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

researchers wish to use are often not the ones routinely produced for official statistics. The Nordic statistical offices have devoted substantial resources to creating additional research datasets, thereby also reducing duplicated effort in the creation of samples and variables. They cite the example of the Swedish Louise database with its anonymized micro yearly data on individuals’ education, income, and employment, updated annually.

Second, there are issues of acquiring access per se to the data and the costs of doing so. The situation has changed since the UNECE (2007) review was undertaken, reflecting developments in information technology in particular. Nowadays, all four Nordic national statistical agencies provide access under what are essentially the Five Safes conditions, albeit with some heterogeneity in arrangements (safe data, projects, people, settings, and outputs, as discussed elsewhere in this report). For example, Statistics Denmark provides access through a one-stop Data Portal⁹ (a form of secure data server that approved researchers access remotely). (Users based outside Denmark can access data in principle but only under tightly circumscribed conditions. More typically, foreign users work with researchers based at approved Danish institutions.) Users are charged for data use. Access arrangements and charges of a similar nature apply for data from Statistics Finland,¹⁰ Statistics Norway,¹¹ and Statistics Sweden.¹²

Although Nordic register data infrastructures are the envy of researchers around the world, they are not perfect sources for analyzing the joint distribution of ICW. The essential problem for ICW analysis is that there are no register data on consumption. However, several recent studies for Nordic countries have shown how, if sufficiently comprehensive data about wealth and its components are available, it is possible to derive measures of consumption. (This possible route is discussed further in the United States context in Chapter 5.) The trick is to back out estimates of consumption from the household budget identity: consumption expenditure in a given year equals disposable income (post-tax labor and capital income) plus capital gains minus the change in wealth between last year and this year.

The condition on wealth data availability is important. For instance, Sweden had a wealth tax until 2007 (and Finland until 2006), meaning relatively extensive wealth data were available in register form. Without these data, estimates of wealth have to be pieced together from other sources of information provided by multiple registers.

Proof of concept for the budget identity strategy for deriving measures of consumption is provided most recently by Eika et al. (2020) for Norway.

___________________

⁹ www.dst.dk/en/TilSalg/Forskningsservice

¹⁰ www.stat.fi/tup/mikroaineistot/aineistot_en.html

¹¹ www.ssb.no/en/data-til-forskning/utlan-av-data-til-forskere

¹² www.scb.se/en/services/ordering-data-and-statistics/ordering-microdata/

Page 200 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

They link administrative data on income and wealth (Norway has a wealth tax) to other administrative data with information on financial and real estate transactions. The authors conclude that their approach produces reliable measures of household consumption expenditure (and they compare well with aggregate measures from the NIPA). Their sensitivity checks yield an important caveat, however, in that Eika et al. (2020) stress that measures of household consumption expenditure based on tax records on income and wealth alone are likely to contain substantial measurement error. Having the additional (register) information on financial and real estate transactions is essential. Other Nordic studies using the household budget identity to derive measures of consumption include Kreiner et al. (2013) for Denmark and Koijen et al. (2014) for Sweden. Both use data from time periods when there was a wealth tax and hence register data. There is no similar study for Finland. Törmälehto (2022) highlights the main issue: no comprehensive measure of wealth currently exists in Finland’s administrative registers. To build such a measure, one could combine information from many additional topic registers (e.g., share and bond holdings, property and business wealth), but there are no register data on bank deposits or life insurance savings, for which there are only survey data collected every three 3 or so.

The Netherlands

The Netherlands is less well known as a register country than the Nordic countries, but its System of Social-Statistical Datasets (SSD) has many of the features of the Nordic IDIs described above. The SSD developed quickly. Feasibility work started in 1996 but already by 2001 the national census was entirely register-based. As in the Nordic countries, there is a single coordinating agency, Statistics Netherland, responsible for the linked register data. There are base registers for population, businesses, and dwellings, and many additional topic registers providing “a wealth of information on persons, households, jobs, benefits, pensions, education, hospitalizations, crime reports, dwellings, vehicles and more” (Bakker et al., 2014, p. 1). In addition, data from surveys such as the Labour Force Survey are linked to individuals in the register data. Individual linking is on the basis of a unique national identity number (citizen service number). Data are available to external researchers, with remote access under the equivalent of the Five Safes conditions, and there are user charges.¹³

Bakker et al. (2014) provide a good overview of the SSD (see also van Rooijen, 2022). Bakker et al.’s article is also instructive because it provides

___________________

¹³ See www.cbs.nl/en-gb/our-services/customised-services-microdata/microdata-conducting-your-own-research

Page 201 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

detailed information about the organization and coordination required to establish and maintain the SSD—this is relevant for the panel’s proposals for the United States in Chapters 5 and 6. Bakker et al. stress how the workflow for linked-register data infrastructures is very different from the workflow based on (more autonomous) survey-based infrastructures. The authors emphasize coordination issues that have to be addressed, which are organizational and technical and related to content and output. Clearly, addressing these issues requires appropriate resourcing for the coordinating statistical agency.

Like the Nordic IDIs, the Dutch SSD is not an ICW infrastructure, and it also appears to currently have less potential to be used in construction of an ICW data infrastructure. Although the Netherlands currently has a wealth tax, wealth data do not appear to be included in the SSD. (Statistics Netherlands does produce wealth distribution statistics, however.¹⁴) There is also a household budget survey, undertaken annually,¹⁵ but it is unclear whether the survey data are linked into the SSD in the same way as the Labour Force Survey data are.

New Zealand

New Zealand is another country, like the Netherlands, that has recently developed an IDI (“IDI” is in fact the New Zealand label for its infrastructure). From a prototype that began in 2011, using tax data to provide a spine based on taxpayers, the IDI has changed substantially over time and now has an “ever resident” population-based spine and incorporates a growing number of datasets. For overviews of New Zealand’s IDI, see Atkinson and Blakely (2017) and Milne et al. (2019).

New Zealand’s IDI has some features in common with the infrastructure of the countries discussed so far. For example, a single agency is responsible for the IDI, Statistics New Zealand. There is legislation supporting the use of data for statistics and research purposes. Atkinson and Blakely (2017) also highlight the role played by the government in its emphasis on the value of evidence-based investment in public services, and they discuss how ministers “have worked to promote the use of data to improve the lives of New Zealanders” (p. 3, and section 6). They also explicitly discuss the extent to which the IDI has a “social license” (section 5). (Recall the UNECE’s second and fourth general preconditions discussed earlier.) On the other hand, Milne et al. (2019, p. 677) point out problems with incomplete data documentation and metadata:

___________________

¹⁴ See www.cbs.nl/en-gb/figures/detail/83834eng#shortTableDescription

¹⁵ See www.cbs.nl/en-gb/news/2009/52/traditional-role-patterns-in-spending-by-single-men-and-women/budget-survey

Page 202 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

Because the data have only recently been made available for wider research use, comprehensive documentation has lagged behind data availability. Furthermore, documentation about measures and constructs is stored at the agency level, with no central database of the measures contained in the IDI.

Data access is governed by the Five Safes discussed earlier, and currently access is only possible through approved secure data centers (based at Statistics New Zealand, some government departments, universities, and other approved organizations). Remote access from outside New Zealand is not possible, and there are also use charges.¹⁶

New Zealand is an exception to the countries discussed so far, because there is no national identity number that can be used to link across data sources. This is a major challenge for the IDI, leading to potential issues of non-coverage and error. For example, the “ever resident” population that comprises the IDI spine has to be derived by combining three sources of information: tax records since 1999, New Zealand birth records from 1920, and long-term visas from 1997. (See Black, 2016 and Atkinson & Blakely, 2017, for more information.) Moreover, probabilistic matching methods have to be used to link topic-based data modules to the spine when there is no common identifier in the spine and topic datasets. Typical match keys are date of birth, first and last name, sex, and address. Mismatch error is likely to be nonrandom (Milne et al., 2019). (See Atkinson & Blakely, 2017, for further details about the criteria used to trade off maximizing linkage rates while minimizing the probabilities of erroneous linkages; for more details of linkage methods, see Statistics New Zealand, 2014.) Atkinson and Blakely (2017, Appendix 1) provide information about linkage rates and false-positive rates as of April 2017. For example, the pairwise link between the income tax and birth registers, one of the links used to create the resident population spine, has a match rate of 84.4% with an estimated false-positive rate of 1.3%. The linkage rate between the spine and Ministry of Social Development data (about benefits) is 80%, with a false positive rate of 0.9%.

A notable feature of New Zealand’s IDI is that it does not contain household identifiers. This contrasts with the Nordic and Dutch infrastructures in which household identification numbers can and have been derived from registers about family relationships, co-residence, and fiscal relationships. As a result, all IDI-based research about the distribution of income based on the administrative data from the New Zealand tax authority has analyzed the distribution of individual income, not family or household

___________________

¹⁶ See www.stats.govt.nz/integrated-data/apply-to-use-microdata-for-research/ for more details.

Page 203 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

income. Some limited household identifier information is available in the IDI through links with the 2013 national census and the household budget survey, but the panel did not learn of studies that have exploited these opportunities. Another possibility would be to imitate the innovative though a complicated method of creating households using address fields, as done by Larrimore et al. (2021) using U.S. income tax and information returns.

New Zealand’s IDI is not an ICW dataset. There are longitudinal data about income from the income tax data. However, there are no data about wealth; New Zealand has no wealth or capital gains taxes. The IDI’s holdings appear to provide no opportunity to derive estimates of consumption using the household budget identity from information about income and wealth components. However, there is some IDI information about income and wealth in the linked Household Expenditure Survey data.

The United Kingdom

This section ends with a brief discussion of the data integration situation in the United Kingdom. Unlike the countries discussed so far, the United Kingdom is a relatively large “Anglo” country, as is the United States, though like the other countries discussed, it has a national statistical agency, the Office for National Statistics. This office is currently implementing a Household Financial Statistics Transformation (HFST) project, components of which were made possible by legislation. Specifically, the 2017 Digital Economy Act provided a new and more permissive framework for sharing personal data for defined purposes including statistical purposes.

The HFST project involves “a new integrated survey design (the Household Finance Survey […]) as well as identifying and integrating administrative data sources. [The] ultimate vision is for an ‘admin data first’ approach to the production of household financial statistics wherever possible, which is particularly relevant for the production of small area income statistics” (Massey, 2018, p. 2).

The Office for National Statistics’ vision remains a long way from a fully fledged register-based data infrastructure in the Nordic and Dutch senses. There are obstacles that remain, including cross-agency coordination issues. For example, benefit and credit administrative data are “owned” by the Department for Work and Pensions who have responsibility for the benefit and tax credit systems. Administrative data on income reported for tax purposes are held by the tax authorities (His Majesty’s Revenue and Customs). The United Kingdom does not have a national personal identification number. Instead, individuals have multiple identifiers such as national insurance numbers related to employment and social insurance contributions (used by the Department for Work and Pensions and His Majesty’s Revenue and Customs), driver license numbers, and National Health

Page 204 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

Service numbers, all of which provide incomplete population coverage. Proposals to introduce a national identity card, as is common in European countries, have never gotten off the ground. (National identity cards were introduced in 2006 legislation but abolished in 2011.) Although there are national registers of dwellings and addresses, there is currently no population register that could be used to form an IDI spine.

From an ICW perspective, the United Kingdom does best on the income dimension with both the income tax administrative data and survey data that are available. The Survey of Personal Incomes has been linked to tax records to understand income inequality (Burkhauser et al., 2018), and Jenkins (2022) presents a review of recent progress, especially in improving estimates of the top of the income distribution. Wealth data are primarily survey-based, using the Wealth and Assets Survey, which the HFST project aims to include in its survey integration plans. Survey data on both income and consumption are available from the annual Living Costs and Food Survey, which is the national household budget survey (and is also included in the HFST survey integration plans). All estimates of the joint distribution of ICW to date have been derived by statistically matching data from the Living Costs and Food Survey onto the Wealth and Assets Survey dataset (see, e.g., Office for National Statistics, 2020). Similar statistical matching approaches have been undertaken in a coordinated fashion for multiple European countries.¹⁷

Lessons Learned

Four key lessons may be drawn from this review, as follows.

The IDIs that currently exist do not provide linked data on all three of ICW for individual units. The IDIs are perhaps better described as well-developed sources of linked data where most (but not all) data are derived from administrative registers. No existing IDI contains data on consumption. These features link to the proposals presented in Chapter 5 for integrated data about ICW.
The most comprehensive IDIs to date are “owned” by a national statistical agency. That is, there is a single organization with IDI oversight, coordinating powers, and responsibility. The IDI may use data from multiple agencies and sources, and such an agency needs to be well resourced. There is no such single agency in the United States—see the discussion in Chapter 6.

___________________

¹⁷ See Eurostat’s experimental statistics at ec.europa.eu/eurostat/web/experimental-statistics/income-consumption-and-wealth and the methodological discussion by Lamarche (2017).

Page 205 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

Relatedly, establishment of an IDI needs legal structures that enable the collection, sharing, and use of linked data while respecting considerations of relevance, quality, privacy, and confidentiality. These legal structures are themselves conditional on social consent and trust for IDIs—the willingness of politicians and the public to provide support to, or at least not hinder, the establishment and maintenance of IDIs. These structures have implications for how data can be accessed, using the Five Safes principles that are discussed in Chapter 6.
Having a single personal identification number that uniquely identifies every individual in the target population underpins the most successful national register-based data infrastructures to date. This is relevant to the discussions of the base population spine in Chapter 5.

Conclusion 4.3: International experience shows that the most comprehensive integrated data infrastructures (IDIs) to date are “owned” by a national statistical agency—that is, there is a single organization with IDI oversight, coordinating powers, and responsibility. The IDI may use data from multiple agencies and sources, and such an agency needs to be well resourced. Establishment of an IDI needs legal structures that enable the collection, sharing, and use of linked data while respecting considerations of relevance, quality, privacy, and confidentiality. These legal structures are themselves conditional on social consent and trust for IDIs. Having a single personal identification number that uniquely identifies every individual in the target population underpins the most successful national register–based data infrastructures to date.

Page 206 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

TABLE 4-1A Quality Assessment of Relevant Surveys for an Integrated Household ICW Dataset: Income Components (and Taxes)

INCOME COMPONENT (receipt and amount unless otherwise noted)

Survey	ACS	CE Interview Survey	CPS-ASEC
Reference Period/Recipients	Prior 12 months (except where noted); people ages 15+	Prior 12 months; consumer unit	Prior calendar year (except where noted); people ages 15+
LABOR INCOME
Wages, bonuses, etc. (gross)	Yes	Yes	Yes
Self-employment	Yes (net)	Yes (net/gross)	Yes (net)
INCOME FROM ASSETS
Dividends	Property income	Dividends + interest	Yes
Interest	Property income	Dividends + interest	Yes
Rent	Property income	Yes (net)	Rent (net) + royalties
Royalties	Property income	Royalties + trusts	Rent (net) + royalties
Trusts	Property income	Royalties + trusts	Other income
Other financial assets	Property income	No	Other income
Realized capital gains/losses	No	No	No
SOCIAL INSURANCE
Unemployment compensation	All other	Other regular income	Yes (federal/state, supp., union)
Worker’s compensation	All other	No (not mentioned)	Yes
Social Security/Railroad Ret.	Yes	Yes	Yes
Disability benefits	Retirement etc.	Retirement etc.	Yes
Veterans’ payments	All other	Other regular income	Yes
Survivor benefits	Retirement etc.	Retirement etc.	Yes
RETIREMENT
Pensions	Retirement etc.	Retirement etc.	Yes
Annuities	Retirement etc.	Retirement etc.	Yes
Retirement withdrawals	Retirement etc.	Lump sum income	Yes

Page 207 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

Survey	ACS	CE Interview Survey	CPS-ASEC
PUBLIC ASSISTANCE
Supplemental Security Income	Yes	Yes	Yes
TANF/GA (cash)	Yes	Yes	Yes
OTHER INCOME
Educational assistance	All other	Other income	Yes (Pell grants, other assistance)
Child support	All other	Other regular income	Yes
Financial assistance from friends/relatives	All other	Other regular income	Yes
All other regular income	All other	No	Yes (also Alaska dividend)
Lump sums	No (but 2020–21 stimulus payments)	Other regular income	No (except retirement withdrawals, also 2020–21 stimulus payments)
IN-KIND BENEFITS
SNAP	Receipt only; requires $ imputation	Yes	Yes
School lunch	No; requires imputation (receipt/$)	No	Receipt only; amount imputed
Public/subsidized housing	No; requires imputation (receipt/$)	Yes	Receipt only; amount imputed
WIC	No; requires imputation (receipt/$)	No	Receipt only; amount imputed
Energy assistance (LIHEAP)	No; requires imputation (receipt/$)	No	Yes
Free meals at work	No	Yes	No

Page 208 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

Survey	ACS	CE Interview Survey	CPS-ASEC
HEALTH CARE BENEFITS
Type (Employer/Union, Medicaid/CHIP, Medicare, Marketplace (ACA), Military/VA, Indian Health Service)	Yes to all	Yes to all	Yes to all
Premium	Yes (no $, whether subsidized)	Yes ($, single service—e.g., vision, premium subsidy)	Yes ($, employer pays all, some, none)
Other medical out-of-pocket $	No	Yes (by type of service)	Yes
TAXES/TAX CREDITS	Estimated by NBER’s TAXSIM model	Estimated by NBER’s TAXSIM model	Estimated by Census Bureau’s tax model
Federal taxes	No	No (calculated using TAXSIM)	No
State/local taxes	No	No (calculated using TAXSIM	No
FICA taxes	No	Yes	No
EITC	No	Yes	No
Other tax credits	No	Monthly child tax credit	No
Other tax information	No	Yes (e.g., AMT liability)	No

Page 209 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

Survey	HRS	PSID	SCF	SIPP
Reference Period/Recipients	Prior calendar year; sample member/spouse	Prior calendar year (some items monthly); reference person/spouse/other family	Prior calendar year; principal economic unit	Monthly (except where noted); people ages 15+
LABOR INCOME
Wages etc. (gross)	Yes (also for others 16+) (tips etc. separately)	Yes (tips etc. separately)	Yes	Yes
Self-employment	Yes (gross, trade/practice)	Yes (net/gross, business/farm, trade/practice)	Yes (net/gross)	Yes
INCOME FROM ASSETS
Dividends	Yes (from stocks, mutual funds, bonds, checking/savings/money market funds, CDs/govt. bonds/Treasury bills)	Yes	Yes	Yes
Interest	Included with dividends	Yes	Yes	Yes (share from govt. securities, checking/savings, money market funds, CDs, municipal/corporate bonds)
Rent	Yes (from business/farm); Home rent in Other income	Yes	Rent etc. (net)	Yes (gross/net)
Royalties/Estates/Trusts	Trusts (no $)	Yes	Rent etc. (also income from other investments)	Royalties, Trusts

Page 210 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

Survey	HRS	PSID	SCF	SIPP
Realized capital gains/losses	On sale of home	Yes (sale of home, stocks)	Yes	No
SOCIAL INSURANCE
Unemployment comp.	Yes	Yes	Unemployment + WC	Yes
Worker’s compensation	Yes	Yes	Unemployment + WC	Yes
Social Security (OASDI)/Railroad Retirement	Yes (net)	Yes	Yes	Yes
Disability benefits	Other income	Yes	Retirement etc.	Yes
Veterans’ payments	Yes (benefits, pension)	Yes	Retirement etc.	Yes
Survivor benefits	Other income	Yes	Retirement etc.	Yes
RETIREMENT
Pensions	Yes (detail on 2+)	Yes	Pensions	Yes
Annuities	Yes (if from IRA conversion, detail on 2+)	Yes	Pensions	Yes
Retirement withdrawals	Yes	No	Pensions	Yes
PUBLIC ASSISTANCE
Supplemental Security Income	Yes	Yes	Welfare etc.	Yes
TANF/GA (cash)	Yes (cash welfare)	Yes (cash welfare)	Welfare etc.	Yes

Page 211 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

Survey	HRS	PSID	SCF	SIPP
OTHER INCOME
Educational assistance	No	No	Other income	Yes
Alimony	No	Yes	Alimony/child support	Yes
Child support	No	Yes	Alimony/child support	Yes
Financial assistance from friends/relatives	Yes (parents)	Yes (also church/community groups)	Other income	Yes
All other income	Other income (also for others 16+)	Yes	Other income (sources from detailed list; total amount)	Yes
Lump Sums	Yes (by type/from whom if inheritance/trust)	Yes	Gifts/inheritances of $10,000 or more	Yes (plus stimulus payments in 2020–21)
IN-KIND BENEFITS
SNAP	Yes	Yes	Welfare etc.	Yes
School lunch	No	Yes	No	Yes
Public/subsidized hsg.	No	Yes	No	Yes
WIC	No	Yes	No	Yes
LIHEAP	No	Yes	No	Yes
Free meals at work	Yes (at home, no $)	Yes (also transportation/job training/clothing/other)	Yes (at work)	No (food assistance from others)

Page 212 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

Survey	HRS	PSID	SCF	SIPP
HEALTH CARE BENEFITS
Type (Employer/Union, Medicaid, Medicare, ACA, Military/VA, Indian Health Service)	Yes to all	Yes to all	Yes to all	Yes to all
Premium	Yes ($ paid by self)	Yes	Yes ($, source of payment)	Yes
Medical out-of-pocket	Yes	Yes	No	Yes (inc. OTC)
TAXES/TAX CREDITS
Federal taxes	No	No	No	No
State/local taxes	No	No	No	No
FICA taxes	No	No	No	No
EITC	No	No	No	No
Other tax credits	No	No	No	No
Other tax information	Filing status, whether itemized	Filing status, whether itemized	Filing status, schedules filed, whether itemized	Filing status, EITC, whether itemized

NOTE: ACA = Affordable Care Act; ACS = American Community Survey; AMT = Alternative Minimum Income; CD = Certificate of Deposit; CE = Consumer Expenditure Survey; CHIP = Children’s Health Insurance Program; CPS-ASEC = Annual Social and Economic Supplement to the Current Population Survey; EITC = Earned Income Tax Credit; FICA = Federal Insurance Contributions Act (payroll tax); GA = General Assistance; HRS = Health and Retirement Study; IRA = individual retirement account; LIHEAP = Low Income Home Energy Assistance Program; NBER = National Bureau of Economic Research; OASDI = ; PSID = Panel Study of Income Dynamics; SCF = Survey of Consumer Finances; SIPP = Survey of Income and Program Participation; SNAP =Supplemental Nutrition Assistance Program; TANF = Temporary Assistance for Needy Families; TAXSIM = Tax Simulator; VA = Veterans Assistance; WC = Workers Compensation.

SOURCES: Beaule et al. (2023), Bureau of Labor Statistics (2022, 2023); Census Bureau (2021a,b,c, 2022a, 2023a); Federal Reserve (2019, 2020); Health and Retirement Study (n.d.a).

Page 213 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

TABLE 4-1B Quality Assessment of Relevant Surveys for an Integrated Household Income, Consumption, and Wealth Dataset: Consumption and Expenditures

CONSUMPTION/EXPENDITURE COMPONENT (item and amount unless otherwise noted; ascertained for economic unit—household, consumer unit, etc.)

Survey	CE Interview Survey	HRS	PSID
Reference Period	Prior 3 months	Prior 12 months	Varies—Prior year, last month, average week
ALCOHOL/TOBACCO
Alcohol	Yes	Food at home	No
Tobacco	Yes	No	No
APPLIANCES/HOUSEHOLD EQUIPMENT/REPAIRS
Major appliances	Yes – detail	Yes	Furnishings/equipment
Small appliances/tools	Yes – detail	Furnishings/small appliances	Furnishings/equipment
Heating/cooling	Yes – detail	No	No
TVs/radios/computer	Yes – detail	Yes	Yes
Sports/exercise equip.	Yes – detail	Yes	Recreation
Equipment repairs/service contracts	Yes – detail	No	No
CLOTHING – FOOTWEAR – ACCESSORIES/CLOTHING SERVICES/PERSONAL CARE
For adults 18	Yes	Yes	Apparel
For children	Yes	No	Apparel
For outside household	Yes	No	No
Dry cleaning, etc.	Yes – detail	Yes	No
Hair cuts, wigs, etc.	Yes – detail	Yes	No
CONTRIBUTIONS
$ to college students	Yes	No	No
$ to outside household	Yes	Cash or Gifts	Yes
Child support/alimony	Yes	Cash or Gifts	Yes (separately)
$ charities	Yes – detail	Contributions	Yes – detail

Page 214 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

Survey	CE Interview Survey	HRS	PSID
$ political orgs.	Yes	Contributions	No
Stocks, bonds	Yes	No	With cash
EDUCATIONAL EXPENSES
Tuition	Yes	No	Tuition etc.
Child day care	Yes	No	Yes
Recreational lessons	Yes	No	No
Other school expenses	Yes – detail	No	Tuition etc.
FOOD
Food at home	Yes	Yes (delivery separate)	Yes (delivery separate)
Food away from home	Yes	Yes	Yes
School meals	Yes	No	Yes
FURNISHINGS/RELATED ITEMS
Purchases	Yes – detail	Furnishing/small appliances	Furnishings/equipment
Rentals/repairs	Yes – detail	No	No
Household supplies	Yes	Yes	No
HEALTH INSURANCE/MEDICAL EXPENSES/OTHER INSURANCE
Health insurance	Yes – detail	Yes	Yes
Medical care	Yes – detail	Yes	Yes
Medicine/supplies	Yes – detail	Yes	Yes
Life/other insurance	Yes – detail	No	No
HOUSING: OWNED
Mortgage interest/Principal	Yes – detail	Yes	Yes
Taxes/insurance/fees	Yes – detail	Yes	Yes (taxes, insurance separately)
HOUSING: RENTED
Rent	Yes – detail	Yes	Yes
Costs included in rent	Yes – detail	No	Yes

Page 215 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

Survey	CE Interview Survey	HRS	PSID
HOUSING CONSTRUCTION/REPAIRS/MAINTENANCE
Major construction	Yes – detail	No	Yes
Repairs	Yes – detail	Yes	Yes
MISCELLANEOUS
Gardening	Yes – detail	Yes	No
Pets	Yes – detail	No	No
Other	Yes – detail	No	No
SUBSCRIPTIONS/MEMBERSHIPS/BOOKS/ENTERTAINMENT EXPENSES
Tickets, memberships	Yes – detail	Yes	Recreation
Books, newspapers	Yes – detail	No	Recreation
Other	Yes – detail	Yes	Recreation
TRANSPORTATION
Public transportation	Yes	No	Vehicle insurance etc.
Taxis, etc.	Yes	No	Vehicle insurance etc.
TRIPS/VACATIONS
Entertainment on trips	Yes – detail	Trips	Trips
Trips/vacations	Yes – detail	Trips	Trips
UTILITIES AND FUELS
Phone/cable/internet	Yes – detail	Yes	Yes
Fuels	Yes – detail	Yes	Yes (gas, electricity)
Water	Yes	Yes	Yes
Garbage	Yes	No	Other
Other	Yes	No	Other
VEHICLES/OPERATING EXPENSES
Owned/leased	Yes – detail	Yes	Yes (payments)
Maintenance, repairs	Yes – detail	Yes	No
Fuel, fees, parking	Yes – detail	Yes	Insurance, gas, parking, bus/train, taxi, other

NOTE: The ACS, CPS-ASEC, SCF, and SIPP do not collect detailed information on consumption.

SOURCES: Beaule et al. (2023), Bureau of Labor Statistics (2022, 2023); Census Bureau (2021a,b,c, 2022a, 2023a); Federal Reserve (2019, 2020); Health and Retirement Study (n.d.a).

Page 216 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

TABLE 4-1C Quality Assessment of Relevant Surveys for an Integrated Household Income, Consumption, and Wealth Dataset: Assets and Debts

ASSET/DEBT COMPONENT (item and gross/net value unless noted; ascertained for economic unit unless noted)

Survey	CE (Interview)	HRS	PSID	SCF	SIPP
Reference Period	4th interview	Time of interview	Time of interview	Time of interview	Time of interview
BUSINESSES (INCLUDING DEBT)
Active management	Net profit/loss prior year	Yes (business or farm)	Yes (business or farm)	Yes – detail (up to 2)/more detail if < 500 employees	Yes
Non-active (e.g., LLCs, S-Corps)	No	Yes (business or farm)	Yes (business or farm)	Yes – detail	Yes
CHARGE ACCOUNTS/CREDIT CARDS
Bank cards/store-branded cards	Yes	Credit card debt	Credit card debt	Yes – detail	Yes
Revolving accounts	No	Credit card debt	Credit card debt	Yes	No
EDUCATION/OTHER LOANS
Education loans	Yes	Yes	Yes	Yes – detail (up to 6)	Yes
Other loans—regular/other/payday	Yes (medical/personal)	Yes (health care, other)	Loans from relatives	Yes – detail (up to 6)	No
FINANCIAL ASSETS
Checking accounts	Checking etc.	Checking etc.	Checking etc.	Yes – detail (up to 6)	Yes (interest-earning)
Prepaid debit card accounts	No	No	No	Yes – total mount	No
IRA/KEOGH accounts	Yes	Yes (ownership for up to 3+ accounts; % in stocks)	Yes (with annuities)	Yes – detail, separately for each person	Yes

Page 217 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

Survey	CE (Interview)	HRS	PSID	SCF	SIPP
CDs	Checking etc.	Government bonds	CDs, bonds	Yes – detail	Yes
Savings/money market	Checking etc.	Checking etc.	Checking etc.	Yes – detail (up to 5)	Yes (savings/MMA)
Mutual funds/hedge funds	Stocks and mutual funds	Yes	Stocks and mutual funds	Yes – detail	Yes
Bonds	Yes	Yes (corporate/municipal, government)	CDs, bonds	Yes – detail	Yes
Publicly traded stock	Stock and mutual funds	Yes	Stocks and mutual funds	Yes – detail	Yes
Brokerage account	No	No	No	Yes	No
Annuities	Other financial assets	Yes	Yes (with IRAs)	Yes – detail	Yes
Trusts/managed investment accounts	Other financial assets	Other assets	Life insurance etc.	Yes – detail	Yes
Life insurance (term, whole life)	Yes (exclude term)	Yes	Life insurance etc.	Yes – detail	Yes
Loans to people outside unit	No	Other assets	Yes	Yes	Yes
Other assets	Other financial assets	Other assets	Yes	Yes (up to 3)	Yes (artwork/et al.)
Other debts	No	Yes	Yes (legal bills, other)	Yes	Yes (medical and other debt separately)
INHERITANCES
Inheritance received	Property	Yes	Yes	Yes – detail (up to 3)	No

Page 218 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

Survey	CE (Interview)	HRS	PSID	SCF	SIPP
Expect receive/leave	No	Yes (dis-position of deceased spouse’s estate; provisions in will/trust for children, grandchildren)	No	Yes	No
Personal trust/foundation	No	No	No	Yes	No
PENSIONS
Stock options	No	No	No	Yes (employed reference person/spouse/partner) – detail	No
Employer plans	DC Plans	Pensions (up to 2+)	Yes (current/former) – detail	Same as stock options (up to 2 plans)	Yes (DB/DC)
Lump sum payout/rollover	Included with all lump sum income	Yes	No	Same as stock options (up to 4 payouts) – detail, including how used payout	Yes
PRINCIPAL RESIDENCE/LINES OF CREDIT – FOR OWNERS
Owned—farm-ranch/mobile home/house	Yes – detail	Yes	Yes	Type/value/purchase price/when purchased	Type/value
Mortgage/home equity/other loans	Yes – detail	Yes	Yes	Yes – detail (up to 3)	Yes
Lines of credit	Yes	No	No	Yes – detail (up to 3)	No
Loans for improvements	No	No	Yes	Yes – detail	No

Page 219 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

Survey	CE (Interview)	HRS	PSID	SCF	SIPP
Rent out part of property to others	Yes	Yes	No	Yes	No
Durable goods	Yes – vehicles only (see also Table 4-1b)	Yes	No	No	No
REAL ESTATE LOANS TO OTHERS/INVESTMENT REAL ESTATE
Loans to others	Yes – detail	No	No	Yes – detail (up to 2)	Yes
Investment property/second homes	Yes – detail	Yes	Yes	Yes – detail (up to 2)	Yes
VEHICLES
Owned—cars/trucks/SUVs/vans/minivans	Yes – detail	Vehicles	Yes – detail	Yes – detail (up to 4)	Yes
Owned—RVs/motorcycles/boats/airplanes/helicopters/motorhomes	Yes – detail	Vehicles	Yes (all vehicles)	Yes – detail (up to 2)	Yes (RVs)
Leased	Yes	No	No	Yes – detail (up to 2)	No

NOTES: For the limited asset/debt information in the ACS and CPS-ASEC, see Table 4-1. The HRS asks whether other family members have more than $5,000 in assets; the SCF collects financial assets/debts of independent household members in addition to the principal economic unit; and the SIPP obtains most items for people ages 15 and older in the household. CD = Certificate of Deposit; CE = Consumer Expenditure Survey; DB = Defined Benefit; DC = Defined Contribution; HRS = Health and Retirement Study; IRA = individual retirement account; LLC = Limited Liability Company; MMA = Money Market Account; PSID = Panel Study of Income Dynamics; RV = Recreational Vehicle; SCF = Survey of Consumer Finances; SIPP = Survey of Income and Program Participation.

SOURCES: Beaule et al. (2023), Bureau of Labor Statistics (2022, 2023); Census Bureau (2021a,b,c, 2022a, 2023a); Federal Reserve (2019, 2020); Health and Retirement Study (n.d.a).

Page 220 Cite

Suggested Citation:"4 Data Requirements and Criteria." National Academies of Sciences, Engineering, and Medicine. 2024. Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build. Washington, DC: The National Academies Press. doi: 10.17226/27333.

×

This page intentionally left blank.

Creating an Integrated System of Data and Statistics on Household Income, Consumption, and Wealth: Time to Build (2024)

Chapter: 4 Data Requirements and Criteria

4

Data Requirements and Criteria

AN IDEAL DATASET

DATA QUALITY FRAMEWORKS

RELEVANCE AND OTHER QUALITY ATTRIBUTES

Surveys

Administrative Records

Commercial Data Sources

ACCURACY AND RELIABILITY

Survey Errors from Nonresponse and Misreporting

Household (unit) Response Rates

Item Nonresponse to Income Questions

Substantial and Increased Underreporting of Transfer Income

Survey Coverage of Population Groups

Errors in Administrative Records

Data Linkage Errors

BOX 4-1
Introduction to the Person Identification Validation System and Its Protected Identification Keys

INTEGRATION PROJECTS: THE UNITED STATES

BOX 4-2A
Continuing Projects on Estimating U.S. Household/Family Income Distributions

Distribution of Personal Income

NEWS

CID

The Distribution of Household Income

BOX 4-2B
World Inequality Database (WID) Project to Estimate Adult National Income and Wealth Cross-Nationally

DINA

Distributional National Wealth Accounts

WID

BOX 4-2C
One-Time Data Integration Projects on ICW

INTEGRATED DATA INFRASTRUCTURES: OTHER COUNTRIES

The Nordic Countries

The Netherlands

New Zealand

The United Kingdom

Lessons Learned

Welcome to OpenBook!

Get Email Updates