How Data Disaggregation Improves Public Policy, with Professor Amy O’Hara

GPPR Podcast Editor Kharl Reynado (MPP ’23) spoke with Amy O’Hara, Research Professor in the Massive Data Institute at Georgetown’s McCourt School of Public Policy. In this podcast, Professor O’Hara addresses data disaggregation and how disaggregated data better informs policy analysis and development. Professor O’Hara also talks about where data originates, how it is collected, and why policymakers need to be mindful about the source of their data. Finally, we talk about how to use disaggregated data to improve visibility for underserved communities and what policy makers can do to make that happen.

Check out more podcasts from the Georgetown Public Policy Review (GPPR) Podcast Team: https://soundcloud.com/gppolicyreview
To follow GPPR podcasts, click the above link to GPPR’s Soundcloud Page, then click “FOLLOW” on the
right-hand side of the page to be sure to know when our podcasts drop! GPPR Podcasts are also published to Apple Podcasts and to Spotify (see button at bottom of GPPR page).

[Episode Begins]

Introduction by Kharl Reynado (MPP ’23): Hey GPPR listeners! I am Kharl Reynado, a Georgetown Public Policy Review Editor. I had the pleasure of interviewing Amy O’Hara, a research professor at the Massive Data Institute at the McCourt School of Public Policy at Georgetown University. We talked about data disaggregation and where data comes from. Professor O’Hara provides insight into how data disaggregation can better inform policy analysis and how it can improve visibility for underserved communities.

Here is our conversation.

Kharl Reynado: Thank you for joining us today, Professor O’Hara. Can you just start by introducing yourself and some of the hats that you wear?

Amy O’Hara: Sure, thanks for having me! I’m a Research Professor in the Massive Data Institute at the McCourt School of Public Policy, and in that role, I do research primarily on data governance and data access issues.

O’Hara: Also in the McCourt School, I’m the Director of the Federal Statistical Research Data Center. That is one of 31 locations across the country where you can apply for and gain access to restricted government data sets. So that’s a pretty unique thing that we have here at Georgetown. I’m currently on a Federal Advisory Committee that’s looking into how to use data for evidence building so that’s called the Advisory Committee on Data for Evidence Building. We’re trying to figure out, what are the regulations and what are the policies and what are the incentives so that people will use a lot more data.

Reynado: So most of us, I think, have a basic understanding of the general data categories we see in policy reports, such as race or income. They’re usually reported in the aggregate. Today, we’re here to talk about data disaggregation. So, what is data disaggregation?

O’Hara: Data disaggregation refers to being able to use data — to see data to look at series of data that are at the level of granularity that you need for your questions whether that’s understanding what’s going on in our communities or longer-term trends. We’re all generating a lot of different data points. In many cases, the only way that you can publish that information is by aggregating it so that none of us are singled out or stand out in the data.

O’Hara: But data disaggregation refers to the intentional categories that we need to understand what’s going on. This is really important in public policy where you think about people. Well, you want people by age, and so the way that those data are gathered and then displayed or disseminated really matters and the degree of granularity can be calibrated.

Reynado: What does disaggregated data provide that aggregated data might miss?

O’Hara: It’s really that richness and that context. When we think about the way that our communities are reflected in data, the biggest regular data collection is a decennial census. Every 10 years information is pulled together about every single resident in the United States. And for that information, in order to do apportionment, you say, how many humans are there in the U.S. and that’s adequate for that purpose.

O’Hara: But then, you really want to start breaking it down. What are the characteristics of these people? How many are male? How many are female? How many are old? How many are young? And you get these disaggregations of the data that were collected. The aggregate information is useful, but depending on what your policy question is, it’s not going to be useful enough.

O’Hara: Demographers often use age data in five-year age buckets so zero to four and five to nine and so forth. But if you’re talking about eligibility for a program like WIC (Women, Infants, and Children), which is a food assistance program, you need to know how many kids are zero to two. Disaggregation is really important depending on what your use is. And sometimes you need a very fine cut like maybe you need single years of age and then sometimes you might be able to use bigger buckets of age data.

O’Hara: Show me the kids that are preschool age, then show me the kids in primary school, etc., and maybe for adults you just need very broad bins. But that’s an example of how disaggregate data is absolutely necessary for some policy questions and aggregate data is never going to get you there.

Reynado: So, how does data disaggregation policy influence goals such as racial equality and racial equity?

O’Hara: It’s crucial. When I mentioned the information for apportionment, it’s really how many residents are there in the United States. But then when you take it to the next stage, such as the Voting Rights Act, you want to have more information on, “Are individuals of voting age?” So, there’s a cut over and under 18. And then, looking at the demographic characteristics of these individuals — looking at race, looking at ethnicity. And when you look at the history of how the data were collected, it has varied over time.

O’Hara: Sometimes data collections we’re getting “White” and “Not White.” And right now, the categories for race are really set by the Office of Management and Budget. And then all the federal agencies are using that standard in order to report out on data.

O’Hara: But that standard is malleable. Right now, there are discussions about whether Hispanic origin should be included in the race question and also whether Middle East and North African should be another category added. This is something that changes as our population changes. It’s really important for policy analysis in understanding what groups are binned together and which groups can be identified on their own, so you can mark progress over time.

Reynado: What role do privacy considerations play when you’re moving from aggregated data to disaggregated data? How do policymakers balance accuracy and privacy?

O’Hara: I think that that happens a lot in the agencies where the data are being published. And I’m afraid that a lot of policymakers are not even acutely aware of this. So, as you mentioned, there is this tradeoff between absolutely private data, where you can’t see anything. And then, absolutely accurate data, where everybody would be identifiable. So there’s this dial that needs to be turned. And that tuning is really important for the uses that you’re going to have for the data. When you look at how that privacy tradeoff is happening right now – especially with disaggregations around race and ethnicity – there is a law, say at the Census Bureau and the laws at other agencies, where you can’t have this singling out of any individual.

O’Hara: You don’t want that outlier to be viewable so that you don’t want any harms coming to somebody that could be re-identified by a data publication that is made. So, there are a variety of methods that could be used in order to improve the privacy of the data that’s to be released. Sometimes that could be through suppression. Don’t release that table. Or it could be, roll that up into another table. You don’t want the person that stands out in that one county, so you combine counties.

O’Hara: Another way that you could protect privacy that was recently done for the 2020 Census is by adding a little bit of noise and so you can’t tell that it is that person that has unique characteristics that make them stand out in the data. So those practices are happening right now. I am not sure how well aware people are of their implementation or the consequences of using different forms of privacy protection.

O’Hara: This also was a factor in the 2020 Census, that part of this great civic opportunity is to be counted – to show your uniqueness, to make sure that your community is reflected in the data. But you have to balance that with being able to have privacy protection in the data that are published on the web. So that you don’t have any concerns that harms could come to individuals that are members of unique populations, or that are outliers in a geography.

O’Hara: That is the balance that we’re watching play out right now and I hope that more folks become aware of it and can have their voices heard in the way that they want to be represented in data moving forward.

Reynado: I will shift gears a little bit here. I know you were previously a senior executive at the U.S. Census Bureau. Can you tell us a bit more about where the data comes from and really who provides this kind of information?

O’Hara: We are all generating a lot of data points at all times right now. There are data points on who we are and where we live – on the types of transactions that we’re having with platforms or with merchants. Our location is known by a lot of different companies, you know with Google or Waze and location tracking. We’re generating a lot of information, but government data are usually a bit more well behaved. The government ends up with data because it was collected – it was intentionally collected.

O’Hara: It might have been collected on a tax return or it might have been collected on a Census response. But these are data that are gathered and, especially for the statistical system in the United States, are really crucial for policy to understand where programs are working, where interventions are needed. In the example of the Census that I’ve already referred to – it’s kind of our denominator for everything. That bottom number in, “How many vaccines do we need or what is the vaccination rate?” “What is the mortality rate?” You need to understand what your base is.

O’Hara: And so, data are being gathered. And they are then validated, you know, making sure that you have complete and accurate information and then they are disseminated or published. And you have these two pieces that need to happen, especially with government data. You want to protect the input privacy, so that no one gets their hands on your tax return or your census response. It’s kind of a cyber security angle of making sure that the data that you do collect are kept secure and no unauthorized access.

O’Hara: But then, to make the data useful, you need to publish the data. When you’re publishing the data, you want to make sure that you don’t stand out in the data in a way that risks harm to you from re-identification. So, there are a set of controls in place in government to prevent input privacy breaches, as well as these output privacy risks. It’s kind of a big enterprise that involves privacy officers, data officers, information security officers, and chief information officers that are all part of the solutions to make sure that you have the best security stance possible.

Reynado: You mentioned earlier that the Census data is obviously collected, and people are filling out the forms and providing all the information. Why should policy analysts and policymakers be mindful of where their data comes from and who is filling the information out?

O’Hara: On April 1st, the results from the 1950 Census were publicly released. And the 1950 Census was collected by enumerators. They would have asked you or actually asked whoever is responding for your family. I imagine that you know in many households they went to the householder which may have been the man of the house or the matriarch of a family that is reporting on behalf of everybody that lives in the household.

O’Hara: You need to understand whether it is a self-reported response. And they say, “Well, what is your race? What is your ethnicity?” Or, if that is being observed by somebody else, so this is something that in Census since 1960 it has been self-reported race and ethnicity, but prior to that there was an enumerator involved.

O’Hara: But you compare that to data collected say in law enforcement, where the officer may not be asking, they may be observing and filling in the race and ethnicity, based upon what they see. Whenever you think about data, especially disaggregate data that’s being broken down by categories, thinking through: Who is that respondent? And did they have any incentives or disincentives for providing accurate information?

O’Hara: Another challenge here is that for some individuals their own conception of their own identification with a category may evolve over time. That’s really part of the richness of government data collections is that you’re able to observe that over time.

Reynado: I’m going to shift gears one more time again. You served on the National Commission to Transform Public Health Data Systems. Can you tell us a bit about the Commission’s goals and why it was formed in the first place?

O’Hara: I felt a little odd on that Commission because I have a PhD in economics, and I worked for the government for a long time and I’m there with all of these leaders in the health space. It was great because we had complimentary perspectives. That Commission came together, because public health data systems have really not been invested in in this country. Yes, states have vital statistics programs. They have systems to record births and deaths.

O’Hara: But then over time they’ve largely been focused on specific diseases. “Here’s a terrible disease. We should get money and we should try to deal with that.” But they haven’t really taken a holistic view or really a community level view and thought through: What are these data points that we need? They may be collected somewhere else already and how do we take best advantage of them in order to understand what’s going on in our communities and make better plans? The Commission had as its goal assessing where we were and then discussing, very bluntly, where we wanted to be. We want to be in a world where data are available to inform decisions to inform policies. And that data are accurate and disaggregated so that we understand the subpopulations in our communities.

O’Hara: The National Commission wanted to arrest the legacy of institutional racism. We all acknowledged at the front end that our public health data systems have a lot of blind spots and a lot of weaknesses in capacity and infrastructure. And that that was harmful to many segments of society. This was not just focusing on race. This was also looking at Native Americans, looking at the disabled community. Just taking that broader lens of who is left out, and how do we remedy this? What can we do?

O’Hara: The Commission put forth a number of recommendations. We’ve got to start somewhere, you know we’re not apologizing for where we are, but we’re saying we gotta start today and we’ve got to have better health systems moving forward that are attuned to the needs of everyone – not just the people that are easy to see.

Reynado: You mentioned some of the Commission’s recommendations. Can you talk a bit more about the Commission’s recommendations for improving public health data related to disaggregated data?

O’Hara: One in particular is a tie to the scholarly community such as Georgetown and many other institutions. Researchers need to be aware of where the data are coming from and how they conform to different standards. So, there’s a big push to understanding whether your data are interoperable with another data set. And in doing so you need to examine the way that race and ethnicity data, for instance, were collected in your data set and other data that you may be joining together or analyzing as a whole. So that you are getting the right inferences about the populations that you’re attempting to understand.

O’Hara: And this is not just an issue with race and ethnicity. This is also critical when you look at sexual orientation and gender identity. The way that information is captured in different data systems – if you haven’t been thoughtful in how you’re aligning it, you could make inferences from the data points that aren’t really reflecting what is happening on the ground.

Reynado: Thank you so much for being on this podcast with me. I wanted to give you a chance to say any final words for those of us who are interested in improving data systems.

O’Hara: I would encourage you to get involved and make your voices heard. During the next couple of years, the Office of Management and Budget and the Census Bureau, they are going to be looking at whether they should change the way race data are collected and reported. This is going to matter for how we propose policy, how we analyze policy, how we understand our communities, how we ourselves and how our children identify as members of society. And so (1) be aware that these conversations are happening; and (2) when you have an opportunity, make your voice heard.

O’Hara: Whenever there is a request for comment to a Federal Register notice, as dry and wonky as that sounds, send a short note. All you have to do is send them an email or fill out a simple form online, but that is how you make your voice heard.

O’Hara: It can be in support. It can be against whatever the proposed change is. I would urge people to participate and then, if you do find yourself in quantitative or qualitative data collection and analysis, be thoughtful. Think about how you’re phrasing the way that you’re capturing information, particularly about characteristics, knowing that you may want to report on different groups. You will want to have this granular analysis. Think about that at the front end: How are we collecting this information? How are we validating it? And then ultimately, are we being respectful of the populations when we are reporting on it?

O’Hara: Our analyses can drive change and making sure that that is responsible to the people, the data subjects themselves, is something that I hope that there is a lot more awareness of as well.

Closing (Reynado):  Thank you for listening to this Georgetown Public Policy Review podcast. I hope you feel inspired to learn more about how data systems better inform public policy. In this episode, we talked about the National Commission to Transform Public Health Data Systems. Their October 2021 report called “Charting a Course for an Equity-Centered Data System” outlines the recommendations Professor O’Hara discussed. The Commission was convened by the Robert Wood Johnson Foundation. If you enjoyed this podcast, please subscribe! Check out more from the Georgetown Public Policy Review at www.gppreview.com. Thank you!

[Episode Ends]

For information about the National Commission to Transform Public Health Data Systems, please visit www.rwjf.com.

+ posts

Established in 1995, the Georgetown Public Policy Review is the McCourt School of Public Policy’s nonpartisan, graduate student-run publication. Our mission is to provide an outlet for innovative new thinkers and established policymakers to offer perspectives on the politics and policies that shape our nation and our world.

1 thought on “How Data Disaggregation Improves Public Policy, with Professor Amy O’Hara

Comments are closed.