Category Archives: cautionary tales

Counting everyone: An acid test for a democracy

Taking a census is not one of the more exciting jobs of a national government. But a census is nothing if not democratic, so it can be a canary in a coal mine, an interesting sentinel of deeper problems. Nations as diverse as Canada, Pakistan, Australia and the United States are struggling with the same thing: How to balance privacy while still collecting the data  needed for a fair distribution of spending and electoral representation.

The U.S. Census Bureau hopes to save billions of dollars on Census 2020 with a major overhaul, but it face congressional skepticism about the investment needed. April 1 marked the start of major tests — in Maricopa County, Ariz., and 20 counties around Savannah, Ga.  Among the tests: Internet response and the use of government records when people don’t fill out a form. They will shape the design of a massive infrastructure that must be ready to test by late 2018 so that it can go live in early 2020.


Governments track people frequently through surveys and data from benefits programs. But only a census counts everyone (or tries to) at the same time and assigns them to an exact place. That seems simple but it’s very hard to do well on a national scale.

Some European nations have shifted away from taking a census in favor of assigning everyone an ID number and keeping a central register of personal data.  Don’t count on Americans embracing this idea.

Our distrust of government is just one reason that the U.S. census is so difficult to take — and so expensive. (Census 2010 cost $13 billion across a decade.) Information-sharing between agencies that is routine in some countries isn’t allowed here. Add mobility, continental sprawl, linguistic diversity and the fears of millions of immigrants living here illegally.

The United States isn’t alone:

— Citing privacy qualms, Canadian Prime Minister Stephen Harper scrapped the detailed version of the 2011 Census for a voluntary survey. Response plunged from 94% to 67%. Citing unreliability, the government did not publish results on more than 1,000 localities where the rate fell below 50%. “One in 4 … towns disappeared from the statistical registry,” said Paul Jacobson, president of the Canadian Association for Business Economics at a Washington conference in March. Even Toronto was affected: “We’ve got hunks of the city that have disappeared from the statistical registry.” In a country that is diversifying rapidly, the gaps worry planners, researchers and businesses.

— Pakistan’s officials have finally agreed to undertake its 2008 census. It will be conducted next March with the help of the armed forces, a sign of the strife and factionalism that has delayed it. Its last census was taken in 1998, when it had 40 million fewer people.

— Australia takes a census every five years but  may skip 2016 to catch up on the cost and work of shifting to digital collection. It’s facing the same efficiency vs. privacy debate  as the USA and Canada.

Lest Americans get smug, the House of Representatives voted last year to make the American Community Survey voluntary. (The Senate did not.) ACS surveys 2.5 million households a year for the same information that the Census Bureau used to collect during the census every 10 years.

Congress, through various laws, has asked for every ACS question. Support for ACS runs deep through local governments and business groups, social scientists, civil rights and  and economic development groups.

But  distrust of government has found a home in Congress.


Torturing the data till it lies

“Top 10 states for left-handers!” “Worst states for tall people” “Best country to travel to if you are 45!”

The Web is ripe with news features like this. Recipe: Assemble a basket of social measures for states or nations. Blend, rank and present as a measure of some condition. They are usually built as galleries of images or pages. Even as a reward for multiple clicks, they rarely offer a reader-friendly at-a-glance list.

The biggest problem with rankings like this: They use grouped data to conclude something about experiences that are much more tightly linked to local and personal factors.

This is the ecological fallacy. Put simply, you often can’t infer something about individuals because you have data about a group of them. This is especially true if the link that’s being claimed is barely plausible.


A simple and famous example: In the 1930 Census, a strong correlation existed between states’ English literacy rates and their shares of foreign-born people. But were immigrants more likely to be literate in English than native-born Americans? No. Census data for individuals showed the opposite, of course – Immigrants were less likely than natives to be literate in English. But immigrants had clustered in states with relatively high literacy rates, so grouped data made them seem more literate than natives.

Another example: In the presidential election of 1968, segregationist George Wallace won the electoral votes of AL, AR, GA, LA and MS. These states had the highest rates of black voters. Should we conclude that blacks voted strongly for Wallace?

States – diverse collections of people acting through laws and policies – exert little or no effect on many conditions in daily life, such as crime. And most social conditions vary within a state far more than they do among states. Data journalists spend a lot of time and sweat trying to get this right by collecting *local* crime rates or  student-pupil ratios before they start probing for patterns.

There are legitimate times to rank states, most obviously on something the state government itself can affect directly, like the climate for startup businesses or the strength of consumer protection laws.

And USA TODAY, has run such lists from content partners. They can be fun, clickable lists. But they really don’t tell us anything about ourselves.

So if your state ranks low as a place to be a coin collector or a Chevy driver, don’t fret.

–Paul Overberg

That stomach-sinking feeling when data is wrong

That nagging feeling that something is wrong with the data? Listen to it.

When my colleague Marisol Bello asked whether we could figure out how often parents kill their children – she was reporting in the wake of a high-profile case in Georgia last summer – I knew we could probably find some help in the FBI’s supplemental homicide reports, which include victim/suspect relationship details. I ran the queries and came up with some preliminary figures, but I also knew the SHR was notoriously spotty, because many cities fail to provide details on murders.

So I started looking for other research on the topic, eventually digging up what can seem to be the gold standard of analysis: a piece co-authored by an Ivy League researcher, and published in a peer-reviewed journal. The only problem? The researchers had found six times as many filicides each year as I did in the FBI data.

I contacted the researchers. They hadn’t used the data directly from the FBI, but rather had used cleaned-up figures publically available from James Alan Fox and Marc Swatt of Northeastern University. That must account for the differences, they said.

But something kept bothering me: According to the researchers’ findings, 3,000 children each year were killed by their parents. Keep in mind that there are roughly 16,000 homicides each year. That would mean that 20% of all victims were children killed by a parent or stepparent. I covered the cops beat for the first four years of my career – I thought back to all the gang battles, lovers’ quarrels and drug deals gone wrong. I could count on one hand the number of child/parent murders I had seen. It certainly wasn’t anywhere near 20%.

So I followed the researchers’ lead, downloaded Fox and Swatt’s data and opened it in SPSS. It didn’t take long to realize each case number, which was supposed to be a unique ID, was in the file six times. A phone call to Fox, who walked me through the data, revealed the researchers’ mistake. Unlike the raw FBI file, Fox and Swatt’s dataset is built for advanced statistical analysis – it has multiple imputations to allow academic researchers to fill in holes that we know exist in FBI data, either where cases are missing entirely or where certain details (the relationship between victim and killer, for instance) aren’t included. Each killing was broken into six lines: the original record, and five different imputations with different weights applied and missing values filled in.

Fox walked me through a way to properly weight cases (his data set includes separate weights for national analysis and state-by-state analysis) and how to properly fill in gaps where relationship details weren’t provided in the raw data.

The upshot? I found that on average, about 450 children are killed by a parent or stepparent each year.

Brown University has since issued a correction to their press release on the researchers’ findings. Marisol and I used the data and our findings in a were published in USA TODAY, along with another follow-up story published later.

(a version of this post was published on the American Press Institute’s Fact-Checking Project blog)

–Meghan Hoyer

Hey baby! What’s in a name

Earlier this year we published a story on the most popular 2014 baby names based on data from BabyCenter, a website that caters to expecting parents. It covers about 1 in 8 newborns.

Nothing wrong with that, but the names tend to skew white and trendy. We won’t have the complete 2014 list until spring, when the Social Security Administration publishes its own list, based on near-universal infant registration for SS numbers.

This is a baby. He was born in 2013 but he does not have a trendy name.
This is a baby. He was born in 2013 but he does not have a trendy name.

A comparison of 2013 top 10 boys’ lists shows, in order:

— BabyCenter: Jackson, Aiden, Liam, Lucas, Noah, Mason, Jayden, Ethan, Jacob, Jack.

— Social Security: Noah, Liam, Jacob, Mason, William, Ethan, Michael, Alexander, Jayden and Daniel.

There’s less difference for popular girls’ names. Sophia/Sofia, Emma, Olivia, Ava, Isabella and Mia rule both lists.

Is that because boys’ names tend to be less trendy? Or parents who choose more traditional names don’t bother to register on this website? In any case, we’re left with a subtly white-centric view of the nursery. Outside of the top names, the lists diverge sharply, especially for traditional names. And BabyCenter’s list is missing several Hispanic names in the top 100, like Angel, Jose, Luis and Juan.

You can also see this in names that moved up the most, according to the SSA list: more ethnic-sounding names such as Jayceon and Castiel for boys, and Daleyza and Freya for girls.

And not for nothing, the names of the Data Team members are past the bubble: All of our names have been losing popularity since at least 2000. Our trendiest member: Paul, who last broke the top 100 (at number 100) 14 years ago.

—Jodi Upton and Paul Overberg