Does peer review cast doubt on traffic-stop analysis?

Researchers use traffic-stop data to find racial disparities that could be the result of racial profiling or could have other innocuous explanations.

Researchers earlier this month released their third annual statewide report analyzing traffic-stop data in an effort to find signs of potential racial profiling by police.

It’s several hundred pages of often dense technical writing. Adding to the difficulty of understanding what the report says and how much stock to put in it, police chiefs who have long criticized the research made public a peer review that seemed to cast doubt on the work.

The peer review was made public just hours before the report came out, frustrating researchers who had no time to use the review to improve their work for this version of the report.

Understanding the report isn’t a simple thumbs up or thumbs down, judging whether it’s right or wrong. “It’s not as easy as saying it’s valid or it’s not; there’s lots of stuff in there,” said Michael Smith, one of the peer reviewers. He’s the criminal justice department chair at the University of Texas at San Antonio.

To make sense of this report we have to understand each of its numerical tests of varying rigor, their strengths and weaknesses, and what the expert reviewers say about them.

1. Hold on, what is this report, anyway?

The report analyzes traffic-stop data from every police department in the state and identifies a few departments that will get a closer look in a follow-up report.

The report was written by researchers from Central Connecticut State University’s Institute for Municipal and Regional Policy, under the oversight of the Racial Profiling Prohibition Project Advisory Board, set up under the state’s Office of Policy and Management.

The analysis uses data to decide where to focus its efforts each year. This report doesn’t prove or disprove any racial profiling by police. Instead, it looks to identify which departments have the most consistent statistical irregularities across the most tests, so those departments can be subject to deeper scrutiny.

The researchers break down the tests into two categories — “statistical,” meaning the results are more reliable, and the less robust “descriptive” tests.

The three statistical tests include the Veil of Darkness, a synthetic control, and a search hit-rate test.

2. What is the Veil of Darkness test?

The test considered most rigorous is the veil of darkness analysis, which compares stops made when it’s light out against stops made when it’s dark, to see if minority drivers are pulled over more often when officers can more easily see them and determine their race and ethnicity.

You might think that means comparing stops at 3 p.m. to stops at 10 p.m., but that approach has a flaw.

“The problem with that is there may be other differences between day- and night-time driving, people going to work […], people going out at night. Maybe those aren’t comparable groups,” said Jeffrey Grogger, a University of Chicago professor, one of the peer reviewers, and one of the creators of the veil of darkness test. If that’s how the test worked, a department could be flagged for stopping more minority drivers during the daytime because more minority drivers come into town to work or shop.

The veil test gets around that problem by comparing only stops when it’s dark in the winter against stops at the same time of day when it’s light in the summer. The idea here is that the driving population should be more consistent, and researchers can compare stops under different lighting conditions.

Reviewers pointed out that there may be times when the underlying assumptions are wrong — when an officer can’t see a driver in broad daylight, or an officer can clearly see a driver in darkness, such as in a well-lit urban setting.

Kenneth Barone, a project manager at CCSU and an author of the annual reports, said the assumption doesn’t always have to be true for the veil test to be valid. If officers can discern a driver’s race 20 percent of the time during daylight and only 3 percent in darkness, that would be good enough, he said.

3. What the heck is a ‘synthetic control?’

Don’t be scared off by the science-y sounding name.

This test works by grouping each department with others considered similar in a number of ways, and then seeing which departments stick out among that peer group.

Some of the factors researchers use to group the departments include employment, housing density, age and gender, employment in retail and entertainment sectors, and the racial and ethnic population of the town and its neighboring towns.

This test, as Grogger points out, could mask signs of racial profiling. For instance, if a department conducted a high number of equipment stops affecting minorities, it wouldn’t stick out if it were grouped with departments that had similar patterns.

CCSU’s Barone agrees that the test could mask profiling, which makes it more conservative in terms of preventing false positives.

4. Post-stop ‘hit-rate’ analysis

A third statistical test, called a “hit-rate” analysis, looks at how often drivers are searched and how often anything illegal is actually found.

Credit: Scott Davidson / Creative Commons

The idea is that if one group is searched unsuccessfully a lot more often, that might mean there’s some unfair treatment going on.

Statewide, and in some of the jurisdictions singled out for a closer look, minorities are more likely to be searched than white drivers, and officers are much less likely to find anything illegal during the searches, the researchers found.

Peer reviewers point out that sometimes the decision to search is not within an officer’s discretion, for instance if an arrest is made and a vehicle is impounded.

However, Barone and his team didn’t consider these “low-discretion” searches and instead focused on searches when the officer had greater latitude in deciding whether to search. Barone said he thought that point could have been cleared up if his team had had a chance to respond to the peer reviews before they were published.

The main drawback in this analysis is that searches are very rare, so the results are based on a small amount of data. Just 3 percent of stops statewide involved a search of any kind.

5. OK, what about those less formal tests?

Researchers included three tests they admitted are less reliable. Peer reviewers viewed these tests more dimly, calling them “invalid” and saying they should be cut altogether.

“If you actually took that one section of the report out, I thought it was a pretty good report,” said Edward Maguire, a criminology and criminal justice professor at Arizona State University, and one of the three peer reviewers.

Barone figures the more tests the better, as long as there’s a clear explanation of their weaknesses, because the report is meant to be a useful “policy document,” rather than a journal article.

Each of these descriptive tests relies on estimating the racial composition of a town’s driving population and comparing it to the racial composition of drivers stopped, to see if stops are proportionate to who is on the road. The problem is those population estimates are hard to make and are not considered very accurate.

The estimates are error-prone because drivers don’t just drive in the town where they live, and different age and race groups might spend more or less time on the road because of differing access to vehicles, commutes, or other driving characteristics.

One of these tests eliminates some uncertainty by focusing only on stops of drivers who live in that town. That way, researchers only need to estimate the resident driving population without guessing how many people from other towns are on the roads. But, the other challenges still remain, and it’s still an estimate.

Smith suggested one better estimate of the driving population the IMRP researchers failed to use was crash data. That is, use the race of not-at-fault drivers involved in car crashes to get a better sense of who’s really on the road.

Barone said he would be happy to add crash data, but police don’t record race and ethnicity data on crash reports.

“We tried to do that in the state of Connecticut four years go,” Barone said. “Police opposed the inclusion of race and ethnicity on accident forms.” If they chose to start capturing that data immediately, it could take three years to have an adequate sample size, he said.

6. How much does the peer review undermine the report?

The peer review released on the eve of the third traffic stop analysis is full of critical statements.

For non-scientists, it can be hard to understand whether the criticism adds up to a fatally flawed analysis or a sound approach with room for improvement.

The reviewers say their comments were part of a common process of constructive criticism, not to be taken as a litany of reasons to discount the work.

“I obviously can’t speak for everybody, but I was expecting those reviews to be used in a constructive way,” Grogger said. He said it was “pretty disappointing” that researchers received his feedback, which he wrote in the summer, only hours before the latest report was made public. “That’s something that doesn’t happen; that’s political, that’s not an academic thing,” he said.

Grogger, Smith and Maguire all told The Connecticut Mirror that they were far more critical of the so-called descriptive tests, but where the researchers employed academically rigorous analyses, they did well.

“In terms of the quality of analysis, I think it’s excellent, at least the parts of it that are done well. The parts that are problematic you can’t make them not problematic,” Smith said, referring to the “descriptive” population tests.

Grogger said, “If they just focused on the state-of-the-art techniques, Connecticut would be way ahead of the national curve.”

As a way of putting harsh peer reviews into context for people who don’t often read them, Grogger told a story from early in his career about, submitting a paper to a well-respected journal, and receiving a letter from its editor saying that, after reading the peer reviews, he was “not optimistic” about publishing the paper.

“It was the most discouraging-sounding letter I’ve ever heard,” Grogger said.

But, he said, he worked hard to address all of the issues that had been raised. “They not only published it; they published it as the lead article.”

7. Where does it all lead?

While there may always be debate over how to interpret the data, it’s all just number-crunching unless departments can use it to identify ways to improve.

The closer examinations of departments that showed disparities will include more department-specific context, data, and input from the departments on what might be driving their disparities.

Police chiefs, even if they’re defensive about being singled out for having disparities, have for the most part tried to see what improvements they can make, said Mike Lawlor, the state’s undersecretary for criminal justice policy and planning.

Lawlor and Barone pointed to Hamden as a place where the research has led to real changes. The department’s high number of stops for defective equipment was affecting minorities disproportionately, and the data showed it wasn’t effective.

WNPR reported last year that the chief there changed his priorities to focus on more safety-oriented enforcement and the disparities went away.