Racial Profiling and Bike Sharing: Urban Data Science at Microsoft Research

A liveblog of Microsoft Research’s Data Science Summer School. Errors likely mine.

The Data Science Summer School program recruits some of the most talented data students in the city to solve really difficult problems. Fortunately, they were able to choose the 8 extremely talented students from a city of 8 million people.

Data Science School students
Data Science Summer School students. Photo by Microsoft Research.

Microsoft Research’s instructors and directors pulled all the necessary strings to put this program together on an expedited timeline. Tonight are their final presentations:

An Empirical Analysis of Stop-and-Frisk in New York City

Md.Afzal Hossain (New York City College of Technology)
Khanna Pugach (Baruch College)
Derek Sanz (Brooklyn College)
Siobhan Wilmot-Dunbar (Pace University)

Between 2006 and 2012, the New York City Police Department made roughly four million stops as part of the city’s controversial stop-and-frisk program. We empirically study two aspects of the program by analyzing a large public dataset released by the police department that records all documented stops in the city. First, by comparing to block-level census data, we estimate stop rates for various demographic subgroups of the population. In particular, we find, somewhat remarkably, that the average annual number of stops of young, black men exceeds the number of such individuals in the general population. This disparity is even more pronounced when we account for geography, with the number of stops of young black men in certain neighborhoods several times greater than their number in the local population. Second, we statistically analyze the reasons recorded in our data that officers state for making each stop (e.g., “furtive movements” or “sights and sounds of criminal activity”). By comparing which stated reasons best predict whether a suspect is ultimately arrested, we develop simple heuristics to aid officers in making better stop decisions. We believe our results will help both the general population and the police department better understand the burden of stop-and-frisk on certain subgroups of the population, and that the guidelines we have developed will help improve stop-and-frisk programs in New York City and across the country.

Stop and Frisk is based on individual officers’ perception of an individual’s risk. The stop rates are disproportionate across racial and ethnic groups: 87% of those stopped are Black or Hispanic. The social costs placed on many, many innocent individuals are significant. To measure these costs, the team divided the number of stops by the total population. The data came from the NYPD and US Census PUMS and block-level datasets.

Females are stopped far less often than men, but the stops still disproportionately affect black women versus white women. A 19 year-old black male can be expected to be stopped 1.2 times each year, vs. 0.2 times for a 19 year-old white male.

The team then quantified the effect of location on the likelihood of police stops. White males are most likely to be stopped in Coney Island (an average of once every two years). The stop rate for a young black male in Jamaica, Queens is over three times a year. 75% of the stops made in Brownsville occur in public housing. The average stop rate for a young Hispanic male is 0.24.

How do we make this system more effective? How do we improve stop decisions? A better stop isn’t necessarily one that leads to an arrest, because arrests aren’t perfect, but the team used this data as a proxy. NYPD’s Stop form includes a variety of reasons an officer can give to justify their stop, and several of these reasons are subjective: what counts as “furtive movement”?

The team trained a predictive model to predict likelihood of an arrest based on the data available, and then ranked individuals by the predicted likelihood of arrests. Using the model, police stopping just 25% of the people would still lead to 56% of the arrests.

They developed a heuristic model by calculating the probability of an arrest for each of the stated reasons on the NYPD’s form. To do so, they had to eliminate the impossible-to-predict category of justification known as “other”.

Their heuristic model performs worse than their full statistical model, but far better than random stops, and is much easier to use in the real world. Given the social costs and demographically unfair burdens of stop and frisk in its current incarnation, the team finds the heuristic model to be an improved solution.

—————

Self-Balancing CitiBikes

Briana Vecchione (Pace University)
Franky Rodriguez (St. Joseph’s College)
Donald Hanson (Adelphi University)
Jahaziel Guzman (Brooklyn College)

Bike sharing is an internationally implemented system for reducing public transit congestion, minimizing carbon emissions, and encouraging a healthy lifestyle. Since New York City’s launch of the CitiBike program in May 2013, however, various issues have arisen due to overcrowding and general flow. In response to these issues, CitiBike employees redistribute bicycles by vehicle throughout the New York City area. During the past year, over 500,000 bikes have been redistributed in this fashion. This solution is financially taxing, environmentally and economically inefficient, and often suffers from timing issues. What if CitiBike instead used its clientele to redistribute bicycles?

In this talk, we will describe the data analysis that we conducted in hopes of creating an incentive and rerouting scheme for riders to self-balance the system. We anticipate that we can decrease vehicle transportations by offering financial incentives to take bikes from relatively full stations and return bikes to relatively empty stations (with rerouting advice provided via an app). We used publicly available data obtained via the CitiBike website, consisting of starting and ending locations, times, and user characteristics for each trip taken from July 2013 through May 2014. Using this dataset, we estimated CitiBike traffic flow, which enabled us to build agent-based simulation models in response to incentives and rerouting information. By estimating various parameters under which to organize incentive schemes, we found that such a program would help to improve CitiBike’s environmentalism and increase productivity, as well as being financially beneficial for both CitiBike and its riders.

Bike-sharing programs have surged around the world in recent years. New York’s own, CitiBike, opened in May 2013 and quickly became the largest in the US, with 20-40,000 trips a day.

Bike schemes are situated in cities, which bring legacy business and residential zones. People take all of the bikes into business areas during the work day, leading to problems of stations that are empty or full. The current solution to this problem is driving racks of bikes back to empty stations to relieve demand on the system.

Donald Hanson says there’s a van transport for every 10-20 rides, costing 10-15% of the program’s gross revenue. Could we instead use riders themselves to better balance the system?

The CitiBike website has official system data with trip logs. They augmented this official data with Abe Stanway’s open dataset of bike availability.

They divided stations up between the bike congested and the bike starved and found the system is imbalanced.

CitiBike doesn’t provide much information on their truck-based rebalancing, so the team dug into the data themselves and found the moments where a bike magically transported to a new station location. They totaled these numbers to discover how many bikes were transported (or incorrectly locked, or taken for repair, or just stolen). They plotted this data to create a map of drop-offs. Stations at major transit hubs need refilling more often, while Union Square requires more frequent station emptying.

Truck transports peak during commute hours, and nearby stations experience complementary activity. The team computed what would happen without truck drop-offs and discovered that about half of transported bikes serve trips that otherwise couldn’t have occurred.

Could local rebalancing alleviate these problems? They created a simulation to redirect the commuter about an avenue away to take a bike from a less starved station. Likewise on the terminal dock: re-route the commuter a block away to a less congested station.

Such re-routing drastically reduces local congestion but the global imbalance remains. A simple greedy algorithm described above could improve bike availability, and incentives to encourage re-routes could encourage such adoption.

Some news articles have estimated the cost of bike transport at about $6 per bike, with 40,000 bikes transported per month.