Modeling NYC Subway Flow and School Districts’ Effect on Housing Value

Justin Rao and Jake Hofman coordinate the Data Science Summer School program, hosted and sponsored by Microsoft Research. Each year, dedicated students spend their summer learning how to conduct research thanks to a network of researchers, mentors, and advisors. All of the course materials are openly and freely available on Github.

Tonight, we’re celebrating the program’s second class. Last year’s students researched questions about racial profiling in New York City and how to optimize the city’s bikeshare system.


New York City Subway data visualized

The first team, consisting of Eiman, Shannon, Riva, and Steven, studied the New York City subway system. They were interested in how people travel on complex transit networks, and classifying stations based on turnstile data. The system’s 468 stations serve approximately 6 million trips per weekday. The team took data from the MTA’s General Transit Feed Specification data and MTA turnstile data, but had to exclude PATH, LIRR, NJT, and Staten Island Railroad systems due to incomplete data.

There are 30% more entries than exits in the data due to New Yorkers’ habit of using the more accessible emergency exit door. New Yorkers also travel less on Mondays and Fridays, as well as holidays and during major snowstorms. The data required significant cleaning to balance entries and exits and match station names across the datasets.

Different stations serve different purposes in the city. The team categorized over 400 subway stations as commercial stations, residential stations, or link stations, based on the ratio of daytime to nighttime entries and exits. Residential stations serve roughly 1,000 fewer entries per hour (~400 per hour) than commercial stations (~1,500 per hour). Grand Central alone serves over 188,000 daily exits. As you might guess, the commercial stations serve Manhattan south of 59th Street, while residential stations cover the rest of the city, with a few exceptions for military bases, car dealerships, and other outliers.

Next, the team constructed a network graph to visualize the flow of activity through the stations. Nodes were train stations, edges were rail links between adjacent stations, and the cost is the time it takes to travel from one station to another based on schedules. The resulting adjacency matrix is  rat’s nest (sorry) of stations, improved by adding the stations’ geocoordinates.

One finding is that 14th Street Union Square has 10 neighboring adjacent stations, while Times Square, the most trafficked station, has only 7.

The team estimated demand of inflow and outflow, and computed minimum cost flow, where demand is satisfied while minimizing the previously defined cost. They chose Grand Central Station as the center of the visualization. The algorithm identifies high flow corridors. In the mornings, the flow is generally inbound to Grand Central, with the exception of flow down to Lower Manhattan.

Next, they wanted to model population flow. US Census data only shows residential population, which wouldn’t work for New York’s immense number of commuters. Combined with the commuter data, we can watch Census tracts empty or swell throughout the day based on commuting patterns. The team suggests applications such as correcting stop and frisk activity for an area’s current population, or studying disease spread in epidemiology.

Q: Has anything you’ve learned change how you use the subway?

A: “I definitely don’t use the [emergency] exit doors anymore”.

“I don’t know if Midtown is fire-safe”.


New York City School District

The second data science team sought to understand the relationship between housing values and the quality of the school district in which they’re situated. It’s comprised of Glenda, Thomas, Nikki, and Anastassiya.

The New York City school system is the largest in the country, with over 1 million students. It has some of the best schools, and some of the worst, depending where you live. The team looked at test scores and plotted it relative to the rest of the state. The result is a huge disparity, between boroughs like the Bronx and Queens, and within boroughs like Manhattan. They took shapefiles and plotted school performance on a map of New York by color.

PS111 performed worse than 60% of all schools in the state, where as PS59 performed better than 99.2% of all state elementary schools. They are a ten minute walk from one another on 53rd and 56th streets.

This is where school district boundaries can make or break your child’s education. Park Slope recently redrew their lines, stoking parents’ anxieties.

The team compared high-performing school zones to housing values. In an ideal experiment, they’d sell identical apartments in two different school districts. Instead the looked at historical sales value data by scraping StreetEasy, a major NYC-area real estate website. The data wasn’t perfect — apparently you can have a negative number of bedrooms and bathrooms.

They wrote a Python script to geocode the addresses with the NYC GeoClient API. They computed and displayed the latitude, longitude, school, and price per square foot for each home. 40,000 distinct sales produced 10,000 sales with complete data mappable to known school zones.

Apartment prices alone illustrate huge disparities in New York: $110 per square foot in Woodlawn, Bronx vs. $3,393 per square foot around Central Park South. When plotted against school performance, there’s a correlation between price and competitive schools, but the relationship is conflated by other factors like neighborhood quality and location. We can’t confidently say that price per square foot increases only because of school quality.

So how do we isolate the school zone premium? If we had infinite data, we’d simply subtract the average sale price in neighboring zones from the average sale price in the school zone. With limited data, the team had to build a model to estimate, and fit a linear model to predict sale price per square foot. They used regularization to select important features and avoid unreliable estimates of sales in the given areas. They took into account the number of bedrooms, bathrooms, demographics, test scores, and neighborhood, as well as the interactions between the number of bedrooms and the school (to capture families vs. studios).

The median absolute deviation for all boroughs was $103 per square foot, but it varied by borough: $48.99 in the Bronx and $138.48 in Manhattan. The model accurately captures average home price within each school zone.

The team zoomed in on Park Slope to identify school-based premiums. They found a premium of $84 per square foot for residencies in PS 321’s school district vs. those in PS 282, despite PS 282 being closer to the Atlantic Ave subway hub. Repeating the same procedure in each school district, they find that you do pay less to live in a school district with poorer test scores.

The result? New York City’s schools demonstrate extreme inequality, often over small geographic areas.

The information is compelling, but static, so the team built an interactive app where you can enter your address and number of bedrooms and bathrooms to see the price average, median price, and premium price (positive or negative) in each of the city’s school zones:

Screenshot (38)

In praise of echo chambers, and Nuzzel

I once spent an afternoon during my time at the MIT Media Lab with a marker board and Kshitij Marwah. We drew out the various news products we could make using link-sharing data from once-removed contacts in users’ networks. We thought we might help people discover content they were likely to like sooner, by surfacing trending links before even their own network had discovered and shared them.

A version of this idea has successfully been productized by the team at Nuzzel. Once a critical mass of your contacts share a link (8 seems to be the magic number in my network), the app sends you a push notification with the story. The app primarily looks at shares within your immediate network, but also has an extended network view. With its timely but manageable updates, it fits squarely within a new generation of apps designed to live in your phone’s notifications shade.

Screenshot by @nmonroe
Continue reading In praise of echo chambers, and Nuzzel

Tech in Cuba in 2015

Tech in Cuba 2015

Illustration by J. Longo

Last month, I had the incredible opportunity to visit Cuba with my global travel companion Marco Bani. It’s a dynamic place facing rapid changes. I talked to everyone I met – regular people, but for their exposure to the lucrative tourism sector – about technology. The result is this primer in Kernel, the Daily Dot‘s Sunday magazine, for their travel issue. Thanks to Jesse Hicks for his editing. More photos, below.

What is civic tech?

Civic tech is when we apply technology toward shared problems and opportunities. Technology’s daily advance continuously expands the collection of potential ways to improve our society. Civic tech is when we consciously apply technology’s new potentials toward societal needs.

civic tech

And happy birthday to Civicist, the re-launch of TechPresident, which has provided more coverage of civic tech than any other media outlet.

TEDxAlbany: Activism Drives Attention Drives Aid

I was grateful to be able to share a chapter of my thesis on Participatory Aid at TEDxAlbany last month. The video’s online now. Thanks to Lisa Barone and the OverIt team for inviting me and doing such a great job producing the event. Thanks also to Ethan Zuckerman and everyone at MIT Center for Civic Media for connecting me to these ideas in the first place.

It’s been an extremely violent year. What makes a crisis worthy of our attention? It turns out that human suffering does not predict media coverage. How closely is disaster aid correlated to receiving public attention? And, if we’re newly able to use our networks creatively to drive attention, can our active participation improve these formulas?

Personal Data Geographies

Our phones track our personal geographies. This enables dystopian surveillance, but also provides an interesting layer of biographical data that we haven’t had access to previously. My personal perspective is that if other actors (cellphone companies, marketers, governments) are going to have access to this information, I should at least be able to view and analyze this data, too. That’s why I’m thankful that Google exposes this data to end-users through the Location History page, and also allows outputs of raw geodata.

I’m going to use this data as a personal reflection aid, sort of the way social media data helps power TimeHop‘s semi-automated moments of reflection. I’m also experimenting with artistic visualizations (as in, actual paint and paper). But to start, I’ve taken the data from the 5 or so months that I’ve lived in New York, imported it into Google Earth, and created a GIF of my geographic footprint:

Continue reading Personal Data Geographies