Research Papers

Big Data in Economics (with Matthew Harding) IZA World of Labor
Big Data refers to datasets of much larger size, higher frequency, and often more personalized information. Examples include data collected by smart sensors in homes or aggregation of tweets on Twitter. In small datasets, traditional econometric methods tend to outperform more complex techniques. In large datasets, however, machine learning methods shine. New analytic approaches are needed to make the most of Big Data in economics. Researchers and policy makers should thus pay close attention to recent developments in machine learning techniques if they want to fully take advantage of these new sources of Big Data.
[final draft]


Poverty Mapping Using Convolutional Neural Networks Trained on High and Medium Resolution Satellite Images, With an Application in Mexico (with Boris Babenko, David Newhouse, Anusha Ramakrishnan, and Tom Swartz) Proceedings of the Neural Information Processing Systems 2017
Mapping the spatial distribution of poverty in developing countries remains an important and costly challenge. These “poverty maps” are key inputs for poverty targeting, public goods provision, political accountability, and impact evaluation, that are all the more important given the geographic dispersion of the remaining bottom billion severely poor individuals. In this paper we train Convolutional Neural Networks (CNNs) to estimate poverty directly from high and medium resolution satellite images. We use both Planet and Digital Globe imagery with spatial resolutions of 3-5 m 2 and 50 cm 2 respectively, covering all 2 million km 2 of Mexico. Benchmark poverty estimates come from the 2014 MCS-ENIGH combined with the 2015 Intercensus and are used to estimate poverty rates for 2,456 Mexican municipalities. CNNs are trained using the 896 municipalities in the 2014 MCS-ENIGH. We experiment with several architectures (GoogleNet, VGG) and use GoogleNet as a final architecture where weights are fine-tuned from ImageNet. We find that 1) the best models, which incorporate satellite-estimated land use as a predictor, explain approximately 57% of the variation in poverty in a validation sample of 10 percent of MCS-ENIGH municipalities; 2) Across all MCS-ENIGH municipalities explanatory power reduces to 44% in a CNN prediction and landcover model; 3) Predicted poverty from the CNN predictions alone explains 47% of the variation in poverty in the validation sample, and 37% over all MCS-ENIGH municipalities; 4) In urban areas we see slight improvements from using Digital Globe versus Planet imagery, which explain 61% and 54% of poverty variation respectively. We conclude that CNNs can be trained end-to-end on satellite imagery to estimate poverty, although there is much work to be done to understand how the training process influences out of sample validation.
[conference draft, full draft to follow]


Poverty from Space: Using High Resolution Satellite Imagery for Estimating Economic Well-being and Geographic Targeting (with Ryan Engstrom and David Newhouse) Submitted
 Can features extracted from high spatial resolution satellite imagery accurately estimate poverty and economic well-being? We investigate this question by extracting both object and texture features from satellite images of Sri Lanka, which are used to estimate poverty rates and average log consumption for 1,291 administrative units (Grama Niladhari (GN) Divisions). Features extracted include the number and density of buildings, the prevalence of building shadows (a proxy for building height), the number of cars, density and length of roads, type of agriculture, roof material, and a suite of texture and spectral features. A linear regression model explains sixty percent of both poverty headcount rates and average log consumption. In comparison, models built using Night Time Lights explain only 15 percent. Estimates remain accurate throughout the consumption distribution. Two sample applications, extrapolating predictions into adjacent areas and estimating local area poverty using an artificially reduced census, confirm the out of sample predictive capabilities.
[latest draft] [online appendix] [slides]

[Press: BrookingsBloomberg, Atlantic City Lab, Fast Company, Borgen Magazine]


Unintended Consequences of the African Growth and Opportunity Act: The Role of Trade Diversion and Structural Change (with Klaus-Peter Hellwig)
This paper investigates the effects of preferential trade programs such as the U.S. African Growth and Opportunity Act (AGOA) on the direction of African countries’ exports. While these programs intend to promote African exports, textbook models of trade suggest that such asymmetric tariff reductions could additionally divert African exports from other destinations to the tariff reducing economy. We examine the import patterns of 177 countries and estimate the diversion effect using a triple-difference estimation strategy, which exploits time variation in the product and country coverage of AGOA. We find no evidence of systematic trade diversion within Africa, whereas diversion from other industrialized destinations to the US was significant, in particular for apparel products. At the same time we show that, more than diverting trade, AGOA had positive spillovers on the product composition of trade, which suggests that the product coverage of preferential trade agreements can influence structural change in Africa.
[draft available upon request] 


Predicting Trade Decline Following the Financial Crisis: How Much Better are Ensemble Methods? (with Marianne Baxter)
What predicts trade contraction following financial crises? World trade decline by as much 10 percent in 2009 following the financial crisis of 2008, and while several researchers have investigated proximate causes of the decline, we still lack an understanding of which models should be used to forecast the next possible trade contraction. To shed light on this problem, we estimate six methods – OLS, Lasso, Ridge, Adaptive Lasso, random forest, and gradient boosted trees – on data pre-crisis, to determine which method is best able to predict trade during the crisis period of 2009-11. We find that ensemble tree based methods — random forest and gradient boosted trees — are much better at predicting trade based on pre-crisis signals. Random forest models have nearly twice the explanatory power of models prior to the crisis, and 50% more explanatory power out-of-sample at predicting trade decline during the crisis. Trading partner heterogeneity does not appear to explain the larger explanatory power, but the ability of tree based methods to capture variable non-linearities does.
[latest draft] [slides


Building a better model: Variable selection to predict poverty in Pakistan and Sri Lanka (with Marium Afzal and David Newhouse)
Numerous studies have developed models to predict poverty, but surprisingly few have rigorously examined different approaches to developing prediction models. This paper applies out of sample validation techniques to household data from Pakistan and Sri Lanka, to compare the accuracy of regional poverty predictions from models derived using manual selection, stepwise regression, and Lasso-based procedures. It also examines how much incorporating publically available satellite data into the model improves its accuracy. The five main findings are that: 1) Lasso tends to outperform both discretionary and stepwise models in Pakistan, where the set of potential predictors is large. 2) Lasso and stepwise models give comparable results in Sri Lanka, where the set of predictors is smaller. 3) The accuracy of the prediction model depends considerably on the poverty threshold  4) Including publically available satellite data makes poverty predictions more accurate in Sri Lanka, where predictors are scarce, but slightly less accurate in Pakistan and 5) Including the satellite data increases the benefit of using Lasso in Sri Lanka. We conclude that among the three model selection methods considered, lasso-based models are preferred for generating poverty predictions, especially when the pool of candidate variables is large. Furthermore, when the pool of candidate variables available from household surveys is smaller, incorporating publicly available satellite data can considerably improve the accuracy of regional poverty predictions.
[latest draft]  


Historical Health Conditions in Major U.S. Cities, (with Carlos Villarreal, Brian Bettenhausen, Eric Hanss)
The Historical Urban Ecological data set is a new resource detailing health and environmental conditions within seven major U.S. cities during the study period from 1830 to 1930. Researchers collected and digitized ward-level data from annual reports of municipal departments that detail the epidemiological, economic, and demographic conditions within each city. They then drafted new geographic information system data to link the tabular records to ward geographies. These data provide a new foundation to revisit questions surrounding the urban mortality transition and the growth of U.S. cities.
[published version]  


Sweet diversity: Colonial goods and the rise of European living standards after 1492
When did overseas trade start to matter for living standards? Traditional real-wage indices suggest that living standards in Europe stagnated before 1800. In this paper, we argue that welfare rose substantially, but surreptitiously, because of an influx of new goods as a result of overseas trade. Colonial luxuries such as tea, coffee, and sugar transformed European diets after the discovery of America and the rounding of the Cape of Good Hope. These goods became household items in many countries by the end of the 18th century. We use three different methods to calculate welfare gains based on price data and the rate of adoption of these new colonial goods. Our results suggest that by 1800, the average Englishman would have been willing to forego 10% or more of his income in order to maintain access to sugar and tea alone. These findings are robust to a wide range of alternative assumptions, data series, and valuation methods. 
[latest draft]  


Work in progress

Predicting Firm Performance with API Flows, with Seth Benzell, Guillermo Lagarda Cuervas, and Marshall Van Alstyne


Cell Phone Coverage and Traffic Accidents: New Evidence Using Cell Phone Towers, with Bree Lang and Matthew Lang


Analyzing Conflict From Space: Identification of Physical Destruction During the Syrian Civil War, with Hannes Muller, Andre Groger, and Andrea Matranga


How Can Poverty Estimates Derived from Artificial Intelligence and Satellite Imagery Improve Surveys? Evidence from Mexico, with Boris Babenko, David Newhouse, Anusha Ramakrishnan, and Tom Swartz