Shapefiles in Python: shapely polygons

In my last post, I described how to take a shapefile and plot the outlines of the geometries in the shapefile. But the power of shapefiles is in the records (the data) associated with each shape. One common way of presenting shapefile data is to plot the shapefile geometry as polygons that are colored by some value of data. So as a prelude to doing just that, this post will cover how to plot polygons using the shapely and descartes libraries. As always, my code is up on my github page.

The two python libraries that I’ll be using are shapely (for constructing a polygon) and descartes (for adding a polygon to a plot). So step 0 is to go install those! I’ll also be using the numpy and matplotlib libraries, but you probably already have those.

Though the documentation for shapely has some nice sample source code, I wrote my own script, simple_polygons.py, to get to know the libraries better. In this approach, there are two steps to building a polygon from scratch: constructing the points that define the polygon’s shape and then mapping those points into a polygon structure. The first step doesn’t require any special functions, just standard numpy. The second step uses the  shapely.geometry.Polygon class to build a polygon from a list of coordinates.

There are limitations for valid polygons, but virtually any shape can be constructed, like the following pacman:

pacman

The first step is to build the list of coordinates defining the exterior points (the outer circle) and a list of interior points to exclude from the polygon (the eyeball). Starting with the exterior points, I calculate the x and y coordinates of unit circle from 0.25pi to 7/4pi (0 to 2pi would map a whole circle rather than a pacman):

theta = np.linspace(0.25*3.14,1.75*3.14,80) 

# add random perturbation 
max_rough=0.05 
pert=max_rough * np.random.rand(len(theta)) 

x = np.cos(theta)+pert 
y = np.sin(theta)+pert

I also add a random, small perturbation to each x-y position to add a bit of roughness to the outer pacman edge, because I wanted some small scale roughness more similar to the shapefiles I’d be plotting later. Next, I build a python list of all those x-y points. This list, ext, is the list of exterior points that I’ll give to shapely:

# build the list of points 
ext = list() 

# loop over x,y, add each point to list 
for itheta in range(len(theta)): 
    ext.append((x[itheta],y[itheta])) 

ext.append((0,0)) # add 0 point

At the end, I add the 0,0 point, otherwise the start and end points on the circle would connect to each other and I’d get a pacman that was punched in the face:

pacman_punch

That takes care of the exterior points, and making the list of interior points is similar. This list, inter, will be a list of points that define interior geometries to exclude from the polygon:

# build eyeball interior points 
theta=np.linspace(0,2*3.14,30) 
x = 0.1*np.cos(theta)+0.2 
y = 0.1*np.sin(theta)+0.7 

inter = list() 
for itheta in range(len(theta)): 
    inter.append((x[itheta],y[itheta])) 
inter.append((x[0],y[0]))

Now that we have the list of exterior and interior points, you just give that to shapely’s polygon function (shapely.geometry.Polygon):

polygon = Polygon(ext,[inter[::-1]])

Two things about passing Polygon the interior list: (1) you can actually pass Polygon a list of lists to define multiple areas to exclude from the polygon, so you have to add the brackets around inter and (2) I haven’t quite figured out the [::-1] that the shapely documentation includes. I know that generally, [::-1] will take all the elements of a list and reverse them, but why does Polygon need the points in reverse? No idea. Without it, I only get an outer edge defining the eyeball:

pacman_badeye

I would love to get some information on why Polygon needs the reversed list, so leave me a note in the comments if you know why.

Regardless, the next step is to add that polygon structure to a plot, with a straightforward use of matplotlib.pyplot (imported as plt) and descartes.patch.PolygonPatch:

 

# initialize figure and axes 
fig = plt.figure() 
ax = fig.add_axes((0.1,0.1,0.8,0.8)) 

# put the patch on the plot 
patch = PolygonPatch(polygon, facecolor=[0,0,0.5], edgecolor=[1,1,1], alpha=1.0) 
ax.add_patch(patch) 

# new axes 
plt.xlim([-1.5, 1.5]) 
plt.ylim([-1.5,1.5]) 
ax.set_aspect(1) 

plt.show()

PolygonPatch’s arguments are pretty self explanatory: facecolor and edgecolor set the colors for the fill and edge of the polygon. Conveniently, facecolor and edgecolor can be specified as RGB values, which I’ll take advantage of for plotting shapefile records in my next post. It can also accept any of the kwargs available to matplotlib.patches.Polygon class (like the transparency,alpha, between 0 and 1).

So that’s it! Pretty easy! And in some ways it is even easier to plot polygons from a shapefile, since pyshp imports shapefile coordinates as a list and you can just give that list directly to Polygon… more on that in the next post.

Shapefiles in Python: a super basic tutorial

I recently started a couple of projects that will involve using shapefiles and I got frustrated real fast. Many tutorials that I found assumed some previous knowledge of either shapefiles or the python libraries used to manipulate them. But what I wanted was a tutorial that helped me to plot a simple shapefile while getting to know what a shapefile actually is!

So here’s a SUPER simple example of how to load, inspect and plot a shapefile to make a map of the U.S! There are quite a few Python libraries dealing with shapefiles and it was hard to find the easiest place to start. I found the pyshp Python library the most approachable, so that’s what I use in the following example. There are many ways to visualize shapefiles in a more automated way than I do here, but I think that my approach here gives a clearer picture to a beginner of what a shapefile is and how to use Python with shapefiles.

The shapefile

Go get yourself a shapefile! The one I used (which will definitely work with my code below) is the lowest resolution state-level cartographic boundary shapefile from census.gov (link to census.gov, direct link to lowest resolution 20m .zip file). Once you download the .zip file, unpack it and take a look inside. A shapefile is actually a collection of different files, including a .shp file containing information on shape geometry (state boundaries in this case), a .dbf file containing attributes of each shape (like the name of each state) and others (check out the wiki page on shapefiles for a description of the other file extensions).

The code!

You can download my Python code: https://github.com/chrishavlin/learning_shapefiles

At present, the src folder includes only one python script: basic_read_plot.py. To run this script you will need to:

  1. install the pyshp Python library  (and numpy and matplotlib if you don’t have them already)
  2. edit the variables in the source code describing the path to the shapefile (dat_dir and shp_file_base in src/basic_read_plot.py)

After those two steps, just open up a terminal and run the script (assuming you’re in the src directory):

$ python basic_read_plot.py

The three plots described below should pop up.

So what does the code do? 

After the initial comment block and library import, the code reads in the shapefile using the string variables that give the location of the shapefile directory (data_dir) and the name of the shapefile without extension (shp_file_base):

sf = shapefile.Reader(dat_dir+shp_file_base)

This creates a shapefile object, sf, and the next few lines do some basic inspections of that object. To check how many shapes have been imported:

print 'number of shapes imported:',len(sf.shapes())

For the census.gov state boundary shapefile, this returns 52 for the 50 states, Washington D.C. and Puerto Rico.

For each shape (or state), there are a number of attributes defined: bbox, parts, points and shapeType. The pyshp documentation describes each, and I’ll touch on each one in the following (except for shapeType).

The first thing I wanted to do after importing the shapefile was just plot a single state. So I first pull out the information for a single shape (in this case, the 5th shape):

shape_ex = sf.shape(5)

The points attribute contains a list of latitude-longitude values that define the shape (state) boundary. So I loop over those points to create an array of longitude and latitude values that I can plot. A single point can be accessed with shape_ex.points[0] and will return a lon/lat pair, e.g. (-70.13123,40.6210). So I pull out the first and second index and put them in pre-defined numpy arrays:

x_lon = np.zeros((len(shape_ex.points),1))
y_lat = np.zeros((len(shape_ex.points),1))
for ip in range(len(shape_ex.points)):
    x_lon[ip] = shape_ex.points[ip][0]
    y_lat[ip] = shape_ex.points[ip][1]

And then I plot it:

plt.plot(x_lon,y_lat,'k')

# use bbox (bounding box) to set plot limits
plt.xlim(shape_ex.bbox[0],shape_ex.bbox[2])

single

This returns the state of Oregon! I also used the bbox attribute to set the x limits of the plot. bbox contains four elements that define a bounding box using the lower left lon/lat and upper right lon/lat. Since I’m setting the axes aspect ratio equal here, I only define the x limit.

Great! So all we need now is to loop over each shape (state) and plot it! Right? Well this code snippet does just that:

plt.figure()
ax = plt.axes()
ax.set_aspect('equal')
for shape in list(sf.iterShapes()):
   x_lon = np.zeros((len(shape.points),1))
   y_lat = np.zeros((len(shape.points),1))
   for ip in range(len(shape.points)):
       x_lon[ip] = shape.points[ip][0]
       y_lat[ip] = shape.points[ip][1]

   plt.plot(x_lon,y_lat)

plt.xlim(-130,-60)
plt.ylim(23,50)

And we can see some problems with the result:

bad_map

The issue is that in some of the shapes (states), the geometry has multiple closed loops (because of the islands in some states), so simply connecting the lat/lon points creates some weird lines.

But it turns out that the parts attribute of each shape includes information to save us! For a single shape the parts attribute (accessed with shape.parts) contains a list of indeces corresponding to the start of a new closed loop within a shape. So I modified the above code to first check if there are any closed loops (number of parts > 1) and then loop over each part, pulling out the correct index range for each segment of geometry:

plt.figure()
ax = plt.axes() # add the axes
ax.set_aspect('equal')

for shape in list(sf.iterShapes()):
    npoints=len(shape.points) # total points
    nparts = len(shape.parts) # total parts

    if nparts == 1:
       x_lon = np.zeros((len(shape.points),1))
       y_lat = np.zeros((len(shape.points),1))
       for ip in range(len(shape.points)):
           x_lon[ip] = shape.points[ip][0]
           y_lat[ip] = shape.points[ip][1]
       plt.plot(x_lon,y_lat)

    else: # loop over parts of each shape, plot separately
       for ip in range(nparts): # loop over parts, plot separately
           i0=shape.parts[ip]
           if ip < nparts-1:
              i1 = shape.parts[ip+1]-1
          else:
              i1 = npoints

         seg=shape.points[i0:i1+1]
         x_lon = np.zeros((len(seg),1))
         y_lat = np.zeros((len(seg),1))
         for ip in range(len(seg)):
             x_lon[ip] = seg[ip][0]
             y_lat[ip] = seg[ip][1]

         plt.plot(x_lon,y_lat)

plt.xlim(-130,-60)
plt.ylim(23,50)
plt.show()

And we can see those spurious lines are now gone:

good_map

Final Thoughts

Now that I feel pretty good about the information contained in a shapefile and how it’s stored, I’ll be moving on to more exciting visualizations. It’s important to note, that there are many Python libraries that can plot shapefiles without manually pulling out the points as I’ve done here. But I feel much better about using those fancier approaches now that I’ve gone through this exercise.

Also, in this post I’ve only touched on the geometry information in a shapefile. But it’s really the records included in the .dbf files that will make this an interesting visualization. The records contain measurements, observations or descriptions for each shape and that information can be used to color or fill each shape to create visualizations like this one (not my work).

Useful links: pyshp documentation, Plot shapefile with matplotlib (Stack Exchange)

Taxi Tracers: Speed and Hysteresis of NYC Cabs

In my previous post, I outlined how I took the NYC yellow cab trip record data and made some pretty maps of Manhattan. Given how many variables are tracked, there are many possible avenues of further exploration.  It turns out that the average speed of taxis holds some interesting trends and in this post I’ll describe some of these trends (code available at my GitHub page).

I started off with plotting average speed against time of day, hoping to see some trends related to rush hour. I took three months of yellow cab data (January, February and March 2016) and calculated the average speed of every ride using the pick up time, drop off time and distance traveled. I then calculated a 2D histogram of average speeds vs time of day:

speedhist2d

I also calculated the median speed (solid curve) and the first/third quartiles (dashed curve) at each time of day. As expected, rush hour is evident in the rapid drop in speed from 6-10.  I didn’t expect such a large peak in the early hours (around 5) and I also found it interesting that after the work day, speeds only gradually increase back to the maximum.

Next, I took the 2-D histogram, defined three representative time ranges and calculated 1-D histograms of the speed for those time ranges (left plot, below):

1dhistograms

These distributions are highly skewed, to varying degrees. The 5-6 time range (blue curve) sees a much wider distribution of speeds compared to the work day (9-18, green curve) and evening (20-24, red curve). Because of the skewness, I used the median rather than the mean in the 2-D histogram plot. Similarly, the standard deviation is not a good representation for skewed data, and so I plotted the first and third quartiles (the dashed lines in the 2-D plot).

The difference between the 1st and 3rd quartile (the interquartile range, or midspread), is a good measure of how the distribution of average cab speeds changes (plotted above, right, if the data were normally distributed, this plot would just be a plot of the standard deviation). Not only are taxis going faster on average around 5-6 am, but there’s a much wider variation in taxi speeds. Traffic speeds during the work day are much more uniform.

After playing with these plots for a bit, I got to thinking about how the speed of a taxi is only indirectly related to time of day! A taxi’s speed should depend on how much traffic is on the road (as well as the speed limit I suppose…), the time of day only reflects trends of when humans decide to get on the road. So if I instead plot the average speed vs the number of hired cabs, I get quite an interesting result:

speed_hyst_3_mo_a

Generally, the average speed decreases with increased number of cabs (increases with decreased cab count), but the changes are by no means linear. It’s easiest to read this plot by picking a starting time: let’s start at 6am, up at the top. At 6am, the average speed with about 1000 hired cabs on the road is about 23 mph. From 6am to 9am, the number of cabs increases, causing traffic to slow to about 12 mph. From 9am to 7pm, things get complicated (see the zoomed inset below), with the number of cabs and average speed bouncing around through the workday until 7pm, when the cab count drops and average speed rises in response. Eventually, the loop closes on itself, forming a sort of representative daily traffic cycle for Manhattan.

speed_hyst_3_mo_zoom

When I first looked at this, it immediately reminded me of hysteresis loops from all my thermodynamics classes. And so I started googling “Traffic Hysteresis” and it turns out there’s a whole field of study devoted to understanding hysteresis loops in traffic patterns in order to better design systems of roadways that can accommodate fluctuations in traffic volume (e.g., Zhang 1999, Geroliminis and Sun 2011Saberi and Mahmassani 2012 ). One of important differences between those studies and the brief exploration I’ve done here is that the above studies have counts of total traffic. The calculations here only consider the taxis floating in the larger stream of traffic. While these taxi tracers are still a good measure of average flow of traffic, the number of hired cabs at a given time is only a proxy for total traffic. A lot of the complicated behavior during the workday (9am-7pm) likely arises from  changes in traffic volume not accounted for like delivery trucks, commuters or buses. So in some ways it’s remarkable that I got a hysteresis loop at all!

The Code

I described most of the methodology for reading and manipulating the NYC yellow cab data in my previous post. The two main additions to the code are a new class in taxi_plotmod.py (binned_variable) and an algorithm for finding the number of hired taxis (find_N_unique_vs_t of taxi_plotmod.py). The binned_variable class returns a number of useful values including median, first quartile, third quartile and binned values. The method find_N_unique_vs_t loops over a given time range and counts the number of taxis that are hired during that time.

A few useful tips learned during the process:

PyTip! Calculating the first and third quartiles. This was quite simple using the numpy’s where method. The first quartile is defined as the median of all values lying below the median, so for an array of values (var2bin), the first quartile at a given time of day (designated by the i_bin index) can be calculated with:

firstquart[i_bin]=np.median(var2bin[np.where(var2bin<self.med[i_bin])]) 

The where command finds all the indeces with values less than the median at a given time of day, so var2bin[np.where(var2bin<self.med[i_bin])] selects all values that are less than the median (self.med[i_bin]) from which a new median is calculated.

PyTip! Converting to datetime from numpy’s datetime64. So at one point, I found myself needing to know the number of days between two dates. I had been importing the date of each taxi ride as a numpy datetime64 type to make selecting given date ranges with conditional statements easier. Both datetime and numpy’s datetime64 allow you to simply subtract two dates. The following two commands will give the number of days between two dates (assuming numpy has been imported as np and datetime as dt):

a=dt.datetime(2016,4,3)-dt.datetime(2015,3,1)
b=np.datetime64('2016-04-03')-np.datetime64('2015-03-01')

Both methods will give 399 days as the result, but I needed an integer value. The datetime64 method returned a weird I-don’t-know-what type of class that I couldn’t figure out how to convert to an integer value easily.  The easiest solution I found was to use the .astype method available with the datetime64 class:

date_end=np.datetime64('2016-04-03')
date_start=np.datetime('2015-03-01')
Ndates=date_end.astype(dt.datetime)-date_start.astype(dt.datetime)

Sp what’s happening here? Well date_end and date_start are both numpy datetime64 variables. date_end.astype(dt.datetime) takes the end date and casts it as a normal datetime value. So I do that for both the start and end date, take the difference and I’m left with Ndates as a datetime timedelta, which you can conveniently get an integer date range using

Ndates.days

So that’s that!

Stay tuned for either more taxi analysis or some fun with shapefiles!

Taxi Tracers: Mapping NYC

The city of New York has been recording and releasing trip data for every single taxi ride in the city since 2009 (TLC Trip Data, NYC) and these publicly available datasets are treasure trove of information. Every single time a NYC taxi picks up a passenger, 19 different pieces of information are recorded, including gps coordinates of pickup and dropoff, total fare and distance traveled. Think about how many cabs you see while walking the streets of New York and then imagine how many times in a day just one of those cabs picks up a passenger… yeah, that’s a lot of data. A single data file from  TLC Trip Data, NYC contains data for one month between 2009-2016 and weighs in at a  MASSIVE 2 GB.

There have already been plenty of cool visualizations and analyses of this data set, Todd  W. Schneider processed the ENTIRE data set from 2009-2015, amounting to over a billion taxi trips (link) and found lots of interesting trends. His comparison of yellow and green taxis to Uber rides in individual boroughs (link) is particularly interesting. Chris Wong created a visualization that follows a randomly selected cab through a day of driving (link) and Juan Francisco mapped over 10,000 cabs in a 24 hour period to create a mesmerizing video (link). Both used the start and end points from the TLC Trip data and created a best guess route via the Google Maps API. It’s pretty fun to watch the taxis jump around the city, especially if you happen to have just read Rawi Hage’s novel Carnival, in which the main character, a cab driver, describes one type of cabbie as “the flies”, opportunistic wanderers floating the streets waiting for passengers and getting stuck in their own web of routine.

In my own exploration of the data, I’ve been following two main threads. The first, which I’ll describe today, is map-making using gps coordinates of the taxis. The second, which I’ll leave to a future post, involves a more statistical analysis of the data focused on understanding average taxi speed. I’ll start by describing the final product, then dig into the details of my coding approach (python source code is available on my github page here).

Taxi Maps (the final products)!

Just look at the map, it’s beautiful:

taxi_count_map_maxval_labeled

This map was made by creating a rectangular grid spanning New York City and counting how many pickups occurred within each grid bin (more details on this below) within three months. The color scale is the base-10 log of the number of pickups in each bin (yellow/white are a lot of pickups, red are a lower number and black are no pickups). The horizontal direction (longitudinal) is divided into 600 bins, while the vertical (latitudinal) is divided into 800 bins, a grid fine enough to resolve individual streets.

The level of detail visible in this map is impressive. My eye was initially drawn to the odd half-circle patterns in the East Village, which it turns out is Stuyvesant Village. It’s also odd that the bridges to Brooklyn and Queens show up at all. I’m pretty sure it’s impossible for cabs to pick up anyone on one of these bridges. Perhaps the presence of the bridges arises from uncertainty in the GPS coordinates or cabbies starting their trip meter after getting on the bridge? Hard to say. But it’s kinda neat! The other very clear observation is the drop in cab pickups in Manhattan around the north end of Central Park. The NYC Taxi and Limo Commission observed this same feature and calculated that 95% of pickups in Manhattan occurred south of this boundary, which led to the creation of the green Boro taxis in order to provide more equitable access to taxis (more info here and here).

In addition to simply mapping taxi pickup location, some interesting trends are visible by changing the color scale to reflect different variables. The following map takes the same pickup locations, but colors each grid point by the average distance traveled away from the pickup location:

trip_distance_vs_lat

The map is fairly uniform in color, except for around Wall Street and the southern tip of Manhattan, where the more yellow coloration shows a generally higher trip distance. It’s hard to pick off numbers from the color scale, so I took the map and created a modified zonal average; at each latitude, I averaged across the longitudinal grid points, only using grid points with non-zero values, and then plotted the result (above, right). The line plot shows pretty clearly that cabs picking up passengers around Wall Street tend to go about 2 miles farther than trips originating anywhere else. Why is this? Hard to say without a lot more analysis, but 4 to 4.5 miles happens to be about the distance up to Penn Station or Grand Central Station, so I suspect commuters who don’t mind spending extra money to avoid the subway on their way to the train.

In the following I’ll dive into the horrid detail of how I made the above maps, read on if you dare! Future posts will look into some more complicated analysis of the yellow cab data, beyond simple visualization, so stay tuned for that…

Data Description

The datafiles at TLC Trip Data, NYC are HUGE comma separated value (csv) plain text files, each weighing in at about 2 GB. A single file covers all entries for a single month and contains 19 entries per line. A single line in the csv file looks like:

1,2015-03-06 08:02:34,2015-03-06 08:14:45,1,3.80,-73.889595031738281,40.747310638427734,1,N,-73.861953735351563,40.768550872802734,2,14,0,0.5,0,0,0.3,14.8

The TLC yellow cab data dictionary, describes what each entry is, but the ones I find most interesting are: pickup time (entry 2), dropoff time (entry 3), passenger count (entry 4), trip distance (entry 5), pickup and drop off longitude and latitude coordinates (entries 5, 6, 9 and 10, respectively), total fare (entry 12) and tip (entry 15).  So in most basic terms, all I’ve done is read each line of the file, pull out the entry of interest along with its gps coordinates and then find a way to plot that aggregated data.

Source Code, Tricks and Tips

The source code (available here) is divided into two python modules: taxi_main.py and taxi_plotmod.py. The first deals with reading in the raw .csv files while the second contains routines for processing and plotting that data. The source includes several a README and several scripts that show examples of using modules, so I won’t go into that here. Rather, I’ll describe my general approach along with several problems I ran into (and their solutions).

Reading the data

Reading in a single csv data file is relatively straightforward. The standard open command (open(filename, ‘r’)) allows you to read in one line at a time. The functions  of interest are in taxi_main.py: read_taxi_file loops over all of the csv files in a specified directory and for each file calls read_all_variables. I did add some fancy indexing in the form of a list of strings that identify which of the 19 variables you want to save (see the Vars_To_Import variable) so that you don’t have to import all the data. Check out the main function in taxi_main.py to see an example of using the commands to read in files.

One  of the design decisions I made early on was to read in the data then store it in memory before processing it. Meaning there is a big array (Vars or VarBig in taxi_main.py) that stores all of the imported data before feeding it to the routines for spatially binning the data (see below). This seriously limits the scalability of the present code:  for the 3 months of data used in the above maps, I was using up 30-45% of my laptop’s memory while reading and processing the data. Including more months of data would use up the remaining RAM pretty quick. The way around this problem would be to write some routines to do that spatial binning or other processing on the fly, so that the whole data set wouldn’t need to be stored. The advantage of holding the data set in memory, however, is that I can easily process and re-process that data multiple times without the bottle neck of reading in the files. So I stuck with the present design rather than rewriting everything…

Spatially binning point data

To create spatial maps and images, I had to go from point data to a gridded data product. This is a common practice in geo-located data sets (data sets that have latitude, longitude and some associated measurement at those coordinates) and though there are many complicated ways to carry out this process, the concept is simple:

gridded_product_concept

Conceptual cartoon of how to make a gridded data product from distributed point data

Each data point has a latitude, longitude and associated measurement (the color of the dots in the case of the conceptual image above) and the data points are spatially scattered according to their coordinates. To make a gridded data product, I first specified a latitude/longitude grid (dashed lines, left). I then find which points lie within each bin, aggregate the measurements within a bin and then store that new single value in the bin. In this conceptual case, I’ve averaged the color of the data points to define the new gridded data product. In the NYC taxi maps shown above, the color is first determined by simply the total number of measurements within a bin and then the average of all the trip distance measurements within a bin.

Since the reading of the raw data file and subsequent spatial binning can take a few minutes, the code can write out the gridded data products (see the main function in taxi_main.py). These files are only a few MB in size, depending on the resolution of the grid, so it’s much easier to repeatedly load the gridded data products rather than re-processing the dataset when fine tuning the plotting.

pyTip: where. One of the useful python tricks I learned in this is the numpy routine where that identifies the indeces of an array that meet a specified criteria. Say you have a giant array X with values between 0 and 100. If you want to find the mean of X, but only using values less than 50, you could use np.mean(X[np.where(X<50)]).

pyTip: looping efficiency. Less of a python tip and more of a computational tip, but there are several ways to write a routine that spatially bins point data. The first approach that I stupidly employed was to loop over each bin in the spatial grid, find the points that fall within those bins and then aggregate them. If N is the total number of data points, Nx is the number of bins in the x dimension and Ny is the number of bins in the y dimension, then in psuedo-code, this looks like

 for ix in 1:Nx
     for iy in 1:Ny
 
      Current_Grid_Values = All data points within Lat_Lot[ix,iy]
      GridValue[iy,ix]=mean(Current_Grid_Values)

Why is this approach so bad? Well, it involves searching through the entire data set of N values each time through the loop. N is huge (the smallest cases I ran have N > 50,000, for the maps above, N>1000000), so for a fine grid like in the maps above (where Nx = 600, Ny=800), this is incredibly slow.

A much better way to do it is to loop over N, find the corresponding bin and aggregate the measurements within that bin (again, this is psuedo-code, check out the real source code, map_proc within taxi_plotmod.py to see the working python code):

 for iN in 1:N
      Measurement = all_data_points[iN]
      Current_Lat_Lon = Lat_Lot[iN]
      find ix,iy that contains the Current_Lat_Lon 
      GridValue[iy,ix]=GridValue[iy,ix] + Measurement
      GridCount[iy,ix]=GridCount[iy,ix] + 1

After the loop, we can find the mean value of measurements within each binned value by normalizing by the total number of measurements within each bin:

 GridValueMean=GridValue / GridCount

A simple switch, but it scales to large values of N, Nx and Ny much better. The above maps took only a few minutes to process using the second (good) approach, while when I initially tried using the first (stupid) approach, the code was still running when I killed it after 30 minutes.

Plotting the spatially binned data: The plotting makes use of the matplotlib module, specifically the pcolormesh function. I went through a lot of color maps and ended up settling on ‘hot.’ Gives the maps an almost electric feel.  Not much else to say about the plotting, pretty standard stuff. Check out plt_map within taxi_plotmod.py if you want the detail.

Stay tuned for more exploration of the NYC taxi data!

Animating Historical Urban Population Growth

In their recent article,  “Spatializing 6 ,000 years of global urbanization from 3700 BC to AD 2000”,  Reba et al. describe efforts to create a digital, geocoded dataset tracking the distribution of urban locations since 3700 BC. The digital database provides a record of human movement over the geologically recent past and is useful for understanding the forces that drive human societies towards urbanization. The database seemed like a fun test of my fledgling python skills and in the post here, I’ll describe a visualization of Reba et al.’s database. Reba et al. released their data here. My python source is available vi GitHub here. As this is my first actual post on the blog here, let me remind you that I’m something of a python noob. So if you actually check out the full source code, I expect you’ll find lots of mistakes and perhaps some poor choices. You’re welcome to let me know of all the ways I could improve the code in the comments section below.

Before jumping into what I actually did, a bit more on the dataset. Reba et al. (2016) created their new digital database from two earlier compilations, Chandler (1984)  and Modelski (2000,2003). The authors of those previous studies meticulously scoured archaelogical and historical records in search of locations of urban centers. Reba et al. took those datasets and created a .csv file listing each city’s name, latitude, longitude, population estimate and a time corresponding to the population estimate, a process that involved transcribing the original databases manually (ouch!! none of the available automated print-to-digital methods worked) and geocoding each entry to obtain a latitude and longitude. In the end, Reba et al. ended up with three digital .csv datasets: Chandler’s database for urban centers from 2250 BC to 1975 AD, Modelski’s ancient city database covering 3700 BC to 1000 AD and Modelski’s modern 2000 AD database. All told, there are just over 2000 unique cities recorded between the three databases, many of which have multiple population estimates through time.

It’s worth noting that the historical database is by no means complete. As Reba et al. discuss in detail, omissions of urban centers from the original Chandler and Modelski databases, unclear entries  in the original databases or errors in transcription would all result in missing cities. South Asian, South American, North American, and African cities are particularly under-represented. As a geologist, I’m used to incomplete records. Interpreting a fossil record, a regional sedimentary sequence or structural juxtaposition often requires some interpolation. A given rock unit may be preserved in one location while it is eroded and lost to knowledge in another. Thus, the completeness of any (pre)historical dataset depends on both preservation and sampling – there could be cities missing because the local climate did not preserve abandoned structures or simply because archaeology is a relatively young pursuit and excavation efforts have traditionally focused on a small fraction of the Earth’s surface. But as Reba et al. state “These data alone are not accurate global representations of all population values through time. Rather, it highlights the population values of important global cities during important time periods.”

I was pretty impressed by Reba et al.’s work and decided their dataset would provide an interesting context to improve my python. So I set out to write some python code capable of importing the database and making some pretty plots (source code here). Note that I do not distribute Reba et al.’s database with my source, you’ll have to download that separately from their site here.  See the README file in the source code for a list of other dependencies required to run the code, which was only tested with python 2.7.

Before digging too deeply into the code, let’s just start with the end product. Here’s an animation of Chandler’s data set from 2250 BC to present day. Each circle is an urban center and the color of circles changes from blue to red as the time approaches present day.

Pretty neat!  We can see clustering and expansion of early urban centers around Mesopotamia, then seemingly separate loci of urban development popping up in East Asia and Central America. Along South America, it’s interesting how urban centers trace out the coastline. And as the animation approaches present day, the number of urban centers explodes (from 1960 to 2014, the percentage of the world’s population living in urban settings increased from 34% to 54%).

In addition to animations, the python source can plot a single frame for a user-specified time range. Here are the entries for Modelski’s Ancient Cities database from 500 BC to 50 AD:

HistoricalCities_Modelski

The Code

The three main steps to producing the above animation were (1) import the data, (2) subsample the data and (3) plot the data. The module urbanmap.py includes all functions needed to reproduce the above figures. And the scripts ex_animate.py and ex_single_frame.py are examples that call the appropriate functions to create the above animation and single frame plot.

In the following, I’ll walk through the different steps and their related functions in urbanmap.py. Again, full source code is here.

(1) Importing the data

The first step is to actually read in some data! The function urbanmap.load_cities does just that for a specified dataset. The directory where the dataset is located is specified and the name of the dataset are given by the data_dir and city_file argument, respectively:

 39 def load_cities(data_dir,city_file):
 40     """ loads population, lat/lon and time arrays for historical 
 41         city data. """

The function works with any of the three original plain text, comma separated valued (CSV) files from Reba et al.: chandlerV2.csv, modelskiAncientV2.csv and modelskiModernV2.csv.

Each .csv file has a row for each city, where the columns are the City Name, Alternate City Name, Country, Latitude,Longitude, Certainty and Population. So first, I open up the .csv file and create a csv object using the CSV reader:

 44 #   load city files
 45     flnm=data_dir+city_file
 46     fle = open(flnm, 'r') # open the file for reading
 47     csv_ob=csv.reader(fle,skipinitialspace=True)  

Some of the Alternate City Names have commas within quotes, which causes those entries to split. Adding the second argument (skipinitialspace=True) to csv.reader prevents those commas within quotes from being read as a new comma-separated value.

The remainder of the function reformats the data into arrays that I find more flexible to work with. First, I generate an array called Time, which contains every time for which a population record exists. In the original .csv files, the header line of each Population column gives the time at which the population estimate corresponds to. The header values are strings, such as BC_1400, BC_1390,…,AD_100,AD_110,AD_1000…  So the first thing I do is convert these header strings to a 1D numpy array where each element of the array is the time in years before present (ybp).

 57 #   get header line 
 58     header = fle.next()
 59     header = header.rstrip()
 60     header = header.split(',')
 61 
 62 #   build the Time array
 63     nt = len(header)
 64     Time_string=header[6:nt]
 65     nt = len(Time_string)
 66     Time = np.zeros((nt,1))
 67 
 68 #   convert BC/AD to years before present 
 69     for it in range(0,nt):
 70         ct = Time_string[it]
 71         ct = ct.split('_')
 72         Time[it]=bc_ad_ce_to_ybp(ct[1],ct[0])

Why go through all this trouble? Well, let’s say I want all cities with a recorded population for the time range 1400 BC to 50 AD. If I knew the header values exactly, I could find the indeces in Time_string corresponding to BC_1400 and AD_50. But the headers aren’t uniform within a single .csv file or across .csv files. The way I’ve constructed the Time array, however, allows for straightforward conditional indexing. The usefulness becomes more apparent after reading in the remaining data and I describe it in section 2 below.

The next lines (lines 74-101 of urbanmap.py) are pretty straightforward. Each row of the database is read in and distributed to one of three arrays: PopuL, city_lat and city_lon. The latter two contain the latitude and longitude of every city. PopuL is a 2D matrix with a row for each city and a column for each population record (i.e., PopuL.shape() returns n_city by n_Times).

I did run into some trouble with blank space characters. A few lines of the .csv files have some non-ascii blank space characters ‘\xa0’ that resulted in errors when I tried to convert the entries into floating point values. So I had to replace those characters with a true blank space before converting:

 81     for row in csv_ob:
 82 #       read in current line
 83         line = row
 84         line = [item.replace('\xa0',' ') for item in line]

 87 #       pull out lat/lon    
 88         city_lat[indx] = float(line[3])
 89         city_lon[indx] = float(line[4]) 

And that’s about it for reading in the data…

(2) Subsampling the data

Now that I’ve got all the data loaded, I want to be able to subsample that data for a specified time range. The main function to do this in urbanmap.py is get_lon_lat_at_t:

177 def get_lon_lat_at_t(year_range,city_lon,city_lat,Time,PopuL):

Most of the arguments (city_lon,city_lat,Time,PopuL) are returned by the load_cities function, described above. The year_range argument is a string argument that specifies the time span for which I want to select cities with non-zero population records. I chose to make year_range a comma separated string:

year_range='5000,BCE,1900,CE'

This year range starts at 5000 BCE and ends at 1900 CE. The time unit can be BCE,CE,BC or AD. Within get_lon_lat_at_t, I first convert this year_range to a start and end date in ybp:

186 #   convert year_range to years before present
187     year_range=year_range.replace(" ", "")
188     years=year_range.split(',')
189     time_range=[0,0]
190     time_range[0]=bc_ad_ce_to_ybp(years[0],years[1])
191     time_range[1]=bc_ad_ce_to_ybp(years[2],years[3])

Now, I can easily select the cities within the desired time range without knowing beforehand whether or not the chosen endpoints exist exactly in the Time array. First, I loop through each city and select the population records of the current city

193 #   find lat and lon of cities with recorded populations in database
194     for icit in range(0,ncit):
195         pop=PopuL[icit,:] # current population

Next, I find the times in current city that have a nonzero population record

196         pop_t=Time[pop>0] # times with nonzero pops    

And now I pull out times that are within the specified time range

197         pop_t=pop_t[pop_t<=time_range[0]] # pops within time range
198         pop_t=pop_t[pop_t>=time_range[1]]

The final step is check if there are any cities left. If there are no cities with a nonzero population record in the specified time range, I flag them for removal:

200         if pop_t.size == 0: # flag for removal 
201            lons[icit]=999.
202            lats[icit]=999.

So at the end of all this, I select the lons and lats that are not equal to 999 and those are the longitudes of the cities with a nonzero population within the specified time range. Neat-o! I can now return these lon/lat values and make some plots!

(3) Plotting the data

Now that we’ve got a function to subsample the data, we can plot that data in a number of ways. The simplest place to start is to put all the lat/lon of cities with a nonzero population record for a specified time range on a map. This is what the __main__ function of urbanmap.py and the script ex_single_frame.py accomplish. In both, I sequentially call load_cities and get_lon_lat_at_t functions then plotting the resulting lat/lon values using the basemap toolkit (mpl_toolkits).

The plotting is accomplished in two functions: urbanmap.base_plot() and urbanmap.city_plots(). The first creates a basemap object with the background image of the globe while the second actually puts the current lat/lon values onto that background image. base_plot() follows this example pretty closely.

The following, from ex_single_frame.py, shows the full sequence to plot cities within a current time span.

 33 import urbanmap as um
 34 import matplotlib.pyplot as plt
 35 
 36 # select time range
 37 time_span='500,BCE,50,CE'
 38        # comma separated string noting desired time span. The age descriptor can 
 39        # be BC, BCE, AD or CE.
 40 
 41 # select data set
 42 data_dir='../data_refs/'
 43 city_file='modelskiAncientV2.csv'
 49 
 50 # import data set 
 51 (Time,PopuL,city_lat,city_lon)=um.load_cities(data_dir,city_file)
 52 
 53 # get lon/lat of cities in time span    
 54 lons,lats,time_range=um.get_lon_lat_at_t(time_span,city_lon,city_lat,Time,PopuL)
 55 
 56 # plot it
 57 plt.figure(facecolor=(1,1,1))
 58 m = um.base_plot() # create base plot and map object
 59 um.add_annotations() # adds references
 60 um.city_plot(lons,lats,m,'singleframe',time_range,Time) # plot points
 61 plt.show() # and display the plot

Now that we have functions to pick out cities within a time range and then plot those points, creating an animation is conceptually straightforward. I just needed to repeatedly call get_lon_lat_at_t and city_plot, varying the time range each call. However in practice, sorting through the animation routines in the python animation package was the trickiest part of this whole exercise. I quickly gave up on using the animation routines, and simply looped over a time range, subsampling and plotting at each increment, saving the figure at each step along the way. I was then left with a bunch of image files (the frames of the animation), which I then concatenated into an animation using a bash script and ImageStack.

In the end, I managed to figure out the python animation routines, and that’s I included in the source code.

PyTip: shifting the basemap. I used two functions to rotate the center longitude of the map:  mpl_toolkits.basemap.Basemap() and  mpl_toolkits.basemap.shiftgrid(). The Basemap function creates the longitude/latitude projection while shiftgrid rotates the topography data to align with the Basemap. BOTH functions take an argument lon0, but in Basemap, lon0 is defined as the center longitude of the projection while in shiftgrid lon0 is the westernmost longitude. I was tripped up by this for a while because I assumed lon0 had the same meaning in each… whoops.

PyTip: accessing the data for animation. Most of the tutorials for the animation function (matplotlib.animation.FuncAnimation) are relatively simple and are set up to re-calculate the data to plot at each frame. The issue I ran into was that FuncAnimation animates a specified function by sequentially feeding it a frame index. I couldn’t figure out how to pass additional arguments (the full dataset) and importing the data set at each frame would be way too slow. I had an existing dataset that I wanted to read in only once at the start of the program. I first got around this by declaring the database variables (PopuL, city_lats, city_lons,…) as global variables so that they’d be accessible within the FuncAnimation call. This was pretty easy but I’m always a little uncomfortable using global variables. My approach in the end relied on simply better understanding how python handles variables. By nesting all of the animation functions under one top level function, any variables set in that top level function are available at the lower levels (in a sense, they’re locally global?). I found this post useful.

References:

Reba, Meredith, Femke Reitsma, and Karen C. Seto. “Spatializing 6,000 years of global urbanization from 3700 BC to AD 2000.” Scientific data 3:160034 doi: 10.1038/sdata.2016.34 (2016).
Historical Urban Population Growth Data: http://urban.yale.edu/data
Chandler, Tertius. “Four thousand years of urban growth: an historical census.” The Edwin Mellen Press (1987).
Modelski G. World Cities: -3000 to 2000 (FAROS 2000, 2003).