Great Circle Paths

Been quite a while since any updates, but here’s a short one!

As a part of a contract I’m working on, I found myself having to plot the major arc of great circle paths on a map. But if you google “how to plot great circle path in [insert python library here]” all the solutions are for plotting minor arcs. Turns out in the end, there’s a really simple trick to plotting the major arc (and I felt pretty dumb when I realized it after wasting a ton of time), but I figured I’d write it up here in case it saves anyone else a bit of time. The short answer: given two points that you want the major arc for, just add the antipodes for each point to the list of points in your path.

First off, in case you need a review of great circles, here’s a globe:

Given two points on the surface of a sphere, there is a single circle that that contains both points (unless you’re at a pole, in which case there are infinite great circles). The short way round is the minor arc (red curve), the long way round is the major arc (green curve). And I needed to plot both of them.

The reason I got into plotting this in the first place is that in seismology, surface waves are described by major and minor arcs. When an earthquake generates seismic waves and is measured at a seismometer somewhere else, the raypath between the epicenter and seismic station falls on a great circle path. And surface waves are referred to in terms of the minor and major arcs: the R1 wave travels the minor arc and the R2 travels the major arc. These waves will actually keep going around the earth’s surface before dissipating: R3 is the R1 after it goes around again, R4 is the R2 after it goes around again, and on and on. So I needed to be able to plot all these.

Ok, so back to plotting…

I was using the plotly library in Python for this plot, so I’ll stick with that for examples here, but there should be similar functions in whatever mapping library you’re using. The full script is on my github page here.

So the important bit is just defining the list of latitude and longitude points. Here, the minor arc points are put into a dictionary:

paths={}
paths['minor_arc']={'lon':[ start_lon, end_lon ],
                    'lat':[start_lat,end_lat], 'clr':'red','dash':None}

When we give this to plotly, we’ll tell it to connect the two points, which will give us the shortest path between the two, the minor arc.

To plot the major arc, we just need to add some points between the start and end so that it takes the long way around. But how to choose the points? Well, turns out that there are tons of confusing pages out there on the trig used for calculating great circle paths, and I almost started to code up some of it… until I realized only needed a couple points. And the antipodes (the point that is exactly opposite a given point on the surface) are both real easy to calculate and  guaranteed to lie on the great circle path. Just add 180 to the longitude and flip the sign of the latitude:

ant1lon=start_lon+180
if ant1lon>360:
    ant1lon= ant1lon - 360
ant1lat=-start_lat

Same for the antipode of the second, end point. The >360 bit is just to make sure longitude remains between 0 and 360 degrees.

And now, we can put the antipodes in a list for the major arc:

paths['major_arc']={'lon':[ start_lon,ant2lon,ant1lon, end_lon ],
                    'lat':[start_lat,ant2lat,ant1lat,end_lat], 
                    'clr':'green','dash':None}

In the script, these paths are then added as a data dictionary used in creating the plotly figure:

DataDict=list()
for path in ['minor_arc','major_arc']:
    DataDict.append(
        dict(
            type = 'scattergeo',
            lon = paths[path]['lon'],
            lat = paths[path]['lat'],
            name= path,
            mode = 'lines',
            line = dict(
                width = 2.5,
                color = paths[path]['clr'],
                dash=paths[path]['dash'],
            ),
            opacity = 1.0,
        )
    )

figdata={}
figdata['data']=DataDict

The full script has a bit more where it actually plots the data (to give the image above), but plotly has some really nice tutorials for that already so I won’t bother explaining all that.

So that’s that! Hope it saves someone else some time.

 

Quick tutorial on Geocoding with Python

I recently found myself needing to get the latitude/longitude of a list of cities (for this map here) and it turns out, it’s pretty easy now that I know how to do it. Here’s a quick tutorial!

Ok, so the process of taking a city and assigning a latitude/longitude point is called geocoding. There are many services that offer this (e.g., Google or Bing Maps APIs) but most I looked at seemed overkill for a one-time task of assigning lat/lon to about 500 cities. But then I discovered OpenStreetMap’s Nominatim!  You can modify the http address to return results in an xml file. For example,  the following searches for Providence, RI:

https://nominatim.openstreetmap.org/search.php?q=Providence+RI+USA&format=xml

 

And returns:

<searchresults timestamp="Thu, 02 Feb 17 16:17:00 +0000" attribution="Data © OpenStreetMap contributors, ODbL 1.0. http://www.openstreetmap.org/copyright" querystring="Providence RI USA" polygon="false" exclude_place_ids="158799064,159481664" more_url="https://nominatim.openstreetmap.org/search.php?format=xml&exclude_place_ids=158799064,159481664&accept-language=en-US,en;q=0.8&q=Providence+RI+USA">
<place place_id="158799064" osm_type="relation" osm_id="191210" place_rank="16" boundingbox="41.772414,41.861571,-71.4726669,-71.3736134" lat="41.8239891" lon="-71.4128342" display_name="Providence, Providence County, Rhode Island, United States of America" class="place" type="city" importance="0.80724054252736" icon="https://nominatim.openstreetmap.org/images/mapicons/poi_place_city.p.20.png"/>
<place place_id="159481664" osm_type="relation" osm_id="1840541" place_rank="12" boundingbox="41.7232498,42.0188529,-71.7992521,-71.3177699" lat="41.8677428" lon="-71.5814833" display_name="Providence County, Rhode Island, United States of America" class="boundary" type="administrative" importance="0.58173948152676" icon="https://nominatim.openstreetmap.org/images/mapicons/poi_boundary_administrative.p.20.png"/>
</searchresults>

If you scroll to the right you’ll see:

lat="41.8239891" lon="-71.4128342"

It’s pretty easy to write a python script to request then parse the xml result for lat and lon.  Here’s what that might look like (BUT DON’T DO THIS):

import urllib2

city='Providence, RI'
city_search=city.replace(' ','').split(',') # removes whitespace, splits city/state

# build the http address:
# (results in a string: 'https://nominatim.openstreetmap.org/search.php?q=Providence+RI+USA&format=xml')
osm='https://nominatim.openstreetmap.org/search.php?q=' 
fmt='+USA&polygon=1&format=xml'
srch = osm + city_search[0] + '+' + city_search[1] + fmt

# now use urllib2 to open the url and store the result:
response = urllib2.urlopen(srch)
the_page = response.read().split()

# and now we can parse the resulting string array where the xml info is stored.
# the it only stores the first Lon/Lat that it encounters

Lon = 0.0
Lat = 0.0
for iel in range(0,len(the_page)): # loop over the strings in the_page, look for Lat/Lon
    if 'lon=' in the_page[iel] and Lon == 0.0:
        Lon=float(the_page[iel].split("'")[1])
    if 'lat=' in the_page[iel] and Lat == 0.0:
        Lat=float(the_page[iel].split("'")[1])

So. Why not just loop over your list of cities and repeat this exercise? Well if you check out Nomanatim’s documentation page, and take a look at the usage policy, it requires: “(1) No heavy uses (an absolute maximum of 1 request per second). (2) Provide a valid HTTP Referer or User-Agent identifying the application (stock User-Agents as set by http libraries will not do). (3) Clearly display attribution as suitable for your medium. (4) Data is provided under the ODbL license which requires to share alike (although small extractions are likely to be covered by fair usage / fair dealing).” While I don’t think that my case of simply geocoding 500 or so cities falls under heavy usage and I could just delay my successive calls, I decided to look into their suggestions for other options.

In the end I settled on MapQuest’s implementation of Nominatim. It provides access to all the OpenStreetMaps data (still open source and subject to the OSM license agreements) and a MapQuest free developer account gets you 15,000 request/month for free. Waaay more than I’d need for this project.

So to geocode a list of cities, first sign up for a MapQuest Developer Account. You’ll get an API key assigned to you. Unlike some other API’s, MapQuest doesn’t use any fancy authentication. You basically just make a request for the URL with the API in the http address. Reaaaaally easy (but not exactly secure).

Then you can run a code very similar to that above. My implementation is here: look_up_latlons.py. But it’s kind of tied to the data that I was mapping.

Some notes on the code.

(1)  the API key is passed in through a command line argument, so when you run this code you have to type

$ python look_up_latlons.py AL1243KSFD242332552134KLJ

where that long string of letters/numbers is whatever your API key is.

(2) And then the formatting of the http address is slightly different from the standard Nominatim api. The same search for Providence RI  looks like:

 http://open.mapquestapi.com/nominatim/v1/search.php?key=API_KEY&format=xml&q=Providence+RI

where API_KEY is, again, your API key.

(3) In my implementation, I have imported a CSV file as a pandas dataframe (called Counts). Each row contains a city name along with the number of people who marched in the Women’s Marches on Jan. 21. The meat of the code is copied below, in which I iterate over the rows in the dataframe (named Counts here), find the lat/lon for each row (i.e., each city) and then store that lat/lon in a new dataframe (NewCounts) because it’s bad to modify an existing dataframe while iterating over it. Here’s what that looks like:

 osm='http://open.mapquestapi.com/nominatim/v1/search.php?key='+API_KEY+'&format=xml&q='

 # loop over cities in crowd counts, find Lat/Lon
 NewCounts=Counts.copy()
 NewCounts['lon']=np.zeros(len(Counts)) # add new column for lon
 NewCounts['lat']=np.zeros(len(Counts)) # add new column for lat
 for index, row in Counts.iterrows():

     srch=osm+str(row['City']).replace(' ','+')

     print '\n\nLooking up lat/lon for',row['City'],index
     time.sleep(dt) 
     response = urllib2.urlopen(srch)
     the_page = response.read().split()
 
     for iel in range(0,len(the_page)):
         if 'lon=' in the_page[iel] and NewCounts['lon'][index]==0.0:
            NewCounts['lon'][index]=float(the_page[iel].split("'")[1])
         if 'lat=' in the_page[iel] and NewCounts['lat'][index]==0.0:
            NewCounts['lat'][index]=float(the_page[iel].split("'")[1])

      print row['City'],NewCounts['lon'][index],NewCounts['lat'][index]

The MapQuest API didn’t have any specific usage constraints for how frequently you make a request, just overall number in a month, but I added a small delay between calls using the time.sleep() function anyway.

That’s all for now, hopefully some more posts with colorful plots coming soon!

 

Mapping some things!

Since publishing my series of posts on manipulating shapefiles in Python (1, 2 and 3), I’ve been exploring different open data catalogs so I thought I’d share some of the maps I’ve mapped! All these were produced using scripts in my learning_shapefiles repository on GitHub (see denver_stack.py and denver_tree_canopy.py).

Downtown Denver

denver_downtown

MapZen is a pretty sweet service! You can draw a regional box then the folks at MapZen will take that box and extract all the OpenStreetMap data within the box then give you the shapefiles with all that info. First thing I (stupidly) did after downloading my data extraction from MapZen was to just plot all of the shapes… and I then proceeded to sit around for quite some time while my laptop chugged away and produced a highly detailed map of all the things in Denver County. I zoomed in to downtown Denver for the above image to show off the detail.

Metro-Denver

denver_road_rivers

A more abstract representation of the primary roadways in metro Denver. I figured out how the MapZen/OpenStreetMap shapefile was organized and only plotted motorways, primary and secondary roads (bright white in the map). I also created a grid containing the shortest distance to a roadway (using  the distance method available for shapely.geometry.LineString) and contoured the inverse distance (1/x) to evoke the topographic contours along rivers.

Tree Cover in Denver County

denver_1

Denver’s Open Data Catalog has a bunch of databases with shapefiles galore. I downloaded one for the Tree Canopy and then plotted up the percent tree cover in each geometry. This is what actually lead me to learn how to plot roadways… and here I overlaid the tree cover on the same MapZen extraction of OpenStreeMap data. Along with the roadways underneath, it forms a sort of abstract tree.

So that’s it for today. Three maps from a couple different open data sources using some inefficient Python. Not going to go into detail on the code this time, because, well, it’s slooow. I’m using modifications of the simple scripts that I used in my shapefile tutorials because I was excited to just plot things! But there are much better ways to handle shapefiles with hundreds of thousands of geometries. So until next time (after I figure out how to use fiona or geopandas), just enjoy the visuals!

Shapely Polygons: Coloring Shapefile Polygons

In my previous two posts, I showed how to (1) read and plot shapefile geometries using the pyshp library and (2) plot polygons using shapely and descartes. So the obvious next step is to combine the two! And that’s what I’ll cover today, again using my learning_shapefiles github repo along with the shapefile of state boundaries from census.gov.

The Final Map

In case you don’t care about the Python and are just curious about the end product, here’s the final map where the color of each state reflects its total land area:

shapefile_us_colored_by_area_sat

It’s kind of neat to see the gradient of state size from east to west, reflecting the historical expansion of the U.S. westward, but other than that, there’s not much to the map. But it does serve as a simple case for learning to manipulate shapefiles.

The Code

There are two scripts in learning_shapefiles/src of relevance for today’s post: basic_readshp_plotpoly.py and read_shp_and_rcrd.py. The first script is a simple combination of basic_read_plot.py and simple_polygons.py (from my previous two posts), plotting the shapefile geometries using polygons instead of lines, so let’s start there.

basic_readshp_plotpoly.py

The code starts out the same as basic_read_plot.py, but now also imports Polygon and PolygonPatch from shapely and descartes, before reading in the shapefile:

import shapefile
import numpy as np
import matplotlib.pyplot as plt
from shapely.geometry import Polygon
from descartes.patch import PolygonPatch

"""
 IMPORT THE SHAPEFILE 
"""
shp_file_base='cb_2015_us_state_20m'
dat_dir='../shapefiles/'+shp_file_base +'/'
sf = shapefile.Reader(dat_dir+shp_file_base)

The next part of the code plots a single geometry from the shapefile. This is super easy because shapefile.Reader reads a shapefile geometry as a list of points, which is exactly what the Polygon function needs. So we can just give that list of points directly to the Polygon function:

plt.figure()
ax = plt.axes()
ax.set_aspect('equal')

shape_ex = sf.shape(5) # could break if selected shape has multiple polygons. 

# build the polygon from exterior points
polygon = Polygon(shape_ex.points)
patch = PolygonPatch(polygon, facecolor=[0,0,0.5], edgecolor=[0,0,0], alpha=0.7, zorder=2)
ax.add_patch(patch)

# use bbox (bounding box) to set plot limits
plt.xlim(shape_ex.bbox[0],shape_ex.bbox[2])
plt.ylim(shape_ex.bbox[1],shape_ex.bbox[3])

And we get Washington, now as a colored polygon rather than an outline:

shapefile_single

Woo!

And as before, we can now loop over each shape (and each part of each shape), construct a polygon and plot it:

""" PLOTS ALL SHAPES AND PARTS """
plt.figure()
ax = plt.axes() # add the axes
ax.set_aspect('equal')

icolor = 1
for shape in list(sf.iterShapes()):

    # define polygon fill color (facecolor) RGB values:
    R = (float(icolor)-1.0)/52.0
    G = 0
    B = 0

    # check number of parts (could use MultiPolygon class of shapely?)
    nparts = len(shape.parts) # total parts
    if nparts == 1:
       polygon = Polygon(shape.points)
       patch = PolygonPatch(polygon, facecolor=[R,G,B], alpha=1.0, zorder=2)
       ax.add_patch(patch)

    else: # loop over parts of each shape, plot separately
      for ip in range(nparts): # loop over parts, plot separately
          i0=shape.parts[ip]
          if ip < nparts-1:
             i1 = shape.parts[ip+1]-1
          else:
             i1 = len(shape.points)

          polygon = Polygon(shape.points[i0:i1+1])
          patch = PolygonPatch(polygon, facecolor=[R,G,B], alpha=1.0, zorder=2)
          ax.add_patch(patch)

    icolor = icolor + 1

plt.xlim(-130,-60)
plt.ylim(23,50)
plt.show()

In order to distinguish each polygon, I set each shape’s color based on how many shapes have already been plotted:

R = (float(icolor)-1.0)/52.0

This grades the red scale in an RGB tuple between 0 and 1 (since there are 52 shapes), and it is then used in the facecolor argument of PolygonPatch. The coloring is simply a function of the order in which the shapes are accessed:

shapefile_us

The goal, however, is to color each polygon by some sort of data so that we can actually learn something interesting, and that is exactly what read_shp_and_rcrd.py does.

read_shp_and_rcrd.py

Up to now, we’ve only considered the shape geometry, but that is only one part of a shapefile. Also included in most shapefiles are the records, or the data, associated with each shape. When a shapefile is imported,

shp_file_base='cb_2015_us_state_20m'
dat_dir='../shapefiles/'+shp_file_base +'/'
sf = shapefile.Reader(dat_dir+shp_file_base)

The resulting shapefile object (sf in this case) contains records associated with each shape. I wasn’t sure what fields were included for the State Boundary shapefile from census.gov, so I opened up a Python shell in terminal, read in the shapefile then typed

>>> sf.fields

to get a list of available fields:

[('DeletionFlag', 'C', 1, 0), ['STATEFP', 'C', 2, 0], ['STATENS', 'C', 8, 0], ['AFFGEOID', 'C', 11, 0], ['GEOID', 'C', 2, 0], ['STUSPS', 'C', 2, 0], ['NAME', 'C', 100, 0], ['LSAD', 'C', 2, 0], ['ALAND', 'N', 14, 0], ['AWATER', 'N', 14, 0]]

Down towards the end, there’s an interesting entry

['ALAND', 'N', 14, 0]

Though I couldn’t find any documentation on the included fields, I suspected ALAND stood for land area (especially since it was followed by AWATER). So in read_shp_and_rcrd.py, the first thing I do is extract the field names and find the index corresponding the the land area:

""" Find max/min of record of interest (for scaling the facecolor)"""

# get list of field names, pull out appropriate index
# fieldnames of interest: ALAND, AWATER are land and water area, respectively
fld = sf.fields[1:]
field_names = [field[0] for field in fld]
fld_name='ALAND'
fld_ndx=field_names.index(fld_name)

I found this post helpful for extracting the fieldnames of each record.

Next, I loop over the records using the interRecords() object to find the minimum and maximum land area in order to scale the polygon colors:

# loop over records, track global min/max
maxrec=-9999
minrec=1e21
for rec in sf.iterRecords():
    if rec[4] != 'AK': # exclude alaska so the scale isn't skewed
       maxrec=np.max((maxrec,rec[fld_ndx]))
       minrec=np.min((minrec,rec[fld_ndx]))

maxrec=maxrec/1.0 # upper saturation limit

print fld_name,'min:',minrec,'max:',maxrec

I excluded Alaska (if rec[4] != ‘AK’:) so that the color scale wouldn’t be thrown off, and then I also scale the maximum (maxrec=maxrec/1.0) to adjust the color scale manually (more on this later).

Now that I know the max/min, I loop over each shape and (1) calculate the RGB value for each polygon using a linear scale between the max and min and then (2) plot a polygon for each shape (and all the parts of a shape) using that RGB value:

for shapeRec in sf.iterShapeRecords():
    # pull out shape geometry and records 
    shape=shapeRec.shape
    rec = shapeRec.record

    # select polygon facecolor RGB vals based on record value
    if rec[4] != 'AK':
         R = 1
         G = (rec[fld_ndx]-minrec)/(maxrec-minrec)
         G = G * (G<=1) + 1.0 * (G>1.0)
         B = 0
    else:
         R = 0
         B = 0
         G = 0

    # check number of parts (could use MultiPolygon class of shapely?)
    nparts = len(shape.parts) # total parts
    if nparts == 1:
       polygon = Polygon(shape.points)
       patch = PolygonPatch(polygon, facecolor=[R,G,B], edgecolor=[0,0,0], alpha=1.0, zorder=2)
       ax.add_patch(patch)
    else: # loop over parts of each shape, plot separately
       for ip in range(nparts): # loop over parts, plot separately
           i0=shape.parts[ip]
           if ip < nparts-1:
              i1 = shape.parts[ip+1]-1
           else:
              i1 = len(shape.points)

          # build the polygon and add it to plot 
          polygon = Polygon(shape.points[i0:i1+1])
          patch = PolygonPatch(polygon, facecolor=[R,G,B], alpha=1.0, zorder=2)
          ax.add_patch(patch)

plt.xlim(-130,-60)
plt.ylim(23,50)
plt.show()

One import thing not to miss is that on the first line, I loop over the iterShapeRecords iterable rather than using iterShapes. This is neccesary so that I have access to both shape geometry and the associated records, rather than just the shapes (iterShapes) or just the records (iterRecords).

Running the above code will produce the following map:

shapefile_us_colored_by_area

Because Texas is so much larger than the rest of the states, we don’t see much of a difference between the states. But we can adjust this by decreasing the max value using in the scaling. So after finding the max/min value, I set

maxrec=maxrec/2.0 # upper saturation limit

and end up with the following map that brings out more of the variation in the states’ land area (same map as in the very beginning of this post):

shapefile_us_colored_by_area_sat

Note that because I’m decreased the maxvalue for scaling, I had to ensure that the RGB value did not exceed 1, which is why I had the following lines limiting the green value (G):

    if rec[4] != 'AK':
         R = 1
         G = (rec[fld_ndx]-minrec)/(maxrec-minrec)
         G = G * (G<=1) + 1.0 * (G>1.0)

So that’s about it! That’s how you can read in a shapefile and plot polygons of each shape colored by some data (record) associated with each shape. There are plenty of more sophisticated ways to do this exercise, and I’ll be looking into some other shapefile Python libraries for upcoming posts.

Shapefiles in Python: shapely polygons

In my last post, I described how to take a shapefile and plot the outlines of the geometries in the shapefile. But the power of shapefiles is in the records (the data) associated with each shape. One common way of presenting shapefile data is to plot the shapefile geometry as polygons that are colored by some value of data. So as a prelude to doing just that, this post will cover how to plot polygons using the shapely and descartes libraries. As always, my code is up on my github page.

The two python libraries that I’ll be using are shapely (for constructing a polygon) and descartes (for adding a polygon to a plot). So step 0 is to go install those! I’ll also be using the numpy and matplotlib libraries, but you probably already have those.

Though the documentation for shapely has some nice sample source code, I wrote my own script, simple_polygons.py, to get to know the libraries better. In this approach, there are two steps to building a polygon from scratch: constructing the points that define the polygon’s shape and then mapping those points into a polygon structure. The first step doesn’t require any special functions, just standard numpy. The second step uses the  shapely.geometry.Polygon class to build a polygon from a list of coordinates.

There are limitations for valid polygons, but virtually any shape can be constructed, like the following pacman:

pacman

The first step is to build the list of coordinates defining the exterior points (the outer circle) and a list of interior points to exclude from the polygon (the eyeball). Starting with the exterior points, I calculate the x and y coordinates of unit circle from 0.25pi to 7/4pi (0 to 2pi would map a whole circle rather than a pacman):

theta = np.linspace(0.25*3.14,1.75*3.14,80) 

# add random perturbation 
max_rough=0.05 
pert=max_rough * np.random.rand(len(theta)) 

x = np.cos(theta)+pert 
y = np.sin(theta)+pert

I also add a random, small perturbation to each x-y position to add a bit of roughness to the outer pacman edge, because I wanted some small scale roughness more similar to the shapefiles I’d be plotting later. Next, I build a python list of all those x-y points. This list, ext, is the list of exterior points that I’ll give to shapely:

# build the list of points 
ext = list() 

# loop over x,y, add each point to list 
for itheta in range(len(theta)): 
    ext.append((x[itheta],y[itheta])) 

ext.append((0,0)) # add 0 point

At the end, I add the 0,0 point, otherwise the start and end points on the circle would connect to each other and I’d get a pacman that was punched in the face:

pacman_punch

That takes care of the exterior points, and making the list of interior points is similar. This list, inter, will be a list of points that define interior geometries to exclude from the polygon:

# build eyeball interior points 
theta=np.linspace(0,2*3.14,30) 
x = 0.1*np.cos(theta)+0.2 
y = 0.1*np.sin(theta)+0.7 

inter = list() 
for itheta in range(len(theta)): 
    inter.append((x[itheta],y[itheta])) 
inter.append((x[0],y[0]))

Now that we have the list of exterior and interior points, you just give that to shapely’s polygon function (shapely.geometry.Polygon):

polygon = Polygon(ext,[inter[::-1]])

Two things about passing Polygon the interior list: (1) you can actually pass Polygon a list of lists to define multiple areas to exclude from the polygon, so you have to add the brackets around inter and (2) I haven’t quite figured out the [::-1] that the shapely documentation includes. I know that generally, [::-1] will take all the elements of a list and reverse them, but why does Polygon need the points in reverse? No idea. Without it, I only get an outer edge defining the eyeball:

pacman_badeye

I would love to get some information on why Polygon needs the reversed list, so leave me a note in the comments if you know why.

Regardless, the next step is to add that polygon structure to a plot, with a straightforward use of matplotlib.pyplot (imported as plt) and descartes.patch.PolygonPatch:

 

# initialize figure and axes 
fig = plt.figure() 
ax = fig.add_axes((0.1,0.1,0.8,0.8)) 

# put the patch on the plot 
patch = PolygonPatch(polygon, facecolor=[0,0,0.5], edgecolor=[1,1,1], alpha=1.0) 
ax.add_patch(patch) 

# new axes 
plt.xlim([-1.5, 1.5]) 
plt.ylim([-1.5,1.5]) 
ax.set_aspect(1) 

plt.show()

PolygonPatch’s arguments are pretty self explanatory: facecolor and edgecolor set the colors for the fill and edge of the polygon. Conveniently, facecolor and edgecolor can be specified as RGB values, which I’ll take advantage of for plotting shapefile records in my next post. It can also accept any of the kwargs available to matplotlib.patches.Polygon class (like the transparency,alpha, between 0 and 1).

So that’s it! Pretty easy! And in some ways it is even easier to plot polygons from a shapefile, since pyshp imports shapefile coordinates as a list and you can just give that list directly to Polygon… more on that in the next post.

Shapefiles in Python: a super basic tutorial

I recently started a couple of projects that will involve using shapefiles and I got frustrated real fast. Many tutorials that I found assumed some previous knowledge of either shapefiles or the python libraries used to manipulate them. But what I wanted was a tutorial that helped me to plot a simple shapefile while getting to know what a shapefile actually is!

So here’s a SUPER simple example of how to load, inspect and plot a shapefile to make a map of the U.S! There are quite a few Python libraries dealing with shapefiles and it was hard to find the easiest place to start. I found the pyshp Python library the most approachable, so that’s what I use in the following example. There are many ways to visualize shapefiles in a more automated way than I do here, but I think that my approach here gives a clearer picture to a beginner of what a shapefile is and how to use Python with shapefiles.

The shapefile

Go get yourself a shapefile! The one I used (which will definitely work with my code below) is the lowest resolution state-level cartographic boundary shapefile from census.gov (link to census.gov, direct link to lowest resolution 20m .zip file). Once you download the .zip file, unpack it and take a look inside. A shapefile is actually a collection of different files, including a .shp file containing information on shape geometry (state boundaries in this case), a .dbf file containing attributes of each shape (like the name of each state) and others (check out the wiki page on shapefiles for a description of the other file extensions).

The code!

You can download my Python code: https://github.com/chrishavlin/learning_shapefiles

At present, the src folder includes only one python script: basic_read_plot.py. To run this script you will need to:

  1. install the pyshp Python library  (and numpy and matplotlib if you don’t have them already)
  2. edit the variables in the source code describing the path to the shapefile (dat_dir and shp_file_base in src/basic_read_plot.py)

After those two steps, just open up a terminal and run the script (assuming you’re in the src directory):

$ python basic_read_plot.py

The three plots described below should pop up.

So what does the code do? 

After the initial comment block and library import, the code reads in the shapefile using the string variables that give the location of the shapefile directory (data_dir) and the name of the shapefile without extension (shp_file_base):

sf = shapefile.Reader(dat_dir+shp_file_base)

This creates a shapefile object, sf, and the next few lines do some basic inspections of that object. To check how many shapes have been imported:

print 'number of shapes imported:',len(sf.shapes())

For the census.gov state boundary shapefile, this returns 52 for the 50 states, Washington D.C. and Puerto Rico.

For each shape (or state), there are a number of attributes defined: bbox, parts, points and shapeType. The pyshp documentation describes each, and I’ll touch on each one in the following (except for shapeType).

The first thing I wanted to do after importing the shapefile was just plot a single state. So I first pull out the information for a single shape (in this case, the 5th shape):

shape_ex = sf.shape(5)

The points attribute contains a list of latitude-longitude values that define the shape (state) boundary. So I loop over those points to create an array of longitude and latitude values that I can plot. A single point can be accessed with shape_ex.points[0] and will return a lon/lat pair, e.g. (-70.13123,40.6210). So I pull out the first and second index and put them in pre-defined numpy arrays:

x_lon = np.zeros((len(shape_ex.points),1))
y_lat = np.zeros((len(shape_ex.points),1))
for ip in range(len(shape_ex.points)):
    x_lon[ip] = shape_ex.points[ip][0]
    y_lat[ip] = shape_ex.points[ip][1]

And then I plot it:

plt.plot(x_lon,y_lat,'k')

# use bbox (bounding box) to set plot limits
plt.xlim(shape_ex.bbox[0],shape_ex.bbox[2])

single

This returns the state of Oregon! I also used the bbox attribute to set the x limits of the plot. bbox contains four elements that define a bounding box using the lower left lon/lat and upper right lon/lat. Since I’m setting the axes aspect ratio equal here, I only define the x limit.

Great! So all we need now is to loop over each shape (state) and plot it! Right? Well this code snippet does just that:

plt.figure()
ax = plt.axes()
ax.set_aspect('equal')
for shape in list(sf.iterShapes()):
   x_lon = np.zeros((len(shape.points),1))
   y_lat = np.zeros((len(shape.points),1))
   for ip in range(len(shape.points)):
       x_lon[ip] = shape.points[ip][0]
       y_lat[ip] = shape.points[ip][1]

   plt.plot(x_lon,y_lat)

plt.xlim(-130,-60)
plt.ylim(23,50)

And we can see some problems with the result:

bad_map

The issue is that in some of the shapes (states), the geometry has multiple closed loops (because of the islands in some states), so simply connecting the lat/lon points creates some weird lines.

But it turns out that the parts attribute of each shape includes information to save us! For a single shape the parts attribute (accessed with shape.parts) contains a list of indeces corresponding to the start of a new closed loop within a shape. So I modified the above code to first check if there are any closed loops (number of parts > 1) and then loop over each part, pulling out the correct index range for each segment of geometry:

plt.figure()
ax = plt.axes() # add the axes
ax.set_aspect('equal')

for shape in list(sf.iterShapes()):
    npoints=len(shape.points) # total points
    nparts = len(shape.parts) # total parts

    if nparts == 1:
       x_lon = np.zeros((len(shape.points),1))
       y_lat = np.zeros((len(shape.points),1))
       for ip in range(len(shape.points)):
           x_lon[ip] = shape.points[ip][0]
           y_lat[ip] = shape.points[ip][1]
       plt.plot(x_lon,y_lat)

    else: # loop over parts of each shape, plot separately
       for ip in range(nparts): # loop over parts, plot separately
           i0=shape.parts[ip]
           if ip < nparts-1:
              i1 = shape.parts[ip+1]-1
          else:
              i1 = npoints

         seg=shape.points[i0:i1+1]
         x_lon = np.zeros((len(seg),1))
         y_lat = np.zeros((len(seg),1))
         for ip in range(len(seg)):
             x_lon[ip] = seg[ip][0]
             y_lat[ip] = seg[ip][1]

         plt.plot(x_lon,y_lat)

plt.xlim(-130,-60)
plt.ylim(23,50)
plt.show()

And we can see those spurious lines are now gone:

good_map

Final Thoughts

Now that I feel pretty good about the information contained in a shapefile and how it’s stored, I’ll be moving on to more exciting visualizations. It’s important to note, that there are many Python libraries that can plot shapefiles without manually pulling out the points as I’ve done here. But I feel much better about using those fancier approaches now that I’ve gone through this exercise.

Also, in this post I’ve only touched on the geometry information in a shapefile. But it’s really the records included in the .dbf files that will make this an interesting visualization. The records contain measurements, observations or descriptions for each shape and that information can be used to color or fill each shape to create visualizations like this one (not my work).

Useful links: pyshp documentation, Plot shapefile with matplotlib (Stack Exchange)