# Great Circle Paths

Been quite a while since any updates, but here’s a short one!

As a part of a contract I’m working on, I found myself having to plot the major arc of great circle paths on a map. But if you google “how to plot great circle path in [insert python library here]” all the solutions are for plotting minor arcs. Turns out in the end, there’s a really simple trick to plotting the major arc (and I felt pretty dumb when I realized it after wasting a ton of time), but I figured I’d write it up here in case it saves anyone else a bit of time. The short answer: given two points that you want the major arc for, just add the antipodes for each point to the list of points in your path.

First off, in case you need a review of great circles, here’s a globe: Given two points on the surface of a sphere, there is a single circle that that contains both points (unless you’re at a pole, in which case there are infinite great circles). The short way round is the minor arc (red curve), the long way round is the major arc (green curve). And I needed to plot both of them.

The reason I got into plotting this in the first place is that in seismology, surface waves are described by major and minor arcs. When an earthquake generates seismic waves and is measured at a seismometer somewhere else, the raypath between the epicenter and seismic station falls on a great circle path. And surface waves are referred to in terms of the minor and major arcs: the R1 wave travels the minor arc and the R2 travels the major arc. These waves will actually keep going around the earth’s surface before dissipating: R3 is the R1 after it goes around again, R4 is the R2 after it goes around again, and on and on. So I needed to be able to plot all these.

Ok, so back to plotting…

I was using the plotly library in Python for this plot, so I’ll stick with that for examples here, but there should be similar functions in whatever mapping library you’re using. The full script is on my github page here.

So the important bit is just defining the list of latitude and longitude points. Here, the minor arc points are put into a dictionary:

```paths={}
paths['minor_arc']={'lon':[ start_lon, end_lon ],
'lat':[start_lat,end_lat], 'clr':'red','dash':None}
```

When we give this to plotly, we’ll tell it to connect the two points, which will give us the shortest path between the two, the minor arc.

To plot the major arc, we just need to add some points between the start and end so that it takes the long way around. But how to choose the points? Well, turns out that there are tons of confusing pages out there on the trig used for calculating great circle paths, and I almost started to code up some of it… until I realized only needed a couple points. And the antipodes (the point that is exactly opposite a given point on the surface) are both real easy to calculate and  guaranteed to lie on the great circle path. Just add 180 to the longitude and flip the sign of the latitude:

```ant1lon=start_lon+180
if ant1lon>360:
ant1lon= ant1lon - 360
ant1lat=-start_lat
```

Same for the antipode of the second, end point. The >360 bit is just to make sure longitude remains between 0 and 360 degrees.

And now, we can put the antipodes in a list for the major arc:

```paths['major_arc']={'lon':[ start_lon,ant2lon,ant1lon, end_lon ],
'lat':[start_lat,ant2lat,ant1lat,end_lat],
'clr':'green','dash':None}
```

In the script, these paths are then added as a data dictionary used in creating the plotly figure:

```DataDict=list()
for path in ['minor_arc','major_arc']:
dict(
type = 'scattergeo',
lon = paths[path]['lon'],
lat = paths[path]['lat'],
name= path,
mode = 'lines',
line = dict(
width = 2.5,
color = paths[path]['clr'],
dash=paths[path]['dash'],
),
opacity = 1.0,
)
)

figdata={}
```

The full script has a bit more where it actually plots the data (to give the image above), but plotly has some really nice tutorials for that already so I won’t bother explaining all that.

So that’s that! Hope it saves someone else some time.

# Mapping some things!

Since publishing my series of posts on manipulating shapefiles in Python (1, 2 and 3), I’ve been exploring different open data catalogs so I thought I’d share some of the maps I’ve mapped! All these were produced using scripts in my learning_shapefiles repository on GitHub (see denver_stack.py and denver_tree_canopy.py).

Downtown Denver MapZen is a pretty sweet service! You can draw a regional box then the folks at MapZen will take that box and extract all the OpenStreetMap data within the box then give you the shapefiles with all that info. First thing I (stupidly) did after downloading my data extraction from MapZen was to just plot all of the shapes… and I then proceeded to sit around for quite some time while my laptop chugged away and produced a highly detailed map of all the things in Denver County. I zoomed in to downtown Denver for the above image to show off the detail.

Metro-Denver A more abstract representation of the primary roadways in metro Denver. I figured out how the MapZen/OpenStreetMap shapefile was organized and only plotted motorways, primary and secondary roads (bright white in the map). I also created a grid containing the shortest distance to a roadway (using  the distance method available for shapely.geometry.LineString) and contoured the inverse distance (1/x) to evoke the topographic contours along rivers.

Tree Cover in Denver County Denver’s Open Data Catalog has a bunch of databases with shapefiles galore. I downloaded one for the Tree Canopy and then plotted up the percent tree cover in each geometry. This is what actually lead me to learn how to plot roadways… and here I overlaid the tree cover on the same MapZen extraction of OpenStreeMap data. Along with the roadways underneath, it forms a sort of abstract tree.

So that’s it for today. Three maps from a couple different open data sources using some inefficient Python. Not going to go into detail on the code this time, because, well, it’s slooow. I’m using modifications of the simple scripts that I used in my shapefile tutorials because I was excited to just plot things! But there are much better ways to handle shapefiles with hundreds of thousands of geometries. So until next time (after I figure out how to use fiona or geopandas), just enjoy the visuals!

# Shapely Polygons: Coloring Shapefile Polygons

In my previous two posts, I showed how to (1) read and plot shapefile geometries using the pyshp library and (2) plot polygons using shapely and descartes. So the obvious next step is to combine the two! And that’s what I’ll cover today, again using my learning_shapefiles github repo along with the shapefile of state boundaries from census.gov.

The Final Map

In case you don’t care about the Python and are just curious about the end product, here’s the final map where the color of each state reflects its total land area: It’s kind of neat to see the gradient of state size from east to west, reflecting the historical expansion of the U.S. westward, but other than that, there’s not much to the map. But it does serve as a simple case for learning to manipulate shapefiles.

The Code

There are two scripts in learning_shapefiles/src of relevance for today’s post: basic_readshp_plotpoly.py and read_shp_and_rcrd.py. The first script is a simple combination of basic_read_plot.py and simple_polygons.py (from my previous two posts), plotting the shapefile geometries using polygons instead of lines, so let’s start there.

The code starts out the same as basic_read_plot.py, but now also imports Polygon and PolygonPatch from shapely and descartes, before reading in the shapefile:

```import shapefile
import numpy as np
import matplotlib.pyplot as plt
from shapely.geometry import Polygon
from descartes.patch import PolygonPatch

"""
IMPORT THE SHAPEFILE
"""
shp_file_base='cb_2015_us_state_20m'
dat_dir='../shapefiles/'+shp_file_base +'/'

The next part of the code plots a single geometry from the shapefile. This is super easy because shapefile.Reader reads a shapefile geometry as a list of points, which is exactly what the Polygon function needs. So we can just give that list of points directly to the Polygon function:

```plt.figure()
ax = plt.axes()
ax.set_aspect('equal')

shape_ex = sf.shape(5) # could break if selected shape has multiple polygons.

# build the polygon from exterior points
polygon = Polygon(shape_ex.points)
patch = PolygonPatch(polygon, facecolor=[0,0,0.5], edgecolor=[0,0,0], alpha=0.7, zorder=2)

# use bbox (bounding box) to set plot limits
plt.xlim(shape_ex.bbox,shape_ex.bbox)
plt.ylim(shape_ex.bbox,shape_ex.bbox)```

And we get Washington, now as a colored polygon rather than an outline: Woo!

And as before, we can now loop over each shape (and each part of each shape), construct a polygon and plot it:

```""" PLOTS ALL SHAPES AND PARTS """
plt.figure()
ax = plt.axes() # add the axes
ax.set_aspect('equal')

icolor = 1
for shape in list(sf.iterShapes()):

# define polygon fill color (facecolor) RGB values:
R = (float(icolor)-1.0)/52.0
G = 0
B = 0

# check number of parts (could use MultiPolygon class of shapely?)
nparts = len(shape.parts) # total parts
if nparts == 1:
polygon = Polygon(shape.points)
patch = PolygonPatch(polygon, facecolor=[R,G,B], alpha=1.0, zorder=2)

else: # loop over parts of each shape, plot separately
for ip in range(nparts): # loop over parts, plot separately
i0=shape.parts[ip]
if ip < nparts-1:
i1 = shape.parts[ip+1]-1
else:
i1 = len(shape.points)

polygon = Polygon(shape.points[i0:i1+1])
patch = PolygonPatch(polygon, facecolor=[R,G,B], alpha=1.0, zorder=2)

icolor = icolor + 1

plt.xlim(-130,-60)
plt.ylim(23,50)
plt.show()```

In order to distinguish each polygon, I set each shape’s color based on how many shapes have already been plotted:

`R = (float(icolor)-1.0)/52.0`

This grades the red scale in an RGB tuple between 0 and 1 (since there are 52 shapes), and it is then used in the facecolor argument of PolygonPatch. The coloring is simply a function of the order in which the shapes are accessed: The goal, however, is to color each polygon by some sort of data so that we can actually learn something interesting, and that is exactly what read_shp_and_rcrd.py does.

Up to now, we’ve only considered the shape geometry, but that is only one part of a shapefile. Also included in most shapefiles are the records, or the data, associated with each shape. When a shapefile is imported,

```shp_file_base='cb_2015_us_state_20m'
dat_dir='../shapefiles/'+shp_file_base +'/'

The resulting shapefile object (sf in this case) contains records associated with each shape. I wasn’t sure what fields were included for the State Boundary shapefile from census.gov, so I opened up a Python shell in terminal, read in the shapefile then typed

`>>> sf.fields`

to get a list of available fields:

`[('DeletionFlag', 'C', 1, 0), ['STATEFP', 'C', 2, 0], ['STATENS', 'C', 8, 0], ['AFFGEOID', 'C', 11, 0], ['GEOID', 'C', 2, 0], ['STUSPS', 'C', 2, 0], ['NAME', 'C', 100, 0], ['LSAD', 'C', 2, 0], ['ALAND', 'N', 14, 0], ['AWATER', 'N', 14, 0]]`

Down towards the end, there’s an interesting entry

`['ALAND', 'N', 14, 0]`

Though I couldn’t find any documentation on the included fields, I suspected ALAND stood for land area (especially since it was followed by AWATER). So in read_shp_and_rcrd.py, the first thing I do is extract the field names and find the index corresponding the the land area:

```""" Find max/min of record of interest (for scaling the facecolor)"""

# get list of field names, pull out appropriate index
# fieldnames of interest: ALAND, AWATER are land and water area, respectively
fld = sf.fields[1:]
field_names = [field for field in fld]
fld_name='ALAND'
fld_ndx=field_names.index(fld_name)```

I found this post helpful for extracting the fieldnames of each record.

Next, I loop over the records using the interRecords() object to find the minimum and maximum land area in order to scale the polygon colors:

```# loop over records, track global min/max
maxrec=-9999
minrec=1e21
for rec in sf.iterRecords():
if rec != 'AK': # exclude alaska so the scale isn't skewed
maxrec=np.max((maxrec,rec[fld_ndx]))
minrec=np.min((minrec,rec[fld_ndx]))

maxrec=maxrec/1.0 # upper saturation limit

print fld_name,'min:',minrec,'max:',maxrec```

I excluded Alaska (if rec != ‘AK’:) so that the color scale wouldn’t be thrown off, and then I also scale the maximum (maxrec=maxrec/1.0) to adjust the color scale manually (more on this later).

Now that I know the max/min, I loop over each shape and (1) calculate the RGB value for each polygon using a linear scale between the max and min and then (2) plot a polygon for each shape (and all the parts of a shape) using that RGB value:

```for shapeRec in sf.iterShapeRecords():
# pull out shape geometry and records
shape=shapeRec.shape
rec = shapeRec.record

# select polygon facecolor RGB vals based on record value
if rec != 'AK':
R = 1
G = (rec[fld_ndx]-minrec)/(maxrec-minrec)
G = G * (G<=1) + 1.0 * (G>1.0)
B = 0
else:
R = 0
B = 0
G = 0

# check number of parts (could use MultiPolygon class of shapely?)
nparts = len(shape.parts) # total parts
if nparts == 1:
polygon = Polygon(shape.points)
patch = PolygonPatch(polygon, facecolor=[R,G,B], edgecolor=[0,0,0], alpha=1.0, zorder=2)
else: # loop over parts of each shape, plot separately
for ip in range(nparts): # loop over parts, plot separately
i0=shape.parts[ip]
if ip < nparts-1:
i1 = shape.parts[ip+1]-1
else:
i1 = len(shape.points)

# build the polygon and add it to plot
polygon = Polygon(shape.points[i0:i1+1])
patch = PolygonPatch(polygon, facecolor=[R,G,B], alpha=1.0, zorder=2)

plt.xlim(-130,-60)
plt.ylim(23,50)
plt.show()

```

One import thing not to miss is that on the first line, I loop over the iterShapeRecords iterable rather than using iterShapes. This is neccesary so that I have access to both shape geometry and the associated records, rather than just the shapes (iterShapes) or just the records (iterRecords).

Running the above code will produce the following map: Because Texas is so much larger than the rest of the states, we don’t see much of a difference between the states. But we can adjust this by decreasing the max value using in the scaling. So after finding the max/min value, I set

`maxrec=maxrec/2.0 # upper saturation limit`

and end up with the following map that brings out more of the variation in the states’ land area (same map as in the very beginning of this post): Note that because I’m decreased the maxvalue for scaling, I had to ensure that the RGB value did not exceed 1, which is why I had the following lines limiting the green value (G):

```    if rec != 'AK':
R = 1
G = (rec[fld_ndx]-minrec)/(maxrec-minrec)
G = G * (G<=1) + 1.0 * (G>1.0)
```

So that’s about it! That’s how you can read in a shapefile and plot polygons of each shape colored by some data (record) associated with each shape. There are plenty of more sophisticated ways to do this exercise, and I’ll be looking into some other shapefile Python libraries for upcoming posts.

# Taxi Tracers: Mapping NYC

The city of New York has been recording and releasing trip data for every single taxi ride in the city since 2009 (TLC Trip Data, NYC) and these publicly available datasets are treasure trove of information. Every single time a NYC taxi picks up a passenger, 19 different pieces of information are recorded, including gps coordinates of pickup and dropoff, total fare and distance traveled. Think about how many cabs you see while walking the streets of New York and then imagine how many times in a day just one of those cabs picks up a passenger… yeah, that’s a lot of data. A single data file from  TLC Trip Data, NYC contains data for one month between 2009-2016 and weighs in at a  MASSIVE 2 GB.

There have already been plenty of cool visualizations and analyses of this data set, Todd  W. Schneider processed the ENTIRE data set from 2009-2015, amounting to over a billion taxi trips (link) and found lots of interesting trends. His comparison of yellow and green taxis to Uber rides in individual boroughs (link) is particularly interesting. Chris Wong created a visualization that follows a randomly selected cab through a day of driving (link) and Juan Francisco mapped over 10,000 cabs in a 24 hour period to create a mesmerizing video (link). Both used the start and end points from the TLC Trip data and created a best guess route via the Google Maps API. It’s pretty fun to watch the taxis jump around the city, especially if you happen to have just read Rawi Hage’s novel Carnival, in which the main character, a cab driver, describes one type of cabbie as “the flies”, opportunistic wanderers floating the streets waiting for passengers and getting stuck in their own web of routine.

In my own exploration of the data, I’ve been following two main threads. The first, which I’ll describe today, is map-making using gps coordinates of the taxis. The second, which I’ll leave to a future post, involves a more statistical analysis of the data focused on understanding average taxi speed. I’ll start by describing the final product, then dig into the details of my coding approach (python source code is available on my github page here).

## Taxi Maps (the final products)!

Just look at the map, it’s beautiful: This map was made by creating a rectangular grid spanning New York City and counting how many pickups occurred within each grid bin (more details on this below) within three months. The color scale is the base-10 log of the number of pickups in each bin (yellow/white are a lot of pickups, red are a lower number and black are no pickups). The horizontal direction (longitudinal) is divided into 600 bins, while the vertical (latitudinal) is divided into 800 bins, a grid fine enough to resolve individual streets.

The level of detail visible in this map is impressive. My eye was initially drawn to the odd half-circle patterns in the East Village, which it turns out is Stuyvesant Village. It’s also odd that the bridges to Brooklyn and Queens show up at all. I’m pretty sure it’s impossible for cabs to pick up anyone on one of these bridges. Perhaps the presence of the bridges arises from uncertainty in the GPS coordinates or cabbies starting their trip meter after getting on the bridge? Hard to say. But it’s kinda neat! The other very clear observation is the drop in cab pickups in Manhattan around the north end of Central Park. The NYC Taxi and Limo Commission observed this same feature and calculated that 95% of pickups in Manhattan occurred south of this boundary, which led to the creation of the green Boro taxis in order to provide more equitable access to taxis (more info here and here).

In addition to simply mapping taxi pickup location, some interesting trends are visible by changing the color scale to reflect different variables. The following map takes the same pickup locations, but colors each grid point by the average distance traveled away from the pickup location: The map is fairly uniform in color, except for around Wall Street and the southern tip of Manhattan, where the more yellow coloration shows a generally higher trip distance. It’s hard to pick off numbers from the color scale, so I took the map and created a modified zonal average; at each latitude, I averaged across the longitudinal grid points, only using grid points with non-zero values, and then plotted the result (above, right). The line plot shows pretty clearly that cabs picking up passengers around Wall Street tend to go about 2 miles farther than trips originating anywhere else. Why is this? Hard to say without a lot more analysis, but 4 to 4.5 miles happens to be about the distance up to Penn Station or Grand Central Station, so I suspect commuters who don’t mind spending extra money to avoid the subway on their way to the train.

In the following I’ll dive into the horrid detail of how I made the above maps, read on if you dare! Future posts will look into some more complicated analysis of the yellow cab data, beyond simple visualization, so stay tuned for that…

## Data Description

The datafiles at TLC Trip Data, NYC are HUGE comma separated value (csv) plain text files, each weighing in at about 2 GB. A single file covers all entries for a single month and contains 19 entries per line. A single line in the csv file looks like:

```1,2015-03-06 08:02:34,2015-03-06 08:14:45,1,3.80,-73.889595031738281,40.747310638427734,1,N,-73.861953735351563,40.768550872802734,2,14,0,0.5,0,0,0.3,14.8
```

The TLC yellow cab data dictionary, describes what each entry is, but the ones I find most interesting are: pickup time (entry 2), dropoff time (entry 3), passenger count (entry 4), trip distance (entry 5), pickup and drop off longitude and latitude coordinates (entries 5, 6, 9 and 10, respectively), total fare (entry 12) and tip (entry 15).  So in most basic terms, all I’ve done is read each line of the file, pull out the entry of interest along with its gps coordinates and then find a way to plot that aggregated data.

## Source Code, Tricks and Tips

The source code (available here) is divided into two python modules: taxi_main.py and taxi_plotmod.py. The first deals with reading in the raw .csv files while the second contains routines for processing and plotting that data. The source includes several a README and several scripts that show examples of using modules, so I won’t go into that here. Rather, I’ll describe my general approach along with several problems I ran into (and their solutions).

Reading in a single csv data file is relatively straightforward. The standard open command (open(filename, ‘r’)) allows you to read in one line at a time. The functions  of interest are in taxi_main.py: read_taxi_file loops over all of the csv files in a specified directory and for each file calls read_all_variables. I did add some fancy indexing in the form of a list of strings that identify which of the 19 variables you want to save (see the Vars_To_Import variable) so that you don’t have to import all the data. Check out the main function in taxi_main.py to see an example of using the commands to read in files.

One  of the design decisions I made early on was to read in the data then store it in memory before processing it. Meaning there is a big array (Vars or VarBig in taxi_main.py) that stores all of the imported data before feeding it to the routines for spatially binning the data (see below). This seriously limits the scalability of the present code:  for the 3 months of data used in the above maps, I was using up 30-45% of my laptop’s memory while reading and processing the data. Including more months of data would use up the remaining RAM pretty quick. The way around this problem would be to write some routines to do that spatial binning or other processing on the fly, so that the whole data set wouldn’t need to be stored. The advantage of holding the data set in memory, however, is that I can easily process and re-process that data multiple times without the bottle neck of reading in the files. So I stuck with the present design rather than rewriting everything…

Spatially binning point data

To create spatial maps and images, I had to go from point data to a gridded data product. This is a common practice in geo-located data sets (data sets that have latitude, longitude and some associated measurement at those coordinates) and though there are many complicated ways to carry out this process, the concept is simple:

Each data point has a latitude, longitude and associated measurement (the color of the dots in the case of the conceptual image above) and the data points are spatially scattered according to their coordinates. To make a gridded data product, I first specified a latitude/longitude grid (dashed lines, left). I then find which points lie within each bin, aggregate the measurements within a bin and then store that new single value in the bin. In this conceptual case, I’ve averaged the color of the data points to define the new gridded data product. In the NYC taxi maps shown above, the color is first determined by simply the total number of measurements within a bin and then the average of all the trip distance measurements within a bin.

Since the reading of the raw data file and subsequent spatial binning can take a few minutes, the code can write out the gridded data products (see the main function in taxi_main.py). These files are only a few MB in size, depending on the resolution of the grid, so it’s much easier to repeatedly load the gridded data products rather than re-processing the dataset when fine tuning the plotting.

pyTip: where. One of the useful python tricks I learned in this is the numpy routine where that identifies the indeces of an array that meet a specified criteria. Say you have a giant array X with values between 0 and 100. If you want to find the mean of X, but only using values less than 50, you could use np.mean(X[np.where(X<50)]).

pyTip: looping efficiency. Less of a python tip and more of a computational tip, but there are several ways to write a routine that spatially bins point data. The first approach that I stupidly employed was to loop over each bin in the spatial grid, find the points that fall within those bins and then aggregate them. If N is the total number of data points, Nx is the number of bins in the x dimension and Ny is the number of bins in the y dimension, then in psuedo-code, this looks like

``` for ix in 1:Nx
for iy in 1:Ny

Current_Grid_Values = All data points within Lat_Lot[ix,iy]
GridValue[iy,ix]=mean(Current_Grid_Values)```

Why is this approach so bad? Well, it involves searching through the entire data set of N values each time through the loop. N is huge (the smallest cases I ran have N > 50,000, for the maps above, N>1000000), so for a fine grid like in the maps above (where Nx = 600, Ny=800), this is incredibly slow.

A much better way to do it is to loop over N, find the corresponding bin and aggregate the measurements within that bin (again, this is psuedo-code, check out the real source code, map_proc within taxi_plotmod.py to see the working python code):

``` for iN in 1:N
Measurement = all_data_points[iN]
Current_Lat_Lon = Lat_Lot[iN]
find ix,iy that contains the Current_Lat_Lon
GridValue[iy,ix]=GridValue[iy,ix] + Measurement
GridCount[iy,ix]=GridCount[iy,ix] + 1```

After the loop, we can find the mean value of measurements within each binned value by normalizing by the total number of measurements within each bin:

` GridValueMean=GridValue / GridCount`

A simple switch, but it scales to large values of N, Nx and Ny much better. The above maps took only a few minutes to process using the second (good) approach, while when I initially tried using the first (stupid) approach, the code was still running when I killed it after 30 minutes.

Plotting the spatially binned data: The plotting makes use of the matplotlib module, specifically the pcolormesh function. I went through a lot of color maps and ended up settling on ‘hot.’ Gives the maps an almost electric feel.  Not much else to say about the plotting, pretty standard stuff. Check out plt_map within taxi_plotmod.py if you want the detail.

Stay tuned for more exploration of the NYC taxi data!