Pandas Dataframes to Fortran Formatting

Ahhhh, Fortran.

The past few weeks I’ve had the pleasure (I think that’s the right word?) of dusting off my olde Fortran skills for several unrelated projects, one of which requires exporting of a pandas dataframe from python to a plain text CSV readable by some legacy fortran code. We wanted to speed up and simplify that code, and in the end I figured out a neat way to do it. The solutions out on the web (e.g., this SO post) generally follow the approach of iterating over every row in the dataframe, writing out each row with the appropriate formatting. While the pandas.DataFrame.to_csv() function has an option to format columns, it applies that formatting to every column and we need individual control.

So here’s a notebook (available over at github as well) to help out anyone out using python and fortran in tandem! It makes use of the lovely fortranformat package so that you can use your beloved fortran format strings exactly as you do in fortran!

A primer on map projections in Python

Ok, so after my post a little while ago about converting contours to shapefiles, I spent a day adapting the approach for use in a in a dash-plotly app… and in general it works, and took a ~800 MB file down to a ~14 MB shapefile. But the rendering of the shapefile was still very slooooooow, and due to limitations of how data callbacks update a figure’s data in Dash, I couldn’t just render the background map once. So, after some thinking, I realized that if I “simply” calculated my own map projections, I could use the standard x-y plotting routines (which includes contours, as well as the faster scattergl method for plotting a large number of points).

But then I realized that I only vaguely know what map projections are: I knew different projections vary in how they portray the 3D earth on a 2D surface, but when you start using all the mapping libraries out there, you run into a lot of jargon really fast. So here’s a super basic primer on calculation map projections in Python that explains some of that jargon!

Also, this is the first post I decided to write as a Python notebook. The notebook is in my github learning_shapefiles repo, but github has been having some issues displaying notebooks lately, so you can view the full notebook here.

PyProj

To start off, pyproj (link), is a python library based on the command-line library https://proj.org/ that provides a large number of algorithms related to map projections. So go install that (simple to do with pip or conda).

In order to use pyproj to project a lat/lon point onto a map projection, it’s helpful to know a few acronyms:

  • CRS: coordinate reference system
  • EPSG: a database of coordinate reference systems and transformations, different projections covering different regional or local (or global) domains are specified by the number following EPSG. For example, EPSG:4326 is the global reference ellipse used by modern day GPS.
  • WGS84 : World Geodetic System, latest revision, a.k.a. WGS1984, same as EPSG:4326
  • reference ellipse: an ellipse describing the surface of the Earth on which positions (like latitude and longitude) are defined. GPS points use a reference ellipse that approximates the Earth’s geoid (i.e., the gravitational equipotential surface that sea-level would follow if due to gravity alone, see wiki).

So now that’s out of the way…. to project a single lat/lon point with Pyproj, first import pyproj then initialize a projection (Proj) object:

from pyproj import Proj
p=Proj(proj='hammer',ellps='WGS84')

When initializing the Proj (projection) object, you give it the reference ellipse that your lat/lon is defined on, ellps=‘WGS84’, and the projection you want to transform to, in this case a hammer projection. Once you’ve initialized Proj, you can use it to move from lat/lon to cartesian x,y:

lon=-120.5 
lat=42.4
x,y=p(lon,lat)

This lon/lat point become (x,y)=(-9894366.0792,5203184.81636). These values in meters by default and are the cartesian x-y distance from the map center for your point.

The Proj object is flexible enough to take lists (and numpy arrays!), so you can project many points at once. It can also take any of the parameters defined for the PROJ projections as keyword arguments. For example, the Hammer projection, has an argument ‘+lon_0’ to set the 0-longitude for the view, and so in pyProj you can use ‘lon_0’ as a keyword argument when using the hammer projection:

p = Proj(proj='hammer',ellps='WGS84',lon_0=-50)

The rest of the notebook goes through how you could construct a grid of lat/lon lines and plot them for different projections. And at the end, I take a digital elevation model of the western US (which gives an elevation for lat/lon points), project the lat/lon points onto a hammer projection  to get points of (elevation,x,y) and contour it using the standard matplotlib.pyplot.contourf to get the following:

In my particular application, I’ll now be able to project my data to x,y and then use the cartesian contour functions of dash-plotly! The main drawback of this approach is having to write all the routines for handling rotations and plotting lines on maps… but ultimately I think it will work well with Dash, and it will be nice to be able use scattergl with my map point data.

 

Quick tutorial on Geocoding with Python

I recently found myself needing to get the latitude/longitude of a list of cities (for this map here) and it turns out, it’s pretty easy now that I know how to do it. Here’s a quick tutorial!

Ok, so the process of taking a city and assigning a latitude/longitude point is called geocoding. There are many services that offer this (e.g., Google or Bing Maps APIs) but most I looked at seemed overkill for a one-time task of assigning lat/lon to about 500 cities. But then I discovered OpenStreetMap’s Nominatim!  You can modify the http address to return results in an xml file. For example,  the following searches for Providence, RI:

https://nominatim.openstreetmap.org/search.php?q=Providence+RI+USA&format=xml

 

And returns:

<searchresults timestamp="Thu, 02 Feb 17 16:17:00 +0000" attribution="Data © OpenStreetMap contributors, ODbL 1.0. http://www.openstreetmap.org/copyright" querystring="Providence RI USA" polygon="false" exclude_place_ids="158799064,159481664" more_url="https://nominatim.openstreetmap.org/search.php?format=xml&exclude_place_ids=158799064,159481664&accept-language=en-US,en;q=0.8&q=Providence+RI+USA">
<place place_id="158799064" osm_type="relation" osm_id="191210" place_rank="16" boundingbox="41.772414,41.861571,-71.4726669,-71.3736134" lat="41.8239891" lon="-71.4128342" display_name="Providence, Providence County, Rhode Island, United States of America" class="place" type="city" importance="0.80724054252736" icon="https://nominatim.openstreetmap.org/images/mapicons/poi_place_city.p.20.png"/>
<place place_id="159481664" osm_type="relation" osm_id="1840541" place_rank="12" boundingbox="41.7232498,42.0188529,-71.7992521,-71.3177699" lat="41.8677428" lon="-71.5814833" display_name="Providence County, Rhode Island, United States of America" class="boundary" type="administrative" importance="0.58173948152676" icon="https://nominatim.openstreetmap.org/images/mapicons/poi_boundary_administrative.p.20.png"/>
</searchresults>

If you scroll to the right you’ll see:

lat="41.8239891" lon="-71.4128342"

It’s pretty easy to write a python script to request then parse the xml result for lat and lon.  Here’s what that might look like (BUT DON’T DO THIS):

import urllib2

city='Providence, RI'
city_search=city.replace(' ','').split(',') # removes whitespace, splits city/state

# build the http address:
# (results in a string: 'https://nominatim.openstreetmap.org/search.php?q=Providence+RI+USA&format=xml')
osm='https://nominatim.openstreetmap.org/search.php?q=' 
fmt='+USA&polygon=1&format=xml'
srch = osm + city_search[0] + '+' + city_search[1] + fmt

# now use urllib2 to open the url and store the result:
response = urllib2.urlopen(srch)
the_page = response.read().split()

# and now we can parse the resulting string array where the xml info is stored.
# the it only stores the first Lon/Lat that it encounters

Lon = 0.0
Lat = 0.0
for iel in range(0,len(the_page)): # loop over the strings in the_page, look for Lat/Lon
    if 'lon=' in the_page[iel] and Lon == 0.0:
        Lon=float(the_page[iel].split("'")[1])
    if 'lat=' in the_page[iel] and Lat == 0.0:
        Lat=float(the_page[iel].split("'")[1])

So. Why not just loop over your list of cities and repeat this exercise? Well if you check out Nomanatim’s documentation page, and take a look at the usage policy, it requires: “(1) No heavy uses (an absolute maximum of 1 request per second). (2) Provide a valid HTTP Referer or User-Agent identifying the application (stock User-Agents as set by http libraries will not do). (3) Clearly display attribution as suitable for your medium. (4) Data is provided under the ODbL license which requires to share alike (although small extractions are likely to be covered by fair usage / fair dealing).” While I don’t think that my case of simply geocoding 500 or so cities falls under heavy usage and I could just delay my successive calls, I decided to look into their suggestions for other options.

In the end I settled on MapQuest’s implementation of Nominatim. It provides access to all the OpenStreetMaps data (still open source and subject to the OSM license agreements) and a MapQuest free developer account gets you 15,000 request/month for free. Waaay more than I’d need for this project.

So to geocode a list of cities, first sign up for a MapQuest Developer Account. You’ll get an API key assigned to you. Unlike some other API’s, MapQuest doesn’t use any fancy authentication. You basically just make a request for the URL with the API in the http address. Reaaaaally easy (but not exactly secure).

Then you can run a code very similar to that above. My implementation is here: look_up_latlons.py. But it’s kind of tied to the data that I was mapping.

Some notes on the code.

(1)  the API key is passed in through a command line argument, so when you run this code you have to type

$ python look_up_latlons.py AL1243KSFD242332552134KLJ

where that long string of letters/numbers is whatever your API key is.

(2) And then the formatting of the http address is slightly different from the standard Nominatim api. The same search for Providence RI  looks like:

 http://open.mapquestapi.com/nominatim/v1/search.php?key=API_KEY&format=xml&q=Providence+RI

where API_KEY is, again, your API key.

(3) In my implementation, I have imported a CSV file as a pandas dataframe (called Counts). Each row contains a city name along with the number of people who marched in the Women’s Marches on Jan. 21. The meat of the code is copied below, in which I iterate over the rows in the dataframe (named Counts here), find the lat/lon for each row (i.e., each city) and then store that lat/lon in a new dataframe (NewCounts) because it’s bad to modify an existing dataframe while iterating over it. Here’s what that looks like:

 osm='http://open.mapquestapi.com/nominatim/v1/search.php?key='+API_KEY+'&format=xml&q='

 # loop over cities in crowd counts, find Lat/Lon
 NewCounts=Counts.copy()
 NewCounts['lon']=np.zeros(len(Counts)) # add new column for lon
 NewCounts['lat']=np.zeros(len(Counts)) # add new column for lat
 for index, row in Counts.iterrows():

     srch=osm+str(row['City']).replace(' ','+')

     print '\n\nLooking up lat/lon for',row['City'],index
     time.sleep(dt) 
     response = urllib2.urlopen(srch)
     the_page = response.read().split()
 
     for iel in range(0,len(the_page)):
         if 'lon=' in the_page[iel] and NewCounts['lon'][index]==0.0:
            NewCounts['lon'][index]=float(the_page[iel].split("'")[1])
         if 'lat=' in the_page[iel] and NewCounts['lat'][index]==0.0:
            NewCounts['lat'][index]=float(the_page[iel].split("'")[1])

      print row['City'],NewCounts['lon'][index],NewCounts['lat'][index]

The MapQuest API didn’t have any specific usage constraints for how frequently you make a request, just overall number in a month, but I added a small delay between calls using the time.sleep() function anyway.

That’s all for now, hopefully some more posts with colorful plots coming soon!

 

Taxi Tracers: Speed and Hysteresis of NYC Cabs

In my previous post, I outlined how I took the NYC yellow cab trip record data and made some pretty maps of Manhattan. Given how many variables are tracked, there are many possible avenues of further exploration.  It turns out that the average speed of taxis holds some interesting trends and in this post I’ll describe some of these trends (code available at my GitHub page).

I started off with plotting average speed against time of day, hoping to see some trends related to rush hour. I took three months of yellow cab data (January, February and March 2016) and calculated the average speed of every ride using the pick up time, drop off time and distance traveled. I then calculated a 2D histogram of average speeds vs time of day:

speedhist2d

I also calculated the median speed (solid curve) and the first/third quartiles (dashed curve) at each time of day. As expected, rush hour is evident in the rapid drop in speed from 6-10.  I didn’t expect such a large peak in the early hours (around 5) and I also found it interesting that after the work day, speeds only gradually increase back to the maximum.

Next, I took the 2-D histogram, defined three representative time ranges and calculated 1-D histograms of the speed for those time ranges (left plot, below):

1dhistograms

These distributions are highly skewed, to varying degrees. The 5-6 time range (blue curve) sees a much wider distribution of speeds compared to the work day (9-18, green curve) and evening (20-24, red curve). Because of the skewness, I used the median rather than the mean in the 2-D histogram plot. Similarly, the standard deviation is not a good representation for skewed data, and so I plotted the first and third quartiles (the dashed lines in the 2-D plot).

The difference between the 1st and 3rd quartile (the interquartile range, or midspread), is a good measure of how the distribution of average cab speeds changes (plotted above, right, if the data were normally distributed, this plot would just be a plot of the standard deviation). Not only are taxis going faster on average around 5-6 am, but there’s a much wider variation in taxi speeds. Traffic speeds during the work day are much more uniform.

After playing with these plots for a bit, I got to thinking about how the speed of a taxi is only indirectly related to time of day! A taxi’s speed should depend on how much traffic is on the road (as well as the speed limit I suppose…), the time of day only reflects trends of when humans decide to get on the road. So if I instead plot the average speed vs the number of hired cabs, I get quite an interesting result:

speed_hyst_3_mo_a

Generally, the average speed decreases with increased number of cabs (increases with decreased cab count), but the changes are by no means linear. It’s easiest to read this plot by picking a starting time: let’s start at 6am, up at the top. At 6am, the average speed with about 1000 hired cabs on the road is about 23 mph. From 6am to 9am, the number of cabs increases, causing traffic to slow to about 12 mph. From 9am to 7pm, things get complicated (see the zoomed inset below), with the number of cabs and average speed bouncing around through the workday until 7pm, when the cab count drops and average speed rises in response. Eventually, the loop closes on itself, forming a sort of representative daily traffic cycle for Manhattan.

speed_hyst_3_mo_zoom

When I first looked at this, it immediately reminded me of hysteresis loops from all my thermodynamics classes. And so I started googling “Traffic Hysteresis” and it turns out there’s a whole field of study devoted to understanding hysteresis loops in traffic patterns in order to better design systems of roadways that can accommodate fluctuations in traffic volume (e.g., Zhang 1999, Geroliminis and Sun 2011Saberi and Mahmassani 2012 ). One of important differences between those studies and the brief exploration I’ve done here is that the above studies have counts of total traffic. The calculations here only consider the taxis floating in the larger stream of traffic. While these taxi tracers are still a good measure of average flow of traffic, the number of hired cabs at a given time is only a proxy for total traffic. A lot of the complicated behavior during the workday (9am-7pm) likely arises from  changes in traffic volume not accounted for like delivery trucks, commuters or buses. So in some ways it’s remarkable that I got a hysteresis loop at all!

The Code

I described most of the methodology for reading and manipulating the NYC yellow cab data in my previous post. The two main additions to the code are a new class in taxi_plotmod.py (binned_variable) and an algorithm for finding the number of hired taxis (find_N_unique_vs_t of taxi_plotmod.py). The binned_variable class returns a number of useful values including median, first quartile, third quartile and binned values. The method find_N_unique_vs_t loops over a given time range and counts the number of taxis that are hired during that time.

A few useful tips learned during the process:

PyTip! Calculating the first and third quartiles. This was quite simple using the numpy’s where method. The first quartile is defined as the median of all values lying below the median, so for an array of values (var2bin), the first quartile at a given time of day (designated by the i_bin index) can be calculated with:

firstquart[i_bin]=np.median(var2bin[np.where(var2bin<self.med[i_bin])]) 

The where command finds all the indeces with values less than the median at a given time of day, so var2bin[np.where(var2bin<self.med[i_bin])] selects all values that are less than the median (self.med[i_bin]) from which a new median is calculated.

PyTip! Converting to datetime from numpy’s datetime64. So at one point, I found myself needing to know the number of days between two dates. I had been importing the date of each taxi ride as a numpy datetime64 type to make selecting given date ranges with conditional statements easier. Both datetime and numpy’s datetime64 allow you to simply subtract two dates. The following two commands will give the number of days between two dates (assuming numpy has been imported as np and datetime as dt):

a=dt.datetime(2016,4,3)-dt.datetime(2015,3,1)
b=np.datetime64('2016-04-03')-np.datetime64('2015-03-01')

Both methods will give 399 days as the result, but I needed an integer value. The datetime64 method returned a weird I-don’t-know-what type of class that I couldn’t figure out how to convert to an integer value easily.  The easiest solution I found was to use the .astype method available with the datetime64 class:

date_end=np.datetime64('2016-04-03')
date_start=np.datetime('2015-03-01')
Ndates=date_end.astype(dt.datetime)-date_start.astype(dt.datetime)

Sp what’s happening here? Well date_end and date_start are both numpy datetime64 variables. date_end.astype(dt.datetime) takes the end date and casts it as a normal datetime value. So I do that for both the start and end date, take the difference and I’m left with Ndates as a datetime timedelta, which you can conveniently get an integer date range using

Ndates.days

So that’s that!

Stay tuned for either more taxi analysis or some fun with shapefiles!