Pandas Dataframes to Fortran Formatting

Ahhhh, Fortran.

The past few weeks I’ve had the pleasure (I think that’s the right word?) of dusting off my olde Fortran skills for several unrelated projects, one of which requires exporting of a pandas dataframe from python to a plain text CSV readable by some legacy fortran code. We wanted to speed up and simplify that code, and in the end I figured out a neat way to do it. The solutions out on the web (e.g., this SO post) generally follow the approach of iterating over every row in the dataframe, writing out each row with the appropriate formatting. While the pandas.DataFrame.to_csv() function has an option to format columns, it applies that formatting to every column and we need individual control.

So here’s a notebook (available over at github as well) to help out anyone out using python and fortran in tandem! It makes use of the lovely fortranformat package so that you can use your beloved fortran format strings exactly as you do in fortran!

 

 

3D rendering of velocity anomalies

Lately I’ve been working on using the yt library (https://yt-project.org) for 3D visualization of seismic data sets. Seismologists generally avoid 3D plots in lieu of 2D slices of 3D data as it’s often hard to interpret 3D structure. Much of that difficultly comes from highlighting iso-surfaces, but yt is different in that it uses ray-tracing to project through a data set, integrating information as it goes. The result is a 3D rendering that maintains accumulated structure.

Here’s an example of a 3D rendering of negative velocity anomalies in the Western U.S. from 50-1200 km deep, using the shear wave data from James et al. 2011, one of the 3D datasets available via IRIS (https://ds.iris.edu/ds/products/emc-nwus11-s/):

The highlighted structures represent regions in the Earth’s mantle that exhibit slower seismic shear wave speeds when compared to a reference model and the faint grid in the middle of the model domain is at 410 km, the upper limit of the mantle transition zone. The white dot on the surface is Yellowstone.

This is a preliminary image, so I won’t dive into interpretation of the structure embedded here, so for now, just enjoy this mesmerizing GIF:

Shapefiles in Python: converting contours to shapes

I’m in the process of finishing up a web-app that I’ve written using Dash/Plotly and as someone who is not an expert in front end web development, I’ve really enjoyed how easy it is to build an interactive app without getting bogged down by all the javascript/html/css (though knowing some definitely helps!). But recently I’ve run into an issue while figuring how to plot various datasets as a background contour on a number of interactive maps: while Dash/Plotly is great at plotting interactive points on a map, there’s not a simple way to contour a dataset and use it as the base layer (there is the new scattermapbox class, but there is not yet a simple contouring method for maps like, for example, contourf in the basemap library).

But Dash/Plotly is very good at chloropleth (colored shapes) plots, so I figured that if I could convert a matplotlib.pyplot.contourf plot into a shapefile of polygons, I could then load that shapefile in Dash/Plotly as needed and plot it as a chloropleth plot, which seems easier to me than rendering a bitmap image of the contour plot in the background (the only other option I can think of)… and so I spent the morning experimenting with how to do this. There’s a few scattered stackoverflow questions that related to this, but I had to pull together a bunch of pieces to get this working,  so I figured a short writeup here might be useful to others out there.

So with that said, here are they python libraries that you’ll need to run my test script:

  • shapely (for polygons, used previously here, here and here)
  • fiona (for writing a shapefile)
  • descartes (for making some plots)
  • matplotlib (for the initial contouring of the data)
  • numpy (for making up some initial data and some array manipulation)

The full code is available in my learning_shapefiles repo and the new script is contourf_to_shp.py. And some useful stackoverflow answers that helped significantly: Converting Matplotlib contour objects to Shapely objects, matplotlib – extracting values from contour lines.

Creating data to contour 

To start off, I wanted to work with a small test data set that would include multiple domains at the same contour level, so I created a lat/lon grid and then superimposed two 2d gaussian curves with different amplitudes and decay rates and plotted those up:

from shapely import geometry
import matplotlib.pyplot as plt
from matplotlib import cm
import numpy as np
import fiona
import os,json
from descartes.patch import PolygonPatch

# create some test data with multiple peaks
lon = np.linspace(0,45,100)
lat = np.linspace(-20,32,90)
long,latg=np.meshgrid(lon,lat)
C1=np.sqrt((long-5.)**2+(latg-25)**2)/30.
C2=np.sqrt((long-30.)**2+(latg-1)**2)/10.
m = 30*np.exp(-C1**2)+20.*np.exp(-C2**2)

# make the contourf plot, storing the resulting ContourSet in cs
plt.figure(figsize=[10,5])
plt.subplot(1,2,1)
Nlevels=10
cs = plt.contourf(lon,lat,m,Nlevels,cmap='gist_heat')
plt.title('contourf figure with Nlevels='+str(Nlevels))

In my actual use-case I’ll be loading a 2D dataset, so most of this will be replaced. But the important bit is that I store the object that results from the contourf call, that cs object contains all the data on the contours.

Storing the contour results

The first step in saving the contoured data is to create a lookup table for the contour levels. The cs object stores each contour in cs.collections and the levels in cs.levels:

# create lookup table for levels
lvl_lookup = dict(zip(cs.collections, cs.levels))

Next, I loop over each contour of cs.collections, convert the contour’s shape into a shapely polygon and store it in a list of dictionaries that also saves that contour’s level value:

# loop over collections (and polygons in each collection), store in list for fiona
PolyList=[]
for col in cs.collections:
    z=lvl_lookup[col] # the value of this level
    for contour_path in col.get_paths():
        # create the polygon for this level
        for ncp,cp in enumerate(contour_path.to_polygons()):
            lons = cp[:,0]
            lats = cp[:,1]
            new_shape = geometry.Polygon([(i[0], i[1]) for i in zip(lons,lats)])
            if ncp == 0:                
                poly = new_shape # first shape
            else:
                poly = poly.difference(new_shape) # Remove the holes
            
            PolyList.append({'poly':poly,'props':{'z': z}})

The list PolyList now holds a shapely polygon as well as the z-value (or level) of that polygon.

Writing the data

In order to use these polygons elsewhere, the easiest thing to do is simply save them within a shapefile. I used the Fiona library for writing, which entails writing the geometry (the polygons) and the properties for each polygon that I stored earlier:

# define ESRI schema, write each polygon to the file
outfi=os.path.join(outname,'shaped_contour.shp')
schema = {'geometry': 'Polygon','properties': {'z': 'float'}}
with fiona.collection(outfi, "w", "ESRI Shapefile", schema) as output:
    for p in PolyList:
        output.write({'properties': p['props'],
            'geometry': geometry.mapping(p['poly'])})

In order to reconstruct the contour plot, I need to save the information about the different levels (to get the colormap correct). And while I could load the shapefile and loop over the shapes to calculate min/max values, it saves time to simply write that data to a different file. And rather than write it as metadata of some kind for the shapefile, I just drop it into a json file:

# save the levels and global min/max as a separate json for convenience
Lvls={'levels':cs.levels.tolist(),'min':m.min(),'max':m.max()}
with open(os.path.join(outname,'levels.json'), 'w') as fp:
    json.dump(Lvls, fp)

Plotting the polygons

From there, the shapefile and levels data can be re-loaded and plotted to get the following:

polygon_conts_N10On the left is the original contour plot, on the write is the plot of Polygon patches colored using the same colormap. And it’s an almost perfect replica… I’m not sure if it’s my eyes or screen, but the lightest colors seem just a hair different between the two. But that difference is even less noticeable as I use more contours, e.g. 1000: 

polygon_conts_N1000

The final bit of code is mostly self explanatory: looping over the shapes in the shapefile, then plotting colored polygon patches like in previous posts. The only tricky business is making sure the polygon facecolor matches the filled contour color. To do this, I load in the contour level json file that I saved off, and load the matplotlib colormap that I used in the initial call to contourf:

# read in levels, define colormap
with open(os.path.join(outname,'levels.json')) as jfile:
    Lvls=json.load(jfile)
levels=np.array(Lvls['levels'])
cmap=plt.cm.gist_heat
lv_range=[Lvls['min'],Lvls['max']]

And then for each polygon, I pull out the z-value that I saved and find the appropriate RGB value for that level with:

        lv=shp['properties']['z'] # this shape's level
        clr=cmap((lv - lv_range[0])/(lv_range[1]-lv_range[0]))

The bit of math with lv_range is ensure that the value is between 0 and 1. The clr variable is a tuple that can be given to the descartes PolygonPatch:

patch = PolygonPatch(poly, facecolor=clr, edgecolor=clr)

And that’s pretty much it. I still have some experimenting to do with how many contour levels I need for the datasets I’m working with, and I expect to run into some scaling issues (the full dataset fields I’m working with are 100s of MB), but at least I have a feasible approach!

Full code here.

Shapely Polygons: Expansion!

Today I found myself needing to scale a large number of geometric shapes corresponding to volcanic fields in the Western United States. Shapely (used in previous posts) makes this easy with the shapely.affinity.scale() method (an affine transformation preserves lines and points, and the scale() method is a simplified transformation without adding rotation or skew).

So I put together a test script using the scale() method (on my github here) that takes the scaling factor as input. Since I’ll be using this with geometric coordinates and will be testing sensitivity of measurements to different sized shapes, I added the option to scale the shape by a a sequences of distances (angular or linear). The code is fairly straight forward and I don’t have time for a detailed post, so for now, here’s a trippy plot showing a squiggly-circle shape scaled by different amounts:

affine_expand

To generate the above plot (assuming you’ve cloned my learning_shapefiles repo and are in the directory containing affine_expansion.py).:

python affine_expansion.py -f ../data/expansion_test_data.csv -exp 0.25,.5,.75,1
.25,1.5,1.75,2

 

Animating Historical Urban Population Growth

In their recent article,  “Spatializing 6 ,000 years of global urbanization from 3700 BC to AD 2000”,  Reba et al. describe efforts to create a digital, geocoded dataset tracking the distribution of urban locations since 3700 BC. The digital database provides a record of human movement over the geologically recent past and is useful for understanding the forces that drive human societies towards urbanization. The database seemed like a fun test of my fledgling python skills and in the post here, I’ll describe a visualization of Reba et al.’s database. Reba et al. released their data here. My python source is available vi GitHub here. As this is my first actual post on the blog here, let me remind you that I’m something of a python noob. So if you actually check out the full source code, I expect you’ll find lots of mistakes and perhaps some poor choices. You’re welcome to let me know of all the ways I could improve the code in the comments section below.

Before jumping into what I actually did, a bit more on the dataset. Reba et al. (2016) created their new digital database from two earlier compilations, Chandler (1984)  and Modelski (2000,2003). The authors of those previous studies meticulously scoured archaelogical and historical records in search of locations of urban centers. Reba et al. took those datasets and created a .csv file listing each city’s name, latitude, longitude, population estimate and a time corresponding to the population estimate, a process that involved transcribing the original databases manually (ouch!! none of the available automated print-to-digital methods worked) and geocoding each entry to obtain a latitude and longitude. In the end, Reba et al. ended up with three digital .csv datasets: Chandler’s database for urban centers from 2250 BC to 1975 AD, Modelski’s ancient city database covering 3700 BC to 1000 AD and Modelski’s modern 2000 AD database. All told, there are just over 2000 unique cities recorded between the three databases, many of which have multiple population estimates through time.

It’s worth noting that the historical database is by no means complete. As Reba et al. discuss in detail, omissions of urban centers from the original Chandler and Modelski databases, unclear entries  in the original databases or errors in transcription would all result in missing cities. South Asian, South American, North American, and African cities are particularly under-represented. As a geologist, I’m used to incomplete records. Interpreting a fossil record, a regional sedimentary sequence or structural juxtaposition often requires some interpolation. A given rock unit may be preserved in one location while it is eroded and lost to knowledge in another. Thus, the completeness of any (pre)historical dataset depends on both preservation and sampling – there could be cities missing because the local climate did not preserve abandoned structures or simply because archaeology is a relatively young pursuit and excavation efforts have traditionally focused on a small fraction of the Earth’s surface. But as Reba et al. state “These data alone are not accurate global representations of all population values through time. Rather, it highlights the population values of important global cities during important time periods.”

I was pretty impressed by Reba et al.’s work and decided their dataset would provide an interesting context to improve my python. So I set out to write some python code capable of importing the database and making some pretty plots (source code here). Note that I do not distribute Reba et al.’s database with my source, you’ll have to download that separately from their site here.  See the README file in the source code for a list of other dependencies required to run the code, which was only tested with python 2.7.

Before digging too deeply into the code, let’s just start with the end product. Here’s an animation of Chandler’s data set from 2250 BC to present day. Each circle is an urban center and the color of circles changes from blue to red as the time approaches present day.

Pretty neat!  We can see clustering and expansion of early urban centers around Mesopotamia, then seemingly separate loci of urban development popping up in East Asia and Central America. Along South America, it’s interesting how urban centers trace out the coastline. And as the animation approaches present day, the number of urban centers explodes (from 1960 to 2014, the percentage of the world’s population living in urban settings increased from 34% to 54%).

In addition to animations, the python source can plot a single frame for a user-specified time range. Here are the entries for Modelski’s Ancient Cities database from 500 BC to 50 AD:

HistoricalCities_Modelski

The Code

The three main steps to producing the above animation were (1) import the data, (2) subsample the data and (3) plot the data. The module urbanmap.py includes all functions needed to reproduce the above figures. And the scripts ex_animate.py and ex_single_frame.py are examples that call the appropriate functions to create the above animation and single frame plot.

In the following, I’ll walk through the different steps and their related functions in urbanmap.py. Again, full source code is here.

(1) Importing the data

The first step is to actually read in some data! The function urbanmap.load_cities does just that for a specified dataset. The directory where the dataset is located is specified and the name of the dataset are given by the data_dir and city_file argument, respectively:

 39 def load_cities(data_dir,city_file):
 40     """ loads population, lat/lon and time arrays for historical 
 41         city data. """

The function works with any of the three original plain text, comma separated valued (CSV) files from Reba et al.: chandlerV2.csv, modelskiAncientV2.csv and modelskiModernV2.csv.

Each .csv file has a row for each city, where the columns are the City Name, Alternate City Name, Country, Latitude,Longitude, Certainty and Population. So first, I open up the .csv file and create a csv object using the CSV reader:

 44 #   load city files
 45     flnm=data_dir+city_file
 46     fle = open(flnm, 'r') # open the file for reading
 47     csv_ob=csv.reader(fle,skipinitialspace=True)  

Some of the Alternate City Names have commas within quotes, which causes those entries to split. Adding the second argument (skipinitialspace=True) to csv.reader prevents those commas within quotes from being read as a new comma-separated value.

The remainder of the function reformats the data into arrays that I find more flexible to work with. First, I generate an array called Time, which contains every time for which a population record exists. In the original .csv files, the header line of each Population column gives the time at which the population estimate corresponds to. The header values are strings, such as BC_1400, BC_1390,…,AD_100,AD_110,AD_1000…  So the first thing I do is convert these header strings to a 1D numpy array where each element of the array is the time in years before present (ybp).

 57 #   get header line 
 58     header = fle.next()
 59     header = header.rstrip()
 60     header = header.split(',')
 61 
 62 #   build the Time array
 63     nt = len(header)
 64     Time_string=header[6:nt]
 65     nt = len(Time_string)
 66     Time = np.zeros((nt,1))
 67 
 68 #   convert BC/AD to years before present 
 69     for it in range(0,nt):
 70         ct = Time_string[it]
 71         ct = ct.split('_')
 72         Time[it]=bc_ad_ce_to_ybp(ct[1],ct[0])

Why go through all this trouble? Well, let’s say I want all cities with a recorded population for the time range 1400 BC to 50 AD. If I knew the header values exactly, I could find the indeces in Time_string corresponding to BC_1400 and AD_50. But the headers aren’t uniform within a single .csv file or across .csv files. The way I’ve constructed the Time array, however, allows for straightforward conditional indexing. The usefulness becomes more apparent after reading in the remaining data and I describe it in section 2 below.

The next lines (lines 74-101 of urbanmap.py) are pretty straightforward. Each row of the database is read in and distributed to one of three arrays: PopuL, city_lat and city_lon. The latter two contain the latitude and longitude of every city. PopuL is a 2D matrix with a row for each city and a column for each population record (i.e., PopuL.shape() returns n_city by n_Times).

I did run into some trouble with blank space characters. A few lines of the .csv files have some non-ascii blank space characters ‘\xa0’ that resulted in errors when I tried to convert the entries into floating point values. So I had to replace those characters with a true blank space before converting:

 81     for row in csv_ob:
 82 #       read in current line
 83         line = row
 84         line = [item.replace('\xa0',' ') for item in line]

 87 #       pull out lat/lon    
 88         city_lat[indx] = float(line[3])
 89         city_lon[indx] = float(line[4]) 

And that’s about it for reading in the data…

(2) Subsampling the data

Now that I’ve got all the data loaded, I want to be able to subsample that data for a specified time range. The main function to do this in urbanmap.py is get_lon_lat_at_t:

177 def get_lon_lat_at_t(year_range,city_lon,city_lat,Time,PopuL):

Most of the arguments (city_lon,city_lat,Time,PopuL) are returned by the load_cities function, described above. The year_range argument is a string argument that specifies the time span for which I want to select cities with non-zero population records. I chose to make year_range a comma separated string:

year_range='5000,BCE,1900,CE'

This year range starts at 5000 BCE and ends at 1900 CE. The time unit can be BCE,CE,BC or AD. Within get_lon_lat_at_t, I first convert this year_range to a start and end date in ybp:

186 #   convert year_range to years before present
187     year_range=year_range.replace(" ", "")
188     years=year_range.split(',')
189     time_range=[0,0]
190     time_range[0]=bc_ad_ce_to_ybp(years[0],years[1])
191     time_range[1]=bc_ad_ce_to_ybp(years[2],years[3])

Now, I can easily select the cities within the desired time range without knowing beforehand whether or not the chosen endpoints exist exactly in the Time array. First, I loop through each city and select the population records of the current city

193 #   find lat and lon of cities with recorded populations in database
194     for icit in range(0,ncit):
195         pop=PopuL[icit,:] # current population

Next, I find the times in current city that have a nonzero population record

196         pop_t=Time[pop>0] # times with nonzero pops    

And now I pull out times that are within the specified time range

197         pop_t=pop_t[pop_t<=time_range[0]] # pops within time range
198         pop_t=pop_t[pop_t>=time_range[1]]

The final step is check if there are any cities left. If there are no cities with a nonzero population record in the specified time range, I flag them for removal:

200         if pop_t.size == 0: # flag for removal 
201            lons[icit]=999.
202            lats[icit]=999.

So at the end of all this, I select the lons and lats that are not equal to 999 and those are the longitudes of the cities with a nonzero population within the specified time range. Neat-o! I can now return these lon/lat values and make some plots!

(3) Plotting the data

Now that we’ve got a function to subsample the data, we can plot that data in a number of ways. The simplest place to start is to put all the lat/lon of cities with a nonzero population record for a specified time range on a map. This is what the __main__ function of urbanmap.py and the script ex_single_frame.py accomplish. In both, I sequentially call load_cities and get_lon_lat_at_t functions then plotting the resulting lat/lon values using the basemap toolkit (mpl_toolkits).

The plotting is accomplished in two functions: urbanmap.base_plot() and urbanmap.city_plots(). The first creates a basemap object with the background image of the globe while the second actually puts the current lat/lon values onto that background image. base_plot() follows this example pretty closely.

The following, from ex_single_frame.py, shows the full sequence to plot cities within a current time span.

 33 import urbanmap as um
 34 import matplotlib.pyplot as plt
 35 
 36 # select time range
 37 time_span='500,BCE,50,CE'
 38        # comma separated string noting desired time span. The age descriptor can 
 39        # be BC, BCE, AD or CE.
 40 
 41 # select data set
 42 data_dir='../data_refs/'
 43 city_file='modelskiAncientV2.csv'
 49 
 50 # import data set 
 51 (Time,PopuL,city_lat,city_lon)=um.load_cities(data_dir,city_file)
 52 
 53 # get lon/lat of cities in time span    
 54 lons,lats,time_range=um.get_lon_lat_at_t(time_span,city_lon,city_lat,Time,PopuL)
 55 
 56 # plot it
 57 plt.figure(facecolor=(1,1,1))
 58 m = um.base_plot() # create base plot and map object
 59 um.add_annotations() # adds references
 60 um.city_plot(lons,lats,m,'singleframe',time_range,Time) # plot points
 61 plt.show() # and display the plot

Now that we have functions to pick out cities within a time range and then plot those points, creating an animation is conceptually straightforward. I just needed to repeatedly call get_lon_lat_at_t and city_plot, varying the time range each call. However in practice, sorting through the animation routines in the python animation package was the trickiest part of this whole exercise. I quickly gave up on using the animation routines, and simply looped over a time range, subsampling and plotting at each increment, saving the figure at each step along the way. I was then left with a bunch of image files (the frames of the animation), which I then concatenated into an animation using a bash script and ImageStack.

In the end, I managed to figure out the python animation routines, and that’s I included in the source code.

PyTip: shifting the basemap. I used two functions to rotate the center longitude of the map:  mpl_toolkits.basemap.Basemap() and  mpl_toolkits.basemap.shiftgrid(). The Basemap function creates the longitude/latitude projection while shiftgrid rotates the topography data to align with the Basemap. BOTH functions take an argument lon0, but in Basemap, lon0 is defined as the center longitude of the projection while in shiftgrid lon0 is the westernmost longitude. I was tripped up by this for a while because I assumed lon0 had the same meaning in each… whoops.

PyTip: accessing the data for animation. Most of the tutorials for the animation function (matplotlib.animation.FuncAnimation) are relatively simple and are set up to re-calculate the data to plot at each frame. The issue I ran into was that FuncAnimation animates a specified function by sequentially feeding it a frame index. I couldn’t figure out how to pass additional arguments (the full dataset) and importing the data set at each frame would be way too slow. I had an existing dataset that I wanted to read in only once at the start of the program. I first got around this by declaring the database variables (PopuL, city_lats, city_lons,…) as global variables so that they’d be accessible within the FuncAnimation call. This was pretty easy but I’m always a little uncomfortable using global variables. My approach in the end relied on simply better understanding how python handles variables. By nesting all of the animation functions under one top level function, any variables set in that top level function are available at the lower levels (in a sense, they’re locally global?). I found this post useful.

References:

Reba, Meredith, Femke Reitsma, and Karen C. Seto. “Spatializing 6,000 years of global urbanization from 3700 BC to AD 2000.” Scientific data 3:160034 doi: 10.1038/sdata.2016.34 (2016).
Historical Urban Population Growth Data: http://urban.yale.edu/data
Chandler, Tertius. “Four thousand years of urban growth: an historical census.” The Edwin Mellen Press (1987).
Modelski G. World Cities: -3000 to 2000 (FAROS 2000, 2003).