I’ll be attending the 29th European Conference on Operational Research – more simply, Euro 2018 – in Valencia this summer. I’m not giving a presentation, but as I’ll be attending on behalf of OVO I’d be particularly keen to talk to anyone with experience of applying OR techniques in the energy industry. At a personal level, I’m also keen to learn more about MILP / constraint satisfaction algorithms at scale.

# Visualising air quality data with Voronoi diagrams

*This is the extended version of my lightning talk from the second PyData Bristol meetup.*

The basic idea for this visualisation – using Voronoi diagrams to map air quality data – is one I’ve been carrying around for several years, but only recently did I have the right combination of data, tools, and project time to make it a reality. My new job at OVO required me to dust off my python skills – for which I prefer a toy project to abstract study – and gave me access to Tableau, which meant I could lean on its mapping capabilities and easy sharing of visualisations. Working for a green energy company also seemed more appropriate for shining a light on poor air quality than my previous roles in aviation! Open Data Bristol provided the raw material – although their site seems to be in flux, with the exact dataset I used no longer available – and here’s what I did with it:

## Voronoi diagrams in the wild

So what *is* a Voronoi diagram? Given a particular point from a set of points – in this case, individual air quality monitors set up around Bristol – the Voronoi cell is all the locations which are closer to that point than any of the others. A Voronoi diagram (or tessellation) is then the collection of all these cells; here, it partitions Bristol so that for any location in the city, you can see at a glance the air quality reading from its nearest monitor. This has a more instant impact than just plotting the points: highlighting that poor air quality likely affects broad areas, not just a particular monitor; and eliminating the need to determine the nearest sensor to a place of interest (especially as zooming to a local level of detail may remove monitors from view entirely).

However, I’m not aware of any dataviz programs that offer Voronoi diagrams ‘out of the box’, and indeed, this is not their main field of application. They arise in an impressive variety of domains across mathematics, CS and the natural sciences – and have even found their way into artistic endeavours – so I’m surprised that they don’t appear in undergraduate curricula. Or, at least, they didn’t feature in mine; so here’s a few contexts in which I’ve since encountered them!

### Biology

Long-necked Voronoi diagrams

Instead of thinking of Voronoi cells as being constructed, we can imagine them arising organically. If we have growth occurring at the same rate from multiple seed sites, then collisions will occur precisely at the boundaries of the Voronoi cells corresponding to those sites. Similarly, if pigments or other chemicals are spread from the sites, they will meet and react for the first time along those same boundaries. As a result, Voronoi diagrams can be used to model natural phenomena ranging from the skin patterns of giraffes to the structures of tortoise shells or dragonfly wings (Wikipedia has a neat animation of this).

### Physics

Making Voronoi diagrams with sand

If waiting for biology seems too slow, you can also produce your Voronoi diagrams with physics. In the above example, holes have been drilled into blocks covered with fine sand; once the holes are uncovered from below, some sand will flow through. But grains will drain out from their nearest hole, and are more likely to do so if they’re closer. Thus, once the pile settles, it does so with ridges at the locations furthest from the holes: that is, the boundaries of the Voronoi cells. Similar processes occur in mud cracking, lava cooling or soap bubble formation; these can be thought of as physical optimisation algorithms, where in this case the best solution relates to your nearest or furthest neighbour. (By analogy, we can exploit Voronoi diagrams to answer operational research style problems such as where best to place a new shop or distribution centre.)

### Architecture

Design for the Fry building, University of Bristol

Voronoi diagrams have become popular in architecture, particularly as screens or roofs, where the irregular cell layout enables a mixture of light and shade patterns (and an alternative, more organic, aesthetic than arises from grid-based designs). A local example will soon be found in pride of place at the University of Bristol’s Fry building, which will be the new home for the School of Mathematics. However, their summary doesn’t specify what it’s a diagram *of*!

## Voronoi diagrams in python

### Obtaining Voronoi cells

The mathematics of Voronoi diagrams belongs to discrete and computational geometry, and I highly recommend the book of the same name by Devadoss and O’Rourke – so good, I actually have two copies! This was an area that I dabbled in shortly before I departed academia, and have always wanted to do more with.

Fortunately, python being python, there’s no need to reinvent the wheel. Importing `voronoi` from `scipy.spatial` we can easily construct and plot the Voronoi diagram from an array of X,Y pairs describing our points of interest:

The blue points are the original sites, whilst the orange ones are the vertices of the Voronoi cells (or regions as they are referred to here). So at first glance, it might seem that we’re done… but on closer inspection – or rather, further, by expanding our viewing window from the conveniently chosen one – some issues are revealed:

The ‘problem’ is with the outer cells, and mathematically there is no problem at all. Recall that the goal of the Voronoi diagram is to divide up the plane into regions according to their closest site. But this is the idealised mathematical plane: in the simplest case of two starting points, the entire (flat) universe is divided into two *infinite* halves, separated by the perpendicular bisector of the line through the points. For more complicated collections of points, we get two types of Voronoi cells: the finite ones, with a well defined boundary made up of finite line segments; plus the infinite ones, where borders shared with other infinite cells will be rays that extend indefinitely from a boundary point. In the plots above, these rays are represented by dashed lines rather than the solid boundaries of the finite cells.

The `Voronoi` methods actually handle this by working with the list of boundary vertices for each region, rather than its ‘shape’. Each orange point we see carries an index, from which we can recover its coordinates in the plane. But there is also a ‘point at infinity’ (with index -1) for which there is no corresponding point; the infinite regions are those which include this point in their vertex list.

### Obtaining finite Voronoi cells

For visualisation purposes, we’d like to restrict the cells to fall within some bounding box – in our case, large enough to include all our air quality monitors, but not otherwise stray too far from Bristol. As we’ll soon see, there are python data structures that make this operation straightforward for polygons – but we need to deal with those infinite regions first. The trick is to turn them into large – much larger than our eventual area of interest – but finite polygons first, by projecting those infinite boundary rays out much further before truncating.

I would like to say I thought through the mathematics for this, but instead I turned to Stackoverflow, suspecting that others had had the same problem. Sure enough, this question – and this solution by user ‘sklavit’ – give us the pieces we need. This plot, on the same axis as before, but now showing individual polygons rather than collections of points and lines – illustrates some of the effect:

### Working with shapes in python

Sklavit’s solution actually contains two parts: generating the finite regions as lists of boundary vertices; then using the `shapely` library to construct `Polygon`s with those boundaries. These can then have geometric operations easily applied to them; in particular, a bounding box polygon can be created and the intersection of this with each Voronoi cell found.

This suffices to demonstrate their solution to the problem. But in practice, working directly with individual polygons can be fiddly – there are issues with shared boundaries, for instance – and for our visualisation purposes we’d need some book-keeping to track which data values correspond to each cell (especially if the original points are no longer contained in any of the bounded polygons).

We can therefore benefit from a more elaborate structure. The `geopandas` library extends `pandas` data frames to allow each row to have a *geometry* (described by a shapely polygon). This allows us to perform set-theoretic operations on collections of shapes en-masse, and get a new collection back. Better still, we can associate our shapes with data and metadata, and this will be carried through to the new shapes as appropriate.

For example, taking the union of overlapping shapely polygons directly only gives us a single larger polygon. With geopandas, we get a collection of polygons which add up to the same result, but track each piece’s relationship with the original polygons. Here’s an illustration of that effect:

For our purposes we can therefore preview the effect of applying a particular bounding box, by considering its *union* with a geodata frame of the finite cells:

Although, of course, it is the *intersection* that we actually wish to keep:

where each of the surviving shapes has data columns inherited from the finite cells, which could themselves have been easily associated with monitor data (since shapely allows us to test whether a point is contained in a polygon).

## Putting it all together

### Working with shapes in tableau

From here we can move to Tableau, provided we can translate to data it understands. This is fairly straightforward, once you’re aware of a few quirks of Tableau’s polygon mark type…

- A Tableau polygon is defined by a set of boundary points, and we need one data row per point.
- An index or label is needed to indicate which points belong to which polygon (and should be applied to the detail or colour shelf).
- An additional
*path index*is needed to indicate the order in which the boundary is constructed from the points (and should be applied to the path shelf). - Unlike shapely polygons
*and, confusingly, tableau’s path mark type*, you should not ‘close the loop’ by repeating the initial vertex.

Thus the remaining task in python is to produce these appropriately indexed lists of points. These could then be merged back to the original data in Tableau, but I found it simpler (albeit wasteful of space) to produce a single file, replicating each cell’s data across all its boundary points.

If you wish to be able to plot both the cells and the original sites, then you’ll need to use a dual axis plot in Tableau, and differentiate the two types of geometry (polygon and point) in the data. You’ll also need to hide the boundary points of each cell (or use the shape shelf to render them invisible), since Tableau filters apply to both axes simultaneously. The visualisation above demonstrates this effect if you enable the ‘Show Monitor Locations’ option.

### Example notebook

Our recipe is therefore as follows:

- obtain some data with X,Y coordinates;
- recover the (potentially infinite) Voronoi regions;
- reduce these to finite Voronoi cells;
- truncate these to polygons within a suitable bounding box;
- convert to a point-by-point format for Tableau.

I’ve implemented each of these steps in a Jupyter notebook, which you can find here. Note that this only produces the output needed for a single Voronoi diagram using the 2016 air quality data, which is (currently) available under the Open Government Licence v3.0. To produce the multi-year visualisation (with a diagram for each year based on the monitors that were active then) I simply iterated this process, and added year as an additional column before concatenating the output and exporting to Tableau; sadly this is no longer available as a single data set!

## Limitations

Although I’m happy with this particular visualisation, this code is not suitable for all such uses. This is because the process makes a faulty – but apparently increasingly popular – assumption: that the world is a flat plane! Since we actually live on (roughly) a sphere, this creates a couple of issues when treating longitude and latitude as if they were X,Y coordinates.

Firstly, we define our Voronoi cells by their boundary vertices, which we then join by a ‘straight’ line. However, the shortest path from one geographic coordinate to another – e.g. an edge of a cell – would be a great circle, which appears as a curve under this projection. Fortunately, this distortion is relatively minor for paths across, say, Bristol. But it is worth keeping in mind at an international scale; this example shows how Tableau (in orange) would render a triangle between three major airports, versus how it would slice a segment out of the globe (in blue):

One solution would be for Tableau’s path rendering to be aware of these issues; alternatively, we could compute these curves in python first and output a much larger number of points to define the boundary of each cell.

However, treating geographic coordinates as points in a plane will cause much bigger problems elsewhere in the world, regardless of scale:

Had we been attempting to map Fijian rather than Bristolian air quality, we’d have run into difficulties as extremely negative and positive longitudes indicate locations which are close by, not far apart. Although latitude at the poles does not have the same ‘wrap around’ effect, distortion would be pronounced there too.

Fortunately these are known and solved issues, resulting in spherical Voronoi diagrams and some of my favourite visualisations. So if you found all of this straightforward, I’d love to see a python implementation that computes these correctly anywhere on the globe, and can then project them to a map!

# Analytics 2015

I will be attending Analytics 2015 in November, where some work at BA will be presented: *Using Text Mining and Natural Language Processing to Automate the Classification of Passenger Complaints*.

# Data Visualisations in use at NRS

Some of my dissertation work on data visualisation is now in use by the National Records of Scotland:

# Visualisation Of Scottish Demographic Data

My MSc dissertation is available for download here, as a 7.54Mb PDF. Here’s the abstract:

This project explores the use of interactive visualisations to augment the extensive data published by the National Records of Scotland. Good visualisation can illustrate key trends in statistical data, increasing impact and accessibility; great visualisation can go further, and enable us to identify and explore unexpected connections. Data visualisations can therefore support operational research, but we will see that producing them also entails solving problems of an OR flavour.We survey the existing literature for principles of good design in presenting data visually; much of this is aimed at hand-produced imagery for print, so we examine how it can be best used in the new context of procedurally-generated, interactive visualisations for the web. In the first instance, we consider this for chart types which have proven popular or successful for static visualisations, particularly if already used by NRS.

This leads us to investigate more complicated data sets which can be interpreted as having a graph theoretic structure. We will show how the constrained layout of networks of vertices with an associated size can be posed as an optimisation problem, and develop a visualisation that operates under such constraints. Further, we will consider the use of geographic clustering to represent migration flow, describing and implementing a novel `re-wiring’ algorithm to generate tree structures that produce better visualisations than standard agglomerative approaches.

Finally, we present a portfolio of visualisations created for NRS that follow the design principles identified and make use of the software tools developed during the project.

There is also an online version of the appendix with links to the various visualisations developed, including source code and sample data files. The rest of this post gives at-a-glance versions.

The Cause of Death Explorer

Cause of Death Treemap

*Experimental alternative presentation of the above data set; not suitable for Internet Explorer*

Fertility data (cohort effects)

Popular baby names for boys

*Also available for girls!*

Life Expectancy

*The first to be used in NRS reporting, appearing here.*

# My Erdős number…

…appears to be four (an infinite improvement). I coauthored a paper with Gary Greaves, whose recent paper Edge-signed graphs with smallest eigenvalue greater than -2 also saw contributions from Jack Koolen and Akhiro Munemasa. They both have an Erdős number of two (each via Chris Godsil, who is an Erdős coauthor), making Gary a three and myself a four.

If I do not publish any more papers, the best I can hope for is three, if Gary later collaborates with a one. But for now my goal should be to obtain a Bacon number…

# Recent popular baby names in Scotland

The data available is the top 100 names for each of boys and girls born in Scotland, every year 1998-2013 except 2000 (for unknown reasons). For each name that features, the precise count is also given – but for any that fail to make the cut, we don’t have this figure. As well as this censoring effect – for which the precise threshold will vary each year – raw counts should really be considered in the context of varying birth rates too: there may be less children with a particular name simply because there are less children! So for the visualisation project I focused my efforts on the rankings. After various experiments, I settled on simply showing the top 20 each year. Interestingly, this doesn’t require too much data. For example, there are just 25 different boys names that feature in any of the 13 top 10’s; and only another 16 are needed to form the pool for the top 20’s, as many of those are past or future members of the top 10. So, here’s the boys:

and similarly for the girls:

However, I couldn’t resist going back to the raw counts to look at some of these in more detail. For instance, at the top of the charts we seem to have captured “peak *Emma*“; from highs of around 630 in 2003-04, it not only lost the top spot to *Sophie* but plummeted out of the top 10 (and nearly the top 20), with just 237 of them a decade later. The shift is even more pronounced when you consider that *Sophia* cracked the top 20 from 2011, and *Sofia* is also to be found further down the top 100. For the boys, *Lewis* has also declined substantially from its chart-topping days, but still holds a top three position despite there being less than half as many in 2013 than 2003.

The Sophie/Sophia/Sofia situation is an example of a rather common phenomenon girls names. Although the truncated rankings will suppress the least popular variants, a sufficiently popular name can carry with it homophones (such as

*Niamh*/

*Neve*

^{1}or

*Abbie*/

*Abby*;

*Nieve*,

*Abi*and

*Abbi*have also featured in top 100’s) or clusters of similar names (

*Ella*/

*Elle*/

*Ellie*,

*Eva*/

*Eve*/

*Evie*) as the next graph shows:

As mentioned, the most popular names usually spend some time as moderately popular ones first, and take a while to disappear entirely. But an interesting example of a name that has very recently sprung into prominence is

*Amelia*. The shorter

*Amy*held a top ten spot for all but two years 2001-2010, and the variant

*Aimee*also made the top 20 for all of 2003-2006 (finding favour slightly later than Amy). But save for the 87th spot in 2005, Amelia was nowhere to be found until 2007; and only made the top

*thirty*for the first time in 2012, somehow leaping straight to ninth place and staying there for 2013 too. Definitely one to watch for 2014!

Finally, I couldn’t resist an egotistical look at the data. However, in a sure sign of my advancing age, neither

*Graeme*nor

*Graham*ever make the top 100 for any of the years available…fortunately in 2013 the complete list was also published, and from this I note seven instances of Graham, three of Graeme (despite that being the more traditionally Scottish spelling), and both a

*Gray*and a

*Graye*too. On the other hand, my surname has the distinction of being reasonably popular as a first name for both boys and girls – the only other unisex example I spotted was

*Jordan*, but that was substantially more common for boys. For

*Taylor*, it’s fairly even – but also falling out of fashion it seems!

^{1} Yes, those sound the same. Blame gaelic. For bonus marks, can you pronouce 2007’s twentieth most popular name, *Eilidh*?

# Fertility in Scotland

A month in, I am starting to produce my own d3 visualisations essentially from scratch. This example is an interactive version of Figure 2.4 from *The Registrar General’s Annual Review of Demographic Trends (158th Edition)*. The user can select from many more years, but as only one is shown at a time clutter is reduced (whilst animation helps to reveal the changing patterns, and tooltips provide clarification and precise data values). Moreover, there is a cohort effect within this data: it is not entirely accurate that fertility of 25 year olds fell from 1973 to 1974, as these are different groups of women. The transition animations therefore instead show the changing experiences of each of these groups as they age, identified by colour coding. For an alternative slice through the data along these lines, this version instead shows fertility at each age for a selected cohort; I am still considering if there is an effective way to combine the two.

**Source: Vital Events Reference Tables 2012**Table 3.6: Age-specific birth rates, per 1,000 female population, Scotland, 1951 to 2012.- Live births only. Excludes births where mother’s age is not stated.
- Rate for age 15 includes births at younger ages and for age 44 includes births at older ages.
- The average age is calculated by adding 0.5 years to the mother’s age at her last birthday (e.g. it is assumed that 30-year-old mothers were, on average, aged 30 years and 6 months when they gave birth).
- The age-specific birth rates for 2002 to 2010 are the revised figures calculated using the rebased population estimates which were published on 17th December 2013.

# DataViz first steps

For the dissertation component of my MSc I’ll be working with the National Records of Scotland on a project entitled *Data Visualisation of Scottish Demographic Information*. Here’s a first dip into the world of D3, lightly adapted from these examples of chord diagrams. The data shown are Migration flows between Council areas for 2011-12 (most recent).

# The Giant 4D Buckyball

Earlier in the year I participated in the building of a `Giant 4D Buckyball’ sculpture; the first of its kind in the UK, and assembled by a team of twenty during the opening day of the University of Edinburgh’s Innovative Learning Week. I then represented the project at the ASCUS Art and Science Salon as part of TEDxUniversityofEdinburgh at the end of the week. The build was one of several ILW events organised by Julia Collins from the School of Mathematics, and you can read her account here. There was a lot of coverage of this event, from student blogs to Scottish Television – although of varying standards of mathematical literacy! So I’ve put together a series of posts describing the fundamental building block, the `buckyball’:

Whilst the sculpture definite counts as mathematical artwork, it also gave me a chance to indulge some of my other creative interests. As well as the images above, during the construction (and more recent deconstruction) I was able to capture the action through a pair of time-lapse videos (as always, setting to HD is recommended!):