I will be attending Analytics 2015 in November, where some work at BA will be presented: Using Text Mining and Natural Language Processing to Automate the Classification of Passenger Complaints.
My MSc dissertation is available for download here, as a 7.54Mb PDF. Here’s the abstract:
This project explores the use of interactive visualisations to augment the extensive data published by the National Records of Scotland. Good visualisation can illustrate key trends in statistical data, increasing impact and accessibility; great visualisation can go further, and enable us to identify and explore unexpected connections. Data visualisations can therefore support operational research, but we will see that producing them also entails solving problems of an OR flavour.
We survey the existing literature for principles of good design in presenting data visually; much of this is aimed at hand-produced imagery for print, so we examine how it can be best used in the new context of procedurally-generated, interactive visualisations for the web. In the first instance, we consider this for chart types which have proven popular or successful for static visualisations, particularly if already used by NRS.
This leads us to investigate more complicated data sets which can be interpreted as having a graph theoretic structure. We will show how the constrained layout of networks of vertices with an associated size can be posed as an optimisation problem, and develop a visualisation that operates under such constraints. Further, we will consider the use of geographic clustering to represent migration flow, describing and implementing a novel `re-wiring’ algorithm to generate tree structures that produce better visualisations than standard agglomerative approaches.
Finally, we present a portfolio of visualisations created for NRS that follow the design principles identified and make use of the software tools developed during the project.
There is also an online version of the appendix with links to the various visualisations developed, including source code and sample data files. The rest of this post gives at-a-glance versions.
The Cause of Death Explorer
Cause of Death Treemap
Experimental alternative presentation of the above data set; not suitable for Internet Explorer
…appears to be four (an infinite improvement). I coauthored a paper with Gary Greaves, whose recent paper Edge-signed graphs with smallest eigenvalue greater than -2 also saw contributions from Jack Koolen and Akhiro Munemasa. They both have an Erdős number of two (each via Chris Godsil, who is an Erdős coauthor), making Gary a three and myself a four.
If I do not publish any more papers, the best I can hope for is three, if Gary later collaborates with a one. But for now my goal should be to obtain a Bacon number…
The data available is the top 100 names for each of boys and girls born in Scotland, every year 1998-2013 except 2000 (for unknown reasons). For each name that features, the precise count is also given – but for any that fail to make the cut, we don’t have this figure. As well as this censoring effect – for which the precise threshold will vary each year – raw counts should really be considered in the context of varying birth rates too: there may be less children with a particular name simply because there are less children! So for the visualisation project I focused my efforts on the rankings. After various experiments, I settled on simply showing the top 20 each year. Interestingly, this doesn’t require too much data. For example, there are just 25 different boys names that feature in any of the 13 top 10′s; and only another 16 are needed to form the pool for the top 20′s, as many of those are past or future members of the top 10. So, here’s the boys:
and similarly for the girls:
However, I couldn’t resist going back to the raw counts to look at some of these in more detail. For instance, at the top of the charts we seem to have captured “peak Emma“; from highs of around 630 in 2003-04, it not only lost the top spot to Sophie but plummeted out of the top 10 (and nearly the top 20), with just 237 of them a decade later. The shift is even more pronounced when you consider that Sophia cracked the top 20 from 2011, and Sofia is also to be found further down the top 100. For the boys, Lewis has also declined substantially from its chart-topping days, but still holds a top three position despite there being less than half as many in 2013 than 2003.
The Sophie/Sophia/Sofia situation is an example of a rather common phenomenon girls names. Although the truncated rankings will suppress the least popular variants, a sufficiently popular name can carry with it homophones (such as Niamh/Neve1 or Abbie/Abby; Nieve, Abi and Abbi have also featured in top 100′s) or clusters of similar names (Ella/Elle/Ellie, Eva/Eve/Evie) as the next graph shows:
As mentioned, the most popular names usually spend some time as moderately popular ones first, and take a while to disappear entirely. But an interesting example of a name that has very recently sprung into prominence is Amelia. The shorter Amy held a top ten spot for all but two years 2001-2010, and the variant Aimee also made the top 20 for all of 2003-2006 (finding favour slightly later than Amy). But save for the 87th spot in 2005, Amelia was nowhere to be found until 2007; and only made the top thirty for the first time in 2012, somehow leaping straight to ninth place and staying there for 2013 too. Definitely one to watch for 2014!
Finally, I couldn’t resist an egotistical look at the data. However, in a sure sign of my advancing age, neither Graeme nor Graham ever make the top 100 for any of the years available…fortunately in 2013 the complete list was also published, and from this I note seven instances of Graham, three of Graeme (despite that being the more traditionally Scottish spelling), and both a Gray and a Graye too. On the other hand, my surname has the distinction of being reasonably popular as a first name for both boys and girls – the only other unisex example I spotted was Jordan, but that was substantially more common for boys. For Taylor, it’s fairly even – but also falling out of fashion it seems!
1 Yes, those sound the same. Blame gaelic. For bonus marks, can you pronouce 2007′s twentieth most popular name, Eilidh?
A month in, I am starting to produce my own d3 visualisations essentially from scratch. This example is an interactive version of Figure 2.4 from The Registrar General’s Annual Review of Demographic Trends (158th Edition). The user can select from many more years, but as only one is shown at a time clutter is reduced (whilst animation helps to reveal the changing patterns, and tooltips provide clarification and precise data values). Moreover, there is a cohort effect within this data: it is not entirely accurate that fertility of 25 year olds fell from 1973 to 1974, as these are different groups of women. The transition animations therefore instead show the changing experiences of each of these groups as they age, identified by colour coding. For an alternative slice through the data along these lines, this version instead shows fertility at each age for a selected cohort; I am still considering if there is an effective way to combine the two.
- Source: Vital Events Reference Tables 2012 Table 3.6: Age-specific birth rates, per 1,000 female population, Scotland, 1951 to 2012.
- Live births only. Excludes births where mother’s age is not stated.
- Rate for age 15 includes births at younger ages and for age 44 includes births at older ages.
- The average age is calculated by adding 0.5 years to the mother’s age at her last birthday (e.g. it is assumed that 30-year-old mothers were, on average, aged 30 years and 6 months when they gave birth).
- The age-specific birth rates for 2002 to 2010 are the revised figures calculated using the rebased population estimates which were published on 17th December 2013.
For the dissertation component of my MSc I’ll be working with the National Records of Scotland on a project entitled Data Visualisation of Scottish Demographic Information. Here’s a first dip into the world of D3, lightly adapted from these examples of chord diagrams. The data shown are Migration flows between Council areas for 2011-12 (most recent).
Earlier in the year I participated in the building of a `Giant 4D Buckyball’ sculpture; the first of its kind in the UK, and assembled by a team of twenty during the opening day of the University of Edinburgh’s Innovative Learning Week. I then represented the project at the ASCUS Art and Science Salon as part of TEDxUniversityofEdinburgh at the end of the week. The build was one of several ILW events organised by Julia Collins from the School of Mathematics, and you can read her account here. There was a lot of coverage of this event, from student blogs to Scottish Television – although of varying standards of mathematical literacy! So I’ve put together a series of posts describing the fundamental building block, the `buckyball’:
Whilst the sculpture definite counts as mathematical artwork, it also gave me a chance to indulge some of my other creative interests. As well as the images above, during the construction (and more recent deconstruction) I was able to capture the action through a pair of time-lapse videos (as always, setting to HD is recommended!):
In the previous post we saw how we could project polyhedra into the plane, and use some simple properties about planar graphs to classify all the possible Platonic solids. In this post we’ll finally get to the buckyball, by considering a less restrictive class of polyhedra: the fullerenes.
The Platonic solids were extremely regular: every face had to be the same, with all angles and side lengths the same, and the same number of faces meeting at each vertex. In a Fullerene we allow the faces to be either pentagons or hexagons, but we require exactly three faces to meet at each vertex. We can still take a one-point projection and get a planar graph: it’ll be 3-regular from the vertex condition, and every face has degree five or six.
This turns out to force a seemingly stronger condition on our graphs:
A fullerene has exactly twelve faces of degree five.
To see this, suppose there are P faces of degree five (pentagons) and H faces of degree six (hexagons). Then all-in-all there are F=P+H faces, and we know from Euler’s formula that F = 2 + E -V. By 3-regularity we know 2E=3V. So P+H=F=2+3V/2 – V = 2 + V/2. Further, by handshaking for planar graphs we know 2E = 5P + 6H; so
This tells us that V=2H+20, so H = V/2 -10. As P + H = 2 +V/2, we conclude P = 2 + V/2 – H = 2+ V/2 – (V/2 -10) = 12, as claimed.
So in a certain degenerate sense we’ve already seen a fullerene – if we have 12 pentagons and no hexagons, with three pentagons meeting at every point, then we have one of our Platonic solids – specifically, the dodecahedron. However, the motivation for studying fullerenes comes from molecular chemistry, where they arise as different allotropes of carbon. But the laws of physics get in the way of having adjacent pentagonal faces when building with carbon – the bonds are not stable. To be a viable fullerene in the chemical sense, our fullerene graph has to have isolated pentagons. That means that none of the five vertices of each of the twelve pentagons can be shared, so a fullerene has to have at least sixty vertices. But, remarkably, we can exhibit a 60 vertex planar graph with twelve pentagonal faces, all other faces hexagonal, three faces meeting at every vertex, and no two pentagons touching:
This, as you may have guessed, is our long-awaited Buckyball! Or, more properly, Buckminsterfullerine. This is the simplest possible isolated pentagon fullerene, but it is still much more complicated than the more familiar allotropes of carbon: graphite and diamond. The theoretical existence of the C60 allotrope had been advanced several times in the 60s and 70s, but was not generally accepted as a realistic possibility by the scientific community. That had to change in the 1980s, when it was first synthesised by Kroto, Curl and Smalley. They named it Buckminsterfullerene due to its resemblence to geodesic dome constructions by the architect Richard Buckminster Fuller:
They also produced C70, showing that C60 was just one instance of a general class, the fullerenes: they received the 1996 Nobel prize in Chemistry for opening up this field of study. It has subsequently been shown that C60 is naturally occuring – it can be found in soot, created by lightning, and has even been identified in clouds of cosmic dust!
How can we represent a 3-dimensional object such a cube in only 2-dimensions, such as on a flat piece of paper? This is the problem of projection, and it inevitably introduces inaccuracies. Different choices of perspective will alter what features survive the projection process. For instance, a perfect cube has all faces square, with corner angles of 90 degrees, and opposite sides of each square are parallel. But in the two point perspective shown only the vertical lines remain parallel; the introduction of vanishing points has distorted the horizontal ones and thus the angles.
Instead of thinking of the cube as a solid object, we can describe it in terms of its vertices (the corners or points) and the edges that join them – that is, as a graph! But in the previous post we were interested in planar graphs, and in our 2-point perspective we have edges crossing. This might seem unavoidable – the `front’ blocking our view of the `back’ – but through a different choice of projection we can get a planar graph of the cube, or indeed any suitably well-behaved solid. This is the key to using graph theory to study those solids.