- Data Acquisition and Processing
There is a small – but growing – community of individuals dedicated to making the most out of airline ‘frequent flyer’ programs, and although my own travel is relatively modest, I consider myself a devotee of this unusual art. A focal point online is the website FlyerTalk, where you can find discussion on everything from collecting handfuls of points through supermarket deals, through to the best champagnes available in first class. (A good insight into this world is given in the short film Frequent Flyer.)
Beyond a certain point, you tend to need a sprawling collection of spreadsheets (or a good PA) to manage your trips, and a number of flight-logging websites have sprung up. I favour FlightDiary, but for this project I drew upon the records of BA97. This site – an offshoot of the British Airways board on FlyerTalk – allows users to track their flights, and liase with other travellers to share lounge access (as most programs allow their higher-tier members to guest in others). Crucially, the data is publically available, in an easy-to-parse format of lists of pairs of IATA codes, so I collected a year’s worth of data, from October 1st 2011 through to 30th September 2012. After cleanup, this gave over 17,000 flights, across around 4000 pairs from nearly 700 airports.
From this, I was able to create a weighted, directed network (with airports as nodes, and flights as edges); I also produced the unweighted directed, and unweighted undirected versions. A quick rendering of the latter, after running Force Atlas 2 with hub dissuasion, is shown above.
(1) Community Detection
From the image above, some potential communities of airports are apparent. Sticking with the unweighted, undirected network, here’s a rendering after the modularity process has been run; nodes are now also scaled by degree:
By over-familiarity with airport codes, it seemed that the community classes corresponded to geography at a continental level: orange for Europe (class 2), green for North America (3), light blue for South America (0), red for Asia (9). Interestingly, it didn’t seem able to find a good home for LHR – London Heathrow – presumably as it connects to so many communities, it doesn’t obviously belong to any of them (the rest of its class (7), in yellow, is a confusing assortment of locations!). Further, Asia and Australasia seem to have been lumped together.
To make this a bit more precise, I went back to Python to collect some geographic data with which to augment the nodes; this is included in the gephi files. In the data laboratory, I sorted by modularity class, then exported the spreadsheet, and assigned OneWorld continents to each node by hand (a fun test of geography skills!). The chart below shows percentages by continent for each of the main modularity classes detected, and the overall percentage in the final column.
For instance, a random airport has a 27% chance of being in North America. However, given a random airport from class 3, there is a 96% chance that it is from North America; whilst if it is not in class 3, it has only a 6% chance of being in North America. Thus, given an airport of unknown location, but known flight patterns, one can make a much better guess as to whether or not it is North American by adding it to the network and determining the class it falls into based upon modularity. Similarly, if an aiport is not in class 9, it almost certainly isn’t Asian or Australasian – although distinguishing the two within that class is hard!
(2) Degree Distribution
As always, one should be cautious in assuming a power-law distribution for a ‘real world’ data set- for frequent flyer data, there cannot be an arbitrarily long tail, since there are only so many passengers, and so many possible flights for them to be on. Nonetheless, there may be a reasonably good fit over some range of the data.
For this, I used the list of weighted degrees of the nodes (that is, the number of flights in or out of each airport), and the power law tools from Aaron Clauset used earlier in the course. This computed that the maximum likelihood value for alpha is 1.5, valid for x at least 2 (there are a lot of airports that were visited only once). I couldn’t get plplot to work, so I rolled my own in SAGE: below is a log-log plot of the weighted degree frequencies (i.e, the x co-ordinate is the log of an observed weighted degree, and its y co-ordinate is the log of the number of observations of that degree. These are not probabilities, but rescaling would be multiplicative and thus a linear shift on the log-log plot). Also plotted is the calculated line of slope -1.5, taken through the suggested start point of x=2.
This fairly low value of alpha implies that it takes a long time for frequency to decay with weighted degree; since airport networks are specifically designed to have high-traffic hubs, this is not too surprising. At the other end, it is hard to take a very short flight, and for moderate distances, other modes of transport may be viable alternatives.
(3) Measures of centrality
Looking at the top five entries for each, the dataset illustrates that measures of importance like eigenvector centrality or betweenness are not simply a side effect of weighted degree, although it certainly has a major effect:
|Weighted Degree||Eigenvalue Centrality||Betweenness|
|LHR (London Heathrow)||LHR||LHR|
|FRA (Frankfurt, Germany)||FRA||FRA|
|LGW (London Gatwick)||ZRH (Zurich, Switzerland)||LGW|
|HKG (Hong Kong International)||AMS (Schiphol, Amsterdam, The Netherlands)||BKK (Bangkok, Thailand)|
|SIN (Changi, Singapore)||MUC (Munich, Germany)||AKL (Auckland, New Zealand)|
It is unsurprising to see Heathrow top the charts for all three measures. It is the third busiest airport in the world, and as British Airways’ UK hub, will be even more prominent in a dataset drawn largely from the flight patterns of the members of BA’s Executive Club. Perhaps most interesting is the other major London airport, Gatwick: despite being the third most used in the dataset, and also third in betweenness, it doesn’t even make the top ten for eigenvector centrality (landing in twelfth place). At a guess, I would put this down to it being more used for there-and-back holiday flights, to destinations that aren’t so crucial to the business network or as interlinked themselves, so are unable to funnel influence to LGW.
(4) Distribution of physical distances
We have seen in lectures that the physical distances between nodes can play a role in finding short distances in the graph-sense. For instance, when selecting who to forward a letter to in the small world challenge, a good early strategy is to pick someone geographically closer to the target. Rewiring models also suggest there is a sweet-spot for networks of mostly local, but with a few long-range, connections that allows efficient navigation. So, I thought it would be interesting to see if the flight network was structured in this way: are most flights local, with the occasional long haul? Phrased like so, it sounds reasonable. But again, whilst a power law would certainly imply this structure, there are very definite upper limits to the length of a flight, unless you can refuel mid-air! So such a model would only be valid when truncated to sensible flight lengths.
Since I had the coordinates of all the airports, I could turn these into distances for each flight made; if a route was flown multiple times, it was included repeatedly, giving a substantial dataset of over 17,000 flights. I then fed these to the power law fitting tools, but the results are perhaps disappointing. The most likely value of alpha is around 6.29, which would imply a very aggressive penalty for long journeys as expected: but it was only considered valid for distances greater than 5186 miles, which is just 237 of the 4000+ routes flown. The log-log plot demonstrates just what a mess the situation is, with no real reason to believe in a linear relationship:
Were the computed alpha to be valid, it would imply that the probability of a flight between two cities a distance d apart is k/d^6.29. However, looking a little into the literature, it seems that a ‘gravity model’ may make more sense. In these the likelihood of travel still falls off as some power of distance, but is also proportional to the ‘mass’ of the cities involved, as estimated by something like population or GDP. This can be modelled as M1*M2/d^r for some r, with the product of masses M1,M2 allowing for fine-tuning compared to a fixed k. For instance, a particularly long journey can be justified if the end destination is particularly important, whilst local journeys to uninteresting places are penalised. One might expect rather more traffic between London and New York than London and the (similarly distant) city of Kunduz, Afghanistan! With more time to devote to the topic, it’d be interesting to aggregate the airports by nearest city, and see how closely such a model fits; or conversely, if a new measure of ‘mass’ can be derived from the observed flight data.
Data Acquisition and Processing
This section describes in detail how I obtained and formatted the data, working mostly with networkx in python. I should add the caveat that this was the first time I had attempted many of these tasks (a real project is the best way to learn a programming language!), so often the solution is simply the first thing I cooked up after googling for useful libraries.
Scraping the list of flights
I split this into two functions, for getting a page of data from ba97 (given a date, you can view 6 days worth), and extracting a list of flights from such a page. The first requires urllib2, and the second BeautifulSoup from bs4 and the regular expressions module re.
def get_page_from_date(urlDate): urlStart='http://ba97.com/ba97/calendar/bigcal.asp?start=' urlEnd='&filt=' urlString=urlStart+urlDate+urlEnd usock=urllib2.urlopen(urlString) ba97page=usock.read() usock.close() return ba97page def process_page(ba97page): soup=BeautifulSoup(ba97page) flightList= for link in soup.find_all('a'): match=re.search(r"^\w\w\w-\w\w\w$",link.text) if match!=None: t=match.group() flightList.append([t[0:3],t[4:7]]) return flightList
I built a list of dates, dlist, covering the last year. To save ba97’s server from getting hammered, I’ll not include dlist here, but dates are of the format ’10/1/2011′, that is, US-style month first. Armed with this and the functions above, collecting the flights was simple, if a little slow:
fl= for d in dlist: pd=get_page_from_date(d) fd=process_page(pd) fl.extend(fd) print d
This gave me an initial list of 17,278 ‘flights’, which I converted to a basic networkx graph to do some sanity-checking:
import networkx as nx G=nx.Graph() for f in fl: G.add_edge(f,f) nx.write_gml(G,'ba97v1.gml')
Firing up Gephi and loading this network, I asked it to partition into connected components: there were six, with one giant as to be expected, containing all but a handful of the ‘airports’. Hopping over to the data laboratory, I was able to examine these in greater detail. Proving that it really is a small world, two of the components were related to SEN – the airport at Southend, England, where I grew up! Curiously, the other member of one of the components was Dallas, Fort Worth (dfw): from personal experience of both, I knew it was deeply unlikely that this was an actual flight- and besides, the case was wrong. The regular expression above is not case sensitive, and as luck would have it, there’s a user with the name “dfw-sen” that was being picked up (an earlier, even sloppier regex I used for a trial run had produced an airport ‘of-‘, by matching the user “land-of-miles”, so I was on guard for such issues). However, the other pairing, WAT-SEN, corresponds to Aer Lingus services to Waterford in Ireland, which apparently exist.
There were a couple of other 2-airport components, corresponding to legitimate-looking pairings in Western Canada (YWH-CXH), and the Ukraine (SIP-IEV). Lastly, there was a 5-entry component, with some very odd looking codes: QQS, QQW, XQE, XPG and ZYR. These are not airports, but railway stations! I think they all correspond to stops on the Eurostar service, so they do represent international travel – but since most users of the site are unlikely to be logging their train journeys, I felt it would be best to exclude these for this analysis.
During a later stage of this project I found two more non-airports: Strasborg bus station, and Cologne rail station. Although one can book ‘flights’ to the last of these (actually flying to, say, Frankfurt then continuing by train) I dropped them for simplicity.
I therefore built a ‘kill-list’ of invalid origins/destinations, and removed anything matching those from the flight list- fortunately python string matching is case sensitive, so I wouldn’t kill off valid flights to DFW or SEN. This slightly drops the number of entries for the year to 17,254.
The Final Networks
With this, there is enough information to build a weighted, directed graph on 699 airports, with 4099 routes flown. Here’s the relevant functions:
def countFlight(flightList,countsDictionary): for f in flightList: k=(f,f) if not countsDictionary.has_key(k): countsDictionary[k]=1 else: countk=countsDictionary[k] countsDictionary[k]=countk+1 return countsDictionary def MultiDiGraphFromCounts(countsDictionary): G=nx.MultiDiGraph() G.add_weighted_edges_from([(k,k,countsDictionary[k]) for k in countsDictionary.keys()]) return G def DiGraphFromCounts(countsDictionary): G=nx.DiGraph() G.add_edges_from([(k,k) for k in countsDictionary.keys()]) return G def GraphFromCounts(countsDictionary): G=nx.Graph() G.add_edges_from([(k,k) for k in countsDictionary.keys()]) return G
Then to build a network, it’s simply a case of:
fc=dict() fc=countFlight(fl,fc) H=MultiDiGraphFromCounts(fc) nx.write_gml(H,'ba97v2.gml')
First I built a list airports, which are simply the names of the nodes in the network. I then attempted to build up a dictionary, airportd of metadata collected from Google, via the geopy library:
from geopy import geocoders g=geocoders.Google() import time airportd=dict() for a in airports: ... try: ... place,(lat,lng)=g.geocode(a+' airport') ... airportd[a]=[place,lat,lng] ... time.sleep(0.5) ... print a,place,lat,lng ... except: ... print 'Failed for',a
You are only allowed so many requests to the google geocoder in a day (and the sleep command is needed to prevent too many in a small interval, which also seems to be forbidden), but for a small project like this it is fine. Using an assortment of the other geocoders, I was able to pin down 665 of the airports, leaving 36. I could probably have found other ways to automate the data collection – such as parsing the freely-available Global Airport Database, but with so few, it seemed easier to just run the codes through the Great Circle Mapper and add them to the dictionary myself. A slight wrinkle is that GCM produces neatly-formatted coordinates in degrees, minutes and seconds, such as 50°01’35″N 8°32’35″E for Frankfurt. But geopy’s point function can get you floats from a string representation of those:
from geopy import Point p=Point("50 01m 35s N 8 32m 35s E") p,p (50.02638888888889, 8.543055555555556)
I also had to manually convert some of the airport names to ascii by substituting suitable characters, otherwise export to graphml format fails. Having populated airportd, the following adds the data to a networkx graph G (provided every key of the dictionary is a node label in the graph):
for a in airportd.keys(): ln=airportd[a].encode('ascii','ignore') G.node[a]['long_name']=ln G.node[a]['lat']=airportd[a] G.node[a]['lng']=airportd[a]
Producing a log-log plot of degree data
Given the MultiDiGraph G, I built an array of pairs (x,y) of weighted degrees and their frequency as follows:
weightlist=[G.degree(weighted=True)[a] for a in G.nodes()] weights= weightdict=dict() for w in weightlist: if w not in weights: weights.append(w) weightdict[w]=0 for w in weightlist: weightdict[w]=weightdict[w]+1 D=[(w,weightdict[w]) for w in weightdict.keys()]
The value of alpha and xmin can then be computed with plfit, and a scatter plot of the logs of the entries of D produced:
[alpha, xmin, L] = plfit(weightlist) alpha,xmin logD=[(log(a),log(a)) for a in D] P=scatter_plot(logD,facecolor='green',markersize=25) P
Using the first x after xmin, and its log(x),log(y) value, allows the x-intercept of the line through that point with slope -alpha to be computed and added to the plot.
The geopy library already includes the capability to compute geodesics – shortest paths on the curved surface of the earth, as typically flown by planes. These are often described as ‘great circle’ routes, although an ellipsoidal model is more appropriate- the more familiar ‘straight line’ or ‘Euclidean’ distance from standard geometry is inappropriate as it would require tunneling through the earth! In the course, we have also thought about shortest paths in the context of a graph, where moves are only allowed along edges – these are all examples of ‘metrics’: a graph just gives another, possibly weird, notion of ‘space’ and ‘distance’.
from geopy import distance from geopy import Point def getDistance(flight,airportd): A=airportd[flight] B=airportd[flight] P=Point(A,A) Q=Point(B,B) return distance.distance(P,Q).miles dists= distd=dict() for d in distlist: if d not in dists: dists.append(d) distd[d]=0 for d in distlist: distd[d]=distd[d]+1 D=[(d,distd[d]) for d in distd.keys()]
This was then processed into log-log data and plotted as for weighted degrees in the previous section; distlist was used for testing for a suitable alpha value.