Data Science : Web Scraping

The Information Sources are examples of philanthropic organisations which generously make the information they hold available in a structured or readily accessible form (e.g. downloadable files or in a standard data exchange format). You will more frequently come across websites that have what you’re looking for but where the information is spread over multiple web pages or is not presented in a way that makes it easy to collate, extract and download. However, if the Hyper Text Markup Language (HTML) used to create the internet pages is well-defined ‘Web Scraping’ tools and techniques may help to extract and manipulate the data.

Below is a Case Study of just such a situation. In this example, I explain how I used Python to scrape data from a website for a specific purpose. The approach is applicable to a range of similar problems.

Problem Statement:

The Power of 10 is a website which publishes race results for amateur and professional athletes participating in UK events. The website provides search facilities (i) to list all the members of a registered athletics club or, (ii) for a single, named athlete, to display the events they participated in and their performance. The website does not provide a facility to list the events and recorded times for all athletes in a club.

Figure 1: The Power of 10 home page

A local running club uses the website to look up the performances recorded for their athletes. The club’s administrator has to perform multiple search tasks, one for each of the 117 athletes, and to compile the individual results into a report. The process of accessing, extracting and collating the relevant details for each athlete is a tedious and time consuming which can take a couple of hours each week.

Web Site Functional Analysis

From the home page of the ‘Power of 10’ website users can enter a club name – in this example “Woodbridge”. The application then presents a web page listing members of the relevant club. Note the URL (http://www.thepowerof10.info/athletes/athleteslookup.aspx?surname=&firstname=&club=Woodbridge ). In order to access details of the events an athlete has participated in and their performance the user needs to select the relevant ‘Show’ link under the heading ‘Profile’.

Figure 2: List of club members

The HTML code can be viewed by ‘right clicking’ the web page and selecting ‘View page source’ (see Figure 3). The resulting code can be seen in Figure 4.

Figure 3: Accessing web page HTML code

If you are familiar with HTML code you will be able to spot the tags which are used to define and format blocks of text and tables. The tags {tr} and {\tr} define the start and end of a row of information, and the tags {td} and {\td} delimit data cells in a table. There are multiple rows in each web page so the Python code needs to be able to distinguish those which are part of the table (with headers ‘First’, ‘Surname’, ‘Track’ etc.).

Analysis of the HTML shows that each row contains 9 data cells and that the value in the 7th column (headed ‘Club’) is consistently ‘Woodbridge’. These are characteristics that can be used to identify the relevant table rows and distinguish them from other tables on the same page.

Figure 4: HTML code for club member’s web page

After selecting the ‘Profile’ of an athlete the web application displays any relevant events they competed in (see Figure 5). It is important to scrutinise the URL of the web page (http://www.thepowerof10.info/athletes/profile.aspx?athleteid=796371 ). The ‘athleteid’ is not an attribute shown on the ‘list of athletes’ web page (Figure 2) but careful examination shows the unique athlete identifier is embedded in the <a href> tag of the code (Figure 4). Consequently, in order to navigate to each athlete’s performance record the Python code needs to extract the ‘athleteid’ from the relevant data cell and tag from the HTML of the club members’ web page (Figure 4).

Figure 5: Details of events and performance for one club member

The web page for each athlete’s performance record can be examined, as before, with ‘View page source’. As before, it can be seen that the HTML code has structured tables with tags {tr} & {\tr} delimiting the rows and {td} & {\td} defining data cells. However, examination shows there are 12 cells in each row of the table.

Figure 6: HTML code of one club member’s events and performance

Code Schema Workflow

This analysis of the structure and function of the website gives sufficient insight to start building the Python web scraping programme. A schematic for the sequence of steps required to set up Python, access the website, analyse the HTML, and extract the relevant details is given in Figure 7.

Figure 7: Python programme schema

Code

At the start of the programme the required Python library functions need to be imported:

urlopen	allows website URLs to be accessed
BeautifulSoup	functions to support extracting of data from HTML files
pandas	structures data into tables for analysis and manipulation
re	regular expression module for matching text strings

Acknowledgements: Thanks to the work of Jean-Nicholas Hould (Scraping for Craft Beers: A Dataset Creation Tutorial) and Greg Reda (Web Scraping 101 with Python) which are useful introductions to this subject.

The next steps involve opening the Power of 10 URL (using the Python command ‘urlopen’) and analysing the HTML to find the table which holds the athlete names and extracting the relevant fields. This is achieved within the code through the use of the Python module ‘BeautifulSoup’ and two bespoke functions: get_all_athletes() and is_athlete_entry().

The function get_all_athletes() declares a new array ‘athletes’ to store the data from the source table. It iterates through each row of the webpage (identified by the tag {tr}) and uses the function is_athlete_entry() to check if the row content has the characteristics of an athlete entry. The function tests whether the row has 9 data cells (identified by the tag {td}) and the 7th field in the table contains the text string “Woodbridge”. If the 7th column is empty this test would produce a run time error so I had to include the ‘try’ clause to confirm there is a valid entry. For each row that returns a successful check, the code takes values from table columns and assigns them to variable names “forename”, “surname”, “trackcat” etc.

Note that Python uses standard array notation i.e. the first column corresponds to array position 0, the second to array position 1 etc.

Inspecting the code it is apparent that the ‘athleteid’ is not available in a data cell as a textual value but can be found embedded in a string within the <a href > tag. Consequently, the command to extract the identifier is different to other fields. The reference ‘row_cells[7].a’ refers to the text string in the < a > tag of column 8. The ‘str’ function allows the text string to be manipulated: [32:38] selects the substring from the 32nd to the 38th letters where the ‘athleteid’ can be found. Running the code a few times found that ‘athleteids’ are not always 7 digits and for shorter strings superfluous characters (such as quotes [“] and tag braces [>]) need to be removed.

Finally, the one dimensional array ‘athlete_entry’ derived from each table row is appended to the two dimensional array ‘athletes’ and it is this array which is returned and assigned to the array ‘athletes_list’. This is then converted to the Python panda dataframe ‘df’. The output of this dataframe shows that fields have been reordered alphabetically based on the names of column headings.

While not essential to the programme, the contents of the panda table (df) holding the details of the club members can be exported to a named comma separated variable (csv) file using the command df.to_csv(‘filename.csv’).

The next phase of the programme involves a loop (using variable ‘i’) which opens the webpage of each club member in turn. The number of items in the dataframe ‘df’ can be obtained from ‘df.index.values’. On each iteration the ‘athleteid’ is found at location row ‘i’ and column ‘athleteid’ of the dataframe ‘df’. The value obtained is assigned to the temporary variable ‘string’ and used to create the required URL.

As in the earlier explanation, the code needs to step through each row of the webpage and where the table characteristics match that of an athlete’s race record the relevant data is assigned to specific variables in the array ‘athlete_time_entry’. In this case 12 data cells are expected. Having run the code some inappropriate rows were extracted and in order to deselect them an additional test was added (row[0].text does not contain the string ‘Event’).

The athlete performances were added to the Python dataframe ‘dft’. By default the column headings are alphabetically arranged which isn’t ideal. The data fields (e.g. ‘Forename’, ‘Surname’) can be reordered easily by panda to the desired format using the command dft[[‘field1’, ‘field2’, …]]. Finally, the contents of the panda table (dft) holding the details of each athlete’s performance record can be exported to a named csv file using the command dft.to_csv(‘filename.csv’).