BBC Online@20: Updating the BBC Web Archive
Carl Davies
Service Development and Delivery Manager
Tagged with:
The BBC Web Archive has been growing at an increasing rate in the last couple of years. Having one of the biggest websites in the world, dating back to 1994, it's been a huge task to pull together so much material. How does our web archive collection breakdown? Here goes...
Since the mid-nineties copies of various BBC websites were deposited with the Archives, often on disc including the website code, which came in at an ad-hoc rate. This included the very first BBC website known as the BBC Networking Club. By 2004 it was clear that a new web archive solution was needed, and the Archive joined forces with our technology partners and created an in-house web archive system – The BBC Web Archive 1.0. It took copies of large parts of the BBC website and any new or amended page. This created millions of pages within the first web archive. From 2008 the BBC moved to dynamic web hosting, with lots more interactive content, so this older automatic archive system couldn't cope as well, and it was effectively closed in 2010.
This meant looking for a new solution, which was very pressing given that BBC Online was rationalised in 2011 after a review, and had a major facelift. This resulted in hundreds of websites closing or moving. We archived all the code for these website closures and redesigns, and also introduced screencasting into our archive capture, in order to emulate the user journey within many of our major websites. We have also selected for the archive nearly all the exclusive video, audio, podcasts and games, because these have intrinsic value beyond the website itself. So if there are web exclusive films from Live Lounge for example, we have archived them therefore one day a programme maker can re-use these in television documentaries.

The BBC's 1997 election website
The next step was to introduce a more robust web archive process in-line with what The British Library and The National Archives currently undertake with our national and government web resources. That led us to adopt a more focused web crawling solution. To back fill our archives with older crawls, we linked up with the Internet Archive, and took a copy of 16,000 web crawls they had dating from 1996 to 2014 – these thousands of crawls constituted over 160 million archived web pages of BBC Online. In addition we have been working with the Web Archive specialists ‘Hanzo’ on a proof of concept web crawl of bbc.co.uk. This is the most extensive and high quality crawl of our website ever, and is approximately 10 million pages in size.
The next steps will be to convert our BBC Web Archive 1.0 and any older archived sites into the WARC international web archive standard, so they can sit with the Internet Archive extracted crawls we have obtained (160 million pages) and the current high quality proof of concept crawl (10 million pages). Once these collections are together, we will look towards storing these on a server, indexing the WARC crawls and backing them up to LTO data tap. Finally we will place a search interface on top of our web archive collections, much like the Wayback Machine open source software. This will initially be for research and internal re-use, but also help us manage and preserve our website history. We are also archiving the BBC’s eBooks and iWonder guides, which are new innovative ways of hosting BBC content online. And we have a rich document archive of the BBC’s online innovations, particularly around iPlayer.
So there is a fair amount of work to do, but we now have a vast and rich Web Archive, with over 200 million archived pages from 20 years of BBC Online - dating from the 1994 BBC Networking Club through to the current day.
Carl Davies is an Archive Innovation Executive, BBC Information and Archives
