I had the very great pleasure of speaking at the ‘Libraries, Media & The Semantic Web’ event hosted by the BBC Academy last Wednesday, along with folks from the New York Times, the BBC, Google in the guise of Schema.org, Historypin and KONA. The event was organised by the Lotico London Semantic Web Group.
The General Manager for News and Media at the BBC, Phil Fearnley, introduced the event, and immediately caught our attention by informing us that the BBC is continuing to make a substantial commitment to semantic web technologies, having devoted 20% of it’s entire digital budget to activities underpinned by this technology. Nice one Phil.
After a few opening words from Marco Neuman of Lotico, Jon Voss was then up, giving us a briefing on Linked Open Data in Libraries, Archives and Museums (LOD-LAM) efforts around the world, and upcoming plans within the community. He talked about how the first International LODLAM Summit held in San Francisco last year has galvanised the LODLAM community, and helped kick-off a number of activities. Jon was the main convener of the summit, and kindly asked me to be on the organising committee, so, although you could say I’m biased, I can vouch for the fact that it was a great event. He also mentioned how the number of LODLAM events across the world has grown, with meetups in Australia, the UK and a number of places around the USA. Jon also talked about some recent work Historypin are doing to allow users to dig deeper into archival records based on time and place, to enhance the Historypin experience using linked data principles. He wrapped up by emphasising the importance of open licenses, and how open data has to come before linked open data.
I was up next, giving a whistle stop tour of UK LODLAM activities, myself being Adrian Stevenson, Senior Technical Innovations Coordinator at Mimas, University of Manchester. Given that I was in the vicinity of where the classic glam rock bands have played, I couldn’t resist the temptation to use the galleries, libraries, archives and museums ‘GLAM’ acronym for my presentation title, and throw in a glitter platform shoe on the opening slide. I covered the work of the LOCAH and Linking Lives projects, before giving a heads up to a number of the JISC funded Discovery projects doing linked data work, including the Bricolage project in which our own Pete Johnston is involved, and the newish World War One Discovery project I’m working on. I finished up by focussing on particular challenges we’ve met on Locah and Linking Lives, namely the difficulty of creating links based around names, and the general problem of finding data to link to.
We then moved to the media perspective on things, with Evan Sandhaus, lead architect for semantic platforms at the New York Times, giving us the low down on rNews, an embedded data standard for the news industry from the IPTC. Evan explained the ‘silly’ situation we’ve ended up in, where the data content of news articles is kept in structured form behind the scenes in databases, but this structure is lost when the data is presented to the Web in HTML. To address this weakness, the IPTC came up with the rNews data standard, which is defined as “a data model for embedding machine-readable publishing metadata in web documents and a set of suggested implementations”. Currently there are RDFa and HTML5 implementations, with a JSON implementation under consideration.
Addressing the benefits, Evan explained that rNews can provide superior algorithmically generated links, such as those generated by Google Rich Snippets, thereby improving referral traffic. In addition, it can allow for better analytics provided by the better quality data. It was noted, however, that these benefits will depend on the wide adoption of rNews in the community. He then gave a short history of the development of rNews, culminating in the announcement that it has now been adopted by the New York Times, and is used on all news articles published after 29th January 2012. Evan mentioned how the arrival of Schema.org, which essentially does the same thing as rNews, caused something of an “existential crisis”. Fortunately, the organisations have worked together, and schema.org has now been expanded to absorb about 98% of the rNews data model.
Dan Brickley from Google, working on Schema.org, gave a really interesting talk looking back at the history of search and structured data over the past 100 years. He used this as a way to highlight the connections between the GLAM sector, the media, and the problems schema.org is aiming to solve. Dan proposed the notion that somewhere in Belgium, semantic search over structured data went mainstream as long ago as 1912. He backed this up by quoting some search queries logged in the 1912 annual report from the Belgian Institute of Bibliography. Dan went on to talk about Lonclass, a BBC media archives classification system still used today. Dan suggests that Lonclass is based on structured semantic data, having compositional semantics predating computing. Using Lonclass, it’s possible to build sentences from its semantics, e.g. the lonclass code ‘656.881:301.162.721 for “Letters of apology” can be combined with the codes for ‘resignation letters’, ‘Margaret Thatcher’ etc.
Dan described how Schema.org, launched in June 2011, is essentially the result of a loose collaboration of engineering groups from Google, Bing, Yahoo & Yandex. Having been somewhat behind the scenes, they are moving increasingly to a collaboration model in the public space, the vocabulary development now being hosted by the W3C. Google Rich Snippets was cited as the best known way in which this markup is being used, and the business story is that if you use schema.org markup, your page is better described, you get more click-throughs, and people can better understand search result lists. He noted there’s also an advertising aspect, though this is not part of Dan’s work. The overarching aim is to give more accurate search results. Dan reckons schema.org counts as linked data, as the markup that describes someone, say Douglas Adams, points off to another page providing more info about Douglas Adams. Dan rounded off suggesting Schema.org is basically a dictionary of terms drawing on the everyday scenarios of search. It was interesting to note that he thinks the semantic web world is too polite in feeling the need to use other people’s terms. Schema.org is relatively ‘rude’, having about 300 terms, but he believes this makes it easier to deploy.
Silver Oliver from BBC News and Knowledge outlined how they’ve been doing ‘more of the same’, building on the semantic web work used for the World Cup and applying it to the new sports site, and the upcoming 2012 Olympics site. There’ll be representations for every athlete, medal event, venue, and so on. The underlying linked data principles are the same, i.e. tagging with HTTP URIs that are then used as hooks into the web graph. They’ll be using geonames for locations, hooked onto IOC Olympic content, which typically comes in spreadsheet form. They use Google Refine with the DERI RDF plugin to get RDF from spreadsheets, then add in other existing BBC RDF content, stitching these datasets together to create useful graphs. This approach gives the benefit of providing ‘page furniture’, for example, using information on the country Jamaica, and the IOC statistics on Jamaica’s performance in Olympics, to frame and enhance the BBC content on Jamaican athletes.
Silver mentioned that Google is their biggest data consumer, using their microdata and RDFa. He noted that the 2012 Olympics pages will have schema.org data in, and also mentioned work using hRecipe for exposing structured recipe information: these have surfaced really well on Google.
Yves Raimond from BBC R&D then talked about the challenge of surfacing the huge amount of excellent BBC archive content, and the challenge of making it connect with current content. The BBC has a massive archive, but tagging has only been used for a few years, and much of the archive has only very sparse and often incorrect metadata. He described how they’ve been using automated tagging with linked data URIs to make connections to current content to help push the archive to users. They’ve been trialing the approach on the World Service archive, which contains a massive audio database. They’re using a piece of software they’ve developed called ‘KiWi’, built with open source components, and some custom built alogorithms to automatically tag content. CMU Sphinx is used to create ‘very noisy‘ speech to text transcripts. More will be published on how they’re using KiWi in the next few months. Yves then showed us examples of autotagged programme content. As he noted, it appears to do a decent job, but some of the tags are wrong. He mentioned the possibility of using crowdsourced tags to improve the accuracy of the content.
That was basically it for the presentations part of the proceedings. All the speakers then came up for a short Q&A session, mainly focussed on the media side of things, and after this we headed to the nearest bar. All in all it was a great evening, and I felt quite privileged to be part of a panel of such esteemed experts.
I’ve included the speakers slides where I’ve been able to track them down below: