(I drafted this post at the same time as Jane was working on the previous post, and I’ve amended it slightly to pick up on some of the points Jane highlighted. I think I’ve focused on slightly different aspects of similar territory, but I’m conscious there’s probably still some element of overlap. It’s more a set of loosely connected notes than a fully formed post, but I’ve had it on the go for ages and think I’ve slightly lost the thread of where I was trying to go with it – so decided I should just publish it!)
In my previous post, I focused on capturing, tracking and disclosing information about the licences that apply to packages of data within the Linked Archives Hub dataset. Towards the end of that post, I introduced the notion that licence data was just one part of a broader set of metadata that we should capture and make available.
As discussed in that post, the Linked Archives Hub dataset, like the Archives Hub EAD dataset from which it is derived, is an aggregation composed of data contributed from multiple sources. The EAD data contributed to the Hub is all created by cataloguers from the contributor archives, professionals working directly with the materials they are describing, which provides assurances about the nature, quality and accuracy of the information provided in the data.
Linking Lives as Mashup
For Linking Lives, the ideas is to bringing data from the Linked Archives Hub dataset together with data from further, possibly more diverse sources. Some of these sources are created by other information professionals; others such as DBpedia may be the result of work by a broader community; others may be the product of algorithmic processes. In this context, I think, it will be important to be able to indicate clearly which data comes from which source, when it was drawn from that source, who is responsible for the source (assuming that information is available), any processes which have been applied to it, and so on. Providing this information enables a user of the data to make judgements about the quality and currency of the data.
It may well be important for providers, too: links back to original sources may be a “route in” to other resources they offer, and the simple presence of attribution may be important to them, both in terms of disseminating their “brand” and in providing evidence for the use of the data they provide.
(As I was writing this post, I noticed some discussion on Twitter of a recently published paper from Europeana on “business models” associated with metadata in the cultural heritage sector, and “loss of attribution” is highlighted as one of the concerns that institutions had expressed over opening up their metadata collections.)
For the project too, and for Mimas as providers of the Archives Hub service, I think we need to improve our management of this sort of data (as “administrative metadata”, if you like). While our processes aren’t currently very complex, we do need to be better at tracking when data was “ingested”, which versions of which processes were applied at what point in time, and so on.
Potentially there is quite a lot of information which might be provided in this area, so as Jane discusses, its presentation will need some careful design.
We may wish to build features into the interface features which allow a viewer to “drop out”/”add back in” chunks of data based on their assessment of such metadata.
Consider this example from Sig.ma, an application which Jane also mentions in her post. Sig.ma is a viewer application for Sindice, an RDF aggregator which crawls RDF data sources on the Web, aggregates the data it finds and provides various search/display/analysis services over the aggregated data.
Figure 1 shows the Sig.ma page generated by a search for the Ordnance Survey URI for the county of Oxfordshire http://data.ordnancesurvey.co.uk/id/7000000000008328. The URI of the page is http://sig.ma/search?q=http://data.ordnancesurvey.co.uk/id/7000000000008328
The left hand column shows a summary view of the triples that Sindice has cached with (I think) this URI as subject or object. The right-hand column lists the sources from which those triples were obtained. (I find it slightly confusing that in this case the right-hand column shows the “thing URI” rather than the “document URI”).
For each triple in the right-hand column, there’s a reference to one or more data sources in the left-hand colum indicating the origin(s) of that triple. If I mouseover a particular triple – an attribute/value pair in the left-hand column – , the sources of that triple are highlighted in the right-hand column. So e.g. I can see that the triple providing an area for the county was drawn from the Ordnance Survey source:
And if I mouseover an entry for a particular data source in the right-hand column, then the triples from that source are highlighted in the left-hand column. So e.g. I can see that the dotac.rkbexplorer.com source for Oxford Brookes provides the triples saying that it is located in Oxfordshire and that the county is a “Spatial Thing”:
There are options to switch the data sources in and out, so that the triples from that source in the left-hand column will disappear/reappear.
The metadata about the sources presented here is fairly minimal: for each source, there’s an indication of the number of triples from that source, and a date, which I imagine is the date they were harvested. But in principle, one could envisage that extended to display a fuller set of “provenance metadata” for each data source.
The Sig.ma view is something of a “raw” “under the bonnet” (“hood”, for our American readers) view: I’m presenting it here to try to illustrate the general point about tracking the sources of chunks of data and displaying that information; I’m not suggesting that Linking Lives would adopt this same style of presentation.
The other point, as Jane notes in her post, is that Sig.ma surfaces any/all data it crawls: its “selection criteria” are very broad, if you like.
Linking Lives as Aggregator
We haven’t yet made any final decisions on an “architecture” for Linking Lives, but one of the key choices we need to make is whether the user-facing application draws data from the Linked Archives Hub and from other sources “on the fly”/”on demand”, or whether we periodically harvest and store/cache data from other sources and then the user-facing app runs from that stored data. It may be that some hybrid of the two approaches is appropriate, but I suspect that for reasons of performance and reliability some element of harvesting and storing will be required.
And that process of aggregation may go beyond simply collecting and merging data from distributed sources. We might perform some simple inferencing to generate some additional triples to simplify querying. That might be the simple “materialization” of triples based on RDF Schema/OWL semantics, or it might be some more complex rules-based processing, say, to expand a simple statement of a life date into the sort of event-centric form I described here.
And (again returning to one of Jane’s points), there is a process of selection involved – choosing particular data sources, and perhaps choosing particular subsets of data from within those sources – and it can be argued that that that selection in itself – if performed effectively! – may be a source of “added value”.
I think sometimes when we’re discussing “aggregations” there’s a tendency to think in terms of “aggregations” as quite large-scale, “formal” collections, scoped by “domain” or “community” boundaries. But it seems to me we form aggregations like this all the time, typically pulling together specific subsets of data to address specific requirements of some particular application, where the scope is determine by the “functions” of that particular application.
In the discussion above, I considered the provenance issue mainly from the perspective of presenting information to a human reader. But in principle at least, the data presented in human-readable form by Linking Lives could also be re-exposed as new graphs, new packages of RDF data, available via content negotiation. I think this is one of the characteristics that lead Kingsley Idehen to label this sort of application as “meshups”, to distinguish them from the traditional “mashups”.
With these two factors (aggregating data from different sources into a store, and (re)exposing data as RDF) in mind, I think it becomes easier to see some similarities between this and the architectures I sketched for the Linked Archives Hub in the previous post.
In the diagrams in that post, I had the RDF graphs derived from EAD documents as the inputs on the left hand side, and the Linked Archives Hub “bounded description” pages as outputs on the left. For Linking Lives, we would have data from the Linked Archives Hub (either the individual pages or some other subsets generated by SPARQL queries) and similar from the other sources, as the inputs on the left and the Linking Lives “views” on that data on the right.
In the Linked Archives Hub case, the “container” for our aggregation was a single triple store; for Linking Lives, it may be something similar or it may be more of a “hybrid” of data stored centrally and data retrieved in real time. But logically, I think the picture is probably very similar.
In the previous post I focused on metadata about licensing associated with each source; but as I acknowledged towards the end of that post, licence metadata would be part of a broader set of provenance metadata associated with each source.
I’ve tried to reflect this in a modified version of the diagrams from the previous post. I reduced the number of sources to two, to try to reduce the number of criss-crossing lines, but it is still rather messy! My point is really to try to illustrate the broad similarity of the two scenarios.
A few points to note:
- I added the pale green boxes to represent the human-readable HTML output pages – analogous pages are present for the Linked Archives Hub case too, but for simplicity I omitted them in the diagrams in the previous post.
- I’ve tried to indicate the relationship between the metadata “about” the output graphs and that “about” the input graphs. We’ll need to be careful with issues of identity here, I think – when are we talking about the same graph and when are we talking about distinct graphs, one derived from the other.
- Bearing in mind what I said above about the “architecture” still being a work-in-progress, what is represented above as a single “store” may in practice turn out to be several distinct components.
I think one of our next tasks should be to focus in on some scenarios/use cases to distill out some requirements for provenance metadata, and give ourselves a clearer idea of what metadata we need to collect or generate.