As a follow-up to Jane’s post on licensing, I thought I’d write up a few notes looking at some of the practical challenges of managing the Hub linked data and the licence(s) that apply to it. I’m not an expert in rights/licensing at all, so here I’m mostly thinking about the “mechanics” of “labelling” “packages” of data with appropriate licences (as part of other metadata that might be applied to those packages) at various points in the system. The thoughts here are prompted mostly by my thinking about the current LOCAH system rather than anything more specific to the “Linking Lives” project. I should try to write more about that in a follow-up post, but I think many of the underlying issues are common to both contexts.
Figure 1 represents the current situation, with data flowing (roughly) from left to right.
The “input packages” here are EAD XML documents (e.g. EAD 1 and EAD 2 in the diagram), provided by multiple archives (contributors to the Archives Hub “aggregation”, to use the teminology of the UK Discovery initiative). These are converted to RDF/XML documents on a one-to-one basis i.e. each EAD XML document is transformed into a single RDF/XML document, a serialisation of a single RDF graph, a set of RDF triples (Graphs i1 and i2 in the diagram). Triples are the “atomic units” of our data, with each triple expressing some simple assertion of a relationship between two things.
That RDF data is uploaded to a triple store. (Our current process actually involves converting the RDF/XML into another format, but to simplify this discussion, I’ll ignore that and focus on the graphs-as-sets-of-triples.)
Some further RDF data not derived directly from the EAD data (Graph iX) is also uploaded to the triple store. Currently this includes data about repository postcodes, links to external resources generated through a post-conversion process, and data about the dataset itself. As these have different sources and characteristics, it would be more appropriate to represent several different graphs here, but again for simplicity, let’s assume we just have the one “package” to deal with.
This merged data is then exposed as Linked Data, in the form of a second set of graphs (Graphs o1, o2 and o3 on the right of the diagram). These are “bounded descriptions” of single resources. i.e. packages of data “about” that single resource generated by querying the triple store. Roughly, they are sets of triples with the URI of a single resource as subject or object (For a more formal discussion, see e.g. “Bounded Descriptions”). Again, they are made available in multiple formats, but, again, let’s leave that to one side here. Currently, that “expose” process applies the same licence to each of those output graphs, and a triple stating the relationship between graph/document and licence is included in the output graphs i.e. it forms part of the graph-/document-level metadata provided on output – the graph is “labelled” with that licence, if you like.
So for example, the bounded description of the Beverley Skinner Collection includes the triple:
@prefix dcterms: <http://purl.org/dc/terms/> . <http://data.archiveshub.ac.uk/doc/archivalresource/gb1086skinner> dcterms:license <http://creativecommons.org/publicdomain/zero/1.0/> .
Inputs, outputs, decomposition and recomposition
One of the points to note here is that there is no simple correspondence between the content of those input graphs on the left and the bounded description output graphs on the right: for a single output graph, the triples within it may have originated from a single input graph or in multiple input graphs. I’ve tried to highlight this with the dashed boxes and arrows overlaid in Figure 2. Output graph o1 (lower right) contains a subset of the triples from input graph iX, output graph o2 (centre right) contains the same triples as input graph i2, and output graph o3 (upper right) contains triples from both graph i1 and graph iX.
For examples of these cases in the “live” LOCAH data, consider:
- A description of an “archival resource”, like that of the Beverley Skinner Collection: in this case the data is all (I think) from the single input graph derived from the corresponding EAD document.
- A description of a “concept” the name of which has been used as an index term, like that of the concept of “Economic history” from the UNESCO Thesaurus: here, the data includes links from two different archival resources, one held by the British Library of Political and Economic Science and one held by Glasgow University Archives. Those links have their origin in two different input graphs, derived from two different EAD documents, created and contributed to the Hub by two different institutions.
So there is a process of “decomposition”/”recomposition” or “unpackaging” and “repackaging” taking place here. On upload to the triple store, the input graph/”package” is decomposed into its constituent atomic triples/assertions; the process of exposing Linked Data recomposes subsets of those triples into a new set of output graphs/”packages”.
I think this is a characteristic of many (most?) applications based on aggregating data from multiple sources, though the degree of decomposition/recomposition may vary. However, I think it may be a characteristic that is perhaps unfamiliar/unusual from the “traditional”, “document-centric” perspective of archival description. From that perspective, an archival finding aid is typically seen as something created, distributed, and “used” or “consumed” as a single document – the finding aid document is the “package”. Yes, there are identifiable individual component “descriptions” within that document, but they are typically “read”/interpreted within the context of that whole. The design of the EAD XML format, I think, largely reflects that, though of course applications based on EAD often do extract, index, and (re-)present components in new combinations.
In the current LOCAH system, the triple store is essentially a “big pot of triples”: it does not record the information that a particular set of triples had their origin in a particular input graph. At the moment, if we needed that information when generating the output graphs, I think we’d probably be relying on some rather ad hoc heuristics to try to reconstruct it, e.g.
- triples in which the subject URIs contain “gb248abc{-n-n-n}” must have originated in a graph derived from the EAD document http://archiveshub.ac.uk/data/gb248abc.xml)
- triples in which the subject or object URIs contain http://viaf.org/ must have originated in a graph of external links
And so on. But even with some inventive rules, I’m not completely sure we could account for all triples in this way.
This particular challenge of recording and tracking the relationship between triple and input graph could be addressed by making use of named graph/”quad” support. In that context, the decomposition and recomposition still takes place, but the association between a triple and its source graph can be explicitly maintained within the store, and is available to applications querying the store. This is represented in Figure 3:
Managing multiple licences
Now, let’s return to the particular question of licensing. Consider a new scenario in which the different EAD input documents are subject to different licence terms, with the expectation that this is “reflected in” the Linked Data outputs. I wonder whether, in practice, this “reflection” may not be quite as simple as “output licence” = “input licence”, but for simplicity (again :)) in this discussion, let’s assume it is. The cases of output graphs o1 and o2 are relatively straightforward as in each case there is only a single input graph to consider in determining the output licence. But with graph o3, there are two inputs to consider, each with different licences.
It seems to me there are various possibilities here:
- Output a single document/graph labelled (as above) with a licence (or licences?) that somehow accommodates/encompasses the requirements of both the input licences. This is illustrated in Figure 4 below. Whether this is possible depends on the characteristics of the licences: it may be that they are sufficiently different that this condition can not be satisfied.
- Instead of outputting a single document/graph, output multiple documents/graphs with the data partitioned up so that each graph is associated with a different licence. This is illustrated in Figure 5 below. (In the diagram, I’ve suggested generating a “stub” document with “see also” links to two other documents.)
A third option woud be to output a single document which labels subsets of the triples with different licences as appropriate, either using reification or by making use of an RDF format that supports multiple graphs per document – sort of the equivalent of labelling different blocks of text on a human-readable page with different licence information, rather than labelling the whole document. This would be a break from the graph-document equivalaence which I posited above, though.
I should hold my hands up at this point and admit that I haven’t tried to implement these options above, but while they certainly add an element of complexity, it does seem that the second or third could be managed.
Dumps and Queries
In addition to generating a set of individual bounded descriptions, we also make available a complete dump of the dataset i.e. all the triples in a single graph – and currently subject to a single licence. The “multiple licences” challenge could be managed in a similar way to the second and third options for above, i.e. partitioning the data, either into multiple physical dumps or into subgraphs within a single dump, and associating each with a different licence.
In both the cases discussed so far – the individual bounded description graphs and the dataset dumps – , the creation of those “packages” of data is under our control, in the sense that we code the application that exposes the Linked Data pages and we create the dump, so we can “label” those “packages” with a suitable licence based on the information we have about the inputs from which they were generated. However, we also provide a SPARQL endpoint over the content of the store. Currently all the data forms part of the “default graph”. A SPARQL CONSTRUCT or DESCRIBE query to that endpoint generates an output RDF graph, the form of which is determined by the agent making the query. There’s no requirement for the query engine to provide the sort of graph-level metadata labelling that I outlined above for the bounded description case, so it would be left to the agent making the query to try to obtain that. I suspect there are also some further complexities, particularly for the CONSTRUCT case, where the output graphs may include triples that are not part of the input graphs. SPARQL also supports query forms (SELECT, ASK) which generate non-graph result sets that take other forms, and I must admit I’m not sure how the licensing of the input data “maps into” that of the output data for some of these cases (e.g. what licence applies to a Boolean response to my ASK query “Does the dataset contain this triple?”:
PREFIX dcterms: <http://purl.org/dc/terms/> ASK { <http://example.org/id/thing/T> dcterms:creator <http://example.org/id/person/P> }
As I was writing this, I found the paper “Publishing and Consuming Provenance Metadata on the Web of Linked Data” by Olaf Hartig and Jun Zhao (Preprint). While their focus is broader than licensing metadata, section 3 covers the addition of “provenance” metadata for the three cases I’ve discussed here: Linked Data objects (what I was calling “bounded descriptions”); RDF dumps; and SPARQL endpoints. In their coverage of the latter they say:
We propose to make provenance metadata a part of the dataset published via such a SPARQL endpoint so that queries can ask for provenance information. Furthermore, a provenance-enhanced SPARQL query engine could also add provenance metadata automatically to query results.
Concluding Thoughts
A couple of broader points:
- Although I’ve discussed this in the context of an application that generates Linked Data, many of the underlying issues are not specific to that context, and it can be seen as a particular example of the sort of issues that arise when combining resources which are individually subject to different licences. (See, for example, some of the discussions around open licensing for educational resources.) Having said that, as I noted above, I think the data case makes this more apparent because the nature of data (re)use almost always introduces a high element of “recombination”.
- Tracking the sources of triples and characteristics of those sources may be important/necessary for reasons other than licensing. For example, even in a case where all the source data is covered by a single “open” licence, it may be useful for any output to disclose information about the data sources from which it was created, and any processes applied to it, so that consumers of the output data can make assessments about its quality (currency, accuracy etc). This is the focus of the work by Hartig and Zhao I noted above. And indeed I was reminded by a recent thread on the W3C public-lod list that it may be considered good practice to explicitly partition up a dataset along these lines so that consumers can clearly understand – and applications can work with – the different characteristics of different subsets of the data.
It seems to me that one of the primary motivations for creating “aggregations” like the Archives Hub is the notion that “the whole is more than the sum of the parts”, that there is “value added” in bringing distributed data together – logically or physically – and enabling operations of various forms across that aggregated data. Such operations almost inevitably involve some form of complex recombination of components of the input data. So for that value to be fully realised, first, the data must be available on terms which permit those sort of operations, and, second, we must be able to ensure that the terms of such licences are reflected appropriately in the outputs of those operations.
Pingback: On managing licences and other graph-level metadata | Linking Lives | Internet Librarian International 2011 | Scoop.it
You might want to look at VOAG, you can find a link to it at http://www.linkedmodel.org
VOAG stands for Vocabulary of Attribution and Governance. It has over 100 licenses in RDF/OWL
Ralph
Pingback: Linking Lives as Mashup: more on Aggregation and Provenance | Linking Lives