URIs, identity, aliases & "consolidation"

Jane has written a few posts recently on our efforts to improve the stability of URIs used for pages about archival resources on the “live” Archives Hub service, and as far as possible we’ll be trying to reflect the changes made there in the URIs we use in the Linked Archives Hub RDF data. Much of that work has led to a review of the conventions used in the source EAD XML data and a concerted effort to “cleanse” or enhance that data to improve its coherence and consistency.

In this post, I’ll focus on some issues around the URIs used to identify Persons in the Linked Archives Hub data. It’s something I’ve been trying to write on and off over a period of several weeks, and a combination of some work I’ve been doing for the Bricolage project, and some subsequent conversations, have prompted me to try to knock my rather rambling drafts into shape.

The Data Transformation Process

It’s probably worth taking a step back and emphasising that the process by which the Linked Archives Hub RDF data is generated is currently a relatively simple one:

  • EAD XML documents are transformed into RDF/XML using an XSLT transform. This process is performed on a “document-by-document” basis, i.e. it has as input a single EAD document and an XSLT stylesheet and outputs a single RDF/XML document; the process does not have any “knowledge” of the other EAD documents within the dataset to be transformed.
  • The output from the transform is uploaded to a triple store.
  • Some supplementary data is uploaded alongside the data derived from the EAD documents. This data is the product of various processes: some is “hand-crafted”; some is imported from external sources; some is the result of processes run over the EAD-derived data; some is the result of “lookups” against external datasets – but for the purposes of this discussion, the key point to note is that it is “added to” the EAD-derived data, and that EAD-derived data itself is not changed.
  • That data is served as “Linked Data” “bounded descriptions”.

The URIs used to identify persons in the Linked Archives Hub dataset have their origins in the names of persons occurring in the Archives Hub EAD XML documents. Within those documents, person names occur in two contexts (or at least the EAD-to-RDF transformation process currently takes into account occurrences of names in two contexts). I’ll describe here how the conversion process handles this data, what RDF data is generated and then look at some of the issues this raises.

The examples I’ll use are all from the small subset of EAD documents included in the current Linked Archives Hub data. I’ve picked the case of Beatrice Webb, which illustrates several of the variations which can occur and the issues which arise.

Personal names as index terms

The first context is that of personal names added to the description by the cataloguer as “index terms” on the basis that they may be useful for the purposes of retrieval/search/browse. In the Hub EAD documents, they occur in XML structures like the following, using the EAD controlaccess element. In its simplest form, this looks like:

<ead>
  <eadheader>
    <eadid></eadid>
    ...
  </eadheader>
  <archdesc level="fonds">
    <did>
      <unitid></unitid>
    </did>
    ...
    <controlaccess>
      <persname source="nra">(name)</persname>
    </controlaccess>
  </archdesc>
</ead>

In some (but not all) of the Hub EAD documents, a convention employing the emph element and emph/@altrender attribute is used to capture the distinction between the component parts of a name constructed according to a name rules system – this is something local/”proprietary” to the Hub application (and really a “redefinition” of the EAD tag semantics): a “standard EAD” application would not interpret the markup in this way.

<ead>
  <eadheader>
    <eadid></eadid>
    ...
  </eadheader>
  <archdesc level="fonds">
    <did>
      <unitid></unitid>
    </did>
    ...
    <controlaccess>
      <persname source="nra">
        <emph altrender="surname">Webb</emph>
        <emph altrender="forename">Martha Beatrice</emph>
        <emph altrender="dates">1858-1943</emph>
        <emph altrender="epithet">Social Reformer</emph>
      </persname>
    </controlaccess>
  </archdesc>
</ead>

Within the subset of the Hub EAD data currently transformed into RDF, this same “index term” – same XML fragment – is used in three different EAD XML documents :

In this example, the persname/@source attribute is used to capture the name of a “name authority file” from which the name is drawn, the “nra” value here indicating the use of the National Register of Archives (NRA). The NRA itself is not currently available as Linked Data, so does not provide URIs for the entities described. The NRA record for Beatrice Webb is http://www.nationalarchives.gov.uk/nra/searches/subjectView.asp?ID=P29999. In fact, the actual form of the name used in the authority record (“Webb, Martha Beatrice (1858-1943) nee Potter, Social Reformer”) does appear to differ slightly from that used in these three EAD documents (i.e. it includes “nee Potter”).

As I discussed on the LOCAH project blog, in our mapping of the EAD data into an RDF representation, from this XML structure we generate two resources to try to capture the distinction between the person and the “conceptualisation” of that person reflected in the authority file entry or the use of the name rules. The two resources have distinct URIs and are linked using the foaf:focus property.

The patterns for the URIs for both the concept and the person are similar, and based on a combination of:

  • the name of the authority file or (see below) of the name rules
  • a “slug” derived from the the name itself (including life dates, titles, epithets etc)

So for the cases above the Person URI generated is:

  • http://data.archiveshub.ac.uk/id/person/nra/webbmarthabeatrice1858-1943socialreformer
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

<http://data.archiveshub.ac.uk/id/concept/person/nra/webbmarthabeatrice1858-1943socialreformer>
  a skos:Concept ;
  rdfs:label "Webb, Martha Beatrice, 1858-1943, social reformer" ;
  foaf:focus
    <http://data.archiveshub.ac.uk/id/person/nra/webbmarthabeatrice1858-1943socialreformer> .

<http://data.archiveshub.ac.uk/id/person/nra/webbmarthabeatrice1858-1943socialreformer>
  a foaf:Person ;
  rdfs:label "Webb, Martha Beatrice, 1858-1943, social reformer" ;
  foaf:name "Martha Beatrice Webb" ;
  foaf:familyName "Webb" ;
  foaf:givenName "Martha Beatrice" .

In other cases, the persname/@source attribute is not present, but instead the persname/@rules attribute is used to provide the name of a set of “name rules” under which the name is constructed. The example below refers to the use of “ncarules”, i.e. the National Council of Archives’ Rules for the Construction of Personal, Place and Corporate Names.

<ead>
  <eadheader>
    <eadid></eadid>
    ...
  </eadheader>
  <archdesc level="fonds">
    <did>
      <unitid></unitid>
    </did>
    ...
    <controlaccess>
      <persname rules="ncarules">
        <emph altrender="a">Webb</emph>,
        <emph altrender="forename">Martha Beatrice</emph>
        <emph altrender="dates">1858-1943</emph>
        <emph altrender="other">nee Potter</emph>
        <emph altrender="epithet">social reformer and historian</emph>
      </persname>
    </controlaccess>
  </archdesc>
</ead>

This form is present in seven EAD documents:

and is mapped by the transform to the URI

  • http://data.archiveshub.ac.uk/id/person/ncarules/webbmarthabeatrice1858-1943neepottersocialreformerandhistorian

A second form of the name, also constructed using NCA Rules, but with a variation in the epithet and the “nee Potter” omitted, is also used:

<ead>
  <eadheader>
    <eadid></eadid>
    ...
  </eadheader>
  <archdesc level="fonds">
    <did>
      <unitid></unitid>
    </did>
    ...
    <controlaccess>
      <persname rules="ncarules">
        <emph altrender="a">Webb</emph>,
        <emph altrender="forename">Martha Beatrice</emph>. (
        <emph altrender="y">1858-1943</emph>)
        <emph altrender="epithet">social reformer</emph>
      </persname>
    </controlaccess>
  </archdesc>
</ead>

This appears in one EAD document:

and is mapped to the URI:

  • http://data.archiveshub.ac.uk/id/person/ncarules/webbmarthabeatrice1858-1943socialreformer

There are a few points worth noting here:

First, and most obviously (and this was the point that initially prompted me to start writing this post), the fact that different forms of name can – quite legitimately, within the constraints of the EAD format and the Hub data entry guidelines – be used as index terms to refer to the same person across the dataset means that we end up generating through our transform process – and publishing/exposing to the Web in our data – multiple URIs for the same person. From the cases above, we have three distinct “URI aliases” for Beatrice Webb:

  • http://data.archiveshub.ac.uk/id/person/nra/webbmarthabeatrice1858-1943socialreformer
  • http://data.archiveshub.ac.uk/id/person/ncarules/webbmarthabeatrice1858-1943socialreformer
  • http://data.archiveshub.ac.uk/id/person/ncarules/webbmarthabeatrice1858-1943neepottersocialreformerandhistorian

Second, the use of the name to construct the URI is not a guarantee of avoiding URI ambiguity (i.e. of having a single URI used to refer to what are in fact two different things). In archival description data it is quite common to encounter names without complete life dates or epithets, and in a dataset the size of the Hub, it is quite possible that there are two occurrences of an index term like “Smith, John, 1945-, engineer”, both constructed using the same “name rules”, which are intended as references to two distinct individuals but would be mapped to the same URI.

Third, the “repeatability” of the transformation process over time is not guaranteed. If any of the name components changes in the EAD document (e.g. a previously unknown date of death is added, or an “epithet” is added or removed), then the subsequent re-transformation of the data will generate a different URI from that generated from the previous process using the initial form of the name. (Is “Scott, James, 1950-2012, biologist” in this version the same person who was referred to as “Scott, James, 1950-, scientist” in a previous version?)

Fourth, for both URIs, that of the Concept and of the Person, the URI includes the name of the “authority file” or name rule system.

I’m willing to concede that for the Person case this may be “overkill”. I think I chose this because I was wary of conflating what were in reality two different persons based on matches in their names. So, on this basis, it should not be automatically assumed that the same form of name in two different authority files refers to the same person, at least not without some human verification – though having said that, if there is a match on “life dates” and “epithets”, then it seems highly probable that they do.

Similarly with the name rule systems case. The situation here is probably even more complex, as in archival description data it is quite common to encounter names without complete life dates or epithets. I also wondered if it was theoretically possible that under two different name rule systems, different surname/forename ordering rules might result in two quite different names mapping to the same string in the URI. e.g. forename = James and surname = Scott under a surname first rule would result in “scottjames….” and forename = Scott and surname = James under a forename first rule would also result in “scottjames….”.

So, in short, retaining the name of the name rules or the authority file as part of the Person URI was part of an attempt to avoid accidentally conflating what may be two different person, i.e. to reduce instances of the second problem above, though this very tactic potentially contributes to the first one!

Personal names as names of the creators/”originators” of archival resources

The second context in which personal names are found is as the names of agents responsible for the creation or bringing together of the resources described. In the Hub EAD documents, they occur in XML structures like:

<ead>
  <eadheader>
    <eadid></eadid>
    ...
  </eadheader>
  <archdesc level="fonds">
    <did>
      <unitid></unitid>
      <origination>(name)</origination>
    </did>
    ...
  </archdesc>
</ead>

In the Hub EAD data, there is no guarantee that the data indicates whether the name is that of a person or an organisation. Although the EAD schema does support the use of the <persname> and <corpname> within the <origination> element, and indeed it is present in some Hub data, the Hub data entry tool does not provide this distinction.

While cataloguers are encouraged to provide the name of the originator also as an index term, this guideline is not always followed.

Furthermore, the Hub data entry guidelines for this element encourage the use of “the commonly used form of name”, so it may be that the form of name used here is different from that used as an “index term”, which creates potential complexity in trying to “reconcile” the two.

Beatrice Webb appears as the creator/originator of five collections:

using one of the following two XML structures:

<ead>
  <eadheader>
    <eadid></eadid>
    ...
  </eadheader>
  <archdesc level="fonds">
    <did>
      <unitid></unitid>
      <origination encodinganalog="3.2.1">
      Webb, (Martha) Beatrice, 1858-1943, wife of 1st Baron Passfield, social reformer and historian
      </origination>
    </did>
    ...
  </archdesc>
</ead>
<ead>
  <eadheader>
    <eadid></eadid>
    ...
  </eadheader>
  <archdesc level="fonds">
    <did>
      <unitid></unitid>
      <origination encodinganalog="3.2.1">
      Webb, Martha Beatrice, 1858-1943, wife of 1st Baron Passfield, social reformer and historian
      </origination>
    </did>
    ...
  </archdesc>
</ead>

The name-to-URI mapping algorithm discards the parentheses so both cases map to a single URI:

  • http://data.archiveshub.ac.uk/id/agent/gb97/webbmarthabeatrice1858-1943wifeof1stbaronpassfieldsocialreformerandhistorian

Post-transform processing

After this EAD-derived data is uploaded to the triple store, some further processes are applied:

  • a “lookup” process which extracts information about “persons” in the Hub data and searches for candidate matches in the VIAF dataset
  • a process which seeks candidate matches within the Hub dataset between “agents” (generated from the creator/origination context) and “persons” (generated from the index terms context)

The result of this is the addition of a set of triples with owl:sameAs predicates to indicate that the various data.archiveshub.ac.uk URIs (and the VIAF URI) identify the same person.

One of the problems with this approach is that an application consuming the data still has to be prepared to work with these multiple URI aliases, and particularly with SPARQL, this can be quite cumbersome: given URI X denoting a person, to find all the data we hold about the person, an application has to search for patterns involving not just that known URI X but also any URI Y, where URI Y sameAs URI X.

Materialising inferences?

As a possible further measure to mitigate these difficulties, we might perhaps take the approach of further “materialising inferences” based on these owl:sameAs predicates, i.e. explicitly adding to the data the further set of triples which can be inferred from those triples. While this would facilitate querying, it increases the size of the dataset and also (from a “provenance” perspective) adds to the complexity of managing how we distinguish the different sources of data (e.g. which triples had their origin in the transformation of the source EAD documents and which were added by subsequent processes).

Consolidation and “Annotation”

I’m coming to the conclusion that what while our current process is “OK-ish” as a first stab at generating an RDF representation, the “repeatability” issue (change in name resulting in change of URI) is a problem, and these multiple URI aliases in the published data is, while not strictly “wrong”, at best rather “sub-optimal” for consumers of the data.

The “repeatability” problem is the consequence of our basing the “slug” in the Person URI pattern on data attributes that can change over time. At the time the transform is applied, the only data that is available is the name (and the associated attributes), so I’m not sure I have a good answer to this. One approach would be to see the transformation stage as only the first part of a larger process, to keep track of the URIs generated over time, and build in a stage of processing to reconcile the URI generated from “Scott, James, 1950-2012, Sir, biologist” this week from that URI generated from “Scott, James, 1950-, scientist” in the previous version of the document six months ago. This perhaps then becomes simply a special case of the second problem, of dealing with multiple URIs for a single entity.

On the second problem, given the nature of our input data, it may well be a necessary part of the process that the initial transformation stage does result in multiple URIs. But once we’ve applied the post-transform processing to “reconcile” these references, rather than publishing a set of sameAs triples, maybe we should take a step further and consider “consolidating” our data to use a single URI for the person?

So e.g., if our post-transform processing tells us that, as I describe above, we have four distinct data.archiveshub.ac.uk URIs which all refer to the person Beatrice Webb, should we “distill down” to one of those four, and replace the occurrences of the other URIs in the data?

Furthermore, if we know that the content of any name is potentially unstable (i.e. “Scott, James, 1950-, scientist” can be replaced by “Scott, James, 1950-2012, Sir, biologist”), should we be using this as the basis of a URI at all, even in the case where – at this point in time – it is the only name for that person in our dataset? Should we instead manage a mapping to some sort of code and use that to construct a distinct URI again? The challenge is in creating a process/workflow which makes this easy to do, and repeatable if/when data is reprocessed or new data is added.

A further possibility is suggested by a post by Leigh Dodds which I’ve had at the back of my mind for a while, and which he mentions again in a more recent post.

Leigh argues that as Linked Data providers we tend to publish data using our own URIs, then “reconcile” some of those URIs with some existing published URIs for the same entity created by other providers, and add owl:sameAs assertions to indicate that they are co-references – much as I’ve describe here for the Linked Archives Hub case. But an alternative approach in which, instead of publishing our own URIs, we use those existing URIs directly in our own data may well make our data easier to use. Leigh refers to this as “Annotated Data” – in the sense that we are providing new triples using an existing URI. Applying this to our concrete example for Beatrice Webb, if, as I suggest above, it would be a Good Thing to “distill down” our four different URIs for Beatrice to a single URI and substitute that single URI in our data, could we use, say, VIAF’s URI for her for that purpose?

In fact, we already make use of externally-owned URIs directly for the case of languages, where we simply use lexvo.org URIs directly in our data. One motivation for choosing this approach was that it was trivial to construct the lexvo.org URIs in the transform process using the language codes present in the EAD data. Obtaining a VIAF URI for a person, on the other hand, is a rather more complex task involving a search of another dataset and (in some cases, at least), a process of manual verification of candidate matches. But in spite of the difference in the processes of obtaining the URIs, are the two cases so distinct? Particularly if we start to think of our data publication as rather more of a multiple-stage process, I admit I’m less sure than I might have been at one point.

One factor might be our level of confidence in the stability of any external URIs we use. I’m not sure VIAF has published any formal policy regarding its URIs. But on the other hand, part of the problem that we are grappling with is that of maintaining the stability of our own URIs!

Another factor is that the consequence of adopting the “annotation” approach is that when it comes to dereferencing URIs, we would no longer have a data.archiveshub.ac.uk “Person URI” which we can redirect to a document/graph that we serve. Obviously, the VIAF URI for Beatrice Webb redirects to a document served by VIAF – which would not provide the information that, say, she was the creator/originator of the five archives above or the “foaf:focus” of those concepts associated with the eleven other archives. That information is still present in our dataset, and would be available via SPARQL, and as part of other Linked Data documents we do serve (e.g. the bounded description of the archival resource would include a triple indicating its creator/originator). In principle, we could also, as Leigh suggests in the penultimate paragraph of his post, continue to serve a document providing a bounded description, much as we do now, but its subject would be <http://viaf.org/viaf/86607236> rather than a data.archiveshub.ac.uk URI. The challenge then becomes one of how to make that document discoverable (through foaf:isPrimaryTopicOf/wdrs:describedby/rdfs:seeAlso links? through third-party services built on such links?)

I admit I hesitate to advocate taking this plunge at this point. The cases of the Language URIs and the Person URIs do seem to be different – although in ways I’m not sure I can articulate very clearly! Using the lexvo.org Language URIs seems appropriate in part because it doesn’t seem like we have “anything interesting to say” about languages, but the person case feels more “core” to “our data”. Also we will almost certainly always have to handle cases for which VIAF doesn’t provide a URI and we need our own Person URI. On the other hand, if, say, the National Register of Archives “authority file” data had already existed as Linked Data, and provided URIs for persons, would we still coin our own URIs for those cases? Or would we have simply adopted their URIs wherever we could? I’d hope we’d have chosen the latter. Maybe we really do need to become more relaxed about embracing the use of others’ URIs.

So… I think we need to think more about whether to take that step of using external URIs instead of our own, but I do think our URI alias issues in general need some attention, probably involving some sort of an extension to the current process to introduce a “consolidation” step between the transformation stage and the publishing stage so that where we know we are coining multiple URIs for an entity, we publish only one of them.

This entry was posted in archival description, identifiers, linked data and tagged , , , , , . Bookmark the permalink.

2 Responses to URIs, identity, aliases & "consolidation"

  1. Pingback: Bricolage

  2. Unless we find an effective way to deal with the “dare I use your URLs?” problem, there isn’t any prospect that we will produce a useful Linked Data web. Mint-your-own-URL plus make-some-same-as-links is a strategy which will clearly not scale. Imagine 1000 archives all minting their own URLs for people, and the work in making same-as links for all of them. Then add in 1000 museums …

    One approach we are exploring with the Modes software is the idea of “web termlists” with a selective local cache. We’ve implemented the idea with Geonames. You send a search term (place name) to the Geonames web service and get back a hit list expressed as an XML document. This is re-formatted and presented to the user as a pick list, and she selects the Geonames concept which represents the intended place. As a result, the selected Geonames URL for the place is inserted into the Modes data, but at the same time another Modes file (the “Geonames shadow”) is updated with a copy of the selected Geonames record (reformatted to suit local requirements).

    This approach gives two benefits:

    – there is a local security copy of those Geonames concepts which have actually been used in the Modes data;
    – additional information held in the Geonames authority data (such as lat/long coordinates) is now available in the local database, removing the need to invoke web services in order to access it

    When discussing the use of Linked Data URLs, it saddens me that we tend to see this as an additional burden, rather than as the gateway to a whole lot of additional information which we can use for free. Maybe that reflects the current primitive/fragmented state of our Linked Data processing toolset. (I’m managing quite happily with standard XSLT 1.0, though I seem to be in a minority of one!)

Comments are closed.