Skip to content

Persistent IDs

pboisver edited this page Aug 23, 2017 · 3 revisions

publishing flows

The Object Teller and the Fedora repository

The Object Teller is a front end for the Fedora commons repositories used to store Knowledge Objects (KOs) as part of the Knowledge Grid (K-grid). The Object Teller application is responsible for:

  • Managing the lifecycle of KOs (create, update, configure, publish, delete)
  • Enabling discovery of KOs (view, filter, search, import/export, maybe proxy)
  • Controlling access to KOs for object owners, informaticians/librarians, and library users (using a combination of simple role-based access control and discretionary access control)
  • Providing an (restful) API usable by clients of the K-grid (object management and object execution)
  • Providing an execution stack for certain types of KOs (Python, simple OWL)
  • We are using instances of Fedora repositories as the back-end data store for knowledge objects and associated information. The Object Teller front end should interoperate reasonably well with any Fedora repository instance, with minimal configuration and setup. So far we are not really exposing the underlying repository (all use cases go through the Object Teller front-end.

The problem

(tl;dr — [the solution] (Persistent ID proposal))

There may be multiple versions of a particular general bit of knowledge encapsulated in more than one KO, or a single KO with multiple versions, and they may be may be under different names and owners, either public and private, and in more than one instance of a K-grid repository.

The Knowledge Grid is inherently distributed. Both the management and the use of KOs is carried out by a diverse set of authors, and clients, human and machine.

Each of these individual KOs (and the bit of knowledge encapsulated) has an identity in the world because it was created and is maintained by some person or organization for some ongoing purpose. The problem is two-fold: we must be able to find (and find again) a KO of interest, and we must be able to tell that it is, or is not, more or less definitively, the same as some other KO.

To that end we want to assign an identifier.

Internal identifiers

Because no one group or application controls the lifecycle of every instance of every KO (we expect there to be many libraries and many knowledge creators) we are not primarily interested in specifying the internal identifiers used by various repositories.

The internal identifiers must be sufficient for implementing the use cases supported by the implementing platform, but need not be exposed (much) to users of the system. (In a relational database, for example, it's good practice to make use of synthetic keys that are generated internally by the database engine.)

External (entity/resource) identifiers

For external ids, which must be unique across libraries, we have a choice of natural (business) keys, or synthetic (surrogate) keys 1.

An example of a platform that uses a global natural key is Maven, a repository for binary artifacts, primarily code. The backing repo, in Maven's case, is a simple file system, hierarchical, with conventions around the names of the nodes and the binaries (and associated metadata and variant files). Each artifact (along with it's associated metadata/variants files) is uniquely located using "coordinates" consisting of a group id, artifact id, and version.

These three metadata attributes form a unique natural key that enables discovery, hierarchical grouping, and versioning, are human readable, and easily associated with other systems (like source code repositories) and organizations. The attributes forming the unique natural key are chosen by the creator of the artifact, conventionally the group id is a reversed domain, artifact id is a simple name, and the version string follows a prescribed major.minor.patch format. Attributes are not always chosen wisely, and organizations and code bases change regularly, so sometimes it's difficult to track the provenance and variations of an artifact.

Nonetheless, the artifacts are uniquely identified and the repositories are easy to navigate and replicate.

KO use cases involved or impacted by persistent ID strategy

  1. Create object

    • needs a new identifier
  2. Edit/update object

    • might need to update metadata/version in an external source
  3. Delete a KO

    • probably need to notify the persistent ID store
    • may need to keep track of other KOs referring to or cloning from this persistent id
  4. Publish/Unpublish a KO (public/private)

    • again may need to notify or update an external source
    • and keep track of references
  5. Find a particular KO

    • Probably accept only persistent id regardless of internal id
  6. Import/Export, and other library-to-library use cases

    • depend on a consistent persistent id across libraries

Background material from the California Digital Library on the ezid service and ark: IDs (our target solution)

More about ezid accounts - http://ezid.cdlib.org/learn/id_basics

The ezid.cdlib.org API - http://ezid.cdlib.org/doc/apidoc.html

The ezid service operations: mint, create, delete, get(metadata), and modify ( metadata); compare with the Noid operations below - http://ezid.cdlib.org/doc/apidoc.html#operation-get-identifier-metadata

Command API for the Noid python implementation: mint, bind, get (metadata); delete (there is no delete; you don't delete Chuck Norris, Chuck Norris deletes you!) - http://search.cpan.org/dist/Noid/noid#COMMANDS_AND_MODES

Info about the noid python module and how to run it locally - http://www.cdlib.org/inside/diglib/ark/noid.pdf

Java wrapper classes for talking to ezid or a local minter (for illustration purposes, not for copy/paste) - https://github.com/pulibrary/arkifier


Notes

1: For example, I am represented in countless databases and repositories by one or more of my email addresses. On the other hand, some ask me to choose an id, which could be a random string for all the system cares. My email is then associated with the entity identified by that random string. The difference is that if my email changes the system can simply update it, rather than having to change the underlying key which has been used throughout the system in ways that complicate the update. IS my email permanent and unique? Is my SSN? Is my thumbprint? My DNA? Or are those attributes of the real me, which might as well be identified as 06HR$jkj&%ghi007... just call me "007" for short.

Clone this wiki locally