Skip to content

Updates needed for get_aspace_refids.py process for new JPCA collection guides #2

@fordmadox

Description

@fordmadox

One possible patch is to augment the EAD exported from ArchivesSpace to ensure that each JPCA finding aid will have the expected c01-c03 structure. This issue will be tagged to a branch that can be used for testing that approach.

Longer description, referenced elsewhere:

To improve the researcher’s user experience in RCV, additional grouping levels will be added to ArchivesSpace to cut down on the extremely flat lists of A-Z files. While the JPCA finding aids to date have had very specific and predicable hierarchical arrangements, these new groupings will result in varying depths of the EAD hierarchy. Such varying depths of description are not only a very common feature in archival description, but they are also vital to employ and expect because they represent the experience informed by international best practices, which rely on an archivist’s judgement to both create and update archival description over time. In other words, all finding aids should be considered living documents, and their structure should be expected to change, due to researcher feedback and shifts in historical understanding.

Practically speaking, what this upcoming change means is that Osprey’s “get_aspace_refids.py” Python script will need to be updated so that it does not rely on numbered EAD components, which never extend beyond the “c03” depth. Instead, the depth in JPCA finding aids will likely extend to at least until the “c05” depth. And within the Smithsonian EAD corpus, the depth currently extends to the “c11” depth!

To make the Osprey script fully compatible with ArchivesSpace and EAD, just a few small changes should be required:

  • When the EAD file is requested from the ArchivesSpace endpoint, the “numbered_cs=true” parameter should be removed altogether or changed to false. This will ensure that the code only needs to reference “c” elements, rather than c01, c02, c03, as it currently does, and later adding c04 and c05, as well as c06, c07, c08, c09, c10, and c11, to support the Smithsonian finding aids, and finally c12, to support the full EAD 2002 standard. See
    "{}{}/resource_descriptions/{}.xml?include_unpublished=false&include_daos=true&numbered_cs=true&logger.info_pdf=false&ead3=false".format(
  • Rather than parsing for c03 elements explicitly, the code could either process the “c” elements recursively, or (perhaps best of all) it could look for the lowest levels of description in a finding aid that also have a container. E.g., “//c[did/container][not(descendant::c[did/container])]”, as an example XPath statement that would return all “c” elements that would need to be processed by Osprey, regardless of their depth (although this approach would likely require https://pypi.org/project/saxonche/ rather than lxml, which has extremely limited XPath support)
  • A few other data points would need to be grabbed differently (e.g., c01/unittitle + c02/unittitle).
    • Because of this issue, we will try out the approach of modifying the EAD from ASpace to ensure that we can still rely on the c01 and c02 unittitle elements.
  • This current test,
    r = requests.get("{}/repositories/2/resources?page=1".format(settings.aspace_api), headers=Headers)
    list_resources = json.loads(r.text.encode('utf-8'))['results']
    for resource in list_resources:
    repository_id = resource['repository']['ref']
    resource_id = resource['ead_id']
    resource_title = resource['title']
    resource_tree = resource['tree']['ref']
    r = requests.get(
    "{}{}/resource_descriptions/{}.xml?include_unpublished=false&include_daos=true&numbered_cs=true&print_pdf=false&ead3=false".format(
    settings.aspace_api, repository_id, resource_tree.split('/')[4]), headers=Headers)
    # get root element
    tree = ET.fromstring(r.text)
    root = ET.ElementTree(tree).getroot()
    ns = "{urn:isbn:1-931666-22-9}"
    # Implement later more elegant
    c01_list = root.findall('.//' + ns + 'archdesc/' + ns + 'dsc/' + ns + 'c01')
    i = 0
    # Run the hierarchy, c01 -> c02 -> c03
    for c01_item in c01_list:
    # try:
    # iterate child elements of item
    refid_1 = c01_item.attrib['id']
    unit_title = c01_item.find('.//' + ns + 'did/' + ns + 'unittitle').text
    c02_items = c01_item.findall('.//' + ns + 'c02')
    for c02_item in c02_items:
    refid_2 = c02_item.attrib['id']
    try:
    fol_type = c02_item.find('.//' + ns + 'did/' + ns + 'unittitle').text
    except AttributeError:
    print(unit_title)
    print("109")
    exit
    try:
    c03_items = c02_item.findall('.//' + ns + 'c03')
    except AttributeError:
    print("129")
    print(unit_title)
    exit
    for c03_item in c03_items:
    refid_3 = c03_item.attrib['id']
    print("{}|{}|{}".format(refid_1, refid_2, refid_3))
    , would also need to be updated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions