Updates needed for get_aspace_refids.py process for new JPCA collection guides

One possible patch is to augment the EAD exported from ArchivesSpace to ensure that each JPCA finding aid will have the expected c01-c03 structure.  This issue will be tagged to a branch that can be used for testing that approach.  

Longer description, referenced elsewhere:

To improve the researcher’s user experience in RCV, additional grouping levels will be added to ArchivesSpace to cut down on the extremely flat lists of A-Z files.  While the JPCA finding aids to date have had very specific and predicable hierarchical arrangements, these new groupings will result in varying depths of the EAD hierarchy.  Such varying depths of description are not only a very common feature in archival description, but they are also vital to employ and expect because they represent the experience informed by international best practices, which rely on an archivist’s judgement to both create and update archival description over time.  In other words, all finding aids should be considered living documents, and their structure should be expected to change, due to researcher feedback and shifts in historical understanding.  
 
Practically speaking, what this upcoming change means is that Osprey’s “get_aspace_refids.py” Python script will need to be updated so that it does not rely on numbered EAD components, which never extend beyond the “c03” depth.  Instead, the depth in JPCA finding aids will likely extend to at least until the “c05” depth.  And within the Smithsonian EAD corpus, the depth currently extends to the “c11” depth!
 
To make the Osprey script fully compatible with ArchivesSpace and EAD, just a few small changes should be required:
 

- When the EAD file is requested from the ArchivesSpace endpoint, the “numbered_cs=true” parameter should be removed altogether or changed to false.  This will ensure that the code only needs to reference “c” elements, rather than c01, c02, c03, as it currently does, and later adding c04 and c05, as well as c06, c07, c08, c09, c10, and c11, to support the Smithsonian finding aids, and finally c12, to support the full EAD 2002 standard.  See https://github.com/Smithsonian/MassDigi-tools/blob/d433e874492de96f645669aa3a2edf1b40419512/unit_projects/JPC_Archive_Digitization/ASpace_to_Osprey/get_aspace_refids.py#L98 
- Rather than parsing for c03 elements explicitly, the code could either process the “c” elements recursively, or (perhaps best of all) it could look for the lowest levels of description in a finding aid that also have a container.  E.g., “//c[did/container][not(descendant::c[did/container])]”, as an example XPath statement that would return all “c” elements that would need to be processed by Osprey, regardless of their depth (although this approach would likely require https://pypi.org/project/saxonche/ rather than lxml, which has extremely limited XPath support)
- A few other data points would need to be grabbed differently (e.g., c01/unittitle + c02/unittitle).
- - Because of this issue, we will try out the approach of modifying the EAD from ASpace to ensure that we can still rely on the c01 and c02 unittitle elements.
- This current test, https://github.com/Smithsonian/MassDigi-tools/blob/d433e874492de96f645669aa3a2edf1b40419512/unit_projects/JPC_Archive_Digitization/systems_tests/aspace_refid_test.py#L23-L70 , would also need to be updated.

	r = requests.get("{}/repositories/2/resources?page=1".format(settings.aspace_api), headers=Headers)

	list_resources = json.loads(r.text.encode('utf-8'))['results']

	for resource in list_resources:
	repository_id = resource['repository']['ref']
	resource_id = resource['ead_id']
	resource_title = resource['title']
	resource_tree = resource['tree']['ref']
	r = requests.get(
	"{}{}/resource_descriptions/{}.xml?include_unpublished=false&include_daos=true&numbered_cs=true&print_pdf=false&ead3=false".format(
	settings.aspace_api, repository_id, resource_tree.split('/')[4]), headers=Headers)

	# get root element
	tree = ET.fromstring(r.text)
	root = ET.ElementTree(tree).getroot()

	ns = "{urn:isbn:1-931666-22-9}"

	# Implement later more elegant
	c01_list = root.findall('.//' + ns + 'archdesc/' + ns + 'dsc/' + ns + 'c01')

	i = 0

	# Run the hierarchy, c01 -> c02 -> c03
	for c01_item in c01_list:
	# try:
	# iterate child elements of item
	refid_1 = c01_item.attrib['id']
	unit_title = c01_item.find('.//' + ns + 'did/' + ns + 'unittitle').text
	c02_items = c01_item.findall('.//' + ns + 'c02')
	for c02_item in c02_items:
	refid_2 = c02_item.attrib['id']
	try:
	fol_type = c02_item.find('.//' + ns + 'did/' + ns + 'unittitle').text
	except AttributeError:
	print(unit_title)
	print("109")
	exit
	try:
	c03_items = c02_item.findall('.//' + ns + 'c03')
	except AttributeError:
	print("129")
	print(unit_title)
	exit
	for c03_item in c03_items:
	refid_3 = c03_item.attrib['id']
	print("{}\|{}\|{}".format(refid_1, refid_2, refid_3))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates needed for get_aspace_refids.py process for new JPCA collection guides #2

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Updates needed for get_aspace_refids.py process for new JPCA collection guides #2

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions