From a24d4add272ce0609d0dad3c97b0fe42130496b0 Mon Sep 17 00:00:00 2001 From: Alex Dryden Date: Wed, 11 Feb 2026 14:13:05 -0500 Subject: [PATCH 01/44] feat(arclight#29): Add creator records and automatic indexing MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Implement data ETL for standalone creator records from ArchivesSpace agents and automatically index them to Solr for discovery in ArcLight. PROBLEM: ArchivesSpace agents (people, organizations, families) were not discoverable as standalone entities in ArcLight. Users could not browse or search for agents independently of their collections. SOLUTION: 1. Extract ALL agents from ArchivesSpace via API 2. Generate EAC-CPF XML documents for each agent 3. Define Solr schema fields for agent metadata 4. Configure traject to index agent XML to Solr 5. Implement automatic indexing after XML generation FEATURES: - Processes all agent types (people, corporate entities, families) - Generates standards-compliant EAC-CPF XML - Links agents to their collections via persistent_id - Automatic discovery of traject config (bundle show arcuit) - Batch processing (100 files per traject call) - Robust error handling with detailed logging - Multiple processing modes (normal, agents-only, collections-only) COMPONENTS: 1. Python Processing (arcflow/main.py - 1428 lines): - get_all_agents() - Fetch ALL agents from ArchivesSpace API - task_agent() - Generate EAC-CPF XML via archival_contexts endpoint - process_creators() - Batch process all agents in parallel (10 workers) - find_traject_config() - Auto-discover traject configuration - index_creators() - Batch index to Solr via traject 2. Solr Schema (solr/conf/arcuit_creator_fields.xml - 11 fields): - is_creator (boolean) - Identifies agent records - creator_persistent_id (string) - Unique identifier - agent_type (string) - Type: corporate/person/family - agent_id (string) - ArchivesSpace agent ID - agent_uri (string) - ArchivesSpace agent URI - entity_type (string) - EAC-CPF entity type - related_agents_ssim (multiValued) - Related agent names - related_agent_uris_ssim (multiValued) - Related agent URIs - relationship_types_ssim (multiValued) - Relationship types - document_type (string) - Document type (eac-cpf) - record_type (string) - Record type (creator/agent) 3. Traject Configuration (traject_config_eac_cpf.rb - 271 lines): - Maps EAC-CPF XML elements to Solr fields - Extracts agent identity information - Processes biographical/historical notes - Captures related agents and relationships - Handles collection linkages USAGE: python -m arcflow.main \ --arclight-dir /path/to/arclight \ --aspace-dir /path/to/archivesspace \ --solr-url http://localhost:8983/solr/blacklight-core python -m arcflow.main ... --agents-only python -m arcflow.main ... --collections-only python -m arcflow.main ... --skip-creator-indexing python -m arcflow.main ... --arcuit-dir /path/to/arcuit COMMAND LINE OPTIONS: --arclight-dir PATH Path to ArcLight application (required) --aspace-dir PATH Path to ArchivesSpace data (required) --solr-url URL Solr instance URL (required) --arcuit-dir PATH Path to arcuit gem (optional, auto-detected) --agents-only Process only agents, skip collections --collections-only Process only collections, skip agents --skip-creator-indexing Generate XML but don't index to Solr --force-update Process all records regardless of timestamps ARCHITECTURE: ArchivesSpace API ↓ (archivessnake library) arcflow Python code ↓ (fetches via /repositories/1/archival_contexts) EAC-CPF XML files (public/xml/agents/*.xml) ↓ (indexed via traject) Solr (blacklight-core) ↓ (discovered via ArcLight) ArcLight DATA FLOW: 1. arcflow calls get_all_agents() - fetches ALL agents from ArchivesSpace API 2. For each agent, task_agent() retrieves EAC-CPF from archival_contexts endpoint 3. Saves EAC-CPF XML to public/xml/agents/ directory 4. find_traject_config() discovers config via 'bundle show arcuit' or --arcuit-dir 5. index_creators() batches XML files (100 per call) and invokes traject 6. traject indexes XML to Solr with is_creator:true flag 7. Agent records now searchable in ArcLight BENEFITS: - Users can discover all agents independently of collections - Direct navigation to agent pages - Browse all agents of a specific type - View all collections linked to a specific agent - Standards-based EAC-CPF format for interoperability - Automatic indexing reduces manual steps - Flexible processing modes for different workflows TECHNICAL DETAILS: - EAC-CPF format: urn:isbn:1-931666-33-4 namespace - ID extraction: Filename-based (handles empty control element in EAC-CPF) - Batch size: 100 files per traject call - Parallel processing: 10 worker processes for agent generation - Timeout: 5 minutes per batch - Error handling: Log errors, continue processing - Linking: Via Solr persistent_id field (not direct XML updates) FILES CHANGED: arcflow-phase1-revised/ ├── arcflow/ │ ├── __init__.py (updated imports) │ ├── main.py (1428 lines, core logic) │ └── utils/ │ ├── __init__.py │ ├── bulk_import.py │ └── stage_classifications.py ├── traject_config_eac_cpf.rb (271 lines) ├── requirements.txt ├── .archivessnake.yml.example ├── README.md (updated) ├── HOW_TO_USE.md (updated) └── TESTING.md (updated) solr/ ├── README.md (installation instructions) └── conf/ └── arcuit_creator_fields.xml (11 field definitions) Documentation: ├── CREATOR_INDEXING_GUIDE.md (comprehensive guide) ├── AUTOMATED_INDEXING_IMPLEMENTATION.md (technical details) └── README.md (updated with creator section) TESTING: Manual verification: 1. Run arcflow with --agents-only flag 2. Verify XML files generated in public/xml/agents/ 3. Check Solr for indexed agent records 4. Verify is_creator:true in Solr documents 5. Test agent-collection linking via persistent_id Automated testing: - Python syntax validation - Ruby syntax validation (traject config) - Solr schema validation DEPLOYMENT: 1. Add Solr schema fields to schema.xml 2. Reload Solr core 3. Run arcflow to generate and index agents 4. Verify agents appear in Solr 5. Test in ArcLight interface BACKWARD COMPATIBILITY: - No breaking changes to existing functionality - Collections continue to work as before - Agent indexing is additive - Can be disabled with --skip-creator-indexing --- HOW_TO_USE.md | 294 +++++++++++++++++ README.md | 180 ++++++++++- TESTING.md | 552 +++++++++++++++++++++++++++++++ arcflow/main.py | 661 +++++++++++++++++++++++++++++++++++++- traject_config_eac_cpf.rb | 271 ++++++++++++++++ 5 files changed, 1943 insertions(+), 15 deletions(-) create mode 100644 HOW_TO_USE.md create mode 100644 TESTING.md create mode 100644 traject_config_eac_cpf.rb diff --git a/HOW_TO_USE.md b/HOW_TO_USE.md new file mode 100644 index 0000000..0917842 --- /dev/null +++ b/HOW_TO_USE.md @@ -0,0 +1,294 @@ +# Arcflow Phase 1: Creator Records Implementation + +This directory contains the complete implementation of Phase 1 (Creator Records Data Pipeline) for the arcflow repository. + +## What This Is + +This directory is a **complete, working copy of arcflow** with all Phase 1 creator records changes already applied. You can run it directly from here! + +## Purpose + +This allows you to: +1. **Run arcflow directly** to test the creator records feature immediately +2. **Review all arcflow changes** without needing separate repository access +3. **Create the arcflow PR** using the provided documentation when ready + +--- + +## How to Use + +### Option 1: Run Directly from This Directory (Recommended for Testing) + +The simplest way to test the creator records feature is to run arcflow directly from this directory: + +#### Step 1: Install Dependencies + +```bash +cd arcflow-phase1 + +# Install Python dependencies +pip install -r requirements.txt +``` + +#### Step 2: Configure Credentials + +```bash +# Copy the example configuration +cp .archivessnake.yml.example .archivessnake.yml + +# Edit with your ArchivesSpace credentials +nano .archivessnake.yml # or use your preferred editor +``` + +Edit `.archivessnake.yml` with your settings: +```yaml +baseurl: http://your-archivesspace-server:8089 +username: your-username +password: your-password +``` + +#### Step 3: Create ArcFlow Configuration + +```bash +# Create .arcflow.yml to track last update time +cat > .arcflow.yml << EOF +last_updated: '1970-01-01T00:00:00+00:00' +EOF +``` + +Or run with `--force-update` flag to process all resources. + +#### Step 4: Run ArcFlow + +```bash +# Run arcflow with required arguments +python -m arcflow.main \ + --arclight-dir /path/to/your/arclight-app \ + --aspace-dir /path/to/your/archivesspace \ + --solr-url http://localhost:8983/solr/blacklight-core + +# Or with force update to process everything +python -m arcflow.main \ + --arclight-dir /path/to/your/arclight-app \ + --aspace-dir /path/to/your/archivesspace \ + --solr-url http://localhost:8983/solr/blacklight-core \ + --force-update + +# Or to process only agents (skip collections - useful for testing) +python -m arcflow.main \ + --arclight-dir /path/to/your/arclight-app \ + --aspace-dir /path/to/your/archivesspace \ + --solr-url http://localhost:8983/solr/blacklight-core \ + --agents-only +``` + +#### Step 5: View Results + +```bash +# Check for creator XML files +ls -lh $ARCLIGHT_DIR/public/xml/agents/ + +# View a creator file +cat $ARCLIGHT_DIR/public/xml/agents/creator_*.xml | jq '.' + +# Index to Solr +cd $ARCLIGHT_DIR +bundle exec traject -u $SOLR_URL -i xml \ + -c /path/to/arcuit/arcflow-phase1-revised/traject_config_eac_cpf.rb \ + public/xml/agents/*.xml +``` + +**See `TESTING.md` for comprehensive testing instructions!** + +#### Testing a Single Creator + +For faster testing, use the test-single-creator command to process just one agent: + +```bash +cd arcflow-phase1 + +# Set environment variables +export ARCLIGHT_DIR=/path/to/your/arclight-app +export ASPACE_DIR=/path/to/your/archivesspace + +# Test a single creator agent +python -m arcflow.main test-single-creator \ + --agent-uri /agents/agent_corporate_entities/123 + +# The command will show you: +# - The created XML file path +# - The traject command to index it +``` + +This is much faster than processing all creators and is ideal for development and testing. + +--- + +### Configure Solr Schema (Required Before Indexing) + +⚠️ **CRITICAL PREREQUISITE** - Before you can index creator records to Solr, you must configure the Solr schema. + +**See [SOLR_SCHEMA.md](SOLR_SCHEMA.md) for complete instructions on:** +- Which fields to add (is_creator, creator_persistent_id, etc.) +- Three methods to add them (Schema API recommended, managed-schema, or schema.xml) +- How to verify they're added +- Troubleshooting "unknown field" errors + +**Quick Schema Setup (Schema API method):** +```bash +# Add is_creator field +curl -X POST -H 'Content-type:application/json' \ + http://localhost:8983/solr/blacklight-core/schema \ + -d '{"add-field": {"name": "is_creator", "type": "boolean", "indexed": true, "stored": true}}' + +# Add other required fields (see SOLR_SCHEMA.md for complete list) +``` + +**Verify schema is configured:** +```bash +curl "http://localhost:8983/solr/blacklight-core/schema/fields/is_creator" +# Should return field definition, not 404 +``` + +⚠️ **If you skip this step, you'll get:** +``` +ERROR: [doc=creator_corporate_entities_584] unknown field 'is_creator' +``` + +This is a **one-time setup** per Solr instance. + +--- + +### Option 2: Copy to Separate Arcflow Repository (For Creating PR) + +### Option 2: Copy to Separate Arcflow Repository (For Creating PR) + +If you want to create a PR in the official arcflow repository: + +```bash +# In your local environment with access to arcflow repo: +cd /path/to/arcflow +git checkout -b copilot/add-creator-records + +# Copy the modified files +cp /path/to/arcuit/arcflow-phase1/arcflow/main.py arcflow/main.py +cp /path/to/arcuit/arcflow-phase1/traject_config_creators.rb . +cp /path/to/arcuit/arcflow-phase1/CREATOR_RECORDS_DESIGN.md . +cp /path/to/arcuit/arcflow-phase1/PR_SUMMARY.md . +cp /path/to/arcuit/arcflow-phase1/README.md . +cp /path/to/arcuit/arcflow-phase1/.github/copilot-instructions.md .github/ + +git add -A +git commit -m "Add standalone creator records extraction and indexing pipeline" +git push -u origin copilot/add-creator-records + +# Then create PR via GitHub UI using PR_SUMMARY.md as description +``` + +**Alternative: Create a patch file** + +```bash +cd /path/to/arcuit/arcflow-phase1 + +# Create a patch file comparing against main branch +git diff c2486e4..HEAD > ../arcflow-phase1-changes.patch + +# Then in arcflow repo: +cd /path/to/arcflow +git checkout -b copilot/add-creator-records +git apply /path/to/arcflow-phase1-changes.patch +``` + +--- + +## Key Files + +### Implementation +- **`arcflow/main.py`** - Core code with creator agent processing methods +- **`traject_config_eac_cpf.rb`** - Solr indexing configuration for creator EAC-CPF XML + +### Documentation +- **`CREATOR_RECORDS_DESIGN.md`** - Comprehensive design document +- **`PR_SUMMARY.md`** - Complete PR description (use for GitHub PR) +- **`README.md`** - Updated with creator records usage instructions +- **`.github/copilot-instructions.md`** - Architecture documentation + +## Changes Summary + +### New Methods in `arcflow/main.py` + +1. **`get_all_agents(agent_types, modified_since, indent_size)`** + - Fetches all agents from ArchivesSpace + - Returns set of unique agent URIs + - Lines: ~651-705 + +2. **`task_agent(agent_uri, agents_dir, repo_id, indent_size)`** + - Processes individual agent into EAC-CPF XML document + - Extracts bioghist, dates, relationships + - Only processes agents with biographical notes + - Lines: ~708-774 + +3. **`process_creators(agents_dir, modified_since, agent_uri, indent_size)`** + - Main orchestration method for agent processing + - Processes agents in parallel + - Lines: ~893-946 + +### Workflow Integration + +Added to `update_eads()` method after PDF processing (around line 492): +- Calls `process_creators()` to process all agents +- Generates EAC-CPF XML files in `public/xml/agents/` directory +- Collection linking handled via Solr using persistent_id field + +## Testing + +⭐ **See `TESTING.md` for comprehensive testing instructions!** + +The testing guide includes: +- Step-by-step instructions for migrating a single creator record +- Command-line Solr queries with curl +- Browser-based Solr query examples +- Expected output and troubleshooting + +### Quick Test + +After applying changes to arcflow: + +1. **Run ArcFlow**: + ```bash + python -m arcflow.main [options] + ``` + +2. **Check Output**: + ```bash + ls -lh public/xml/agents/creator_*.xml + ``` + +3. **Index to Solr**: + ```bash + bundle exec traject -u $SOLR_URL -i xml \ + -c traject_config_eac_cpf.rb public/xml/agents/*.xml + ``` + +4. **Query Solr** (command line): + ```bash + curl "http://localhost:8983/solr/blacklight-core/select?q=*:*&fq=is_creator:true&rows=5&wt=xml" | jq '.' + ``` + +5. **Query Solr** (browser): + ``` + http://localhost:8983/solr/blacklight-core/select?q=*:*&fq=is_creator:true&wt=xml&indent=true + ``` + +For detailed testing procedures including how to test a single creator record and all query options, see **`TESTING.md`**. + +## Next Steps + +1. Copy these changes to the arcflow repository +2. Create PR in arcflow using `PR_SUMMARY.md` as description +3. Test with real ArchivesSpace data +4. Once merged, begin Phase 2 in arcuit repository (search exclusion) + +## Questions? + +See `CREATOR_RECORDS_DESIGN.md` for detailed design rationale and `PR_SUMMARY.md` for PR description. diff --git a/README.md b/README.md index f6397ac..bc434e6 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,181 @@ # ArcFlow -Code for exporting data from ArchivesSpace to ArcLight, along with additional utility scripts for data handling and transformation. \ No newline at end of file +Code for exporting data from ArchivesSpace to ArcLight, along with additional utility scripts for data handling and transformation. + +## Quick Start + +This directory contains a complete, working installation of arcflow with creator records support. To run it: + +```bash +# 1. Install dependencies +pip install -r requirements.txt + +# 2. Configure credentials +cp .archivessnake.yml.example .archivessnake.yml +nano .archivessnake.yml # Add your ArchivesSpace credentials + +# 3. Set environment variables +export ARCLIGHT_DIR=/path/to/your/arclight-app +export ASPACE_DIR=/path/to/your/archivesspace +export SOLR_URL=http://localhost:8983/solr/blacklight-core + +# 4. Run arcflow +python -m arcflow.main + +``` + +--- + +## Features + +- **Collection Indexing**: Exports EAD XML from ArchivesSpace and indexes to ArcLight Solr +- **Creator Records**: Extracts creator agent information and indexes as standalone documents +- **Biographical Notes**: Injects creator biographical/historical notes into collection EAD XML +- **PDF Generation**: Generates finding aid PDFs via ArchivesSpace jobs +- **Incremental Updates**: Supports modified-since filtering for efficient updates + +## Creator Records + +ArcFlow now generates standalone creator documents in addition to collection records. Creator documents: + +- Include biographical/historical notes from ArchivesSpace agent records +- Link to all collections where the creator is listed +- Can be searched and displayed independently in ArcLight +- Are marked with `is_creator: true` to distinguish from collections +- Must be fed into a Solr instance with fields to match their specific facets (See:Configure Solr Schema below ) + +### How Creator Records Work + +1. **Extraction**: `get_all_agents()` fetches all agents from ArchivesSpace +2. **Processing**: `task_agent()` generates an EAC-CPF XML document for each agent with bioghist notes +3. **Linking**: Handled via Solr using the persistent_id field (agents and collections linked through bioghist references) +4. **Indexing**: Creator XML files are indexed to Solr using `traject_config_eac_cpf.rb` + +### Creator Document Format + +Creator documents are stored as XML files in `agents/` directory using the ArchivesSpace EAC-CPF export: + +```xml +{ + "id": "creator_agent_corporate_entities_123", + "record_type": "creator", + "is_creator": true, + "agent_type": "agent_corporate_entities", + "agent_id": 123, + "title": "University Archives", + "creator_sort_name": "University Archives", + "bioghist_html": "

Established in 1963...

", + "bioghist_text": "Established in 1963...", + "dates": "1963-", + "collection_ids": ["15-0-1234", "15-0-5678"], + "collection_titles": ["Collection A", "Collection B"], + "repository": ["University Library"] +} +``` + +### Indexing Creator Documents + +#### Configure Solr Schema (Required Before Indexing) + +⚠️ **CRITICAL PREREQUISITE** - Before you can index creator records to Solr, you must configure the Solr schema. + +**See [SOLR_SCHEMA.md](SOLR_SCHEMA.md) for complete instructions on:** +- Which fields to add (is_creator, creator_persistent_id, etc.) +- Three methods to add them (Schema API recommended, managed-schema, or schema.xml) +- How to verify they're added +- Troubleshooting "unknown field" errors + +**Quick Schema Setup (Schema API method):** +```bash +# Add is_creator field +curl -X POST -H 'Content-type:application/json' \ + http://localhost:8983/solr/blacklight-core/schema \ + -d '{"add-field": {"name": "is_creator", "type": "boolean", "indexed": true, "stored": true}}' + +# Add other required fields (see SOLR_SCHEMA.md for complete list) +``` + +**Verify schema is configured:** +```bash +curl "http://localhost:8983/solr/blacklight-core/schema/fields/is_creator" +# Should return field definition, not 404 +``` + +⚠️ **If you skip this step, you'll get:** +``` +ERROR: [doc=creator_corporate_entities_584] unknown field 'is_creator' +``` + +This is a **one-time setup** per Solr instance. + +--- + +To index creator documents to Solr: + +```bash +bundle exec traject \ + -u http://localhost:8983/solr/blacklight-core \ + -i xml \ + -c traject_config_eac_cpf.rb \ + /path/to/agents/*.xml +``` + +Or integrate into your ArcFlow deployment workflow. + +## Installation + +See the original installation instructions in your deployment documentation. + +## Configuration + +- `.archivessnake.yml` - ArchivesSpace API credentials +- `.arcflow.yml` - Last update timestamp tracking + +## Usage + +```bash +python -m arcflow.main --arclight-dir /path --aspace-dir /path --solr-url http://... [options] +``` + +### Command Line Options + +Required arguments: +- `--arclight-dir` - Path to ArcLight installation directory +- `--aspace-dir` - Path to ArchivesSpace installation directory +- `--solr-url` - URL of the Solr core (e.g., http://localhost:8983/solr/blacklight-core) + +Optional arguments: +- `--force-update` - Force update of all data (recreates everything from scratch) +- `--traject-extra-config` - Path to extra Traject configuration file +- `--agents-only` - Process only agent records, skip collections (useful for testing agents) +- `--collections-only` - Skips creators, proccesses EAD, PDF finding aid and indexes collections +- `--skip-creator-indexing` - Collects EAC-CPF files only, does not index into Solr +### Examples + +**Normal run (process all collections and agents):** +```bash +python -m arcflow.main \ + --arclight-dir /path/to/arclight \ + --aspace-dir /path/to/archivesspace \ + --solr-url http://localhost:8983/solr/blacklight-core +``` + +**Process only agents (skip collections):** +```bash +python -m arcflow.main \ + --arclight-dir /path/to/arclight \ + --aspace-dir /path/to/archivesspace \ + --solr-url http://localhost:8983/solr/blacklight-core \ + --agents-only +``` + +**Force full update:** +```bash +python -m arcflow.main \ + --arclight-dir /path/to/arclight \ + --aspace-dir /path/to/archivesspace \ + --solr-url http://localhost:8983/solr/blacklight-core \ + --force-update +``` + +See `--help` for all available options. \ No newline at end of file diff --git a/TESTING.md b/TESTING.md new file mode 100644 index 0000000..3b3aeb5 --- /dev/null +++ b/TESTING.md @@ -0,0 +1,552 @@ +# Testing Guide: Creator Records Migration + +This guide provides step-by-step instructions for testing the creator records migration using native EAC-CPF format from ArchivesSpace. + +## Table of Contents +1. [Prerequisites](#prerequisites) +2. [Migrating a Single Creator Record](#migrating-a-single-creator-record) +3. [Viewing Creator Records](#viewing-creator-records) +4. [Querying Solr](#querying-solr) +5. [Troubleshooting](#troubleshooting) + +--- + +## Step 0: Configure Solr Schema (PREREQUISITE) + +⚠️ **CRITICAL FIRST STEP** - Before indexing creator records, configure the Solr schema. + +**See the parent directory's `solr/README.md` for:** +- Which fields to add to Solr +- How to manually add them to schema.xml +- How to verify they're added + +**Quick check if schema is configured:** +```bash +curl "http://localhost:8983/solr/blacklight-core/schema/fields/is_creator" +# Should return field definition +``` + +--- + +## Prerequisites + +### Required Software +- Python 3.4.3+ with ArchivesSnake installed +- Ruby 3.4.3+ with Bundler +- Access to ArchivesSpace instance +- Access to Solr instance (typically running on port 8983) +- ArcLight application installed +- Solr schema configured (see Step 0 above) + +### Required Configuration + +1. **ArchivesSpace credentials** (`.archivessnake.yml`): + ```yaml + baseurl: http://your-archivesspace-server:8089 + username: your-username + password: your-password + ``` + +2. **Solr URL**: Note your Solr endpoint, typically: + ``` + http://localhost:8983/solr/blacklight-core + ``` + +3. **ArcLight directory**: Path to your ArcLight application (e.g., `/path/to/arclight-app`) + +--- + +## Migrating a Single Creator Record + +### Quick Method: Using the Test Function + +Test a single creator using the built-in test function: + +```bash +cd /path/to/arcuit/arcflow-phase1-revised + +# Set environment variables +export ARCLIGHT_DIR=/path/to/arclight-app +export ASPACE_DIR=/path/to/archivesspace +export SOLR_URL=http://localhost:8983/solr/blacklight-core + +# Test a single creator +python -m arcflow.main test-single-creator \ + --agent-uri /agents/corporate_entities/584 +``` + +This will: +1. Process the specified agent +2. Generate the creator EAC-CPF XML file +3. Link it to collections +4. Show you the output file path and indexing command + +### Step 1: Identify a Test Creator + +Find a creator agent in ArchivesSpace that has a biographical/historical note: + +```bash +# List agents with creator role +curl -u username:password \ + "http://your-archivesspace-server:8089/repositories/2/resources/1" | \ + jq '.linked_agents[] | select(.role == "creator") | .ref' +``` + +Example output: +``` +"/agents/corporate_entities/584" +``` + +### Step 2: Verify Agent Has Bioghist + +Check that the agent has biographical/historical notes: + +```bash +curl -u username:password \ + "http://your-archivesspace-server:8089/agents/corporate_entities/584" | \ + jq '.notes[] | select(.jsonmodel_type == "note_bioghist")' +``` + +If this returns data, the agent is suitable for testing. + +### Step 3: Run ArcFlow for Single Creator + +```bash +cd /path/to/arcuit/arcflow-phase1-revised + +export ARCLIGHT_DIR=/path/to/arclight-app +export ASPACE_DIR=/path/to/archivesspace +export SOLR_URL=http://localhost:8983/solr/blacklight-core + +# Process the specific creator +python -m arcflow.main test-single-creator \ + --agent-uri /agents/corporate_entities/584 +``` + +This processes the creator and shows you the output. + +To process all creators: + +```bash +cd /path/to/arcuit/arcflow-phase1-revised +python -m arcflow.main --force-update +``` + +### Step 4: Locate the Generated Creator EAC-CPF XML + +After ArcFlow completes, check the agents directory: + +```bash +cd /path/to/arclight-app/public/xml/agents + +# List all creator XML files +ls -lh creator_*.xml + +# View a specific creator file +cat creator_corporate_entities_584.xml +``` + +**EAC-CPF format structure:** +```xml + + + + + + corporateBody + + "I" Men's Association + local + + + + +

The "I" Men's Association is composed of alumni...

+
+
+ + + University of Illinois... + + + 1927 Reunion Publications + + +
+
+``` + +**Key Elements:** +- `` - Typically empty from ArchivesSpace +- `` - Entity type and name +- `` - Biographical/historical note +- `` - Links to collections + +### Step 5: Index the Creator to Solr + +Index the creator record to Solr: + +```bash +cd /path/to/arclight-app + +# Index a single creator file +bundle exec traject \ + -u http://localhost:8983/solr/blacklight-core \ + -i xml \ + -c /path/to/arcuit/arcflow-phase1-revised/traject_config_eac_cpf.rb \ + public/xml/agents/creator_corporate_entities_584.xml +``` + +Expected output: +``` +Traject indexer starting id=... +INFO: Using filename-based ID: creator_corporate_entities_584 +Indexed creator: creator_corporate_entities_584 +Committed 1 documents to Solr +``` + +To index all creators: +```bash +bundle exec traject \ + -u http://localhost:8983/solr/blacklight-core \ + -i xml \ + -c /path/to/arcuit/arcflow-phase1-revised/traject_config_eac_cpf.rb \ + public/xml/agents/creator_*.xml +``` + +--- + +## Viewing Creator Records + +View creator records in three ways: + +### Method 1: Direct File Inspection + +View creator data directly: + +```bash +# View a creator EAC-CPF XML file +cat public/xml/agents/creator_corporate_entities_584.xml + +# Or use xmllint for pretty printing +xmllint --format public/xml/agents/creator_corporate_entities_584.xml + +# View specific elements +xmllint --xpath '//identity/nameEntry/part/text()' public/xml/agents/creator_corporate_entities_584.xml +``` + +Example EAC-CPF structure: +```xml + + + + + + corporateBody + + "I" Men's Association + + + + +

The "I" Men's Association is composed of alumni...

+
+
+
+
+``` + +### Method 2: Command-Line Solr Queries (curl) + +Query Solr directly using curl: + +#### Basic Query: All Creator Records +```bash +curl "http://localhost:8983/solr/blacklight-core/select?q=*:*&fq=is_creator:true&rows=10&wt=json" | jq '.' +``` + +#### Query Specific Creator by ID +```bash +curl "http://localhost:8983/solr/blacklight-core/select?q=id:creator_corporate_entities_584&wt=json" | jq '.response.docs[0]' +``` + +#### Search Creators by Name +```bash +curl "http://localhost:8983/solr/blacklight-core/select?q=title:*Association*&fq=is_creator:true&wt=json" | jq '.response.docs' +``` + +#### Get Creator with All Fields +```bash +curl "http://localhost:8983/solr/blacklight-core/select?q=id:creator_corporate_entities_584&wt=json" | jq '.response.docs[0]' +``` + +Example response: +```json +{ + "id": "creator_corporate_entities_584", + "title": ["\"I\" Men's Association"], + "is_creator": true, + "entity_type": "corporateBody", + "agent_type": "corporate_entities", + "agent_id": 584 +} +``` + +#### Count Total Creator Records +```bash +curl "http://localhost:8983/solr/blacklight-core/select?q=*:*&fq=is_creator:true&rows=0&wt=json" | jq '.response.numFound' +``` + +#### Search Bioghist Content +```bash +curl "http://localhost:8983/solr/blacklight-core/select?q=text:established&fq=is_creator:true&fl=id,title&wt=json" | jq '.response.docs' +``` + +### Method 3: Browser-Based Solr Queries + +Open these URLs in your web browser for formatted output: + +#### View All Creator Records +``` +http://localhost:8983/solr/blacklight-core/select?q=*:*&fq=is_creator:true&rows=10&wt=json&indent=true +``` + +#### View Specific Creator +``` +http://localhost:8983/solr/blacklight-core/select?q=id:creator_corporate_entities_584&wt=json&indent=true +``` + +#### Search by Creator Name +``` +http://localhost:8983/solr/blacklight-core/select?q=title:*Association*&fq=is_creator:true&wt=json&indent=true +``` + +#### Browse Creators with Facets +``` +http://localhost:8983/solr/blacklight-core/select?q=*:*&fq=is_creator:true&facet=true&facet.field=entity_type&rows=10&wt=json&indent=true +``` + +#### Get Only Specific Fields +``` +http://localhost:8983/solr/blacklight-core/select?q=*:*&fq=is_creator:true&fl=id,title,entity_type&rows=10&wt=json&indent=true +``` + +--- + +## Querying Solr + +### Understanding Solr Query Parameters + +- **`q=*:*`** - Match all documents (use `q=field:value` to search specific fields) +- **`fq=is_creator:true`** - Filter to only creator records +- **`rows=10`** - Return 10 results (default is 10, max is usually 1000+) +- **`fl=id,title`** - Return only specified fields (default is all fields) +- **`wt=json`** - Return JSON format (alternatives: xml, csv) +- **`indent=true`** - Pretty-print JSON output +- **`start=0`** - Pagination offset (start=10 for second page with rows=10) + +### Useful Query Patterns + +#### 1. Find Collections Linked to a Creator + +To find collections created by a specific creator, look for resourceRelation links in the creator's EAC-CPF: + +```bash +# First get the creator's linked collections +curl "http://localhost:8983/solr/blacklight-core/select?q=id:creator_corporate_entities_584&fl=related_resources&wt=json" | jq '.response.docs[0]' + +# Then query for those specific collection IDs +curl "http://localhost:8983/solr/blacklight-core/select?q=id:(resource_586 OR resource_123)&wt=json" +``` + +**Note:** Collection links are stored in the creator's `` elements in the EAC-CPF XML. + +#### 2. Find All Corporate Entity Creators +```bash +curl "http://localhost:8983/solr/blacklight-core/select?q=entity_type:corporateBody&fq=is_creator:true&rows=20&wt=json" | jq '.response.docs[] | {id, title}' +``` + +#### 3. Full-Text Search in Bioghist +```bash +curl "http://localhost:8983/solr/blacklight-core/select?q=text:university&fq=is_creator:true&fl=id,title&wt=json" | jq '.response.docs[0]' +``` + +#### 4. Get Creator with All Fields +```bash +curl "http://localhost:8983/solr/blacklight-core/select?q=id:creator_corporate_entities_584&wt=json&indent=true" | jq '.response.docs[0]' +``` + +#### 5. Verify Creator Records Don't Appear in Standard Searches +This is important for Phase 2 - ensure creators are properly filtered: + +```bash +# This should return 0 if Phase 2 is implemented (currently will return creators) +curl "http://localhost:8983/solr/blacklight-core/select?q=*:*&fq=-is_creator:true&rows=0&wt=json" | jq '.response.numFound' +``` + +### Advanced Queries + +#### Wildcard Search +```bash +curl "http://localhost:8983/solr/blacklight-core/select?q=title:*Association*&fq=is_creator:true&wt=json" | jq '.response.docs[] | .title' +``` + +#### Boolean Operators +```bash +# OR +curl "http://localhost:8983/solr/blacklight-core/select?q=title:(Association%20OR%20University)&fq=is_creator:true&wt=json" + +# AND +curl "http://localhost:8983/solr/blacklight-core/select?q=title:Association%20AND%20text:alumni&fq=is_creator:true&wt=json" + +# NOT +curl "http://localhost:8983/solr/blacklight-core/select?q=*:*%20NOT%20entity_type:person&fq=is_creator:true&wt=json" +``` + +--- + +## Troubleshooting + +### Issue: No XML Files Generated + +**Symptom**: The `public/xml/agents/` directory is empty after running ArcFlow. + +**Solutions**: +```bash +# Check ArcFlow logs +tail -f logs/arcflow.log + +# Verify agents have bioghist notes in ArchivesSpace +curl -u username:password \ + "http://your-archivesspace-server:8089/agents/corporate_entities/584" | \ + jq '.notes[] | select(.jsonmodel_type == "note_bioghist")' + +# Check if agents directory exists +ls -la /path/to/arclight-app/public/xml/agents/ +``` + +### Issue: Missing ID Field Error + +**Symptom**: `Document is missing mandatory uniqueKey field: id` + +**Solutions**: +```bash +# Verify using the correct traject config +bundle exec traject \ + -c /path/to/arcuit/arcflow-phase1-revised/traject_config_eac_cpf.rb \ + public/xml/agents/creator_*.xml + +# Check filename format (should start with creator_) +ls public/xml/agents/creator_*.xml +``` + +### Issue: Traject Indexing Fails + +**Symptom**: Error when running `bundle exec traject` + +**Solutions**: +```bash +# Verify Solr is running +curl "http://localhost:8983/solr/admin/cores?action=STATUS&wt=json" + +# Validate XML file +xmllint --noout public/xml/agents/creator_*.xml + +# Verify Solr schema has required fields +curl "http://localhost:8983/solr/blacklight-core/schema/fields/is_creator" +``` + +### Issue: No Results in Solr + +**Symptom**: Queries return 0 results even after indexing + +**Possible causes**: +1. Documents not committed to Solr +2. Wrong Solr core/collection +3. Indexing to different Solr than querying + +**Solutions**: +```bash +# Force commit in Solr +curl "http://localhost:8983/solr/blacklight-core/update?commit=true" + +# Check which cores exist +curl "http://localhost:8983/solr/admin/cores?action=STATUS&wt=json" | jq '.status | keys' + +# Verify documents were indexed +curl "http://localhost:8983/solr/blacklight-core/select?q=*:*&rows=0&wt=json" | jq '.response.numFound' + +# Check specifically for creator records +curl "http://localhost:8983/solr/blacklight-core/select?q=is_creator:true&rows=0&wt=json" | jq '.response.numFound' +``` + +### Issue: Missing Fields in Solr + +**Symptom**: Some fields are missing when querying Solr + +**Possible causes**: +1. Fields not defined in Solr schema +2. Traject config not mapping fields correctly +3. Source data missing from ArchivesSpace + +**Solutions**: +```bash +# Check which fields exist for a document +curl "http://localhost:8983/solr/blacklight-core/select?q=id:creator_corporate_entities_584&wt=json" | jq '.response.docs[0] | keys' + +# View Solr schema for creator-related fields +curl "http://localhost:8983/solr/blacklight-core/schema/fields?wt=json" | jq '.fields[] | select(.name | contains("creator") or . == "is_creator")' + +# Check source XML has the data +xmllint --xpath '//identity/nameEntry/part/text()' public/xml/agents/creator_corporate_entities_584.xml +``` + +### Issue: Creator Records Appear in Standard Searches + +**Note**: This is expected behavior for Phase 1. Phase 2 will add search exclusion in Arcuit. + +To manually filter creators from searches: +```bash +curl "http://localhost:8983/solr/blacklight-core/select?q=*:*&fq=-is_creator:true&wt=json" +``` + +--- + +## Verification Checklist + +Use this checklist to verify your creator records migration: + +- [ ] ArcFlow runs without errors +- [ ] EAC-CPF XML files created in `public/xml/agents/` directory +- [ ] XML files have correct structure (control, cpfDescription, identity, biogHist, relations) +- [ ] Filename format is correct (e.g., `creator_corporate_entities_584.xml`) +- [ ] Traject indexing completes successfully with `traject_config_eac_cpf.rb` +- [ ] Solr query returns creator records: `curl "http://localhost:8983/solr/blacklight-core/select?q=*:*&fq=is_creator:true&rows=1&wt=json"` +- [ ] Creator has `is_creator: true` field +- [ ] Creator has `entity_type` field (corporateBody, person, or family) +- [ ] Creator name is indexed in `title` field +- [ ] Bioghist content is searchable: `curl "http://localhost:8983/solr/blacklight-core/select?q=text:*&fq=is_creator:true&rows=1&wt=json"` +- [ ] Related resources are captured (if present in `` elements) +- [ ] All expected fields are present in Solr document + +--- + +## Next Steps + +After verifying creator records are indexed: + +1. **Phase 2**: Implement search exclusion in Arcuit to filter `is_creator:true` from standard searches +2. **Phase 3**: Create creator show page in Arcuit to display creator records +3. **Phase 4-7**: Add UI enhancements (search dropdown, links from collections, etc.) + +--- + +## Additional Resources + +- **Solr Configuration**: See `../solr/README.md` for schema setup +- **Solr Documentation**: https://solr.apache.org/guide/ +- **ArcLight Documentation**: https://github.com/projectblacklight/arclight +- **EAC-CPF Standard**: https://eac.staatsbibliothek-berlin.de/ +- **ArchivesSnake Documentation**: https://github.com/archivesspace-labs/ArchivesSnake diff --git a/arcflow/main.py b/arcflow/main.py index a8621fa..74dd7f3 100644 --- a/arcflow/main.py +++ b/arcflow/main.py @@ -9,12 +9,14 @@ import re import logging import math +import sys from xml.dom.pulldom import parse, START_ELEMENT from xml.sax.saxutils import escape as xml_escape +from xml.etree import ElementTree as ET from datetime import datetime, timezone from asnake.client import ASnakeClient from multiprocessing.pool import ThreadPool as Pool -from utils.stage_classifications import extract_labels +from .utils.stage_classifications import extract_labels base_dir = os.path.abspath((__file__) + "/../../") @@ -38,7 +40,7 @@ class ArcFlow: """ - def __init__(self, arclight_dir, aspace_dir, solr_url, traject_extra_config='', force_update=False): + def __init__(self, arclight_dir, aspace_dir, solr_url, traject_extra_config='', force_update=False, agents_only=False, collections_only=False, arcuit_dir=None, skip_creator_indexing=False): self.solr_url = solr_url self.batch_size = 1000 self.traject_extra_config = f'-c {traject_extra_config}' if traject_extra_config.strip() else '' @@ -46,6 +48,10 @@ def __init__(self, arclight_dir, aspace_dir, solr_url, traject_extra_config='', self.aspace_jobs_dir = f'{aspace_dir}/data/shared/job_files' self.job_type = 'print_to_pdf_job' self.force_update = force_update + self.agents_only = agents_only + self.collections_only = collections_only + self.arcuit_dir = arcuit_dir + self.skip_creator_indexing = skip_creator_indexing self.log = logging.getLogger('arcflow') self.pid = os.getpid() self.pid_file_path = os.path.join(base_dir, 'arcflow.pid') @@ -395,6 +401,7 @@ def update_eads(self): pdf_dir = f'{self.arclight_dir}/public/pdf' modified_since = int(self.last_updated.timestamp()) + if self.force_update or modified_since <= 0: modified_since = 0 # delete all EADs and related files in ArcLight Solr @@ -454,7 +461,7 @@ def update_eads(self): # Tasks for indexing pending resources results_3 = [pool.apply_async( - self.index, + self.index_collections, args=(repo_id, f'{xml_dir}/{repo_id}_*_batch_{batch_num}.xml', indent_size)) for repo_id, batch_num in batches] @@ -527,22 +534,60 @@ def update_eads(self): page += 1 - def index(self, repo_id, xml_file_path, indent_size=0): + def index_collections(self, repo_id, xml_file_path, indent_size=0): + """Index collection XML files to Solr using traject.""" indent = ' ' * indent_size self.log.info(f'{indent}Indexing pending resources in repository ID {repo_id} to ArcLight Solr...') try: + # Get arclight traject config path + result_show = subprocess.run( + ['bundle', 'show', 'arclight'], + capture_output=True, + text=True, + cwd=self.arclight_dir + ) + arclight_path = result_show.stdout.strip() if result_show.returncode == 0 else '' + + if not arclight_path: + self.log.error(f'{indent}Could not find arclight gem path') + return + + traject_config = f'{arclight_path}/lib/arclight/traject/ead2_config.rb' + + cmd = [ + 'bundle', 'exec', 'traject', + '-u', self.solr_url, + '-s', 'processing_thread_pool=8', + '-s', 'solr_writer.thread_pool=8', + '-s', f'solr_writer.batch_size={self.batch_size}', + '-s', 'solr_writer.commit_on_close=true', + '-i', 'xml', + '-c', traject_config + ] + + if self.traject_extra_config: + cmd.extend(self.traject_extra_config.split()) + + cmd.append(xml_file_path) + + env = os.environ.copy() + env['REPOSITORY_ID'] = str(repo_id) + result = subprocess.run( - f'REPOSITORY_ID={repo_id} bundle exec traject -u {self.solr_url} -s processing_thread_pool=8 -s solr_writer.thread_pool=8 -s solr_writer.batch_size={self.batch_size} -s solr_writer.commit_on_close=true -i xml -c $(bundle show arclight)/lib/arclight/traject/ead2_config.rb {self.traject_extra_config} {xml_file_path}', -# f'FILE={xml_file_path} SOLR_URL={self.solr_url} REPOSITORY_ID={repo_id} TRAJECT_SETTINGS="processing_thread_pool=8 solr_writer.thread_pool=8 solr_writer.batch_size=1000 solr_writer.commit_on_close=false" bundle exec rake arcuit:index', - shell=True, + cmd, cwd=self.arclight_dir, - stderr=subprocess.PIPE,) - self.log.error(f'{indent}{result.stderr.decode("utf-8")}') + env=env, + capture_output=True, + text=True + ) + + if result.stderr: + self.log.error(f'{indent}{result.stderr}') if result.returncode != 0: self.log.error(f'{indent}Failed to index pending resources in repository ID {repo_id} to ArcLight Solr. Return code: {result.returncode}') else: self.log.info(f'{indent}Finished indexing pending resources in repository ID {repo_id} to ArcLight Solr.') - except subprocess.CalledProcessError as e: + except Exception as e: self.log.error(f'{indent}Error indexing pending resources in repository ID {repo_id} to ArcLight Solr: {e}') @@ -625,6 +670,438 @@ def get_creator_bioghist(self, resource, indent_size=0): return None + def get_all_agents(self, agent_types=None, modified_since=0, indent_size=0): + """ + Fetch ALL agents from ArchivesSpace (not just creators). + Uses direct agent API endpoints for comprehensive coverage. + + Args: + agent_types: List of agent types to fetch. Default: ['corporate_entities', 'people', 'families'] + modified_since: Unix timestamp to filter agents modified since this time (if API supports it) + indent_size: Indentation size for logging + + Returns: + set: Set of agent URIs (e.g., '/agents/corporate_entities/123') + """ + if agent_types is None: + agent_types = ['corporate_entities', 'people', 'families'] + + indent = ' ' * indent_size + all_agents = set() + + self.log.info(f'{indent}Fetching ALL agents from ArchivesSpace...') + + for agent_type in agent_types: + try: + # Try with modified_since parameter first + params = {'all_ids': True} + if modified_since > 0: + params['modified_since'] = modified_since + + response = self.client.get(f'/agents/{agent_type}', params=params) + agent_ids = response.json() + + self.log.info(f'{indent}Found {len(agent_ids)} {agent_type} agents') + + # Add agent URIs to set + for agent_id in agent_ids: + agent_uri = f'/agents/{agent_type}/{agent_id}' + all_agents.add(agent_uri) + + except Exception as e: + self.log.error(f'{indent}Error fetching {agent_type} agents: {e}') + # If modified_since fails, try without it + if modified_since > 0: + self.log.warning(f'{indent}Retrying {agent_type} without modified_since filter...') + try: + response = self.client.get(f'/agents/{agent_type}', params={'all_ids': True}) + agent_ids = response.json() + self.log.info(f'{indent}Found {len(agent_ids)} {agent_type} agents (no date filter)') + for agent_id in agent_ids: + agent_uri = f'/agents/{agent_type}/{agent_id}' + all_agents.add(agent_uri) + except Exception as e2: + self.log.error(f'{indent}Failed to fetch {agent_type} agents: {e2}') + + self.log.info(f'{indent}Found {len(all_agents)} total agents across all types.') + return all_agents + + + def task_agent(self, agent_uri, agents_dir, repo_id=1, indent_size=0): + """ + Process a single agent and generate a creator document in EAC-CPF XML format. + Retrieves EAC-CPF directly from ArchivesSpace archival_contexts endpoint. + + Args: + agent_uri: Agent URI from ArchivesSpace (e.g., '/agents/corporate_entities/123') + agents_dir: Directory to save agent XML files + repo_id: Repository ID to use for archival_contexts endpoint (default: 1) + indent_size: Indentation size for logging + + Returns: + str: Creator document ID if successful, None otherwise + """ + indent = ' ' * indent_size + + try: + # Parse agent URI to extract type and ID + # URI format: /agents/{agent_type}/{id} + parts = agent_uri.strip('/').split('/') + if len(parts) != 3 or parts[0] != 'agents': + self.log.error(f'{indent}Invalid agent URI format: {agent_uri}') + return None + + agent_type = parts[1] # e.g., 'corporate_entities', 'people', 'families' + agent_id = parts[2] + + # Construct EAC-CPF endpoint + # Format: /repositories/{repo_id}/archival_contexts/{agent_type}/{id}.xml + eac_cpf_endpoint = f'/repositories/{repo_id}/archival_contexts/{agent_type}/{agent_id}.xml' + + self.log.debug(f'{indent}Fetching EAC-CPF from: {eac_cpf_endpoint}') + + # Fetch EAC-CPF XML + response = self.client.get(eac_cpf_endpoint) + + if response.status_code != 200: + self.log.error(f'{indent}Failed to fetch EAC-CPF for {agent_uri}: HTTP {response.status_code}') + return None + + eac_cpf_xml = response.text + eac_cpf_xml = response.text + + # Parse the EAC-CPF XML to extract key information + try: + root = ET.fromstring(eac_cpf_xml) + except ET.ParseError as e: + self.log.error(f'{indent}Failed to parse EAC-CPF XML for {agent_uri}: {e}') + return None + + # Generate creator ID + creator_id = f'creator_{agent_type}_{agent_id}' + + # Save EAC-CPF XML to file + filename = f'{agents_dir}/{creator_id}.xml' + with open(filename, 'w', encoding='utf-8') as f: + f.write(eac_cpf_xml) + + self.log.info(f'{indent}Created creator document: {creator_id}') + return creator_id + + except Exception as e: + self.log.error(f'{indent}Error processing agent {agent_uri}: {e}') + import traceback + self.log.error(f'{indent}{traceback.format_exc()}') + return None + + + def update_creator_collection_links(self, agents_dir, indent_size=0): + """ + Update creator documents with links to their associated collections. + Scans all resources to build agent -> collections mapping, then updates creator XML files. + + Args: + agents_dir: Directory containing agent XML files + indent_size: Indentation size for logging + """ + indent = ' ' * indent_size + + # Build mapping of agent_uri -> [collection info] + self.log.info(f'{indent}Building agent-collection linkage map...') + agent_collections = {} + + repos = self.client.get('repositories').json() + for repo in repos: + repo_id = self.get_repo_id(repo) + resources = self.client.get( + f'{repo["uri"]}/resources', + params={'all_ids': True} + ).json() + + self.log.info(f'{indent}Processing {len(resources)} resources in repository ID {repo_id}...') + + for resource_id in resources: + try: + resource = self.client.get( + f'{repo["uri"]}/resources/{resource_id}', + params={'resolve': ['linked_agents']} + ).json() + + # Only process published resources + if not resource.get('publish') or resource.get('suppressed'): + continue + + ead_id = resource.get('ead_id', '').replace('.', '-') + + if 'linked_agents' in resource: + for linked_agent in resource['linked_agents']: + if linked_agent.get('role') == 'creator': + agent_ref = linked_agent.get('ref') + if agent_ref: + if agent_ref not in agent_collections: + agent_collections[agent_ref] = [] + agent_collections[agent_ref].append({ + 'ead_id': ead_id, + 'title': resource.get('title', 'Untitled'), + 'repository': repo.get('name', '') + }) + except Exception as e: + self.log.error(f'{indent}Error fetching resource {resource_id}: {e}') + + # Update creator documents with collection links + self.log.info(f'{indent}Updating creator documents with collection links...') + updated_count = 0 + + for xml_file in os.listdir(agents_dir): + if xml_file.endswith('.xml'): + filepath = os.path.join(agents_dir, xml_file) + try: + # Parse XML file + tree = ET.parse(filepath) + root = tree.getroot() + + # Find agent URI from controlaccess + agent_uri = None + controlaccess = root.find('.//controlaccess') + if controlaccess is not None: + for name_elem in controlaccess.findall('.//*[@identifier]'): + agent_uri = name_elem.get('identifier') + break + + if not agent_uri: + self.log.warning(f'{indent}Could not find agent URI in {xml_file}') + continue + + if agent_uri in agent_collections: + collections = agent_collections[agent_uri] + + # Find or create relatedmaterial section for collections + archdesc = root.find('.//archdesc') + if archdesc is None: + self.log.warning(f'{indent}No archdesc found in {xml_file}') + continue + + # Remove existing collection relatedmaterial if present + for rm in archdesc.findall('relatedmaterial[@type="collections"]'): + archdesc.remove(rm) + + # Add new relatedmaterial section for collections + relatedmaterial = ET.SubElement(archdesc, 'relatedmaterial') + relatedmaterial.set('type', 'collections') + head = ET.SubElement(relatedmaterial, 'head') + head.text = 'Related Collections' + + # Add each collection + for collection in collections: + item = ET.SubElement(relatedmaterial, 'item') + item.text = collection['title'] + item.set('ead_id', collection['ead_id']) + item.set('repository', collection['repository']) + + # Save updated XML + ET.indent(tree, space=' ') + tree.write(filepath, encoding='utf-8', xml_declaration=True) + + updated_count += 1 + creator_id = xml_file.replace('.xml', '') + self.log.info(f'{indent}Updated {creator_id} with {len(collections)} collection links') + + except Exception as e: + self.log.error(f'{indent}Error updating {xml_file}: {e}') + + self.log.info(f'{indent}Updated {updated_count} creator documents with collection links.') + + + def process_creators(self, agents_dir, modified_since=0, agent_uri=None, indent_size=0): + """ + Process creator agents and generate standalone creator documents. + + Args: + agents_dir: Directory to save agent XML files + modified_since: Unix timestamp to filter agents modified since this time + agent_uri: Optional. If provided, process only this single agent (for testing) + indent_size: Indentation size for logging + + Returns: + list: List of created creator document IDs + """ + indent = ' ' * indent_size + self.log.info(f'{indent}Processing creator agents...') + + # Create agents directory if it doesn't exist + os.makedirs(agents_dir, exist_ok=True) + + # Get agents to process + if agent_uri: + # Single agent mode (for testing) + self.log.info(f'{indent}Single agent mode: processing {agent_uri}') + agents = {agent_uri} + else: + # Get ALL agents (not just creators) + agents = self.get_all_agents(modified_since=modified_since, indent_size=indent_size) + + # Process agents in parallel + with Pool(processes=10) as pool: + results_agents = [pool.apply_async( + self.task_agent, + args=(agent_uri_item, agents_dir, 1, indent_size)) # Use repo_id=1 + for agent_uri_item in agents] + + creator_ids = [r.get() for r in results_agents] + creator_ids = [cid for cid in creator_ids if cid is not None] + + self.log.info(f'{indent}Created {len(creator_ids)} creator documents.') + + # NOTE: Collection links are NOT added to creator XML files. + # Instead, linking is handled via Solr using the persistent_id field: + # - Creator bioghist has persistent_id as the 'id' attribute + # - Collection EADs reference creators via bioghist with persistent_id + # - Solr indexes both, allowing queries to link them + # This avoids the expensive operation of scanning all resources to build a linkage map. + + # Index creators to Solr (if not skipped) + if not self.skip_creator_indexing and creator_ids: + self.log.info(f'{indent}Indexing {len(creator_ids)} creator records to Solr...') + traject_config = self.find_traject_config() + if traject_config: + indexed = self.index_creators(agents_dir, creator_ids) + self.log.info(f'{indent}Creator indexing complete: {indexed}/{len(creator_ids)} indexed') + else: + self.log.info(f'{indent}Skipping creator indexing (traject config not found)') + self.log.info(f'{indent}To index manually:') + self.log.info(f'{indent} cd {self.arclight_dir}') + self.log.info(f'{indent} bundle exec traject -u {self.solr_url} -i xml \\') + self.log.info(f'{indent} -c /path/to/arcuit/arcflow-phase1-revised/traject_config_eac_cpf.rb \\') + self.log.info(f'{indent} {agents_dir}/*.xml') + elif self.skip_creator_indexing: + self.log.info(f'{indent}Skipping creator indexing (--skip-creator-indexing flag set)') + + return creator_ids + + + def find_traject_config(self): + """ + Find the traject config for creator indexing. + + Tries: + 1. bundle show arcuit (finds installed gem) + 2. self.arcuit_dir (explicit path) + 3. Returns None if neither works + + Returns: + str: Path to traject config, or None if not found + """ + # Try bundle show arcuit first + try: + result = subprocess.run( + ['bundle', 'show', 'arcuit'], + cwd=self.arclight_dir, + capture_output=True, + text=True, + timeout=10 + ) + if result.returncode == 0: + arcuit_path = result.stdout.strip() + traject_config = f'{arcuit_path}/arcflow-phase1-revised/traject_config_eac_cpf.rb' + if os.path.exists(traject_config): + self.log.info(f'Found traject config via bundle show: {traject_config}') + return traject_config + else: + self.log.warning(f'bundle show arcuit succeeded but traject config not found at expected path') + else: + self.log.debug('bundle show arcuit failed (gem not installed?)') + except Exception as e: + self.log.debug(f'Error running bundle show arcuit: {e}') + + # Fall back to arcuit_dir if provided + if self.arcuit_dir: + traject_config = f'{self.arcuit_dir}/arcflow-phase1-revised/traject_config_eac_cpf.rb' + if os.path.exists(traject_config): + self.log.info(f'Using traject config from arcuit_dir: {traject_config}') + return traject_config + else: + self.log.warning(f'arcuit_dir provided but traject config not found: {traject_config}') + + # No config found + self.log.warning('Could not find traject config (bundle show arcuit failed and arcuit_dir not provided)') + return None + + + def index_creators(self, agents_dir, creator_ids, batch_size=100): + """ + Index creator XML files to Solr using traject. + + Args: + agents_dir: Directory containing creator XML files + creator_ids: List of creator IDs to index + batch_size: Number of files to index per traject call (default: 100) + + Returns: + int: Number of successfully indexed creators + """ + traject_config = self.find_traject_config() + if not traject_config: + return 0 + + indexed_count = 0 + failed_count = 0 + + # Process in batches to avoid command line length limits + total_batches = math.ceil(len(creator_ids) / batch_size) + for i in range(0, len(creator_ids), batch_size): + batch = creator_ids[i:i+batch_size] + batch_num = (i // batch_size) + 1 + + # Build list of XML files for this batch + xml_files = [f'{agents_dir}/{cid}.xml' for cid in batch] + + # Filter to only existing files + existing_files = [f for f in xml_files if os.path.exists(f)] + + if not existing_files: + self.log.warning(f' Batch {batch_num}/{total_batches}: No files found, skipping') + continue + + try: + cmd = [ + 'bundle', 'exec', 'traject', + '-u', self.solr_url, + '-i', 'xml', + '-c', traject_config + ] + existing_files + + self.log.info(f' Indexing batch {batch_num}/{total_batches}: {len(existing_files)} files') + + result = subprocess.run( + cmd, + cwd=self.arclight_dir, + capture_output=True, + text=True, + timeout=300 # 5 minute timeout per batch + ) + + if result.returncode == 0: + indexed_count += len(existing_files) + self.log.info(f' Successfully indexed {len(existing_files)} creators') + else: + failed_count += len(existing_files) + self.log.error(f' Traject failed with exit code {result.returncode}') + if result.stderr: + self.log.error(f' STDERR: {result.stderr}') + + except subprocess.TimeoutExpired: + self.log.error(f' Traject timed out for batch {batch_num}/{total_batches}') + failed_count += len(existing_files) + except Exception as e: + self.log.error(f' Error indexing batch {batch_num}/{total_batches}: {e}') + failed_count += len(existing_files) + + if failed_count > 0: + self.log.warning(f'Creator indexing completed with errors: {indexed_count} succeeded, {failed_count} failed') + + return indexed_count + + def get_repo_id(self, repo): """ Get the repository ID from the repository URI. @@ -753,11 +1230,28 @@ def run(self): Run the ArcFlow process. """ self.log.info(f'ArcFlow process started (PID: {self.pid}).') - self.update_repositories() - self.update_eads() + + # Update repositories (unless agents-only mode) + if not self.agents_only: + self.update_repositories() + + # Update collections/EADs (unless agents-only mode) + if not self.agents_only: + self.update_eads() + + # Update creator records (unless collections-only mode) + if not self.collections_only: + xml_dir = f'{self.arclight_dir}/public/xml' + agents_dir = f'{xml_dir}/agents' + modified_since = int(self.last_updated.timestamp()) + indent_size = 0 + self.process_creators(agents_dir, modified_since=modified_since, indent_size=indent_size) + self.save_config_file() self.log.info(f'ArcFlow process completed (PID: {self.pid}). Elapsed time: {time.strftime("%H:%M:%S", time.gmtime(int(time.time()) - self.start_time))}.') + + def main(): parser = argparse.ArgumentParser(description='ArcFlow') @@ -781,16 +1275,155 @@ def main(): '--traject-extra-config', default='', help='Path to extra Traject configuration file',) + parser.add_argument( + '--agents-only', + action='store_true', + help='Process only agent records, skip collections (for testing)',) + parser.add_argument( + '--collections-only', + action='store_true', + help='Process only repositories and collections, skip creator processing',) + parser.add_argument( + '--arcuit-dir', + default=None, + help='Path to arcuit repository (for traject config). If not provided, will try bundle show arcuit.',) + parser.add_argument( + '--skip-creator-indexing', + action='store_true', + help='Generate creator XML files but skip Solr indexing (for testing)',) args = parser.parse_args() + + # Validate mutually exclusive flags + if args.agents_only and args.collections_only: + parser.error('Cannot use both --agents-only and --collections-only') arcflow = ArcFlow( arclight_dir=args.arclight_dir, aspace_dir=args.aspace_dir, solr_url=args.solr_url, traject_extra_config=args.traject_extra_config, - force_update=args.force_update) + force_update=args.force_update, + agents_only=args.agents_only, + collections_only=args.collections_only, + arcuit_dir=args.arcuit_dir, + skip_creator_indexing=args.skip_creator_indexing) arcflow.run() +def test_single_creator(): + """ + Test function to process a single creator record. + + Usage: + python -c "from arcflow.main import test_single_creator; test_single_creator()" \ + --agent-uri /agents/agent_corporate_entities/123 \ + --arclight-dir /path/to/arclight \ + --aspace-dir /path/to/archivesspace + + Or add to a separate test script. + """ + parser = argparse.ArgumentParser( + description='Test single creator record processing', + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=""" +Examples: + # Test a single creator agent + python -m arcflow.main test-single-creator \\ + --agent-uri /agents/agent_corporate_entities/123 \\ + --arclight-dir /path/to/arclight-app \\ + --aspace-dir /path/to/archivesspace + + # With environment variables + export ARCLIGHT_DIR=/path/to/arclight-app + export ASPACE_DIR=/path/to/archivesspace + python -m arcflow.main test-single-creator \\ + --agent-uri /agents/agent_people/456 + """) + + parser.add_argument( + '--agent-uri', + required=True, + help='Agent URI to process (e.g., /agents/agent_corporate_entities/123)',) + parser.add_argument( + '--arclight-dir', + default=os.environ.get('ARCLIGHT_DIR'), + help='Path to ArcLight installation directory (default: $ARCLIGHT_DIR)',) + parser.add_argument( + '--aspace-dir', + default=os.environ.get('ASPACE_DIR'), + help='Path to ArchivesSpace installation directory (default: $ASPACE_DIR)',) + parser.add_argument( + '--solr-url', + default=os.environ.get('SOLR_URL', 'http://localhost:8983/solr/blacklight-core'), + help='URL of the Solr core (default: $SOLR_URL or http://localhost:8983/solr/blacklight-core)',) + parser.add_argument( + '--arcuit-dir', + default=os.environ.get('ARCUIT_DIR'), + help='Path to arcuit repository (for traject config). If not provided, will try bundle show arcuit.',) + parser.add_argument( + '--skip-creator-indexing', + action='store_true', + help='Generate creator XML files but skip Solr indexing',) + + args = parser.parse_args() + + # Validate required arguments + if not args.arclight_dir: + parser.error('--arclight-dir is required (or set $ARCLIGHT_DIR)') + if not args.aspace_dir: + parser.error('--aspace-dir is required (or set $ASPACE_DIR)') + + print(f'Testing single creator: {args.agent_uri}') + print(f'ArcLight directory: {args.arclight_dir}') + print(f'ArchivesSpace directory: {args.aspace_dir}') + print(f'Solr URL: {args.solr_url}') + print() + + # Create ArcFlow instance (without running full process) + arcflow = ArcFlow( + arclight_dir=args.arclight_dir, + aspace_dir=args.aspace_dir, + solr_url=args.solr_url, + traject_extra_config='', + force_update=False, + arcuit_dir=args.arcuit_dir, + skip_creator_indexing=args.skip_creator_indexing) + + # Process single creator + agents_dir = f'{args.arclight_dir}/public/xml/agents' + creator_ids = arcflow.process_creators( + agents_dir=agents_dir, + modified_since=0, + agent_uri=args.agent_uri, + indent_size=0) + + if creator_ids: + print(f'\nSuccess! Created {len(creator_ids)} creator document(s):') + for creator_id in creator_ids: + xml_file = f'{agents_dir}/{creator_id}.xml' + print(f' - {xml_file}') + + if args.skip_creator_indexing: + print(f'\nTo index to Solr:') + print(f' cd {args.arclight_dir}') + print(f' bundle exec traject -u {args.solr_url} -i xml \\') + print(f' -c /path/to/arcuit/arcflow-phase1-revised/traject_config_eac_cpf.rb \\') + print(f' public/xml/agents/{creator_ids[0]}.xml') + else: + print(f'\nIndexing was handled automatically (or check logs if it was skipped)') + else: + print('\nNo creator documents created. Check that the agent has biographical notes.') + + # Clean up PID file + if os.path.exists(arcflow.pid_file_path): + os.remove(arcflow.pid_file_path) + + if __name__ == '__main__': - main() \ No newline at end of file + # Check if we're running a subcommand + if len(sys.argv) > 1 and sys.argv[1] == 'test-single-creator': + # Remove the subcommand from argv so argparse works correctly + sys.argv.pop(1) + test_single_creator() + else: + main() \ No newline at end of file diff --git a/traject_config_eac_cpf.rb b/traject_config_eac_cpf.rb new file mode 100644 index 0000000..76c2df1 --- /dev/null +++ b/traject_config_eac_cpf.rb @@ -0,0 +1,271 @@ +# Traject configuration for indexing EAC-CPF creator records to Solr +# +# This config file processes EAC-CPF (Encoded Archival Context - Corporate Bodies, +# Persons, and Families) XML documents from ArchivesSpace archival_contexts endpoint. +# +# Usage: +# bundle exec traject -u $SOLR_URL -c traject_config_eac_cpf.rb /path/to/agents/*.xml +# +# The EAC-CPF XML documents are retrieved directly from ArchivesSpace via: +# /repositories/{repo_id}/archival_contexts/{agent_type}/{id}.xml + +require 'traject' +require 'traject_plus' +require 'traject_plus/macros' + +# Use TrajectPlus macros (provides extract_xpath and other helpers) +extend TrajectPlus::Macros + +settings do + provide "solr.url", ENV['SOLR_URL'] || "http://localhost:8983/solr/blacklight-core" + provide "solr_writer.commit_on_close", "true" + provide "solr_writer.thread_pool", "8" + provide "solr_writer.batch_size", "100" + provide "processing_thread_pool", "4" + + # Use NokogiriReader for XML processing + provide "reader_class_name", "Traject::NokogiriReader" +end + +# Each record from reader +each_record do |record, context| + context.clipboard[:is_creator] = true +end + +# Core identity field +# CRITICAL: The 'id' field is required by Solr's schema (uniqueKey) +# Must ensure this field is never empty or indexing will fail +# +# IMPORTANT: Real EAC-CPF from ArchivesSpace has empty element! +# Cannot rely on recordId being present. Must extract from filename or generate. +to_field 'id' do |record, accumulator, context| + # Try 1: Extract from control/recordId (if present) + record_id = record.xpath('//eac-cpf/control/recordId', + 'eac-cpf' => 'urn:isbn:1-931666-33-4').first + record_id ||= record.xpath('//control/recordId').first + + if record_id && !record_id.text.strip.empty? + accumulator << record_id.text.strip + else + # Try 2: Extract from source filename (most reliable for ArchivesSpace exports) + # Filename format: creator_corporate_entities_584.xml or similar + source_file = context.source_record_id || context.input_name + if source_file + # Remove .xml extension and any path + id_from_filename = File.basename(source_file, '.xml') + # Check if it looks valid (starts with creator_ or agent_) + if id_from_filename =~ /^(creator_|agent_)/ + accumulator << id_from_filename + context.logger.info("Using filename-based ID: #{id_from_filename}") + else + # Try 3: Generate from entity type and name + entity_type = record.xpath('//identity/entityType').first&.text&.strip + name_entry = record.xpath('//identity/nameEntry/part').first&.text&.strip + + if entity_type && name_entry + # Create stable ID from type and name + type_short = case entity_type + when 'corporateBody' then 'corporate' + when 'person' then 'person' + when 'family' then 'family' + else 'entity' + end + name_id = name_entry.gsub(/[^a-z0-9]/i, '_').downcase[0..50] # Limit length + generated_id = "creator_#{type_short}_#{name_id}" + accumulator << generated_id + context.logger.warn("Generated ID from name: #{generated_id}") + else + # Last resort: timestamp-based unique ID + fallback_id = "creator_unknown_#{Time.now.to_i}_#{rand(10000)}" + accumulator << fallback_id + context.logger.error("Using fallback ID: #{fallback_id}") + end + end + else + # No filename available, generate from name + entity_type = record.xpath('//identity/entityType').first&.text&.strip + name_entry = record.xpath('//identity/nameEntry/part').first&.text&.strip + + if entity_type && name_entry + type_short = case entity_type + when 'corporateBody' then 'corporate' + when 'person' then 'person' + when 'family' then 'family' + else 'entity' + end + name_id = name_entry.gsub(/[^a-z0-9]/i, '_').downcase[0..50] + generated_id = "creator_#{type_short}_#{name_id}" + accumulator << generated_id + context.logger.warn("Generated ID from name: #{generated_id}") + else + # Absolute last resort + fallback_id = "creator_unknown_#{Time.now.to_i}_#{rand(10000)}" + accumulator << fallback_id + context.logger.error("Using fallback ID: #{fallback_id}") + end + end + end +end + +# Add is_creator marker field +to_field 'is_creator' do |record, accumulator| + accumulator << 'true' +end + +# Record type +to_field 'record_type' do |record, accumulator| + accumulator << 'creator' +end + +# Entity type (corporateBody, person, family) +to_field 'entity_type' do |record, accumulator| + entity = record.xpath('//cpfDescription/identity/entityType', + 'eac-cpf' => 'urn:isbn:1-931666-33-4').first + if entity + accumulator << entity.text + else + # Fallback without namespace + entity = record.xpath('//identity/entityType').first + accumulator << entity.text if entity + end +end + +# Title/name fields - from authorized form of name +to_field 'title' do |record, accumulator| + # Try with namespace + name = record.xpath('//cpfDescription/identity/nameEntry/part', + 'eac-cpf' => 'urn:isbn:1-931666-33-4') + if name.any? + accumulator << name.map(&:text).join(' ') + else + # Fallback without namespace + name = record.xpath('//identity/nameEntry/part') + accumulator << name.map(&:text).join(' ') if name.any? + end +end + +to_field 'title_display' do |record, accumulator| + name = record.xpath('//identity/nameEntry/part') + accumulator << name.map(&:text).join(' ') if name.any? +end + +to_field 'title_sort' do |record, accumulator| + name = record.xpath('//identity/nameEntry/part') + if name.any? + text = name.map(&:text).join(' ') + accumulator << text.gsub(/^(a|an|the)\s+/i, '').downcase + end +end + +# Dates of existence +to_field 'dates' do |record, accumulator| + # Try existDates element + dates = record.xpath('//existDates/dateRange/fromDate | //existDates/dateRange/toDate | //existDates/date') + if dates.any? + from_date = record.xpath('//existDates/dateRange/fromDate').first + to_date = record.xpath('//existDates/dateRange/toDate').first + + if from_date || to_date + from_text = from_date ? from_date.text : '' + to_text = to_date ? to_date.text : '' + accumulator << "#{from_text}-#{to_text}".gsub(/^-|-$/, '') + else + # Single date + dates.each { |d| accumulator << d.text } + end + end +end + +# Biographical/historical note - text content +to_field 'bioghist_text' do |record, accumulator| + # Extract text from biogHist elements + bioghist = record.xpath('//biogHist//p') + if bioghist.any? + text = bioghist.map(&:text).join(' ') + accumulator << text + end +end + +# Biographical/historical note - HTML +to_field 'bioghist_html' do |record, accumulator| + bioghist = record.xpath('//biogHist//p') + if bioghist.any? + html = bioghist.map { |p| "

#{p.text}

" }.join("\n") + accumulator << html + end +end + +# Full-text search field +to_field 'text' do |record, accumulator| + # Title + name = record.xpath('//identity/nameEntry/part') + accumulator << name.map(&:text).join(' ') if name.any? + + # Bioghist + bioghist = record.xpath('//biogHist//p') + accumulator << bioghist.map(&:text).join(' ') if bioghist.any? +end + +# Related agents (from cpfRelation elements) +to_field 'related_agents_ssim' do |record, accumulator| + relations = record.xpath('//cpfRelation') + relations.each do |rel| + # Get the related entity href/identifier + href = rel['href'] || rel['xlink:href'] + relation_type = rel['cpfRelationType'] + + if href + # Store as: "uri|type" for easy parsing later + accumulator << "#{href}|#{relation_type}" + elsif relation_entry = rel.xpath('relationEntry').first + # If no href, at least store the name + name = relation_entry.text + accumulator << "#{name}|#{relation_type}" if name + end + end +end + +# Related agents - just URIs (for simpler queries) +to_field 'related_agent_uris_ssim' do |record, accumulator| + relations = record.xpath('//cpfRelation') + relations.each do |rel| + href = rel['href'] || rel['xlink:href'] + accumulator << href if href + end +end + +# Relationship types +to_field 'relationship_types_ssim' do |record, accumulator| + relations = record.xpath('//cpfRelation') + relations.each do |rel| + relation_type = rel['cpfRelationType'] + accumulator << relation_type if relation_type && !accumulator.include?(relation_type) + end +end + +# Agent source URI (from original ArchivesSpace) +to_field 'agent_uri' do |record, accumulator| + # Try to extract from control section or otherRecordId + other_id = record.xpath('//control/otherRecordId[@localType="archivesspace_uri"]').first + if other_id + accumulator << other_id.text + end +end + +# Timestamp +to_field 'timestamp' do |record, accumulator| + accumulator << Time.now.utc.iso8601 +end + +# Document type marker +to_field 'document_type' do |record, accumulator| + accumulator << 'creator' +end + +# Log successful indexing +each_record do |record, context| + record_id = record.xpath('//control/recordId').first + if record_id + context.logger.info("Indexed creator: #{record_id.text}") + end +end From 8a96e8d420ce007b6835752196abf2527e186b62 Mon Sep 17 00:00:00 2001 From: Alex Dryden Date: Wed, 11 Feb 2026 14:23:49 -0500 Subject: [PATCH 02/44] fix: remove method to test a single creator record --- arcflow/main.py | 125 +----------------------------------------------- 1 file changed, 2 insertions(+), 123 deletions(-) diff --git a/arcflow/main.py b/arcflow/main.py index 74dd7f3..248433b 100644 --- a/arcflow/main.py +++ b/arcflow/main.py @@ -932,13 +932,7 @@ def process_creators(self, agents_dir, modified_since=0, agent_uri=None, indent_ os.makedirs(agents_dir, exist_ok=True) # Get agents to process - if agent_uri: - # Single agent mode (for testing) - self.log.info(f'{indent}Single agent mode: processing {agent_uri}') - agents = {agent_uri} - else: - # Get ALL agents (not just creators) - agents = self.get_all_agents(modified_since=modified_since, indent_size=indent_size) + agents = self.get_all_agents(modified_since=modified_since, indent_size=indent_size) # Process agents in parallel with Pool(processes=10) as pool: @@ -1310,120 +1304,5 @@ def main(): arcflow.run() -def test_single_creator(): - """ - Test function to process a single creator record. - - Usage: - python -c "from arcflow.main import test_single_creator; test_single_creator()" \ - --agent-uri /agents/agent_corporate_entities/123 \ - --arclight-dir /path/to/arclight \ - --aspace-dir /path/to/archivesspace - - Or add to a separate test script. - """ - parser = argparse.ArgumentParser( - description='Test single creator record processing', - formatter_class=argparse.RawDescriptionHelpFormatter, - epilog=""" -Examples: - # Test a single creator agent - python -m arcflow.main test-single-creator \\ - --agent-uri /agents/agent_corporate_entities/123 \\ - --arclight-dir /path/to/arclight-app \\ - --aspace-dir /path/to/archivesspace - - # With environment variables - export ARCLIGHT_DIR=/path/to/arclight-app - export ASPACE_DIR=/path/to/archivesspace - python -m arcflow.main test-single-creator \\ - --agent-uri /agents/agent_people/456 - """) - - parser.add_argument( - '--agent-uri', - required=True, - help='Agent URI to process (e.g., /agents/agent_corporate_entities/123)',) - parser.add_argument( - '--arclight-dir', - default=os.environ.get('ARCLIGHT_DIR'), - help='Path to ArcLight installation directory (default: $ARCLIGHT_DIR)',) - parser.add_argument( - '--aspace-dir', - default=os.environ.get('ASPACE_DIR'), - help='Path to ArchivesSpace installation directory (default: $ASPACE_DIR)',) - parser.add_argument( - '--solr-url', - default=os.environ.get('SOLR_URL', 'http://localhost:8983/solr/blacklight-core'), - help='URL of the Solr core (default: $SOLR_URL or http://localhost:8983/solr/blacklight-core)',) - parser.add_argument( - '--arcuit-dir', - default=os.environ.get('ARCUIT_DIR'), - help='Path to arcuit repository (for traject config). If not provided, will try bundle show arcuit.',) - parser.add_argument( - '--skip-creator-indexing', - action='store_true', - help='Generate creator XML files but skip Solr indexing',) - - args = parser.parse_args() - - # Validate required arguments - if not args.arclight_dir: - parser.error('--arclight-dir is required (or set $ARCLIGHT_DIR)') - if not args.aspace_dir: - parser.error('--aspace-dir is required (or set $ASPACE_DIR)') - - print(f'Testing single creator: {args.agent_uri}') - print(f'ArcLight directory: {args.arclight_dir}') - print(f'ArchivesSpace directory: {args.aspace_dir}') - print(f'Solr URL: {args.solr_url}') - print() - - # Create ArcFlow instance (without running full process) - arcflow = ArcFlow( - arclight_dir=args.arclight_dir, - aspace_dir=args.aspace_dir, - solr_url=args.solr_url, - traject_extra_config='', - force_update=False, - arcuit_dir=args.arcuit_dir, - skip_creator_indexing=args.skip_creator_indexing) - - # Process single creator - agents_dir = f'{args.arclight_dir}/public/xml/agents' - creator_ids = arcflow.process_creators( - agents_dir=agents_dir, - modified_since=0, - agent_uri=args.agent_uri, - indent_size=0) - - if creator_ids: - print(f'\nSuccess! Created {len(creator_ids)} creator document(s):') - for creator_id in creator_ids: - xml_file = f'{agents_dir}/{creator_id}.xml' - print(f' - {xml_file}') - - if args.skip_creator_indexing: - print(f'\nTo index to Solr:') - print(f' cd {args.arclight_dir}') - print(f' bundle exec traject -u {args.solr_url} -i xml \\') - print(f' -c /path/to/arcuit/arcflow-phase1-revised/traject_config_eac_cpf.rb \\') - print(f' public/xml/agents/{creator_ids[0]}.xml') - else: - print(f'\nIndexing was handled automatically (or check logs if it was skipped)') - else: - print('\nNo creator documents created. Check that the agent has biographical notes.') - - # Clean up PID file - if os.path.exists(arcflow.pid_file_path): - os.remove(arcflow.pid_file_path) - - if __name__ == '__main__': - # Check if we're running a subcommand - if len(sys.argv) > 1 and sys.argv[1] == 'test-single-creator': - # Remove the subcommand from argv so argparse works correctly - sys.argv.pop(1) - test_single_creator() - else: - main() \ No newline at end of file + main() \ No newline at end of file From 604f68c02e6517f88b74bd2fc09d151db547d058 Mon Sep 17 00:00:00 2001 From: Alex Dryden Date: Wed, 11 Feb 2026 15:22:09 -0500 Subject: [PATCH 03/44] refactor: move variable declarations into method body There is no longer any need for these to be defined outside the method --- arcflow/main.py | 22 +++++++++------------- 1 file changed, 9 insertions(+), 13 deletions(-) diff --git a/arcflow/main.py b/arcflow/main.py index 248433b..0445550 100644 --- a/arcflow/main.py +++ b/arcflow/main.py @@ -912,20 +912,20 @@ def update_creator_collection_links(self, agents_dir, indent_size=0): self.log.info(f'{indent}Updated {updated_count} creator documents with collection links.') - def process_creators(self, agents_dir, modified_since=0, agent_uri=None, indent_size=0): + def process_creators(self): """ Process creator agents and generate standalone creator documents. - - Args: - agents_dir: Directory to save agent XML files - modified_since: Unix timestamp to filter agents modified since this time - agent_uri: Optional. If provided, process only this single agent (for testing) - indent_size: Indentation size for logging - + Returns: list: List of created creator document IDs """ + + xml_dir = f'{self.arclight_dir}/public/xml' + agents_dir = f'{xml_dir}/agents' + modified_since = int(self.last_updated.timestamp()) + indent_size = 0 indent = ' ' * indent_size + self.log.info(f'{indent}Processing creator agents...') # Create agents directory if it doesn't exist @@ -1235,11 +1235,7 @@ def run(self): # Update creator records (unless collections-only mode) if not self.collections_only: - xml_dir = f'{self.arclight_dir}/public/xml' - agents_dir = f'{xml_dir}/agents' - modified_since = int(self.last_updated.timestamp()) - indent_size = 0 - self.process_creators(agents_dir, modified_since=modified_since, indent_size=indent_size) + self.process_creators() self.save_config_file() self.log.info(f'ArcFlow process completed (PID: {self.pid}). Elapsed time: {time.strftime("%H:%M:%S", time.gmtime(int(time.time()) - self.start_time))}.') From 38c3612fc3dc62889e21a51cc04639035ca88bc4 Mon Sep 17 00:00:00 2001 From: Alex Dryden Date: Wed, 11 Feb 2026 16:32:43 -0500 Subject: [PATCH 04/44] remove unwanted documentation --- HOW_TO_USE.md | 294 --------------------------- TESTING.md | 552 -------------------------------------------------- 2 files changed, 846 deletions(-) delete mode 100644 HOW_TO_USE.md delete mode 100644 TESTING.md diff --git a/HOW_TO_USE.md b/HOW_TO_USE.md deleted file mode 100644 index 0917842..0000000 --- a/HOW_TO_USE.md +++ /dev/null @@ -1,294 +0,0 @@ -# Arcflow Phase 1: Creator Records Implementation - -This directory contains the complete implementation of Phase 1 (Creator Records Data Pipeline) for the arcflow repository. - -## What This Is - -This directory is a **complete, working copy of arcflow** with all Phase 1 creator records changes already applied. You can run it directly from here! - -## Purpose - -This allows you to: -1. **Run arcflow directly** to test the creator records feature immediately -2. **Review all arcflow changes** without needing separate repository access -3. **Create the arcflow PR** using the provided documentation when ready - ---- - -## How to Use - -### Option 1: Run Directly from This Directory (Recommended for Testing) - -The simplest way to test the creator records feature is to run arcflow directly from this directory: - -#### Step 1: Install Dependencies - -```bash -cd arcflow-phase1 - -# Install Python dependencies -pip install -r requirements.txt -``` - -#### Step 2: Configure Credentials - -```bash -# Copy the example configuration -cp .archivessnake.yml.example .archivessnake.yml - -# Edit with your ArchivesSpace credentials -nano .archivessnake.yml # or use your preferred editor -``` - -Edit `.archivessnake.yml` with your settings: -```yaml -baseurl: http://your-archivesspace-server:8089 -username: your-username -password: your-password -``` - -#### Step 3: Create ArcFlow Configuration - -```bash -# Create .arcflow.yml to track last update time -cat > .arcflow.yml << EOF -last_updated: '1970-01-01T00:00:00+00:00' -EOF -``` - -Or run with `--force-update` flag to process all resources. - -#### Step 4: Run ArcFlow - -```bash -# Run arcflow with required arguments -python -m arcflow.main \ - --arclight-dir /path/to/your/arclight-app \ - --aspace-dir /path/to/your/archivesspace \ - --solr-url http://localhost:8983/solr/blacklight-core - -# Or with force update to process everything -python -m arcflow.main \ - --arclight-dir /path/to/your/arclight-app \ - --aspace-dir /path/to/your/archivesspace \ - --solr-url http://localhost:8983/solr/blacklight-core \ - --force-update - -# Or to process only agents (skip collections - useful for testing) -python -m arcflow.main \ - --arclight-dir /path/to/your/arclight-app \ - --aspace-dir /path/to/your/archivesspace \ - --solr-url http://localhost:8983/solr/blacklight-core \ - --agents-only -``` - -#### Step 5: View Results - -```bash -# Check for creator XML files -ls -lh $ARCLIGHT_DIR/public/xml/agents/ - -# View a creator file -cat $ARCLIGHT_DIR/public/xml/agents/creator_*.xml | jq '.' - -# Index to Solr -cd $ARCLIGHT_DIR -bundle exec traject -u $SOLR_URL -i xml \ - -c /path/to/arcuit/arcflow-phase1-revised/traject_config_eac_cpf.rb \ - public/xml/agents/*.xml -``` - -**See `TESTING.md` for comprehensive testing instructions!** - -#### Testing a Single Creator - -For faster testing, use the test-single-creator command to process just one agent: - -```bash -cd arcflow-phase1 - -# Set environment variables -export ARCLIGHT_DIR=/path/to/your/arclight-app -export ASPACE_DIR=/path/to/your/archivesspace - -# Test a single creator agent -python -m arcflow.main test-single-creator \ - --agent-uri /agents/agent_corporate_entities/123 - -# The command will show you: -# - The created XML file path -# - The traject command to index it -``` - -This is much faster than processing all creators and is ideal for development and testing. - ---- - -### Configure Solr Schema (Required Before Indexing) - -⚠️ **CRITICAL PREREQUISITE** - Before you can index creator records to Solr, you must configure the Solr schema. - -**See [SOLR_SCHEMA.md](SOLR_SCHEMA.md) for complete instructions on:** -- Which fields to add (is_creator, creator_persistent_id, etc.) -- Three methods to add them (Schema API recommended, managed-schema, or schema.xml) -- How to verify they're added -- Troubleshooting "unknown field" errors - -**Quick Schema Setup (Schema API method):** -```bash -# Add is_creator field -curl -X POST -H 'Content-type:application/json' \ - http://localhost:8983/solr/blacklight-core/schema \ - -d '{"add-field": {"name": "is_creator", "type": "boolean", "indexed": true, "stored": true}}' - -# Add other required fields (see SOLR_SCHEMA.md for complete list) -``` - -**Verify schema is configured:** -```bash -curl "http://localhost:8983/solr/blacklight-core/schema/fields/is_creator" -# Should return field definition, not 404 -``` - -⚠️ **If you skip this step, you'll get:** -``` -ERROR: [doc=creator_corporate_entities_584] unknown field 'is_creator' -``` - -This is a **one-time setup** per Solr instance. - ---- - -### Option 2: Copy to Separate Arcflow Repository (For Creating PR) - -### Option 2: Copy to Separate Arcflow Repository (For Creating PR) - -If you want to create a PR in the official arcflow repository: - -```bash -# In your local environment with access to arcflow repo: -cd /path/to/arcflow -git checkout -b copilot/add-creator-records - -# Copy the modified files -cp /path/to/arcuit/arcflow-phase1/arcflow/main.py arcflow/main.py -cp /path/to/arcuit/arcflow-phase1/traject_config_creators.rb . -cp /path/to/arcuit/arcflow-phase1/CREATOR_RECORDS_DESIGN.md . -cp /path/to/arcuit/arcflow-phase1/PR_SUMMARY.md . -cp /path/to/arcuit/arcflow-phase1/README.md . -cp /path/to/arcuit/arcflow-phase1/.github/copilot-instructions.md .github/ - -git add -A -git commit -m "Add standalone creator records extraction and indexing pipeline" -git push -u origin copilot/add-creator-records - -# Then create PR via GitHub UI using PR_SUMMARY.md as description -``` - -**Alternative: Create a patch file** - -```bash -cd /path/to/arcuit/arcflow-phase1 - -# Create a patch file comparing against main branch -git diff c2486e4..HEAD > ../arcflow-phase1-changes.patch - -# Then in arcflow repo: -cd /path/to/arcflow -git checkout -b copilot/add-creator-records -git apply /path/to/arcflow-phase1-changes.patch -``` - ---- - -## Key Files - -### Implementation -- **`arcflow/main.py`** - Core code with creator agent processing methods -- **`traject_config_eac_cpf.rb`** - Solr indexing configuration for creator EAC-CPF XML - -### Documentation -- **`CREATOR_RECORDS_DESIGN.md`** - Comprehensive design document -- **`PR_SUMMARY.md`** - Complete PR description (use for GitHub PR) -- **`README.md`** - Updated with creator records usage instructions -- **`.github/copilot-instructions.md`** - Architecture documentation - -## Changes Summary - -### New Methods in `arcflow/main.py` - -1. **`get_all_agents(agent_types, modified_since, indent_size)`** - - Fetches all agents from ArchivesSpace - - Returns set of unique agent URIs - - Lines: ~651-705 - -2. **`task_agent(agent_uri, agents_dir, repo_id, indent_size)`** - - Processes individual agent into EAC-CPF XML document - - Extracts bioghist, dates, relationships - - Only processes agents with biographical notes - - Lines: ~708-774 - -3. **`process_creators(agents_dir, modified_since, agent_uri, indent_size)`** - - Main orchestration method for agent processing - - Processes agents in parallel - - Lines: ~893-946 - -### Workflow Integration - -Added to `update_eads()` method after PDF processing (around line 492): -- Calls `process_creators()` to process all agents -- Generates EAC-CPF XML files in `public/xml/agents/` directory -- Collection linking handled via Solr using persistent_id field - -## Testing - -⭐ **See `TESTING.md` for comprehensive testing instructions!** - -The testing guide includes: -- Step-by-step instructions for migrating a single creator record -- Command-line Solr queries with curl -- Browser-based Solr query examples -- Expected output and troubleshooting - -### Quick Test - -After applying changes to arcflow: - -1. **Run ArcFlow**: - ```bash - python -m arcflow.main [options] - ``` - -2. **Check Output**: - ```bash - ls -lh public/xml/agents/creator_*.xml - ``` - -3. **Index to Solr**: - ```bash - bundle exec traject -u $SOLR_URL -i xml \ - -c traject_config_eac_cpf.rb public/xml/agents/*.xml - ``` - -4. **Query Solr** (command line): - ```bash - curl "http://localhost:8983/solr/blacklight-core/select?q=*:*&fq=is_creator:true&rows=5&wt=xml" | jq '.' - ``` - -5. **Query Solr** (browser): - ``` - http://localhost:8983/solr/blacklight-core/select?q=*:*&fq=is_creator:true&wt=xml&indent=true - ``` - -For detailed testing procedures including how to test a single creator record and all query options, see **`TESTING.md`**. - -## Next Steps - -1. Copy these changes to the arcflow repository -2. Create PR in arcflow using `PR_SUMMARY.md` as description -3. Test with real ArchivesSpace data -4. Once merged, begin Phase 2 in arcuit repository (search exclusion) - -## Questions? - -See `CREATOR_RECORDS_DESIGN.md` for detailed design rationale and `PR_SUMMARY.md` for PR description. diff --git a/TESTING.md b/TESTING.md deleted file mode 100644 index 3b3aeb5..0000000 --- a/TESTING.md +++ /dev/null @@ -1,552 +0,0 @@ -# Testing Guide: Creator Records Migration - -This guide provides step-by-step instructions for testing the creator records migration using native EAC-CPF format from ArchivesSpace. - -## Table of Contents -1. [Prerequisites](#prerequisites) -2. [Migrating a Single Creator Record](#migrating-a-single-creator-record) -3. [Viewing Creator Records](#viewing-creator-records) -4. [Querying Solr](#querying-solr) -5. [Troubleshooting](#troubleshooting) - ---- - -## Step 0: Configure Solr Schema (PREREQUISITE) - -⚠️ **CRITICAL FIRST STEP** - Before indexing creator records, configure the Solr schema. - -**See the parent directory's `solr/README.md` for:** -- Which fields to add to Solr -- How to manually add them to schema.xml -- How to verify they're added - -**Quick check if schema is configured:** -```bash -curl "http://localhost:8983/solr/blacklight-core/schema/fields/is_creator" -# Should return field definition -``` - ---- - -## Prerequisites - -### Required Software -- Python 3.4.3+ with ArchivesSnake installed -- Ruby 3.4.3+ with Bundler -- Access to ArchivesSpace instance -- Access to Solr instance (typically running on port 8983) -- ArcLight application installed -- Solr schema configured (see Step 0 above) - -### Required Configuration - -1. **ArchivesSpace credentials** (`.archivessnake.yml`): - ```yaml - baseurl: http://your-archivesspace-server:8089 - username: your-username - password: your-password - ``` - -2. **Solr URL**: Note your Solr endpoint, typically: - ``` - http://localhost:8983/solr/blacklight-core - ``` - -3. **ArcLight directory**: Path to your ArcLight application (e.g., `/path/to/arclight-app`) - ---- - -## Migrating a Single Creator Record - -### Quick Method: Using the Test Function - -Test a single creator using the built-in test function: - -```bash -cd /path/to/arcuit/arcflow-phase1-revised - -# Set environment variables -export ARCLIGHT_DIR=/path/to/arclight-app -export ASPACE_DIR=/path/to/archivesspace -export SOLR_URL=http://localhost:8983/solr/blacklight-core - -# Test a single creator -python -m arcflow.main test-single-creator \ - --agent-uri /agents/corporate_entities/584 -``` - -This will: -1. Process the specified agent -2. Generate the creator EAC-CPF XML file -3. Link it to collections -4. Show you the output file path and indexing command - -### Step 1: Identify a Test Creator - -Find a creator agent in ArchivesSpace that has a biographical/historical note: - -```bash -# List agents with creator role -curl -u username:password \ - "http://your-archivesspace-server:8089/repositories/2/resources/1" | \ - jq '.linked_agents[] | select(.role == "creator") | .ref' -``` - -Example output: -``` -"/agents/corporate_entities/584" -``` - -### Step 2: Verify Agent Has Bioghist - -Check that the agent has biographical/historical notes: - -```bash -curl -u username:password \ - "http://your-archivesspace-server:8089/agents/corporate_entities/584" | \ - jq '.notes[] | select(.jsonmodel_type == "note_bioghist")' -``` - -If this returns data, the agent is suitable for testing. - -### Step 3: Run ArcFlow for Single Creator - -```bash -cd /path/to/arcuit/arcflow-phase1-revised - -export ARCLIGHT_DIR=/path/to/arclight-app -export ASPACE_DIR=/path/to/archivesspace -export SOLR_URL=http://localhost:8983/solr/blacklight-core - -# Process the specific creator -python -m arcflow.main test-single-creator \ - --agent-uri /agents/corporate_entities/584 -``` - -This processes the creator and shows you the output. - -To process all creators: - -```bash -cd /path/to/arcuit/arcflow-phase1-revised -python -m arcflow.main --force-update -``` - -### Step 4: Locate the Generated Creator EAC-CPF XML - -After ArcFlow completes, check the agents directory: - -```bash -cd /path/to/arclight-app/public/xml/agents - -# List all creator XML files -ls -lh creator_*.xml - -# View a specific creator file -cat creator_corporate_entities_584.xml -``` - -**EAC-CPF format structure:** -```xml - - - - - - corporateBody - - "I" Men's Association - local - - - - -

The "I" Men's Association is composed of alumni...

-
-
- - - University of Illinois... - - - 1927 Reunion Publications - - -
-
-``` - -**Key Elements:** -- `` - Typically empty from ArchivesSpace -- `` - Entity type and name -- `` - Biographical/historical note -- `` - Links to collections - -### Step 5: Index the Creator to Solr - -Index the creator record to Solr: - -```bash -cd /path/to/arclight-app - -# Index a single creator file -bundle exec traject \ - -u http://localhost:8983/solr/blacklight-core \ - -i xml \ - -c /path/to/arcuit/arcflow-phase1-revised/traject_config_eac_cpf.rb \ - public/xml/agents/creator_corporate_entities_584.xml -``` - -Expected output: -``` -Traject indexer starting id=... -INFO: Using filename-based ID: creator_corporate_entities_584 -Indexed creator: creator_corporate_entities_584 -Committed 1 documents to Solr -``` - -To index all creators: -```bash -bundle exec traject \ - -u http://localhost:8983/solr/blacklight-core \ - -i xml \ - -c /path/to/arcuit/arcflow-phase1-revised/traject_config_eac_cpf.rb \ - public/xml/agents/creator_*.xml -``` - ---- - -## Viewing Creator Records - -View creator records in three ways: - -### Method 1: Direct File Inspection - -View creator data directly: - -```bash -# View a creator EAC-CPF XML file -cat public/xml/agents/creator_corporate_entities_584.xml - -# Or use xmllint for pretty printing -xmllint --format public/xml/agents/creator_corporate_entities_584.xml - -# View specific elements -xmllint --xpath '//identity/nameEntry/part/text()' public/xml/agents/creator_corporate_entities_584.xml -``` - -Example EAC-CPF structure: -```xml - - - - - - corporateBody - - "I" Men's Association - - - - -

The "I" Men's Association is composed of alumni...

-
-
-
-
-``` - -### Method 2: Command-Line Solr Queries (curl) - -Query Solr directly using curl: - -#### Basic Query: All Creator Records -```bash -curl "http://localhost:8983/solr/blacklight-core/select?q=*:*&fq=is_creator:true&rows=10&wt=json" | jq '.' -``` - -#### Query Specific Creator by ID -```bash -curl "http://localhost:8983/solr/blacklight-core/select?q=id:creator_corporate_entities_584&wt=json" | jq '.response.docs[0]' -``` - -#### Search Creators by Name -```bash -curl "http://localhost:8983/solr/blacklight-core/select?q=title:*Association*&fq=is_creator:true&wt=json" | jq '.response.docs' -``` - -#### Get Creator with All Fields -```bash -curl "http://localhost:8983/solr/blacklight-core/select?q=id:creator_corporate_entities_584&wt=json" | jq '.response.docs[0]' -``` - -Example response: -```json -{ - "id": "creator_corporate_entities_584", - "title": ["\"I\" Men's Association"], - "is_creator": true, - "entity_type": "corporateBody", - "agent_type": "corporate_entities", - "agent_id": 584 -} -``` - -#### Count Total Creator Records -```bash -curl "http://localhost:8983/solr/blacklight-core/select?q=*:*&fq=is_creator:true&rows=0&wt=json" | jq '.response.numFound' -``` - -#### Search Bioghist Content -```bash -curl "http://localhost:8983/solr/blacklight-core/select?q=text:established&fq=is_creator:true&fl=id,title&wt=json" | jq '.response.docs' -``` - -### Method 3: Browser-Based Solr Queries - -Open these URLs in your web browser for formatted output: - -#### View All Creator Records -``` -http://localhost:8983/solr/blacklight-core/select?q=*:*&fq=is_creator:true&rows=10&wt=json&indent=true -``` - -#### View Specific Creator -``` -http://localhost:8983/solr/blacklight-core/select?q=id:creator_corporate_entities_584&wt=json&indent=true -``` - -#### Search by Creator Name -``` -http://localhost:8983/solr/blacklight-core/select?q=title:*Association*&fq=is_creator:true&wt=json&indent=true -``` - -#### Browse Creators with Facets -``` -http://localhost:8983/solr/blacklight-core/select?q=*:*&fq=is_creator:true&facet=true&facet.field=entity_type&rows=10&wt=json&indent=true -``` - -#### Get Only Specific Fields -``` -http://localhost:8983/solr/blacklight-core/select?q=*:*&fq=is_creator:true&fl=id,title,entity_type&rows=10&wt=json&indent=true -``` - ---- - -## Querying Solr - -### Understanding Solr Query Parameters - -- **`q=*:*`** - Match all documents (use `q=field:value` to search specific fields) -- **`fq=is_creator:true`** - Filter to only creator records -- **`rows=10`** - Return 10 results (default is 10, max is usually 1000+) -- **`fl=id,title`** - Return only specified fields (default is all fields) -- **`wt=json`** - Return JSON format (alternatives: xml, csv) -- **`indent=true`** - Pretty-print JSON output -- **`start=0`** - Pagination offset (start=10 for second page with rows=10) - -### Useful Query Patterns - -#### 1. Find Collections Linked to a Creator - -To find collections created by a specific creator, look for resourceRelation links in the creator's EAC-CPF: - -```bash -# First get the creator's linked collections -curl "http://localhost:8983/solr/blacklight-core/select?q=id:creator_corporate_entities_584&fl=related_resources&wt=json" | jq '.response.docs[0]' - -# Then query for those specific collection IDs -curl "http://localhost:8983/solr/blacklight-core/select?q=id:(resource_586 OR resource_123)&wt=json" -``` - -**Note:** Collection links are stored in the creator's `` elements in the EAC-CPF XML. - -#### 2. Find All Corporate Entity Creators -```bash -curl "http://localhost:8983/solr/blacklight-core/select?q=entity_type:corporateBody&fq=is_creator:true&rows=20&wt=json" | jq '.response.docs[] | {id, title}' -``` - -#### 3. Full-Text Search in Bioghist -```bash -curl "http://localhost:8983/solr/blacklight-core/select?q=text:university&fq=is_creator:true&fl=id,title&wt=json" | jq '.response.docs[0]' -``` - -#### 4. Get Creator with All Fields -```bash -curl "http://localhost:8983/solr/blacklight-core/select?q=id:creator_corporate_entities_584&wt=json&indent=true" | jq '.response.docs[0]' -``` - -#### 5. Verify Creator Records Don't Appear in Standard Searches -This is important for Phase 2 - ensure creators are properly filtered: - -```bash -# This should return 0 if Phase 2 is implemented (currently will return creators) -curl "http://localhost:8983/solr/blacklight-core/select?q=*:*&fq=-is_creator:true&rows=0&wt=json" | jq '.response.numFound' -``` - -### Advanced Queries - -#### Wildcard Search -```bash -curl "http://localhost:8983/solr/blacklight-core/select?q=title:*Association*&fq=is_creator:true&wt=json" | jq '.response.docs[] | .title' -``` - -#### Boolean Operators -```bash -# OR -curl "http://localhost:8983/solr/blacklight-core/select?q=title:(Association%20OR%20University)&fq=is_creator:true&wt=json" - -# AND -curl "http://localhost:8983/solr/blacklight-core/select?q=title:Association%20AND%20text:alumni&fq=is_creator:true&wt=json" - -# NOT -curl "http://localhost:8983/solr/blacklight-core/select?q=*:*%20NOT%20entity_type:person&fq=is_creator:true&wt=json" -``` - ---- - -## Troubleshooting - -### Issue: No XML Files Generated - -**Symptom**: The `public/xml/agents/` directory is empty after running ArcFlow. - -**Solutions**: -```bash -# Check ArcFlow logs -tail -f logs/arcflow.log - -# Verify agents have bioghist notes in ArchivesSpace -curl -u username:password \ - "http://your-archivesspace-server:8089/agents/corporate_entities/584" | \ - jq '.notes[] | select(.jsonmodel_type == "note_bioghist")' - -# Check if agents directory exists -ls -la /path/to/arclight-app/public/xml/agents/ -``` - -### Issue: Missing ID Field Error - -**Symptom**: `Document is missing mandatory uniqueKey field: id` - -**Solutions**: -```bash -# Verify using the correct traject config -bundle exec traject \ - -c /path/to/arcuit/arcflow-phase1-revised/traject_config_eac_cpf.rb \ - public/xml/agents/creator_*.xml - -# Check filename format (should start with creator_) -ls public/xml/agents/creator_*.xml -``` - -### Issue: Traject Indexing Fails - -**Symptom**: Error when running `bundle exec traject` - -**Solutions**: -```bash -# Verify Solr is running -curl "http://localhost:8983/solr/admin/cores?action=STATUS&wt=json" - -# Validate XML file -xmllint --noout public/xml/agents/creator_*.xml - -# Verify Solr schema has required fields -curl "http://localhost:8983/solr/blacklight-core/schema/fields/is_creator" -``` - -### Issue: No Results in Solr - -**Symptom**: Queries return 0 results even after indexing - -**Possible causes**: -1. Documents not committed to Solr -2. Wrong Solr core/collection -3. Indexing to different Solr than querying - -**Solutions**: -```bash -# Force commit in Solr -curl "http://localhost:8983/solr/blacklight-core/update?commit=true" - -# Check which cores exist -curl "http://localhost:8983/solr/admin/cores?action=STATUS&wt=json" | jq '.status | keys' - -# Verify documents were indexed -curl "http://localhost:8983/solr/blacklight-core/select?q=*:*&rows=0&wt=json" | jq '.response.numFound' - -# Check specifically for creator records -curl "http://localhost:8983/solr/blacklight-core/select?q=is_creator:true&rows=0&wt=json" | jq '.response.numFound' -``` - -### Issue: Missing Fields in Solr - -**Symptom**: Some fields are missing when querying Solr - -**Possible causes**: -1. Fields not defined in Solr schema -2. Traject config not mapping fields correctly -3. Source data missing from ArchivesSpace - -**Solutions**: -```bash -# Check which fields exist for a document -curl "http://localhost:8983/solr/blacklight-core/select?q=id:creator_corporate_entities_584&wt=json" | jq '.response.docs[0] | keys' - -# View Solr schema for creator-related fields -curl "http://localhost:8983/solr/blacklight-core/schema/fields?wt=json" | jq '.fields[] | select(.name | contains("creator") or . == "is_creator")' - -# Check source XML has the data -xmllint --xpath '//identity/nameEntry/part/text()' public/xml/agents/creator_corporate_entities_584.xml -``` - -### Issue: Creator Records Appear in Standard Searches - -**Note**: This is expected behavior for Phase 1. Phase 2 will add search exclusion in Arcuit. - -To manually filter creators from searches: -```bash -curl "http://localhost:8983/solr/blacklight-core/select?q=*:*&fq=-is_creator:true&wt=json" -``` - ---- - -## Verification Checklist - -Use this checklist to verify your creator records migration: - -- [ ] ArcFlow runs without errors -- [ ] EAC-CPF XML files created in `public/xml/agents/` directory -- [ ] XML files have correct structure (control, cpfDescription, identity, biogHist, relations) -- [ ] Filename format is correct (e.g., `creator_corporate_entities_584.xml`) -- [ ] Traject indexing completes successfully with `traject_config_eac_cpf.rb` -- [ ] Solr query returns creator records: `curl "http://localhost:8983/solr/blacklight-core/select?q=*:*&fq=is_creator:true&rows=1&wt=json"` -- [ ] Creator has `is_creator: true` field -- [ ] Creator has `entity_type` field (corporateBody, person, or family) -- [ ] Creator name is indexed in `title` field -- [ ] Bioghist content is searchable: `curl "http://localhost:8983/solr/blacklight-core/select?q=text:*&fq=is_creator:true&rows=1&wt=json"` -- [ ] Related resources are captured (if present in `` elements) -- [ ] All expected fields are present in Solr document - ---- - -## Next Steps - -After verifying creator records are indexed: - -1. **Phase 2**: Implement search exclusion in Arcuit to filter `is_creator:true` from standard searches -2. **Phase 3**: Create creator show page in Arcuit to display creator records -3. **Phase 4-7**: Add UI enhancements (search dropdown, links from collections, etc.) - ---- - -## Additional Resources - -- **Solr Configuration**: See `../solr/README.md` for schema setup -- **Solr Documentation**: https://solr.apache.org/guide/ -- **ArcLight Documentation**: https://github.com/projectblacklight/arclight -- **EAC-CPF Standard**: https://eac.staatsbibliothek-berlin.de/ -- **ArchivesSnake Documentation**: https://github.com/archivesspace-labs/ArchivesSnake From 4912ddc50db39fa60fcbc17739c1e4dd6515b79a Mon Sep 17 00:00:00 2001 From: Alex Dryden Date: Wed, 11 Feb 2026 16:53:10 -0500 Subject: [PATCH 05/44] refactor: revert to PIPE for sterr for consistency --- arcflow/main.py | 16 +++++++--------- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/arcflow/main.py b/arcflow/main.py index 0445550..450e61f 100644 --- a/arcflow/main.py +++ b/arcflow/main.py @@ -577,17 +577,16 @@ def index_collections(self, repo_id, xml_file_path, indent_size=0): cmd, cwd=self.arclight_dir, env=env, - capture_output=True, - text=True + stderr=subprocess.PIPE, ) - + if result.stderr: - self.log.error(f'{indent}{result.stderr}') + self.log.error(f'{indent}{result.stderr.decode("utf-8")}') if result.returncode != 0: self.log.error(f'{indent}Failed to index pending resources in repository ID {repo_id} to ArcLight Solr. Return code: {result.returncode}') else: self.log.info(f'{indent}Finished indexing pending resources in repository ID {repo_id} to ArcLight Solr.') - except Exception as e: + except subprocess.CalledProcessError as e: self.log.error(f'{indent}Error indexing pending resources in repository ID {repo_id} to ArcLight Solr: {e}') @@ -1069,8 +1068,7 @@ def index_creators(self, agents_dir, creator_ids, batch_size=100): result = subprocess.run( cmd, cwd=self.arclight_dir, - capture_output=True, - text=True, + stderr=subprocess.PIPE, timeout=300 # 5 minute timeout per batch ) @@ -1081,7 +1079,7 @@ def index_creators(self, agents_dir, creator_ids, batch_size=100): failed_count += len(existing_files) self.log.error(f' Traject failed with exit code {result.returncode}') if result.stderr: - self.log.error(f' STDERR: {result.stderr}') + self.log.error(f' STDERR: {result.stderr.decode("utf-8")}') except subprocess.TimeoutExpired: self.log.error(f' Traject timed out for batch {batch_num}/{total_batches}') @@ -1089,7 +1087,7 @@ def index_creators(self, agents_dir, creator_ids, batch_size=100): except Exception as e: self.log.error(f' Error indexing batch {batch_num}/{total_batches}: {e}') failed_count += len(existing_files) - + if failed_count > 0: self.log.warning(f'Creator indexing completed with errors: {indexed_count} succeeded, {failed_count} failed') From 6014dd19b34bd9d0f6b9e48b636cd136f36376f1 Mon Sep 17 00:00:00 2001 From: Alex Dryden Date: Wed, 11 Feb 2026 16:57:09 -0500 Subject: [PATCH 06/44] fix: spacing --- arcflow/main.py | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/arcflow/main.py b/arcflow/main.py index 450e61f..fd0293a 100644 --- a/arcflow/main.py +++ b/arcflow/main.py @@ -911,7 +911,7 @@ def update_creator_collection_links(self, agents_dir, indent_size=0): self.log.info(f'{indent}Updated {updated_count} creator documents with collection links.') - def process_creators(self): + def process_creators(self): """ Process creator agents and generate standalone creator documents. @@ -926,32 +926,32 @@ def process_creators(self): indent = ' ' * indent_size self.log.info(f'{indent}Processing creator agents...') - + # Create agents directory if it doesn't exist os.makedirs(agents_dir, exist_ok=True) - + # Get agents to process agents = self.get_all_agents(modified_since=modified_since, indent_size=indent_size) - + # Process agents in parallel with Pool(processes=10) as pool: results_agents = [pool.apply_async( self.task_agent, args=(agent_uri_item, agents_dir, 1, indent_size)) # Use repo_id=1 for agent_uri_item in agents] - + creator_ids = [r.get() for r in results_agents] creator_ids = [cid for cid in creator_ids if cid is not None] - + self.log.info(f'{indent}Created {len(creator_ids)} creator documents.') - + # NOTE: Collection links are NOT added to creator XML files. # Instead, linking is handled via Solr using the persistent_id field: # - Creator bioghist has persistent_id as the 'id' attribute # - Collection EADs reference creators via bioghist with persistent_id # - Solr indexes both, allowing queries to link them # This avoids the expensive operation of scanning all resources to build a linkage map. - + # Index creators to Solr (if not skipped) if not self.skip_creator_indexing and creator_ids: self.log.info(f'{indent}Indexing {len(creator_ids)} creator records to Solr...') @@ -968,7 +968,7 @@ def process_creators(self): self.log.info(f'{indent} {agents_dir}/*.xml') elif self.skip_creator_indexing: self.log.info(f'{indent}Skipping creator indexing (--skip-creator-indexing flag set)') - + return creator_ids From 5dbe81e84bd5a38d6f7968e9b7f1dae8c15b6ee5 Mon Sep 17 00:00:00 2001 From: Alex Dryden Date: Wed, 11 Feb 2026 16:59:46 -0500 Subject: [PATCH 07/44] fix: duplicate line --- arcflow/main.py | 1 - 1 file changed, 1 deletion(-) diff --git a/arcflow/main.py b/arcflow/main.py index fd0293a..11bbddf 100644 --- a/arcflow/main.py +++ b/arcflow/main.py @@ -767,7 +767,6 @@ def task_agent(self, agent_uri, agents_dir, repo_id=1, indent_size=0): return None eac_cpf_xml = response.text - eac_cpf_xml = response.text # Parse the EAC-CPF XML to extract key information try: From 1c63cae69e89fceaeb3cd126017a621fcafb5740 Mon Sep 17 00:00:00 2001 From: Alex Dryden Date: Wed, 11 Feb 2026 17:37:01 -0500 Subject: [PATCH 08/44] fix: remove unused method --- arcflow/main.py | 118 ------------------------------------------------ 1 file changed, 118 deletions(-) diff --git a/arcflow/main.py b/arcflow/main.py index 11bbddf..08b2bc6 100644 --- a/arcflow/main.py +++ b/arcflow/main.py @@ -792,124 +792,6 @@ def task_agent(self, agent_uri, agents_dir, repo_id=1, indent_size=0): self.log.error(f'{indent}{traceback.format_exc()}') return None - - def update_creator_collection_links(self, agents_dir, indent_size=0): - """ - Update creator documents with links to their associated collections. - Scans all resources to build agent -> collections mapping, then updates creator XML files. - - Args: - agents_dir: Directory containing agent XML files - indent_size: Indentation size for logging - """ - indent = ' ' * indent_size - - # Build mapping of agent_uri -> [collection info] - self.log.info(f'{indent}Building agent-collection linkage map...') - agent_collections = {} - - repos = self.client.get('repositories').json() - for repo in repos: - repo_id = self.get_repo_id(repo) - resources = self.client.get( - f'{repo["uri"]}/resources', - params={'all_ids': True} - ).json() - - self.log.info(f'{indent}Processing {len(resources)} resources in repository ID {repo_id}...') - - for resource_id in resources: - try: - resource = self.client.get( - f'{repo["uri"]}/resources/{resource_id}', - params={'resolve': ['linked_agents']} - ).json() - - # Only process published resources - if not resource.get('publish') or resource.get('suppressed'): - continue - - ead_id = resource.get('ead_id', '').replace('.', '-') - - if 'linked_agents' in resource: - for linked_agent in resource['linked_agents']: - if linked_agent.get('role') == 'creator': - agent_ref = linked_agent.get('ref') - if agent_ref: - if agent_ref not in agent_collections: - agent_collections[agent_ref] = [] - agent_collections[agent_ref].append({ - 'ead_id': ead_id, - 'title': resource.get('title', 'Untitled'), - 'repository': repo.get('name', '') - }) - except Exception as e: - self.log.error(f'{indent}Error fetching resource {resource_id}: {e}') - - # Update creator documents with collection links - self.log.info(f'{indent}Updating creator documents with collection links...') - updated_count = 0 - - for xml_file in os.listdir(agents_dir): - if xml_file.endswith('.xml'): - filepath = os.path.join(agents_dir, xml_file) - try: - # Parse XML file - tree = ET.parse(filepath) - root = tree.getroot() - - # Find agent URI from controlaccess - agent_uri = None - controlaccess = root.find('.//controlaccess') - if controlaccess is not None: - for name_elem in controlaccess.findall('.//*[@identifier]'): - agent_uri = name_elem.get('identifier') - break - - if not agent_uri: - self.log.warning(f'{indent}Could not find agent URI in {xml_file}') - continue - - if agent_uri in agent_collections: - collections = agent_collections[agent_uri] - - # Find or create relatedmaterial section for collections - archdesc = root.find('.//archdesc') - if archdesc is None: - self.log.warning(f'{indent}No archdesc found in {xml_file}') - continue - - # Remove existing collection relatedmaterial if present - for rm in archdesc.findall('relatedmaterial[@type="collections"]'): - archdesc.remove(rm) - - # Add new relatedmaterial section for collections - relatedmaterial = ET.SubElement(archdesc, 'relatedmaterial') - relatedmaterial.set('type', 'collections') - head = ET.SubElement(relatedmaterial, 'head') - head.text = 'Related Collections' - - # Add each collection - for collection in collections: - item = ET.SubElement(relatedmaterial, 'item') - item.text = collection['title'] - item.set('ead_id', collection['ead_id']) - item.set('repository', collection['repository']) - - # Save updated XML - ET.indent(tree, space=' ') - tree.write(filepath, encoding='utf-8', xml_declaration=True) - - updated_count += 1 - creator_id = xml_file.replace('.xml', '') - self.log.info(f'{indent}Updated {creator_id} with {len(collections)} collection links') - - except Exception as e: - self.log.error(f'{indent}Error updating {xml_file}: {e}') - - self.log.info(f'{indent}Updated {updated_count} creator documents with collection links.') - - def process_creators(self): """ Process creator agents and generate standalone creator documents. From fef9307063a6339a0ae720fa37a98cda930c5135 Mon Sep 17 00:00:00 2001 From: Alex Dryden Date: Wed, 11 Feb 2026 17:39:46 -0500 Subject: [PATCH 09/44] Update arcflow/main.py allow for whitespace in filenames and for quoted arguments Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --- arcflow/main.py | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/arcflow/main.py b/arcflow/main.py index 08b2bc6..61826d9 100644 --- a/arcflow/main.py +++ b/arcflow/main.py @@ -566,7 +566,10 @@ def index_collections(self, repo_id, xml_file_path, indent_size=0): ] if self.traject_extra_config: - cmd.extend(self.traject_extra_config.split()) + if isinstance(self.traject_extra_config, (list, tuple)): + cmd.extend(self.traject_extra_config) + else: + cmd.append(self.traject_extra_config) cmd.append(xml_file_path) From 849098e15cda8c58648016ae83f09ee691772eee Mon Sep 17 00:00:00 2001 From: Alex Dryden Date: Wed, 11 Feb 2026 17:42:16 -0500 Subject: [PATCH 10/44] fix: add time in case it isn't loaded when traject runs --- traject_config_eac_cpf.rb | 1 + 1 file changed, 1 insertion(+) diff --git a/traject_config_eac_cpf.rb b/traject_config_eac_cpf.rb index 76c2df1..3e4b7ef 100644 --- a/traject_config_eac_cpf.rb +++ b/traject_config_eac_cpf.rb @@ -12,6 +12,7 @@ require 'traject' require 'traject_plus' require 'traject_plus/macros' +require 'time' # Use TrajectPlus macros (provides extract_xpath and other helpers) extend TrajectPlus::Macros From ec3c9618502fcdc36609259fb4e1fd4066acc14e Mon Sep 17 00:00:00 2001 From: Alex Dryden Date: Wed, 11 Feb 2026 17:46:15 -0500 Subject: [PATCH 11/44] Update arcflow/main.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --- arcflow/main.py | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arcflow/main.py b/arcflow/main.py index 61826d9..34a37a1 100644 --- a/arcflow/main.py +++ b/arcflow/main.py @@ -771,9 +771,10 @@ def task_agent(self, agent_uri, agents_dir, repo_id=1, indent_size=0): eac_cpf_xml = response.text - # Parse the EAC-CPF XML to extract key information + # Parse the EAC-CPF XML to validate and inspect its structure try: root = ET.fromstring(eac_cpf_xml) + self.log.debug(f'{indent}Parsed EAC-CPF XML root element: {root.tag}') except ET.ParseError as e: self.log.error(f'{indent}Failed to parse EAC-CPF XML for {agent_uri}: {e}') return None From a043c9df9ff8751c6b93a869b10e497190b9f679 Mon Sep 17 00:00:00 2001 From: Alex Dryden Date: Wed, 11 Feb 2026 17:56:25 -0500 Subject: [PATCH 12/44] fix: remove hardcoded directory from local implemantaiton --- arcflow/main.py | 43 +++++++++++++++++++++++++++---------------- 1 file changed, 27 insertions(+), 16 deletions(-) diff --git a/arcflow/main.py b/arcflow/main.py index 34a37a1..5cdd072 100644 --- a/arcflow/main.py +++ b/arcflow/main.py @@ -849,7 +849,7 @@ def process_creators(self): self.log.info(f'{indent}To index manually:') self.log.info(f'{indent} cd {self.arclight_dir}') self.log.info(f'{indent} bundle exec traject -u {self.solr_url} -i xml \\') - self.log.info(f'{indent} -c /path/to/arcuit/arcflow-phase1-revised/traject_config_eac_cpf.rb \\') + self.log.info(f'{indent} -c /path/to/arcuit/arcflow/traject_config_eac_cpf.rb \\') self.log.info(f'{indent} {agents_dir}/*.xml') elif self.skip_creator_indexing: self.log.info(f'{indent}Skipping creator indexing (--skip-creator-indexing flag set)') @@ -880,28 +880,39 @@ def find_traject_config(self): ) if result.returncode == 0: arcuit_path = result.stdout.strip() - traject_config = f'{arcuit_path}/arcflow-phase1-revised/traject_config_eac_cpf.rb' - if os.path.exists(traject_config): - self.log.info(f'Found traject config via bundle show: {traject_config}') - return traject_config - else: - self.log.warning(f'bundle show arcuit succeeded but traject config not found at expected path') + # Prefer config at gem root, fall back to legacy subdirectory layout + candidate_paths = [ + os.path.join(arcuit_path, 'traject_config_eac_cpf.rb'), + os.path.join(arcuit_path, 'arcflow', 'traject_config_eac_cpf.rb'), + ] + for traject_config in candidate_paths: + if os.path.exists(traject_config): + self.log.info(f'Found traject config via bundle show: {traject_config}') + return traject_config + self.log.warning( + 'bundle show arcuit succeeded but traject_config_eac_cpf.rb ' + 'was not found in any expected location under the gem root' + ) else: self.log.debug('bundle show arcuit failed (gem not installed?)') except Exception as e: self.log.debug(f'Error running bundle show arcuit: {e}') - # Fall back to arcuit_dir if provided if self.arcuit_dir: - traject_config = f'{self.arcuit_dir}/arcflow-phase1-revised/traject_config_eac_cpf.rb' - if os.path.exists(traject_config): - self.log.info(f'Using traject config from arcuit_dir: {traject_config}') - return traject_config - else: - self.log.warning(f'arcuit_dir provided but traject config not found: {traject_config}') - + candidate_paths = [ + os.path.join(self.arcuit_dir, 'traject_config_eac_cpf.rb'), + os.path.join(self.arcuit_dir, 'arcflow', 'traject_config_eac_cpf.rb'), + ] + for traject_config in candidate_paths: + if os.path.exists(traject_config): + self.log.info(f'Using traject config from arcuit_dir: {traject_config}') + return traject_config + self.log.warning( + 'arcuit_dir provided but traject_config_eac_cpf.rb was not found ' + 'in any expected location' + ) # No config found - self.log.warning('Could not find traject config (bundle show arcuit failed and arcuit_dir not provided)') + self.log.warning('Could not find traject config (bundle show arcuit failed and arcuit_dir not provided or invalid)') return None From a407d728560c95cf975c620f700a851e67bc9c55 Mon Sep 17 00:00:00 2001 From: Alex Dryden Date: Wed, 11 Feb 2026 18:09:02 -0500 Subject: [PATCH 13/44] fix: use updated example --- README.md | 39 ++++++++++++++++++++++++--------------- 1 file changed, 24 insertions(+), 15 deletions(-) diff --git a/README.md b/README.md index bc434e6..bc49f37 100644 --- a/README.md +++ b/README.md @@ -56,21 +56,30 @@ ArcFlow now generates standalone creator documents in addition to collection rec Creator documents are stored as XML files in `agents/` directory using the ArchivesSpace EAC-CPF export: ```xml -{ - "id": "creator_agent_corporate_entities_123", - "record_type": "creator", - "is_creator": true, - "agent_type": "agent_corporate_entities", - "agent_id": 123, - "title": "University Archives", - "creator_sort_name": "University Archives", - "bioghist_html": "

Established in 1963...

", - "bioghist_text": "Established in 1963...", - "dates": "1963-", - "collection_ids": ["15-0-1234", "15-0-5678"], - "collection_titles": ["Collection A", "Collection B"], - "repository": ["University Library"] -} + + + + + + corporateBody + + Core: Leadership, Infrastructure, Futures + local + + + + + 2020- + + +

Founded on September 1, 2020, the Core: Leadership, Infrastructure, Futures division of the American Library Association has a mission to cultivate and amplify the collective expertise of library workers in core functions through community building, advocacy, and learning. + In June 2020, the ALA Council voted to approve Core: Leadership, Infrastructure, Futures as a new ALA division beginning September 1, 2020, and to dissolve the Association for Library Collections and Technical Services (ALCTS), the Library Information Technology Association (LITA) and the Library Leadership and Management Association (LLAMA) effective August 31, 2020. The vote to form Core was 163 to 1.(1)

+ 1. "ALA Council approves Core; dissolves ALCTS, LITA and LLAMA," July 1, 2020, http://www.ala.org/news/member-news/2020/07/ala-council-approves-core-dissolves-alcts-lita-and-llama. +
+
+ +
+
``` ### Indexing Creator Documents From 8ec04457597262598913a52cf01f38d3369ef2a6 Mon Sep 17 00:00:00 2001 From: Alex Dryden Date: Wed, 11 Feb 2026 18:10:08 -0500 Subject: [PATCH 14/44] Update README.md fix typo Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index bc49f37..6c570bf 100644 --- a/README.md +++ b/README.md @@ -157,7 +157,7 @@ Optional arguments: - `--force-update` - Force update of all data (recreates everything from scratch) - `--traject-extra-config` - Path to extra Traject configuration file - `--agents-only` - Process only agent records, skip collections (useful for testing agents) -- `--collections-only` - Skips creators, proccesses EAD, PDF finding aid and indexes collections +- `--collections-only` - Skips creators, processes EAD, PDF finding aid and indexes collections - `--skip-creator-indexing` - Collects EAC-CPF files only, does not index into Solr ### Examples From 202a532f47df426aea8a6fb2eba3144f58f6f768 Mon Sep 17 00:00:00 2001 From: Alex Dryden Date: Wed, 11 Feb 2026 18:12:00 -0500 Subject: [PATCH 15/44] Update traject_config_eac_cpf.rb In Nokogiri XPath, every namespaced element must be prefixed (e.g., //eac:control/eac:recordId) with a namespace mapping. exists. Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --- traject_config_eac_cpf.rb | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/traject_config_eac_cpf.rb b/traject_config_eac_cpf.rb index 3e4b7ef..517ee55 100644 --- a/traject_config_eac_cpf.rb +++ b/traject_config_eac_cpf.rb @@ -41,8 +41,8 @@ # Cannot rely on recordId being present. Must extract from filename or generate. to_field 'id' do |record, accumulator, context| # Try 1: Extract from control/recordId (if present) - record_id = record.xpath('//eac-cpf/control/recordId', - 'eac-cpf' => 'urn:isbn:1-931666-33-4').first + record_id = record.xpath('//eac-cpf:control/eac-cpf:recordId', + 'eac-cpf' => 'urn:isbn:1-931666-33-4').first record_id ||= record.xpath('//control/recordId').first if record_id && !record_id.text.strip.empty? From deb325ec7e84bb922e79a54ef5ae767dd319ed7f Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Wed, 11 Feb 2026 23:12:42 +0000 Subject: [PATCH 16/44] Initial plan From 05239d19dbe9da7e6c229c791cc8aaafb259f74a Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Wed, 11 Feb 2026 23:15:39 +0000 Subject: [PATCH 17/44] Use consistent eac: namespace prefix in all XPath queries Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com> --- traject_config_eac_cpf.rb | 68 ++++++++++++++++----------------------- 1 file changed, 27 insertions(+), 41 deletions(-) diff --git a/traject_config_eac_cpf.rb b/traject_config_eac_cpf.rb index 517ee55..7a78c97 100644 --- a/traject_config_eac_cpf.rb +++ b/traject_config_eac_cpf.rb @@ -17,6 +17,9 @@ # Use TrajectPlus macros (provides extract_xpath and other helpers) extend TrajectPlus::Macros +# EAC-CPF namespace - used consistently throughout this config +EAC_NS = { 'eac' => 'urn:isbn:1-931666-33-4' } + settings do provide "solr.url", ENV['SOLR_URL'] || "http://localhost:8983/solr/blacklight-core" provide "solr_writer.commit_on_close", "true" @@ -41,9 +44,7 @@ # Cannot rely on recordId being present. Must extract from filename or generate. to_field 'id' do |record, accumulator, context| # Try 1: Extract from control/recordId (if present) - record_id = record.xpath('//eac-cpf:control/eac-cpf:recordId', - 'eac-cpf' => 'urn:isbn:1-931666-33-4').first - record_id ||= record.xpath('//control/recordId').first + record_id = record.xpath('//eac:control/eac:recordId', EAC_NS).first if record_id && !record_id.text.strip.empty? accumulator << record_id.text.strip @@ -60,8 +61,8 @@ context.logger.info("Using filename-based ID: #{id_from_filename}") else # Try 3: Generate from entity type and name - entity_type = record.xpath('//identity/entityType').first&.text&.strip - name_entry = record.xpath('//identity/nameEntry/part').first&.text&.strip + entity_type = record.xpath('//eac:cpfDescription/eac:identity/eac:entityType', EAC_NS).first&.text&.strip + name_entry = record.xpath('//eac:cpfDescription/eac:identity/eac:nameEntry/eac:part', EAC_NS).first&.text&.strip if entity_type && name_entry # Create stable ID from type and name @@ -84,8 +85,8 @@ end else # No filename available, generate from name - entity_type = record.xpath('//identity/entityType').first&.text&.strip - name_entry = record.xpath('//identity/nameEntry/part').first&.text&.strip + entity_type = record.xpath('//eac:cpfDescription/eac:identity/eac:entityType', EAC_NS).first&.text&.strip + name_entry = record.xpath('//eac:cpfDescription/eac:identity/eac:nameEntry/eac:part', EAC_NS).first&.text&.strip if entity_type && name_entry type_short = case entity_type @@ -120,38 +121,23 @@ # Entity type (corporateBody, person, family) to_field 'entity_type' do |record, accumulator| - entity = record.xpath('//cpfDescription/identity/entityType', - 'eac-cpf' => 'urn:isbn:1-931666-33-4').first - if entity - accumulator << entity.text - else - # Fallback without namespace - entity = record.xpath('//identity/entityType').first - accumulator << entity.text if entity - end + entity = record.xpath('//eac:cpfDescription/eac:identity/eac:entityType', EAC_NS).first + accumulator << entity.text if entity end # Title/name fields - from authorized form of name to_field 'title' do |record, accumulator| - # Try with namespace - name = record.xpath('//cpfDescription/identity/nameEntry/part', - 'eac-cpf' => 'urn:isbn:1-931666-33-4') - if name.any? - accumulator << name.map(&:text).join(' ') - else - # Fallback without namespace - name = record.xpath('//identity/nameEntry/part') - accumulator << name.map(&:text).join(' ') if name.any? - end + name = record.xpath('//eac:cpfDescription/eac:identity/eac:nameEntry/eac:part', EAC_NS) + accumulator << name.map(&:text).join(' ') if name.any? end to_field 'title_display' do |record, accumulator| - name = record.xpath('//identity/nameEntry/part') + name = record.xpath('//eac:cpfDescription/eac:identity/eac:nameEntry/eac:part', EAC_NS) accumulator << name.map(&:text).join(' ') if name.any? end to_field 'title_sort' do |record, accumulator| - name = record.xpath('//identity/nameEntry/part') + name = record.xpath('//eac:cpfDescription/eac:identity/eac:nameEntry/eac:part', EAC_NS) if name.any? text = name.map(&:text).join(' ') accumulator << text.gsub(/^(a|an|the)\s+/i, '').downcase @@ -161,10 +147,10 @@ # Dates of existence to_field 'dates' do |record, accumulator| # Try existDates element - dates = record.xpath('//existDates/dateRange/fromDate | //existDates/dateRange/toDate | //existDates/date') + dates = record.xpath('//eac:cpfDescription/eac:description/eac:existDates/eac:dateRange/eac:fromDate | //eac:cpfDescription/eac:description/eac:existDates/eac:dateRange/eac:toDate | //eac:cpfDescription/eac:description/eac:existDates/eac:date', EAC_NS) if dates.any? - from_date = record.xpath('//existDates/dateRange/fromDate').first - to_date = record.xpath('//existDates/dateRange/toDate').first + from_date = record.xpath('//eac:cpfDescription/eac:description/eac:existDates/eac:dateRange/eac:fromDate', EAC_NS).first + to_date = record.xpath('//eac:cpfDescription/eac:description/eac:existDates/eac:dateRange/eac:toDate', EAC_NS).first if from_date || to_date from_text = from_date ? from_date.text : '' @@ -180,7 +166,7 @@ # Biographical/historical note - text content to_field 'bioghist_text' do |record, accumulator| # Extract text from biogHist elements - bioghist = record.xpath('//biogHist//p') + bioghist = record.xpath('//eac:cpfDescription/eac:description/eac:biogHist//eac:p', EAC_NS) if bioghist.any? text = bioghist.map(&:text).join(' ') accumulator << text @@ -189,7 +175,7 @@ # Biographical/historical note - HTML to_field 'bioghist_html' do |record, accumulator| - bioghist = record.xpath('//biogHist//p') + bioghist = record.xpath('//eac:cpfDescription/eac:description/eac:biogHist//eac:p', EAC_NS) if bioghist.any? html = bioghist.map { |p| "

#{p.text}

" }.join("\n") accumulator << html @@ -199,17 +185,17 @@ # Full-text search field to_field 'text' do |record, accumulator| # Title - name = record.xpath('//identity/nameEntry/part') + name = record.xpath('//eac:cpfDescription/eac:identity/eac:nameEntry/eac:part', EAC_NS) accumulator << name.map(&:text).join(' ') if name.any? # Bioghist - bioghist = record.xpath('//biogHist//p') + bioghist = record.xpath('//eac:cpfDescription/eac:description/eac:biogHist//eac:p', EAC_NS) accumulator << bioghist.map(&:text).join(' ') if bioghist.any? end # Related agents (from cpfRelation elements) to_field 'related_agents_ssim' do |record, accumulator| - relations = record.xpath('//cpfRelation') + relations = record.xpath('//eac:cpfDescription/eac:relations/eac:cpfRelation', EAC_NS) relations.each do |rel| # Get the related entity href/identifier href = rel['href'] || rel['xlink:href'] @@ -218,7 +204,7 @@ if href # Store as: "uri|type" for easy parsing later accumulator << "#{href}|#{relation_type}" - elsif relation_entry = rel.xpath('relationEntry').first + elsif relation_entry = rel.xpath('eac:relationEntry', EAC_NS).first # If no href, at least store the name name = relation_entry.text accumulator << "#{name}|#{relation_type}" if name @@ -228,7 +214,7 @@ # Related agents - just URIs (for simpler queries) to_field 'related_agent_uris_ssim' do |record, accumulator| - relations = record.xpath('//cpfRelation') + relations = record.xpath('//eac:cpfDescription/eac:relations/eac:cpfRelation', EAC_NS) relations.each do |rel| href = rel['href'] || rel['xlink:href'] accumulator << href if href @@ -237,7 +223,7 @@ # Relationship types to_field 'relationship_types_ssim' do |record, accumulator| - relations = record.xpath('//cpfRelation') + relations = record.xpath('//eac:cpfDescription/eac:relations/eac:cpfRelation', EAC_NS) relations.each do |rel| relation_type = rel['cpfRelationType'] accumulator << relation_type if relation_type && !accumulator.include?(relation_type) @@ -247,7 +233,7 @@ # Agent source URI (from original ArchivesSpace) to_field 'agent_uri' do |record, accumulator| # Try to extract from control section or otherRecordId - other_id = record.xpath('//control/otherRecordId[@localType="archivesspace_uri"]').first + other_id = record.xpath('//eac:control/eac:otherRecordId[@localType="archivesspace_uri"]', EAC_NS).first if other_id accumulator << other_id.text end @@ -265,7 +251,7 @@ # Log successful indexing each_record do |record, context| - record_id = record.xpath('//control/recordId').first + record_id = record.xpath('//eac:control/eac:recordId', EAC_NS).first if record_id context.logger.info("Indexed creator: #{record_id.text}") end From 5d2588e21fd3ce5ca8c2c2c7c21a05dd7048b956 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Wed, 11 Feb 2026 23:17:27 +0000 Subject: [PATCH 18/44] Refactor dates XPath for improved readability Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com> --- traject_config_eac_cpf.rb | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/traject_config_eac_cpf.rb b/traject_config_eac_cpf.rb index 7a78c97..27773cd 100644 --- a/traject_config_eac_cpf.rb +++ b/traject_config_eac_cpf.rb @@ -147,10 +147,11 @@ # Dates of existence to_field 'dates' do |record, accumulator| # Try existDates element - dates = record.xpath('//eac:cpfDescription/eac:description/eac:existDates/eac:dateRange/eac:fromDate | //eac:cpfDescription/eac:description/eac:existDates/eac:dateRange/eac:toDate | //eac:cpfDescription/eac:description/eac:existDates/eac:date', EAC_NS) + base_path = '//eac:cpfDescription/eac:description/eac:existDates' + dates = record.xpath("#{base_path}/eac:dateRange/eac:fromDate | #{base_path}/eac:dateRange/eac:toDate | #{base_path}/eac:date", EAC_NS) if dates.any? - from_date = record.xpath('//eac:cpfDescription/eac:description/eac:existDates/eac:dateRange/eac:fromDate', EAC_NS).first - to_date = record.xpath('//eac:cpfDescription/eac:description/eac:existDates/eac:dateRange/eac:toDate', EAC_NS).first + from_date = record.xpath("#{base_path}/eac:dateRange/eac:fromDate", EAC_NS).first + to_date = record.xpath("#{base_path}/eac:dateRange/eac:toDate", EAC_NS).first if from_date || to_date from_text = from_date ? from_date.text : '' From e092b10821c815c23b454bf507791749ad9c8643 Mon Sep 17 00:00:00 2001 From: Alex Dryden Date: Thu, 12 Feb 2026 15:09:52 -0500 Subject: [PATCH 19/44] Update README.md punctuation/spacing Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 6c570bf..7352155 100644 --- a/README.md +++ b/README.md @@ -42,7 +42,7 @@ ArcFlow now generates standalone creator documents in addition to collection rec - Link to all collections where the creator is listed - Can be searched and displayed independently in ArcLight - Are marked with `is_creator: true` to distinguish from collections -- Must be fed into a Solr instance with fields to match their specific facets (See:Configure Solr Schema below ) +- Must be fed into a Solr instance with fields to match their specific facets (See: Configure Solr Schema below) ### How Creator Records Work From 32d923e2d462b939c78d0646ebee5e479140b833 Mon Sep 17 00:00:00 2001 From: Alex Dryden Date: Fri, 13 Feb 2026 11:49:08 -0500 Subject: [PATCH 20/44] fix: use dynamic mappings for ArcLight Solr fields --- traject_config_eac_cpf.rb | 35 +++++++++++++++++++++++++---------- 1 file changed, 25 insertions(+), 10 deletions(-) diff --git a/traject_config_eac_cpf.rb b/traject_config_eac_cpf.rb index 27773cd..8f6d06b 100644 --- a/traject_config_eac_cpf.rb +++ b/traject_config_eac_cpf.rb @@ -125,27 +125,32 @@ accumulator << entity.text if entity end -# Title/name fields - from authorized form of name -to_field 'title' do |record, accumulator| +# Title/name fields - using ArcLight dynamic field naming convention +# _tesim = text, stored, indexed, multiValued (for full-text search) +# _ssm = string, stored, multiValued (for display) +# _ssi = string, stored, indexed (for faceting/sorting) +to_field 'title_tesim' do |record, accumulator| name = record.xpath('//eac:cpfDescription/eac:identity/eac:nameEntry/eac:part', EAC_NS) accumulator << name.map(&:text).join(' ') if name.any? end -to_field 'title_display' do |record, accumulator| +to_field 'title_ssm' do |record, accumulator| name = record.xpath('//eac:cpfDescription/eac:identity/eac:nameEntry/eac:part', EAC_NS) accumulator << name.map(&:text).join(' ') if name.any? end -to_field 'title_sort' do |record, accumulator| +to_field 'title_filing_ssi' do |record, accumulator| name = record.xpath('//eac:cpfDescription/eac:identity/eac:nameEntry/eac:part', EAC_NS) if name.any? text = name.map(&:text).join(' ') + # Remove leading articles and convert to lowercase for filing accumulator << text.gsub(/^(a|an|the)\s+/i, '').downcase end end -# Dates of existence -to_field 'dates' do |record, accumulator| +# Dates of existence - using ArcLight standard field unitdate_ssm +# (matches what ArcLight uses for collection dates) +to_field 'unitdate_ssm' do |record, accumulator| # Try existDates element base_path = '//eac:cpfDescription/eac:description/eac:existDates' dates = record.xpath("#{base_path}/eac:dateRange/eac:fromDate | #{base_path}/eac:dateRange/eac:toDate | #{base_path}/eac:date", EAC_NS) @@ -164,9 +169,12 @@ end end -# Biographical/historical note - text content -to_field 'bioghist_text' do |record, accumulator| - # Extract text from biogHist elements +# Biographical/historical note - using ArcLight conventions +# _tesim for searchable plain text +# _tesm for searchable HTML (text, stored, multiValued but not for display) +# _ssm for section heading display +to_field 'bioghist_tesim' do |record, accumulator| + # Extract text from biogHist elements for full-text search bioghist = record.xpath('//eac:cpfDescription/eac:description/eac:biogHist//eac:p', EAC_NS) if bioghist.any? text = bioghist.map(&:text).join(' ') @@ -175,7 +183,8 @@ end # Biographical/historical note - HTML -to_field 'bioghist_html' do |record, accumulator| +to_field 'bioghist_html_tesm' do |record, accumulator| + # Extract HTML for searchable content (matches ArcLight's bioghist_html_tesm) bioghist = record.xpath('//eac:cpfDescription/eac:description/eac:biogHist//eac:p', EAC_NS) if bioghist.any? html = bioghist.map { |p| "

#{p.text}

" }.join("\n") @@ -183,6 +192,12 @@ end end +to_field 'bioghist_heading_ssm' do |record, accumulator| + # Extract section heading (matches ArcLight's bioghist_heading_ssm pattern) + heading = record.xpath('//eac:cpfDescription/eac:description/eac:biogHist//eac:head', EAC_NS).first + accumulator << heading.text if heading +end + # Full-text search field to_field 'text' do |record, accumulator| # Title From 628975ed0dbfc126fe676cc73e5c515e6b8389d5 Mon Sep 17 00:00:00 2001 From: Alex Dryden Date: Fri, 13 Feb 2026 11:58:11 -0500 Subject: [PATCH 21/44] Ensure that extra traject config is proccessed MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit as a separate command. Detailed explanation: self.traject_extra_config is constructed as a single string containing a space (e.g., "-c /path/to/file.rb"), but subprocess.run(cmd) passes arguments verbatim and Traject won’t parse that as two flags/values. Store the extra config as a path (or as an already-split argv list) and append it as ['-c', traject_extra_config] (or extend with a list) when building cmd. Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --- arcflow/main.py | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arcflow/main.py b/arcflow/main.py index 5cdd072..490057e 100644 --- a/arcflow/main.py +++ b/arcflow/main.py @@ -43,7 +43,8 @@ class ArcFlow: def __init__(self, arclight_dir, aspace_dir, solr_url, traject_extra_config='', force_update=False, agents_only=False, collections_only=False, arcuit_dir=None, skip_creator_indexing=False): self.solr_url = solr_url self.batch_size = 1000 - self.traject_extra_config = f'-c {traject_extra_config}' if traject_extra_config.strip() else '' + clean_extra_config = traject_extra_config.strip() + self.traject_extra_config = clean_extra_config or None self.arclight_dir = arclight_dir self.aspace_jobs_dir = f'{aspace_dir}/data/shared/job_files' self.job_type = 'print_to_pdf_job' From ae77c982014f27d042ae19953304e7e347d6c9ac Mon Sep 17 00:00:00 2001 From: Alex Dryden Date: Fri, 13 Feb 2026 11:59:22 -0500 Subject: [PATCH 22/44] add extra traject config with extend to prevent commands from running together Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --- arcflow/main.py | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arcflow/main.py b/arcflow/main.py index 490057e..4bede4a 100644 --- a/arcflow/main.py +++ b/arcflow/main.py @@ -570,7 +570,8 @@ def index_collections(self, repo_id, xml_file_path, indent_size=0): if isinstance(self.traject_extra_config, (list, tuple)): cmd.extend(self.traject_extra_config) else: - cmd.append(self.traject_extra_config) + # Treat a string extra config as a path and pass it with -c + cmd.extend(['-c', self.traject_extra_config]) cmd.append(xml_file_path) From cd1f94d149e23f3967d53be49b7edbd0cf033b47 Mon Sep 17 00:00:00 2001 From: Alex Dryden Date: Fri, 13 Feb 2026 12:03:16 -0500 Subject: [PATCH 23/44] Preserve html markup Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --- traject_config_eac_cpf.rb | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/traject_config_eac_cpf.rb b/traject_config_eac_cpf.rb index 8f6d06b..deb9ac5 100644 --- a/traject_config_eac_cpf.rb +++ b/traject_config_eac_cpf.rb @@ -187,7 +187,8 @@ # Extract HTML for searchable content (matches ArcLight's bioghist_html_tesm) bioghist = record.xpath('//eac:cpfDescription/eac:description/eac:biogHist//eac:p', EAC_NS) if bioghist.any? - html = bioghist.map { |p| "

#{p.text}

" }.join("\n") + # Preserve inline EAC markup inside by serializing child nodes + html = bioghist.map { |p| "

#{p.inner_html}

" }.join("\n") accumulator << html end end From 107e43d4c457a5bcfa259d696045c5b04e8c4305 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Fri, 13 Feb 2026 17:04:11 +0000 Subject: [PATCH 24/44] Initial plan From b54236757281cee44cf2449946570e690995d24f Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Fri, 13 Feb 2026 17:05:36 +0000 Subject: [PATCH 25/44] Add check=True to subprocess.run and handle exception properly Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com> --- arcflow/main.py | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/arcflow/main.py b/arcflow/main.py index 4bede4a..c0b0da6 100644 --- a/arcflow/main.py +++ b/arcflow/main.py @@ -583,16 +583,17 @@ def index_collections(self, repo_id, xml_file_path, indent_size=0): cwd=self.arclight_dir, env=env, stderr=subprocess.PIPE, + check=True, ) if result.stderr: self.log.error(f'{indent}{result.stderr.decode("utf-8")}') - if result.returncode != 0: - self.log.error(f'{indent}Failed to index pending resources in repository ID {repo_id} to ArcLight Solr. Return code: {result.returncode}') - else: - self.log.info(f'{indent}Finished indexing pending resources in repository ID {repo_id} to ArcLight Solr.') - except subprocess.CalledProcessError as e: + + self.log.info(f'{indent}Finished indexing pending resources in repository ID {repo_id} to ArcLight Solr.') + except subprocess.CalledProcessError as e: self.log.error(f'{indent}Error indexing pending resources in repository ID {repo_id} to ArcLight Solr: {e}') + if e.stderr: + self.log.error(f'{indent}{e.stderr.decode("utf-8")}') def get_creator_bioghist(self, resource, indent_size=0): From 092d1beaef833b3f6f1fecb770d3dd2eb0645589 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Fri, 13 Feb 2026 17:22:57 +0000 Subject: [PATCH 26/44] Move success log to else block for clarity Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com> --- arcflow/main.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arcflow/main.py b/arcflow/main.py index c0b0da6..579bd5a 100644 --- a/arcflow/main.py +++ b/arcflow/main.py @@ -588,12 +588,12 @@ def index_collections(self, repo_id, xml_file_path, indent_size=0): if result.stderr: self.log.error(f'{indent}{result.stderr.decode("utf-8")}') - - self.log.info(f'{indent}Finished indexing pending resources in repository ID {repo_id} to ArcLight Solr.') except subprocess.CalledProcessError as e: self.log.error(f'{indent}Error indexing pending resources in repository ID {repo_id} to ArcLight Solr: {e}') if e.stderr: self.log.error(f'{indent}{e.stderr.decode("utf-8")}') + else: + self.log.info(f'{indent}Finished indexing pending resources in repository ID {repo_id} to ArcLight Solr.') def get_creator_bioghist(self, resource, indent_size=0): From 924ea473f143ad76f087862e7aed3bbf1b3e2c3a Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Fri, 13 Feb 2026 17:30:00 +0000 Subject: [PATCH 27/44] Revert to original error handling without check=True Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com> --- arcflow/main.py | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/arcflow/main.py b/arcflow/main.py index 579bd5a..292d049 100644 --- a/arcflow/main.py +++ b/arcflow/main.py @@ -583,17 +583,16 @@ def index_collections(self, repo_id, xml_file_path, indent_size=0): cwd=self.arclight_dir, env=env, stderr=subprocess.PIPE, - check=True, ) if result.stderr: self.log.error(f'{indent}{result.stderr.decode("utf-8")}') + if result.returncode != 0: + self.log.error(f'{indent}Failed to index pending resources in repository ID {repo_id} to ArcLight Solr. Return code: {result.returncode}') + else: + self.log.info(f'{indent}Finished indexing pending resources in repository ID {repo_id} to ArcLight Solr.') except subprocess.CalledProcessError as e: self.log.error(f'{indent}Error indexing pending resources in repository ID {repo_id} to ArcLight Solr: {e}') - if e.stderr: - self.log.error(f'{indent}{e.stderr.decode("utf-8")}') - else: - self.log.info(f'{indent}Finished indexing pending resources in repository ID {repo_id} to ArcLight Solr.') def get_creator_bioghist(self, resource, indent_size=0): From 4923eb2224f0561659d3ae59f58c9e2eca7b1a58 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Fri, 13 Feb 2026 17:34:49 +0000 Subject: [PATCH 28/44] Initial plan From 093d936b9f9ca063d4100ce860025beb67ee9c10 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Fri, 13 Feb 2026 17:43:42 +0000 Subject: [PATCH 29/44] Add agent filtering to exclude system users and donors Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com> --- README.md | 24 +++++++- arcflow/main.py | 152 ++++++++++++++++++++++++++++++++++++++++++++---- 2 files changed, 162 insertions(+), 14 deletions(-) diff --git a/README.md b/README.md index 7352155..55709ed 100644 --- a/README.md +++ b/README.md @@ -44,12 +44,30 @@ ArcFlow now generates standalone creator documents in addition to collection rec - Are marked with `is_creator: true` to distinguish from collections - Must be fed into a Solr instance with fields to match their specific facets (See: Configure Solr Schema below) +### Agent Filtering + +**ArcFlow automatically filters agents to include only legitimate creators** of archival materials. The following agent types are **excluded** from indexing: + +- ✗ **System users** - ArchivesSpace software users (identified by `is_user` field) +- ✗ **System-generated agents** - Auto-created for users (identified by `system_generated` field) +- ✗ **Software agents** - Non-human agents (identified by `agent_type = 'agent_software'`) +- ✗ **Repository agents** - Corporate entities representing the repository itself (identified by `is_repo_agent` field) +- ✗ **Donor-only agents** - Agents with only the 'donor' role and no creator role + +**Agents are included if they meet any of these criteria:** + +- ✓ Have the **'creator' role** in linked_agent_roles +- ✓ Are **linked to published records** (and not excluded by filters above) + +This filtering ensures that only legitimate archival creators are discoverable in ArcLight, while protecting privacy and security by excluding system users and donors. + ### How Creator Records Work 1. **Extraction**: `get_all_agents()` fetches all agents from ArchivesSpace -2. **Processing**: `task_agent()` generates an EAC-CPF XML document for each agent with bioghist notes -3. **Linking**: Handled via Solr using the persistent_id field (agents and collections linked through bioghist references) -4. **Indexing**: Creator XML files are indexed to Solr using `traject_config_eac_cpf.rb` +2. **Filtering**: `is_target_agent()` filters out system users, donors, and non-creator agents +3. **Processing**: `task_agent()` generates an EAC-CPF XML document for each target agent with bioghist notes +4. **Linking**: Handled via Solr using the persistent_id field (agents and collections linked through bioghist references) +5. **Indexing**: Creator XML files are indexed to Solr using `traject_config_eac_cpf.rb` ### Creator Document Format diff --git a/arcflow/main.py b/arcflow/main.py index 292d049..58a8496 100644 --- a/arcflow/main.py +++ b/arcflow/main.py @@ -674,10 +674,58 @@ def get_creator_bioghist(self, resource, indent_size=0): return None + def is_target_agent(self, agent): + """ + Determine if agent is a target creator of archival materials. + + Excludes: + - System users (is_user field present) + - System-generated agents (system_generated = true) + - Software agents (agent_type = 'agent_software') + - Repository agents (is_repo_agent field present) + - Donor-only agents (only has 'donor' role, no creator role) + + Args: + agent: Agent record from ArchivesSpace API + + Returns: + bool: True if agent should be indexed, False to exclude + """ + # TIER 1: Exclude system users (PRIMARY FILTER) + if agent.get('is_user'): + return False + + # TIER 2: Exclude system-generated agents + if agent.get('system_generated'): + return False + + # TIER 3: Exclude software agents + if agent.get('agent_type') == 'agent_software': + return False + + # TIER 4: Exclude repository agents (corporate entities only) + if agent.get('is_repo_agent'): + return False + + # TIER 5: Role-based filtering + roles = agent.get('linked_agent_roles', []) + + # Include if explicitly marked as creator + if 'creator' in roles: + return True + + # Exclude if ONLY marked as donor + if roles == ['donor']: + return False + + # TIER 6: Default - include if linked to published records + # (covers cases where roles aren't populated yet) + return agent.get('is_linked_to_published_record', False) + def get_all_agents(self, agent_types=None, modified_since=0, indent_size=0): """ - Fetch ALL agents from ArchivesSpace (not just creators). - Uses direct agent API endpoints for comprehensive coverage. + Fetch target agents from ArchivesSpace and filter to creators only. + Excludes system users, donors, and other non-creator agents. Args: agent_types: List of agent types to fetch. Default: ['corporate_entities', 'people', 'families'] @@ -685,15 +733,25 @@ def get_all_agents(self, agent_types=None, modified_since=0, indent_size=0): indent_size: Indentation size for logging Returns: - set: Set of agent URIs (e.g., '/agents/corporate_entities/123') + list: List of filtered agent URIs (e.g., '/agents/corporate_entities/123') """ if agent_types is None: agent_types = ['corporate_entities', 'people', 'families'] indent = ' ' * indent_size - all_agents = set() + target_agents = [] + stats = { + 'total': 0, + 'excluded_user': 0, + 'excluded_system_generated': 0, + 'excluded_software': 0, + 'excluded_repo_agent': 0, + 'excluded_donor_only': 0, + 'excluded_no_links': 0, + 'included': 0 + } - self.log.info(f'{indent}Fetching ALL agents from ArchivesSpace...') + self.log.info(f'{indent}Fetching agents from ArchivesSpace and applying filters...') for agent_type in agent_types: try: @@ -705,12 +763,59 @@ def get_all_agents(self, agent_types=None, modified_since=0, indent_size=0): response = self.client.get(f'/agents/{agent_type}', params=params) agent_ids = response.json() - self.log.info(f'{indent}Found {len(agent_ids)} {agent_type} agents') + self.log.info(f'{indent}Found {len(agent_ids)} {agent_type} agents, filtering...') - # Add agent URIs to set + # Fetch and filter each agent for agent_id in agent_ids: + stats['total'] += 1 agent_uri = f'/agents/{agent_type}/{agent_id}' - all_agents.add(agent_uri) + + try: + # Fetch full agent record to access filtering fields + agent_response = self.client.get(agent_uri) + agent = agent_response.json() + + # Apply filtering logic + if agent.get('is_user'): + stats['excluded_user'] += 1 + continue + + if agent.get('system_generated'): + stats['excluded_system_generated'] += 1 + continue + + if agent.get('agent_type') == 'agent_software': + stats['excluded_software'] += 1 + continue + + if agent.get('is_repo_agent'): + stats['excluded_repo_agent'] += 1 + continue + + roles = agent.get('linked_agent_roles', []) + + # Include creators + if 'creator' in roles: + stats['included'] += 1 + target_agents.append(agent_uri) + continue + + # Exclude donor-only agents + if roles == ['donor']: + stats['excluded_donor_only'] += 1 + continue + + # Default: include if linked to published records + if agent.get('is_linked_to_published_record', False): + stats['included'] += 1 + target_agents.append(agent_uri) + else: + stats['excluded_no_links'] += 1 + + except Exception as e: + self.log.warning(f'{indent}Error fetching agent {agent_uri}: {e}') + # On error, include the agent (fail-open) + target_agents.append(agent_uri) except Exception as e: self.log.error(f'{indent}Error fetching {agent_type} agents: {e}') @@ -721,14 +826,39 @@ def get_all_agents(self, agent_types=None, modified_since=0, indent_size=0): response = self.client.get(f'/agents/{agent_type}', params={'all_ids': True}) agent_ids = response.json() self.log.info(f'{indent}Found {len(agent_ids)} {agent_type} agents (no date filter)') + + # Re-process with filtering for agent_id in agent_ids: + stats['total'] += 1 agent_uri = f'/agents/{agent_type}/{agent_id}' - all_agents.add(agent_uri) + + try: + agent_response = self.client.get(agent_uri) + agent = agent_response.json() + + if self.is_target_agent(agent): + stats['included'] += 1 + target_agents.append(agent_uri) + + except Exception as e: + self.log.warning(f'{indent}Error fetching agent {agent_uri}: {e}') + target_agents.append(agent_uri) + except Exception as e2: self.log.error(f'{indent}Failed to fetch {agent_type} agents: {e2}') - self.log.info(f'{indent}Found {len(all_agents)} total agents across all types.') - return all_agents + # Log filtering statistics + self.log.info(f'{indent}Agent filtering complete:') + self.log.info(f'{indent} Total agents processed: {stats["total"]}') + self.log.info(f'{indent} Included (target creators): {stats["included"]}') + self.log.info(f'{indent} Excluded (system users): {stats["excluded_user"]}') + self.log.info(f'{indent} Excluded (system-generated): {stats["excluded_system_generated"]}') + self.log.info(f'{indent} Excluded (software agents): {stats["excluded_software"]}') + self.log.info(f'{indent} Excluded (repository agents): {stats["excluded_repo_agent"]}') + self.log.info(f'{indent} Excluded (donor-only): {stats["excluded_donor_only"]}') + self.log.info(f'{indent} Excluded (no published links): {stats["excluded_no_links"]}') + + return target_agents def task_agent(self, agent_uri, agents_dir, repo_id=1, indent_size=0): From f264373d1812a39dc33d58847f7102dcdf15b619 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Fri, 13 Feb 2026 18:00:20 +0000 Subject: [PATCH 30/44] Remove redundant software agent filter - already excluded by endpoint selection Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com> --- README.md | 2 +- arcflow/main.py | 19 +++++-------------- 2 files changed, 6 insertions(+), 15 deletions(-) diff --git a/README.md b/README.md index 55709ed..3eac054 100644 --- a/README.md +++ b/README.md @@ -50,7 +50,7 @@ ArcFlow now generates standalone creator documents in addition to collection rec - ✗ **System users** - ArchivesSpace software users (identified by `is_user` field) - ✗ **System-generated agents** - Auto-created for users (identified by `system_generated` field) -- ✗ **Software agents** - Non-human agents (identified by `agent_type = 'agent_software'`) +- ✗ **Software agents** - Excluded by not querying the `/agents/software` endpoint - ✗ **Repository agents** - Corporate entities representing the repository itself (identified by `is_repo_agent` field) - ✗ **Donor-only agents** - Agents with only the 'donor' role and no creator role diff --git a/arcflow/main.py b/arcflow/main.py index 58a8496..18875ed 100644 --- a/arcflow/main.py +++ b/arcflow/main.py @@ -681,10 +681,11 @@ def is_target_agent(self, agent): Excludes: - System users (is_user field present) - System-generated agents (system_generated = true) - - Software agents (agent_type = 'agent_software') - Repository agents (is_repo_agent field present) - Donor-only agents (only has 'donor' role, no creator role) + Note: Software agents are excluded by not querying /agents/software endpoint. + Args: agent: Agent record from ArchivesSpace API @@ -699,15 +700,11 @@ def is_target_agent(self, agent): if agent.get('system_generated'): return False - # TIER 3: Exclude software agents - if agent.get('agent_type') == 'agent_software': - return False - - # TIER 4: Exclude repository agents (corporate entities only) + # TIER 3: Exclude repository agents (corporate entities only) if agent.get('is_repo_agent'): return False - # TIER 5: Role-based filtering + # TIER 4: Role-based filtering roles = agent.get('linked_agent_roles', []) # Include if explicitly marked as creator @@ -718,7 +715,7 @@ def is_target_agent(self, agent): if roles == ['donor']: return False - # TIER 6: Default - include if linked to published records + # TIER 5: Default - include if linked to published records # (covers cases where roles aren't populated yet) return agent.get('is_linked_to_published_record', False) @@ -744,7 +741,6 @@ def get_all_agents(self, agent_types=None, modified_since=0, indent_size=0): 'total': 0, 'excluded_user': 0, 'excluded_system_generated': 0, - 'excluded_software': 0, 'excluded_repo_agent': 0, 'excluded_donor_only': 0, 'excluded_no_links': 0, @@ -784,10 +780,6 @@ def get_all_agents(self, agent_types=None, modified_since=0, indent_size=0): stats['excluded_system_generated'] += 1 continue - if agent.get('agent_type') == 'agent_software': - stats['excluded_software'] += 1 - continue - if agent.get('is_repo_agent'): stats['excluded_repo_agent'] += 1 continue @@ -853,7 +845,6 @@ def get_all_agents(self, agent_types=None, modified_since=0, indent_size=0): self.log.info(f'{indent} Included (target creators): {stats["included"]}') self.log.info(f'{indent} Excluded (system users): {stats["excluded_user"]}') self.log.info(f'{indent} Excluded (system-generated): {stats["excluded_system_generated"]}') - self.log.info(f'{indent} Excluded (software agents): {stats["excluded_software"]}') self.log.info(f'{indent} Excluded (repository agents): {stats["excluded_repo_agent"]}') self.log.info(f'{indent} Excluded (donor-only): {stats["excluded_donor_only"]}') self.log.info(f'{indent} Excluded (no published links): {stats["excluded_no_links"]}') From e3269832222bc4f8fe4b8db3232ff6839b6ea7e7 Mon Sep 17 00:00:00 2001 From: Copilot <198982749+Copilot@users.noreply.github.com> Date: Thu, 19 Feb 2026 11:16:06 -0500 Subject: [PATCH 31/44] Reorder traject config discovery to follow collection records pattern (#14) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * Improve traject config discovery and logging - Add fallback search in arcflow package directory for development - Add clear logging showing which traject config is being used - Add warning when using arcflow package version (development mode) - Improve error messages when traject config not found - Document that traject config belongs in arcuit gem, not arcflow - Update README with traject config location guidance Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com> * Address code review feedback - Change log level from error to warning for missing traject config - Update example path to clarify arcuit gem location - Show actual searched paths in error message for better troubleshooting Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com> * Reorder traject config search to follow collection records pattern - Change search order: arcuit_dir (1st) → bundle show (2nd) → example file (3rd) - Rename traject_config_eac_cpf.rb to example_traject_config_eac_cpf.rb - Prioritize arcuit_dir parameter as most up-to-date user control - Fall back to example file for module usage without arcuit - Update README with new search order and example file guidance Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com> * Address code review feedback on example file - Update usage comment to reference correct filename - Improve log message formatting for consistency - Add note about copying to arcuit for production use Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com> * Update traject config search paths to follow ArcLight pattern - Remove arcuit_dir/arcflow path (development artifact) - Add arcuit_dir/lib/arcuit/traject path (matches EAD traject location) - Apply same paths to both arcuit_dir and bundle show arcuit searches - Update debug message to reflect new subdirectory checked Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com> * Simplify example traject config search to single known location - Remove candidate paths loop for example file - Directly check the one known location at repo root - Add comment explaining we know the exact location Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com> --- README.md | 17 +++- arcflow/main.py | 85 ++++++++++++------- ...pf.rb => example_traject_config_eac_cpf.rb | 4 +- 3 files changed, 74 insertions(+), 32 deletions(-) rename traject_config_eac_cpf.rb => example_traject_config_eac_cpf.rb (98%) diff --git a/README.md b/README.md index 3eac054..710ddc4 100644 --- a/README.md +++ b/README.md @@ -137,7 +137,22 @@ This is a **one-time setup** per Solr instance. --- -To index creator documents to Solr: +### Traject Configuration for Creator Indexing + +The `traject_config_eac_cpf.rb` file defines how EAC-CPF creator records are mapped to Solr fields. + +**Search Order**: arcflow searches for the traject config following the collection records pattern: +1. **arcuit_dir parameter** (if provided via `--arcuit-dir`) - Highest priority, most up-to-date user control +2. **arcuit gem** (via `bundle show arcuit`) - For backward compatibility when arcuit_dir not provided +3. **example_traject_config_eac_cpf.rb** in arcflow - Fallback for module usage without arcuit + +**Example File**: arcflow includes `example_traject_config_eac_cpf.rb` as a reference implementation. For production: +- Copy this file to your arcuit gem as `traject_config_eac_cpf.rb`, or +- Specify the location with `--arcuit-dir /path/to/arcuit` + +**Logging**: arcflow clearly logs which traject config file is being used when creator indexing runs. + +To index creator documents to Solr manually: ```bash bundle exec traject \ diff --git a/arcflow/main.py b/arcflow/main.py index 18875ed..50f7583 100644 --- a/arcflow/main.py +++ b/arcflow/main.py @@ -965,14 +965,15 @@ def process_creators(self): self.log.info(f'{indent}Indexing {len(creator_ids)} creator records to Solr...') traject_config = self.find_traject_config() if traject_config: + self.log.info(f'{indent}Using traject config: {traject_config}') indexed = self.index_creators(agents_dir, creator_ids) self.log.info(f'{indent}Creator indexing complete: {indexed}/{len(creator_ids)} indexed') else: - self.log.info(f'{indent}Skipping creator indexing (traject config not found)') + self.log.warning(f'{indent}Skipping creator indexing (traject config not found)') self.log.info(f'{indent}To index manually:') self.log.info(f'{indent} cd {self.arclight_dir}') self.log.info(f'{indent} bundle exec traject -u {self.solr_url} -i xml \\') - self.log.info(f'{indent} -c /path/to/arcuit/arcflow/traject_config_eac_cpf.rb \\') + self.log.info(f'{indent} -c /path/to/arcuit-gem/traject_config_eac_cpf.rb \\') self.log.info(f'{indent} {agents_dir}/*.xml') elif self.skip_creator_indexing: self.log.info(f'{indent}Skipping creator indexing (--skip-creator-indexing flag set)') @@ -984,15 +985,32 @@ def find_traject_config(self): """ Find the traject config for creator indexing. - Tries: - 1. bundle show arcuit (finds installed gem) - 2. self.arcuit_dir (explicit path) - 3. Returns None if neither works + Search order (follows collection records pattern): + 1. arcuit_dir if provided (most up-to-date user control) + 2. arcuit gem via bundle show (for backward compatibility) + 3. example_traject_config_eac_cpf.rb in arcflow (fallback when used as module without arcuit) Returns: str: Path to traject config, or None if not found """ - # Try bundle show arcuit first + self.log.info('Searching for traject_config_eac_cpf.rb...') + searched_paths = [] + + # Try 1: arcuit_dir if provided (highest priority - user's explicit choice) + if self.arcuit_dir: + self.log.debug(f' Checking arcuit_dir parameter: {self.arcuit_dir}') + candidate_paths = [ + os.path.join(self.arcuit_dir, 'traject_config_eac_cpf.rb'), + os.path.join(self.arcuit_dir, 'lib', 'arcuit', 'traject', 'traject_config_eac_cpf.rb'), + ] + searched_paths.extend(candidate_paths) + for traject_config in candidate_paths: + if os.path.exists(traject_config): + self.log.info(f'✓ Using traject config from arcuit_dir: {traject_config}') + return traject_config + self.log.debug(' traject_config_eac_cpf.rb not found in arcuit_dir') + + # Try 2: bundle show arcuit (for backward compatibility when arcuit_dir not provided) try: result = subprocess.run( ['bundle', 'show', 'arcuit'], @@ -1003,39 +1021,46 @@ def find_traject_config(self): ) if result.returncode == 0: arcuit_path = result.stdout.strip() - # Prefer config at gem root, fall back to legacy subdirectory layout + self.log.debug(f' Found arcuit gem at: {arcuit_path}') candidate_paths = [ os.path.join(arcuit_path, 'traject_config_eac_cpf.rb'), - os.path.join(arcuit_path, 'arcflow', 'traject_config_eac_cpf.rb'), + os.path.join(arcuit_path, 'lib', 'arcuit', 'traject', 'traject_config_eac_cpf.rb'), ] + searched_paths.extend(candidate_paths) for traject_config in candidate_paths: if os.path.exists(traject_config): - self.log.info(f'Found traject config via bundle show: {traject_config}') + self.log.info(f'✓ Using traject config from arcuit gem: {traject_config}') return traject_config - self.log.warning( - 'bundle show arcuit succeeded but traject_config_eac_cpf.rb ' - 'was not found in any expected location under the gem root' + self.log.debug( + ' traject_config_eac_cpf.rb not found in arcuit gem ' + '(checked root and lib/arcuit/traject/ subdirectory)' ) else: - self.log.debug('bundle show arcuit failed (gem not installed?)') + self.log.debug(' arcuit gem not found via bundle show') except Exception as e: - self.log.debug(f'Error running bundle show arcuit: {e}') - # Fall back to arcuit_dir if provided - if self.arcuit_dir: - candidate_paths = [ - os.path.join(self.arcuit_dir, 'traject_config_eac_cpf.rb'), - os.path.join(self.arcuit_dir, 'arcflow', 'traject_config_eac_cpf.rb'), - ] - for traject_config in candidate_paths: - if os.path.exists(traject_config): - self.log.info(f'Using traject config from arcuit_dir: {traject_config}') - return traject_config - self.log.warning( - 'arcuit_dir provided but traject_config_eac_cpf.rb was not found ' - 'in any expected location' + self.log.debug(f' Error checking for arcuit gem: {e}') + + # Try 3: example file in arcflow package (fallback for module usage without arcuit) + # We know exactly where this file is located - at the repo root + arcflow_package_dir = os.path.dirname(os.path.abspath(__file__)) + arcflow_repo_root = os.path.dirname(arcflow_package_dir) + traject_config = os.path.join(arcflow_repo_root, 'example_traject_config_eac_cpf.rb') + searched_paths.append(traject_config) + + if os.path.exists(traject_config): + self.log.info(f'✓ Using example traject config from arcflow: {traject_config}') + self.log.info( + ' Note: Using example config. For production, copy this file to your ' + 'arcuit gem or specify location with --arcuit-dir.' ) - # No config found - self.log.warning('Could not find traject config (bundle show arcuit failed and arcuit_dir not provided or invalid)') + return traject_config + + # No config found anywhere - show all paths searched + self.log.error('✗ Could not find traject_config_eac_cpf.rb in any of these locations:') + for i, path in enumerate(searched_paths, 1): + self.log.error(f' {i}. {path}') + self.log.error('') + self.log.error(' Add traject_config_eac_cpf.rb to your arcuit gem or specify with --arcuit-dir.') return None diff --git a/traject_config_eac_cpf.rb b/example_traject_config_eac_cpf.rb similarity index 98% rename from traject_config_eac_cpf.rb rename to example_traject_config_eac_cpf.rb index deb9ac5..1fe97d0 100644 --- a/traject_config_eac_cpf.rb +++ b/example_traject_config_eac_cpf.rb @@ -4,7 +4,9 @@ # Persons, and Families) XML documents from ArchivesSpace archival_contexts endpoint. # # Usage: -# bundle exec traject -u $SOLR_URL -c traject_config_eac_cpf.rb /path/to/agents/*.xml +# bundle exec traject -u $SOLR_URL -c example_traject_config_eac_cpf.rb /path/to/agents/*.xml +# +# For production, copy this file to your arcuit gem as traject_config_eac_cpf.rb # # The EAC-CPF XML documents are retrieved directly from ArchivesSpace via: # /repositories/{repo_id}/archival_contexts/{agent_type}/{id}.xml From 2e6619cafffddb60d4522eb253e902d3f982e837 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Thu, 19 Feb 2026 19:39:15 +0000 Subject: [PATCH 32/44] Initial plan From 76be50402c114b8af3736d394384ea93a0170fc7 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Thu, 19 Feb 2026 19:42:03 +0000 Subject: [PATCH 33/44] Replace shell wildcards with Python glob for security Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com> --- arcflow/main.py | 34 +++++++++++++++++++++------------- 1 file changed, 21 insertions(+), 13 deletions(-) diff --git a/arcflow/main.py b/arcflow/main.py index 50f7583..cb4de38 100644 --- a/arcflow/main.py +++ b/arcflow/main.py @@ -10,6 +10,7 @@ import logging import math import sys +import glob from xml.dom.pulldom import parse, START_ELEMENT from xml.sax.saxutils import escape as xml_escape from xml.etree import ElementTree as ET @@ -472,18 +473,17 @@ def update_eads(self): # Remove pending symlinks after indexing for repo_id, batch_num in batches: - xml_file_path = f'{xml_dir}/{repo_id}_*_batch_{batch_num}.xml' - try: - result = subprocess.run( - f'rm {xml_file_path}', - shell=True, - cwd=self.arclight_dir, - stderr=subprocess.PIPE,) - self.log.error(f'{" " * indent_size}{result.stderr.decode("utf-8")}') - if result.returncode != 0: - self.log.error(f'{" " * indent_size}Failed to remove pending symlinks {xml_file_path}. Return code: {result.returncode}') - except Exception as e: - self.log.error(f'{" " * indent_size}Error removing pending symlinks {xml_file_path}: {e}') + xml_file_pattern = f'{xml_dir}/{repo_id}_*_batch_{batch_num}.xml' + xml_files = glob.glob(xml_file_pattern) + + for xml_file_path in xml_files: + try: + os.remove(xml_file_path) + self.log.info(f'{" " * indent_size}Removed pending symlink {xml_file_path}') + except FileNotFoundError: + self.log.warning(f'{" " * indent_size}File not found: {xml_file_path}') + except Exception as e: + self.log.error(f'{" " * indent_size}Error removing pending symlink {xml_file_path}: {e}') # Tasks for processing PDFs results_4 = [pool.apply_async( @@ -573,7 +573,15 @@ def index_collections(self, repo_id, xml_file_path, indent_size=0): # Treat a string extra config as a path and pass it with -c cmd.extend(['-c', self.traject_extra_config]) - cmd.append(xml_file_path) + # Expand wildcards with glob + xml_files = glob.glob(xml_file_path) + + if not xml_files: + self.log.warning(f'{indent}No files found matching pattern: {xml_file_path}') + return + + # Add all matching files to the command + cmd.extend(xml_files) env = os.environ.copy() env['REPOSITORY_ID'] = str(repo_id) From eefdc7276b66fc1a6fa626dd447cece11557d617 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Fri, 20 Feb 2026 02:18:00 +0000 Subject: [PATCH 34/44] Draft: Add timeout and better error handling for PDF jobs Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com> --- arcflow/main.py | 93 +++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 83 insertions(+), 10 deletions(-) diff --git a/arcflow/main.py b/arcflow/main.py index cb4de38..4b4f396 100644 --- a/arcflow/main.py +++ b/arcflow/main.py @@ -347,19 +347,69 @@ def task_repository(self, repo, xml_dir, modified_since, indent_size=0): return (repo, resources) - def task_pdf(self, repo_uri, job_id, ead_id, pdf_dir, indent_size=0): + def task_pdf(self, repo_uri, job_id, ead_id, pdf_dir, indent_size=0, timeout=300): + """ + Wait for an ArchivesSpace PDF generation job to complete and save the result. + + Args: + repo_uri: Repository URI + job_id: Job ID in ArchivesSpace + ead_id: EAD identifier for the resource + pdf_dir: Directory to save PDF files + indent_size: Indentation level for logging + timeout: Maximum seconds to wait for job completion (default: 300 = 5 minutes) + + Returns: + True if successful, False if timeout or job failed + """ indent = ' ' * indent_size + start_time = time.time() + poll_interval = 5 # seconds between status checks + warning_threshold = 60 # warn if queued for more than 1 minute + last_warning_time = 0 + poll_count = 0 + while True: - job_status = self.client.get( - f'{repo_uri}/jobs/{job_id}').json()['status'] + elapsed_time = time.time() - start_time + + # Check for timeout + if elapsed_time > timeout: + self.log.error( + f'{indent}Timeout waiting for ArchivesSpace {self.job_type}_{job_id} ' + f'after {int(elapsed_time)} seconds. Job may be stuck in queued status.\n' + f'{indent}TROUBLESHOOTING: Check if ArchivesSpace background job processor is running:\n' + f'{indent} - Run: ps aux | grep archivesspace | grep background\n' + f'{indent} - Start it: ./archivesspace.sh start-background-job-runner\n' + f'{indent} - Check ArchivesSpace logs for errors\n' + f'{indent}Continuing without PDF for "{ead_id}"...' + ) + # Create empty PDF file and continue + self.save_file( + f'{pdf_dir}/{ead_id}.pdf', + b'', + 'PDF (empty - job timed out)', + indent_size=indent_size) + return False + + try: + job_status = self.client.get( + f'{repo_uri}/jobs/{job_id}').json()['status'] + except Exception as e: + self.log.error(f'{indent}Error checking job status for {self.job_type}_{job_id}: {e}') + time.sleep(poll_interval) + continue if job_status in ('completed', 'canceled', 'failed'): if job_status == 'completed': - file_id = self.client.get( - f'{repo_uri}/jobs/{job_id}/output_files').json()[0] + try: + file_id = self.client.get( + f'{repo_uri}/jobs/{job_id}/output_files').json()[0] - pdf = self.client.get( - f'{repo_uri}/jobs/{job_id}/output_files/{file_id}') + pdf = self.client.get( + f'{repo_uri}/jobs/{job_id}/output_files/{file_id}') + except Exception as e: + self.log.error(f'{indent}Error retrieving PDF output for {self.job_type}_{job_id}: {e}') + pdf = None elif job_status in ('canceled', 'failed'): self.log.error(f'{indent}ArchivesSpace {self.job_type}_{job_id} {job_status}.') pdf = None @@ -387,11 +437,34 @@ def task_pdf(self, repo_uri, job_id, ead_id, pdf_dir, indent_size=0): 'PDF', indent_size=indent_size) - self.log.info(f'Finished processing "{ead_id}".') + self.log.info(f'{indent}Finished processing "{ead_id}" (status: {job_status}).') return True - self.log.info(f'{indent}Waiting for ArchivesSpace {self.job_type}_{job_id} to complete... (current status: {job_status})') - time.sleep(5) + # Enhanced logging for queued jobs + poll_count += 1 + if job_status == 'queued': + # Show warning if job has been queued for too long + if elapsed_time > warning_threshold and (elapsed_time - last_warning_time) > warning_threshold: + self.log.warning( + f'{indent}Job {self.job_type}_{job_id} has been queued for {int(elapsed_time)} seconds. ' + f'This may indicate the ArchivesSpace background job processor is not running.' + ) + last_warning_time = elapsed_time + + # Only log every 4th poll (every 20 seconds) to reduce log spam + if poll_count % 4 == 0: + self.log.info( + f'{indent}Waiting for ArchivesSpace {self.job_type}_{job_id} ' + f'(status: {job_status}, elapsed: {int(elapsed_time)}s, timeout in: {int(timeout - elapsed_time)}s)' + ) + else: + # For non-queued statuses (running, etc.), log every time + self.log.info( + f'{indent}Waiting for ArchivesSpace {self.job_type}_{job_id} ' + f'(status: {job_status}, elapsed: {int(elapsed_time)}s)' + ) + + time.sleep(poll_interval) def update_eads(self): From 579c59498b6c506a19fbb6edc30517f4dd649360 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Fri, 20 Feb 2026 02:26:07 +0000 Subject: [PATCH 35/44] Fix: Correct ArchivesSpace background job processor instructions Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com> --- arcflow/main.py | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/arcflow/main.py b/arcflow/main.py index 4b4f396..7dded0b 100644 --- a/arcflow/main.py +++ b/arcflow/main.py @@ -377,10 +377,11 @@ def task_pdf(self, repo_uri, job_id, ead_id, pdf_dir, indent_size=0, timeout=300 self.log.error( f'{indent}Timeout waiting for ArchivesSpace {self.job_type}_{job_id} ' f'after {int(elapsed_time)} seconds. Job may be stuck in queued status.\n' - f'{indent}TROUBLESHOOTING: Check if ArchivesSpace background job processor is running:\n' - f'{indent} - Run: ps aux | grep archivesspace | grep background\n' - f'{indent} - Start it: ./archivesspace.sh start-background-job-runner\n' - f'{indent} - Check ArchivesSpace logs for errors\n' + f'{indent}TROUBLESHOOTING: Background job processing may not be enabled in ArchivesSpace:\n' + f'{indent} 1. Check config/config.rb: AppConfig[:job_thread_count] must be > 0 (default is 2)\n' + f'{indent} 2. If you changed the config, restart ArchivesSpace: ./archivesspace.sh restart\n' + f'{indent} 3. Verify ArchivesSpace is processing jobs in the web UI: System → Background Jobs\n' + f'{indent} 4. Check ArchivesSpace logs for errors: logs/archivesspace.out\n' f'{indent}Continuing without PDF for "{ead_id}"...' ) # Create empty PDF file and continue From d5e0974b213ea20f50581f10f690e96ee130da4c Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Fri, 20 Feb 2026 02:31:43 +0000 Subject: [PATCH 36/44] Add two-phase timeout for large PDF handling Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com> --- arcflow/main.py | 84 +++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 67 insertions(+), 17 deletions(-) diff --git a/arcflow/main.py b/arcflow/main.py index 7dded0b..4f8038d 100644 --- a/arcflow/main.py +++ b/arcflow/main.py @@ -41,7 +41,7 @@ class ArcFlow: """ - def __init__(self, arclight_dir, aspace_dir, solr_url, traject_extra_config='', force_update=False, agents_only=False, collections_only=False, arcuit_dir=None, skip_creator_indexing=False): + def __init__(self, arclight_dir, aspace_dir, solr_url, traject_extra_config='', force_update=False, agents_only=False, collections_only=False, arcuit_dir=None, skip_creator_indexing=False, pdf_timeout_queued=300, pdf_timeout_running=1800): self.solr_url = solr_url self.batch_size = 1000 clean_extra_config = traject_extra_config.strip() @@ -54,6 +54,8 @@ def __init__(self, arclight_dir, aspace_dir, solr_url, traject_extra_config='', self.collections_only = collections_only self.arcuit_dir = arcuit_dir self.skip_creator_indexing = skip_creator_indexing + self.pdf_timeout_queued = pdf_timeout_queued # Timeout for jobs stuck in "queued" status + self.pdf_timeout_running = pdf_timeout_running # Timeout for jobs actively "running" self.log = logging.getLogger('arcflow') self.pid = os.getpid() self.pid_file_path = os.path.join(base_dir, 'arcflow.pid') @@ -347,17 +349,20 @@ def task_repository(self, repo, xml_dir, modified_since, indent_size=0): return (repo, resources) - def task_pdf(self, repo_uri, job_id, ead_id, pdf_dir, indent_size=0, timeout=300): + def task_pdf(self, repo_uri, job_id, ead_id, pdf_dir, indent_size=0): """ Wait for an ArchivesSpace PDF generation job to complete and save the result. + Uses a two-phase timeout approach: + - Phase 1 (queued): Short timeout to detect if background processor isn't running + - Phase 2 (running): Longer timeout to allow large PDFs to complete + Args: repo_uri: Repository URI job_id: Job ID in ArchivesSpace ead_id: EAD identifier for the resource pdf_dir: Directory to save PDF files indent_size: Indentation level for logging - timeout: Maximum seconds to wait for job completion (default: 300 = 5 minutes) Returns: True if successful, False if timeout or job failed @@ -369,21 +374,38 @@ def task_pdf(self, repo_uri, job_id, ead_id, pdf_dir, indent_size=0, timeout=300 last_warning_time = 0 poll_count = 0 + # Track status transitions for smart timeout + status_start_times = {} # Track when each status began + current_timeout = self.pdf_timeout_queued # Start with queued timeout + while True: elapsed_time = time.time() - start_time - # Check for timeout - if elapsed_time > timeout: - self.log.error( - f'{indent}Timeout waiting for ArchivesSpace {self.job_type}_{job_id} ' - f'after {int(elapsed_time)} seconds. Job may be stuck in queued status.\n' - f'{indent}TROUBLESHOOTING: Background job processing may not be enabled in ArchivesSpace:\n' - f'{indent} 1. Check config/config.rb: AppConfig[:job_thread_count] must be > 0 (default is 2)\n' - f'{indent} 2. If you changed the config, restart ArchivesSpace: ./archivesspace.sh restart\n' - f'{indent} 3. Verify ArchivesSpace is processing jobs in the web UI: System → Background Jobs\n' - f'{indent} 4. Check ArchivesSpace logs for errors: logs/archivesspace.out\n' - f'{indent}Continuing without PDF for "{ead_id}"...' - ) + # Check for timeout (with appropriate timeout for current status) + if elapsed_time > current_timeout: + # Determine which phase timed out for appropriate error message + last_status = list(status_start_times.keys())[-1] if status_start_times else 'queued' + + if last_status == 'queued': + self.log.error( + f'{indent}Timeout waiting for ArchivesSpace {self.job_type}_{job_id} ' + f'after {int(elapsed_time)} seconds. Job stuck in queued status.\n' + f'{indent}TROUBLESHOOTING: Background job processing may not be enabled in ArchivesSpace:\n' + f'{indent} 1. Check config/config.rb: AppConfig[:job_thread_count] must be > 0 (default is 2)\n' + f'{indent} 2. If you changed the config, restart ArchivesSpace: ./archivesspace.sh restart\n' + f'{indent} 3. Verify ArchivesSpace is processing jobs in the web UI: System → Background Jobs\n' + f'{indent} 4. Check ArchivesSpace logs for errors: logs/archivesspace.out\n' + f'{indent}Continuing without PDF for "{ead_id}"...' + ) + else: + self.log.error( + f'{indent}Timeout waiting for ArchivesSpace {self.job_type}_{job_id} ' + f'after {int(elapsed_time)} seconds in "{last_status}" status.\n' + f'{indent}This may be a very large PDF taking longer than expected.\n' + f'{indent}Consider increasing timeout with: --pdf-timeout-running {int(current_timeout * 2)}\n' + f'{indent}Continuing without PDF for "{ead_id}"...' + ) + # Create empty PDF file and continue self.save_file( f'{pdf_dir}/{ead_id}.pdf', @@ -399,6 +421,22 @@ def task_pdf(self, repo_uri, job_id, ead_id, pdf_dir, indent_size=0, timeout=300 self.log.error(f'{indent}Error checking job status for {self.job_type}_{job_id}: {e}') time.sleep(poll_interval) continue + + # Track status transitions and adjust timeout accordingly + if job_status not in status_start_times: + status_start_times[job_status] = time.time() + + # When job transitions from queued to running, switch to running timeout + if job_status == 'running' and 'queued' in status_start_times: + time_in_queued = status_start_times[job_status] - status_start_times['queued'] + # Reset timeout clock for running phase + start_time = time.time() + current_timeout = self.pdf_timeout_running + self.log.info( + f'{indent}Job {self.job_type}_{job_id} transitioned to "running" ' + f'after {int(time_in_queued)}s in queue. ' + f'Now allowing up to {int(self.pdf_timeout_running)}s for PDF generation...' + ) if job_status in ('completed', 'canceled', 'failed'): if job_status == 'completed': @@ -456,7 +494,7 @@ def task_pdf(self, repo_uri, job_id, ead_id, pdf_dir, indent_size=0, timeout=300 if poll_count % 4 == 0: self.log.info( f'{indent}Waiting for ArchivesSpace {self.job_type}_{job_id} ' - f'(status: {job_status}, elapsed: {int(elapsed_time)}s, timeout in: {int(timeout - elapsed_time)}s)' + f'(status: {job_status}, elapsed: {int(elapsed_time)}s, timeout in: {int(current_timeout - elapsed_time)}s)' ) else: # For non-queued statuses (running, etc.), log every time @@ -1405,6 +1443,16 @@ def main(): '--skip-creator-indexing', action='store_true', help='Generate creator XML files but skip Solr indexing (for testing)',) + parser.add_argument( + '--pdf-timeout-queued', + type=int, + default=300, + help='Timeout in seconds for PDF jobs stuck in "queued" status (default: 300 = 5 minutes)',) + parser.add_argument( + '--pdf-timeout-running', + type=int, + default=1800, + help='Timeout in seconds for PDF jobs in "running" status (default: 1800 = 30 minutes)',) args = parser.parse_args() # Validate mutually exclusive flags @@ -1420,7 +1468,9 @@ def main(): agents_only=args.agents_only, collections_only=args.collections_only, arcuit_dir=args.arcuit_dir, - skip_creator_indexing=args.skip_creator_indexing) + skip_creator_indexing=args.skip_creator_indexing, + pdf_timeout_queued=args.pdf_timeout_queued, + pdf_timeout_running=args.pdf_timeout_running) arcflow.run() From 50ad7667732a19059d34e91f01d69f92b544e2de Mon Sep 17 00:00:00 2001 From: Alex Dryden Date: Fri, 20 Feb 2026 21:00:13 -0500 Subject: [PATCH 37/44] feat: Optimize agent filtering with ArchivesSpace Solr MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Replace per-agent API calls with single Solr query for better performance: - Query ArchivesSpace Solr to filter agents in bulk - Exclude system users (publish=false) - Exclude donors (linked_agent_role includes "dnr") - Exclude software agents (agent_type="agent_software") - Use consistent EAC namespace prefixes in XPath queries - Refactor dates extraction for improved readability Performance improvement: O(n) API calls → O(1) Solr query Reduces processing time from minutes to seconds for large repositories. to reflect the required command line arguments Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --- README.md | 63 ++- arcflow/main.py | 407 ++++++++++++------ ...pf.rb => example_traject_config_eac_cpf.rb | 7 +- 3 files changed, 334 insertions(+), 143 deletions(-) rename traject_config_eac_cpf.rb => example_traject_config_eac_cpf.rb (97%) diff --git a/README.md b/README.md index 6c570bf..2a42f18 100644 --- a/README.md +++ b/README.md @@ -14,14 +14,12 @@ pip install -r requirements.txt cp .archivessnake.yml.example .archivessnake.yml nano .archivessnake.yml # Add your ArchivesSpace credentials -# 3. Set environment variables -export ARCLIGHT_DIR=/path/to/your/arclight-app -export ASPACE_DIR=/path/to/your/archivesspace -export SOLR_URL=http://localhost:8983/solr/blacklight-core - -# 4. Run arcflow -python -m arcflow.main - +# 3. Run arcflow +python -m arcflow.main \ + --arclight-dir /path/to/your/arclight-app \ + --aspace-dir /path/to/your/archivesspace \ + --solr-url http://localhost:8983/solr/blacklight-core \ + --aspace-solr-url http://localhost:8983/solr/archivesspace ``` --- @@ -42,14 +40,32 @@ ArcFlow now generates standalone creator documents in addition to collection rec - Link to all collections where the creator is listed - Can be searched and displayed independently in ArcLight - Are marked with `is_creator: true` to distinguish from collections -- Must be fed into a Solr instance with fields to match their specific facets (See:Configure Solr Schema below ) +- Must be fed into a Solr instance with fields to match their specific facets (See: Configure Solr Schema below) + +### Agent Filtering + +**ArcFlow automatically filters agents to include only legitimate creators** of archival materials. The following agent types are **excluded** from indexing: + +- ✗ **System users** - ArchivesSpace software users (identified by `is_user` field) +- ✗ **System-generated agents** - Auto-created for users (identified by `system_generated` field) +- ✗ **Software agents** - Excluded by not querying the `/agents/software` endpoint +- ✗ **Repository agents** - Corporate entities representing the repository itself (identified by `is_repo_agent` field) +- ✗ **Donor-only agents** - Agents with only the 'donor' role and no creator role + +**Agents are included if they meet any of these criteria:** + +- ✓ Have the **'creator' role** in linked_agent_roles +- ✓ Are **linked to published records** (and not excluded by filters above) + +This filtering ensures that only legitimate archival creators are discoverable in ArcLight, while protecting privacy and security by excluding system users and donors. ### How Creator Records Work 1. **Extraction**: `get_all_agents()` fetches all agents from ArchivesSpace -2. **Processing**: `task_agent()` generates an EAC-CPF XML document for each agent with bioghist notes -3. **Linking**: Handled via Solr using the persistent_id field (agents and collections linked through bioghist references) -4. **Indexing**: Creator XML files are indexed to Solr using `traject_config_eac_cpf.rb` +2. **Filtering**: `is_target_agent()` filters out system users, donors, and non-creator agents +3. **Processing**: `task_agent()` generates an EAC-CPF XML document for each target agent with bioghist notes +4. **Linking**: Handled via Solr using the persistent_id field (agents and collections linked through bioghist references) +5. **Indexing**: Creator XML files are indexed to Solr using `traject_config_eac_cpf.rb` ### Creator Document Format @@ -119,7 +135,22 @@ This is a **one-time setup** per Solr instance. --- -To index creator documents to Solr: +### Traject Configuration for Creator Indexing + +The `traject_config_eac_cpf.rb` file defines how EAC-CPF creator records are mapped to Solr fields. + +**Search Order**: arcflow searches for the traject config following the collection records pattern: +1. **arcuit_dir parameter** (if provided via `--arcuit-dir`) - Highest priority, most up-to-date user control +2. **arcuit gem** (via `bundle show arcuit`) - For backward compatibility when arcuit_dir not provided +3. **example_traject_config_eac_cpf.rb** in arcflow - Fallback for module usage without arcuit + +**Example File**: arcflow includes `example_traject_config_eac_cpf.rb` as a reference implementation. For production: +- Copy this file to your arcuit gem as `traject_config_eac_cpf.rb`, or +- Specify the location with `--arcuit-dir /path/to/arcuit` + +**Logging**: arcflow clearly logs which traject config file is being used when creator indexing runs. + +To index creator documents to Solr manually: ```bash bundle exec traject \ @@ -166,7 +197,9 @@ Optional arguments: python -m arcflow.main \ --arclight-dir /path/to/arclight \ --aspace-dir /path/to/archivesspace \ - --solr-url http://localhost:8983/solr/blacklight-core + --solr-url http://localhost:8983/solr/blacklight-core \ + --aspace-solr-url http://localhost:8983/solr/archivesspace + ``` **Process only agents (skip collections):** @@ -175,6 +208,7 @@ python -m arcflow.main \ --arclight-dir /path/to/arclight \ --aspace-dir /path/to/archivesspace \ --solr-url http://localhost:8983/solr/blacklight-core \ + --aspace-solr-url http://localhost:8983/solr/archivesspace \ --agents-only ``` @@ -184,6 +218,7 @@ python -m arcflow.main \ --arclight-dir /path/to/arclight \ --aspace-dir /path/to/archivesspace \ --solr-url http://localhost:8983/solr/blacklight-core \ + --aspace-solr-url http://localhost:8983/solr/archivesspace \ --force-update ``` diff --git a/arcflow/main.py b/arcflow/main.py index 292d049..bf9375b 100644 --- a/arcflow/main.py +++ b/arcflow/main.py @@ -16,7 +16,7 @@ from datetime import datetime, timezone from asnake.client import ASnakeClient from multiprocessing.pool import ThreadPool as Pool -from .utils.stage_classifications import extract_labels +from utils.stage_classifications import extract_labels base_dir = os.path.abspath((__file__) + "/../../") @@ -40,8 +40,9 @@ class ArcFlow: """ - def __init__(self, arclight_dir, aspace_dir, solr_url, traject_extra_config='', force_update=False, agents_only=False, collections_only=False, arcuit_dir=None, skip_creator_indexing=False): + def __init__(self, arclight_dir, aspace_dir, solr_url, aspace_solr_url, traject_extra_config='', force_update=False, agents_only=False, collections_only=False, arcuit_dir=None, skip_creator_indexing=False): self.solr_url = solr_url + self.aspace_solr_url = aspace_solr_url self.batch_size = 1000 clean_extra_config = traject_extra_config.strip() self.traject_extra_config = clean_extra_config or None @@ -217,6 +218,9 @@ def task_resource(self, repo, resource_id, xml_dir, pdf_dir, indent_size=0): 'resolve': ['classifications', 'classification_terms', 'linked_agents'], }).json() + if "ead_id" not in resource: + self.log.error(f'{indent}Resource {resource_id} is missing an ead_id. Skipping.') + return pdf_job xml_file_path = f'{xml_dir}/{resource["ead_id"]}.xml' # replace dots with dashes in EAD ID to avoid issues with Solr @@ -399,6 +403,7 @@ def update_eads(self): ArchivesSpace. """ xml_dir = f'{self.arclight_dir}/public/xml' + resource_dir = f'{xml_dir}/resources' pdf_dir = f'{self.arclight_dir}/public/pdf' modified_since = int(self.last_updated.timestamp()) @@ -412,7 +417,7 @@ def update_eads(self): json={'delete': {'query': '*:*'}}, ) if response.status_code == 200: - self.log.info('Deleted all EADs from ArcLight Solr.') + self.log.info('Deleted all EADs and Creators from ArcLight Solr.') # delete related directories after suscessful # deletion from solr for dir_path, dir_name in [(xml_dir, 'XMLs'), (pdf_dir, 'PDFs')]: @@ -424,10 +429,10 @@ def update_eads(self): else: self.log.error(f'Failed to delete all EADs from Arclight Solr. Status code: {response.status_code}') except requests.exceptions.RequestException as e: - self.log.error(f'Error deleting all EADs from ArcLight Solr: {e}') + self.log.error(f'Error deleting all EADs and Creators from ArcLight Solr: {e}') # create directories if don't exist - for dir_path in (xml_dir, pdf_dir): + for dir_path in (resource_dir, pdf_dir): os.makedirs(dir_path, exist_ok=True) # process resources that have been modified in ArchivesSpace since last update @@ -440,7 +445,7 @@ def update_eads(self): # Tasks for processing repositories results_1 = [pool.apply_async( self.task_repository, - args=(repo, xml_dir, modified_since, indent_size)) + args=(repo, resource_dir, modified_since, indent_size)) for repo in repos] # Collect outputs from repository tasks outputs_1 = [r.get() for r in results_1] @@ -448,7 +453,7 @@ def update_eads(self): # Tasks for processing resources results_2 = [pool.apply_async( self.task_resource, - args=(repo, resource_id, xml_dir, pdf_dir, indent_size)) + args=(repo, resource_id, resource_dir, pdf_dir, indent_size)) for repo, resources in outputs_1 for resource_id in resources] # Collect outputs from resource tasks outputs_2 = [r.get() for r in results_2] @@ -463,7 +468,7 @@ def update_eads(self): # Tasks for indexing pending resources results_3 = [pool.apply_async( self.index_collections, - args=(repo_id, f'{xml_dir}/{repo_id}_*_batch_{batch_num}.xml', indent_size)) + args=(repo_id, f'{resource_dir}/{repo_id}_*_batch_{batch_num}.xml', indent_size)) for repo_id, batch_num in batches] # Wait for indexing tasks to complete @@ -472,7 +477,7 @@ def update_eads(self): # Remove pending symlinks after indexing for repo_id, batch_num in batches: - xml_file_path = f'{xml_dir}/{repo_id}_*_batch_{batch_num}.xml' + xml_file_path = f'{resource_dir}/{repo_id}_*_batch_{batch_num}.xml' try: result = subprocess.run( f'rm {xml_file_path}', @@ -495,14 +500,23 @@ def update_eads(self): for r in results_4: r.get() - # processing deleted resources is not needed when - # force-update is set or modified_since is set to 0 - if self.force_update or modified_since <= 0: - self.log.info('Skipping deleted resources processing.') - return + return + + + + def process_deleted_records(self): + + xml_dir = f'{self.arclight_dir}/public/xml' + resource_dir = f'{xml_dir}/resources' + agent_dir = f'{xml_dir}/agents' + pdf_dir = f'{self.arclight_dir}/public/pdf' + modified_since = int(self.last_updated.timestamp()) + + # process records that have been deleted since last update in ArchivesSpace + resource_pattern = r'^/repositories/(?P\d+)/resources/(?P\d+)$' + agent_pattern = r'^/agents/(?Ppeople|corporate_entities|families)/(?P\d+)$' + - # process resources that have been deleted since last update in ArchivesSpace - pattern = r'^/repositories/(?P\d+)/resources/(?P\d+)$' page = 1 while True: deleted_records = self.client.get( @@ -513,12 +527,13 @@ def update_eads(self): } ).json() for record in deleted_records['results']: - match = re.match(pattern, record) - if match: - resource_id = match.group('resource_id') + resource_match = re.match(resource_pattern, record) + agent_match = re.match(agent_pattern, record) + if resource_match and not self.agents_only: + resource_id = resource_match.group('resource_id') self.log.info(f'{" " * indent_size}Processing deleted resource ID {resource_id}...') - symlink_path = f'{xml_dir}/{resource_id}.xml' + symlink_path = f'{resource_dir}/{resource_id}.xml' ead_id = self.get_ead_from_symlink(symlink_path) if ead_id: self.delete_ead( @@ -530,6 +545,14 @@ def update_eads(self): else: self.log.error(f'{" " * (indent_size+2)}Symlink {symlink_path} not found. Unable to delete the associated EAD from Arclight Solr.') + if agent_match and not self.collections_only: + agent_id = agent_match.group('agent_id') + self.log.info(f'{" " * indent_size}Processing deleted agent ID {agent_id}...') + file_path = f'{agent_dir}/{agent_id}.xml' + agent_solr_id = f'creator_{agent_type}_{agent_id}' + self.delete_creator(file_path, agent_solr_id, indent_size) + + if deleted_records['last_page'] == page: break page += 1 @@ -577,9 +600,9 @@ def index_collections(self, repo_id, xml_file_path, indent_size=0): env = os.environ.copy() env['REPOSITORY_ID'] = str(repo_id) - + cmd_string = ' '.join(cmd) result = subprocess.run( - cmd, + cmd_string, cwd=self.arclight_dir, env=env, stderr=subprocess.PIPE, @@ -673,63 +696,142 @@ def get_creator_bioghist(self, resource, indent_size=0): return '\n'.join(bioghist_elements) return None - - def get_all_agents(self, agent_types=None, modified_since=0, indent_size=0): + def _get_target_agent_criteria(self, modified_since=0): """ - Fetch ALL agents from ArchivesSpace (not just creators). - Uses direct agent API endpoints for comprehensive coverage. - + Defines the Solr query criteria for "target" agents. + These are agents we want to process. + """ + # Basic filters for agents to include + criteria = [ + "system_generated:false", + "is_user:false", + "is_repo_agent:false", + # Include agents that are creators OR are linked to published records + "(linked_agent_roles:creator OR is_linked_to_published_record:true)", + # Exclude agents whose ONLY role is 'donor' + # This logic says: "NOT (role is only donor)" + "(*:* -linked_agent_roles:donor OR (*:* AND linked_agent_roles:[* TO *] AND (*:* -linked_agent_roles:donor)))" + ] + + # Add time filter if applicable + if modified_since > 0 and not self.force_update: + mtime_utc = datetime.fromtimestamp(modified_since, tz=timezone.utc).strftime('%Y-%m-%dT%H:%M:%SZ') + criteria.append(f"system_mtime:[{mtime_utc} TO *]") + + return criteria + + def _get_nontarget_agent_criteria(self, modified_since=0): + """ + Defines the Solr query criteria for "non-target" (excluded) agents. + This is the logical inverse of the target criteria. + """ + # The core logic for what makes an agent a "target" + target_logic = " AND ".join([ + "system_generated:false", + "is_user:false", + "is_repo_agent:false", + "(linked_agent_roles:creator OR is_linked_to_published_record:true)", + "(*:* -linked_agent_roles:donor OR (*:* AND linked_agent_roles:[* TO *] AND (*:* -linked_agent_roles:donor)))" + ]) + + # We find non-targets by negating the entire block of target logic + criteria = [f"NOT ({target_logic})"] + + # We still apply the time filter to the overall query + if modified_since > 0 and not self.force_update: + mtime_utc = datetime.fromtimestamp(modified_since, tz=timezone.utc).strftime('%Y-%m-%dT%H:%M:%SZ') + criteria.append(f"system_mtime:[{mtime_utc} TO *]") + + return criteria + + def _execute_solr_query(self, query_parts, solr_url=None, fields=['id'], indent_size=0): + """ + A generic function to execute a query against the Solr index. + Args: - agent_types: List of agent types to fetch. Default: ['corporate_entities', 'people', 'families'] - modified_since: Unix timestamp to filter agents modified since this time (if API supports it) - indent_size: Indentation size for logging - + query_parts (list): A list of strings that will be joined with " AND ". + fields (list): A list of Solr fields to return in the response. + Returns: - set: Set of agent URIs (e.g., '/agents/corporate_entities/123') + list: A list of dictionaries, where each dictionary contains the requested fields. + Returns an empty list on failure. + """ + indent = ' ' * indent_size + if not query_parts: + self.log.error("Cannot execute Solr query with empty criteria.") + return [] + + if not solr_url: + solr_url = self.solr_url + + query_string = " AND ".join(query_parts) + self.log.info(f"{indent}Executing Solr query: {query_string}") + + try: + # First, get the total count of matching documents + count_params = {'q': query_string, 'rows': 0, 'wt': 'json'} + count_response = requests.get(f'{solr_url}/select', params=count_params) + self.log.info(f" [Solr Count Request]: {count_response.request.url}") + + count_response.raise_for_status() + num_found = count_response.json()['response']['numFound'] + + if num_found == 0: + return [] # No need to query again if nothing was found + + # Now, fetch the actual data for the documents + data_params = { + 'q': query_string, + 'rows': num_found, # Use the exact count to fetch all results + 'fl': ','.join(fields), # Join field list into a comma-separated string + 'wt': 'json' + } + response = requests.get(f'{solr_url}/select', params=data_params) + response.raise_for_status() + # Log the exact URL for the data request + self.log.info(f" [Solr Data Request]: {response.request.url}") + + return response.json()['response']['docs'] + + except requests.exceptions.RequestException as e: + self.log.error(f"Failed to execute Solr query: {e}") + self.log.error(f" Failed query string: {query_string}") + return [] + + def get_all_agents(self, agent_types=None, modified_since=0, indent_size=0): + """ + Fetch target agent URIs from the Solr index and log non-target agents. """ if agent_types is None: - agent_types = ['corporate_entities', 'people', 'families'] - + agent_types = ['agent_person', 'agent_corporate_entity', 'agent_family'] + + if self.force_update: + modified_since = 0 indent = ' ' * indent_size - all_agents = set() - - self.log.info(f'{indent}Fetching ALL agents from ArchivesSpace...') - - for agent_type in agent_types: - try: - # Try with modified_since parameter first - params = {'all_ids': True} - if modified_since > 0: - params['modified_since'] = modified_since - - response = self.client.get(f'/agents/{agent_type}', params=params) - agent_ids = response.json() - - self.log.info(f'{indent}Found {len(agent_ids)} {agent_type} agents') - - # Add agent URIs to set - for agent_id in agent_ids: - agent_uri = f'/agents/{agent_type}/{agent_id}' - all_agents.add(agent_uri) - - except Exception as e: - self.log.error(f'{indent}Error fetching {agent_type} agents: {e}') - # If modified_since fails, try without it - if modified_since > 0: - self.log.warning(f'{indent}Retrying {agent_type} without modified_since filter...') - try: - response = self.client.get(f'/agents/{agent_type}', params={'all_ids': True}) - agent_ids = response.json() - self.log.info(f'{indent}Found {len(agent_ids)} {agent_type} agents (no date filter)') - for agent_id in agent_ids: - agent_uri = f'/agents/{agent_type}/{agent_id}' - all_agents.add(agent_uri) - except Exception as e2: - self.log.error(f'{indent}Failed to fetch {agent_type} agents: {e2}') - - self.log.info(f'{indent}Found {len(all_agents)} total agents across all types.') - return all_agents + self.log.info(f'{indent}Fetching agent data from Solr...') + + # Base criteria for all queries in this function + base_criteria = [f"primary_type:({' OR '.join(agent_types)})"] + + # Get and log the non-target agents + nontarget_criteria = base_criteria + self._get_nontarget_agent_criteria(modified_since) + excluded_docs = self._execute_solr_query(nontarget_criteria,self.aspace_solr_url, fields=['id']) + if excluded_docs: + excluded_ids = [doc['id'] for doc in excluded_docs] + self.log.info(f"{indent} Found {len(excluded_ids)} non-target (excluded) agents.") + # Optional: Log the actual IDs if the list isn't too long + # for agent_id in excluded_ids: + # self.log.debug(f"{indent} - Excluded: {agent_id}") + + # Get and return the target agents + target_criteria = base_criteria + self._get_target_agent_criteria(modified_since) + self.log.info('Target Criteria:') + target_docs = self._execute_solr_query(target_criteria, self.aspace_solr_url, fields=['id']) + + target_agents = [doc['id'] for doc in target_docs] + self.log.info(f"{indent} Found {len(target_agents)} target agents to process.") + return target_agents def task_agent(self, agent_uri, agents_dir, repo_id=1, indent_size=0): """ @@ -844,14 +946,15 @@ def process_creators(self): self.log.info(f'{indent}Indexing {len(creator_ids)} creator records to Solr...') traject_config = self.find_traject_config() if traject_config: + self.log.info(f'{indent}Using traject config: {traject_config}') indexed = self.index_creators(agents_dir, creator_ids) self.log.info(f'{indent}Creator indexing complete: {indexed}/{len(creator_ids)} indexed') else: - self.log.info(f'{indent}Skipping creator indexing (traject config not found)') + self.log.warning(f'{indent}Skipping creator indexing (traject config not found)') self.log.info(f'{indent}To index manually:') self.log.info(f'{indent} cd {self.arclight_dir}') self.log.info(f'{indent} bundle exec traject -u {self.solr_url} -i xml \\') - self.log.info(f'{indent} -c /path/to/arcuit/arcflow/traject_config_eac_cpf.rb \\') + self.log.info(f'{indent} -c /path/to/arcuit-gem/traject_config_eac_cpf.rb \\') self.log.info(f'{indent} {agents_dir}/*.xml') elif self.skip_creator_indexing: self.log.info(f'{indent}Skipping creator indexing (--skip-creator-indexing flag set)') @@ -863,15 +966,32 @@ def find_traject_config(self): """ Find the traject config for creator indexing. - Tries: - 1. bundle show arcuit (finds installed gem) - 2. self.arcuit_dir (explicit path) - 3. Returns None if neither works + Search order (follows collection records pattern): + 1. arcuit_dir if provided (most up-to-date user control) + 2. arcuit gem via bundle show (for backward compatibility) + 3. example_traject_config_eac_cpf.rb in arcflow (fallback when used as module without arcuit) Returns: str: Path to traject config, or None if not found """ - # Try bundle show arcuit first + self.log.info('Searching for traject_config_eac_cpf.rb...') + searched_paths = [] + + # Try 1: arcuit_dir if provided (highest priority - user's explicit choice) + if self.arcuit_dir: + self.log.debug(f' Checking arcuit_dir parameter: {self.arcuit_dir}') + candidate_paths = [ + os.path.join(self.arcuit_dir, 'traject_config_eac_cpf.rb'), + os.path.join(self.arcuit_dir, 'lib', 'arcuit', 'traject', 'traject_config_eac_cpf.rb'), + ] + searched_paths.extend(candidate_paths) + for traject_config in candidate_paths: + if os.path.exists(traject_config): + self.log.info(f'✓ Using traject config from arcuit_dir: {traject_config}') + return traject_config + self.log.debug(' traject_config_eac_cpf.rb not found in arcuit_dir') + + # Try 2: bundle show arcuit (for backward compatibility when arcuit_dir not provided) try: result = subprocess.run( ['bundle', 'show', 'arcuit'], @@ -882,39 +1002,46 @@ def find_traject_config(self): ) if result.returncode == 0: arcuit_path = result.stdout.strip() - # Prefer config at gem root, fall back to legacy subdirectory layout + self.log.debug(f' Found arcuit gem at: {arcuit_path}') candidate_paths = [ os.path.join(arcuit_path, 'traject_config_eac_cpf.rb'), - os.path.join(arcuit_path, 'arcflow', 'traject_config_eac_cpf.rb'), + os.path.join(arcuit_path, 'lib', 'arcuit', 'traject', 'traject_config_eac_cpf.rb'), ] + searched_paths.extend(candidate_paths) for traject_config in candidate_paths: if os.path.exists(traject_config): - self.log.info(f'Found traject config via bundle show: {traject_config}') + self.log.info(f'✓ Using traject config from arcuit gem: {traject_config}') return traject_config - self.log.warning( - 'bundle show arcuit succeeded but traject_config_eac_cpf.rb ' - 'was not found in any expected location under the gem root' + self.log.debug( + ' traject_config_eac_cpf.rb not found in arcuit gem ' + '(checked root and lib/arcuit/traject/ subdirectory)' ) else: - self.log.debug('bundle show arcuit failed (gem not installed?)') + self.log.debug(' arcuit gem not found via bundle show') except Exception as e: - self.log.debug(f'Error running bundle show arcuit: {e}') - # Fall back to arcuit_dir if provided - if self.arcuit_dir: - candidate_paths = [ - os.path.join(self.arcuit_dir, 'traject_config_eac_cpf.rb'), - os.path.join(self.arcuit_dir, 'arcflow', 'traject_config_eac_cpf.rb'), - ] - for traject_config in candidate_paths: - if os.path.exists(traject_config): - self.log.info(f'Using traject config from arcuit_dir: {traject_config}') - return traject_config - self.log.warning( - 'arcuit_dir provided but traject_config_eac_cpf.rb was not found ' - 'in any expected location' + self.log.debug(f' Error checking for arcuit gem: {e}') + + # Try 3: example file in arcflow package (fallback for module usage without arcuit) + # We know exactly where this file is located - at the repo root + arcflow_package_dir = os.path.dirname(os.path.abspath(__file__)) + arcflow_repo_root = os.path.dirname(arcflow_package_dir) + traject_config = os.path.join(arcflow_repo_root, 'example_traject_config_eac_cpf.rb') + searched_paths.append(traject_config) + + if os.path.exists(traject_config): + self.log.info(f'✓ Using example traject config from arcflow: {traject_config}') + self.log.info( + ' Note: Using example config. For production, copy this file to your ' + 'arcuit gem or specify location with --arcuit-dir.' ) - # No config found - self.log.warning('Could not find traject config (bundle show arcuit failed and arcuit_dir not provided or invalid)') + return traject_config + + # No config found anywhere - show all paths searched + self.log.error('✗ Could not find traject_config_eac_cpf.rb in any of these locations:') + for i, path in enumerate(searched_paths, 1): + self.log.error(f' {i}. {path}') + self.log.error('') + self.log.error(' Add traject_config_eac_cpf.rb to your arcuit gem or specify with --arcuit-dir.') return None @@ -1068,37 +1195,51 @@ def create_symlink(self, target_path, symlink_path, indent_size=0): self.log.info(f'{indent}{e}') return False - - def delete_ead(self, resource_id, ead_id, - xml_file_path, pdf_file_path, indent_size=0): + def delete_arclight_solr_record(self, solr_record_id, indent_size=0): indent = ' ' * indent_size - # delete from solr + try: response = requests.post( f'{self.solr_url}/update?commit=true', - json={'delete': {'id': ead_id}}, + json={'delete': {'id': solr_record_id}}, ) if response.status_code == 200: - self.log.info(f'{indent}Deleted EAD "{ead_id}" from ArcLight Solr.') - # delete related files after suscessful deletion from solr - for file_path in (xml_file_path, pdf_file_path): - try: - os.remove(file_path) - self.log.info(f'{indent}Deleted file {file_path}.') - except FileNotFoundError: - self.log.error(f'{indent}File {file_path} not found.') - - # delete symlink if exists - symlink_path = f'{os.path.dirname(xml_file_path)}/{resource_id}.xml' - try: - os.remove(symlink_path) - self.log.info(f'{indent}Deleted symlink {symlink_path}.') - except FileNotFoundError: - self.log.info(f'{indent}Symlink {symlink_path} not found.') + self.log.info(f'{indent}Deleted Solr record {solr_record_id}. from ArcLight Solr') + return True else: - self.log.error(f'{indent}Failed to delete EAD "{ead_id}" from Arclight Solr. Status code: {response.status_code}') + self.log.error( + f'{indent}Failed to delete Solr record {solr_record_id} from Arclight Solr. Status code: {response.status_code}') + return False except requests.exceptions.RequestException as e: - self.log.error(f'{indent}Error deleting EAD "{ead_id}" from ArcLight Solr: {e}') + self.log.error(f'{indent}Error deleting Solr record {solr_record_id} from ArcLight Solr: {e}') + + def delete_file(self, file_path, indent_side=0): + indent = ' ' * indent_size + + try: + os.remove(file_path) + self.log.info(f'{indent}Deleted file {file_path}.') + except FileNotFoundError: + self.log.error(f'{indent}File {file_path} not found.') + + def delete_ead(self, resource_id, ead_id, + xml_file_path, pdf_file_path, indent_size=0): + indent = ' ' * indent_size + # delete from solr + deleted_solr_record = self.delete_arclight_solr_record(ead_id, indent_size=indent_size) + if deleted_solr_record: + self.delete_file(pdf_file_path, indent=indent) + self.delete_file(xml_file_path, indent=indent) + # delete symlink if exists + symlink_path = f'{os.path.dirname(xml_file_path)}/{resource_id}.xml' + self.delete_file(symlink_path, indent=indent) + + def delete_creator(self, file_path, solr_id, indent_size=0): + indent = ' ' * indent_size + deleted_solr_record = self.delete_arclight_solr_record(solr_id, indent_size=indent_size) + if deleted_solr_record: + self.delete_file(file_path, indent=indent) + def save_config_file(self): @@ -1132,7 +1273,14 @@ def run(self): # Update creator records (unless collections-only mode) if not self.collections_only: self.process_creators() - + + # processing deleted resources is not needed when + # force-update is set or modified_since is set to 0 + if self.force_update or int(self.last_updated.timestamp()) <= 0: + self.log.info('Skipping deleted record processing.') + else: + self.process_deleted_records() + self.save_config_file() self.log.info(f'ArcFlow process completed (PID: {self.pid}). Elapsed time: {time.strftime("%H:%M:%S", time.gmtime(int(time.time()) - self.start_time))}.') @@ -1156,7 +1304,11 @@ def main(): parser.add_argument( '--solr-url', required=True, - help='URL of the Solr core',) + help='URL of the ArcLight Solr core',) + parser.add_argument( + '--aspace-solr-url', + required=True, + help='URL of the ASpace Solr core',) parser.add_argument( '--traject-extra-config', default='', @@ -1187,6 +1339,7 @@ def main(): arclight_dir=args.arclight_dir, aspace_dir=args.aspace_dir, solr_url=args.solr_url, + aspace_solr_url=args.aspace_solr_url, traject_extra_config=args.traject_extra_config, force_update=args.force_update, agents_only=args.agents_only, diff --git a/traject_config_eac_cpf.rb b/example_traject_config_eac_cpf.rb similarity index 97% rename from traject_config_eac_cpf.rb rename to example_traject_config_eac_cpf.rb index 62c9a5a..6234ccd 100644 --- a/traject_config_eac_cpf.rb +++ b/example_traject_config_eac_cpf.rb @@ -4,7 +4,9 @@ # Persons, and Families) XML documents from ArchivesSpace archival_contexts endpoint. # # Usage: -# bundle exec traject -u $SOLR_URL -c traject_config_eac_cpf.rb /path/to/agents/*.xml +# bundle exec traject -u $SOLR_URL -c example_traject_config_eac_cpf.rb /path/to/agents/*.xml +# +# For production, copy this file to your arcuit gem as traject_config_eac_cpf.rb # # The EAC-CPF XML documents are retrieved directly from ArchivesSpace via: # /repositories/{repo_id}/archival_contexts/{agent_type}/{id}.xml @@ -188,7 +190,8 @@ # Extract HTML for searchable content (matches ArcLight's bioghist_html_tesm) bioghist = record.xpath('//eac:cpfDescription/eac:description/eac:biogHist//eac:p', EAC_NS) if bioghist.any? - html = bioghist.map { |p| "

#{p.text}

" }.join("\n") + # Preserve inline EAC markup inside by serializing child nodes + html = bioghist.map { |p| "

#{p.inner_html}

" }.join("\n") accumulator << html end end From 7b9522a2e972b97112d4dbd821931e1204e23000 Mon Sep 17 00:00:00 2001 From: Copilot <198982749+Copilot@users.noreply.github.com> Date: Thu, 26 Feb 2026 11:39:16 -0500 Subject: [PATCH 38/44] fix: always use filename for id log record if filename is not expected pattern: creator_{type}_{id} --- example_traject_config_eac_cpf.rb | 83 ++++++------------------------- 1 file changed, 15 insertions(+), 68 deletions(-) diff --git a/example_traject_config_eac_cpf.rb b/example_traject_config_eac_cpf.rb index 6234ccd..be544e9 100644 --- a/example_traject_config_eac_cpf.rb +++ b/example_traject_config_eac_cpf.rb @@ -22,6 +22,9 @@ # EAC-CPF namespace - used consistently throughout this config EAC_NS = { 'eac' => 'urn:isbn:1-931666-33-4' } +# Pattern matching arcflow's creator file naming: creator_{entity_type}_{id} +CREATOR_ID_PATTERN = /^creator_(corporate_entities|people|families)_\d+$/ + settings do provide "solr.url", ENV['SOLR_URL'] || "http://localhost:8983/solr/blacklight-core" provide "solr_writer.commit_on_close", "true" @@ -38,77 +41,21 @@ context.clipboard[:is_creator] = true end -# Core identity field -# CRITICAL: The 'id' field is required by Solr's schema (uniqueKey) -# Must ensure this field is never empty or indexing will fail -# -# IMPORTANT: Real EAC-CPF from ArchivesSpace has empty element! -# Cannot rely on recordId being present. Must extract from filename or generate. +# Solr uniqueKey - extract ID from filename using arcflow's creator_{entity_type}_{id} pattern to_field 'id' do |record, accumulator, context| - # Try 1: Extract from control/recordId (if present) - record_id = record.xpath('//eac:control/eac:recordId', EAC_NS).first - record_id ||= record.xpath('//control/recordId').first - - if record_id && !record_id.text.strip.empty? - accumulator << record_id.text.strip - else - # Try 2: Extract from source filename (most reliable for ArchivesSpace exports) - # Filename format: creator_corporate_entities_584.xml or similar - source_file = context.source_record_id || context.input_name - if source_file - # Remove .xml extension and any path - id_from_filename = File.basename(source_file, '.xml') - # Check if it looks valid (starts with creator_ or agent_) - if id_from_filename =~ /^(creator_|agent_)/ - accumulator << id_from_filename - context.logger.info("Using filename-based ID: #{id_from_filename}") - else - # Try 3: Generate from entity type and name - entity_type = record.xpath('//eac:cpfDescription/eac:identity/eac:entityType', EAC_NS).first&.text&.strip - name_entry = record.xpath('//eac:cpfDescription/eac:identity/eac:nameEntry/eac:part', EAC_NS).first&.text&.strip - - if entity_type && name_entry - # Create stable ID from type and name - type_short = case entity_type - when 'corporateBody' then 'corporate' - when 'person' then 'person' - when 'family' then 'family' - else 'entity' - end - name_id = name_entry.gsub(/[^a-z0-9]/i, '_').downcase[0..50] # Limit length - generated_id = "creator_#{type_short}_#{name_id}" - accumulator << generated_id - context.logger.warn("Generated ID from name: #{generated_id}") - else - # Last resort: timestamp-based unique ID - fallback_id = "creator_unknown_#{Time.now.to_i}_#{rand(10000)}" - accumulator << fallback_id - context.logger.error("Using fallback ID: #{fallback_id}") - end - end + source_file = context.source_record_id || context.input_name + if source_file + id_from_filename = File.basename(source_file, '.xml') + if id_from_filename =~ CREATOR_ID_PATTERN + accumulator << id_from_filename + context.logger.info("Using filename-based ID: #{id_from_filename}") else - # No filename available, generate from name - entity_type = record.xpath('//eac:cpfDescription/eac:identity/eac:entityType', EAC_NS).first&.text&.strip - name_entry = record.xpath('//eac:cpfDescription/eac:identity/eac:nameEntry/eac:part', EAC_NS).first&.text&.strip - - if entity_type && name_entry - type_short = case entity_type - when 'corporateBody' then 'corporate' - when 'person' then 'person' - when 'family' then 'family' - else 'entity' - end - name_id = name_entry.gsub(/[^a-z0-9]/i, '_').downcase[0..50] - generated_id = "creator_#{type_short}_#{name_id}" - accumulator << generated_id - context.logger.warn("Generated ID from name: #{generated_id}") - else - # Absolute last resort - fallback_id = "creator_unknown_#{Time.now.to_i}_#{rand(10000)}" - accumulator << fallback_id - context.logger.error("Using fallback ID: #{fallback_id}") - end + context.logger.error("Filename doesn't match expected pattern 'creator_{type}_{id}': #{id_from_filename}") + context.skip!("Invalid ID format in filename") end + else + context.logger.error("No source filename available for record") + context.skip!("Missing source filename") end end From 635af2be2cf2339e8318e9ff28a12606b94c8f32 Mon Sep 17 00:00:00 2001 From: Alex Dryden Date: Mon, 2 Mar 2026 16:18:47 -0500 Subject: [PATCH 39/44] fix: reduce duplicate fields and make fields dynamic --- example_traject_config_eac_cpf.rb | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/example_traject_config_eac_cpf.rb b/example_traject_config_eac_cpf.rb index be544e9..be0d297 100644 --- a/example_traject_config_eac_cpf.rb +++ b/example_traject_config_eac_cpf.rb @@ -64,13 +64,13 @@ accumulator << 'true' end -# Record type -to_field 'record_type' do |record, accumulator| - accumulator << 'creator' -end +# # Record type +# to_field 'record_type' do |record, accumulator| +# accumulator << 'creator' +# end # Entity type (corporateBody, person, family) -to_field 'entity_type' do |record, accumulator| +to_field 'entity_type_ssi' do |record, accumulator| entity = record.xpath('//eac:cpfDescription/eac:identity/eac:entityType', EAC_NS).first accumulator << entity.text if entity end @@ -198,7 +198,7 @@ end # Agent source URI (from original ArchivesSpace) -to_field 'agent_uri' do |record, accumulator| +to_field 'agent_uri_ssi' do |record, accumulator| # Try to extract from control section or otherRecordId other_id = record.xpath('//eac:control/eac:otherRecordId[@localType="archivesspace_uri"]', EAC_NS).first if other_id @@ -211,10 +211,10 @@ accumulator << Time.now.utc.iso8601 end -# Document type marker -to_field 'document_type' do |record, accumulator| - accumulator << 'creator' -end +# # Document type marker +# to_field 'document_type' do |record, accumulator| +# accumulator << 'creator' +# end # Log successful indexing each_record do |record, context| From 24d86a6fc0002341084e0a1d71cda2f1dccdef41 Mon Sep 17 00:00:00 2001 From: Alex Dryden Date: Mon, 2 Mar 2026 17:30:45 -0500 Subject: [PATCH 40/44] feat: store related agent ids, uris, and relationsips in arrays --- example_traject_config_eac_cpf.rb | 74 ++++++++++++++++++++++++++----- 1 file changed, 62 insertions(+), 12 deletions(-) diff --git a/example_traject_config_eac_cpf.rb b/example_traject_config_eac_cpf.rb index be0d297..52d0050 100644 --- a/example_traject_config_eac_cpf.rb +++ b/example_traject_config_eac_cpf.rb @@ -25,6 +25,9 @@ # Pattern matching arcflow's creator file naming: creator_{entity_type}_{id} CREATOR_ID_PATTERN = /^creator_(corporate_entities|people|families)_\d+$/ +# Entity types - SINGLE SOURCE OF TRUTH +ENTITY_TYPES = ['corporate_entities', 'people', 'families'] + settings do provide "solr.url", ENV['SOLR_URL'] || "http://localhost:8983/solr/blacklight-core" provide "solr_writer.commit_on_close", "true" @@ -160,26 +163,25 @@ accumulator << bioghist.map(&:text).join(' ') if bioghist.any? end -# Related agents (from cpfRelation elements) -to_field 'related_agents_ssim' do |record, accumulator| +# Related agents (from cpfRelation elements) for display parsing and debugging, stored as a single line +# "https://archivesspace-stage.library.illinois.edu/agents/corporate_entities/57|associative" +to_field 'related_agents_debug_ssim' do |record, accumulator| relations = record.xpath('//eac:cpfDescription/eac:relations/eac:cpfRelation', EAC_NS) relations.each do |rel| - # Get the related entity href/identifier href = rel['href'] || rel['xlink:href'] relation_type = rel['cpfRelationType'] - + if href - # Store as: "uri|type" for easy parsing later - accumulator << "#{href}|#{relation_type}" - elsif relation_entry = rel.xpath('eac:relationEntry', EAC_NS).first - # If no href, at least store the name - name = relation_entry.text - accumulator << "#{name}|#{relation_type}" if name + solr_id = aspace_uri_to_solr_id(href) + if solr_id + # Format: "solr_id|type" + accumulator << "#{solr_id}|#{relation_type || 'unknown'}" + end end end end -# Related agents - just URIs (for simpler queries) +# Related agents - ASpace URIs, in parallel array to match ids and types to_field 'related_agent_uris_ssim' do |record, accumulator| relations = record.xpath('//eac:cpfDescription/eac:relations/eac:cpfRelation', EAC_NS) relations.each do |rel| @@ -188,7 +190,31 @@ end end -# Relationship types +# Related agents - Parallel array of relationship ids to match relationship types and uris +to_field 'related_agent_ids_ssim' do |record, accumulator| + relations = record.xpath('//eac:cpfDescription/eac:relations/eac:cpfRelation', EAC_NS) + relations.each do |rel| + href = rel['href'] || rel['xlink:href'] + if href + solr_id = aspace_uri_to_solr_id(href) # CONVERT URI TO ID + accumulator << solr_id if solr_id + end + end +end + +# Related Agents - Parallel array of relationship types to match relationship ids and uris +to_field 'related_agent_relationship_types_ssim' do |record, accumulator| + relations = record.xpath('//eac:cpfDescription/eac:relations/eac:cpfRelation', EAC_NS) + relations.each do |rel| + href = rel['href'] || rel['xlink:href'] + if href + relation_type = rel['cpfRelationType'] || 'unknown' + accumulator << relation_type # NO deduplication - keeps array parallel + end + end +end + +# Relationship types used for faceting, to_field 'relationship_types_ssim' do |record, accumulator| relations = record.xpath('//eac:cpfDescription/eac:relations/eac:cpfRelation', EAC_NS) relations.each do |rel| @@ -223,3 +249,27 @@ context.logger.info("Indexed creator: #{record_id.text}") end end + + + + +# Pattern matching arcflow's creator file naming: creator_{entity_type}_{id} +CREATOR_ID_PATTERN = /^creator_(#{ENTITY_TYPES.join('|')})_\d+$/ + +# Helper to build and validate creator IDs +def build_creator_id(entity_type, id_number) + creator_id = "creator_#{entity_type}_#{id_number}" + unless creator_id =~ CREATOR_ID_PATTERN + raise ArgumentError, "Invalid creator ID: #{creator_id} doesn't match pattern" + end + creator_id +end + +# Helper to convert ArchivesSpace URI to Solr creator ID +def aspace_uri_to_solr_id(uri) + return nil unless uri + # Match: /agents/{type}/{id} or https://.../agents/{type}/{id} + if uri =~ /agents\/(#{ENTITY_TYPES.join('|')})\/(\d+)/ + build_creator_id($1, $2) + end +end \ No newline at end of file From 3676246495462a13dd1ff64f8c89521b1ec0b8f6 Mon Sep 17 00:00:00 2001 From: Alex Dryden Date: Tue, 3 Mar 2026 13:16:39 -0500 Subject: [PATCH 41/44] this will require further refinement, but for now this will be a more conservative list of things we know are relevant --- arcflow/main.py | 14 ++++---------- 1 file changed, 4 insertions(+), 10 deletions(-) diff --git a/arcflow/main.py b/arcflow/main.py index bf9375b..4613ca6 100644 --- a/arcflow/main.py +++ b/arcflow/main.py @@ -701,16 +701,11 @@ def _get_target_agent_criteria(self, modified_since=0): Defines the Solr query criteria for "target" agents. These are agents we want to process. """ - # Basic filters for agents to include criteria = [ + "linked_agent_roles:creator", "system_generated:false", "is_user:false", - "is_repo_agent:false", - # Include agents that are creators OR are linked to published records - "(linked_agent_roles:creator OR is_linked_to_published_record:true)", - # Exclude agents whose ONLY role is 'donor' - # This logic says: "NOT (role is only donor)" - "(*:* -linked_agent_roles:donor OR (*:* AND linked_agent_roles:[* TO *] AND (*:* -linked_agent_roles:donor)))" +# "is_repo_agent:false", ] # Add time filter if applicable @@ -727,11 +722,10 @@ def _get_nontarget_agent_criteria(self, modified_since=0): """ # The core logic for what makes an agent a "target" target_logic = " AND ".join([ + "linked_agent_roles:creator", "system_generated:false", "is_user:false", - "is_repo_agent:false", - "(linked_agent_roles:creator OR is_linked_to_published_record:true)", - "(*:* -linked_agent_roles:donor OR (*:* AND linked_agent_roles:[* TO *] AND (*:* -linked_agent_roles:donor)))" +# "is_repo_agent:false", ]) # We find non-targets by negating the entire block of target logic From 29907fb43ce373e70b2d2cca49fd4fc8d8ae124e Mon Sep 17 00:00:00 2001 From: Alex Dryden Date: Tue, 3 Mar 2026 13:17:01 -0500 Subject: [PATCH 42/44] ensure passing of indent size and not the indent string --- arcflow/main.py | 14 ++++++-------- 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/arcflow/main.py b/arcflow/main.py index 4613ca6..44f8c1b 100644 --- a/arcflow/main.py +++ b/arcflow/main.py @@ -541,7 +541,7 @@ def process_deleted_records(self): ead_id.replace('.', '-'), # dashes in Solr f'{xml_dir}/{ead_id}.xml', # dots in filenames f'{pdf_dir}/{ead_id}.pdf', - indent=4) + indent_size=4) else: self.log.error(f'{" " * (indent_size+2)}Symlink {symlink_path} not found. Unable to delete the associated EAD from Arclight Solr.') @@ -1207,7 +1207,7 @@ def delete_arclight_solr_record(self, solr_record_id, indent_size=0): except requests.exceptions.RequestException as e: self.log.error(f'{indent}Error deleting Solr record {solr_record_id} from ArcLight Solr: {e}') - def delete_file(self, file_path, indent_side=0): + def delete_file(self, file_path, indent_size=0): indent = ' ' * indent_size try: @@ -1218,21 +1218,19 @@ def delete_file(self, file_path, indent_side=0): def delete_ead(self, resource_id, ead_id, xml_file_path, pdf_file_path, indent_size=0): - indent = ' ' * indent_size # delete from solr deleted_solr_record = self.delete_arclight_solr_record(ead_id, indent_size=indent_size) if deleted_solr_record: - self.delete_file(pdf_file_path, indent=indent) - self.delete_file(xml_file_path, indent=indent) + self.delete_file(pdf_file_path, indent_size=indent_size) + self.delete_file(xml_file_path, indent_size=indent_size) # delete symlink if exists symlink_path = f'{os.path.dirname(xml_file_path)}/{resource_id}.xml' - self.delete_file(symlink_path, indent=indent) + self.delete_file(symlink_path, indent_size=indent_size) def delete_creator(self, file_path, solr_id, indent_size=0): - indent = ' ' * indent_size deleted_solr_record = self.delete_arclight_solr_record(solr_id, indent_size=indent_size) if deleted_solr_record: - self.delete_file(file_path, indent=indent) + self.delete_file(file_path, indent_size=indent_size) From 6674a214b27b94b166d4eac29faff8a0dc877527 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Tue, 3 Mar 2026 18:47:53 +0000 Subject: [PATCH 43/44] Remove glob wildcard changes, keep only PDF timeout features Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com> --- arcflow/main.py | 34 +++++++++++++--------------------- 1 file changed, 13 insertions(+), 21 deletions(-) diff --git a/arcflow/main.py b/arcflow/main.py index 4f8038d..3fc84fe 100644 --- a/arcflow/main.py +++ b/arcflow/main.py @@ -10,7 +10,6 @@ import logging import math import sys -import glob from xml.dom.pulldom import parse, START_ELEMENT from xml.sax.saxutils import escape as xml_escape from xml.etree import ElementTree as ET @@ -585,17 +584,18 @@ def update_eads(self): # Remove pending symlinks after indexing for repo_id, batch_num in batches: - xml_file_pattern = f'{xml_dir}/{repo_id}_*_batch_{batch_num}.xml' - xml_files = glob.glob(xml_file_pattern) - - for xml_file_path in xml_files: - try: - os.remove(xml_file_path) - self.log.info(f'{" " * indent_size}Removed pending symlink {xml_file_path}') - except FileNotFoundError: - self.log.warning(f'{" " * indent_size}File not found: {xml_file_path}') - except Exception as e: - self.log.error(f'{" " * indent_size}Error removing pending symlink {xml_file_path}: {e}') + xml_file_path = f'{xml_dir}/{repo_id}_*_batch_{batch_num}.xml' + try: + result = subprocess.run( + f'rm {xml_file_path}', + shell=True, + cwd=self.arclight_dir, + stderr=subprocess.PIPE,) + self.log.error(f'{" " * indent_size}{result.stderr.decode("utf-8")}') + if result.returncode != 0: + self.log.error(f'{" " * indent_size}Failed to remove pending symlinks {xml_file_path}. Return code: {result.returncode}') + except Exception as e: + self.log.error(f'{" " * indent_size}Error removing pending symlinks {xml_file_path}: {e}') # Tasks for processing PDFs results_4 = [pool.apply_async( @@ -685,15 +685,7 @@ def index_collections(self, repo_id, xml_file_path, indent_size=0): # Treat a string extra config as a path and pass it with -c cmd.extend(['-c', self.traject_extra_config]) - # Expand wildcards with glob - xml_files = glob.glob(xml_file_path) - - if not xml_files: - self.log.warning(f'{indent}No files found matching pattern: {xml_file_path}') - return - - # Add all matching files to the command - cmd.extend(xml_files) + cmd.append(xml_file_path) env = os.environ.copy() env['REPOSITORY_ID'] = str(repo_id) From 595279839f945f66dd60de54e422fae46a5323f6 Mon Sep 17 00:00:00 2001 From: Alex Dryden Date: Tue, 3 Mar 2026 13:39:52 -0500 Subject: [PATCH 44/44] Expand wildcards with glob and use list command sequence instead of string command with shell=True --- arcflow/main.py | 15 ++++++--------- 1 file changed, 6 insertions(+), 9 deletions(-) diff --git a/arcflow/main.py b/arcflow/main.py index 44f8c1b..c6d8dd9 100644 --- a/arcflow/main.py +++ b/arcflow/main.py @@ -17,7 +17,7 @@ from asnake.client import ASnakeClient from multiprocessing.pool import ThreadPool as Pool from utils.stage_classifications import extract_labels - +import glob base_dir = os.path.abspath((__file__) + "/../../") log_file = os.path.join(base_dir, 'logs/arcflow.log') @@ -577,7 +577,7 @@ def index_collections(self, repo_id, xml_file_path, indent_size=0): return traject_config = f'{arclight_path}/lib/arclight/traject/ead2_config.rb' - + xml_files = glob.glob(xml_file_path) # Returns list of matching files cmd = [ 'bundle', 'exec', 'traject', '-u', self.solr_url, @@ -586,8 +586,8 @@ def index_collections(self, repo_id, xml_file_path, indent_size=0): '-s', f'solr_writer.batch_size={self.batch_size}', '-s', 'solr_writer.commit_on_close=true', '-i', 'xml', - '-c', traject_config - ] + '-c', traject_config, + ] + xml_files if self.traject_extra_config: if isinstance(self.traject_extra_config, (list, tuple)): @@ -595,14 +595,11 @@ def index_collections(self, repo_id, xml_file_path, indent_size=0): else: # Treat a string extra config as a path and pass it with -c cmd.extend(['-c', self.traject_extra_config]) - - cmd.append(xml_file_path) - + env = os.environ.copy() env['REPOSITORY_ID'] = str(repo_id) - cmd_string = ' '.join(cmd) result = subprocess.run( - cmd_string, + cmd, cwd=self.arclight_dir, env=env, stderr=subprocess.PIPE,