IA Templatizer is a command-line Python tool for batch-generating metadata CSV files for Internet Archive ingest. It applies a JSON template to an input CSV, fills in default values, generates standardized identifiers, expands repeatable fields, and validates metadata. The output is a correctly formatted CSV for use with the Internet Archive CLI or Python library.
Full documentation: see
IA-Templatizer-User-and-Developer-Manual.mdandStudent-Programmer-Reference.md
- Python 3.7 or newer (standard library only — no third-party packages required)
python ia-templatizer.py [options] <template.json> <input.csv> <output.csv>
| Option | Argument | Description |
|---|---|---|
--expand-directories / -E |
— | Expand directory paths in file column into per-file output sheets |
--flatten |
— | Flatten compound objects (item row + child page rows) before processing |
--mapping FILE |
CSV path | Load a column-mapping CSV; overrides any mapping embedded in the template |
--delimiter STR |
string | Multi-value delimiter in source cells (default: |@|) |
--type-col COL |
column name | Column identifying row type for flattening (default: type) |
--page-type VAL |
string | Value marking a child page row (default: GraphicalPage) |
--images-col COL |
column name | Column containing the image/file path (default: images) |
--sequence-col COL |
column name | Column with page sequence number (default: sequence_id) |
CLI options always override values embedded in the template's "options" section.
All metadata fields and control fields at the top level. Best for file listings and manually prepared CSVs.
{
"identifier-prefix": "hamilton",
"mediatype": "image",
"collection": ["hamilton"],
"creator": "Hamilton College",
"rights-statement": "http://rightsstatements.org/vocab/NKC/1.0/",
"subject": ["Hamilton College", "Photographs"],
"notes": "Digitized by LITS Digital Collections, Hamilton College"
}Introduced in v3.1. Wraps defaults, column mapping, and runtime options in three named sections. No separate mapping CSV or CLI flags needed.
{
"defaults": { ... },
"mapping": { "source_col": "ia_field", ... },
"options": { "flatten": false, "images_col": "files", "delimiter": "|@|" }
}The format is detected automatically.
| Field | Effect |
|---|---|
identifier-prefix |
Prepended to every generated identifier |
identifier-date |
Date string embedded in identifier; "TRUE" uses each row's date value |
identifier-basename |
Fixed string replacing the file-derived identifier component |
related-url-base |
Base URL prepended to each row's UUID value to generate related[0] (e.g. "https://litsdigital.hamilton.edu/do/") |
related-url-col |
Source column name holding the UUID; defaults to node_uuid when related-url-base is set |
Any template field whose value is a list becomes repeatable — the tool expands it into indexed output columns (subject[0], subject[1], …). Template values come first, followed by deduplicated values from the source CSV. This applies to subject, collection, source, related, and any other list field in "defaults".
The related field is also activated automatically when related-url-base is set, even if no static "related" list appears in the template.
Use this mode when your source CSV comes from a CONTENTdm/MODS export, where column names are MODS element paths (e.g., mods_titleinfo_title) and multi-value cells use a |@| delimiter.
Each entry in "mapping" translates a source CSV column to one or two IA field names:
"mapping": {
"mods_titleinfo_title": "title",
"mods_subject_topic": "subject",
"mods_genre_authority_local": ["genre", "subject"]
}Multiple source columns may map to the same IA field — values are merged and deduplicated.
Prefix a target field with ! to replace any previously collected values for that field rather than appending. Use this when a later, more specific source should win.
"mods_origininfo_dateissued": "date",
"date_full": "!date"mods_origininfo_dateissued first puts the raw ISO timestamp (1876-03-30T00:00:00Z) into the date bucket. When date_full is processed, the ! clears the bucket and replaces it with the human-readable value (1876-03-30). The ! source must appear after the field it overrides in the mapping.
MODS exports represent compound objects as one item row followed by N child GraphicalPage rows. With "flatten": true, the tool:
- Assigns the first child's image path to the item row.
- Converts remaining children into blank continuation rows (image path only).
All pages of a compound object share the same IA identifier.
The American Socialist (1876–1879) is a weekly newspaper held at Hamilton College. The source data is a MODS export (MODS_Oneida_American_Socialist_ZIPs.csv) where each row is one issue and file paths point to ZIP archives of page images.
python ia-templatizer.py \
templates/template_oneida-american-socialist.json \
MODS_Oneida_American_Socialist_ZIPs.csv \
american-socialist-out.csvNo additional flags are needed — all options are embedded in the template.
{
"defaults": {
"mediatype": "texts",
"collection": ["hamilton"],
"rights-statement": "http://rightsstatements.org/vocab/NKC/1.0/",
"rights": "For questions ... Hamilton College Special Collections ...",
"subject": [
"Hamilton College", "Oneida Community", "Communal societies",
"Intentionalism", "Utopian socialism", "Socialism",
"Noyes, John Humphrey, 1811-1886", "Periodicals", "Newspapers"
],
"notes": "Digitized by LITS Digital Collections, Hamilton College",
"source": [
"Hamilton College Library Rare Books and Special Collections",
"Communal Societies Collection",
"Oneida Community Collection",
"Folio HX656.O5 A46",
"Hamilton College Library, Clinton, New York, United States"
]
},
"mapping": {
"mods_identifier_local": "identifier",
"files": "file",
"mods_titleinfo_title": "title",
"mods_name_personal_namepart_refined": "creator",
"mods_personal_name_author": "creator",
"mods_origininfo_dateissued": "date",
"date_full": "!date",
"mods_language_languageterm_text": "language",
"mods_physicaldescription_extent": "extent",
"mods_note": "notes",
"mods_genre_authority_local": "genre",
"mods_genre_subgenre_authority_local": "genre",
"mods_subject_geographic": "subject",
"mods_subject_topic": "subject",
"mods_subject_family_name": "subject",
"mods_origininfo_place_placeterm_text": "location",
"mods_accesscondition_use_and_reproduction": "rights-statement"
},
"options": {
"flatten": false,
"images_col": "files",
"delimiter": "|@|"
}
}Key design decisions in this template:
"flatten": false— no compound objects; each row is a single issue"images_col": "files"— source CSV usesfiles(not the defaultimages) for file pathssourceis indefaultsas a fixed list — the MODS shelf-locator columns produce inconsistently split values, so the correct values are hardcoded once heredate_full → "!date"— overrides the raw ISO timestamp frommods_origininfo_dateissuedcreatoris mapped from two source columns (mods_name_personal_namepart_refinedandmods_personal_name_author); values are deduplicated if both are populated
| Field | Value |
|---|---|
identifier |
american-socialist-1876-03-30 |
file |
american-socialist-1876-03-30.zip |
mediatype |
texts |
collection[0] |
hamilton |
title |
American socialist, vol. 01, no. 01 (March 30, 1876) |
creator |
Noyes, John Humphrey, 1811-1886 |
date |
1876-03-30 |
subject[0] |
Hamilton College |
subject[6] |
Noyes, John Humphrey, 1811-1886 |
source[0] |
Hamilton College Library Rare Books and Special Collections |
rights-statement |
http://rightsstatements.org/vocab/NKC/1.0/ |
- Copy an existing combined template as a starting point.
- Update
"defaults"— subjects, rights statement, mediatype, notes, and any fixedsourcevalues. - Update
"mapping"— check the source CSV header row and replace column names as needed. - Set
"options"—"flatten": trueonly if the CSV has compound objects; set"images_col"to match the file-path column name. - Test:
python ia-templatizer.py templates/template_NEW.json SOURCE.csv test-out.csv - Review
test-out.csv— check identifiers, dates, subject list, and source values.
identifier → file → mediatype → collection[n] → title → date → creator → description → subject[n] → rights-statement → rights → genre[n] → language[n] → extent[n] → notes → source[n] → location[n] → related[n] → (remaining columns alphabetically)
| Symptom | Fix |
|---|---|
Raw ISO timestamp in date output |
Map the date-full column to "!date" and place it after the timestamp column in the mapping |
creator[0] instead of creator |
Add "creator": "" to "defaults" to anchor it as non-repeatable |
source values split incorrectly from |@| cells |
Move source out of mapping and into defaults as a hardcoded list |
WARNING: mapping references 'col'... |
Column name in mapping doesn't match CSV header — check for typos |
| Pages out of order (compound objects) | Verify "sequence_col" matches the actual numeric column in the source CSV |
| Identifier not generated | Ensure file or identifier column is present in source, or set identifier-prefix in defaults |
related[0] not in output |
Ensure "related-url-base" is set in "defaults" and the source CSV has a node_uuid column (or set "related-url-col" to the correct column name) |
- Template-driven metadata: Fill in missing or default metadata from a JSON template.
- Identifier generation: Automatically create standardized identifiers using template rules and file names.
- Repeatable fields: Expand list fields (e.g.,
subject,collection) into indexed columns (subject[0],subject[1], etc.), with template values first, then deduplicated input values. - Input normalization: Strips leading/trailing whitespace from all input CSV cell data before processing.
- Validation: Checks for valid media types, license URLs, rights statements, and date formats.
- Custom column ordering: Output CSV columns are ordered for Internet Archive workflows.
- Control fields: Template control fields (e.g.,
identifier-date,identifier-prefix) affect behavior but are not included in the output unless explicitly specified. - Robust error handling: Clear error messages for missing files, invalid formats, and unsupported values.
- Directory expansion: Optionally expand directory paths in the input CSV to generate additional output sheets for their contents.
- Extensible codebase: Modular Python scripts for easy customization and extension.
python ia-templatizer.py [flags] <template_path> <csv_path> <output_path><template_path>: Path to your metadata template JSON file.<csv_path>: Path to your input CSV file.<output_path>: Path for the output CSV file.[flags]: Optional flags to control program behavior (see below).
python ia-templatizer.py --expand-directories templates/sample-template_01.json tests/sample-files-listing.csv tests/list-out.csv| Flag | Description |
|---|---|
--expand-directories |
When a directory path is found in the file column, generate an additional output CSV sheet |
-E |
Same as --expand-directories |
Note: Only the above flags are currently supported. Any other flags will result in an error.
When the --expand-directories or -E flag is used:
- If a row in the input CSV has a directory path in its
filecolumn and the directory exists and is listable:- The row is not added to the main output CSV.
- A new output CSV is created, named with
_{last-directory-name}appended before the extension. - Each file in the directory is treated as a new item: a full metadata row is generated for it using the template and original row, and written to the directory output sheet.
- Hidden files, subdirectories, and
Thumbs.dbare excluded. - After processing the directory, the script continues with the next row in the input CSV.
- If the directory does not exist or is not listable:
- The row is added to the main output CSV as usual, with its
mediatypeset to"data".
- The row is added to the main output CSV as usual, with its
A well-formed template JSON file contains metadata fields and control fields. Example:
{
"identifier-prefix": "born-digital",
"mediatype": "DETECT",
"collection": ["middleburycollege"],
"creator": "Middlebury College",
"rights-statement": "http://rightsstatements.org/vocab/CNE/1.0/",
"subject": ["Baseball", "Team photos", "Athletes"],
"inclusive-description-statement": "This collection aims to represent diverse communities and experiences."
}identifier-date: If a valid date (YYYY, YYYY-MM, YYYY-MM-DD, or with 'x' for uncertainty), it is inserted in the identifier. If"TRUE", the value from the input CSV'sdatecolumn is used (if valid).identifier-prefixoridentifier_prefix: Used to construct identifiers. Hyphen and underscore are both supported.identifier-basenameoridentifier_basename: Used as the core part of the identifier.- Repeatable fields (lists) such as
subject,collection, etc., are expanded into indexed columns. - Control fields are never written to output CSVs.
A well-formed input CSV must have a header row. The identifier column is required unless file is present.
Example:
file,title,contributor,notes,date
02baseball/team1.jpg,"Middlebury College Baseball, 2002",,"Team photo",2020-05-01
02baseball/anderson.jpg,"Middlebury College Baseball, 2002: Nate Anderson",Nate Anderson,"Do you know something about this photograph? Email us!",2020-05-02- If the input CSV contains a column named (case-insensitive)
subject,subjects, orkeywords, its contents are treated as individual semicolon-delimited values for the repeatable fieldsubject[n]. - If the input CSV contains columns named
subject[0],subject[1], etc., those are used directly. - The same logic applies for other repeatable fields (e.g.,
collection,collection[0], etc.).
The output CSV will contain:
- All original columns from the input CSV, except for control fields and non-indexed repeatable fields (e.g.,
subject,keywords,subjects,collection). - Any fields from the template not present in the input (except control fields).
- Repeatable fields expanded into indexed columns (e.g.,
subject[0],subject[1]), with template values first, then deduplicated input values. - Columns ordered as follows:
identifierfilemediatype- All
collection[n]columns (in order) titledatecreatordescription- All
subject[n]columns (in order) - Any other columns (in no particular order)
Example output:
identifier,file,mediatype,collection[0],collection[1],title,date,creator,description,subject[0],subject[1],subject[2],rights,notes,rights-statement,inclusive-description-statement
born-digital_middmag_finals-week_2011,a10_middmag_finals-week_2011.mp4,movies,middleburycollege,specialcollection,"Finals Week",2011,"Middlebury College","Description here","Baseball","Team photos","Athletes",...,...,http://rightsstatements.org/vocab/CNE/1.0/,"This collection aims to represent diverse communities and experiences."- Validate your template and input CSV before running the script.
- Use clear, consistent field names in your template and CSV.
- For repeatable fields, use lists in the template to ensure proper expansion.
- For uncertain dates, use 'x' in place of unknown digits (e.g.,
19xx). - Always use supported flags and check error messages for guidance.
- Keep your codebase modular for easier maintenance and extension.
- Control fields in output: Control fields (e.g.,
identifier-date,identifier-prefix) should never appear in output CSVs. If they do, update your codebase to exclude them. - Identifier not generated correctly: Ensure your template uses either
identifier-prefixoridentifier_prefix, and your code supports both. - Invalid date formats: Dates must be in
YYYY,YYYY-MM, orYYYY-MM-DDformat, with 'x' allowed for uncertainty (e.g.,19xx). - Invalid license or rights statement: Only current Creative Commons licenses and rightsstatements.org statements are accepted.
- File not found: If the input CSV or template file does not exist, the script will exit with an error.
- Output directory does not exist: The script will create the output directory if needed.
- Invalid flags: If an unsupported flag is provided, the script will exit with an error and display allowed flags.
- Repeatable fields not expanded: Ensure repeatable fields are lists in the template.
- Duplicate values in repeatable fields: The script automatically deduplicates values for each repeatable field per row.
ia-templatizer.py: Main CLI script. Handles argument parsing, template and CSV loading, main processing loop, and output writing.codebase/template.py: Functions for loading and validating template files.codebase/csvutils.py: Functions for loading and writing CSV files, including whitespace normalization and deduplication utilities.codebase/identifier.py: Identifier generation logic. Handles control fields, uniqueness, and formatting.codebase/fields.py: Utility functions for repeatable fields, mediatype detection, and field normalization.codebase/expand_directories.py: Handles directory expansion logic and writing expanded output sheets.
-
Add new control fields:
- Update the
control_fieldsset in bothia-templatizer.pyandexpand_directories.py. - Implement logic for the new control field in the relevant module (e.g., identifier generation, field expansion).
- Ensure new control fields are excluded from output CSVs unless explicitly required.
- Update the
-
Add new validation rules:
- Implement validation logic in
fields.pyor a new module. - Call validation functions from the main script as needed.
- Implement validation logic in
-
Add new repeatable fields:
- Add the field to your template as a list.
- Ensure
get_repeatable_fieldsinfields.pyrecognizes it. - The main script will automatically expand it into indexed columns.
-
Change output column order:
- Update the output column logic in
ia-templatizer.pyandexpand_directories.py.
- Update the output column logic in
-
Integrate with other tools:
- Add new modules to the
codebase/directory. - Import and use them in the main script as needed.
- Add new modules to the
- Keep logic for control fields centralized and consistent.
- Always exclude control fields from output unless explicitly required.
- Use modular functions for validation, identifier generation, and field expansion.
- Document new features and changes in this README and in code comments.
- Test with a variety of templates and input CSVs to ensure robust behavior.
-
Prepare your template JSON and input CSV.
-
Run the script:
python ia-templatizer.py --expand-directories templates/sample-template_01.json tests/sample-files-listing.csv tests/list-out.csv
-
Review the output CSV and any expanded directory sheets for completeness and accuracy.
-
Use the output CSV with Internet Archive CLI tools or other metadata workflows.
- Script fails to run: Check that all dependencies are installed and the
codebase/directory is present. - Unexpected output: Verify your template and input CSV for correct field names and formats.
- Validation errors: Read the error message for details on which field or value is invalid.
- Invalid flag error: Ensure you are only using supported flags (
--expand-directories,-E). - Control fields in output: Update your codebase to exclude control fields from output rows and headers.
If you have questions about using IA Templatizer for your archival project, or need help with advanced configuration, please submit an issue to the project repository.
IA Templatizer is designed to be modular and extensible. You can add new modules to the codebase/ directory to support additional metadata standards, custom validation, or integration with other archival tools.
IA Templatizer is written in Python 3 and relies only on standard Python libraries for its core functionality.
To run the script successfully, you must have:
- Python 3.7 or newer installed on your system.
- The following standard Python modules (included with Python):
ossyscsvrejsonwarningstime
No third-party packages are required for basic operation.
For code editing, testing, and debugging, you may find these tools helpful:
- Visual Studio Code or another Python-aware IDE
- pytest (for unit testing, if you wish to add tests)
- Git (for version control)
To check your Python version:
python3 --versionIf you need to install Python, visit python.org/downloads.
Note:
If you add new modules or features that require third-party packages, update this section to list those dependencies and provide installation instructions (e.g., using pip install <package>).