db2pq is a Python library for moving data from PostgreSQL into Apache Parquet files.
It is designed for both general PostgreSQL sources and the WRDS PostgreSQL service.
- Export a single PostgreSQL table to Parquet.
- Export all tables in a PostgreSQL schema to Parquet.
- Export WRDS tables to Parquet.
- Update Parquet files only when the WRDS source table is newer.
- Mirror WRDS tables into a local PostgreSQL database.
- Read
last_modifiedmetadata embedded in Parquet files.
Install from PyPI:
pip install --upgrade db2pqInstall optional SAS support (used by wrds_update_pq(..., use_sas=True)):
pip install --upgrade "db2pq[sas]"db2pq supports explicit function arguments and environment-based defaults.
Connection defaults:
PGUSER: PostgreSQL user (falls back to local OS user)PGHOST: PostgreSQL host (default:localhost)PGDATABASE: PostgreSQL database (default:PGUSER)PGPORT: PostgreSQL port (default:5432)
WRDS + output defaults:
WRDS_ID: WRDS username (required for WRDS helpers unless passed directly)DATA_DIR: base directory where Parquet files are written
Example shell setup:
export WRDS_ID="your_wrds_id"
export DATA_DIR="$HOME/pq_data"wrds_update_pq(..., use_sas=True) uses SSH to execute SAS remotely. Configure
an SSH key for your WRDS account first:
ssh-keygen -t ed25519 -C "your_wrds_id@wrds"
cat ~/.ssh/id_ed25519.pub | \
ssh your_wrds_id@wrds-cloud-sshkey.wharton.upenn.edu \
"mkdir -p ~/.ssh && chmod 700 ~/.ssh && \
cat >> ~/.ssh/authorized_keys && chmod 600 ~/.ssh/authorized_keys"from db2pq import db_to_pq
pq_file = db_to_pq(
table_name="my_table",
schema="public",
host="localhost",
database="mydb",
)
print(pq_file)from db2pq import wrds_pg_to_pq
wrds_pg_to_pq(
table_name="dsi",
schema="crsp",
wrds_id="your_wrds_id", # or set WRDS_ID in the environment
)from db2pq import wrds_update_pq
wrds_update_pq(
table_name="dsi",
schema="crsp",
wrds_id="your_wrds_id",
)from db2pq import db_schema_to_pq
files = db_schema_to_pq(schema="public")
print(files)Files are organized as:
<DATA_DIR>/<schema>/<table>.parquet
For example:
/data/crsp/dsi.parquet
When archive=True, replaced files are moved under:
<DATA_DIR>/<schema>/<archive_dir>/<table>_<timestamp>.parquet
From db2pq:
db_to_pq(table_name, schema, ...)wrds_pg_to_pq(table_name, schema, ...)db_schema_to_pq(schema, ...)wrds_update_pq(table_name, schema, ...)wrds_update_schema(schema, ...)get_pq_files(schema, ...)get_modified_pq(file_name)pq_last_updated(table_name=None, schema=None, data_dir=None, file_name=None)db_schema_tables(schema, ...)get_wrds_comment(table_name, schema, use_sas=False, sas_schema=None, ...)get_pg_comment(table_name, schema, ...)wrds_update_pg(table_name, schema, ...)
wrds_update_pq() supports SQL-style filtering via where, for example:
wrds_update_pq("funda", "comp", where="indfmt = 'INDL' AND datafmt = 'STD'")
- WRDS PostgreSQL access uses host
wrds-pgdata.wharton.upenn.eduand port9737. batched=True(default) lowers memory usage for large tables.col_typescan be used to cast selected columns before writing Parquet.keep/dropaccept regex pattern(s) in bothwrds_update_pq()andwrds_update_pg(). If both are supplied,dropis applied beforekeep.
Run editable install in this repository:
pip install -e .With optional SAS dependency:
pip install -e ".[sas]"- Docs index:
docs/README.md - Contributor guide:
CONTRIBUTING.md - Release process:
RELEASING.md - Changelog:
CHANGELOG.md
MIT License. See LICENSE.