db2pq: export PostgreSQL and WRDS data to Parquet

db2pq is a Python library for moving data from PostgreSQL into Apache Parquet files. It is designed for both general PostgreSQL sources and the WRDS PostgreSQL service.

What it does

Export a single PostgreSQL table to Parquet.
Export all tables in a PostgreSQL schema to Parquet.
Export WRDS tables to Parquet.
Update Parquet files only when the WRDS source table is newer.
Mirror WRDS tables into a local PostgreSQL database.
Read last_modified metadata embedded in Parquet files.

Installation

Install from PyPI:

pip install --upgrade db2pq

Install optional SAS support (used by wrds_update_pq(..., use_sas=True)):

pip install --upgrade "db2pq[sas]"

Environment variables

db2pq supports explicit function arguments and environment-based defaults.

Connection defaults:

PGUSER: PostgreSQL user (falls back to local OS user)
PGHOST: PostgreSQL host (default: localhost)
PGDATABASE: PostgreSQL database (default: PGUSER)
PGPORT: PostgreSQL port (default: 5432)

WRDS + output defaults:

WRDS_ID: WRDS username (required for WRDS helpers unless passed directly)
DATA_DIR: base directory where Parquet files are written

Example shell setup:

export WRDS_ID="your_wrds_id"
export DATA_DIR="$HOME/pq_data"

WRDS SSH setup (for SAS-based metadata)

wrds_update_pq(..., use_sas=True) uses SSH to execute SAS remotely. Configure an SSH key for your WRDS account first:

ssh-keygen -t ed25519 -C "your_wrds_id@wrds"
cat ~/.ssh/id_ed25519.pub | \
ssh your_wrds_id@wrds-cloud-sshkey.wharton.upenn.edu \
"mkdir -p ~/.ssh && chmod 700 ~/.ssh && \
 cat >> ~/.ssh/authorized_keys && chmod 600 ~/.ssh/authorized_keys"

Quickstart

1) Export a PostgreSQL table

from db2pq import db_to_pq

pq_file = db_to_pq(
    table_name="my_table",
    schema="public",
    host="localhost",
    database="mydb",
)

print(pq_file)

2) Export a WRDS table to Parquet

from db2pq import wrds_pg_to_pq

wrds_pg_to_pq(
    table_name="dsi",
    schema="crsp",
    wrds_id="your_wrds_id",  # or set WRDS_ID in the environment
)

3) Update only when WRDS data changed

from db2pq import wrds_update_pq

wrds_update_pq(
    table_name="dsi",
    schema="crsp",
    wrds_id="your_wrds_id",
)

4) Export all tables in a PostgreSQL schema

from db2pq import db_schema_to_pq

files = db_schema_to_pq(schema="public")
print(files)

Parquet layout

Files are organized as:

<DATA_DIR>/<schema>/<table>.parquet

For example:

/data/crsp/dsi.parquet

When archive=True, replaced files are moved under:

<DATA_DIR>/<schema>/<archive_dir>/<table>_<timestamp>.parquet

Public API

From db2pq:

db_to_pq(table_name, schema, ...)
wrds_pg_to_pq(table_name, schema, ...)
db_schema_to_pq(schema, ...)
wrds_update_pq(table_name, schema, ...)
wrds_update_schema(schema, ...)
get_pq_files(schema, ...)
get_modified_pq(file_name)
pq_last_updated(table_name=None, schema=None, data_dir=None, file_name=None)
db_schema_tables(schema, ...)
get_wrds_comment(table_name, schema, use_sas=False, sas_schema=None, ...)
get_pg_comment(table_name, schema, ...)
wrds_update_pg(table_name, schema, ...)

wrds_update_pq() supports SQL-style filtering via where, for example:

wrds_update_pq("funda", "comp", where="indfmt = 'INDL' AND datafmt = 'STD'")

Notes

WRDS PostgreSQL access uses host wrds-pgdata.wharton.upenn.edu and port 9737.
batched=True (default) lowers memory usage for large tables.
col_types can be used to cast selected columns before writing Parquet.
keep/drop accept regex pattern(s) in both wrds_update_pq() and wrds_update_pg(). If both are supplied, drop is applied before keep.

Development

Run editable install in this repository:

pip install -e .

With optional SAS dependency:

pip install -e ".[sas]"

Project docs

Docs index: docs/README.md
Contributor guide: CONTRIBUTING.md
Release process: RELEASING.md
Changelog: CHANGELOG.md

License

MIT License. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

db2pq: export PostgreSQL and WRDS data to Parquet

What it does

Installation

Environment variables

WRDS SSH setup (for SAS-based metadata)

Quickstart

1) Export a PostgreSQL table

2) Export a WRDS table to Parquet

3) Update only when WRDS data changed

4) Export all tables in a PostgreSQL schema

Parquet layout

Public API

Notes

Development

Project docs

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
db2pq		db2pq
docs		docs
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
RELEASING.md		RELEASING.md
db2pq.Rproj		db2pq.Rproj
pyproject.toml		pyproject.toml

License

iangow/db2pq

Folders and files

Latest commit

History

Repository files navigation

db2pq: export PostgreSQL and WRDS data to Parquet

What it does

Installation

Environment variables

WRDS SSH setup (for SAS-based metadata)

Quickstart

1) Export a PostgreSQL table

2) Export a WRDS table to Parquet

3) Update only when WRDS data changed

4) Export all tables in a PostgreSQL schema

Parquet layout

Public API

Notes

Development

Project docs

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

Packages