Skip to content
/ db2pq Public

Small library to convert database tables to parquet files

License

Notifications You must be signed in to change notification settings

iangow/db2pq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

db2pq: export PostgreSQL and WRDS data to Parquet

db2pq is a Python library for moving data from PostgreSQL into Apache Parquet files. It is designed for both general PostgreSQL sources and the WRDS PostgreSQL service.

What it does

  • Export a single PostgreSQL table to Parquet.
  • Export all tables in a PostgreSQL schema to Parquet.
  • Export WRDS tables to Parquet.
  • Update Parquet files only when the WRDS source table is newer.
  • Mirror WRDS tables into a local PostgreSQL database.
  • Read last_modified metadata embedded in Parquet files.

Installation

Install from PyPI:

pip install --upgrade db2pq

Install optional SAS support (used by wrds_update_pq(..., use_sas=True)):

pip install --upgrade "db2pq[sas]"

Environment variables

db2pq supports explicit function arguments and environment-based defaults.

Connection defaults:

  • PGUSER: PostgreSQL user (falls back to local OS user)
  • PGHOST: PostgreSQL host (default: localhost)
  • PGDATABASE: PostgreSQL database (default: PGUSER)
  • PGPORT: PostgreSQL port (default: 5432)

WRDS + output defaults:

  • WRDS_ID: WRDS username (required for WRDS helpers unless passed directly)
  • DATA_DIR: base directory where Parquet files are written

Example shell setup:

export WRDS_ID="your_wrds_id"
export DATA_DIR="$HOME/pq_data"

WRDS SSH setup (for SAS-based metadata)

wrds_update_pq(..., use_sas=True) uses SSH to execute SAS remotely. Configure an SSH key for your WRDS account first:

ssh-keygen -t ed25519 -C "your_wrds_id@wrds"
cat ~/.ssh/id_ed25519.pub | \
ssh your_wrds_id@wrds-cloud-sshkey.wharton.upenn.edu \
"mkdir -p ~/.ssh && chmod 700 ~/.ssh && \
 cat >> ~/.ssh/authorized_keys && chmod 600 ~/.ssh/authorized_keys"

Quickstart

1) Export a PostgreSQL table

from db2pq import db_to_pq

pq_file = db_to_pq(
    table_name="my_table",
    schema="public",
    host="localhost",
    database="mydb",
)

print(pq_file)

2) Export a WRDS table to Parquet

from db2pq import wrds_pg_to_pq

wrds_pg_to_pq(
    table_name="dsi",
    schema="crsp",
    wrds_id="your_wrds_id",  # or set WRDS_ID in the environment
)

3) Update only when WRDS data changed

from db2pq import wrds_update_pq

wrds_update_pq(
    table_name="dsi",
    schema="crsp",
    wrds_id="your_wrds_id",
)

4) Export all tables in a PostgreSQL schema

from db2pq import db_schema_to_pq

files = db_schema_to_pq(schema="public")
print(files)

Parquet layout

Files are organized as:

<DATA_DIR>/<schema>/<table>.parquet

For example:

/data/crsp/dsi.parquet

When archive=True, replaced files are moved under:

<DATA_DIR>/<schema>/<archive_dir>/<table>_<timestamp>.parquet

Public API

From db2pq:

  • db_to_pq(table_name, schema, ...)
  • wrds_pg_to_pq(table_name, schema, ...)
  • db_schema_to_pq(schema, ...)
  • wrds_update_pq(table_name, schema, ...)
  • wrds_update_schema(schema, ...)
  • get_pq_files(schema, ...)
  • get_modified_pq(file_name)
  • pq_last_updated(table_name=None, schema=None, data_dir=None, file_name=None)
  • db_schema_tables(schema, ...)
  • get_wrds_comment(table_name, schema, use_sas=False, sas_schema=None, ...)
  • get_pg_comment(table_name, schema, ...)
  • wrds_update_pg(table_name, schema, ...)

wrds_update_pq() supports SQL-style filtering via where, for example:

wrds_update_pq("funda", "comp", where="indfmt = 'INDL' AND datafmt = 'STD'")

Notes

  • WRDS PostgreSQL access uses host wrds-pgdata.wharton.upenn.edu and port 9737.
  • batched=True (default) lowers memory usage for large tables.
  • col_types can be used to cast selected columns before writing Parquet.
  • keep/drop accept regex pattern(s) in both wrds_update_pq() and wrds_update_pg(). If both are supplied, drop is applied before keep.

Development

Run editable install in this repository:

pip install -e .

With optional SAS dependency:

pip install -e ".[sas]"

Project docs

  • Docs index: docs/README.md
  • Contributor guide: CONTRIBUTING.md
  • Release process: RELEASING.md
  • Changelog: CHANGELOG.md

License

MIT License. See LICENSE.

About

Small library to convert database tables to parquet files

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Languages