Skip to content

Latest commit

 

History

History
265 lines (217 loc) · 8.72 KB

File metadata and controls

265 lines (217 loc) · 8.72 KB

qdata Translation Spec (R -> Python)

Background

qdata is a serialization format for a subset of R data types. The types are broken down into two categories:

  • Container types (objects that can contain other objects)
  • Leaf types (objects that are terminal and cannot contain other objects)

This document is a translation spec for qd_read output types.

Container Translation (list and dataframe-like)

flowchart TD
    A[R container object] --> B{Container kind?}

    B -->|VECSXP| C["np.ndarray(dtype=object)"]

    B -->|data.frame/data.table/tibble| D{User choice:<br>pandas or polars}

    D -->|pandas| E[pandas.DataFrame]
    D -->|polars| F[polars.DataFrame]

    C --> G["Shape or type upgrades depending on attributes;<br>See _Conversion upgrade rules_"]
    E --> H[Columns translated by pandas column spec]
    F --> I[Columns translated by polars column spec]
Loading

Leaf Types within Lists -> Array Translation Spec

flowchart LR
    subgraph R["R type (within VECSXP)"]
        RLIST[R Container object]
        RNULL[NULL]
        RRAW[raw vector]
        RCHAR[character vector]
        RLGL[logical vector]
        RINT[integer vector]
        RREAL[double vector]
        RCPLX[complex vector]
        RFACTOR[factor / ordered]
        RDATE[Date]
        RPOSIX[POSIXct]
        RDIFF[difftime]
    end

    subgraph P["Baseline conversion"]
        PLIST[See _Container Translation_]
        PNULL[None]
        PRAW[np.uint8 array]
        PCHAR[np.StringDType na_object=None]
        PLGL[np.bool_]
        PINT[np.int32 array]
        PREAL[np.float64 array]
        PCPLX[np.complex128 array]
        PFACTOR[np.StringDType na_object=None]
        PDATE["np.datetime64[D] array"]
        PPOSIX["np.datetime64[ns] array"]
        PDIFF["np.timedelta64 array"]
    end

    RNULL --> PNULL
    RLIST --> PLIST
    RRAW --> PRAW
    RCHAR --> PCHAR
    RLGL --> PLGL
    RINT --> PINT
    RREAL --> PREAL
    RCPLX --> PCPLX
    RFACTOR --> PFACTOR
    RDATE --> PDATE
    RPOSIX --> PPOSIX
    RDIFF --> PDIFF
Loading

Conversion upgrade rules

Baseline conversions may be upgraded to more sophisticated types to handle missing values and shape/label attributes.

  • Has NA values -> use mask layer (np.ma.MaskedArray) if needed (logical, integer); floats/complex do not add an extra mask layer
  • names attribute (1D vector labels) -> upgrade to xarray.DataArray with a single labeled names axis
  • Has dim -> reshape to shaped np.ndarray; column-major order, like R, takes precedence over names
  • dimnames -> upgrade to xarray.DataArray with per-axis coords

xarray attributes

xarray stores two variables referencing axis labels:

  • dims a string list of axis names
    • In R this would be names(dimnames(x))
  • coords a dictionary mapping dims to labels
    • In R, coords are stored sequentially, e.g. dimnames(x)[[i]]

The axis label mapping between R and xarray is not perfect. In xarray, dims must exist and be unique, but in R they may be NULL, NA, or duplicated. How this is handled is described below.

  • If dims are NULL or NA, use sentinels __DIM_0__, __DIM_1__, ...
  • If duplicate axis names occur, append numeric suffixes to make dims unique (for example A, A_2, A_3)
  • Sentinel collisions with literal axis names are accepted (not escaped)

Lastly, coord values are converted as str | None (xarray coerces the storage type internally).

Leaf Types within Dataframes -> pandas Column Translation Spec

flowchart LR
    subgraph R["R type (dataframe columns)"]
        RCHAR[character]
        RLGL[logical]
        RINT[integer]
        RREAL[double]
        RCPLX[complex]
        RRAW[raw]
        RFACTOR[factor / ordered]
        RDATE[Date]
        RPOSIX[POSIXct]
        RDIFF[difftime]
        ROTHER[list or unsupported]
    end

    subgraph P["Column dtype"]
        PCHAR["pd.StringDtype(storage='pyarrow')"]
        PLGL["pd.BooleanDtype"]
        PINT["pd.Int32Dtype"]
        PREAL[np.float64]
        PCPLX[np.complex128]
        PRAW[np.uint8]
        PFACTOR["pd.CategoricalDtype"]
        PDATE["np.datetime64[ns]"]
        PPOSIX["np.datetime64[ns] or pd.DatetimeTZDtype"]
        PDIFF["np.timedelta64[ns]"]
        POTHER[object]
    end

    RCHAR --> PCHAR
    RLGL --> PLGL
    RINT --> PINT
    RREAL --> PREAL
    RCPLX --> PCPLX
    RRAW --> PRAW
    RFACTOR --> PFACTOR
    RDATE --> PDATE
    RPOSIX --> PPOSIX
    RDIFF --> PDIFF
    ROTHER --> POTHER
Loading

Row names policy: if row.names is STRSXP, set DataFrame.index; otherwise ignore row.names and keep the default index.

POSIXct timezone policy (pandas): no tzone attribute -> np.datetime64[ns]; has tzone attribute -> pd.DatetimeTZDtype (datetime64[ns, tz]).

Character storage policy (pandas): character columns are constructed from Arrow UTF-8 buffers and exposed as pandas string dtype with Arrow storage (string[pyarrow]).

Leaf types within Dataframes -> polars Column Translation Spec

flowchart LR
    subgraph R["R type (dataframe columns)"]
        RCHAR[character]
        RLGL[logical]
        RINT[integer]
        RREAL[double]
        RCPLX[complex]
        RRAW[raw]
        RFACTOR[factor / ordered]
        RDATE[Date]
        RPOSIX[POSIXct]
        RDIFF[difftime]
        ROTHER[list or unsupported]
    end

    subgraph P["column dtype"]
        PCHAR[pl.String]
        PLGL[pl.Boolean]
        PINT[pl.Int32]
        PREAL[pl.Float64]
        PCPLX["pl.Struct({real: pl.Float64, imag: pl.Float64})"]
        PRAW[pl.UInt8]
        PFACTOR[pl.Categorical]
        PDATE[pl.Date]
        PPOSIX["pl.Datetime or pl.Datetime(time_zone=tz)"]
        PDIFF[pl.Duration]
        POTHER[pl.Object]
    end

    RCHAR --> PCHAR
    RLGL --> PLGL
    RINT --> PINT
    RREAL --> PREAL
    RCPLX --> PCPLX
    RRAW --> PRAW
    RFACTOR --> PFACTOR
    RDATE --> PDATE
    RPOSIX --> PPOSIX
    RDIFF --> PDIFF
    ROTHER --> POTHER
Loading

Row names policy: ignore row.names (use the pandas backend if row labels need to be preserved).

POSIXct timezone policy (polars): no tzone attribute -> pl.Datetime(time_unit="ns"); has tzone attribute -> pl.Datetime(time_unit="ns", time_zone=tz).

Character storage policy (polars): character columns are constructed through Arrow arrays before conversion to polars pl.String.

R -> Python Translation Summary

  • Detect dataframe-like containers first
    • User selects Pandas -> Use pandas column translation spec
    • User selects Polars -> Use Polars column translation spec
  • Else treat container as list -> use Array translation spec
    • Upgrade array types as necessary (reshape, np.ma.MaskedArray, or xarray.DataArray)

Comparison with rdata Translation

This section compares only the qdata-supported subset of R types.

Differences in top-level/container outputs

  • qdata VECSXP (non-dataframe) defaults to NumPy-first outputs:
    • unlabeled -> np.ndarray(dtype=object)
    • labeled/shaped -> xarray.DataArray(dtype=object) when names/dimnames apply
  • rdata VEC (non-dataframe) outputs:
    • unnamed -> Python list
    • named -> Python dict
  • Dataframe-like objects:
    • qdata spec -> pandas.DataFrame or polars.DataFrame
    • rdata current output -> pandas.DataFrame

Differences in translation spec when the container is VECSXP (named and unnamed)

  • Default unnamed container:
    • qdata spec -> np.ndarray(dtype=object)
    • rdata -> Python list
  • names on VECSXP:
    • qdata spec -> xarray.DataArray(dtype=object) with labeled axis
    • rdata -> dict (via names)
  • dim on VECSXP:
    • qdata spec -> preserved (reshaped object array; xarray if labels are present)
    • rdata -> dropped for VEC
  • dimnames on VECSXP:
    • qdata spec -> xarray.DataArray(dtype=object) with xarray-compatible label normalization
    • rdata -> no VEC dimnames output path
  • Factor / ordered in non-dataframe VECSXP:
    • qdata spec -> NumPy string labels (StringDType, None for NA)
    • rdata -> pandas.Categorical

Differences in translation spec when the container is a dataframe (pandas baseline)

  • Both outputs are pandas.DataFrame
  • Column dtypes are mostly aligned:
    • logical -> nullable boolean
    • integer -> nullable Int32
    • character -> pandas string dtype
    • real / complex -> numeric arrays/columns
    • factor / ordered -> pd.Categorical

Differences in translation spec when the container is a dataframe (polars)

  • qdata spec supports polars.DataFrame
  • rdata current default class map does not provide a polars dataframe path
  • qdata/polars column outputs are specified as:
    • factor / ordered -> pl.Categorical (ordered flag may be lossy)
    • complex -> pl.Struct({real: pl.Float64, imag: pl.Float64})
    • raw -> pl.UInt8
    • row names -> ignored