qdata Translation Spec (R -> Python)

Background

qdata is a serialization format for a subset of R data types. The types are broken down into two categories:

Container types (objects that can contain other objects)
Leaf types (objects that are terminal and cannot contain other objects)

This document is a translation spec for qd_read output types.

Container Translation (list and dataframe-like)

flowchart TD
    A[R container object] --> B{Container kind?}

    B -->|VECSXP| C["np.ndarray(dtype=object)"]

    B -->|data.frame/data.table/tibble| D{User choice:<br>pandas or polars}

    D -->|pandas| E[pandas.DataFrame]
    D -->|polars| F[polars.DataFrame]

    C --> G["Shape or type upgrades depending on attributes;<br>See _Conversion upgrade rules_"]
    E --> H[Columns translated by pandas column spec]
    F --> I[Columns translated by polars column spec]

Leaf Types within Lists -> Array Translation Spec

flowchart LR
    subgraph R["R type (within VECSXP)"]
        RLIST[R Container object]
        RNULL[NULL]
        RRAW[raw vector]
        RCHAR[character vector]
        RLGL[logical vector]
        RINT[integer vector]
        RREAL[double vector]
        RCPLX[complex vector]
        RFACTOR[factor / ordered]
        RDATE[Date]
        RPOSIX[POSIXct]
        RDIFF[difftime]
    end

    subgraph P["Baseline conversion"]
        PLIST[See _Container Translation_]
        PNULL[None]
        PRAW[np.uint8 array]
        PCHAR[np.StringDType na_object=None]
        PLGL[np.bool_]
        PINT[np.int32 array]
        PREAL[np.float64 array]
        PCPLX[np.complex128 array]
        PFACTOR[np.StringDType na_object=None]
        PDATE["np.datetime64[D] array"]
        PPOSIX["np.datetime64[ns] array"]
        PDIFF["np.timedelta64 array"]
    end

    RNULL --> PNULL
    RLIST --> PLIST
    RRAW --> PRAW
    RCHAR --> PCHAR
    RLGL --> PLGL
    RINT --> PINT
    RREAL --> PREAL
    RCPLX --> PCPLX
    RFACTOR --> PFACTOR
    RDATE --> PDATE
    RPOSIX --> PPOSIX
    RDIFF --> PDIFF

Conversion upgrade rules

Baseline conversions may be upgraded to more sophisticated types to handle missing values and shape/label attributes.

Has NA values -> use mask layer (np.ma.MaskedArray) if needed (logical, integer); floats/complex do not add an extra mask layer
names attribute (1D vector labels) -> upgrade to xarray.DataArray with a single labeled names axis
Has dim -> reshape to shaped np.ndarray; column-major order, like R, takes precedence over names
dimnames -> upgrade to xarray.DataArray with per-axis coords

xarray attributes

xarray stores two variables referencing axis labels:

dims a string list of axis names
- In R this would be names(dimnames(x))
coords a dictionary mapping dims to labels
- In R, coords are stored sequentially, e.g. dimnames(x)[[i]]

The axis label mapping between R and xarray is not perfect. In xarray, dims must exist and be unique, but in R they may be NULL, NA, or duplicated. How this is handled is described below.

If dims are NULL or NA, use sentinels __DIM_0__, __DIM_1__, ...
If duplicate axis names occur, append numeric suffixes to make dims unique (for example A, A_2, A_3)
Sentinel collisions with literal axis names are accepted (not escaped)

Lastly, coord values are converted as str | None (xarray coerces the storage type internally).

Leaf Types within Dataframes -> pandas Column Translation Spec

flowchart LR
    subgraph R["R type (dataframe columns)"]
        RCHAR[character]
        RLGL[logical]
        RINT[integer]
        RREAL[double]
        RCPLX[complex]
        RRAW[raw]
        RFACTOR[factor / ordered]
        RDATE[Date]
        RPOSIX[POSIXct]
        RDIFF[difftime]
        ROTHER[list or unsupported]
    end

    subgraph P["Column dtype"]
        PCHAR["pd.StringDtype(storage='pyarrow')"]
        PLGL["pd.BooleanDtype"]
        PINT["pd.Int32Dtype"]
        PREAL[np.float64]
        PCPLX[np.complex128]
        PRAW[np.uint8]
        PFACTOR["pd.CategoricalDtype"]
        PDATE["np.datetime64[ns]"]
        PPOSIX["np.datetime64[ns] or pd.DatetimeTZDtype"]
        PDIFF["np.timedelta64[ns]"]
        POTHER[object]
    end

    RCHAR --> PCHAR
    RLGL --> PLGL
    RINT --> PINT
    RREAL --> PREAL
    RCPLX --> PCPLX
    RRAW --> PRAW
    RFACTOR --> PFACTOR
    RDATE --> PDATE
    RPOSIX --> PPOSIX
    RDIFF --> PDIFF
    ROTHER --> POTHER

Row names policy: if row.names is STRSXP, set DataFrame.index; otherwise ignore row.names and keep the default index.

POSIXct timezone policy (pandas): no tzone attribute -> np.datetime64[ns]; has tzone attribute -> pd.DatetimeTZDtype (datetime64[ns, tz]).

Character storage policy (pandas): character columns are constructed from Arrow UTF-8 buffers and exposed as pandas string dtype with Arrow storage (string[pyarrow]).

Leaf types within Dataframes -> polars Column Translation Spec

flowchart LR
    subgraph R["R type (dataframe columns)"]
        RCHAR[character]
        RLGL[logical]
        RINT[integer]
        RREAL[double]
        RCPLX[complex]
        RRAW[raw]
        RFACTOR[factor / ordered]
        RDATE[Date]
        RPOSIX[POSIXct]
        RDIFF[difftime]
        ROTHER[list or unsupported]
    end

    subgraph P["column dtype"]
        PCHAR[pl.String]
        PLGL[pl.Boolean]
        PINT[pl.Int32]
        PREAL[pl.Float64]
        PCPLX["pl.Struct({real: pl.Float64, imag: pl.Float64})"]
        PRAW[pl.UInt8]
        PFACTOR[pl.Categorical]
        PDATE[pl.Date]
        PPOSIX["pl.Datetime or pl.Datetime(time_zone=tz)"]
        PDIFF[pl.Duration]
        POTHER[pl.Object]
    end

    RCHAR --> PCHAR
    RLGL --> PLGL
    RINT --> PINT
    RREAL --> PREAL
    RCPLX --> PCPLX
    RRAW --> PRAW
    RFACTOR --> PFACTOR
    RDATE --> PDATE
    RPOSIX --> PPOSIX
    RDIFF --> PDIFF
    ROTHER --> POTHER

Row names policy: ignore row.names (use the pandas backend if row labels need to be preserved).

POSIXct timezone policy (polars): no tzone attribute -> pl.Datetime(time_unit="ns"); has tzone attribute -> pl.Datetime(time_unit="ns", time_zone=tz).

Character storage policy (polars): character columns are constructed through Arrow arrays before conversion to polars pl.String.

R -> Python Translation Summary

Detect dataframe-like containers first
- User selects Pandas -> Use pandas column translation spec
- User selects Polars -> Use Polars column translation spec
Else treat container as list -> use Array translation spec
- Upgrade array types as necessary (reshape, np.ma.MaskedArray, or xarray.DataArray)

Comparison with `rdata` Translation

This section compares only the qdata-supported subset of R types.

Differences in top-level/container outputs

qdata VECSXP (non-dataframe) defaults to NumPy-first outputs:
- unlabeled -> np.ndarray(dtype=object)
- labeled/shaped -> xarray.DataArray(dtype=object) when names/dimnames apply
rdata VEC (non-dataframe) outputs:
- unnamed -> Python list
- named -> Python dict
Dataframe-like objects:
- qdata spec -> pandas.DataFrame or polars.DataFrame
- rdata current output -> pandas.DataFrame

Differences in translation spec when the container is `VECSXP` (named and unnamed)

Default unnamed container:
- qdata spec -> np.ndarray(dtype=object)
- rdata -> Python list
names on VECSXP:
- qdata spec -> xarray.DataArray(dtype=object) with labeled axis
- rdata -> dict (via names)
dim on VECSXP:
- qdata spec -> preserved (reshaped object array; xarray if labels are present)
- rdata -> dropped for VEC
dimnames on VECSXP:
- qdata spec -> xarray.DataArray(dtype=object) with xarray-compatible label normalization
- rdata -> no VEC dimnames output path
Factor / ordered in non-dataframe VECSXP:
- qdata spec -> NumPy string labels (StringDType, None for NA)
- rdata -> pandas.Categorical

Differences in translation spec when the container is a dataframe (pandas baseline)

Both outputs are pandas.DataFrame
Column dtypes are mostly aligned:
- logical -> nullable boolean
- integer -> nullable Int32
- character -> pandas string dtype
- real / complex -> numeric arrays/columns
- factor / ordered -> pd.Categorical

Differences in translation spec when the container is a dataframe (polars)

qdata spec supports polars.DataFrame
rdata current default class map does not provide a polars dataframe path
qdata/polars column outputs are specified as:
- factor / ordered -> pl.Categorical (ordered flag may be lossy)
- complex -> pl.Struct({real: pl.Float64, imag: pl.Float64})
- raw -> pl.UInt8
- row names -> ignored

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

qdata Translation Spec (R -> Python)

Background

Container Translation (list and dataframe-like)

Leaf Types within Lists -> Array Translation Spec

Conversion upgrade rules

xarray attributes

Leaf Types within Dataframes -> pandas Column Translation Spec

Leaf types within Dataframes -> polars Column Translation Spec

R -> Python Translation Summary

Comparison with `rdata` Translation

Differences in top-level/container outputs

Differences in translation spec when the container is `VECSXP` (named and unnamed)

Differences in translation spec when the container is a dataframe (pandas baseline)

Differences in translation spec when the container is a dataframe (polars)

FilesExpand file tree

qdata_read.md

Latest commit

History

qdata_read.md

File metadata and controls

qdata Translation Spec (R -> Python)

Background

Container Translation (list and dataframe-like)

Leaf Types within Lists -> Array Translation Spec

Conversion upgrade rules

xarray attributes

Leaf Types within Dataframes -> pandas Column Translation Spec

Leaf types within Dataframes -> polars Column Translation Spec

R -> Python Translation Summary

Comparison with rdata Translation

Differences in top-level/container outputs

Differences in translation spec when the container is VECSXP (named and unnamed)

Differences in translation spec when the container is a dataframe (pandas baseline)

Differences in translation spec when the container is a dataframe (polars)

Comparison with `rdata` Translation

Differences in translation spec when the container is `VECSXP` (named and unnamed)