qdata is a serialization format for a subset of R data types. The types are broken down into two categories:
- Container types (objects that can contain other objects)
- Leaf types (objects that are terminal and cannot contain other objects)
This document is a translation spec for qd_read output types.
flowchart TD
A[R container object] --> B{Container kind?}
B -->|VECSXP| C["np.ndarray(dtype=object)"]
B -->|data.frame/data.table/tibble| D{User choice:<br>pandas or polars}
D -->|pandas| E[pandas.DataFrame]
D -->|polars| F[polars.DataFrame]
C --> G["Shape or type upgrades depending on attributes;<br>See _Conversion upgrade rules_"]
E --> H[Columns translated by pandas column spec]
F --> I[Columns translated by polars column spec]
flowchart LR
subgraph R["R type (within VECSXP)"]
RLIST[R Container object]
RNULL[NULL]
RRAW[raw vector]
RCHAR[character vector]
RLGL[logical vector]
RINT[integer vector]
RREAL[double vector]
RCPLX[complex vector]
RFACTOR[factor / ordered]
RDATE[Date]
RPOSIX[POSIXct]
RDIFF[difftime]
end
subgraph P["Baseline conversion"]
PLIST[See _Container Translation_]
PNULL[None]
PRAW[np.uint8 array]
PCHAR[np.StringDType na_object=None]
PLGL[np.bool_]
PINT[np.int32 array]
PREAL[np.float64 array]
PCPLX[np.complex128 array]
PFACTOR[np.StringDType na_object=None]
PDATE["np.datetime64[D] array"]
PPOSIX["np.datetime64[ns] array"]
PDIFF["np.timedelta64 array"]
end
RNULL --> PNULL
RLIST --> PLIST
RRAW --> PRAW
RCHAR --> PCHAR
RLGL --> PLGL
RINT --> PINT
RREAL --> PREAL
RCPLX --> PCPLX
RFACTOR --> PFACTOR
RDATE --> PDATE
RPOSIX --> PPOSIX
RDIFF --> PDIFF
Baseline conversions may be upgraded to more sophisticated types to handle missing values and shape/label attributes.
- Has
NAvalues -> use mask layer (np.ma.MaskedArray) if needed (logical,integer); floats/complex do not add an extra mask layer namesattribute (1D vector labels) -> upgrade toxarray.DataArraywith a single labelednamesaxis- Has
dim-> reshape to shapednp.ndarray; column-major order, like R, takes precedence overnames dimnames-> upgrade toxarray.DataArraywith per-axis coords
xarray stores two variables referencing axis labels:
dimsa string list of axis names- In
Rthis would benames(dimnames(x))
- In
coordsa dictionary mappingdimsto labels- In
R, coords are stored sequentially, e.g.dimnames(x)[[i]]
- In
The axis label mapping between R and xarray is not perfect. In xarray, dims must exist and be unique, but in R they may be NULL, NA, or duplicated. How this is handled is described below.
- If
dimsareNULLorNA, use sentinels__DIM_0__,__DIM_1__, ... - If duplicate axis names occur, append numeric suffixes to make dims unique (for example
A,A_2,A_3) - Sentinel collisions with literal axis names are accepted (not escaped)
Lastly, coord values are converted as str | None (xarray coerces the storage type internally).
flowchart LR
subgraph R["R type (dataframe columns)"]
RCHAR[character]
RLGL[logical]
RINT[integer]
RREAL[double]
RCPLX[complex]
RRAW[raw]
RFACTOR[factor / ordered]
RDATE[Date]
RPOSIX[POSIXct]
RDIFF[difftime]
ROTHER[list or unsupported]
end
subgraph P["Column dtype"]
PCHAR["pd.StringDtype(storage='pyarrow')"]
PLGL["pd.BooleanDtype"]
PINT["pd.Int32Dtype"]
PREAL[np.float64]
PCPLX[np.complex128]
PRAW[np.uint8]
PFACTOR["pd.CategoricalDtype"]
PDATE["np.datetime64[ns]"]
PPOSIX["np.datetime64[ns] or pd.DatetimeTZDtype"]
PDIFF["np.timedelta64[ns]"]
POTHER[object]
end
RCHAR --> PCHAR
RLGL --> PLGL
RINT --> PINT
RREAL --> PREAL
RCPLX --> PCPLX
RRAW --> PRAW
RFACTOR --> PFACTOR
RDATE --> PDATE
RPOSIX --> PPOSIX
RDIFF --> PDIFF
ROTHER --> POTHER
Row names policy: if row.names is STRSXP, set DataFrame.index; otherwise ignore row.names and keep the default index.
POSIXct timezone policy (pandas): no tzone attribute -> np.datetime64[ns]; has tzone attribute -> pd.DatetimeTZDtype (datetime64[ns, tz]).
Character storage policy (pandas): character columns are constructed from Arrow UTF-8 buffers and exposed as pandas string dtype with Arrow storage (string[pyarrow]).
flowchart LR
subgraph R["R type (dataframe columns)"]
RCHAR[character]
RLGL[logical]
RINT[integer]
RREAL[double]
RCPLX[complex]
RRAW[raw]
RFACTOR[factor / ordered]
RDATE[Date]
RPOSIX[POSIXct]
RDIFF[difftime]
ROTHER[list or unsupported]
end
subgraph P["column dtype"]
PCHAR[pl.String]
PLGL[pl.Boolean]
PINT[pl.Int32]
PREAL[pl.Float64]
PCPLX["pl.Struct({real: pl.Float64, imag: pl.Float64})"]
PRAW[pl.UInt8]
PFACTOR[pl.Categorical]
PDATE[pl.Date]
PPOSIX["pl.Datetime or pl.Datetime(time_zone=tz)"]
PDIFF[pl.Duration]
POTHER[pl.Object]
end
RCHAR --> PCHAR
RLGL --> PLGL
RINT --> PINT
RREAL --> PREAL
RCPLX --> PCPLX
RRAW --> PRAW
RFACTOR --> PFACTOR
RDATE --> PDATE
RPOSIX --> PPOSIX
RDIFF --> PDIFF
ROTHER --> POTHER
Row names policy: ignore row.names (use the pandas backend if row labels need to be preserved).
POSIXct timezone policy (polars): no tzone attribute -> pl.Datetime(time_unit="ns"); has tzone attribute -> pl.Datetime(time_unit="ns", time_zone=tz).
Character storage policy (polars): character columns are constructed through Arrow arrays before conversion to polars pl.String.
- Detect dataframe-like containers first
- User selects Pandas -> Use pandas column translation spec
- User selects Polars -> Use Polars column translation spec
- Else treat container as list -> use Array translation spec
- Upgrade array types as necessary (reshape,
np.ma.MaskedArray, orxarray.DataArray)
- Upgrade array types as necessary (reshape,
This section compares only the qdata-supported subset of R types.
qdataVECSXP(non-dataframe) defaults to NumPy-first outputs:- unlabeled ->
np.ndarray(dtype=object) - labeled/shaped ->
xarray.DataArray(dtype=object)whennames/dimnamesapply
- unlabeled ->
rdataVEC(non-dataframe) outputs:- unnamed -> Python
list - named -> Python
dict
- unnamed -> Python
- Dataframe-like objects:
qdataspec ->pandas.DataFrameorpolars.DataFramerdatacurrent output ->pandas.DataFrame
- Default unnamed container:
qdataspec ->np.ndarray(dtype=object)rdata-> Pythonlist
namesonVECSXP:qdataspec ->xarray.DataArray(dtype=object)with labeled axisrdata->dict(vianames)
dimonVECSXP:qdataspec -> preserved (reshaped object array; xarray if labels are present)rdata-> dropped forVEC
dimnamesonVECSXP:qdataspec ->xarray.DataArray(dtype=object)with xarray-compatible label normalizationrdata-> noVECdimnamesoutput path
- Factor / ordered in non-dataframe
VECSXP:qdataspec -> NumPy string labels (StringDType,NoneforNA)rdata->pandas.Categorical
- Both outputs are
pandas.DataFrame - Column dtypes are mostly aligned:
- logical -> nullable boolean
- integer -> nullable
Int32 - character -> pandas string dtype
- real / complex -> numeric arrays/columns
- factor / ordered ->
pd.Categorical
qdataspec supportspolars.DataFramerdatacurrent default class map does not provide a polars dataframe path- qdata/polars column outputs are specified as:
- factor / ordered ->
pl.Categorical(ordered flag may be lossy) - complex ->
pl.Struct({real: pl.Float64, imag: pl.Float64}) - raw ->
pl.UInt8 - row names -> ignored
- factor / ordered ->