Skip to content

Support dask dataframes as input to lens.summarise #12

@zblz

Description

@zblz

Currently, lens requires a pandas dataframe as input to the lens.summarise method. This places an upper limit on the size of the dataset analysed, which must be smaller than the available memory in the machine. Even with efficient optimisation of memory usage during the execution of the dask graph, the initial requirement prevents lens from scaling.

Ideally, lens.summarise should accept dask dataframes as input, and build the execution graph based on this delayed dataframe. This will require a rework of the functions in lens.metrics, given that all of them currently take either pd.Series or pd.Dataframe as arguments. In most cases we should be able to use the dask dataframe API, but for other metrics it will be necessary to access the individual chunks and reduce the result appropriately.

Adding this support, along with the distributed scheduler #11, will allow lens to analyse datasets significantly larger than the memory of the machine.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions