FIX-#431: Moving and adding sampling to backend calculations by westernguy2 · Pull Request #438 · lux-org/lux

westernguy2 · 2021-12-01T17:05:19Z

Overview

This is a branch that builds off of the work done in #432. This moves the sampling to after the Filter and also adds sampling for metadata computation.

Changes

Changes the execute function to move sampling after the Filtering is done, and so the sampling is done on each of the data visualizations. It also edits compute_data to sample the data before computing the metadata. All metadata is the metadata associated with the sample, not the full dataset.

Example Output

N/A

Signed-off-by: Kunal Agarwal <kagarwal2@berkeley.edu>

lux/executor/PandasExecutor.py

dorisjlee · 2021-12-01T22:36:22Z

lux/executor/PandasExecutor.py

                elif pd.api.types.is_float_dtype(ldf.dtypes[attr]):

-                    if ldf.cardinality[attr] != len(ldf) and (ldf.cardinality[attr] < 20):
+                    if ldf.cardinality[attr] != ldf._length and (ldf.cardinality[attr] < 20):


What is the difference between _length and len(df)? It is probably more general to use the latter since the _length might not be maintained correctly.

I noticed _length in the metadata for a LuxDataFrame, and found that it was not being used anywhere in the code base (as far as I could tell). On Line 544 of this file, I changed it to be the length of the sampled DataFrame. This is necessary since we don't save the sampled DataFrame after the metadata is computed, but the length of the sampled DataFrame is necessary for future calculations, especially ones related to cardinality, like the one here.

The name of the attribute is probably not the best, so I could maybe change it to _sampled_length instead?

dorisjlee · 2021-12-01T22:37:35Z

lux/executor/PandasExecutor.py

    def compute_stats(self, ldf: LuxDataFrame):
+        # use sample to compute statistics
+        if ldf._sampled is None:
+            ldf_sampled = PandasExecutor.execute_sampling(ldf)


Will the config parameters that we are using for sampling for metadata and the visualization be the same?

Yes, currently they are the same (sampling_thresh). Should we maybe use different parameters?

Signed-off-by: Kunal Agarwal <kagarwal2@berkeley.edu>

westernguy2 added 7 commits November 10, 2021 00:31

FIX-lux-org#431: implement sampling threshold and edit tests and docs

f4441c5

Signed-off-by: Kunal Agarwal <kagarwal2@berkeley.edu>

FIX-lux-org#431: small cleanup changes

6c82a88

Signed-off-by: Kunal Agarwal <kagarwal2@berkeley.edu>

FIX-lux-org#431: move sampling to after the filtering

75ab18f

Signed-off-by: Kunal Agarwal <kagarwal2@berkeley.edu>

FIX-lux-org#431: add sampling for metadata statistics computation

6b224a0

Signed-off-by: Kunal Agarwal <kagarwal2@berkeley.edu>

FIX-lux-org#431: fix cardinality bug

6ea34a7

Signed-off-by: Kunal Agarwal <kagarwal2@berkeley.edu>

FIX-lux-org#431: remove print statement

c26295d

Signed-off-by: Kunal Agarwal <kagarwal2@berkeley.edu>

FIX-lux-org#431: fix bug in check_id_like

caf1fe1

Signed-off-by: Kunal Agarwal <kagarwal2@berkeley.edu>

dorisjlee requested changes Dec 1, 2021

View reviewed changes

westernguy2 added 2 commits December 1, 2021 19:42

FIX-lux-org#431: remove filter_executed dictionary implementation

222037a

Signed-off-by: Kunal Agarwal <kagarwal2@berkeley.edu>

FIX-lux-org#431: fix bug with caching sampled df

c696b2b

Signed-off-by: Kunal Agarwal <kagarwal2@berkeley.edu>

westernguy2 force-pushed the move-sampling branch from 75e5f01 to c696b2b Compare January 31, 2022 20:03

westernguy2 added 2 commits February 1, 2022 21:24

FIX-lux-org#431: Move column filtering after sampling

35bd2d2

Signed-off-by: Kunal Agarwal <kagarwal2@berkeley.edu>

FIX-lux-org#431: changed sampling method to removing rows

5252d91

Signed-off-by: Kunal Agarwal <kagarwal2@berkeley.edu>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX-#431: Moving and adding sampling to backend calculations#438

FIX-#431: Moving and adding sampling to backend calculations#438
westernguy2 wants to merge 11 commits intolux-org:masterfrom
westernguy2:move-sampling

westernguy2 commented Dec 1, 2021

Uh oh!

Uh oh!

dorisjlee Dec 1, 2021

Uh oh!

westernguy2 Dec 2, 2021

Uh oh!

dorisjlee Dec 1, 2021

Uh oh!

westernguy2 Dec 2, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

westernguy2 commented Dec 1, 2021

Overview

Changes

Example Output

Uh oh!

Uh oh!

dorisjlee Dec 1, 2021

Choose a reason for hiding this comment

Uh oh!

westernguy2 Dec 2, 2021

Choose a reason for hiding this comment

Uh oh!

dorisjlee Dec 1, 2021

Choose a reason for hiding this comment

Uh oh!

westernguy2 Dec 2, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants