FIX-#431: Moving and adding sampling to backend calculations#438
FIX-#431: Moving and adding sampling to backend calculations#438westernguy2 wants to merge 11 commits intolux-org:masterfrom
Conversation
Signed-off-by: Kunal Agarwal <kagarwal2@berkeley.edu>
Signed-off-by: Kunal Agarwal <kagarwal2@berkeley.edu>
Signed-off-by: Kunal Agarwal <kagarwal2@berkeley.edu>
Signed-off-by: Kunal Agarwal <kagarwal2@berkeley.edu>
Signed-off-by: Kunal Agarwal <kagarwal2@berkeley.edu>
Signed-off-by: Kunal Agarwal <kagarwal2@berkeley.edu>
Signed-off-by: Kunal Agarwal <kagarwal2@berkeley.edu>
| elif pd.api.types.is_float_dtype(ldf.dtypes[attr]): | ||
|
|
||
| if ldf.cardinality[attr] != len(ldf) and (ldf.cardinality[attr] < 20): | ||
| if ldf.cardinality[attr] != ldf._length and (ldf.cardinality[attr] < 20): |
There was a problem hiding this comment.
What is the difference between _length and len(df)? It is probably more general to use the latter since the _length might not be maintained correctly.
There was a problem hiding this comment.
I noticed _length in the metadata for a LuxDataFrame, and found that it was not being used anywhere in the code base (as far as I could tell). On Line 544 of this file, I changed it to be the length of the sampled DataFrame. This is necessary since we don't save the sampled DataFrame after the metadata is computed, but the length of the sampled DataFrame is necessary for future calculations, especially ones related to cardinality, like the one here.
The name of the attribute is probably not the best, so I could maybe change it to _sampled_length instead?
| def compute_stats(self, ldf: LuxDataFrame): | ||
| # use sample to compute statistics | ||
| if ldf._sampled is None: | ||
| ldf_sampled = PandasExecutor.execute_sampling(ldf) |
There was a problem hiding this comment.
Will the config parameters that we are using for sampling for metadata and the visualization be the same?
There was a problem hiding this comment.
Yes, currently they are the same (sampling_thresh). Should we maybe use different parameters?
Signed-off-by: Kunal Agarwal <kagarwal2@berkeley.edu>
Signed-off-by: Kunal Agarwal <kagarwal2@berkeley.edu>
75e5f01 to
c696b2b
Compare
Signed-off-by: Kunal Agarwal <kagarwal2@berkeley.edu>
Signed-off-by: Kunal Agarwal <kagarwal2@berkeley.edu>
Overview
This is a branch that builds off of the work done in #432. This moves the sampling to after the Filter and also adds sampling for metadata computation.
Changes
Changes the
executefunction to move sampling after the Filtering is done, and so the sampling is done on each of the data visualizations. It also editscompute_datato sample the data before computing the metadata. All metadata is the metadata associated with the sample, not the full dataset.Example Output
N/A