Describe the feature you'd like
Support for the MIME type parquet files in the sagemaker toolkit. E.g. in the README of this repo, there is an example default_input_fn():
def default_input_fn(self, input_data, content_type, context=None):
"""A default input_fn that can handle JSON, CSV and NPZ formats.
Args:
input_data: the request payload serialized in the content_type format
content_type: the request content_type
context (obj): the request context (default: None).
Returns: input_data deserialized into torch.FloatTensor or torch.cuda.FloatTensor depending if cuda is available.
"""
return decoder.decode(input_data, content_type)
Looking into decoder.decode, I see the following MIME types are supported:
_decoder_map = {
content_types.NPY: _npy_to_numpy,
content_types.CSV: _csv_to_numpy,
content_types.JSON: _json_to_numpy,
content_types.NPZ: _npz_to_sparse,
}
Should not be too hard to add parquet here. Parquet is a dat file commonly used with large datasets and also supported in other sagemaker services, for example in Autopilot.
How would this feature be used? Please describe.
Reduce storage costs, data I/O costs, increase speed while processing.
Describe alternatives you've considered
CSV is the standard, but it's a much less efficient way to store, read and write column-oriented data.
Additional context