Transforms a CSV dataset by applying operations listed in a JSON-based data transformation specification.
All data, including the modified values, is returned in a new file, transformed_data.csv.
- Clone this repo:
git clone https://github.com/lisamoreno/transform.git
- Install dependencies:
pip install requirements.txt
./transform --transformspec /path/to/transform-spec.json --dataset /path/to/dataset.csv
- Sample transformation spec and dataset files are available in this repo
- To use them, follow the installation instructions, then run:
./transform --transformspec test_valid_json_transform_spec.json --dataset test_dataset.csv
Two arguments are required:
--transformspectakes a full path to your JSON-based data transformation spec--datasettakes a full path to your dataset file.
The following operations have been implemented and can be specified in your transformation specification.
slugify, operates on strings: Removes all punctuation, converts all whitespace into hyphens, and lowercases all lettersf-to-c, operates on floats or integers: Returns the temperature converted to Celsius, rounded to 1 decimal place.hst-to-unix, operates on date and time strings: Assumes the source date and times are in Hawaii Standard Time (UTC-10). Converts into the UTC time zone and into a UNIX timestamp format. If this operation is included, one column must include a date asMM/DD/YYand another column must include the time asHH:MM:SS. A new column for the timestamp will be added.
- The spec must be valid JSON in the following format:
{
"spec_version": 1.0,
"transforms": [
{
"operation": "OPERATION", # Include a [valid operation](#available-operations)
"column": "COLUMN_NAME" # The `operation` specified will be applied to all values in `column`
},
{ ... },
{ ... }
]
}- Note: If an
operationis specified more than once for the same column, the latest one will be applied.
- This script was made specifically to work with solar radiation measures from the Sun, as captured by NASA from various locations around the Earth.
- It can be used with other datasets.
- To run tests, simply run
./test.py