Skip to content

lisamoreno/transform

Repository files navigation

transform

Transforms a CSV dataset by applying operations listed in a JSON-based data transformation specification. All data, including the modified values, is returned in a new file, transformed_data.csv.

Contents

Installation

  • Clone this repo:
    • git clone https://github.com/lisamoreno/transform.git
  • Install dependencies:
    • pip install requirements.txt

Usage

./transform --transformspec /path/to/transform-spec.json --dataset /path/to/dataset.csv

  • Sample transformation spec and dataset files are available in this repo
  • To use them, follow the installation instructions, then run:
    • ./transform --transformspec test_valid_json_transform_spec.json --dataset test_dataset.csv

Arguments

Two arguments are required:

  1. --transformspec takes a full path to your JSON-based data transformation spec
  2. --dataset takes a full path to your dataset file.

Available Operations

The following operations have been implemented and can be specified in your transformation specification.

  • slugify, operates on strings: Removes all punctuation, converts all whitespace into hyphens, and lowercases all letters
  • f-to-c, operates on floats or integers: Returns the temperature converted to Celsius, rounded to 1 decimal place.
  • hst-to-unix, operates on date and time strings: Assumes the source date and times are in Hawaii Standard Time (UTC-10). Converts into the UTC time zone and into a UNIX timestamp format. If this operation is included, one column must include a date as MM/DD/YY and another column must include the time as HH:MM:SS. A new column for the timestamp will be added.

Transformation Specification

  • The spec must be valid JSON in the following format:
{
	"spec_version": 1.0,
	"transforms": [
		{
			"operation": "OPERATION", # Include a [valid operation](#available-operations)
			"column": "COLUMN_NAME" # The `operation` specified will be applied to all values in `column`
		},
		{ ... },
		{ ... }
	]
}
  • Note: If an operation is specified more than once for the same column, the latest one will be applied.

Dataset File

Testing

  • To run tests, simply run ./test.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages